#749 ✓not-applicable
felix (xilef)

Encoding for connection

Reported by felix (xilef) | January 6th, 2009 @ 09:02 PM

Data encoded in utf8 in the db is being encoded some other way when pulled out. Only tried on MySQL.

Data was put in the db using mysql command line client (encoding set to utf8) and can be read correctly using it also. It was taken from a utf8 encoded csv file (also reads fine).

Using 'merb -i' (so not a html page encoding issue) pulls out rubbish with the command:

repository.adapter.query('SELECT ...')

also, the results for the following all state that the encoding is utf8:

repository.adapter.query("SHOW VARIABLES LIKE 'character_set_%'")

The database character set is utf8 and so is the table's.

Strange no?

Comments and changes to this ticket

  • felix (xilef)

    felix (xilef) January 6th, 2009 @ 09:07 PM

    using data_objects 0.9.10.1 dm-core 0.9.9 do_mysql 0.9.10.1

  • Dan Kubb (dkubb)

    Dan Kubb (dkubb) January 7th, 2009 @ 04:24 PM

    • Assigned user set to “Dirkjan Bussink”
  • felix (xilef)

    felix (xilef) January 13th, 2009 @ 08:01 AM

    An example,

    the text i want to submit: appliquéing all mysql stuff is utf8; table, db, connection, server etc. the error from mysql is:

    (mysql_errno=1366, sql_state=HY000) Incorrect string value: '\xE9ing t...' for column

    yet in the mysql command line client the original string can be inserted. The mysql server log shows the original string trying to be inserted and not the encoded string above. Only the testing log from my specs shows the above encoded string.

    It could well be my setup but it all seems in order, and they worked before!

  • felix (xilef)

    felix (xilef) January 13th, 2009 @ 09:53 AM

    • Tag changed from mysql, utf-8 to dataobjects, mysql, utf-8

    see the following gist for a script that duplicates it:

    http://gist.github.com/46493

  • felix (xilef)

    felix (xilef) January 20th, 2009 @ 05:00 AM

    That gist had errors in it and now works. Yet I still have failing specs due to encoding issues I did not have before :-(

    Will try to create a new script to duplicate the errors.

  • felix (xilef)

    felix (xilef) January 21st, 2009 @ 10:17 AM

    • Tag changed from dataobjects, mysql, utf-8 to dm-sweatshop, mysql, utf-8

    Ok, I believe I have found the source of my issues. The errors were only occuring when using dm-sweatshop to generate content for me. dm-sweatshop uses randexp to produce a lot of the text which is taken from either /usr/share/dict/words or /usr/dict/words (whichever it finds first).

    On my system (no doubt others) my /usr/share/dict/words file is linke to british-english which is encoded as ISO-8859. When this is converted to an UTF-8 encoded file I have no issues at all, all tests pass!

    Therefore, I suspect File.read() in randexp is causing havoc, perhaps doing a conversion to UTF-8 which I believe is the default external encoding for Ruby on my system/Merb and screwing the escaping in the process. I have no idea.

    So I am not sure what the status of this bug is, or if its even a bug.

  • Dirkjan Bussink

    Dirkjan Bussink January 22nd, 2009 @ 04:02 PM

    Well, if it throws in string with other encodings, there are strange things bound to happen. What is the actual rubbish you see? Just some characters or is it all a mess? UTF8 and ISO-8859-1 are both ASCII compatible afaik, so there shouldn't be jibberish for basic letters.

  • Greg Campbell

    Greg Campbell January 23rd, 2009 @ 11:12 AM

    For what it's worth, I see the same behavior with dm-sweatshop as well (the mysql_errno reported by felix above, that is).

  • Dan Kubb (dkubb)

    Dan Kubb (dkubb) January 23rd, 2009 @ 12:38 PM

    The data from the dict file could contain invalid characters. I think it's safest to always assume external data sources contain invalid information, just like if it were user input, and treat it as untrusted by default.

    What about (as a test) modifying dm-sweatshop to use Iconv to strip out invalid characters and convert the remainder to UTF-8? In other contexts I've had good luck with the following approach:

    http://po-ru.com/diary/fixing-in...

  • Dan Kubb (dkubb)

    Dan Kubb (dkubb) January 23rd, 2009 @ 01:02 PM

    To test encoding I whipped up a small script that uses an encoded word, saves it to several code storage engines, and then attempts to retrieve it again. This script returns true for me with every storage engine.

  • Dan Kubb (dkubb)

    Dan Kubb (dkubb) January 23rd, 2009 @ 01:03 PM

    Heh, interesting how Lighthouse mangled the filename. It was originally named "Iñtërnâtiônàlizætiøn.rb"

  • Greg Campbell

    Greg Campbell January 23rd, 2009 @ 01:37 PM

    dkubb: Yes, there appears to be something specifically wrong with the encodings for strings generated by dm-sweatshop - to reproduce the problem, simply set up a dm-sweatshop fixture with a property specified by /\w+/.gen, and generate some large number of them against a MySQL database; eventually you'll get a MysqlError for some string with invalid characters. This causes specs using dm-sweatshop to fail intermittently.

  • Dan Kubb (dkubb)

    Dan Kubb (dkubb) January 23rd, 2009 @ 02:21 PM

    • Assigned user changed from “Dirkjan Bussink” to “Michael Klishin (antares)”

    I'm going to re-assign this to antares, since I'm pretty sure that DO isn't the problem here, as my sample script demonstrates.

    I think Greg is right in that dm-sweatshop is generating an invalid UTF-8 string causing Ruby 1.9.1 to blow up intermittently.

  • Greg Campbell

    Greg Campbell January 23rd, 2009 @ 04:31 PM

    To clarify, I'm using Ruby 1.8.6, so this isn't Ruby-version-dependent.

  • Michael Klishin (antares)

    Michael Klishin (antares) January 23rd, 2009 @ 06:15 PM

    • State changed from “unconfirmed” to “not-applicable”

    DM sweatshop probably just uses your default repository, and thus, your settings from database.yml. Sweatshop is just a set of helpers to keep track of fixture attributes.

    So if it is not Ruby version specific, make sure your DB connection settings use same encoding/collation as the server, and you use multibyte library and proper KCODE value in Ruby.

  • felix (xilef)

    felix (xilef) January 23rd, 2009 @ 06:21 PM

    sweatshop DOES use the default repository et al for output but I believe the issue is with how sweatshop deals with its input. This is taken from the file system using randexp (so probably more likely a randexp issue, but still a dependency of sweatshop) which reads the local system's word lists which are usually in ISO-8859 encoding. The resultant mangled encoding trickles down to the db where it is manifest by a mysql_error.

    I know that my DB connection, creation, collation and KCODE are all UTF-8.

  • Michael Klishin (antares)

    Michael Klishin (antares) January 23rd, 2009 @ 06:34 PM

    how do you think randexp should behave then? Keep in mind it is not tied to DM or Merb, so this should be a Merb/DM agnostic solution.

  • felix (xilef)

    felix (xilef) January 23rd, 2009 @ 06:47 PM

    I believe benburket (randexp's and sweatshops' author) has some goodies in his bag of tricks to make randexp a little more robust. I reckon that should fix it. :-)

  • Michael Klishin (antares)

    Michael Klishin (antares) January 23rd, 2009 @ 06:51 PM

    I just want to make sure we have a good solution, I don't mind fixing it myself as part of some other dm-more work I am up to.

  • Michael Klishin (antares)

    Michael Klishin (antares) January 23rd, 2009 @ 06:52 PM

    Fix it in randexp and submit patches upstream, that is.

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

Attachments

Pages