When you thought the DB was UTF-8 but it wasn’t

This post is NOT about converting a DB from latin1 to UTF-8 (as many people had to do when upgrading their blogs, etc to support UTF-8).

This is about fixing a database where everything was setup intending to use UTF-8 (so your charset, and collation was set to the appropriate UTF-8 values), but you wrote a bunch of UTF-8 data using a latin1 charset connection into your database, so your database basically has latin1-encoded utf-8 data (when it should have just plain old utf-8 data). This can be a bit confusing – so just imagine the data was base64 encoded. It’s still the utf-8 data you put there, but you have to decode it first (in this case, not a base64 decode, but by using a latin1-charset MySQL connection).

Evidence of the above problem is easy to check:  if your PHP application is rendering the data correctly, but phpMyAdmin is not (and other tools like MySQL Workbench), then chances are you have this problem.

Because PHP understands UTF-8, and is reading and writing the UTF-8 bytes correctly – but over a default latin1 connection – what happens is that the data being read and written is actually correct.  But MySQL is storing it strangely inbetween.  It’s almost like PHP is decrypting the data, storing it in MySQL and decrypting it back again.

This is a bad setup.  You’ve told MySQL to expect utf-8 data, but then you’ve fed it latin1 data (that is actually utf-8 bytes).  Very confusing.

This hack however actually works if everyone talks to MySQL on a latin1 connection.  E.g. if you hack phpMyAdmin to use a latin1 connection, your data will now render correctly.  But this is bad. So fix it – and here’s how:

Your app was meant to call mysql_set_charset(‘utf8′); but because it didn’t, latin1 was used.  One way to fix your data, is to apply this same bug to phpMyAdmin itself.

In the mysqli.dbi.lib.php and mysql.dbi.lib.php (two files), in the PMA_DBI_connect function, add the code (thanks to Mike Zhang for the code):

mysql_query("SET SESSION CHARACTER_SET_RESULTS =latin1;",$link);
mysql_query("SET SESSION CHARACTER_SET_CLIENT =latin1;",$link);

before the ‘return $link’ statement.

What you now have is a version of phpMyAdmin with the same bug as your PHP code.  If you do a search on your data, it should now display correctly!

But rather than trying to insert this bug into every mysql app and tool you use – it’s best to correct your MySQL data, and then fix your application.

How to fix your old borked data:

  1. apply the latin1 hack (above) to phpMyAdmin (set the connection to latin1, emulating the PHP app)
  2. view some unicode data to test (it should render correctly now in phpMyAdmin, whereas before it was not)
  3. export your entire DB
    This will export the data, using the twisted latin1 interpretation (like our buggy web client was) which is actually what we want.
  4. Remove the latin1 hack in phpMyAdmin.
  5. View the export in a good text editor like jEdit (make sure the charset is UTF-8), this time the dump should render correctly.
  6. Import the hack-exported DB like so:
    mysql --default-character-set=utf8 -u root database_name < dump.sql
    

    (specify -p if you need to enter a password)

  7. Now if you look in phpMyAdmin without the latin1 hack, your UTF-8 should now render correctly
  8. Now that your data is correct, it’s important to update your application.  In PHP, call
    mysql_set_charset('utf8'); 

    after you establish your DB connection (make sure every app that talks to this database sets the charset to utf8).

  9. Done. Reward yourself with hot coffe or ice cold beer, depending on time of day ~_-

Some mostly unrelated links (as it turns out)

Convert_latin1_to_UTF-8_in_MySQL

Using RVM to manage ruby

mysql ruby1.9 hack (I’m not on Ruby 1.9 yet and did not use this)

RoR unicode hacks (In RoR 3.0 I don’t think this is needed as I have unicode support without it)

UTF-8_MySQL_connections_with_Ruby

Someone who had a similar problem


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>