This post is NOT about converting a DB from latin1 to UTF-8 (as many people had to do when upgrading their blogs, etc to support UTF-8).
This is about fixing a database where everything was setup intending to use UTF-8 (so your charset, and collation was set to the appropriate UTF-8 values), but you wrote a bunch of UTF-8 data using a latin1 charset connection into your database, so your database basically has latin1-encoded utf-8 data (when it should have just plain old utf-8 data). This can be a bit confusing – so just imagine the data was base64 encoded. It’s still the utf-8 data you put there, but you have to decode it first (in this case, not a base64 decode, but by using a latin1-charset MySQL connection).
Evidence of the above problem is easy to check: if your PHP application is rendering the data correctly, but phpMyAdmin is not (and other tools like MySQL Workbench), then chances are you have this problem.
Because PHP understands UTF-8, and is reading and writing the UTF-8 bytes correctly – but over a default latin1 connection – what happens is that the data being read and written is actually correct. But MySQL is storing it strangely inbetween. It’s almost like PHP is decrypting the data, storing it in MySQL and decrypting it back again.
This is a bad setup. You’ve told MySQL to expect utf-8 data, but then you’ve fed it latin1 data (that is actually utf-8 bytes). Very confusing.
This hack however actually works if everyone talks to MySQL on a latin1 connection. E.g. if you hack phpMyAdmin to use a latin1 connection, your data will now render correctly. But this is bad. So fix it – and here’s how:
Your app was meant to call mysql_set_charset(‘utf8′); but because it didn’t, latin1 was used. One way to fix your data, is to apply this same bug to phpMyAdmin itself.
In the mysqli.dbi.lib.php and mysql.dbi.lib.php (two files), in the PMA_DBI_connect function, add the code (thanks to Mike Zhang for the code):
mysql_query("SET SESSION CHARACTER_SET_RESULTS =latin1;",$link); mysql_query("SET SESSION CHARACTER_SET_CLIENT =latin1;",$link);
before the ‘return $link’ statement.
What you now have is a version of phpMyAdmin with the same bug as your PHP code. If you do a search on your data, it should now display correctly!
But rather than trying to insert this bug into every mysql app and tool you use – it’s best to correct your MySQL data, and then fix your application.
mysql --default-character-set=utf8 -u root database_name < dump.sql
(specify -p if you need to enter a password)
after you establish your DB connection (make sure every app that talks to this database sets the charset to utf8).
mysql ruby1.9 hack (I’m not on Ruby 1.9 yet and did not use this)
RoR unicode hacks (In RoR 3.0 I don’t think this is needed as I have unicode support without it)