Character Set Shenanigans: YAMCCS

One of my goals in December was to somehow quickly hack in Japanese into PlanetMySQL. I'm sure anyone reading this will start yelling "Dups you know better than to 'quickly hack'", but I did it anyway and as you know we put Japanese in. I entered utf8 data into a latin1 table. Yuck.

The true solution was to convert the whole Planet MySQL database from latin1 to utf8. I was reluctant at the time to make the change. So let me tell you a YAMCCS - Yet Another MySQL Character-set Conversion Story.

If you are going to mess with character sets and convert to UTF-8 on the web, remember, once you start, you need to make sure \*everything\* is UTF-8. That means, your connection to the server, your tables, your data, and finally the character set in your html pages. This should be obvious, but make sure you do it properly and you are thorough in your changes. Before you start, read through the entire Character Set and Collation section of the manual.

So here was my process, this by the way is after I shot down various other processes and they are detailed later. In total PlanetMySQL was swapped over in minutes with no downtime, unless you clicked at exactly the right microsecond. Remember, Planet is a very, very, very small database (20,000 entries, it's miniscule) and it is very easy to disable all writes to it currently.

The main entries table had latin1, cp1250 and utf8 characters stored in a latin1 table with all connections being made in latin1. To convert the latin1 and cp1250 characters correctly, I altered a copy of the table and converted the character set:

ALTER TABLE entries_copy CONVERT TO CHARACTER SET utf8

I think just yesterday Ryan Lowe noted that this command changes the table structure silently so be warned, not an issue for me, but might be for you. Specifically if you have "text" it will map to "mediumtext", "mediumtext" to "longtext" -- you get the picture.

The Japanese utf-8 stuff I had entered starting in December would not map correctly with this command. In this case I created a script to retrieve the data as latin1 and re-insert the data as utf-8. Dumping and changing would also have worked.

For all the tables I created utf-8 backups and then used the RENAME TABLE command to swap all the tables. All told, less than 15 mins and most importantly, I had an easy way to revert quickly if I had actually messed up.

What I didn't do and why...

1. Dump, fix and re-import. This seemed to be a long route to take for such small tables. There are many scripts, perl, python and the like that can fix character set data from dumps.

2. Convert to binary data types then convert the character set. This works very well, but I would have had to drop and recreate fulltext indices, convert all the columns and one test I ran did not like the reconversion back to text blobs. I haven't investigated this fully yet. While this option definitely works, it was too much of an overkill for the dataset size.

3. Fetch and restore all the data via a script through UTF-8. This I tried, but there were some characters stored in cp1250 character set as well, so using iconv, for example, didn't fully convert everything. This is certainly an option for such small datasets.

4. Other enterprising scripts are out there which will do the alters for you, again see Ryan Lowe's post. I haven't tried them, but would have been overkill for me.

If you do a google search on this topic, you will likely discover a plethora of ways to do this. Using a method that fits your dataset, data use and so forth is critical.

However, what I strongly suggest for anyone is that:

a) If you anticipate having UTF-8 characters, change early in your application cycle. Heck, actually I would just start off with a default of UTF-8, don't leave it at the latin1 defaults, then you never need to worry. Do also remember that UTF-8 takes up more data space, so maybe single columns should be made into UTF-8, either way, think about it early.

b) If you do decide to store lots of different character sets into latin1, don't mix connection types to store data. It gets messy really fast.

Now we have another databse that needs conversion, and that is not so easy at all. The dataset is huge, massive downtimes/locking writes are not an option, and the mix of character sets may dwarf the problems of Planet.

The world however can only be altered one database and one dataset at a time...

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

This is the blog of Dups... currently I'm one of MySQL's Community Relations Managers for Sun Microsystems, post, contact me, I want to hear from you!

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today