Rails 2, MySQL, and Character Sets

|   Jun 29, 2011

If you have not been able to upgrade to Rails 3 and Ruby 1.9, you may be unaware that your application is misconfigured. If you have international customers, ensuring that character sets are configured properly is important to making sure your application behaves the same for them as it does for local customers.

UTF-8 All The Way Across Your Stack

There used to be compelling reasons to stick with latin1 encoding. Each character is a single byte and it closely relates to ASCII, the original encoding of the internet. With the exponential increase in storage capacity, processing power, and international presence on the internet, it makes sense to adopt a character set that works for everyone. UTF-8 has become the character set of choice, and with the migration from latin-based character sets to UTF-8, many developers have run into the same vexing issues that we have.

Where do I need to pay attention to encoding?

With a Rails application, there are a few places where developers need to worry about encoding: between the browser and the Rails application, between the Rails application and the database, and inside the database schema. Of these 3 locations, only the 2 related to the database tend to create issues. Because MySQL is shipped with latin1 enabled by default, anyone who’s started a database without first considering the desired encoding probably has their database configured incorrectly and data in an awkward encoding state. The typical misconfiguration looks like this:

  1. Data is sent from the user’s browser to the Rails application as UTF-8 text.
  2. The Rails application treats it as a multibyte character set.
  3. Rails sends the data over a MySQL connection configured to transmit as latin1.
  4. The UTF-8 data is treated as if it was latin1 data, so multi-byte characters get split apart into single bytes, and stored in a latin1 schema.

Without understanding that both the schema and database connection must be configured for UTF-8, you may be tempted to make some of the same mistakes we did.

When We Noticed Something Was Afoul

This misconfiguration was able to persist until we started hearing from international customers about strange quirks when they tried to @-mention users with non-latin characters in their name. Since we use a case-insensitive search for finding usernames, someone typing “@ølsen” should be able to find Ølsen Grueber. The only way for these users to find each other using the @-mention autocomplete was to use the correct case when typing their name.

This led us to investigate character set collations, and how upper and lower-case characters relate to each other in a character set. Representing UTF-8 strings as latin1 would cause the Ø character to become “Ø” and ø to become “ø”. Since the encoding misconfiguration causes the multi-byte characters to be treated as multiple single-byte characters, it was impossible to match across case in the database.

Fixes Gone Wrong

Since we originally thought it was just the encoding of the tables that had been incorrect, we attempted to just convert the charsets defined in the schema. Unfortunately, this didn’t produce the result we desired. Because MySQL already had mistaken assumptions about the original encoding of the data, this change resulted in horribly broken text. Each byte in the original multi-byte string was treated as its own character, then converted into a multi-byte form. The data still wasn’t in a format conducive to comparison.

We then found out about the charset options for the actual connection between our Rails application and MySQL. Unfortunately, since the data in MySQL was still being stored as latin1, we would get a similar miscoding issue as the data was coming back out of the database.

Fixes Gone Right

A comprehensive solution would involve making changes to both of these variables, but we couldn’t make the change while the data was still improperly represented in the database. Since the data was coming out of the database correctly, despite the misconfiguration, we would need to export all of the data before making the necessary configuration changes. This can be pretty time consuming when you have a database reaching a size in the gigabytes. After scheduling some down-time, the cleanest way to get the data into a good state was to perform the following:

    1. Dump the entire contents of the database with the misconfigured character sets. This gives us a SQL file with all of the schema descriptions (with latin1 charsets) and insert statements. Opening the file reveals all of the content in its original format, with multi-byte characters being in their original form. Only the schema descriptions need to be changed.

mysqldump --default-character-set=latin1 --skip-set-charset --user=username --password=password database_name > dbdump.sql

    1. Replace all instances of CHARSET=latin1 with CHARSET=utf8 in the SQL dump file. This ensures that the new tables will take on the correct encoding.

sed -e 's/CHARSET=latin1/CHARSET=utf8/g' dbdump.sql > dbdump.utf8.sql

    1. Shutdown MySQL and make the default encoding in both my.cnf and Rails’ database.yml set to UTF-8.

[mysqld]
default-character-set=utf8
default-collation=utf8_general_ci
[clients]
default-character-set=utf8
[mysql]
default-character-set=utf8

production:
  encoding: utf8
  collation: utf8_unicode_ci

    1. Restart MySQL and re-import the SQL dump using your new UTF-8 configuration

script/dbconsole <env> -p < dbdump.utf8.sql 

Wrapping Up

Once the migration was completed, we could see @-mentions with UTF-8 characters working just the way we wanted – with case-insensitivity. Despite the amount of confusion and frustration around character set encoding, once the issue was completely framed and the misconfigurations were identified, the solution became straight-forward. The biggest hurdle we found when addressing character sets was talking about character sets and where the encoding format mattered. Diagrams helped communicate this problem to the team better than conversations ever could, so we hope they help you too!

 

Comments

  • Commented on a Gist that could use improvement

    https://gist.github.com/1051889

    Commented on January 5, 2011 at 2:25 pm
    • I’ve updated the gists as appropriate. Thanks for the feedback!

      Commented on January 5, 2012 at 2:27 pm
  • Thanks for the info on how to convert from already saved latin1 to utf-8. The company I work for uses socialcast and one of its big benefits is enabling employees across international boundaries to communicate easily so this kind of support is very good to have.

    Commented on July 1, 2011 at 2:21 pm

Leave a comment

Your email address will not be published. Required fields are marked *

Connect with Facebook

Sign up to receive email communications regarding events, webinars, and product news.

Author Spotlight

Geoff Hichborn
Geoff Hichborn Software Engineer View full bio

What is Socialcast?

Socialcast by VMware (NYSE: VMW) is a social network for business uniting people, information, and applications with its real-time enterprise activity stream engine. Behind the firewall or in the cloud, Socialcast enables instant collaboration in a secure environment. Socialcast is headquartered in San Francisco, California. www.socialcast.com