Everything you never wanted to know about MySQL Charsets & Collations

In this guide I will attempt to explain character sets and collations in MySQL and MariaDB, how they cause problems, how to fix those problems, and how to fix your existing database. This is a long one, so if you're not interested in the gory details, feel free to skip ahead to any of the sections below. Otherwise buckle up because this is one wild ride!

Table of Contents

What are MySQL charsets and collations?
A brief history lesson
- Unicode saves the day
- MySQL fumbles unicode
Character Sets
- Difference Between utf8mb3 and utf8mb4
Collations
- Difference between utf8mb4_general_ci, utf8mb4_unicode_ci, and more
How to Convert Charset and Collation

What are MySQL charsets and collations?

In the MySQL and MariaDB world, if you want to store text in a database, you need to tell the database what kind of text to store. At a high level you can think of charsets (character sets) and collations like human languages. Is this text English? Is it Arabic, Chinese, etc.?

The more specific answer is: charset determines what characters (alphabet) are allowed in the text field; and collation determines how characters are ordered and compared (for example, "A" comes before "B"). In the English language, we presume that there is one and only one order called "alphabetical" order. But it is not so simple - for example does capital "A" come before or after lowercase "a"?

Charset (character set) determines what alphabet the text is composed of.
Collation determines how two or more texts are sorted and compared.

A brief history lesson

MySQL was originally created by Michael Widenius of Finland in 1995 (named after his daughter My). Due to the geographic region of its creator, the default charset is latin1 with corresponding collation latin1_swedish_ci. This means that only the Latin alphabet is allowed (character set). And when sorting things or comparing them (collation), use the Swedish alphabetical order. The ci at the end means "case insensitive", so capital "A" and lowercase "a" are considered identical; therefore "abc" == "ABC" == "Abc".

What happens if you try to insert a character from the Arabic alphabet into a field with latin1 charset? You will get an error! This is because the ancient Romans did not write using Arabic letters. Even more nefarious... what happens if you try to insert an emoji 😊 into a latin1 field 🤔? You will also get an error 😱! The ancient Romans definitely did not have emojis!

So how did MySQL solve this problem? Easy... with MORE CHARSETS! 41 of them to be exact. Here is how to list all available character sets:

SHOW CHARACTER SET;

One other history lesson of note: MariaDB (named after Michael′s second daughter Maria) started as a fork of MySQL back in 2009 when Oracle acquired MySQL. So MariaDB is largely compatible with MySQL. However over a decade later, a few differences have emerged, which I′ll touch on later.

Unicode saves the day

This problem of human language is not unique to MySQL; it is a common problem in all of computer science. The silver bullet is a thing called "Unicode", frequently implemented using an encoding called UTF-8.

Unicode is essentially one character set which encompasses all human alphabets from the history of civilization. You can not only use unicode to store English, Arabic, and Asian alphabets - you can use it to store hieroglpyhics, cuneiform, and of best of all, emojis 🥳! Python even famously switched its internal string to unicode in Python 3 to solve this same problem.

MySQL fumbles unicode

To solve the problem, MySQL implemented a a unicode charset named utf8. So this seems like a no-brainer, right? Just switch from latin1 to utf8?

Wrong! Unfortunately due to some short-sighted (but valid) design desicions, the original utf8 charset was half-baked, meaning some unicode characters worked, but some would mysteriously fail.

By 2010, MySQL had corrected its unicode implementation with utf8mb4.

Character Sets

Generally speaking, utf8mb4 is the recommended character set that will work best in websites and web applications.

Difference Between utf8mb3 and utf8mb4

Instead of following the UTF-8 convention and using 4-byte code points (a code point means a character in Unicode-speak), utf8, now aliased as utf8mb3, uses 3-byte code points to save space. This 3-byte implementation only makes room for 65,536 characters. This means that the most common UTF-8 characters (the Basic Multilingual Plane) worked as expected, but some of the more seldom used code-points, such as the legions of emojis that get added to our phones each year, will fail 😵. By contrast, the proper 4-byte implementation makes room for 2,097,152 characters... now that's a lot of emojis 🤯!

By 2010, MySQL had corrected its Unicode implementation with utf8mb4, and aliased utf8 to utf8mb3 to properly designate it as the lesser 3-byte implementation.

What this means is utf8mb4 is the standard UTF-8 unicode implementation. There is no practical reason to use utf8mb3. Even better utf8mb4 is a superset of utf8mb3, so you can seamlessly upgrade existing utf8 or utf8mb3 fields to utf8mb4!

Collations

OK, so now that we′ve chosen utf8mb4. Which collation is best for websites and web applications?

utf8mb4_0900_ai_ci is the recommended collation for MySQL 8.0 or newer.
uca1400_ai_ci is the recommended collation for MariaDB 10.11 or newer.
utf8mb4_unicode_520_ci is the recommended collation for older versions of MySQL and MariaDB, or for compatibility between both systems simultaneously.

Remember: collation determines the sorting and comparison of characters. This inevitably varies slightly between human languages. For example does the German esset "ß" (which loosely represents "ss") come before or after capital "S"? Lowercase "s"? Or maybe it comes after TWO lowercases "ss"? Similarly, does an accented "á" come before or after regular "a"? Even more maddening, do capital/uppercase letters come before or after lowercase letters?

Difference between utf8mb4_general_ci, utf8mb4_unicode_ci, and more

Since we have decided on utf8mb4, there are many collations available to us. MySQL has 286 collations available, and 89 collations available for utf8mb4. To see them all run:

SHOW COLLATIONS;

The default collation for utf8mb4 used to be utf8mb4_general_ci. This is a simple collation which only sorts/compares one character at a time, and is case-insensitive. So the German esset "ß", which loosely represents "ss", can′t actually be compared to double "ss", only to single "s". It′s fine for English, but might cause problems in other languages.

utf8mb4_unicode_ci is also case-insensitive but can more accurately sort/compare a single letter as equivalent to multiple letters, and follows the Unicode 4.0.0 standard.

utf8mb4_unicode_520_ci is an updated version which follows the Unicode 5.2.0 standard. This is the default used by WordPress.

utf8mb4_0900_ai_ci is the most modern version in MySQL 8.0+ only, which is both accent-insensitive and case-insensitive and follows the Unicode 9.0.0 standard.

uca1400_ai_ci is the most modern version in MariaDB 10.11+ only, which is both accent-insensitive and case-insensitive and follows the Unicode 14.0.0 standard.

utf8mb4_bin is a binary comparison which only sorts/compares based on the binary zeros and ones of the data. So this is not useful for human languages, but can be useful for text fields containing JSON, machine output, etc.

Other collations are available which will do case-sensitive comparisons, or that are optimized for specific languages. For more info, read the official documentation on collations.

How to Convert Charset and Collation

If you′re working with a web framework such as Django, WordPress, etc. then your life will become much easier if all tables and columns in your database use the same charset and collation.

For the rest of this guide, replace your_db and your_table as appropriate.

Step 1: Inspect current charset and collation

First I′d recommend inspecting your current database to see what charsets and collations are in use. The following command will show the charset and collation of each column:

SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA='your_db';

Step 2: Change system default

Optionally, you may want to change the default charset and collation of your database server. The true default is actually hard-coded at compile time into MySQL and MariaDB. But your database administrator can configure soft defaults that are used when creating new tables.

For the sake of this tutorial, I′ll be using utf8mb4_unicode_520_ci since it is most widely available between all versions MySQL and MariaDB.

If you′re using hosted services from AWS, GCP, or Azure, these defaults can be configured through their dashboard. If you′re running your own server, these can be configured by editing my.cnf (see list of possible locations here):

[mysqld]
character_set_server=utf8mb4
collation_server=utf8mb4_unicode_520_ci

Next you′ll want to convert your existing database. Based on my own grisly experience dealing with hundreds of databases over the years, there are two tried and true methods.

Step 3a: Convert charset and collation using SQL

The easiest way to do this is to use the built-in SQL command which can convert it at the table-level. Beware though, this might not work as sometimes contraints such as foreign keys, uniqueness, etc. can prevent conversion.

The command to convert a table is:

ALTER TABLE `your_db`.`your_table` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci;

This may be a bit tedious if you have hundreds of tables. So here is a SQL script which will generate a list of SQL scripts for all tables:

SELECT CONCAT(
    'ALTER TABLE ',TABLE_SCHEMA,'.',TABLE_NAME,
    ' CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci;'
)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA='your_db';

Then copy/paste the output into a script and run it.

Did you get any errors? If so, you may need to attempt Step 3b below, which is a bit more tedious.

Step 3b: Convert charset and collation via dump/load

For this method we are going to dump the entire database into a text file, then find/replace the collations, then drop the database and load the edited file back in. Yes it′s a bit scary, but luckily we will have a backup. Note that this will require downtime of your website.

First, dump the database into a text file as so. Be sure to use the appropriate connection options.

$ mysqldump -p -u <user> -h <host> --hex-blob --result-file dump.sql <your_db>

This will output a file named dump.sql. Before opening it, check the file size. If the size is over 1 gigabyte, open with caution and make sure you save/close any other work you′re doing!

Next, create a backup/copy called dump_new.sql and open the file in your favorite text editor, or use unix tools such as sed to find and replace the following patterns (given in regular expression syntax):

Replace CHARSET=[\w\d\_]+ with CHARSET=utf8mb4
Replace CHARSET [\w\d\_]+ with CHARSET utf8mb4
Replace CHARACTER SET [\w\d\_]+ with CHARACTER SET utf8mb4
Replace COLLATE=[\w\d\_]+ with COLLATE=utf8mb4_unicode_520_ci
Replace COLLATE [\w\d\_]+ with COLLATE utf8mb4_unicode_520_ci
Replace COLLATION=[\w\d\_]+ with COLLATION=utf8mb4_unicode_520_ci
Replace COLLATION [\w\d\_]+ with COLLATION utf8mb4_unicode_520_ci

Once that file has been updated and saved, it′s time to drop the database and load it back in.

DROP DATABASE your_db;

$ cat dump_new.sql | mysql -p -u <user> -h <host> <your_db>

Now test your database, and inspect it again using Step 1 to confirm everything worked correctly. If this was not successful, drop it again and load the original dump.sql to restore.

Step 4: Update your Django or WordPress site

Just because your database is using a specific collation doesn′t mean your website is aware of it. It is important to update the connection strings of your apps to specify the correct character set and collation. Some libraries will assume hard-coded defaults which could cause unexpected breakage.

For Django sites, install mysqlclient>=2.2 (or mysqclient-collate) which has been patched with support for collations (previous versions only used the server′s hard-coded default!). Yes, I was the person who wrote the patch after slowly going mad trying to trace down the Wagtail bug which originates from Django Cast queries.

$ pip install "mysqlclient>=2.2"

Then edit settings.py and add charset and collation to your database options.

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.mysql",
        ...,
        "OPTIONS": {
            ...
            "charset": "utf8mb4",
            "collation": "utf8mb4_unicode_520_ci",
        },
    }
}

For WordPress sites, edit your wp-config.php and set the charset and collation:

/** Database Charset to use in creating database tables. */
define( 'DB_CHARSET', 'utf8mb4' );

/** The Database Collate type. Don't change this if in doubt. */
define( 'DB_COLLATE', 'utf8mb4_unicode_520_ci' );

Hope this helps, and best of luck on your database journey. Stay sane and don′t lose hope!

Shameless plug: If you′d prefer to never deal with a database ever again, consider hosting your Django, Wagtail, or WordPress sites with us on CodeRed Cloud.