In this guide I will attempt to explain character sets and collations in MySQL and MariaDB, how they cause problems, how to fix those problems, and how to fix your existing database. This is a long one, so if you're not interested in the gory details, feel free to skip ahead to any of the sections below. Otherwise buckle up because this is one wild ride!
Table of Contents
What are MySQL charsets and collations?
In the MySQL and MariaDB world, if you want to store text in a database, you need to tell the database what kind of text to store. At a high level you can think of charsets (character sets) and collations like human languages. Is this text English? Is it Arabic, Chinese, etc.?
The more specific answer is: charset determines what characters (alphabet) are allowed in the text field; and collation determines how characters are ordered and compared (for example, "A" comes before "B"). In the English language, we presume that there is one and only one order called "alphabetical" order. But it is not so simple - for example does capital "A" come before or after lowercase "a"?
- Charset (character set) determines what alphabet the text is composed of.
- Collation determines how two or more texts are sorted and compared.
A brief history lesson
MySQL was originally created by Michael Widenius of Finland in 1995 (named after his daughter My). Due to the geographic region of its creator, the default charset is latin1
with corresponding collation latin1_swedish_ci
. This means that only the Latin alphabet is allowed (character set). And when sorting things or comparing them (collation), use the Swedish alphabetical order. The ci
at the end means "case insensitive", so capital "A" and lowercase "a" are considered identical; therefore "abc" == "ABC" == "Abc".
What happens if you try to insert a character from the Arabic alphabet into a field with latin1
charset? You will get an error! This is because the ancient Romans did not write using Arabic letters. Even more nefarious... what happens if you try to insert an emoji 😊 into a latin1
field 🤔? You will also get an error 😱! The ancient Romans definitely did not have emojis!
So how did MySQL solve this problem? Easy... with MORE CHARSETS! 41 of them to be exact. Here is how to list all available character sets:
SHOW CHARACTER SET;
One other history lesson of note: MariaDB (named after Michael′s second daughter Maria) started as a fork of MySQL back in 2009 when Oracle acquired MySQL. So MariaDB is largely compatible with MySQL. However over a decade later, a few differences have emerged, which I′ll touch on later.
Unicode saves the day
This problem of human language is not unique to MySQL; it is a common problem in all of computer science. The silver bullet is a thing called "Unicode", frequently implemented using an encoding called UTF-8.
Unicode is essentially one character set which encompasses all human alphabets from the history of civilization. You can not only use unicode to store English, Arabic, and Asian alphabets - you can use it to store hieroglpyhics, cuneiform, and of best of all, emojis 🥳! Python even famously switched its internal string to unicode in Python 3 to solve this same problem.
MySQL fumbles unicode
To solve the problem, MySQL implemented a a unicode charset named utf8
. So this seems like a no-brainer, right? Just switch from latin1
to utf8
?
Wrong! Unfortunately due to some short-sighted (but valid) design desicions, the original utf8
charset was half-baked, meaning some unicode characters worked, but some would mysteriously fail.
By 2010, MySQL had corrected its unicode implementation with utf8mb4
.
Character Sets
Generally speaking, utf8mb4
is the recommended character set that will work best in websites and web applications.
Difference Between utf8mb3 and utf8mb4
Instead of following the UTF-8 convention and using 4-byte code points (a code point means a character in Unicode-speak), utf8
, now aliased as utf8mb3
, uses 3-byte code points to save space. This 3-byte implementation only makes room for 65,536 characters. This means that the most common UTF-8 characters (the Basic Multilingual Plane) worked as expected, but some of the more seldom used code-points, such as the legions of emojis that get added to our phones each year, will fail 😵. By contrast, the proper 4-byte implementation makes room for 2,097,152 characters... now that's a lot of emojis 🤯!
By 2010, MySQL had corrected its Unicode implementation with utf8mb4
, and aliased utf8
to utf8mb3
to properly designate it as the lesser 3-byte implementation.
What this means is utf8mb4
is the standard UTF-8 unicode implementation. There is no practical reason to use utf8mb3
. Even better utf8mb4
is a superset of utf8mb3
, so you can seamlessly upgrade existing utf8
or utf8mb3
fields to utf8mb4
!
Collations
OK, so now that we′ve chosen utf8mb4
. Which collation is best for websites and web applications?
utf8mb4_0900_ai_ci
is the recommended collation for MySQL 8.0 or newer.uca1400_ai_ci
is the recommended collation for MariaDB 10.11 or newer.utf8mb4_unicode_520_ci
is the recommended collation for older versions of MySQL and MariaDB, or for compatibility between both systems simultaneously.
Remember: collation determines the sorting and comparison of characters. This inevitably varies slightly between human languages. For example does the German esset "ß" (which loosely represents "ss") come before or after capital "S"? Lowercase "s"? Or maybe it comes after TWO lowercases "ss"? Similarly, does an accented "á" come before or after regular "a"? Even more maddening, do capital/uppercase letters come before or after lowercase letters?
Difference between utf8mb4_general_ci, utf8mb4_unicode_ci, and more
Since we have decided on utf8mb4
, there are many collations available to us. MySQL has 286 collations available, and 89 collations available for utf8mb4
. To see them all run:
SHOW COLLATIONS;
The default collation for utf8mb4
used to be utf8mb4_general_ci
. This is a simple collation which only sorts/compares one character at a time, and is case-insensitive. So the German esset "ß", which loosely represents "ss", can′t actually be compared to double "ss", only to single "s". It′s fine for English, but might cause problems in other languages.
utf8mb4_unicode_ci
is also case-insensitive but can more accurately sort/compare a single letter as equivalent to multiple letters, and follows the Unicode 4.0.0 standard.
utf8mb4_unicode_520_ci
is an updated version which follows the Unicode 5.2.0 standard. This is the default used by WordPress.
utf8mb4_0900_ai_ci
is the most modern version in MySQL 8.0+ only, which is both accent-insensitive and case-insensitive and follows the Unicode 9.0.0 standard.
uca1400_ai_ci
is the most modern version in MariaDB 10.11+ only, which is both accent-insensitive and case-insensitive and follows the Unicode 14.0.0 standard.
utf8mb4_bin
is a binary comparison which only sorts/compares based on the binary zeros and ones of the data. So this is not useful for human languages, but can be useful for text fields containing JSON, machine output, etc.
Other collations are available which will do case-sensitive comparisons, or that are optimized for specific languages. For more info, read the official documentation on collations.
How to Convert Charset and Collation
If you′re working with a web framework such as Django, WordPress, etc. then your life will become much easier if all tables and columns in your database use the same charset and collation.
For the rest of this guide, replace your_db
and your_table
as appropriate.
Step 1: Inspect current charset and collation
First I′d recommend inspecting your current database to see what charsets and collations are in use. The following command will show the charset and collation of each column:
SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA='your_db';
Step 2: Change system default
Optionally, you may want to change the default charset and collation of your database server. The true default is actually hard-coded at compile time into MySQL and MariaDB. But your database administrator can configure soft defaults that are used when creating new tables.
For the sake of this tutorial, I′ll be using utf8mb4_unicode_520_ci
since it is most widely available between all versions MySQL and MariaDB.
If you′re using hosted services from AWS, GCP, or Azure, these defaults can be configured through their dashboard. If you′re running your own server, these can be configured by editing my.cnf
(see list of possible locations here):
[mysqld] character_set_server=utf8mb4 collation_server=utf8mb4_unicode_520_ci
Next you′ll want to convert your existing database. Based on my own grisly experience dealing with hundreds of databases over the years, there are two tried and true methods.
Step 3a: Convert charset and collation using SQL
The easiest way to do this is to use the built-in SQL command which can convert it at the table-level. Beware though, this might not work as sometimes contraints such as foreign keys, uniqueness, etc. can prevent conversion.
The command to convert a table is:
ALTER TABLE `your_db`.`your_table` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci;
This may be a bit tedious if you have hundreds of tables. So here is a SQL script which will generate a list of SQL scripts for all tables:
SELECT CONCAT(
'ALTER TABLE ',TABLE_SCHEMA,'.',TABLE_NAME,
' CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci;'
)
FROM information_schema.TABLES
WHERE TABLE_SCHEMA='your_db';
Then copy/paste the output into a script and run it.
Did you get any errors? If so, you may need to attempt Step 3b below, which is a bit more tedious.
Step 3b: Convert charset and collation via dump/load
For this method we are going to dump the entire database into a text file, then find/replace the collations, then drop the database and load the edited file back in. Yes it′s a bit scary, but luckily we will have a backup. Note that this will require downtime of your website.
First, dump the database into a text file as so. Be sure to use the appropriate connection options.
$ mysqldump -p -u <user> -h <host> --hex-blob --result-file dump.sql <your_db>
This will output a file named dump.sql
. Before opening it, check the file size. If the size is over 1 gigabyte, open with caution and make sure you save/close any other work you′re doing!
Next, create a backup/copy called dump_new.sql
and open the file in your favorite text editor, or use unix tools such as sed
to find and replace the following patterns (given in regular expression syntax):
- Replace
CHARSET=[\w\d\_]+
withCHARSET=utf8mb4
- Replace
CHARSET [\w\d\_]+
withCHARSET utf8mb4
- Replace
CHARACTER SET [\w\d\_]+
withCHARACTER SET utf8mb4
- Replace
COLLATE=[\w\d\_]+
withCOLLATE=utf8mb4_unicode_520_ci
- Replace
COLLATE [\w\d\_]+
withCOLLATE utf8mb4_unicode_520_ci
- Replace
COLLATION=[\w\d\_]+
withCOLLATION=utf8mb4_unicode_520_ci
- Replace
COLLATION [\w\d\_]+
withCOLLATION utf8mb4_unicode_520_ci
Once that file has been updated and saved, it′s time to drop the database and load it back in.
DROP DATABASE your_db;
$ cat dump_new.sql | mysql -p -u <user> -h <host> <your_db>
Now test your database, and inspect it again using Step 1 to confirm everything worked correctly. If this was not successful, drop it again and load the original dump.sql
to restore.
Step 4: Update your Django or WordPress site
Just because your database is using a specific collation doesn′t mean your website is aware of it. It is important to update the connection strings of your apps to specify the correct character set and collation. Some libraries will assume hard-coded defaults which could cause unexpected breakage.
For Django sites, install mysqlclient>=2.2
(or mysqclient-collate
) which has been patched with support for collations (previous versions only used the server′s hard-coded default!). Yes, I was the person who wrote the patch after slowly going mad trying to trace down the Wagtail bug which originates from Django Cast queries.
$ pip install "mysqlclient>=2.2"
Then edit settings.py
and add charset
and collation
to your database options.
DATABASES = {
"default": {
"ENGINE": "django.db.backends.mysql",
...,
"OPTIONS": {
...
"charset": "utf8mb4",
"collation": "utf8mb4_unicode_520_ci",
},
}
}
For WordPress sites, edit your wp-config.php
and set the charset and collation:
/** Database Charset to use in creating database tables. */
define( 'DB_CHARSET', 'utf8mb4' );
/** The Database Collate type. Don't change this if in doubt. */
define( 'DB_COLLATE', 'utf8mb4_unicode_520_ci' );
Hope this helps, and best of luck on your database journey. Stay sane and don′t lose hope!
Shameless plug: If you′d prefer to never deal with a database ever again, consider hosting your Django, Wagtail, or WordPress sites with us on CodeRed Cloud.