Have you ever found yourself stuck with gibberish characters in a database that you’ve just imported? Is your data looking like someone’s tried to mask expletives? Then perhaps you’ve found yourself exporting a latin1
encoded database containing UTF-8 characters.
The Problem
Every now and then I find myself needing to migrate a database from one place to another; usually when preparing to replace an old site with a new one. Unfortunately this can sometimes throw up some encoding issues. An apparently working original seems broken with unexpected characters like “�ᥔ in the copy.
The default MySQL character set and collation are latin1
and latin1_swedish_ci
(the Finnish creator of MySQL, Michael Widenius, was co-head of a Swedish firm). As a result there are many existing databases out there using latin1
. If someone then inserts UTF-8 data into one of these everything can look fine until it comes to importing this into a new database.
One way you can check if this is your problem is to query the imported database by setting the CHARACTER_SET_RESULTS
from within the mysql client (via the terminal rather than something like PhpMyAdmin).
Try querying your database for some of the broken data. Then set CHARACTER_SET_RESULTS
before running the query again.
SET SESSION CHARACTER_SET_RESULTS = latin1;
If the data shows correctly the second time then it looks like the problem being described here is your issue too.
The Fix
Thankfully we can resolve the issue by taking a few short steps. We want to export the structure and data of the existing database separately; change the exported schema’s character set; and then import the tweaked structure and data into the new database.
One thing to note before we proceed with the fix is that MySQL’s utf8
encoding does not support all unicode characters. For true UTF-8 support we want to use MySQL’s utf8mb4
character set. For example, we’d need to use utf8mb4
if we wanted to store the increasinly popular emoji characters correctly in our database. For more on this check out my article on Saving Emoticons/Unicode from Twitter to a MySQL Database.
I’m going to assume that the database we’re importing into is utf8mb4
. If you’ve not already created this database then set it up now.
CREATE DATABASE `[new_database]` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
So let’s look at solving our issue with our latin1
tables. To start with we want to export the database’s schema using mysqldump
and change its character set to utf8mb4
.
mysqldump -u[username] -p [existing_database] --no-data --skip-set-charset --default-character-set=latin1 \
| sed 's/CHARSET=latin1/CHARSET=utf8mb4/g' \
> schema.sql
There are three parts to this command: we use mysqldump
to export the data; this is then piped to sed
which updates the charset
in the export; then finally we write the output to a file named schema.sql.
- We’re using
--no-data
to only export the schema (this bit is important). - We’re also using
--skip-set-charset
and--default-character-set=latin1
to ensure that MySQL doesn’t attempt to reconvert or set the charset. sed
is replacing every occurence oflatin1
for thecharset
withutf8mb4
. Changeutf8mb4
toutf8
if you don’t need the full UTF-8 character set.
Next we want to export the data in our database.
mysqldump -u[username] -p [existing_database] --no-create-db --no-create-info --skip-set-charset --default-character-set=latin1 > data.sql
There are just two parts to this command: again we use mysqldump
but this time just to export the data and then we write this output to data.sql.
- We use
--no-create-db
to exclude theCREATE DATABASE
statement. - We also use
--no-create-info
to exclude anyCREATE TABLE
statements as our previous export is handling the structure.
Now it’s time to import the exported schema and data to our new UTF-8 database. Open a connection to the new database using utf8mb4
(or utf8
if that’s what you are using) as the default character set.
mysql -u[username] -p [new_database] --default-character-set=utf8mb4
Finally, import the schema and data.
source schema.sql;
source data.sql;
Hopefully everything should now be working correctly in the new database and the gibberish characters gone!