SlideShare a Scribd company logo
Character encoding
Breaking and unbreaking your data
Maciej Dobrzanski
maciek@psce.com | @mushupl
Brussels, 1 Feb 2015
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Character Encoding
• Binary representation of glyphs
• Each character can be represented by 1 or more bytes
• Popular schemes
• ASCII
• Unicode
• UTF-8, UTF-16, UTF-32
• Language specific character sets
• US (Latin US)
• Europe (Latin 1, Latin 2)
• Asia (EUC-KR, GB18030)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Character Encoding
• Character set defines the visual interpretation of binary information
• One glyph can be associated with several numeric codes
• One numeric code may be used to represent several different glyphs
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Please state the nature of the emergency
• Application configuration
• Database configuration
• Table/column definitions
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1: We are all born Swedish
• MySQL uses latin1 by default
• MySQL 5.7 too
• Is anyone actually aware of that?
• Why Swedish?
• latin1_swedish_ci is the default collation
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
• Let’s build an application
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
| @@global.character_set_server | @@session.character_set_client |
+-------------------------------+--------------------------------+
| latin1 | latin1 |
+-------------------------------+--------------------------------+
1 row in set (0.00 sec)
mysql> CREATE SCHEMA fosdem;
Query OK, 1 row affected (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL);
Query OK, 0 rows affected (0.15 sec)
mysql> SHOW CREATE TABLE locationsG
*************************** 1. row ***************************
Table: locations
Create Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
• Everything is correct… NOT!
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from locations;
+--------------------+
| city |
+--------------------+
| Berlin |
| Kraków |
| 東京都 |
+--------------------+
3 rows in set (0.00 sec)
mysql> SET NAMES latin1;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from locations;
+-----------+
| city |
+-----------+
| Berlin |
| Kraków |
| 東京都 |
+-----------+
3 rows in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1
• Let’s fix this
• Or can we ignore it?
• Ruby may not like it
# grep character-set-server /etc/mysql/my.cnf
character-set-server = utf8
mysql> SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
| @@global.character_set_server | @@session.character_set_client |
+-------------------------------+--------------------------------+
| utf8 | utf8 |
+-------------------------------+--------------------------------+
1 row in set (0.00 sec)
...we are fixing our tables here...
mysql> SHOW CREATE TABLE locationsG
*************************** 1. row ***************************
Table: locations
Create Table: CREATE TABLE `locations` (
`city` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #1: The good news
• It’s usually fixable
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?
• Sesssion settings
• character_set_server
• character_set_client
• character_set_connection
• character_set_database
• character_set_result
• Schema level defaults
• Table level defaults
• Column charsets
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• Having fixed our problem #1, we continue to develop our application
mysql> SELECT @@session.character_set_server, @@session.character_set_client;
+--------------------------------+--------------------------------+
| @@session.character_set_server | @@session.character_set_client |
+--------------------------------+--------------------------------+
| utf8 | utf8 |
+--------------------------------+--------------------------------+
1 row in set (0.00 sec)
mysql> USE fosdem;
mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL);
Query OK, 0 rows affected (0.13 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• Why is the table character set latin1?
mysql> SELECT @@session.character_set_server, @@session.character_set_client;
+--------------------------------+--------------------------------+
| @@session.character_set_server | @@session.character_set_client |
+--------------------------------+--------------------------------+
| utf8 | utf8 |
+--------------------------------+--------------------------------+
1 row in set (0.00 sec)
mysql> USE fosdem;
mysql> SHOW CREATE TABLE peopleG
*************************** 1. row ***************************
Table: people
Create Table: CREATE TABLE `people` (
`first_name` varchar(30) NOT NULL,
`last_name` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• What’s all this, then?
mysql> SHOW SESSION VARIABLES LIKE 'character_set_%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)
mysql> SHOW CREATE DATABASE fosdemG
*************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2
• Can we fix this?
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people;
+------------+----------------------+
| last_name | HEX(last_name) |
+------------+----------------------+
| Lemon | 4C656D6F6E |
| Müller | 4DFC6C6C6572 |
| Dobrza?ski | 446F62727A613F736B69 |
+------------+----------------------+
3 rows in set (0.00 sec)
mysql> SET NAMES latin2;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT last_name, HEX(last_name) FROM people;
+------------+----------------------+
| last_name | HEX(last_name) |
+------------+----------------------+
| Lemon | 4C656D6F6E |
| Müller | 4DFC6C6C6572 |
| Dobrza?ski | 446F62727A613F736B69 |
+------------+----------------------+
3 rows in set (0.00 sec)
• We can’t! :-(
• 0x3F is '?', so my 'ń' was lost
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: The bad news
• It may not be enough to configure the server correctly
• A mismatch between client and server can permantenly break data
• Implicit conversion inside MySQL server
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
• Where do you set character sets in MySQL?
• Sesssion settings
• character_set_server
• character_set_client
• character_set_connection
• character_set_database
• character_set_result
• Schema level defaults – affect new tables
• Table level defaults – affect new columns
• Column charsets
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} ((none)) > SELECT @@global.character_set_server, @@session.character_set_client;
+-------------------------------+--------------------------------+
| @@global.character_set_server | @@session.character_set_client |
+-------------------------------+--------------------------------+
| latin1 | utf8 |
+-------------------------------+--------------------------------+
1 row in set (0.00 sec)
master [localhost] {msandbox} ((none)) > CREATE SCHEMA fosdemG
Query OK, 1 row affected (0.00 sec)
master [localhost] {msandbox} ((none)) > SHOW CREATE SCHEMA fosdemG
*************************** 1. row ***************************
Database: fosdem
Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} ((none)) > USE fosdem;
Database changed
master [localhost] {msandbox} (fosdem) > CREATE TABLE test (a VARCHAR(300), INDEX (a));
Query OK, 0 rows affected (0.62 sec)
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} (fosdem) > ALTER TABLE test DEFAULT CHARSET = utf8;
Query OK, 0 rows affected (0.08 sec)
Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Problem #2: Settings, defaults, inheritance
master [localhost] {msandbox} (fosdem) > ALTER TABLE test ADD b VARCHAR(10);
Query OK, 0 rows affected (0.74 sec)
Records: 0 Duplicates: 0 Warnings: 0
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) CHARACTER SET latin1 DEFAULT NULL,
`b` varchar(10) DEFAULT NULL,
KEY `a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
I f**ckd up. What do I do?
• Let’s start with what you shouldn’t do
• Keep calm and don’t start by changing something
• Analyze the situation
• Why did the problem occur in the first place?
• Reassess the damage
• Is it consistent?
• Are all rows broken in the same way?
• Are some rows bad, but others are okay?
• Are all bad in several different ways?
• Is it actually repearable?
• No character mapping occurred during writes (e.g. unicode over latin1/latin1)
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. What else I shouldn’t do, then?
• Do not rush things as you may easily go from bad to worse
• Do not start fixing this on a replication slave
• You can’t fix this by fixing tables one by one on a live database
• Unless you really have everything in one table
• Do not use: ALTER TABLE … DEFAULT CHARSET = …
• It only changes the default character set for new columns
• Do not use: ALTER TABLE … CONVERT TO CHARACTER SET …
• It’s not for fixing broken encoding
• Do not use: ALTER TABLE … MODIFY col_name … CHARACTER SET …
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• What needs to be fixed?
• Schema defaut character set
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT
• What about ENUM?
• Use INFORMATION_SCHEMA to grab a list
• What about other tables?
• They too (eventually), but it’s not critical
SELECT CONCAT(c.table_schema, '.', c.table_name) AS candidate_table
FROM information_schema.columns c
WHERE c.table_schema = 'fosdem'
AND c.column_type REGEXP '^(.*CHAR|.*TEXT|ENUM)((.+))?$'
GROUP BY candidate_table;
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• Option 1 – requires downtime
• Dump and restore
• Dump the data preserving the bad configuration and drop the old database
bash# mysqldump -u root -p --skip-set-charset --default-character-set=latin1 fosdem >
fosdem.sql
mysql> DROP SCHEMA fosdem;
• Correct table definitions in the dump file
• Edit DEFAULT CHARSET in all CREATE TABLE statements
• Create the database again and import the data back
mysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8;
bash# mysql -u root -p --default-character-set=utf8 fosdem < fosdem.sql
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• Option 2 – requires downtime
• Perform a two step conversion with ALTER TABLE
• Original encoding -> VARBINARY/BLOB -> Target encoding
• Conversion from/to BINARY/BLOB removes character set context
• How?
• Stop applications
• On each tabe, for each text column perform:
ALTER TABLE tbl MODIFY col_name VARBINARY(255);
ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8;
• You may specify multiple columns per ALTER TABLE
• Fix the problems (application and/or db configs)
• Restart applications
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
I f**ckd up. So how do I fix it?
• Option 3 – online character set fix; no downtime*
• Thanks to our plugin for pt-online-schema-change
• and a tiny patch for pt-online-schema-change that goes with the plugin 
• How?
• Start pt-online-schema-change on all tables – one by one
• Do not rotate tables (--no-swap-tables) or drop pt-osc triggers
• Wait until all tables have been converted
• Stop applications
• Fix the problems (application and/or db configs)
• Rotate tables – takes just 1 minute
• Restart applications
• Et voilà
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
GOTCHAs!
• Data space requrements may change during conversion
• Latin1 uses 1 byte per character, utf8 will need to assume 3 bytes
• VARCHAR/TEXT fit up to 64KB – it won’t fit 65536 multi-byte characters
• Key length limit is 767 bytes
• Data type and/or index length changes may be required
• Test and plan this ahead
• There may be more prolems than you think
• Detect irrecoverible problems with a simple stored procedure
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
CREATE FUNCTION `cnv_test_conversion` (`value_before` LONGTEXT, `value_after` LONGTEXT) RETURNS tinyint(1)
BEGIN
RETURN (IFNULL(CONVERT(CONVERT(`value_before` USING latin1) USING binary), "") =
IFNULL(CONVERT(`value_after` USING binary), ""));
END;;
01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
GOTCHAs!
master [localhost] {msandbox} (fosdem) > ALTER TABLE test MODIFY a VARCHAR(300) CHARACTER SET utf8;
Query OK, 0 rows affected, 1 warning (1.23 sec)
Records: 0 Duplicates: 0 Warnings: 1
master [localhost] {msandbox} (fosdem) > SHOW WARNINGSG
*************************** 1. row ***************************
Level: Warning
Code: 1071
Message: Specified key was too long; max key length is 767 bytes
1 row in set (0.00 sec)
master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG
*************************** 1. row ***************************
Table: test
Create Table: CREATE TABLE `test` (
`a` varchar(300) DEFAULT NULL,
`b` varchar(10) DEFAULT NULL,
KEY `a` (`a`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
How to do it right?
• Set character-set-server during initial configuration
• When creating new schemas, always specify the desired charset
• CREATE SCHEMA fosdem DEFAULT CHARSET = utf8
• ALTER SCHEMA fosdem DEFAULT CHARSET = utf8
• When creating new tables, also explicitly specify the charset
• CREATE TABLE people (…) DEFAULT CHARSET = utf8
• And don’t forget to configure applications too
• You can try to force charset on the clients
• init-connect = "SET NAMES utf8"
• It might also break applications that don’t want to talk to MySQL using utf8
01.02.2015 Follow us on Twitter @dbasquare www.psce.com
Oh, and one more thing…
01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
• We are sharing WebScaleSQL packages with the MySQL Community!
• Check out http://www.psce.com/blog for details
• Follow @dbasquare to receive updates
01.02.2015 Follow us on Twitter @dbasquare 35
WebScaleSQL
What is WebScaleSQL?
WebScaleSQL is a collaboration among engineers from several companies
such as Facebook, Twitter, Google or Linkedin, that face the same challenges
in deploying MySQL at scale, and seek greater performance from a database
technology tailored for their needs.

More Related Content

PDF
Performance Schema for MySQL Troubleshooting
PDF
Basic MySQL Troubleshooting for Oracle Database Administrators
PDF
Performance Schema for MySQL Troubleshooting
PDF
New features in Performance Schema 5.7 in action
PDF
MySQL Query tuning 101
PDF
Introducing new SQL syntax and improving performance with preparse Query Rewr...
PDF
Moving to the NoSQL side: MySQL JSON functions
PDF
Why Use EXPLAIN FORMAT=JSON?
Performance Schema for MySQL Troubleshooting
Basic MySQL Troubleshooting for Oracle Database Administrators
Performance Schema for MySQL Troubleshooting
New features in Performance Schema 5.7 in action
MySQL Query tuning 101
Introducing new SQL syntax and improving performance with preparse Query Rewr...
Moving to the NoSQL side: MySQL JSON functions
Why Use EXPLAIN FORMAT=JSON?

What's hot (20)

PDF
Performance Schema for MySQL Troubleshooting
PDF
Troubleshooting MySQL Performance
PDF
Troubleshooting MySQL Performance
PPTX
Full Table Scan: friend or foe
PDF
Basic MySQL Troubleshooting for Oracle DBAs
PPTX
SQL Tuning, takes 3 to tango
PDF
ANALYZE for executable statements - a new way to do optimizer troubleshooting...
PPTX
Adapting to Adaptive Plans on 12c
PDF
Percona live-2012-optimizer-tuning
PDF
Histograms in 12c era
PDF
Fosdem2012 mariadb-5.3-query-optimizer-r2
PDF
Chasing the optimizer
PDF
UKOUG 2011: Practical MySQL Tuning
PDF
New features-in-mariadb-and-mysql-optimizers
PDF
MySQL/MariaDB query optimizer tuning tutorial from Percona Live 2013
PDF
MariaDB 10.0 Query Optimizer
PDF
Preparse Query Rewrite Plugins
PDF
Introduction into MySQL Query Tuning for Dev[Op]s
PPTX
SQL Plan Directives explained
PDF
MariaDB: Engine Independent Table Statistics, including histograms
Performance Schema for MySQL Troubleshooting
Troubleshooting MySQL Performance
Troubleshooting MySQL Performance
Full Table Scan: friend or foe
Basic MySQL Troubleshooting for Oracle DBAs
SQL Tuning, takes 3 to tango
ANALYZE for executable statements - a new way to do optimizer troubleshooting...
Adapting to Adaptive Plans on 12c
Percona live-2012-optimizer-tuning
Histograms in 12c era
Fosdem2012 mariadb-5.3-query-optimizer-r2
Chasing the optimizer
UKOUG 2011: Practical MySQL Tuning
New features-in-mariadb-and-mysql-optimizers
MySQL/MariaDB query optimizer tuning tutorial from Percona Live 2013
MariaDB 10.0 Query Optimizer
Preparse Query Rewrite Plugins
Introduction into MySQL Query Tuning for Dev[Op]s
SQL Plan Directives explained
MariaDB: Engine Independent Table Statistics, including histograms
Ad

Viewers also liked (9)

DOCX
Data encoding techniques for reducing energyb consumption in network on-chip
PDF
Data encoding and Metadata for Streams
PPTX
Data encoding
PPT
CCNA
PPT
Encoding in Data Communication DC8
PPTX
Asynchronous and synchronous
PPS
Synchronous and-asynchronous-data-transfer
PPT
Ccna Presentation
PPT
Data Encoding
Data encoding techniques for reducing energyb consumption in network on-chip
Data encoding and Metadata for Streams
Data encoding
CCNA
Encoding in Data Communication DC8
Asynchronous and synchronous
Synchronous and-asynchronous-data-transfer
Ccna Presentation
Data Encoding
Ad

Similar to Character Encoding - MySQL DevRoom - FOSDEM 2015 (20)

PDF
MySQL SQL Tutorial
PPT
Introduction To Lamp P2
PDF
MySQL Idiosyncrasies That Bite
PDF
MySQL Idiosyncrasies That Bite SF
PDF
Curso de MySQL 5.7
PPT
15 protips for mysql users pfz
PDF
OSMC 2008 | Monitoring MySQL by Geert Vanderkelen
PDF
MySQL Idiosyncrasies That Bite 2010.07
PPTX
MySQLinsanity
ODP
PDF
How to Avoid Pitfalls in Schema Upgrade with Galera
PDF
MySQL Kitchen : spice up your everyday SQL queries
PDF
Big Data Analytics with MariaDB ColumnStore
PDF
Mysql basics1
PDF
My SQL Idiosyncrasies That Bite OTN
DOCX
Bt0075, rdbms and my sql
PDF
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
PPT
Applied Partitioning And Scaling Your Database System Presentation
PDF
MySQL 5.7 innodb_enhance_partii_20160527
PDF
My sql 5.7-upcoming-changes-v2
MySQL SQL Tutorial
Introduction To Lamp P2
MySQL Idiosyncrasies That Bite
MySQL Idiosyncrasies That Bite SF
Curso de MySQL 5.7
15 protips for mysql users pfz
OSMC 2008 | Monitoring MySQL by Geert Vanderkelen
MySQL Idiosyncrasies That Bite 2010.07
MySQLinsanity
How to Avoid Pitfalls in Schema Upgrade with Galera
MySQL Kitchen : spice up your everyday SQL queries
Big Data Analytics with MariaDB ColumnStore
Mysql basics1
My SQL Idiosyncrasies That Bite OTN
Bt0075, rdbms and my sql
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Applied Partitioning And Scaling Your Database System Presentation
MySQL 5.7 innodb_enhance_partii_20160527
My sql 5.7-upcoming-changes-v2

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
medical staffing services at VALiNTRY
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Introduction to Artificial Intelligence
PPTX
ai tools demonstartion for schools and inter college
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
System and Network Administration Chapter 2
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
VVF-Customer-Presentation2025-Ver1.9.pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Upgrade and Innovation Strategies for SAP ERP Customers
PTS Company Brochure 2025 (1).pdf.......
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
medical staffing services at VALiNTRY
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Softaken Excel to vCard Converter Software.pdf
Transform Your Business with a Software ERP System
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
CHAPTER 2 - PM Management and IT Context
Introduction to Artificial Intelligence
ai tools demonstartion for schools and inter college
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
System and Network Administration Chapter 2
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...

Character Encoding - MySQL DevRoom - FOSDEM 2015

  • 1. Character encoding Breaking and unbreaking your data Maciej Dobrzanski maciek@psce.com | @mushupl Brussels, 1 Feb 2015 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 2. Character Encoding • Binary representation of glyphs • Each character can be represented by 1 or more bytes • Popular schemes • ASCII • Unicode • UTF-8, UTF-16, UTF-32 • Language specific character sets • US (Latin US) • Europe (Latin 1, Latin 2) • Asia (EUC-KR, GB18030) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 3. Character Encoding • Character set defines the visual interpretation of binary information • One glyph can be associated with several numeric codes • One numeric code may be used to represent several different glyphs 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 4. Please state the nature of the emergency • Application configuration • Database configuration • Table/column definitions 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 5. Problem #1: We are all born Swedish • MySQL uses latin1 by default • MySQL 5.7 too • Is anyone actually aware of that? • Why Swedish? • latin1_swedish_ci is the default collation 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 6. Problem #1 • Let’s build an application mysql> SELECT @@global.character_set_server, @@session.character_set_client; +-------------------------------+--------------------------------+ | @@global.character_set_server | @@session.character_set_client | +-------------------------------+--------------------------------+ | latin1 | latin1 | +-------------------------------+--------------------------------+ 1 row in set (0.00 sec) mysql> CREATE SCHEMA fosdem; Query OK, 1 row affected (0.00 sec) mysql> USE fosdem; mysql> CREATE TABLE locations (city VARCHAR(30) NOT NULL); Query OK, 0 rows affected (0.15 sec) mysql> SHOW CREATE TABLE locationsG *************************** 1. row *************************** Table: locations Create Table: CREATE TABLE `locations` ( `city` varchar(30) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 7. Problem #1 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 8. Problem #1 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 9. Problem #1 • Everything is correct… NOT! mysql> SET NAMES utf8; Query OK, 0 rows affected (0.00 sec) mysql> select * from locations; +--------------------+ | city | +--------------------+ | Berlin | | Kraków | | 東京都 | +--------------------+ 3 rows in set (0.00 sec) mysql> SET NAMES latin1; Query OK, 0 rows affected (0.00 sec) mysql> select * from locations; +-----------+ | city | +-----------+ | Berlin | | Kraków | | 東京都 | +-----------+ 3 rows in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 10. Problem #1 • Let’s fix this • Or can we ignore it? • Ruby may not like it # grep character-set-server /etc/mysql/my.cnf character-set-server = utf8 mysql> SELECT @@global.character_set_server, @@session.character_set_client; +-------------------------------+--------------------------------+ | @@global.character_set_server | @@session.character_set_client | +-------------------------------+--------------------------------+ | utf8 | utf8 | +-------------------------------+--------------------------------+ 1 row in set (0.00 sec) ...we are fixing our tables here... mysql> SHOW CREATE TABLE locationsG *************************** 1. row *************************** Table: locations Create Table: CREATE TABLE `locations` ( `city` varchar(30) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 11. Problem #1: The good news • It’s usually fixable 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 12. Problem #2: Settings, defaults, inheritance • Where do you set character sets in MySQL? • Sesssion settings • character_set_server • character_set_client • character_set_connection • character_set_database • character_set_result • Schema level defaults • Table level defaults • Column charsets 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 13. Problem #2 • Having fixed our problem #1, we continue to develop our application mysql> SELECT @@session.character_set_server, @@session.character_set_client; +--------------------------------+--------------------------------+ | @@session.character_set_server | @@session.character_set_client | +--------------------------------+--------------------------------+ | utf8 | utf8 | +--------------------------------+--------------------------------+ 1 row in set (0.00 sec) mysql> USE fosdem; mysql> CREATE TABLE people (first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL); Query OK, 0 rows affected (0.13 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 14. Problem #2 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 15. Problem #2 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 16. Problem #2 • Why is the table character set latin1? mysql> SELECT @@session.character_set_server, @@session.character_set_client; +--------------------------------+--------------------------------+ | @@session.character_set_server | @@session.character_set_client | +--------------------------------+--------------------------------+ | utf8 | utf8 | +--------------------------------+--------------------------------+ 1 row in set (0.00 sec) mysql> USE fosdem; mysql> SHOW CREATE TABLE peopleG *************************** 1. row *************************** Table: people Create Table: CREATE TABLE `people` ( `first_name` varchar(30) NOT NULL, `last_name` varchar(30) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 17. Problem #2 • What’s all this, then? mysql> SHOW SESSION VARIABLES LIKE 'character_set_%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec) mysql> SHOW CREATE DATABASE fosdemG *************************** 1. row *************************** Database: fosdem Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */ 1 row in set (0.00 sec) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 18. Problem #2 • Can we fix this? mysql> SET NAMES utf8; Query OK, 0 rows affected (0.00 sec) mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+ | last_name | HEX(last_name) | +------------+----------------------+ | Lemon | 4C656D6F6E | | Müller | 4DFC6C6C6572 | | Dobrza?ski | 446F62727A613F736B69 | +------------+----------------------+ 3 rows in set (0.00 sec) mysql> SET NAMES latin2; Query OK, 0 rows affected (0.00 sec) mysql> SELECT last_name, HEX(last_name) FROM people; +------------+----------------------+ | last_name | HEX(last_name) | +------------+----------------------+ | Lemon | 4C656D6F6E | | Müller | 4DFC6C6C6572 | | Dobrza?ski | 446F62727A613F736B69 | +------------+----------------------+ 3 rows in set (0.00 sec) • We can’t! :-( • 0x3F is '?', so my 'ń' was lost 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 19. Problem #2: The bad news • It may not be enough to configure the server correctly • A mismatch between client and server can permantenly break data • Implicit conversion inside MySQL server 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 20. Problem #2: Settings, defaults, inheritance • Where do you set character sets in MySQL? • Sesssion settings • character_set_server • character_set_client • character_set_connection • character_set_database • character_set_result • Schema level defaults – affect new tables • Table level defaults – affect new columns • Column charsets 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 21. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} ((none)) > SELECT @@global.character_set_server, @@session.character_set_client; +-------------------------------+--------------------------------+ | @@global.character_set_server | @@session.character_set_client | +-------------------------------+--------------------------------+ | latin1 | utf8 | +-------------------------------+--------------------------------+ 1 row in set (0.00 sec) master [localhost] {msandbox} ((none)) > CREATE SCHEMA fosdemG Query OK, 1 row affected (0.00 sec) master [localhost] {msandbox} ((none)) > SHOW CREATE SCHEMA fosdemG *************************** 1. row *************************** Database: fosdem Create Database: CREATE DATABASE `fosdem` /*!40100 DEFAULT CHARACTER SET latin1 */ 1 row in set (0.00 sec)
  • 22. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} ((none)) > USE fosdem; Database changed master [localhost] {msandbox} (fosdem) > CREATE TABLE test (a VARCHAR(300), INDEX (a)); Query OK, 0 rows affected (0.62 sec) master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) DEFAULT NULL, KEY `a` (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
  • 23. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} (fosdem) > ALTER TABLE test DEFAULT CHARSET = utf8; Query OK, 0 rows affected (0.08 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) CHARACTER SET latin1 DEFAULT NULL, KEY `a` (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
  • 24. 01.02.2015 Follow us on Twitter @dbasquare www.psce.com Problem #2: Settings, defaults, inheritance master [localhost] {msandbox} (fosdem) > ALTER TABLE test ADD b VARCHAR(10); Query OK, 0 rows affected (0.74 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) CHARACTER SET latin1 DEFAULT NULL, `b` varchar(10) DEFAULT NULL, KEY `a` (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
  • 25. I f**ckd up. What do I do? • Let’s start with what you shouldn’t do • Keep calm and don’t start by changing something • Analyze the situation • Why did the problem occur in the first place? • Reassess the damage • Is it consistent? • Are all rows broken in the same way? • Are some rows bad, but others are okay? • Are all bad in several different ways? • Is it actually repearable? • No character mapping occurred during writes (e.g. unicode over latin1/latin1) 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 26. I f**ckd up. What else I shouldn’t do, then? • Do not rush things as you may easily go from bad to worse • Do not start fixing this on a replication slave • You can’t fix this by fixing tables one by one on a live database • Unless you really have everything in one table • Do not use: ALTER TABLE … DEFAULT CHARSET = … • It only changes the default character set for new columns • Do not use: ALTER TABLE … CONVERT TO CHARACTER SET … • It’s not for fixing broken encoding • Do not use: ALTER TABLE … MODIFY col_name … CHARACTER SET … 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 27. I f**ckd up. So how do I fix it? • What needs to be fixed? • Schema defaut character set • ALTER SCHEMA fosdem DEFAULT CHARSET = utf8 • Tables with text columns: CHAR, VARCHAR, TEXT, TINYTEXT, LONGTEXT • What about ENUM? • Use INFORMATION_SCHEMA to grab a list • What about other tables? • They too (eventually), but it’s not critical SELECT CONCAT(c.table_schema, '.', c.table_name) AS candidate_table FROM information_schema.columns c WHERE c.table_schema = 'fosdem' AND c.column_type REGEXP '^(.*CHAR|.*TEXT|ENUM)((.+))?$' GROUP BY candidate_table; 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 28. I f**ckd up. So how do I fix it? • Option 1 – requires downtime • Dump and restore • Dump the data preserving the bad configuration and drop the old database bash# mysqldump -u root -p --skip-set-charset --default-character-set=latin1 fosdem > fosdem.sql mysql> DROP SCHEMA fosdem; • Correct table definitions in the dump file • Edit DEFAULT CHARSET in all CREATE TABLE statements • Create the database again and import the data back mysql> CREATE SCHEMA fosdem DEFAULT CHARSET utf8; bash# mysql -u root -p --default-character-set=utf8 fosdem < fosdem.sql 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 29. I f**ckd up. So how do I fix it? • Option 2 – requires downtime • Perform a two step conversion with ALTER TABLE • Original encoding -> VARBINARY/BLOB -> Target encoding • Conversion from/to BINARY/BLOB removes character set context • How? • Stop applications • On each tabe, for each text column perform: ALTER TABLE tbl MODIFY col_name VARBINARY(255); ALTER TABLE tbl MODIFY col_name VARCHAR(255) CHARACTER SET utf8; • You may specify multiple columns per ALTER TABLE • Fix the problems (application and/or db configs) • Restart applications 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 30. I f**ckd up. So how do I fix it? • Option 3 – online character set fix; no downtime* • Thanks to our plugin for pt-online-schema-change • and a tiny patch for pt-online-schema-change that goes with the plugin  • How? • Start pt-online-schema-change on all tables – one by one • Do not rotate tables (--no-swap-tables) or drop pt-osc triggers • Wait until all tables have been converted • Stop applications • Fix the problems (application and/or db configs) • Rotate tables – takes just 1 minute • Restart applications • Et voilà 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 31. GOTCHAs! • Data space requrements may change during conversion • Latin1 uses 1 byte per character, utf8 will need to assume 3 bytes • VARCHAR/TEXT fit up to 64KB – it won’t fit 65536 multi-byte characters • Key length limit is 767 bytes • Data type and/or index length changes may be required • Test and plan this ahead • There may be more prolems than you think • Detect irrecoverible problems with a simple stored procedure 01.02.2015 Follow us on Twitter @dbasquare www.psce.com CREATE FUNCTION `cnv_test_conversion` (`value_before` LONGTEXT, `value_after` LONGTEXT) RETURNS tinyint(1) BEGIN RETURN (IFNULL(CONVERT(CONVERT(`value_before` USING latin1) USING binary), "") = IFNULL(CONVERT(`value_after` USING binary), "")); END;;
  • 32. 01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com GOTCHAs! master [localhost] {msandbox} (fosdem) > ALTER TABLE test MODIFY a VARCHAR(300) CHARACTER SET utf8; Query OK, 0 rows affected, 1 warning (1.23 sec) Records: 0 Duplicates: 0 Warnings: 1 master [localhost] {msandbox} (fosdem) > SHOW WARNINGSG *************************** 1. row *************************** Level: Warning Code: 1071 Message: Specified key was too long; max key length is 767 bytes 1 row in set (0.00 sec) master [localhost] {msandbox} (fosdem) > SHOW CREATE TABLE testG *************************** 1. row *************************** Table: test Create Table: CREATE TABLE `test` ( `a` varchar(300) DEFAULT NULL, `b` varchar(10) DEFAULT NULL, KEY `a` (`a`(255)) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 1 row in set (0.00 sec)
  • 33. How to do it right? • Set character-set-server during initial configuration • When creating new schemas, always specify the desired charset • CREATE SCHEMA fosdem DEFAULT CHARSET = utf8 • ALTER SCHEMA fosdem DEFAULT CHARSET = utf8 • When creating new tables, also explicitly specify the charset • CREATE TABLE people (…) DEFAULT CHARSET = utf8 • And don’t forget to configure applications too • You can try to force charset on the clients • init-connect = "SET NAMES utf8" • It might also break applications that don’t want to talk to MySQL using utf8 01.02.2015 Follow us on Twitter @dbasquare www.psce.com
  • 34. Oh, and one more thing… 01.02.2015 Follow us on Twitter @dbasquare Need help? Visit www.psce.com
  • 35. • We are sharing WebScaleSQL packages with the MySQL Community! • Check out http://www.psce.com/blog for details • Follow @dbasquare to receive updates 01.02.2015 Follow us on Twitter @dbasquare 35 WebScaleSQL What is WebScaleSQL? WebScaleSQL is a collaboration among engineers from several companies such as Facebook, Twitter, Google or Linkedin, that face the same challenges in deploying MySQL at scale, and seek greater performance from a database technology tailored for their needs.