How MySQL Uses character_set Configurations

How MySQL Uses character_set

There are eight configuration options related to the character_set in MySQL, as shown below. Without reading the MySQL Character Set documentation carefully, it could be hard to know what these configuration options are used for. In addition, for some of the options, unless there is further testing, it could be hard to know how MySQL uses them.

mysql> show variables like 'character_set%'; +--------------------------+-------------------------------------+ | Variable_name            | Value                               | +--------------------------+-------------------------------------+ | character_set_client     | utf8mb4                             | | character_set_connection | utf8mb4                             | | character_set_database   | utf8mb4                             | | character_set_filesystem | binary                              | | character_set_results    | utf8mb4                             | | character_set_server     | utf8mb4                             | | character_set_system     | utf8mb3                             | | character_sets_dir       | /usr/share/percona-server/charsets/ | +--------------------------+-------------------------------------+

Don’t be intimidated – I will simplify the purpose of these settings by grouping them and explaining them. I will also use examples to illustrate how MySQL uses and processes these character_set options so that users can mitigate the occurrence of data loss and/or errors. Lastly, I will explain how the command “set names” can be used to organize the necessary configurations to be identical.

Grouping the options

To begin, we can organize these eight options into three groups: 

  • G1. Miscellaneous; 
  • G2. To define the character_set of data(column in a table); 
  • G3. To transfer/interpret during the processing of the character_set. 

G1. Miscellaneous

G1.1 character_sets_dir: It is straightforward: the directory where dynamically loaded character sets are installed. 

G1.2 character_set_filesystem: when MySQL needs to deal with a file, it needs the file name, and that file name will be converted from character_set_client to character_set_filesystem before the file is opened. The default value is binary, which means that no conversion occurs. In most cases, please keep it as the default value binary unless you are really confident that you would require the change.

G2. Define the character set of data (column in a table)

The smallest unit of data in an RDBMS (for example MySQL) is a column of a table. When the column is used to store a collection of characters(namely string, for example, varchar(), char(), text, etc.), MySQL needs to know which character set these characters belong to, so that MySQL can store and interpret them correctly.  

The sole purpose of the three options in G2 is to help eventually define the character set of a column in a hierarchical way. The three options are character_set_system, character_set_server, and character_set_database.

The system character set and collation(character_set_system) is used as the default value if the server character set and collation(character_set_server) is not specified.

The server character set and collation(character_set_server) is used as the default value if the database character set and collation(character_set_database) is not specified. 

The database character set and collation(character_set_database) is used as the default value if the database character set and collation are not specified in CREATE DATABASE statements. The character set and collation of a database are used as the default value for a table if the table character set and collation are not specified in CREATE TABLE statements.

The character set and collation of a table are used as the default values for string columns of the table if the character set and collation of those string columns are not specified in CREATE TABLE statements.

In other words, if we define the character_set and collation of a column in CREATE TABLE statements, all of the above settings will be ignored. For example:

mysql> CREATE TABLE t1 ( c1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     
                         c2 VARCHAR(100) COLLATE utf8mb4_general_ci,     
                         c3 VARCHAR(100) )  DEFAULT CHARACTER SET=latin1;
mysql> select TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME, CHARACTER_SET_NAME from information_schema.columns where table_schema='test' and table_name='t1';
+--------------+------------+-------------+--------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | CHARACTER_SET_NAME |
+--------------+------------+-------------+--------------------+
| test         | t1         | c1          | NULL               |
| test         | t1         | c2          | utf8mb4            |
| test         | t1         | c3          | latin1             |
+--------------+------------+-------------+--------------------+

In the example above, we did not specify the character set for column c3. Therefore, by default, it uses latin1 (which was defined in CREATE TABLE) as its CHARACTER_SET. However, we specified column c2 as COLLATE utf8mb4_general_ci, so it uses utf8mb4 as its CHARACTER_SET, thus ignoring the character set latin1 (which, again, was defined in CREATE TABLE for the table).

For more details, please read https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html.

10.3.2 Server Character Set and Collation

10.3.3 Database Character Set and Collation

10.3.4 Table Character Set and Collation

10.3.5 Column Character Set and Collation

G3. Transfer/interpret during the processing of the character_set

There are three options in this group: character_set_client, character_set_connection, and character_set_results

The first two options are used by MySQL for writing to tables: The server takes the character_set_client system variable to be the character set in which statements are sent by the client. The server converts statements sent by the client from character_set_client to character_set_connection.

The last one is used for reading from the table: The character_set_results system variable indicates the character set in which the server returns query results to the client. To tell the server to perform no conversion of result sets or error messages, set character_set_results to NULL or binary.

Illustration via examples

Let’s take a look at a few examples to help us understand further:

We will reuse the above table t1, which has a column c2 of CHARACTER SET utf8mb4 and a column c3 of CHARACTER SET latin1

Just so you know, in the example, I will use two Chinese characters “?” which is my Chinese last name, and “?” which is my wife’s Chinese last name.

Example one

There is no conversion when character_set_client is the same as character_set_connection. And we can see that the encoded binary of the Chinese character “?” is 0xE78E8B and the encoded binary of the Chinese character “?” is 0xE69D8E, regardless of if the character set is latin1 or utf8mb4.

mysql> set names latin1;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| latin1                 | latin1                     |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+
mysql> set names utf8mb4;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| utf8mb4                | utf8mb4                    |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+

Example two

MySQL will convert from character_set_client to character_set_connection when their settings are different.

Example 2.1 convert from smaller character_set to larger character_set

mysql> set character_set_client=latin1; set character_set_connection=utf8mb4;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| latin1                 | utf8mb4                    |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xC3A7C5BDE280B9             |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xC3A6C29DC5BD               |
+------------------------------+

Why are the binary(“?”) and binary(“?”) shown here different from example 1? The reason is that MySQL did the conversion from character_set_client(latin1) to character_set_connection(utf8mb4), meaning that MySQL will interpret the string using character_set_client(latin1), then interpret that output using character_set_connection(utf8mb4). 

Let’s verify:

mysql> select binary(convert(convert('?' using latin1) using utf8mb4));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using latin1) using utf8mb4))                    |
+-------------------------------------------------------------------------------+
| 0xC3A7C5BDE280B9                                                              | +-------------------------------------------------------------------------------+
mysql> select binary(convert(convert('?' using latin1) using utf8mb4));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using latin1) using utf8mb4))                    |
+-------------------------------------------------------------------------------+
| 0xC3A6C29DC5BD                                                                | +-------------------------------------------------------------------------------+

Example 2.2 convert from larger character_set to smaller character_set

mysql> set character_set_client=utf8mb4; set character_set_connection=latin1;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| utf8mb4                | latin1                     |
+------------------------+----------------------------+
mysql> select binary("?");
+--------------------------+
| binary("?")              |
+--------------------------+
| 0x3F                     |
+--------------------------+
mysql> select binary("?");
+--------------------------+
| binary("?")              |
+--------------------------+
| 0x3F                     |
+--------------------------+

Again, MySQL does the conversion from character_set_client(utf8mb4) to character_set_connection(latin1), and due to the conversion from larger character_set to smaller character_set, it makes sense that the conversion will cause data loss in some situations, such as in this example. Different Chinese words ‘?’ and ‘?’ returned the same binary 0x3F which obviously is wrong (data was lost).

Below is how MySQL does the conversion with data loss:

mysql> select binary(convert(convert('?' using utf8mb4) using latin1));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using utf8mb4) using latin1))                    |
+-------------------------------------------------------------------------------+
| 0x3F                                                                          |
+-------------------------------------------------------------------------------+
mysql> select binary(convert(convert('?' using utf8mb4) using latin1));
+-------------------------------------------------------------------------------+
| binary(convert(convert('?'  using utf8mb4) using latin1))                    |
+-------------------------------------------------------------------------------+
|0x3F                                                                           |
+-------------------------------------------------------------------------------+

Example three

When the character_set of a column is different from the character_set_connection, the conversion will occur as well.

Example 3.1 converts from a smaller character_set to a larger character_set.

As we can see from the below example, column c2 is expecting character_set utf8mb4 which is different from character_set_connection latin1, so the actual data inserted is the result of the conversion(interpret the string as latin1, then interpret the output into destination character set utf8mb4): “convert(convert(‘?’ using latin1) using utf8mb4)” from which we got the binary of ‘?’ for utf8mb4: 0xC3A7C5BDE280B9; at the same time, column c3 is expecting the same character_set as character_set_connection latin1, so there is no conversion involved, and data inserted is the original one(interpret the string as latin1), so the binary of ‘?’ for latin1: 0xE69D8E.

mysql> set names latin1;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| latin1                 | latin1                     |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+
mysql> insert into t1(c1,c2,c3) values(1,"?","?");
mysql> select * from t1 where c1=1;
+----+------+------+
| c1 | c2   | c3   |
+----+------+------+
|  1 | ?   | ?   |
+----+------+------+
mysql> select c1,binary(c2),binary(c3) from t1 where c1=1;
+----+------------------------+------------------------+
| c1 | binary(c2)             | binary(c3)             |
+----+------------------------+------------------------+
|  1 | 0xC3A7C5BDE280B9       | 0xE69D8E               |
+----+------------------------+------------------------+

Please also note that here, column c2 is utf8mb4 which is different from character_set_results latin1. So when reading the data, MySQL will convert it for the results as well. However, column c3 is the same as character_set_results latin1, so no conversion is required, the output will be interpreted as latin1, though.

mysql> select @@character_set_results;
+-------------------------+
| @@character_set_results |
+-------------------------+
| latin1                  |
+-------------------------+
mysql> select * from t1 where c1=1;
+----+------+------+
| c1 | c2   | c3   |
+----+------+------+
|  1 | ?   | ?   |
+----+------+------+

For column c2, MySQL will interpret the result as utf8mb4 (the character set definition of the column), then converts it into latin1(setting of character_set_results) as below:

mysql> select convert(convert(0xC3A7C5BDE280B9 using utf8mb4) using latin1);
+---------------------------------------------------------------+
| convert(convert(0xC3A7C5BDE280B9 using utf8mb4) using latin1) |
+---------------------------------------------------------------+
| ?                                                            |
+---------------------------------------------------------------+

For column c3, MySQL interprets the result directly as latin1:

mysql> select convert(0xE69D8E using latin1);
+--------------------------------+
| convert(0xE69D8E using latin1) |
+--------------------------------+
| ?                             |
+--------------------------------+

Example 3.2 converts from a larger character_set to a smaller character_set.

character_set_connection is set to utf8mb4, c2 is expecting for character_set utf8mb4, and c3 is expecting for character_set latin1. c2 works fine, but for c3, MySQL gave out the error 1366 (HY000): Incorrect string value, since it determined the conversion(convert(convert(‘?’ using utf8mb4) using latin1)) causes data loss and stops inserting the wrong data into the column.

mysql> set names utf8mb4 ;
mysql> select @@character_set_client,@@character_set_connection;
+------------------------+----------------------------+
| @@character_set_client | @@character_set_connection |
+------------------------+----------------------------+
| utf8mb4                | utf8mb4                    |
+------------------------+----------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE78E8B                     |
+------------------------------+
mysql> select binary("?");
+------------------------------+
| binary("?")                 |
+------------------------------+
| 0xE69D8E                     |
+------------------------------+
mysql> insert into t1(c1,c2,c3) values(2,"?","?");
ERROR 1366 (HY000): Incorrect string value: '\xE6\x9D\x8E' for column 'c3' at row 1
mysql> select binary(convert(convert('?' using utf8mb4) using latin1));
+------------------------------------------------------------------------------+
| binary(convert(convert('?' using utf8mb4) using latin1))                    |
+------------------------------------------------------------------------------+
| 0x3F                                                                         |
+------------------------------------------------------------------------------+

Example four

When you want to directly convert a column from a larger character_set to a smaller character_set inside the InnoDB engine, where the variable settings are not involved, the same error may happen as well

mysql> select c1, binary(c2), binary(c3) from t1;
+----+------------------------+------------------------+
| c1 | binary(c2)             | binary(c3)             |
+----+------------------------+------------------------+
|  1 | 0xC3A7C5BDE280B9       | 0xE69D8E               |
|  3 | 0xE78E8B               | 0x4131                 |
+----+------------------------+------------------------+
mysql> ALTER TABLE t1 MODIFY c2 varchar(100) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL;
ERROR 1366 (HY000): Incorrect string value: '\xE7\x8E\x8B' for column 'c2' at row 2

Example five

One more thing, as you may notice, in the above examples, I used the command set names a few times. Yes, this is actually the convenient way to make all three options in the G3 group identical, so that we can avoid unnecessary conversions.

mysql> set names latin1;
mysql> show variables like '%character_set%';
+--------------------------+-------------------------------------+
| Variable_name            | Value                               |
+--------------------------+-------------------------------------+
| character_set_client     | latin1                              |
| character_set_connection | latin1                              |
| character_set_database   | utf8mb4                             |
| character_set_filesystem | binary                              |
| character_set_results    | latin1                              |
| character_set_server     | utf8mb4                             |
| character_set_system     | utf8mb3                             |
| character_sets_dir       | /usr/share/percona-server/charsets/ |
+--------------------------+-------------------------------------+
8 rows in set (0.00 sec)
mysql> set names utf8mb4;
mysql> show variables like '%character_set%';
+--------------------------+-------------------------------------+
| Variable_name            | Value                               |
+--------------------------+-------------------------------------+
| character_set_client     | utf8mb4                             |
| character_set_connection | utf8mb4                             |
| character_set_database   | utf8mb4                             |
| character_set_filesystem | binary                              |
| character_set_results    | utf8mb4                             |
| character_set_server     | utf8mb4                             |
| character_set_system     | utf8mb3                             |
| character_sets_dir       | /usr/share/percona-server/charsets/ |
+--------------------------+-------------------------------------+
8 rows in set (0.00 sec)

Conclusion

MySQL will do the potential conversion to make sure the character_set_client, character_set_connection, and the destination character set of the column are aligned. 

We need to be careful to perform the conversion, since when MySQL converts from a character set that permits more values(larger) to a character set that permits fewer values(smaller) such as utf8mb4 -> latin1, it may cause data loss or an “ERROR 1366 (HY000) Incorrect string value” may happen.

We can make all three G3 configurations(character_set_client, character_set_connection, character_set_results) identical via the command “set names”. 

However, talking about the character set of a column which can be defined as required, you may want to rethink before defining a column as a character set different from the character_set_connection.

Leave a Reply

Your email address will not be published. Required fields are marked *

Subscribe to our Newsletter