Who said nothing to do when migrating your character set from MySQL 5.7 to 8 ?

As you know that MySQL 5.7 is going to be in its end of life this month and many businesses are upgrading to 8 ( except few ones, believe me its true) . Many cloud vendors are providing extra support and time to their customer so that they can move to 8. One of the major change in 8 is the default character type and collation. Previously the default one was latin1 and also the alias of utf8 was utf8mb3 but now it is utf8mb4. There are plenty of articles from many database service providers that says no data and table structural changes are required. Although it is true to some extent but there are some major changes, if being missed can cause your application to behave weirdly.

Lets go basic first !!

What is a character set ?

A character is a letter, let say A,B,C . Now I want to encode these letters i.e. A can be represented by the ! and B can be represented by the $. If I combine the letter and their encoding it can be called as a character set.

There are total 41 character set in 8.0.33. I wont go in detail of every character set as this is out of topic but still if you are more interested in reading about it, this is a manual. This link will also explain how utf8mb4 is different from the utf8mb3 and other character set and how man bytes does it take to store a single character.

mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.33    |
+-----------+
1 row in set (0.00 sec)
mysql> select count(1) from information_schema.CHARACTER_SETS;
+----------+
| count(1) |
+----------+
|       41 |
+----------+
1 row in set (0.00 sec)

What is a collation ?

Above I mentioned that if I replace A by ! , then I am imposing rule . This rule is called as a collation. This is a very simple type of collation. Collation can be much more complex, depending on the type of rules one have created.

In total there are 286 collations in 8.0.33. There can be multiple types of collation for one character set and one

mysql> select count(*) from COLLATIONS;
+----------+
| count(*) |
+----------+
|      286 |
+----------+
1 row in set (0.00 sec)
mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.33    |
+-----------+
1 row in set (0.00 sec)

Each collation has a property which talks about the PAD_ATTRIBUTE i.e. if a string can have trailing space as its own part (without considering it separate) or should it be treated as its additional character. It may sounds confusing now but I will clear it further as this is a main topic of this article.

How pad attribute makes the difference while upgrading ?

In many cases if we were using latin1 or utf8 (which was an alias of utf8mb3) as the character set of our tables in 5.7, most probably we are using latin1_swedish_ci or utf8mb3_general_ci as its collation type because these are the default one and rarely we do change it until unless we have some specific business needs. Now lets talk about a case wherein we have a table with character set as utf8mb3

mysql> show create table check_mb3\G;
*************************** 1. row ***************************
       Table: check_mb3
Create Table: CREATE TABLE `check_mb3` (
  `id` int DEFAULT NULL,
  `name` char(10) DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
1 row in set (0.01 sec)
mysql> insert into check_mb3 values(1,'ankit');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(1,'ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit' for key 'check_mb3.name'
mysql> insert into check_mb3 values(2,'guga');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(3,'guga ');
ERROR 1062 (23000): Duplicate entry 'guga' for key 'check_mb3.name'
mysql> insert into check_mb3 values(4,'victor');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(5,'victor ');
ERROR 1062 (23000): Duplicate entry 'victor' for key 'check_mb3.name'

Until now, everything is fine and we are not able to insert the same string even if it has trailing spaces. Now lets try to migrate our tables to 8 without worrying about the pad spaces.

mysql> alter table check_mb3 CHARACTER SET UTF8mb4;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table check_mb3 modify name char(10) CHARACTER SET utf8mb4 collate utf8mb4_0900_ai_ci;
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings
mysql> show create table check_mb3\G;
*************************** 1. row ***************************
       Table: check_mb3
Create Table: CREATE TABLE `check_mb3` (
  `id` int DEFAULT NULL,
  `name` char(10) CHARACTER SET utf8mb3 DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
mysql> insert into check_mb3 values(4,'victor ');
ERROR 1062 (23000): Duplicate entry 'victor' for key 'check_mb3.name'
mysql> select * from check_mb3;
+------+--------+
| id   | name   |
+------+--------+
|    1 | ankit  |
|    2 | guga   |
|    3 | victor |
+------+--------+
3 rows in set (0.00 sec)

OOPS !!, we didn’t expect this, but anyhow we are fine in this case when we are using data type as char but lets think of a use case wherein our application expect that every string is ending with a trailing space and we try to insert the data with a trailing space –

mysql> insert into check_mb3 values(7,'ankittf84 ');
Query OK, 1 row affected (0.01 sec)
mysql> select * from check_mb3 where name='ankittf84 ';
Empty set (0.00 sec)

Here our application start behaving weird. However if we do same in utf8mb3 collation, the above read will return the data.

mysql> show create table check_mb3_again;
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table           | Create Table                                                                                                                                                                                                                               |
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| check_mb3_again | CREATE TABLE `check_mb3_again` (
  `id` int DEFAULT NULL,
  `name` char(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> insert into check_mb3_again values (1,'ankit ');
Query OK, 1 row affected (0.00 sec)
mysql> select * from check_mb3_again where name='ankit ';
+------+-------+
| id   | name  |
+------+-------+
|    1 | ankit |
+------+-------+
1 row in set (0.00 sec)

But major question is how migration will be impacted when writing the data ?

Lets take another case of below table –

mysql>  CREATE TABLE check_mb3_again_withvarchar (name varchar(10)) character set utf8mb3 collate utf8mb3_general_ci ;
Query OK, 0 rows affected, 2 warnings (0.02 sec)
mysql> alter table check_mb3_again_withvarchar modify name varchar(10) unique key;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> insert into check_mb3_again_withvarchar values('ankit');
Query OK, 1 row affected (0.01 sec)
mysql> insert into check_mb3_again_withvarchar values('ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit ' for key 'check_mb3_again_withvarchar.name'

So this is expected that even if application or web, sends the same string with trailing space, it will be rejected. Now its time to migrate the table to 8 using mb4 collation.

mysql> alter table check_mb3_again_withvarchar charset utf8mb4;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table check_mb3_again_withvarchar modify name varchar(10) CHARACTER SET utf8mb4 collate utf8mb4_0900_ai_ci;
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0
mysql> show create table check_mb3_again_withvarchar;
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table                       | Create Table                                                                                                                                                                                                                     |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| check_mb3_again_withvarchar | CREATE TABLE `check_mb3_again_withvarchar` (
  `name` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Ok, so we have migrated the table successfully, lets continue with our writes –

mysql> insert into check_mb3_again_withvarchar values('ankit ');
Query OK, 1 row affected (0.00 sec)
mysql> select * from check_mb3_again_withvarchar;
+--------+
| name   |
+--------+
| ankit  |
| ankit  |
+--------+
2 rows in set (0.00 sec)

OOPS !! so we fall in a trap wherein application didn’t realise and a duplicate record (of varchar data type) has been inserted successfully.

How to solve this ?

As I mentioned before that there are multiple collations being supported by one character set, we can select the one we need. By “need” I mean several factors like the type of character we need, do we need to compare the strings without spaces or with spaces.

Considering the specific case mentioned in this topic, if we are looking to opt utf8mb4 but with PAD SPACES, we can choose one from collation table. Also we can design our own collation which is not a part of this topic.



So, what is important ?

It is extremely important to understand the behaviour of your data and how does the write operations actually writing the data before planning your migration as such issues looks invisible in beginning but after few months or years, it mess up your data. So, yes deciding character set and collation is extremely important for your data.

Meet me here on linkedin.

Note – These are my own thoughts and doesnt represent the thoughts of my employer in anyway.


Redo flushing. MySQL VS Postgres and Redo in 8/9

One of my earlier post MySQL Innodb – Internal Transaction flow and its related video talks about a transaction flow which gives the internal concept of transaction flow in Innodb. In this article I am going to talk about how the Redo log ( WAL in PG ) flushing is different in MySQL and Postgres and…