February 6, 2026February 6, 2026 simplifieDB

How one space can create a nightmare for you…

As you know that MySQL 5.7 is going to be in its end of life this month (when I wrote this article) and many businesses are upgrading to 8 ( except few ones, believe me its true) . Many cloud vendors are providing extra support and time to their customer so that they can move to 8. One of the major change in 8 is the default character type and collation. Previously the default one was latin1 and also the alias of utf8 was utf8mb3 but now it is utf8mb4. There are plenty of articles from many database service providers that says no data and table structural changes are required. Although it is true to some extent but there are some major changes, if being missed can cause your application to behave weirdly.

Lets go basic first !!

What is a character set ?

A character is a letter, let say A,B,C . Now I want to encode these letters i.e. A can be represented by the ! and B can be represented by the $. If I combine the letter and their encoding it can be called as a character set.

There are total 41 character set in 8.0.33. I wont go in detail of every character set as this is out of topic but still if you are more interested in reading about it, this is a manual. This link will also explain how utf8mb4 is different from the utf8mb3 and other character set and how man bytes does it take to store a single character.

			
mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.33    |
+-----------+
1 row in set (0.00 sec)
mysql> select count(1) from information_schema.CHARACTER_SETS;
+----------+
| count(1) |
+----------+
|       41 |
+----------+
1 row in set (0.00 sec)

		

What is a collation ?

Above I mentioned that if I replace A by ! , then I am imposing rule . This rule is called as a collation. This is a very simple type of collation. Collation can be much more complex, depending on the type of rules one have created.

In total there are 286 collations in 8.0.33. There can be multiple types of collation for one character set for different languages. For example , see below

			
mysql> select * from collations where CHARACTER_SET_NAME='utf8mb3';
+-----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| COLLATION_NAME              | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+-----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| utf8mb3_general_ci          | utf8mb3            |  33 | Yes        | Yes         |       1 | PAD SPACE     |
| utf8mb3_tolower_ci          | utf8mb3            |  76 |            | Yes         |       1 | PAD SPACE     |
| utf8mb3_bin                 | utf8mb3            |  83 |            | Yes         |       1 | PAD SPACE     |
| utf8mb3_unicode_ci          | utf8mb3            | 192 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_icelandic_ci        | utf8mb3            | 193 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_latvian_ci          | utf8mb3            | 194 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_romanian_ci         | utf8mb3            | 195 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_slovenian_ci        | utf8mb3            | 196 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_polish_ci           | utf8mb3            | 197 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_estonian_ci         | utf8mb3            | 198 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_spanish_ci          | utf8mb3            | 199 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_swedish_ci          | utf8mb3            | 200 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_turkish_ci          | utf8mb3            | 201 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_czech_ci            | utf8mb3            | 202 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_danish_ci           | utf8mb3            | 203 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_lithuanian_ci       | utf8mb3            | 204 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_slovak_ci           | utf8mb3            | 205 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_spanish2_ci         | utf8mb3            | 206 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_roman_ci            | utf8mb3            | 207 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_persian_ci          | utf8mb3            | 208 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_esperanto_ci        | utf8mb3            | 209 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_hungarian_ci        | utf8mb3            | 210 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_sinhala_ci          | utf8mb3            | 211 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_german2_ci          | utf8mb3            | 212 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_croatian_ci         | utf8mb3            | 213 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_unicode_520_ci      | utf8mb3            | 214 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_vietnamese_ci       | utf8mb3            | 215 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_general_mysql500_ci | utf8mb3            | 223 |            | Yes         |       1 | PAD SPACE     |
+-----------------------------+--------------------+-----+------------+-------------+---------+---------------+
28 rows in set (0.01 sec)

		

			
Total number of collations in 8 :
mysql> select count(*) from COLLATIONS;
+----------+
| count(*) |
+----------+
|      286 |
+----------+
1 row in set (0.00 sec)
mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.33    |
+-----------+
1 row in set (0.00 sec)

		

Ok, Now going back to the original topic of our blog:

In collation table above, there is a column name known as “pad attribute” . Each collation has a property which talks about the PAD_ATTRIBUTE i.e. if a string can have trailing space (a space after string) as its own part (without considering it separately) or should it be treated as its additional character. It may sounds confusing now but I will clear it further as this is a main topic of this article.

How pad attribute makes the difference while upgrading ?

In many cases if we were using latin1 or utf8 (which was an alias of utf8mb3) as the character set of our tables in 5.7, most probably we are using latin1_swedish_ci or utf8mb3_general_ci as its collation type because these are the default one and rarely we do change it until unless we have some specific business needs. Now lets talk about a case wherein we have a table with character set as utf8mb3

			
mysql> show create table check_mb3\G;
*************************** 1. row ***************************
       Table: check_mb3
Create Table: CREATE TABLE `check_mb3` (
  `id` int DEFAULT NULL,
  `name` char(10) DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
1 row in set (0.01 sec)
mysql> insert into check_mb3 values(1,'ankit');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(1,'ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit' for key 'check_mb3.name'

		

			
mysql> insert into check_mb3 values(2,'guga');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(3,'guga ');
ERROR 1062 (23000): Duplicate entry 'guga' for key 'check_mb3.name'

			
mysql> insert into check_mb3 values(4,'victor');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(5,'victor ');
ERROR 1062 (23000): Duplicate entry 'victor' for key 'check_mb3.name'

Until now, everything went as expected and we are not able to insert the same string even if it has trailing spaces which means, trailing space has been removed when being stored on disk and thus both strings (‘ankit’ & ‘ankit ‘) are same for the user or client . Now let’s try to migrate our tables to 8 by changing the character set since utf8mb4 is default one and comes with collations which has NO PAD option.

			
mysql> alter table check_mb3 CHARACTER SET UTF8mb4;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table check_mb3 modify name char(10) CHARACTER SET utf8mb4 collate utf8mb4_0900_ai_ci;
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings
mysql> show create table check_mb3\G;
*************************** 1. row ***************************
       Table: check_mb3
Create Table: CREATE TABLE `check_mb3` (
  `id` int DEFAULT NULL,
  `name` char(10) CHARACTER SET utf8mb4 collate utf8mb4_0900_ai_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
mysql> insert into check_mb3 values(4,'victor ');
ERROR 1062 (23000): Duplicate entry 'victor' for key 'check_mb3.name'
mysql> select * from check_mb3;
+------+--------+
| id   | name   |
+------+--------+
|    1 | ankit  |
|    2 | guga   |
|    3 | victor |
+------+--------+
3 rows in set (0.00 sec)

		

OOPS !!, we didn’t expect this, but we are fine in this case when we are using data type as CHAR. This is because as per SQL standards, while storing the CHAR data, both string should be treated as same regardless of the collation but not while READING. Keep on reading further.

but lets think of a use case wherein our business logic says that
“every string should end with a trailing space”
and hence accordingly we will insert a string with a space for this test

			
mysql> insert into check_mb3 values(7,'ankittf84 ');
Query OK, 1 row affected (0.01 sec)
--------and then we read the same string with space----
mysql> select * from check_mb3 where name='ankittf84 ';
Empty set (0.00 sec)

		

Here our application has started behaving weird and user will assume that our string doesn’t exists. However if we do same in utf8mb3 collation (which was default in MySQL 5.7) the above read will return the data because while reading it treats both string as different. I repeat , while reading it treats both string as different but not while storing because the data is being stored identically, but when read, InnoDB do the comparison based on collation type i.e. trim space or consider space.

			
mysql> show create table check_mb3_again;
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table           | Create Table                                                                                                                                                                                                                               |
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| check_mb3_again | CREATE TABLE `check_mb3_again` (
  `id` int DEFAULT NULL,
  `name` char(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select COLLATION_NAME,PAD_ATTRIBUTE from information_schema.collations where COLLATION_NAME='utf8mb3_general_ci';
+--------------------+---------------+
| COLLATION_NAME     | PAD_ATTRIBUTE |
+--------------------+---------------+
| utf8mb3_general_ci | PAD SPACE     |
+--------------------+---------------+
mysql> insert into check_mb3_again values (1,'ankit ');
Query OK, 1 row affected (0.00 sec)
mysql> select * from check_mb3_again where name='ankit ';
+------+-------+
| id   | name  |
+------+-------+
|    1 | ankit |
+------+-------+
1 row in set (0.00 sec)

		

Im sure you must be thinking “why even any application will search like this” and I would be completely agree with you. But hold on, I explained this just to setup the background.

NOW, major question is how migration to 8 will corrupt the data while writing ?

Lets try to understand with an example wherein same table structure but with VARCHAR data type.

1. Keep utf8mb3 for now which is default in 5.7 :

			
mysql>  CREATE TABLE check_mb3_again_withvarchar (name varchar(10)) character set utf8mb3 collate utf8mb3_general_ci ;
Query OK, 0 rows affected, 2 warnings (0.02 sec)
mysql> alter table check_mb3_again_withvarchar modify name varchar(10) unique key;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> insert into check_mb3_again_withvarchar values('ankit');
Query OK, 1 row affected (0.01 sec)
mysql> insert into check_mb3_again_withvarchar values('ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit ' for key 'check_mb3_again_withvarchar.name'

		

So this is expected that even if application or web, sends the same string with trailing space, it will be rejected.
Now its time to migrate the table to 8 using mb4 collation.

2. Change table and column to utf8mb4 for now which is default in 8:

			
mysql> alter table check_mb3_again_withvarchar charset utf8mb4;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table check_mb3_again_withvarchar modify name varchar(10) CHARACTER SET utf8mb4 collate utf8mb4_0900_ai_ci;
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0
mysql> show create table check_mb3_again_withvarchar;
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table                       | Create Table                                                                                                                                                                                                                     |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| check_mb3_again_withvarchar | CREATE TABLE `check_mb3_again_withvarchar` (
  `name` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

		

Ok, so we have migrated the table successfully, lets continue with our writes –

			
mysql> insert into check_mb3_again_withvarchar values('ankit ');
Query OK, 1 row affected (0.00 sec)
mysql> select * from check_mb3_again_withvarchar;
+--------+
| name   |
+--------+
| ankit  |
| ankit  |
+--------+
2 rows in set (0.00 sec)

		

So, now what would you say here and your data became inconsistent and you are gone into a nightmare without even realizing this and you will realize it after years or months ???

Every problem has a solution :

I learnt in my life that every problem has a solution and success is, how well you solve your problems.

Initially I mentioned that there are multiple collations being supported by one character set, we can select the one we need. By “need” I mean several factors like the type of character we need, do we need to compare the strings without spaces or with spaces.

Considering the specific case mentioned in this topic, if we are looking to opt utf8mb4 but with PAD SPACES, we can choose one from collation table.

Lets talk with an example :

			
mysql> select COLLATION_NAME,PAD_ATTRIBUTE from information_schema.collations where CHARACTER_SET_NAME='utf8mb4' and PAD_ATTRIBUTE='PAD SPACE';
+------------------------+---------------+
| COLLATION_NAME         | PAD_ATTRIBUTE |
+------------------------+---------------+
| utf8mb4_general_ci     | PAD SPACE     |
| utf8mb4_bin            | PAD SPACE     |
| utf8mb4_unicode_ci     | PAD SPACE     |
| utf8mb4_icelandic_ci   | PAD SPACE     |
| utf8mb4_latvian_ci     | PAD SPACE     |
| utf8mb4_romanian_ci    | PAD SPACE     |
| utf8mb4_slovenian_ci   | PAD SPACE     |
| utf8mb4_polish_ci      | PAD SPACE     |
| utf8mb4_estonian_ci    | PAD SPACE     |
| utf8mb4_spanish_ci     | PAD SPACE     |
| utf8mb4_swedish_ci     | PAD SPACE     |
| utf8mb4_turkish_ci     | PAD SPACE     |
| utf8mb4_czech_ci       | PAD SPACE     |
| utf8mb4_danish_ci      | PAD SPACE     |
| utf8mb4_lithuanian_ci  | PAD SPACE     |
| utf8mb4_slovak_ci      | PAD SPACE     |
| utf8mb4_spanish2_ci    | PAD SPACE     |
| utf8mb4_roman_ci       | PAD SPACE     |
| utf8mb4_persian_ci     | PAD SPACE     |
| utf8mb4_esperanto_ci   | PAD SPACE     |
| utf8mb4_hungarian_ci   | PAD SPACE     |
| utf8mb4_sinhala_ci     | PAD SPACE     |
| utf8mb4_german2_ci     | PAD SPACE     |
| utf8mb4_croatian_ci    | PAD SPACE     |
| utf8mb4_unicode_520_ci | PAD SPACE     |
| utf8mb4_vietnamese_ci  | PAD SPACE     |
+------------------------+---------------+

		

			
CREATE TABLE `check_mb4_withvarchar` (
  `id` int DEFAULT NULL,
  `name` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

		

Have you noticed why I have chosen utf8mb4_general_ci which is not a default one ? This is because, this collation comes with PAD SPACE i.e. trailing space will be considered same as WITHOUT SPACE in VARCHAR.
Now, insert some data to confirm our understanding :

			
mysql> insert into check_mb4_withvarchar values (1,'ankit');
Query OK, 1 row affected (0.01 sec)
mysql> insert into check_mb4_withvarchar values (1,'ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit ' for key 'check_mb4_withvarchar.name

AND this is what I really needed to prove that how one space can corrupt your entire data while upgrading to 8.

So, what is important ?

It is extremely important to understand the behaviour of your data and how does the write operations actually writing the data before planning your migration as such issues looks invisible in beginning but after few months or years, it mess up your data. So, yes deciding character set and collation is extremely important for your data.

Meet me here on linkedin.

Note – This is fork from my previous article but on new domain and with new subject line.

Mutex & MetaData lock handling with Performance and Sys Schema in MySQL

Read about most important part of MySQL. Mutex and meta data locks

by simplifieDB September 18, 2020April 2, 2023

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

This article discuss about how innodb build b tree index and how things improved with 8

by simplifieDB March 22, 2018April 18, 2023

Redo flushing. MySQL VS Postgres and Redo in 8/9

One of my earlier post MySQL Innodb – Internal Transaction flow and its related video talks about a transaction flow which gives the internal concept of transaction flow in Innodb. In this article I am going to talk about how the Redo log ( WAL in PG ) flushing is different in MySQL and Postgres and…

by simplifieDB December 17, 2020September 30, 2025

October 6, 2023February 6, 2026 simplifieDB

Who said nothing to do while migrating from MySQL 5.7 to 8 ?

			
mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.33    |
+-----------+
1 row in set (0.00 sec)
mysql> select count(1) from information_schema.CHARACTER_SETS;
+----------+
| count(1) |
+----------+
|       41 |
+----------+
1 row in set (0.00 sec)

		

			
mysql> select * from collations where CHARACTER_SET_NAME='utf8mb3';
+-----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| COLLATION_NAME              | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+-----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| utf8mb3_general_ci          | utf8mb3            |  33 | Yes        | Yes         |       1 | PAD SPACE     |
| utf8mb3_tolower_ci          | utf8mb3            |  76 |            | Yes         |       1 | PAD SPACE     |
| utf8mb3_bin                 | utf8mb3            |  83 |            | Yes         |       1 | PAD SPACE     |
| utf8mb3_unicode_ci          | utf8mb3            | 192 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_icelandic_ci        | utf8mb3            | 193 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_latvian_ci          | utf8mb3            | 194 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_romanian_ci         | utf8mb3            | 195 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_slovenian_ci        | utf8mb3            | 196 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_polish_ci           | utf8mb3            | 197 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_estonian_ci         | utf8mb3            | 198 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_spanish_ci          | utf8mb3            | 199 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_swedish_ci          | utf8mb3            | 200 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_turkish_ci          | utf8mb3            | 201 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_czech_ci            | utf8mb3            | 202 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_danish_ci           | utf8mb3            | 203 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_lithuanian_ci       | utf8mb3            | 204 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_slovak_ci           | utf8mb3            | 205 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_spanish2_ci         | utf8mb3            | 206 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_roman_ci            | utf8mb3            | 207 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_persian_ci          | utf8mb3            | 208 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_esperanto_ci        | utf8mb3            | 209 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_hungarian_ci        | utf8mb3            | 210 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_sinhala_ci          | utf8mb3            | 211 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_german2_ci          | utf8mb3            | 212 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_croatian_ci         | utf8mb3            | 213 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_unicode_520_ci      | utf8mb3            | 214 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_vietnamese_ci       | utf8mb3            | 215 |            | Yes         |       8 | PAD SPACE     |
| utf8mb3_general_mysql500_ci | utf8mb3            | 223 |            | Yes         |       1 | PAD SPACE     |
+-----------------------------+--------------------+-----+------------+-------------+---------+---------------+
28 rows in set (0.01 sec)

		

			
mysql> select count(*) from COLLATIONS;
+----------+
| count(*) |
+----------+
|      286 |
+----------+
1 row in set (0.00 sec)
mysql> select version();
+-----------+
| version() |
+-----------+
| 8.0.33    |
+-----------+
1 row in set (0.00 sec)

		

Ok, Now going back to the original topic of our blog:

			
mysql> show create table check_mb3\G;
*************************** 1. row ***************************
       Table: check_mb3
Create Table: CREATE TABLE `check_mb3` (
  `id` int DEFAULT NULL,
  `name` char(10) DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3
1 row in set (0.01 sec)
mysql> insert into check_mb3 values(1,'ankit');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(1,'ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit' for key 'check_mb3.name'

		

			
mysql> insert into check_mb3 values(2,'guga');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(3,'guga ');
ERROR 1062 (23000): Duplicate entry 'guga' for key 'check_mb3.name'

			
mysql> insert into check_mb3 values(4,'victor');
Query OK, 1 row affected (0.00 sec)
mysql> insert into check_mb3 values(5,'victor ');
ERROR 1062 (23000): Duplicate entry 'victor' for key 'check_mb3.name'

			
mysql> select count(*) from information_schema.collations where CHARACTER_SET_NAME='utf8mb4' and PAD_ATTRIBUTE='NO PAD';
+----------+
| count(*) |
+----------+
|       63 |
+----------+
1 row in set (0.00 sec)

		

			
mysql> alter table check_mb3 CHARACTER SET UTF8mb4;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table check_mb3 modify name char(10) CHARACTER SET utf8mb4 collate utf8mb4_0900_ai_ci;
Query OK, 3 rows affected (0.03 sec)
Records: 3  Duplicates: 0  Warnings
mysql> show create table check_mb3\G;
*************************** 1. row ***************************
       Table: check_mb3
Create Table: CREATE TABLE `check_mb3` (
  `id` int DEFAULT NULL,
  `name` char(10) CHARACTER SET utf8mb3 DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
mysql> insert into check_mb3 values(4,'victor ');
ERROR 1062 (23000): Duplicate entry 'victor' for key 'check_mb3.name'
mysql> select * from check_mb3;
+------+--------+
| id   | name   |
+------+--------+
|    1 | ankit  |
|    2 | guga   |
|    3 | victor |
+------+--------+
3 rows in set (0.00 sec)

		

OOPS !!, we didn’t expect this, but anyhow we are fine in this case when we are using data type as CHAR. This is because as per SQL standards, while storing the CHAR data, both string should be treated as same regardless of the collation but not while READING. Keep on reading further.

but lets think of a use case wherein our business logic says that
“every string should end with a trailing space”
and hence accordingly we will insert a string with a space for this test

			
mysql> insert into check_mb3 values(7,'ankittf84 ');
Query OK, 1 row affected (0.01 sec)
--------and then we read the same string with space----
mysql> select * from check_mb3 where name='ankittf84 ';
Empty set (0.00 sec)

		

Here our application has started behaving weird and user will assume that our string doesn’t exists. However if we do same in utf8mb3 collation (which was default in MySQL 5.7 and with only “PAD SPACE” option, the above read will return the data because while reading it treats both string as different. I repeat , while reading it treats both string as different but not while storing because the data is being stored identically, but when read, InnoDB do the comparison based on collation type i.e. trim space or consider space.

			
mysql> show create table check_mb3_again;
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table           | Create Table                                                                                                                                                                                                                               |
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| check_mb3_again | CREATE TABLE `check_mb3_again` (
  `id` int DEFAULT NULL,
  `name` char(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select COLLATION_NAME,PAD_ATTRIBUTE from information_schema.collations where COLLATION_NAME='utf8mb3_general_ci';
+--------------------+---------------+
| COLLATION_NAME     | PAD_ATTRIBUTE |
+--------------------+---------------+
| utf8mb3_general_ci | PAD SPACE     |
+--------------------+---------------+
mysql> insert into check_mb3_again values (1,'ankit ');
Query OK, 1 row affected (0.00 sec)
mysql> select * from check_mb3_again where name='ankit ';
+------+-------+
| id   | name  |
+------+-------+
|    1 | ankit |
+------+-------+
1 row in set (0.00 sec)

		

Im sure you must be thinking “why even any application will search like this” and I would be completely agree with you. But hold on, I explained this just to setup the background.

			
mysql>  CREATE TABLE check_mb3_again_withvarchar (name varchar(10)) character set utf8mb3 collate utf8mb3_general_ci ;
Query OK, 0 rows affected, 2 warnings (0.02 sec)
mysql> alter table check_mb3_again_withvarchar modify name varchar(10) unique key;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> insert into check_mb3_again_withvarchar values('ankit');
Query OK, 1 row affected (0.01 sec)
mysql> insert into check_mb3_again_withvarchar values('ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit ' for key 'check_mb3_again_withvarchar.name'

		

			
mysql> alter table check_mb3_again_withvarchar charset utf8mb4;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> alter table check_mb3_again_withvarchar modify name varchar(10) CHARACTER SET utf8mb4 collate utf8mb4_0900_ai_ci;
Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0
mysql> show create table check_mb3_again_withvarchar;
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table                       | Create Table                                                                                                                                                                                                                     |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| check_mb3_again_withvarchar | CREATE TABLE `check_mb3_again_withvarchar` (
  `name` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+-----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

		

Ok, so we have migrated the table successfully, lets continue with our writes –

			
mysql> insert into check_mb3_again_withvarchar values('ankit ');
Query OK, 1 row affected (0.00 sec)
mysql> select * from check_mb3_again_withvarchar;
+--------+
| name   |
+--------+
| ankit  |
| ankit  |
+--------+
2 rows in set (0.00 sec)

		

So, now what would you say here and your data became inconsistent and you are gone into a nightmare without even realizing this and you will realize it after years or months ???

Every problem has a solution :

I learnt in my life that every problem has a solution and success is, how well you solve your problems.

Considering the specific case mentioned in this topic, if we are looking to opt utf8mb4 but with PAD SPACES, we can choose one from collation table.

Lets talk with an example :

			
mysql> select COLLATION_NAME,PAD_ATTRIBUTE from information_schema.collations where CHARACTER_SET_NAME='utf8mb4' and PAD_ATTRIBUTE='PAD SPACE';
+------------------------+---------------+
| COLLATION_NAME         | PAD_ATTRIBUTE |
+------------------------+---------------+
| utf8mb4_general_ci     | PAD SPACE     |
| utf8mb4_bin            | PAD SPACE     |
| utf8mb4_unicode_ci     | PAD SPACE     |
| utf8mb4_icelandic_ci   | PAD SPACE     |
| utf8mb4_latvian_ci     | PAD SPACE     |
| utf8mb4_romanian_ci    | PAD SPACE     |
| utf8mb4_slovenian_ci   | PAD SPACE     |
| utf8mb4_polish_ci      | PAD SPACE     |
| utf8mb4_estonian_ci    | PAD SPACE     |
| utf8mb4_spanish_ci     | PAD SPACE     |
| utf8mb4_swedish_ci     | PAD SPACE     |
| utf8mb4_turkish_ci     | PAD SPACE     |
| utf8mb4_czech_ci       | PAD SPACE     |
| utf8mb4_danish_ci      | PAD SPACE     |
| utf8mb4_lithuanian_ci  | PAD SPACE     |
| utf8mb4_slovak_ci      | PAD SPACE     |
| utf8mb4_spanish2_ci    | PAD SPACE     |
| utf8mb4_roman_ci       | PAD SPACE     |
| utf8mb4_persian_ci     | PAD SPACE     |
| utf8mb4_esperanto_ci   | PAD SPACE     |
| utf8mb4_hungarian_ci   | PAD SPACE     |
| utf8mb4_sinhala_ci     | PAD SPACE     |
| utf8mb4_german2_ci     | PAD SPACE     |
| utf8mb4_croatian_ci    | PAD SPACE     |
| utf8mb4_unicode_520_ci | PAD SPACE     |
| utf8mb4_vietnamese_ci  | PAD SPACE     |
+------------------------+---------------+

		

			
CREATE TABLE `check_mb4_withvarchar` (
  `id` int DEFAULT NULL,
  `name` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
  UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

		

			
mysql> insert into check_mb4_withvarchar values (1,'ankit');
Query OK, 1 row affected (0.01 sec)
mysql> insert into check_mb4_withvarchar values (1,'ankit ');
ERROR 1062 (23000): Duplicate entry 'ankit ' for key 'check_mb4_withvarchar.name

AND this is what I really needed to prove that how one space can corrupt your entire data while upgrading to 8.

So, what is important ?

It is extremely important to understand the behaviour of your data and how does the write operations actually writing the data before planning your migration as such issues looks invisible in beginning but after few months or years, it mess up your data. So, yes deciding character set and collation is extremely important for your data.

Meet me here on linkedin.

Note – These are my own thoughts and doesnt represent the thoughts of my employer in anyway.

Mutex & MetaData lock handling with Performance and Sys Schema in MySQL

Read about most important part of MySQL. Mutex and meta data locks

by simplifieDB September 18, 2020April 2, 2023

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

This article discuss about how innodb build b tree index and how things improved with 8

by simplifieDB March 22, 2018April 18, 2023

Redo flushing. MySQL VS Postgres and Redo in 8/9

by simplifieDB December 17, 2020September 30, 2025

March 22, 2018April 18, 2023 simplifieDB

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

IN INNODB , EVERYTHING IS DESIGNED IN B-TREE. When I say everything I mean primary key, Indexes, secondary index.

But have we ever thought how MySQL creates an index ? Obviously I am not saying by syntax “create index” or “alter table”. Let’s begin to start from an example by what I mean. Say, I have to create an index on table T1 having column ID with million of records in it. In prior version, mysql uses insert APIs and then index entries were inserted into the B-tree one by one. This is a traditional method to insert data in b-tree .

In the old method :

First b-tree CURSOR (btr0btr.c) gets open .
Search to find the correct position begins.
If there is a space in page, then insert this entry into the b tree page and this has been done in optimistic locking .
If there is no space in page , then pessimistic insert would be performed .For this , b-tree cursor (btr0btr.c) will need to open again and split the page and merging will be done.

This is a top down approach and it is costly because of multiple opening of b-tree cursor, searching, splitting and merging of b-tree nodes

Bottom up approach :

Lets divide new approach in 3 phase to understand :

Run phase:

a. In the run phase, Clustered or Primary index has been scanned or read, post which these entries get added to the buffer , known as sort buffer (innodb_sort_buffer_size) with default size of 262144 bytes. Until this, sorting hasn’t been done. So, you can tune this variable according to your data size to speed up the index creation. Once this buffer becomes full , all the records will be sorted and put it in a temporary file (which is explained further in the second part of this process). In addition to sort buffer size, max_sort_length also needs to be modified whenever we are making changes in sort buffer.
Remember here, I am talking about the sort_buffer_size , not the innodb_sort_buffer_size. However, when dealing with innodb engine and while creation of the secondary index online, innodb_sort_buffer_size should be consider

mysql> show global variables like '%sort_buffer%';

+-------------------------+---------+

| Variable_name           | Value   |

+-------------------------+---------+

| innodb_sort_buffer_size | 1048576 |

| myisam_sort_buffer_size | 8388608 |

| sort_buffer_size        | 262144  |

+-------------------------+---------+

3 rows in set (0.01 sec)

mysql>

2. Second phase :

Multiple temporary files, which got generated in run phase will get merged in this phase using MERGE SORT algorithm. At this stage, multiple temporary files got merged with each other. The amount of merge can be seen under the merge_passes status shown below. This status tells how many temporary files are getting merged and if this number is high enough, then it simply means that the sort buffer size is low and is a good time to tweak it.

mysql> show global status like '%passes%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Sort_merge_passes | 0     |
+-------------------+-------+
1 row in set (0.11 sec)

3. Third phase :

Once all file got merged , all the sorted records from second phase will be inserted into the b-tree nodes.

What’s next ?

Until this point, all these operations i.e. sort and merge is single threaded but starting from MySQL 8, these operations are now multi threaded. Think about a situation wherein you are trying to add a secondary index in your existing table using online DDL approach and you need to get it done faster or you are trying to count the number of records in a heavy table.

Now, we have two variables :

1. innodb_parallel_read_threads – Remember this is only applicable for the clustered index or PK.
2. innodb_ddl_threads – Remember this is only for the secondary index.
3. innodb_ddl_buffer_size

Take an example below to understand, how innodb_parallel_read_threads is making the clustered index read more faster as we increase the number of threads.

mysql> set local innodb_parallel_read_threads=1;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (2.12 sec)

mysql> set innodb_parallel_read_threads=4;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (0.92 sec)

mysql> set innodb_parallel_read_threads=1;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (2.09 sec)

However I don’t see any influential improvement in check tables.

mysql> show variables like '%parallel%';
+------------------------------+----------+
| Variable_name                | Value    |
+------------------------------+----------+
| innodb_parallel_read_threads | 1        |
| replica_parallel_type        | DATABASE |
| replica_parallel_workers     | 0        |
| slave_parallel_type          | DATABASE |
| slave_parallel_workers       | 0        |
+------------------------------+----------+
5 rows in set (0.01 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (17.53 sec)

mysql> set innodb_parallel_read_threads=4;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.23 sec)

mysql> set innodb_parallel_read_threads=8;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.48 sec)

mysql> set innodb_parallel_read_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.35 sec)

mysql> show variables like '%parallel%';
+------------------------------+----------+
| Variable_name                | Value    |
+------------------------------+----------+
| innodb_parallel_read_threads | 12       |
| replica_parallel_type        | DATABASE |
| replica_parallel_workers     | 0        |
| slave_parallel_type          | DATABASE |
| slave_parallel_workers       | 0        |
+------------------------------+----------+
5 rows in set (0.00 sec)

mysql> set innodb_parallel_read_threads=20;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.34 sec)

mysql>

This phase is while scanning the clustered index. Similarly if you are using the 8.0.27, innodb_ddl_threads can be used to speed up the sort and merge operations as explained one. This defines the number of threads to perform sort and merge operation. Here is the below test I performed :

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (19.41 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_buffer_size=10485760;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_buffer_size=104857600;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_threads=10;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (16.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_parallel_read_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (15.89 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_buffer_size=1048576000;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_parallel_read_threads=16;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_threads=16;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (17.73 sec)
Records: 0  Duplicates: 0  Warnings: 0

With the various combinations, you can see how the performance varies.

What other things needs to consider when creating index in MySQL ?

Now , question is how much space can we reserved for index future growth per page. To achieve this, we have **innodb_fill_factor** option which will help in reserving the space for future b-tree index growth. For an example, if we have set this value to 70, then 30 percent of the space of page will get reserve for index growth during sorted index build.

This setting help us in managing server memory . Sometimes, when data becomes fragmented , setting this option can help in re-gaining it.

Conclusion

This article helps us to understand the core process behind how innodb creates an index and how with the time, things has been improved in 8. Here I have used 2 versions of MySQL to show the results. One is 8.0.26 and 8.0.32 on Mac OS.

	How one space can co… on Who said nothing to do while m…
	What type of databas… on Some very basics of LSM t…
	What type of databas… on Journey from Relational to Dis…
	Some very basics of… on Bloom Filters
	Some very basics of… on Journey from Relational to Dis…

	How one space can co… on Who said nothing to do while m…
	What type of databas… on Some very basics of LSM t…
	What type of databas… on Journey from Relational to Dis…
	Some very basics of… on Bloom Filters
	Some very basics of… on Journey from Relational to Dis…

SimplifieDB

its all about open source databases

performance

How one space can create a nightmare for you…

Mutex & MetaData lock handling with Performance and Sys Schema in MySQL

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

Redo flushing. MySQL VS Postgres and Redo in 8/9

Who said nothing to do while migrating from MySQL 5.7 to 8 ?

Mutex & MetaData lock handling with Performance and Sys Schema in MySQL

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

Redo flushing. MySQL VS Postgres and Redo in 8/9

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

In the old method :

Bottom up approach :

Other articles

How one space can create a nightmare for you…

tpc-c Benchmarking of MySQL 8.0.44

What type of databases are well suited for LLM ? part 1

Some very basics of LSM tree….

Journey from Relational to Distributed databases