Some very basics of LSM tree….

Basics

This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here.

While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form. However it’s my long pending article which I already wrote few years ago but never published it.

Let’s talk about LSM tree.

In this type of file system, data will be keep on appending to the existing file i.e. it will not overwrite the data ( like we do with b-trees). Data first will be written in the memory ( known as mem tables ) only upto a defined threshold say 10M and once it reaches the limit, the data will be flushed.

LSM uses the concept of LFS which just append the changes (update/insert/delete) i.e. it won’t replace or change the existing record making it as a good option for heavy write work loads in database scenario.

Log structured means that I am writing some logs without modifying the related previous data and just keep on writing one after the other.

Merge is an algorithm to merge the data which is managing this log data.

Tree means different levels of the data i.e. storing the data in a hierarchical order. We can think of our computer hardisk in which we have partitioned the data into different level . The more upper level (having less data) is faster to access and as we go more deeper i.e. lower level is time consuming task (containing more data )

LSM works on different levels which means it maintains data at different levels. Lets try to understand by referring below points.

1st stage and later in writing

1st stage is the memory stage wherein we are having memtables or the memory area ( have explained in later section ). Here data needs to be in the sorted order as per the primary key defined in the table

L0 – level at the storage which contains the data being flushed from the memtable to a stable file in a sorted order. This level contains various sstables ( i.e sorted string tables and known as SST ) and in this level we may experience some duplicate keys (for example a,b,c in s1 table and a,b,c,d is s2 table wherein a,b,c are duplicate keys) . This duplication occurs because of the plain dump received from the memtable and any duplicate data will be deleted at the later stage as a background task.

The SSTable files at Level 1 and above are organized to contain only non-overlapping keys by design.

Have you also heard the word COMPACTION ? let’s understand, what’s that .

COMPACTION is a process wherein SST from level 0 get further combined to form lesser number of SST by removing the overlapping keys/deleted data/ data which has been changed. Obviously this process consume CPU.

What about read operation in LSM tree ?

For read purpose LSM uses binary search algorithm ( which i’m not going to cover in detail as it would be out of context). If we don’t run compaction, it will eventually be impacting the read performance as you can see that the seek will have to go through multiple levels of SSTs.

Did I mention about memtable ?

MemTable is an in-memory data structure holding data before they are flushed to SST files (on disk). It serves both read and write – new writes always insert data to memtable, and reads has to query memtable before reading from SST files, because data in memtable is newer. Once a memtable is full, it becomes immutable and replaced by a new memtable. A background thread will flush the content of the memtable into a SST file, after which the memtable can be destroyed

What all databases uses this ?

Databases like MyRocks, TiDB, Cassandra often uses LSM tree.

Im keeping this article short and concise without writing much more detail as purpose for this article is to make user understand the basics. Writing more content will defeat its purpose.

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

IN INNODB , EVERYTHING IS DESIGNED IN B-TREE.  When I say everything I mean primary key, Indexes, secondary index.

But have we ever thought how MySQL creates an index ? Obviously I am not saying by syntax “create index” or “alter table”. Let’s begin to start from an example by what I mean. Say, I have to create an index on table T1 having column ID with million of records in it. In prior version, mysql uses insert APIs and then index entries were inserted into the B-tree one by one. This is a traditional method to insert data in b-tree .

In the old method :

  1. First b-tree CURSOR (btr0btr.c) gets open .
  2. Search to find the correct position begins.
  3. If there is a space in page, then insert this entry into the b tree page and this has been done in optimistic locking.
  4. If there is no space in page , then pessimistic insert would be performed .For this , b-tree cursor (btr0btr.c) will need to open again and split the page and merging will be done.

This is a top down approach and it is costly because of multiple opening of b-tree cursor, searching, splitting and merging of b-tree nodes

Bottom up approach :

Lets divide new approach in 3 phase to understand :

  1. Run phase:

a.  In the run phase, Clustered or Primary index has been scanned or read, post which these entries get added to the buffer , known as sort buffer (innodb_sort_buffer_size) with default size of 262144 bytes. Until this, sorting hasn’t been done. So, you can tune this variable according to your data size to speed up the index creation. Once this buffer becomes full , all the records will be sorted and put it in a temporary file (which is explained further in the second part of this process). In addition to sort buffer size, max_sort_length also needs to be modified whenever we are making changes in sort buffer.
Remember here, I am talking about the sort_buffer_size , not the innodb_sort_buffer_size. However, when dealing with innodb engine and while creation of the secondary index online, innodb_sort_buffer_size should be consider

mysql> show global variables like '%sort_buffer%';

+-------------------------+---------+

| Variable_name           | Value   |

+-------------------------+---------+

| innodb_sort_buffer_size | 1048576 |

| myisam_sort_buffer_size | 8388608 |

| sort_buffer_size        | 262144  |

+-------------------------+---------+

3 rows in set (0.01 sec)

mysql>

2. Second phase :

Multiple temporary files, which got generated in run phase will get merged in this phase using MERGE SORT algorithm. At this stage, multiple temporary files got merged with each other. The amount of merge can be seen under the merge_passes status shown below. This status tells how many temporary files are getting merged and if this number is high enough, then it simply means that the sort buffer size is low and is a good time to tweak it.

mysql> show global status like '%passes%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Sort_merge_passes | 0     |
+-------------------+-------+
1 row in set (0.11 sec)

3. Third phase :

Once all file got merged , all the sorted records from second phase will be inserted into the b-tree nodes.


What’s next ?

Until this point, all these operations i.e. sort and merge is single threaded but starting from MySQL 8, these operations are now multi threaded. Think about a situation wherein you are trying to add a secondary index in your existing table using online DDL approach and you need to get it done faster or you are trying to count the number of records in a heavy table.

Now, we have two variables :

1. innodb_parallel_read_threads – Remember this is only applicable for the clustered index or PK.
2. innodb_ddl_threads – Remember this is only for the secondary index.
3. innodb_ddl_buffer_size

Take an example below to understand, how innodb_parallel_read_threads is making the clustered index read more faster as we increase the number of threads.

mysql> set local innodb_parallel_read_threads=1;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (2.12 sec)

mysql> set innodb_parallel_read_threads=4;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (0.92 sec)

mysql> set innodb_parallel_read_threads=1;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (2.09 sec)

However I don’t see any influential improvement in check tables.

mysql> show variables like '%parallel%';
+------------------------------+----------+
| Variable_name                | Value    |
+------------------------------+----------+
| innodb_parallel_read_threads | 1        |
| replica_parallel_type        | DATABASE |
| replica_parallel_workers     | 0        |
| slave_parallel_type          | DATABASE |
| slave_parallel_workers       | 0        |
+------------------------------+----------+
5 rows in set (0.01 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (17.53 sec)

mysql> set innodb_parallel_read_threads=4;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.23 sec)

mysql> set innodb_parallel_read_threads=8;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.48 sec)

mysql> set innodb_parallel_read_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.35 sec)

mysql> show variables like '%parallel%';
+------------------------------+----------+
| Variable_name                | Value    |
+------------------------------+----------+
| innodb_parallel_read_threads | 12       |
| replica_parallel_type        | DATABASE |
| replica_parallel_workers     | 0        |
| slave_parallel_type          | DATABASE |
| slave_parallel_workers       | 0        |
+------------------------------+----------+
5 rows in set (0.00 sec)

mysql> set innodb_parallel_read_threads=20;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.34 sec)

mysql> 

This phase is while scanning the clustered index. Similarly if you are using the 8.0.27, innodb_ddl_threads can be used to speed up the sort and merge operations as explained one. This defines the number of threads to perform sort and merge operation. Here is the below test I performed :

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (19.41 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_buffer_size=10485760;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_buffer_size=104857600;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_threads=10;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (16.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_parallel_read_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (15.89 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_buffer_size=1048576000;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_parallel_read_threads=16;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_threads=16;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (17.73 sec)
Records: 0  Duplicates: 0  Warnings: 0

With the various combinations, you can see how the performance varies.


What other things needs to consider when creating index in MySQL ?

Now , question is how much space can we reserved for index future growth per page. To achieve this, we have **innodb_fill_factor** option which will help in reserving the space for future b-tree index growth. For an example, if we have set this value to 70, then 30 percent of the space of page will get reserve for index growth during sorted index build.

This setting help us in managing server memory . Sometimes, when data becomes fragmented , setting this option can help in re-gaining it.

Conclusion

This article helps us to understand the core process behind how innodb creates an index and how with the time, things has been improved in 8. Here I have used 2 versions of MySQL to show the results. One is 8.0.26 and 8.0.32 on Mac OS.

Other articles

How one space can create a nightmare for you…

As you know that MySQL 5.7 is going to be in its end of life this month (when I wrote this article) and many businesses are upgrading to 8 ( except few ones, believe me its true) . Many cloud vendors are providing extra support and time to their customer so that they can move…

tpc-c Benchmarking of MySQL 8.0.44

This setup is on my local machine for mac with 8.0.44. M2 chip, memory 16G High level Results : total transaction executed in 10 min – 247298Transaction per minute – 24729Time taken on average for every case – 0-26ms ( you have to do this calculation , testing wont tell you this. I have done…

Some very basics of LSM tree….

This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here. While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form.…