Some very basics of LSM tree….

Basics

This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here.

While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form. However it’s my long pending article which I already wrote few years ago but never published it.

Let’s talk about LSM tree.

In this type of file system, data will be keep on appending to the existing file i.e. it will not overwrite the data ( like we do with b-trees). Data first will be written in the memory ( known as mem tables ) only upto a defined threshold say 10M and once it reaches the limit, the data will be flushed.

LSM uses the concept of LFS which just append the changes (update/insert/delete) i.e. it won’t replace or change the existing record making it as a good option for heavy write work loads in database scenario.

Log structured means that I am writing some logs without modifying the related previous data and just keep on writing one after the other.

Merge is an algorithm to merge the data which is managing this log data.

Tree means different levels of the data i.e. storing the data in a hierarchical order. We can think of our computer hardisk in which we have partitioned the data into different level . The more upper level (having less data) is faster to access and as we go more deeper i.e. lower level is time consuming task (containing more data )

LSM works on different levels which means it maintains data at different levels. Lets try to understand by referring below points.

1st stage and later in writing

1st stage is the memory stage wherein we are having memtables or the memory area ( have explained in later section ). Here data needs to be in the sorted order as per the primary key defined in the table

L0 – level at the storage which contains the data being flushed from the memtable to a stable file in a sorted order. This level contains various sstables ( i.e sorted string tables and known as SST ) and in this level we may experience some duplicate keys (for example a,b,c in s1 table and a,b,c,d is s2 table wherein a,b,c are duplicate keys) . This duplication occurs because of the plain dump received from the memtable and any duplicate data will be deleted at the later stage as a background task.

The SSTable files at Level 1 and above are organized to contain only non-overlapping keys by design.

Have you also heard the word COMPACTION ? let’s understand, what’s that .

COMPACTION is a process wherein SST from level 0 get further combined to form lesser number of SST by removing the overlapping keys/deleted data/ data which has been changed. Obviously this process consume CPU.

What about read operation in LSM tree ?

For read purpose LSM uses binary search algorithm ( which i’m not going to cover in detail as it would be out of context). If we don’t run compaction, it will eventually be impacting the read performance as you can see that the seek will have to go through multiple levels of SSTs.

Did I mention about memtable ?

MemTable is an in-memory data structure holding data before they are flushed to SST files (on disk). It serves both read and write – new writes always insert data to memtable, and reads has to query memtable before reading from SST files, because data in memtable is newer. Once a memtable is full, it becomes immutable and replaced by a new memtable. A background thread will flush the content of the memtable into a SST file, after which the memtable can be destroyed

What all databases uses this ?

Databases like MyRocks, TiDB, Cassandra often uses LSM tree.

Im keeping this article short and concise without writing much more detail as purpose for this article is to make user understand the basics. Writing more content will defeat its purpose.

Journey from Relational to Distributed databases

It has been a while since I wrote my practical database experiences I gained in this last year. I have mentioned “Journey” because things change when you are accustomed to relational databases ( MySQL, PG) but start adopting distributed databases ( like TiDB, ScyllaDB) regardless of any reason. We need a basic understanding while doing this migration. However, this article won’t be covering details of these 2 because there are already lot many articles out there that help in understanding the differences and use cases of both.

You already know relational databases

binary digits here represent the physical storage

So you already know a few things about it :

  • Entire data is stored in a single machine unless you are not using sharding.
  • You replicate data using binary logs, WAL ( redo), archiving, or some tools that can read these files.
  • Machines do not talk with each other if one in a cluster goes down.
  • No way of tracking consistency across nodes ( except by relying on semi-sync replication).
  • Data is stored in b-tree storage format.

You might know about Distributed database

You might know about these pointers :

  • Data is distributed on multiple machines rather than residing on a single machine.
  • In the above figure, Tables A1 and A11 are subparts of the same table A wherein A1 hold some data and A11 holds some. The same holds for other tables B and C
  • Every table has at least 2 copies in different machines. This is called the replication factor.
  • This is needed if in case 1 machine goes down.
  • Machines talk to each other using consensus algorithms. Some widely known ones are Paxos and Raft.
  • No such concept of a single leader and followers i.e. there are leaders and there are followers since a chunk of data is residing at several machines.
  • Consistency has to be maintained using Quorum and majority decisions.
  • Data is stored in LSM tree format.

How was the journey from relational DB until now

It started with Cassandra but since it is NoSQL and this article is focused on relational, I will talk specifically about relational databases like MySQL. Hence considering TiDB here. The below logs are from my testing machine.

ankitkapoor@Ankits-MacBook-Air bin % ./mysql -h 127.0.0.1 --port 56210 -u root -p --local-infile=1
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3556769810
Server version: 8.0.11-TiDB-v8.5.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 8.0 compatible

Copyright (c) 2000, 2024, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 

First in order to understand how data is distributed I needed to access information schema and metrics Schema

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA     |
| PERFORMANCE_SCHEMA |
| mysql              |
| sys                |
| test               |
+--------------------+

6 rows in set (0.00 sec)

mysql>

This article is not going to cover the architecture.

Let us see, how data is distributed internally :

mysql> select * from information_schema.TIKV_REGION_PEERS;
+-----------+---------+----------+------------+-----------+--------+---------
| REGION_ID | PEER_ID | STORE_ID | IS_LEARNER | IS_LEADER| STATUS|DOWN_SECONDS |
+-----------+---------+----------+------------+-----------+--------+-------+
| 1003 |    1004 |        1 |          0 |         1 | NORMAL |     NULL |
| 1009 |    1010 |        1 |          0 |         1 | NORMAL |     NULL |
|   20 |      21 |        1 |          0 |         1 | NORMAL |     NULL |
|   48 |      49 |        1 |          0 |         1 | NORMAL |     NULL |
+-----------+---------+----------+------------+-----------+--------+-------+

Here Leader is the master and Peers are the replicas which hold the copy of data.

Lets see the data distribution at the table level :

mysql> select * from TABLE_STORAGE_STATS where TABLE_SCHEMA='test'\G;
*************************** 1. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: customer
          TABLE_ID: 110
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 0
*************************** 2. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: tes
          TABLE_ID: 119
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 0
*************************** 3. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: before
          TABLE_ID: 131
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 165
*************************** 4. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: after
          TABLE_ID: 135
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 0
4 rows in set (0.00 sec)

ERROR: 
No query specified

In the above output, table id is the unique id of a table and is associated with every table and is showing its number of copies. Since it is my testing machine, it is showing only one. Regions are the logical block of data.

Lets try to check if we can find more about how data is being distributed ( although not complete ) with the help of placement policies.

mysql> select * from tables where TABLE_SCHEMA='test'\G;
*************************** 1. row ***************************
             TABLE_CATALOG: def
              TABLE_SCHEMA: test
                TABLE_NAME: customer
                TABLE_TYPE: BASE TABLE
                    ENGINE: InnoDB
                   VERSION: 10
                ROW_FORMAT: Compact
                TABLE_ROWS: 0
            AVG_ROW_LENGTH: 0
               DATA_LENGTH: 0
           MAX_DATA_LENGTH: 0
              INDEX_LENGTH: 0
                 DATA_FREE: 0
            AUTO_INCREMENT: 0
               CREATE_TIME: 2024-12-20 19:23:38
               UPDATE_TIME: NULL
                CHECK_TIME: NULL
           TABLE_COLLATION: latin1_bin
                  CHECKSUM: NULL
            CREATE_OPTIONS: 
             TABLE_COMMENT: 
             TIDB_TABLE_ID: 110
 TIDB_ROW_ID_SHARDING_INFO: NOT_SHARDED(PK_IS_HANDLE)
              TIDB_PK_TYPE: CLUSTERED
TIDB_PLACEMENT_POLICY_NAME: NULL

Until now, I have tried to keep it simple as to how data distribution works and how can we see it in actual. In the next article, we will cover other aspects of distributed systems.

Note – These are my own thoughts and none of my employer has any role in it.