It has been a while since I wrote my practical database experiences I gained in this last year. I have mentioned “Journey” because things change when you are accustomed to relational databases ( MySQL, PG) but start adopting distributed databases ( like TiDB, ScyllaDB) regardless of any reason. We need a basic understanding while doing this migration. However, this article won’t be covering details of these 2 because there are already lot many articles out there that help in understanding the differences and use cases of both.
You already know relational databases –

So you already know a few things about it :
- Entire data is stored in a single machine unless you are not using sharding.
- You replicate data using binary logs, WAL ( redo), archiving, or some tools that can read these files.
- Machines do not talk with each other if one in a cluster goes down.
- No way of tracking consistency across nodes ( except by relying on semi-sync replication).
- Data is stored in b-tree storage format.
You might know about Distributed database

You might know about these pointers :
- Data is distributed on multiple machines rather than residing on a single machine.
- In the above figure, Tables A1 and A11 are subparts of the same table A wherein A1 hold some data and A11 holds some. The same holds for other tables B and C
- Every table has at least 2 copies in different machines. This is called the replication factor.
- This is needed if in case 1 machine goes down.
- Machines talk to each other using consensus algorithms. Some widely known ones are Paxos and Raft.
- No such concept of a single leader and followers i.e. there are leaders and there are followers since a chunk of data is residing at several machines.
- Consistency has to be maintained using Quorum and majority decisions.
- Data is stored in LSM tree format.
How was the journey from relational DB until now –
It started with Cassandra but since it is NoSQL and this article is focused on relational, I will talk specifically about relational databases like MySQL. Hence considering TiDB here. The below logs are from my testing machine.
ankitkapoor@Ankits-MacBook-Air bin % ./mysql -h 127.0.0.1 --port 56210 -u root -p --local-infile=1
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3556769810
Server version: 8.0.11-TiDB-v8.5.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 8.0 compatible
Copyright (c) 2000, 2024, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql>
First in order to understand how data is distributed I needed to access information schema and metrics Schema
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA |
| PERFORMANCE_SCHEMA |
| mysql |
| sys |
| test |
+--------------------+
6 rows in set (0.00 sec)
mysql>
This article is not going to cover the architecture.
Let us see, how data is distributed internally :
mysql> select * from information_schema.TIKV_REGION_PEERS;
+-----------+---------+----------+------------+-----------+--------+---------
| REGION_ID | PEER_ID | STORE_ID | IS_LEARNER | IS_LEADER| STATUS|DOWN_SECONDS |
+-----------+---------+----------+------------+-----------+--------+-------+
| 1003 | 1004 | 1 | 0 | 1 | NORMAL | NULL |
| 1009 | 1010 | 1 | 0 | 1 | NORMAL | NULL |
| 20 | 21 | 1 | 0 | 1 | NORMAL | NULL |
| 48 | 49 | 1 | 0 | 1 | NORMAL | NULL |
+-----------+---------+----------+------------+-----------+--------+-------+
Here Leader is the master and Peers are the replicas which hold the copy of data.
Lets see the data distribution at the table level :
mysql> select * from TABLE_STORAGE_STATS where TABLE_SCHEMA='test'\G;
*************************** 1. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: customer
TABLE_ID: 110
PEER_COUNT: 1
REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
TABLE_SIZE: 1
TABLE_KEYS: 0
*************************** 2. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: tes
TABLE_ID: 119
PEER_COUNT: 1
REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
TABLE_SIZE: 1
TABLE_KEYS: 0
*************************** 3. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: before
TABLE_ID: 131
PEER_COUNT: 1
REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
TABLE_SIZE: 1
TABLE_KEYS: 165
*************************** 4. row ***************************
TABLE_SCHEMA: test
TABLE_NAME: after
TABLE_ID: 135
PEER_COUNT: 1
REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
TABLE_SIZE: 1
TABLE_KEYS: 0
4 rows in set (0.00 sec)
ERROR:
No query specified
In the above output, table id is the unique id of a table and is associated with every table and is showing its number of copies. Since it is my testing machine, it is showing only one. Regions are the logical block of data.
Lets try to check if we can find more about how data is being distributed ( although not complete ) with the help of placement policies.
mysql> select * from tables where TABLE_SCHEMA='test'\G;
*************************** 1. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: test
TABLE_NAME: customer
TABLE_TYPE: BASE TABLE
ENGINE: InnoDB
VERSION: 10
ROW_FORMAT: Compact
TABLE_ROWS: 0
AVG_ROW_LENGTH: 0
DATA_LENGTH: 0
MAX_DATA_LENGTH: 0
INDEX_LENGTH: 0
DATA_FREE: 0
AUTO_INCREMENT: 0
CREATE_TIME: 2024-12-20 19:23:38
UPDATE_TIME: NULL
CHECK_TIME: NULL
TABLE_COLLATION: latin1_bin
CHECKSUM: NULL
CREATE_OPTIONS:
TABLE_COMMENT:
TIDB_TABLE_ID: 110
TIDB_ROW_ID_SHARDING_INFO: NOT_SHARDED(PK_IS_HANDLE)
TIDB_PK_TYPE: CLUSTERED
TIDB_PLACEMENT_POLICY_NAME: NULL
Until now, I have tried to keep it simple as to how data distribution works and how can we see it in actual. In the next article, we will cover other aspects of distributed systems.
Note – These are my own thoughts and none of my employer has any role in it.

2 thoughts on “Journey from Relational to Distributed databases”