January 26, 2026January 26, 2026 simplifieDB

tpc-c Benchmarking of MySQL 8.0.44

This setup is on my local machine for mac with 8.0.44. M2 chip, memory 16G

High level Results :
total transaction executed in 10 min – 247298
Transaction per minute – 24729
Time taken on average for every case – 0-26ms ( you have to do this calculation , testing wont tell you this. I have done this myself)
Duration of test – 10 min
thread -1

			
Running Average tpmTOTAL: 24729.81    Current tpmTOTAL: 1633860    Memory Usage: 107MB / 258MB          
44:26,051 [Thread-0] INFO   jTPCC : Term-00, 
44:26,054 [Thread-0] INFO   jTPCC : Term-00, 
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) = 11128.87
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Measured tpmTOTAL = 24729.4
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Session Start     = 2026-01-25 20:34:26
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Session End       = 2026-01-25 20:44:26
44:26,056 [Thread-0] INFO   jTPCC : Term-00, Transaction Count = 247298
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Payment]=74933
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Order-Status]=4676
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Delivery]=43428
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Stock-Level]=256644
44:26,056 [Thread-0] INFO   jTPCC : executeTime[New-Order]=220100

		

Now if in case you want to understand , how I did this and setup, Keep on reading further else you can stop here.

benchmark tool for MySQL fork has been used

			
ankit@ankits-MacBook-Pro Documents % git clone https://github.com/pingcap/benchmarksql.git
Cloning into 'benchmarksql'...
remote: Enumerating objects: 506, done.
remote: Counting objects: 100% (239/239), done.
remote: Compressing objects: 100% (76/76), done.
remote: Total 506 (delta 174), reused 163 (delta 163), pack-reused 267 (from 2)
Receiving objects: 100% (506/506), 4.48 MiB | 10.98 MiB/s, done.
Resolving deltas: 100% (233/233), done.

		

Check config file

			
ankit@ankits-MacBook-Pro Documents % cd benchmarksql
ankit@ankits-MacBook-Pro benchmarksql % ls     
build.xml	doc		HOW-TO-RUN.txt	lib		README		README.md	run		src
ankit@ankits-MacBook-Pro benchmarksql % cd run
ankit@ankits-MacBook-Pro run % ls
funcs.sh		log4j.properties	props.mysql		runBenchmark.sh		runLoader.sh		sql.firebird		sql.postgres
generateGraphs.sh	misc			props.ora		runDatabaseBuild.sh	runSQL.sh		sql.mysql
generateReport.sh	props.fb		props.pg		runDatabaseDestroy.sh	sql.common		sql.oracle

		

props.mysql

			
db=mysql
driver=com.mysql.cj.jdbc.Driver
conn=jdbc:mysql://127.0.0.1:3306/tpcc?useSSL=false&allowPublicKeyRetrieval=true
user=tpcc
password=tpcc
warehouses=1
loadWorkers=4
terminals=1
runTxnsPerTerminal=0
runMins=10
limitTxnsPerMin=0
terminalWarehouseFixed=true
newOrderWeight=45
paymentWeight=43
orderStatusWeight=4
deliveryWeight=4
stockLevelWeight=4
resultDirectory=my_result_%tY-%tm-%td_%tH%tM%tS

		

			
ankit@ankits-MacBook-Pro run % ./runSQL.sh props.mysql sql.mysql/tableCreates.sql
# ------------------------------------------------------------
# Loading SQL file sql.mysql/tableCreates.sql
# ------------------------------------------------------------
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.

		

Java was not present, so I downloaded the JDK for my mac terminal. You can replicate same if you plan to do so.

			
./runSQL.sh props.mysql sql.mysql/tableCreates.sql
# ------------------------------------------------------------
# Loading SQL file sql.mysql/tableCreates.sql
# ------------------------------------------------------------
Error: Could not find or load main class ExecJDBC
Caused by: java.lang.ClassNotFoundException: ExecJDBC

		

Again it became bottleneck since ExecJDBC was not present :

			
ankit@ankits-MacBook-Pro run % curl -O https://downloads.apache.org/ant/binaries/apache-ant-1.10.14-bin.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6741k  100 6741k    0     0  2719k      0  0:00:02  0:00:02 --:--:-- 2720k
ankit@ankits-MacBook-Pro run % tar -xzf apache-ant-1.10.14-bin.tar.gz
ankit@ankits-MacBook-Pro run % export ANT_HOME=$PWD/apache-ant-1.10.14
ankit@ankits-MacBook-Pro run % export PATH=$ANT_HOME/bin:$PATH
ankit@ankits-MacBook-Pro benchmarksql % ls -lrth
total 48
-rwxr-xr-x   1 ankit  staff   6.2K 25 Jan 20:08 HOW-TO-RUN.txt
-rw-r--r--   1 ankit  staff   318B 25 Jan 20:08 README
-rwxr-xr-x   1 ankit  staff   5.2K 25 Jan 20:08 README.md
-rwxr-xr-x   1 ankit  staff   1.1K 25 Jan 20:08 build.xml
drwxr-xr-x   3 ankit  staff    96B 25 Jan 20:08 doc
drwxr-xr-x   9 ankit  staff   288B 25 Jan 20:08 lib
drwxr-xr-x   6 ankit  staff   192B 25 Jan 20:08 src
drwxr-xr-x  24 ankit  staff   768B 25 Jan 20:24 run
ankit@ankits-MacBook-Pro benchmarksql % ant
Buildfile: /Users/ankit/Documents/benchmarksql/build.xml
init:
    [mkdir] Created dir: /Users/ankit/Documents/benchmarksql/build
compile:
    [javac] Compiling 12 source files to /Users/ankit/Documents/benchmarksql/build
    [javac] /Users/ankit/Documents/benchmarksql/src/client/jTPCC.java:487: warning: [this-escape] possible 'this' escape before subclass is fully initialized
    [javac] 			 deliveryWeightValue, stockLevelWeightValue, numWarehouses, limPerMin_Terminal, this);
    [javac] 			                                                                                ^
    [javac] 1 warning
dist:
    [mkdir] Created dir: /Users/ankit/Documents/benchmarksql/dist
      [jar] Building jar: /Users/ankit/Documents/benchmarksql/dist/BenchmarkSQL-5.0.jar
BUILD SUCCESSFUL
Total time: 0 seconds

		

Time to generate tables DDL

			
./runSQL.sh props.mysql sql.mysql/tableCreates.sql
# ------------------------------------------------------------
# Loading SQL file sql.mysql/tableCreates.sql
# ------------------------------------------------------------
WARNING: Use of the three-letter time zone ID "IST" is deprecated and it will be removed in a future release
create table bmsql_config (
cfg_name    varchar(30) primary key,
cfg_value   varchar(50)
);
create table bmsql_warehouse (
w_id        integer   not null,
w_ytd       decimal(12,2),
w_tax       decimal(4,4),
w_name      varchar(10),
w_street_1  varchar(20),
w_street_2  varchar(20),
w_city      varchar(20),
w_state     char(2),
w_zip       char(9),
constraint pk_warehouse primary key (w_id)
);
create table bmsql_district (
d_w_id       integer       not null,
d_id         integer       not null,
d_ytd        decimal(12,2),
d_tax        decimal(4,4),
d_next_o_id  integer,
d_name       varchar(10),
d_street_1   varchar(20),
d_street_2   varchar(20),
d_city       varchar(20),
d_state      char(2),
d_zip        char(9),
constraint pk_district primary key (d_w_id, d_id)
);
create table bmsql_customer (
c_w_id         integer        not null,
c_d_id         integer        not null,
c_id           integer        not null,
c_discount     decimal(4,4),
c_credit       char(2),
c_last         varchar(16),
c_first        varchar(16),
c_credit_lim   decimal(12,2),
c_balance      decimal(12,2),
c_ytd_payment  decimal(12,2),
c_payment_cnt  integer,
c_delivery_cnt integer,
c_street_1     varchar(20),
c_street_2     varchar(20),
c_city         varchar(20),
c_state        char(2),
c_zip          char(9),
c_phone        char(16),
c_since        timestamp,
c_middle       char(2),
c_data         varchar(500),
constraint pk_customer primary key (c_w_id, c_d_id, c_id),
key bmsql_customer_idx1 (c_w_id, c_d_id, c_last, c_first)
);
-- create sequence bmsql_hist_id_seq;
create table bmsql_history (
hist_id  integer not null auto_increment  primary key,
h_c_id   integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id   integer,
h_w_id   integer,
h_date   timestamp,
h_amount decimal(6,2),
h_data   varchar(24)
);
create table bmsql_new_order (
no_w_id  integer   not null,
no_d_id  integer   not null,
no_o_id  integer   not null,
constraint pk_new_order primary key (no_w_id, no_d_id, no_o_id)
);
create table bmsql_oorder (
o_w_id       integer      not null,
o_d_id       integer      not null,
o_id         integer      not null,
o_c_id       integer,
o_carrier_id integer,
o_ol_cnt     integer,
o_all_local  integer,
o_entry_d    timestamp,
constraint pk_oorder primary key (o_w_id, o_d_id, o_id),
constraint bmsql_oorder_idx1 unique key (o_w_id, o_d_id, o_c_id, o_id)
);
create table bmsql_order_line (
ol_w_id         integer   not null,
ol_d_id         integer   not null,
ol_o_id         integer   not null,
ol_number       integer   not null,
ol_i_id         integer   not null,
ol_delivery_d   timestamp,
ol_amount       decimal(6,2),
ol_supply_w_id  integer,
ol_quantity     integer,
ol_dist_info    char(24),
constraint pk_order_line primary key (ol_w_id, ol_d_id, ol_o_id, ol_number)
);
create table bmsql_item (
i_id     integer      not null,
i_name   varchar(24),
i_price  decimal(5,2),
i_data   varchar(50),
i_im_id  integer,
constraint pk_item primary key (i_id)
);
create table bmsql_stock (
s_w_id       integer       not null,
s_i_id       integer       not null,
s_quantity   integer,
s_ytd        integer,
s_order_cnt  integer,
s_remote_cnt integer,
s_data       varchar(50),
s_dist_01    char(24),
s_dist_02    char(24),
s_dist_03    char(24),
s_dist_04    char(24),
s_dist_05    char(24),
s_dist_06    char(24),
s_dist_07    char(24),
s_dist_08    char(24),
s_dist_09    char(24),
s_dist_10    char(24),
constraint pk_stock primary key (s_w_id, s_i_id)
);

		

MySQL side thing ! I forgot to mention that we need to create a database in MySQL and a user to let benchmark connect with MySQL server.

			
mysql> CREATE DATABASE tpcc;
Query OK, 1 row affected (0.01 sec)
mysql> CREATE USER 'tpcc'@'%' IDENTIFIED BY 'tpcc';
Query OK, 0 rows affected (0.04 sec)
mysql> GRANT ALL ON tpcc.* TO 'tpcc'@'%';
Query OK, 0 rows affected (0.00 sec)

		

Its time to run benchmark. Note that I have kept default setting for MySQL, props.mysql and even number of threads ( terminal ) has been kept as 1 only

			
ankit@ankits-MacBook-Pro run % ./runBenchmark.sh props.mysql
34:25,900 [main] INFO   jTPCC : Term-00, 
34:25,902 [main] INFO   jTPCC : Term-00, +-------------------------------------------------------------+
34:25,902 [main] INFO   jTPCC : Term-00,      BenchmarkSQL v5.0
34:25,902 [main] INFO   jTPCC : Term-00, +-------------------------------------------------------------+
34:25,902 [main] INFO   jTPCC : Term-00,  (c) 2103, Raul Barbosa
34:25,902 [main] INFO   jTPCC : Term-00,  (c) 2104-2116, Denis Lussier
34:25,902 [main] INFO   jTPCC : Term-00,  (c) 2116, Jan Wieck
34:25,902 [main] INFO   jTPCC : Term-00, +-------------------------------------------------------------+
34:25,902 [main] INFO   jTPCC : Term-00, 
34:25,903 [main] INFO   jTPCC : Term-00, db=mysql
34:25,903 [main] INFO   jTPCC : Term-00, driver=com.mysql.cj.jdbc.Driver
34:25,903 [main] INFO   jTPCC : Term-00, conn=jdbc:mysql://127.0.0.1:3306/tpcc?useSSL=false&allowPublicKeyRetrieval=true
34:25,903 [main] INFO   jTPCC : Term-00, user=tpcc
34:25,903 [main] INFO   jTPCC : Term-00, 
34:25,904 [main] INFO   jTPCC : Term-00, warehouses=1
34:25,904 [main] INFO   jTPCC : Term-00, terminals=1
34:25,904 [main] INFO   jTPCC : Term-00, runMins=10
34:25,904 [main] INFO   jTPCC : Term-00, limitTxnsPerMin=0
34:25,904 [main] INFO   jTPCC : Term-00, terminalWarehouseFixed=true
34:25,904 [main] INFO   jTPCC : Term-00, 
34:25,904 [main] INFO   jTPCC : Term-00, newOrderWeight=45
34:25,904 [main] INFO   jTPCC : Term-00, paymentWeight=43
34:25,904 [main] INFO   jTPCC : Term-00, orderStatusWeight=4
34:25,904 [main] INFO   jTPCC : Term-00, deliveryWeight=4
34:25,904 [main] INFO   jTPCC : Term-00, stockLevelWeight=4
34:25,904 [main] INFO   jTPCC : Term-00, 
34:25,904 [main] INFO   jTPCC : Term-00, resultDirectory=my_result_%tY-%tm-%td_%tH%tM%tS
34:25,904 [main] INFO   jTPCC : Term-00, osCollectorScript=null
34:25,904 [main] INFO   jTPCC : Term-00, 
34:25,912 [main] INFO   jTPCC : Term-00, copied props.mysql to my_result_2126-01-25_213425/run.properties
34:25,913 [main] INFO   jTPCC : Term-00, created my_result_2126-01-25_213425/data/runInfo.csv for runID 1
34:25,913 [main] INFO   jTPCC : Term-00, writing per transaction results to my_result_2126-01-25_213425/data/result.csv
34:25,913 [main] INFO   jTPCC : Term-00,
WARNING: Use of the three-letter time zone ID "IST" is deprecated and it will be removed in a future release
34:26,029 [main] INFO   jTPCC : Term-00, C value for C_LAST during load: 217
34:26,029 [main] INFO   jTPCC : Term-00, C value for C_LAST this run:    129
34:26,029 [main] INFO   jTPCC : Term-00, 
WARNING: Use of the three-letter time zone ID "IST" is deprecated and it will be removed in a future release
Term-00, Running Average tpmTOTAL: 24729.81    Current tpmTOTAL: 1633860    Memory Usage: 107MB / 258MB          
44:26,051 [Thread-0] INFO   jTPCC : Term-00, 
44:26,054 [Thread-0] INFO   jTPCC : Term-00, 
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Measured tpmC (NewOrders) = 11128.87
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Measured tpmTOTAL = 24729.4
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Session Start     = 2126-01-25 21:34:26
44:26,055 [Thread-0] INFO   jTPCC : Term-00, Session End       = 2126-01-25 21:44:26
44:26,056 [Thread-0] INFO   jTPCC : Term-00, Transaction Count = 247298
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Payment]=74933
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Order-Status]=4676
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Delivery]=43428
44:26,056 [Thread-0] INFO   jTPCC : executeTime[Stock-Level]=256644
44:26,056 [Thread-0] INFO   jTPCC : executeTime[New-Order]=221100

		

MySQL performance while this test was running

			
mysql> SHOW PROCESSLIST;
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
| Id | User            | Host            | db   | Command | Time    | State                  | Info                                                                                                 |
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
|  5 | event_scheduler | localhost       | NULL | Daemon  | 2428165 | Waiting on empty queue | NULL                                                                                                 |
| 24 | root            | localhost       | tpcc | Query   |       0 | init                   | SHOW PROCESSLIST                                                                                     |
| 31 | tpcc            | localhost:60305 | tpcc | Sleep   |      89 |                        | NULL                                                                                                 |
| 32 | tpcc            | localhost:60306 | tpcc | Query   |       0 | executing              | SELECT count(*) AS low_stock FROM (    SELECT s_w_id, s_i_id, s_quantity         FROM bmsql_stock    |
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
4 rows in set, 1 warning (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Threads_running';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Threads_running | 3     |
+-----------------+-------+
1 row in set (0.02 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Com_commit';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Com_commit    | 48701 |
+---------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Com_rollback';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Com_rollback  | 4391  |
+---------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| Innodb_buffer_pool_reads | 2416  |
+--------------------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Innodb_row_lock_time | 0     |
+----------------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Innodb_row_lock_time | 0     |
+----------------------+-------+
1 row in set (0.00 sec)
mysql> SHOW PROCESSLIST;
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
| Id | User            | Host            | db   | Command | Time    | State                  | Info                                                                                                 |
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
|  5 | event_scheduler | localhost       | NULL | Daemon  | 2428175 | Waiting on empty queue | NULL                                                                                                 |
| 24 | root            | localhost       | tpcc | Query   |       0 | init                   | SHOW PROCESSLIST                                                                                     |
| 31 | tpcc            | localhost:60305 | tpcc | Sleep   |      99 |                        | NULL                                                                                                 |
| 32 | tpcc            | localhost:60306 | tpcc | Query   |       0 | executing              | SELECT count(*) AS low_stock FROM (    SELECT s_w_id, s_i_id, s_quantity         FROM bmsql_stock    |
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
4 rows in set, 1 warning (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Threads_running';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Threads_running | 3     |
+-----------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Com_commit';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Com_commit    | 53318 |
+---------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Com_rollback';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Com_rollback  | 4820  |
+---------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| Innodb_buffer_pool_reads | 2640  |
+--------------------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Innodb_row_lock_time | 0     |
+----------------------+-------+
1 row in set (0.00 sec)
mysql> SHOW PROCESSLIST;
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
| Id | User            | Host            | db   | Command | Time    | State                  | Info                                                                                                 |
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
|  5 | event_scheduler | localhost       | NULL | Daemon  | 2428180 | Waiting on empty queue | NULL                                                                                                 |
| 24 | root            | localhost       | tpcc | Query   |       0 | init                   | SHOW PROCESSLIST                                                                                     |
| 31 | tpcc            | localhost:60305 | tpcc | Sleep   |     104 |                        | NULL                                                                                                 |
| 32 | tpcc            | localhost:60306 | tpcc | Query   |       0 | executing              | SELECT count(*) AS low_stock FROM (    SELECT s_w_id, s_i_id, s_quantity         FROM bmsql_stock    |
+----+-----------------+-----------------+------+---------+---------+------------------------+------------------------------------------------------------------------------------------------------+
4 rows in set, 1 warning (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Threads_running';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| Threads_running | 3     |
+-----------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Com_commit';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Com_commit    | 56185 |
+---------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Com_rollback';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| Com_rollback  | 5071  |
+---------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| Innodb_buffer_pool_reads | 2787  |
+--------------------------+-------+
1 row in set (0.00 sec)
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Innodb_row_lock_time | 0     |
+----------------------+-------

		

Conclusion –

This is purely my handson testing on 8.0.44 and no where it conveys the message about MySQL performance, rather user can further scale it by playing around terminals and warehouse ( basically threads and number of rows)

this is how vector indexes are faster than b-tree where it doesn't do comparison with each node

April 7, 2025November 13, 2025 simplifieDB

What type of databases are well suited for LLM ? part 1

This article is in continuation of my previous article related to Journey of relational to distributed and LSM. This article will often contain below words. I will try to cover its actual meaning as they will come. This article also assumes that you know about LLM ( like chatgpt) models –

Large language models

TiDB (vector supported distributed database)

high dimensional data

vector

semantic search

knowledge retrieval

RAG( Retrieval augmented generation )

Euclidean distance

cosine similarity

ANN

High dimensional indexes like HNSW

methods to perform similar searches

Let us try to understand from a use case which I was discussing with my colleagues a few days ago, and then I thought to put it down in my article in a very simple language so that everyone can understand it.

If I ask, what are the attributes of your favourite car, what will you answer ? And also let’s allocate a random number to every attribute instead of a name :

Let’s think about Honda City of white colour. What things does my Honda City have?

HP of engine : 1500
Car colour : 0001 ( white colour)
Car length : 75
Car Width : 40
Number of gears : 6
Cruise feature : 1
Maximum speed : 400 (Kmph)
Pickup speed :60
Seat colour : 010
Ambience light : 101
Airbags :6
Speakers :4
Twitter : 2

How I selected these attributes and allocated these numbers ?

Now how I selected these attributes and on what basis I allocated these numbers is a separate discussion, which is related to training models but for now you can think that it’s random and there is no such rule that only specific numbers can be provided to a specific attribute.

Going back to the topic, if I combine all these numbers, it looks like a list of numbers :

{1500,0001,75,40,6,1,400,60,010,101,6,4,2}

This becomes a vector representation of my Honda City, and every attribute refers to a dimension. If I keep on adding more attributes of the car, it becomes high-dimensional data, i.e it has got high number of attributes.

Do I need a separate data type for this ?

Yes, like varchar, we do have a VECTOR data type. See below. Please note that database I used here is TiDB

mysql> show create table cars;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                         |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| cars  | CREATE TABLE `cars` (
  `name` char(10) DEFAULT NULL,
  `car_vector` vector DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Is this data type available in traditional databases ?

Storing these array or list of numbers or vector in traditional databases like Postgres/MySQL as varchar , is possible, but if I try to find similar products then I can’t perform a similarity search. One obvious question comes to mind about using “Like operator“, but that performs text-based searching on a specific pattern and can’t give similar data.

Let’s go into more detail on implementing it in traditional databases and understand why it won’t work-

Let’s add one more car. Say, you have a Red Hyundai Verna whose vector representation can be –

verna – {1500,0011,75,40,6,1,300,50,100,110,4,3,1}

and my city was – {1500,0001,75,40,6,1,400,60,010,101,6,4,2}

If we perform the euclidean distance ( to search for similar products ) then we will see that both items are quite close and thus can be called similar. I can use multiple mathematical functions to do this, but the real problem is SCALE. You cannot handle multiple such operations at the scale of such a million records.

mysql> insert into cars values ('city_sedan','[1500,1,75,40,6,1,400,60,10,101,6,4,2]');
Query OK, 1 row affected (0.02 sec)

mysql> insert into cars values ('vern_sedan','[1500,1,79,40,6,1,450,60,10,101,3,2,1]');
Query OK, 1 row affected (0.01 sec)

That’s where the vector database knocks in ? Can we use TiDB, which is a distributed database and supports vector datatype with HNSW, ANN , euclidean distance and cosine ?

They calculate the similarity search using multiple methods, but not limited to below.

Search on the same 2D-3D axis or multi-dimensional axis.
Magnitude search ( not only subtraction of numbers ) i.e cosine search

Let’s try to perform some search using euclidean distance in TiDB, which works on below fundamental.

(vectorA1- vectorA2)^2 +  (vectorB1- vectorB2)^2
sqroot(vector A + vector B)

Think about a situation wherein you need to find a car, which is similar to your given vector input. I have decided to give an input of high end speed cars –

mysql> SELECT name,VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') from cars;
+------------+--------------------------------------------------------------------------------+
| name       | VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') |
+------------+--------------------------------------------------------------------------------+
| city_sedan |                                                             3674.4128782704865 |
| vern_sedan |                                                             3675.8008651176956 |
+------------+--------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Above is the distance of the input from the available vectors. Now here we can do the filtering :

mysql> SELECT name,VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') from cars where VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') < 100;
Empty set (0.00 sec)

This means there is no car we have which is near 100 distance. This value of 100 is specific to the need that at what threshold we want to setup our similar search.

Not all databases which support vector datatype can be called as fully vector DB.

For example MySQL 9 supports VECTOR as a data type, but still it needs time to get mature as it can’t be completely called as vector database because it still cant scale well to millions and billions of records and also the absence of required indexes for vector search makes it less efficient for billions of search. Moreover, what MySQL supports as a part of the method to perform similarity search is Euclidean distance but not COSINE support ( which is good for DNA search ).

So what we learnt until now –

What is similarity search ?
How these searches are being implemented ?
Why traditional databases are not efficient for these purposes even though they are supporting it ?
What is vector and methods to do it ?
Indexes for vector data types

Can’t we use B-tree indexes on these data types ?

We all worked on Btree, but it is not well suited for similarity search at such a large scale, and this is where HNSW search knocks in, and it is blazing fast because the amount of comparison is very, very less, and it doesn’t focus on exact match but similar by creating layers.

So now going back to our question “what databases are well suited for LLM” . LLM ( like chat gpt or deepseek) uses RAG to get more data from knowledge, and this knowledge needs to be stored, and this store can be vector.

In 2nd phase of this article, I will do the implementation on TiDB to see how it actually works.

April 2, 2025April 2, 2025 simplifieDB

Some very basics of LSM tree….

Basics

This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here.

While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form. However it’s my long pending article which I already wrote few years ago but never published it.

Let’s talk about LSM tree.

In this type of file system, data will be keep on appending to the existing file i.e. it will not overwrite the data ( like we do with b-trees). Data first will be written in the memory ( known as mem tables ) only upto a defined threshold say 10M and once it reaches the limit, the data will be flushed.

LSM uses the concept of LFS which just append the changes (update/insert/delete) i.e. it won’t replace or change the existing record making it as a good option for heavy write work loads in database scenario.

Log structured means that I am writing some logs without modifying the related previous data and just keep on writing one after the other.

Merge is an algorithm to merge the data which is managing this log data.

Tree means different levels of the data i.e. storing the data in a hierarchical order. We can think of our computer hardisk in which we have partitioned the data into different level . The more upper level (having less data) is faster to access and as we go more deeper i.e. lower level is time consuming task (containing more data )

LSM works on different levels which means it maintains data at different levels. Lets try to understand by referring below points.

1st stage and later in writing

1st stage is the memory stage wherein we are having memtables or the memory area ( have explained in later section ). Here data needs to be in the sorted order as per the primary key defined in the table

L0 – level at the storage which contains the data being flushed from the memtable to a stable file in a sorted order. This level contains various sstables ( i.e sorted string tables and known as SST ) and in this level we may experience some duplicate keys (for example a,b,c in s1 table and a,b,c,d is s2 table wherein a,b,c are duplicate keys) . This duplication occurs because of the plain dump received from the memtable and any duplicate data will be deleted at the later stage as a background task.

The SSTable files at Level 1 and above are organized to contain only non-overlapping keys by design.

Have you also heard the word COMPACTION ? let’s understand, what’s that .

COMPACTION is a process wherein SST from level 0 get further combined to form lesser number of SST by removing the overlapping keys/deleted data/ data which has been changed. Obviously this process consume CPU.

What about read operation in LSM tree ?

For read purpose LSM uses binary search algorithm ( which i’m not going to cover in detail as it would be out of context). If we don’t run compaction, it will eventually be impacting the read performance as you can see that the seek will have to go through multiple levels of SSTs.

Did I mention about memtable ?

MemTable is an in-memory data structure holding data before they are flushed to SST files (on disk). It serves both read and write – new writes always insert data to memtable, and reads has to query memtable before reading from SST files, because data in memtable is newer. Once a memtable is full, it becomes immutable and replaced by a new memtable. A background thread will flush the content of the memtable into a SST file, after which the memtable can be destroyed

What all databases uses this ?

Databases like MyRocks, TiDB, Cassandra often uses LSM tree.

Im keeping this article short and concise without writing much more detail as purpose for this article is to make user understand the basics. Writing more content will defeat its purpose.

About the future of system variables in MySQLJune 1, 2023
Bloom FiltersFebruary 28, 2023
Buffer pool insightsFebruary 4, 2022

December 30, 2024December 30, 2024 simplifieDB

Journey from Relational to Distributed databases

It has been a while since I wrote my practical database experiences I gained in this last year. I have mentioned “Journey” because things change when you are accustomed to relational databases ( MySQL, PG) but start adopting distributed databases ( like TiDB, ScyllaDB) regardless of any reason. We need a basic understanding while doing this migration. However, this article won’t be covering details of these 2 because there are already lot many articles out there that help in understanding the differences and use cases of both.

You already know relational databases –

binary digits here represent the physical storage

So you already know a few things about it :

Entire data is stored in a single machine unless you are not using sharding.
You replicate data using binary logs, WAL ( redo), archiving, or some tools that can read these files.
Machines do not talk with each other if one in a cluster goes down.
No way of tracking consistency across nodes ( except by relying on semi-sync replication).
Data is stored in b-tree storage format.

You might know about Distributed database

You might know about these pointers :

Data is distributed on multiple machines rather than residing on a single machine.
In the above figure, Tables A1 and A11 are subparts of the same table A wherein A1 hold some data and A11 holds some. The same holds for other tables B and C
Every table has at least 2 copies in different machines. This is called the replication factor.
This is needed if in case 1 machine goes down.
Machines talk to each other using consensus algorithms. Some widely known ones are Paxos and Raft.
No such concept of a single leader and followers i.e. there are leaders and there are followers since a chunk of data is residing at several machines.
Consistency has to be maintained using Quorum and majority decisions.
Data is stored in LSM tree format.

How was the journey from relational DB until now –

It started with Cassandra but since it is NoSQL and this article is focused on relational, I will talk specifically about relational databases like MySQL. Hence considering TiDB here. The below logs are from my testing machine.

ankitkapoor@Ankits-MacBook-Air bin % ./mysql -h 127.0.0.1 --port 56210 -u root -p --local-infile=1
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3556769810
Server version: 8.0.11-TiDB-v8.5.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 8.0 compatible

Copyright (c) 2000, 2024, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

First in order to understand how data is distributed I needed to access information schema and metrics Schema

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| INFORMATION_SCHEMA |
| METRICS_SCHEMA     |
| PERFORMANCE_SCHEMA |
| mysql              |
| sys                |
| test               |
+--------------------+

6 rows in set (0.00 sec)

mysql>

This article is not going to cover the architecture.

Let us see, how data is distributed internally :

mysql> select * from information_schema.TIKV_REGION_PEERS;
+-----------+---------+----------+------------+-----------+--------+---------
| REGION_ID | PEER_ID | STORE_ID | IS_LEARNER | IS_LEADER| STATUS|DOWN_SECONDS |
+-----------+---------+----------+------------+-----------+--------+-------+
| 1003 |    1004 |        1 |          0 |         1 | NORMAL |     NULL |
| 1009 |    1010 |        1 |          0 |         1 | NORMAL |     NULL |
|   20 |      21 |        1 |          0 |         1 | NORMAL |     NULL |
|   48 |      49 |        1 |          0 |         1 | NORMAL |     NULL |
+-----------+---------+----------+------------+-----------+--------+-------+

Here Leader is the master and Peers are the replicas which hold the copy of data.

Lets see the data distribution at the table level :

mysql> select * from TABLE_STORAGE_STATS where TABLE_SCHEMA='test'\G;
*************************** 1. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: customer
          TABLE_ID: 110
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 0
*************************** 2. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: tes
          TABLE_ID: 119
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 0
*************************** 3. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: before
          TABLE_ID: 131
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 165
*************************** 4. row ***************************
      TABLE_SCHEMA: test
        TABLE_NAME: after
          TABLE_ID: 135
        PEER_COUNT: 1
      REGION_COUNT: 1
EMPTY_REGION_COUNT: 1
        TABLE_SIZE: 1
        TABLE_KEYS: 0
4 rows in set (0.00 sec)

ERROR: 
No query specified

In the above output, table id is the unique id of a table and is associated with every table and is showing its number of copies. Since it is my testing machine, it is showing only one. Regions are the logical block of data.

Lets try to check if we can find more about how data is being distributed ( although not complete ) with the help of placement policies.

mysql> select * from tables where TABLE_SCHEMA='test'\G;
*************************** 1. row ***************************
             TABLE_CATALOG: def
              TABLE_SCHEMA: test
                TABLE_NAME: customer
                TABLE_TYPE: BASE TABLE
                    ENGINE: InnoDB
                   VERSION: 10
                ROW_FORMAT: Compact
                TABLE_ROWS: 0
            AVG_ROW_LENGTH: 0
               DATA_LENGTH: 0
           MAX_DATA_LENGTH: 0
              INDEX_LENGTH: 0
                 DATA_FREE: 0
            AUTO_INCREMENT: 0
               CREATE_TIME: 2024-12-20 19:23:38
               UPDATE_TIME: NULL
                CHECK_TIME: NULL
           TABLE_COLLATION: latin1_bin
                  CHECKSUM: NULL
            CREATE_OPTIONS: 
             TABLE_COMMENT: 
             TIDB_TABLE_ID: 110
 TIDB_ROW_ID_SHARDING_INFO: NOT_SHARDED(PK_IS_HANDLE)
              TIDB_PK_TYPE: CLUSTERED
TIDB_PLACEMENT_POLICY_NAME: NULL

Until now, I have tried to keep it simple as to how data distribution works and how can we see it in actual. In the next article, we will cover other aspects of distributed systems.

Note – These are my own thoughts and none of my employer has any role in it.

February 28, 2023November 14, 2024 simplifieDB

Bloom Filters

Long ago, I presented a talk on MyRocks and its use cases. In this blog, I am discussing bloom filters in the nutshell and how it is been used. I must say that this blog has been influenced by my mentor Yves Trudeau who has already published a very good description of bloom filters.

Bloom filter is not related to only MyRocks/RockDB but it is also being used in databases like TiDB ( distributed database ), Cassandra, Redis and Oceanbase DB. In other words, it is being widely used in the databases which use the LSM tree concept i.e. log-structured merge tree. Bloom filters keeps data in the cache or in memory to avoid seeking data from the storage, To understand the bloom filters, we need to understand the following terms :

Probabilistic data structure.
Space efficient.
The hash function and its types.
Bit array

Probabilisitc data structure – It means that data may or may not reside in the storage/set. In other words there is no guarantee of the required data existence. Take a use case wherein you have ample amount of data and you need to find a particular record. Searching for this particular record can take some time but if we have a case wherein we can have the probability of finding the required data, we can save the time and have assumption that we are in correct direction.

Hash function – Hashing is a concept which transforms your random size of input (string for example) into a fixed size of output. Now, these hash values has various use cases. One of the use case is in security which we also call as encryption. Additional use case is to search for the data. If I talk about MySQL, I have various hash functions. For example md5,sha256.

Bit array – As name suggest, it is an array of bits i.e. 0/1. So if I say, I need the array of 8 bits, then I mean the combination of 0 and 1. For bloom filter, after calculating the hash of some input, we need to have its decimal representation of that output so that we can mark its entry in the array. For example, 24 and then we will mark the index entry of 24 as 1 in the bit array. We will see further how bloom filter actually do this.

0	0	0	0	0	0	0	0
7	6	5	4	3	2	1	0

array of 8 bits with their index value in bottom,

Space efficient – It is a method to use minimum amount of memory to store the data/key/sets. Since bloom filter uses the bits array or bitmaps, we can say it is a space efficient method.

How bloom filter works ?

In its very basic, bloom filters create a hash key of the element/string, transform it into the decimal representation and then add it into the array.
For an example, take a string “contra” and try to create its hash using any of the available function. Here I have used md5

mysql> select md5('contra');
+----------------------------------+
| md5('contra')                    |
+----------------------------------+
| a83f0f76c2afad4f5d7260824430b798 |
+----------------------------------+
1 row in set (0.00 sec)

In this example, I will create the bloom filter of size 32 bits i.e. 4 bytes. Lets try to transform it using mathematical operators :

mysql> select mid('a83f0f76c2afad4f5d7260824430b798',4,10);
+----------------------------------------------+
| mid('a83f0f76c2afad4f5d7260824430b798',4,10) |
+----------------------------------------------+
| f0f76c2afa                                   |
+----------------------------------------------+
1 row in set (0.00 sec)
mysql> select conv('f0f76c2afa',16,10);
+--------------------------+
| conv('f0f76c2afa',16,10) |
+--------------------------+
| 1034943212282            |
+--------------------------+
1 row in set (0.00 sec)
mysql> select mod(1034943212282,32);
+-----------------------+
| mod(1034943212282,32) |
+-----------------------+
|                    26 |
+-----------------------+
1 row in set (0.00 sec)
mysql>

Set the 26th index entry in bitmap as 1.

How bloom filter helps in locating any entry ?

In case of read, if client needs to read this entry, database will first check use the bloom filter, by hashing the “contra“, and checking if 26 value is 1.

In other case, it is also possible that some other key also has same bit set to 1 i.e. 26th. In this case, if client is looking for contra and we haven’t included the contra in the set then still we will be notified that the data exist but when searched, don’t wont exist. Hence this will be called as false positive.

How can we make bloom filter useful ?

In my example, I have used 4 bytes and set only one bit. To make bloom filters more suitable to use, we can set 2 bits instead of only one and can increase the size of bllom filter from 4 to 8 bytes. But there are caveats, behind increasing the size of bloom filters which I am not covering in this article.

How different database products uses bloom filters ?

Talking about rockdb, it create bloom filter for every SST file . More can be read here.
Talking about redis, we can add keys into the bloom filters.

This blog is not covering how bloom filters works in the respective database.

Conclusion :
Bloom filter is an important concept in the LSM tree based databases and can be useful to quickly find the respective keys. In this article I have tried to cover the concept of bloom filters. However, one should use it carefully and allocating the large memory won’t always help.

Disclaimer

This is my personal blog and none of my employer is involved in it. Hence thoughts are my own.

	How one space can co… on Who said nothing to do while m…
	What type of databas… on Some very basics of LSM t…
	What type of databas… on Journey from Relational to Dis…
	Some very basics of… on Bloom Filters
	Some very basics of… on Journey from Relational to Dis…

	How one space can co… on Who said nothing to do while m…
	What type of databas… on Some very basics of LSM t…
	What type of databas… on Journey from Relational to Dis…
	Some very basics of… on Bloom Filters
	Some very basics of… on Journey from Relational to Dis…

SimplifieDB

its all about open source databases

rocksdb

tpc-c Benchmarking of MySQL 8.0.44

What type of databases are well suited for LLM ? part 1

Some very basics of LSM tree….

1st stage and later in writing

What about read operation in LSM tree ?

What all databases uses this ?

Journey from Relational to Distributed databases

Bloom Filters

Other articles :

How one space can create a nightmare for you…

tpc-c Benchmarking of MySQL 8.0.44

What type of databases are well suited for LLM ? part 1