this is how vector indexes are faster than b-tree where it doesn't do comparison with each node

What type of databases are well suited for LLM ? part 1

This article is in continuation of my previous article related to Journey of relational to distributed and LSM. This article will often contain below words. I will try to cover its actual meaning as they will come. This article also assumes that you know about LLM ( like chatgpt) models –

  • Large language models
  • TiDB (vector supported distributed database)
  • high dimensional data
  • vector
  • semantic search
  • knowledge retrieval
  • RAG( Retrieval augmented generation )
  • Euclidean distance
  • cosine similarity
  • ANN
  • High dimensional indexes like HNSW
  • methods to perform similar searches

Let us try to understand from a use case which I was discussing with my colleagues a few days ago, and then I thought to put it down in my article in a very simple language so that everyone can understand it.

If I ask, what are the attributes of your favourite car, what will you answer ? And also let’s allocate a random number to every attribute instead of a name :

Let’s think about Honda City of white colour. What things does my Honda City have?

  1. HP of engine : 1500
  2. Car colour : 0001 ( white colour)
  3. Car length : 75
  4. Car Width : 40
  5. Number of gears : 6
  6. Cruise feature : 1
  7. Maximum speed : 400 (Kmph)
  8. Pickup speed :60
  9. Seat colour : 010
  10. Ambience light : 101
  11. Airbags :6
  12. Speakers :4
  13. Twitter : 2

    How I selected these attributes and allocated these numbers ?

Now how I selected these attributes and on what basis I allocated these numbers is a separate discussion, which is related to training models but for now you can think that it’s random and there is no such rule that only specific numbers can be provided to a specific attribute.

Going back to the topic, if I combine all these numbers, it looks like a list of numbers :

{1500,0001,75,40,6,1,400,60,010,101,6,4,2}

This becomes a vector representation of my Honda City, and every attribute refers to a dimension. If I keep on adding more attributes of the car, it becomes high-dimensional data, i.e it has got high number of attributes.

Do I need a separate data type for this ?

Yes, like varchar, we do have a VECTOR data type. See below. Please note that database I used here is TiDB

mysql> show create table cars;
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                         |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| cars  | CREATE TABLE `cars` (
  `name` char(10) DEFAULT NULL,
  `car_vector` vector DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Is this data type available in traditional databases ?

Storing these array or list of numbers or vector in traditional databases like Postgres/MySQL as varchar , is possible, but if I try to find similar products then I can’t perform a similarity search. One obvious question comes to mind about using “Like operator“, but that performs text-based searching on a specific pattern and can’t give similar data.

Let’s go into more detail on implementing it in traditional databases and understand why it won’t work-

Let’s add one more car. Say, you have a Red Hyundai Verna whose vector representation can be –

verna – {1500,0011,75,40,6,1,300,50,100,110,4,3,1}

and my city was – {1500,0001,75,40,6,1,400,60,010,101,6,4,2}

If we perform the euclidean distance ( to search for similar products ) then we will see that both items are quite close and thus can be called similar. I can use multiple mathematical functions to do this, but the real problem is SCALE. You cannot handle multiple such operations at the scale of such a million records.

mysql> insert into cars values ('city_sedan','[1500,1,75,40,6,1,400,60,10,101,6,4,2]');
Query OK, 1 row affected (0.02 sec)

mysql> insert into cars values ('vern_sedan','[1500,1,79,40,6,1,450,60,10,101,3,2,1]');
Query OK, 1 row affected (0.01 sec)

That’s where the vector database knocks in ? Can we use TiDB, which is a distributed database and supports vector datatype with HNSW, ANN , euclidean distance and cosine ?

They calculate the similarity search using multiple methods, but not limited to below.

  1. Search on the same 2D-3D axis or multi-dimensional axis.
  2. Magnitude search ( not only subtraction of numbers ) i.e cosine search

Let’s try to perform some search using euclidean distance in TiDB, which works on below fundamental.

(vectorA1- vectorA2)^2 +  (vectorB1- vectorB2)^2
sqroot(vector A + vector B)

Think about a situation wherein you need to find a car, which is similar to your given vector input. I have decided to give an input of high end speed cars –

mysql> SELECT name,VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') from cars;
+------------+--------------------------------------------------------------------------------+
| name       | VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') |
+------------+--------------------------------------------------------------------------------+
| city_sedan |                                                             3674.4128782704865 |
| vern_sedan |                                                             3675.8008651176956 |
+------------+--------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Above is the distance of the input from the available vectors. Now here we can do the filtering :

mysql> SELECT name,VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') from cars where VEC_L2_DISTANCE(car_vector,'[5000,10,19,400,60,10,330,600,100,1001,30,20,10]') < 100;
Empty set (0.00 sec)

This means there is no car we have which is near 100 distance. This value of 100 is specific to the need that at what threshold we want to setup our similar search.

Not all databases which support vector datatype can be called as fully vector DB.

For example MySQL 9 supports VECTOR as a data type, but still it needs time to get mature as it can’t be completely called as vector database because it still cant scale well to millions and billions of records and also the absence of required indexes for vector search makes it less efficient for billions of search. Moreover, what MySQL supports as a part of the method to perform similarity search is Euclidean distance but not COSINE support ( which is good for DNA search ).

So what we learnt until now –

  1. What is similarity search ?
  2. How these searches are being implemented ?
  3. Why traditional databases are not efficient for these purposes even though they are supporting it ?
  4. What is vector and methods to do it ?
  5. Indexes for vector data types

Can’t we use B-tree indexes on these data types ?

We all worked on Btree, but it is not well suited for similarity search at such a large scale, and this is where HNSW search knocks in, and it is blazing fast because the amount of comparison is very, very less, and it doesn’t focus on exact match but similar by creating layers.

So now going back to our question “what databases are well suited for LLM” . LLM ( like chat gpt or deepseek) uses RAG to get more data from knowledge, and this knowledge needs to be stored, and this store can be vector.

In 2nd phase of this article, I will do the implementation on TiDB to see how it actually works.

Some very basics of LSM tree….

Basics

This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here.

While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form. However it’s my long pending article which I already wrote few years ago but never published it.

Let’s talk about LSM tree.

In this type of file system, data will be keep on appending to the existing file i.e. it will not overwrite the data ( like we do with b-trees). Data first will be written in the memory ( known as mem tables ) only upto a defined threshold say 10M and once it reaches the limit, the data will be flushed.

LSM uses the concept of LFS which just append the changes (update/insert/delete) i.e. it won’t replace or change the existing record making it as a good option for heavy write work loads in database scenario.

Log structured means that I am writing some logs without modifying the related previous data and just keep on writing one after the other.

Merge is an algorithm to merge the data which is managing this log data.

Tree means different levels of the data i.e. storing the data in a hierarchical order. We can think of our computer hardisk in which we have partitioned the data into different level . The more upper level (having less data) is faster to access and as we go more deeper i.e. lower level is time consuming task (containing more data )

LSM works on different levels which means it maintains data at different levels. Lets try to understand by referring below points.

1st stage and later in writing

1st stage is the memory stage wherein we are having memtables or the memory area ( have explained in later section ). Here data needs to be in the sorted order as per the primary key defined in the table

L0 – level at the storage which contains the data being flushed from the memtable to a stable file in a sorted order. This level contains various sstables ( i.e sorted string tables and known as SST ) and in this level we may experience some duplicate keys (for example a,b,c in s1 table and a,b,c,d is s2 table wherein a,b,c are duplicate keys) . This duplication occurs because of the plain dump received from the memtable and any duplicate data will be deleted at the later stage as a background task.

The SSTable files at Level 1 and above are organized to contain only non-overlapping keys by design.

Have you also heard the word COMPACTION ? let’s understand, what’s that .

COMPACTION is a process wherein SST from level 0 get further combined to form lesser number of SST by removing the overlapping keys/deleted data/ data which has been changed. Obviously this process consume CPU.

What about read operation in LSM tree ?

For read purpose LSM uses binary search algorithm ( which i’m not going to cover in detail as it would be out of context). If we don’t run compaction, it will eventually be impacting the read performance as you can see that the seek will have to go through multiple levels of SSTs.

Did I mention about memtable ?

MemTable is an in-memory data structure holding data before they are flushed to SST files (on disk). It serves both read and write – new writes always insert data to memtable, and reads has to query memtable before reading from SST files, because data in memtable is newer. Once a memtable is full, it becomes immutable and replaced by a new memtable. A background thread will flush the content of the memtable into a SST file, after which the memtable can be destroyed

What all databases uses this ?

Databases like MyRocks, TiDB, Cassandra often uses LSM tree.

Im keeping this article short and concise without writing much more detail as purpose for this article is to make user understand the basics. Writing more content will defeat its purpose.

Hash partitioning and its role in distributed systems – 1

Have you ever divided a single physical book into rather sub-books because the book was too heavy and it is not even easy to quickly navigate to a specific page?

Similarly, partitioning a table is like you are dividing one physical table into various sub-tables (on the same storage and same data directory) to have better storage management, better performance (in some cases), and better archival policies. We can partition a table in various ways. The process is, first we have to choose a partitioning key or partitioning keyS i.e. a composite partition key in which multiple columns are involved. By partitioning key I mean we need to choose a column/columns basis on which we can divide the table.

Let’s say we have the below popular table employees_range. This table is divided into two sub-parts i.e. p0 and p1. If you insert a record with an id less than 100, that particular record will insert in the p0 partition. This is the simplest form of partitioning we can have in our database server and is known as range partitioning. However such type of partitioning has already been covered by multiple articles and is not a topic to be covered under this article.

Whenever you are creating a partition, table structure matters a lot. For example, if you have a primary key and a unique key in the table and you are planning to partition your table, then the partition key should be present in all the keys i.e. either a unique key or as a primary key.

This blog is focused on hash partitioning. If you will understand this, you can also replicate your understanding in the distributed systems where database is sharded into several shards and you can also understand that how data is being written into the respective shards.

Hash partition :

create table t1 (col1 int not null,col2 date not null,col3 int not null,col4 int not null,unique key(col1,col2,col3)) partition by hash(col3) partitions 4;

This will divide the table into 6 different sub tables.Numeric 6 means the number of partitions we need. In hash partitioning we doesn’t define the path of data explicitly (by path of data I mean, in which partition, a particular insertion shall take place) as we do in the range partitioning. Hash is a function in computer science/math which transforms an arbitrary string into a fixed size of output.

Now hashing can be of different types.

MySQL, by default uses modulous hashing (linear hashing is different) i.e. x % y where x is the value of partitioning key and y is the number of partition . For example 9 % 5 which means divide 9 by 5 will give remainder as 4 so that will be the hashing. Similarly in above example hashing will be based on the col3 values. Lets try to insert some value

mysql> insert into t1 values (1,NOW(),300,1);
Query OK, 1 row affected, 1 warning (0.00 sec)
mysql> insert into t1 values (1,NOW(),191,1);
Query OK, 1 row affected, 1 warning (0.01 sec)

mysql> insert into t1 values (1,NOW(),0,1);
Query OK, 1 row affected, 1 warning (0.00 sec)



mysql> select * from t1 partition(p3);
+------+------------+------+------+

| col1 | col2       | col3 | col4 |

+------+------------+------+------+

|    1 | 2024-04-01 |  191 |    1 |

+------+------------+------+------+

1 row in set (0.00 sec)

mysql> select * from t1 partition(p0);
+------+------------+------+------+

| col1 | col2       | col3 | col4 |

+------+------------+------+------+

|    1 | 2024-04-01 |    0 |    1 |

|    1 | 2024-04-01 |  300 |    1 |

+------+------------+------+------+
2 rows in set (0.00 sec)


300/4 = 0 , so that means data will insert in the first partition.
191/4=3, this will go in 3rd partition
0/4=0, this will go in first partition.

This is the most simplest form of partitioning you will encounter. Same concept can also be applied on multi-tenant databases where databases or tables has been sharded into different servers. See this diagram below


Choosing the definition of hashing can surely impact the write speed of query. If hashing is complex, it will affect the performance. Lets try to use some complex hashing algorithm in the next section.

Latest Articles

Some very basics of LSM tree….

This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here. While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form.…