Long ago, I presented a talk on MyRocks and its use cases. In this blog, I am discussing bloom filters in the nutshell and how it is been used. I must say that this blog has been influenced by my mentor Yves Trudeau who has already published a very good description of bloom filters.
Bloom filter is not related to only MyRocks/RockDB but it is also being used in databases like TiDB ( distributed database ), Cassandra, Redis and Oceanbase DB. In other words, it is being widely used in the databases which use the LSM tree concept i.e. log-structured merge tree. Bloom filters keeps data in the cache or in memory to avoid seeking data from the storage, To understand the bloom filters, we need to understand the following terms :
- Probabilistic data structure.
- Space efficient.
- The hash function and its types.
- Bit array
Probabilisitc data structure – It means that data may or may not reside in the storage/set. In other words there is no guarantee of the required data existence. Take a use case wherein you have ample amount of data and you need to find a particular record. Searching for this particular record can take some time but if we have a case wherein we can have the probability of finding the required data, we can save the time and have assumption that we are in correct direction.
Hash function – Hashing is a concept which transforms your random size of input (string for example) into a fixed size of output. Now, these hash values has various use cases. One of the use case is in security which we also call as encryption. Additional use case is to search for the data. If I talk about MySQL, I have various hash functions. For example md5,sha256.
Bit array – As name suggest, it is an array of bits i.e. 0/1. So if I say, I need the array of 8 bits, then I mean the combination of 0 and 1. For bloom filter, after calculating the hash of some input, we need to have its decimal representation of that output so that we can mark its entry in the array. For example, 24 and then we will mark the index entry of 24 as 1 in the bit array. We will see further how bloom filter actually do this.
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
Space efficient – It is a method to use minimum amount of memory to store the data/key/sets. Since bloom filter uses the bits array or bitmaps, we can say it is a space efficient method.
How bloom filter works ?
In its very basic, bloom filters create a hash key of the element/string, transform it into the decimal representation and then add it into the array.
For an example, take a string “contra” and try to create its hash using any of the available function. Here I have used md5
mysql> select md5('contra');
+----------------------------------+
| md5('contra') |
+----------------------------------+
| a83f0f76c2afad4f5d7260824430b798 |
+----------------------------------+
1 row in set (0.00 sec)
In this example, I will create the bloom filter of size 32 bits i.e. 4 bytes. Lets try to transform it using mathematical operators :
mysql> select mid('a83f0f76c2afad4f5d7260824430b798',4,10);
+----------------------------------------------+
| mid('a83f0f76c2afad4f5d7260824430b798',4,10) |
+----------------------------------------------+
| f0f76c2afa |
+----------------------------------------------+
1 row in set (0.00 sec)
mysql> select conv('f0f76c2afa',16,10);
+--------------------------+
| conv('f0f76c2afa',16,10) |
+--------------------------+
| 1034943212282 |
+--------------------------+
1 row in set (0.00 sec)
mysql> select mod(1034943212282,32);
+-----------------------+
| mod(1034943212282,32) |
+-----------------------+
| 26 |
+-----------------------+
1 row in set (0.00 sec)
mysql>
Set the 26th index entry in bitmap as 1.
How bloom filter helps in locating any entry ?
In case of read, if client needs to read this entry, database will first check use the bloom filter, by hashing the “contra“, and checking if 26 value is 1.
In other case, it is also possible that some other key also has same bit set to 1 i.e. 26th. In this case, if client is looking for contra and we haven’t included the contra in the set then still we will be notified that the data exist but when searched, don’t wont exist. Hence this will be called as false positive.
How can we make bloom filter useful ?
In my example, I have used 4 bytes and set only one bit. To make bloom filters more suitable to use, we can set 2 bits instead of only one and can increase the size of bllom filter from 4 to 8 bytes. But there are caveats, behind increasing the size of bloom filters which I am not covering in this article.
How different database products uses bloom filters ?
Talking about rockdb, it create bloom filter for every SST file . More can be read here.
Talking about redis, we can add keys into the bloom filters.
This blog is not covering how bloom filters works in the respective database.
Conclusion :
Bloom filter is an important concept in the LSM tree based databases and can be useful to quickly find the respective keys. In this article I have tried to cover the concept of bloom filters. However, one should use it carefully and allocating the large memory won’t always help.
Disclaimer
This is my personal blog and none of my employer is involved in it. Hence thoughts are my own.
Other articles :
What type of databases are well suited for LLM ? part 1
We all know vector data type but do you know its scale os usage, lets see from basics.
Some very basics of LSM tree….
This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here. While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form.…
Journey from Relational to Distributed databases
This article covers how I moved from Relational to distributed databases and how I started learning in detail.
