Hash partitioning and its role in distributed systems – 1

Have you ever divided a single physical book into rather sub-books because the book was too heavy and it is not even easy to quickly navigate to a specific page?

Similarly, partitioning a table is like you are dividing one physical table into various sub-tables (on the same storage and same data directory) to have better storage management, better performance (in some cases), and better archival policies. We can partition a table in various ways. The process is, first we have to choose a partitioning key or partitioning keyS i.e. a composite partition key in which multiple columns are involved. By partitioning key I mean we need to choose a column/columns basis on which we can divide the table.

Let’s say we have the below popular table employees_range. This table is divided into two sub-parts i.e. p0 and p1. If you insert a record with an id less than 100, that particular record will insert in the p0 partition. This is the simplest form of partitioning we can have in our database server and is known as range partitioning. However such type of partitioning has already been covered by multiple articles and is not a topic to be covered under this article.

Whenever you are creating a partition, table structure matters a lot. For example, if you have a primary key and a unique key in the table and you are planning to partition your table, then the partition key should be present in all the keys i.e. either a unique key or as a primary key.

This blog is focused on hash partitioning. If you will understand this, you can also replicate your understanding in the distributed systems where database is sharded into several shards and you can also understand that how data is being written into the respective shards.

Hash partition :

create table t1 (col1 int not null,col2 date not null,col3 int not null,col4 int not null,unique key(col1,col2,col3)) partition by hash(col3) partitions 4;

This will divide the table into 6 different sub tables.Numeric 6 means the number of partitions we need. In hash partitioning we doesn’t define the path of data explicitly (by path of data I mean, in which partition, a particular insertion shall take place) as we do in the range partitioning. Hash is a function in computer science/math which transforms an arbitrary string into a fixed size of output.

Now hashing can be of different types.

MySQL, by default uses modulous hashing (linear hashing is different) i.e. x % y where x is the value of partitioning key and y is the number of partition . For example 9 % 5 which means divide 9 by 5 will give remainder as 4 so that will be the hashing. Similarly in above example hashing will be based on the col3 values. Lets try to insert some value

mysql> insert into t1 values (1,NOW(),300,1);
Query OK, 1 row affected, 1 warning (0.00 sec)
mysql> insert into t1 values (1,NOW(),191,1);
Query OK, 1 row affected, 1 warning (0.01 sec)

mysql> insert into t1 values (1,NOW(),0,1);
Query OK, 1 row affected, 1 warning (0.00 sec)



mysql> select * from t1 partition(p3);
+------+------------+------+------+

| col1 | col2       | col3 | col4 |

+------+------------+------+------+

|    1 | 2024-04-01 |  191 |    1 |

+------+------------+------+------+

1 row in set (0.00 sec)

mysql> select * from t1 partition(p0);
+------+------------+------+------+

| col1 | col2       | col3 | col4 |

+------+------------+------+------+

|    1 | 2024-04-01 |    0 |    1 |

|    1 | 2024-04-01 |  300 |    1 |

+------+------------+------+------+
2 rows in set (0.00 sec)


300/4 = 0 , so that means data will insert in the first partition.
191/4=3, this will go in 3rd partition
0/4=0, this will go in first partition.

This is the most simplest form of partitioning you will encounter. Same concept can also be applied on multi-tenant databases where databases or tables has been sharded into different servers. See this diagram below


Choosing the definition of hashing can surely impact the write speed of query. If hashing is complex, it will affect the performance. Lets try to use some complex hashing algorithm in the next section.

Latest Articles

How one space can create a nightmare for you…

As you know that MySQL 5.7 is going to be in its end of life this month (when I wrote this article) and many businesses are upgrading to 8 ( except few ones, believe me its true) . Many cloud vendors are providing extra support and time to their customer so that they can move…

tpc-c Benchmarking of MySQL 8.0.44

This setup is on my local machine for mac with 8.0.44. M2 chip, memory 16G High level Results : total transaction executed in 10 min – 247298Transaction per minute – 24729Time taken on average for every case – 0-26ms ( you have to do this calculation , testing wont tell you this. I have done…

Mutex & MetaData lock handling with Performance and Sys Schema in MySQL

Thought of writing this blog came in my mind when I was struggling with high CPU and waiting of queries in my live production environment of MySQL DB . One case which I can recall easily is of a night when physical backup of my database started using MEB/PEB (MySQL enterprise backup and Percona xtrabackup) which takes the exclusive lock in the initial phase and last phase of backup for a defined amount of time. During this time all the queries goes in a waiting state and when I checked in the show process-list, it says “waiting for meta data lock”. One more case which i can recall is the contention among various threads while writing in to the binary cache and binary log. Such locks are called as mutex lock and can be catch from the various tables of performance schema in MySQL. Imagine a real situation when your live database is running with high number of connections and your load is CPU bound and you met with such a situation where your queries are being slowed down due to the locks already being acquired by other transactions and you want to make sure that the business should not be impacted anyhow, understanding these types of locks will help you to eliminate the weird situation you can be in.

In a highly concurrent environment where threads are competing with each other to gain resource access, locking needs to be properly implemented on all the databases in an instance and objects including table, data, cache, buffers, files etc. In MySQL locks can be at table level as well as at row level locks (multi-granular locking) and at the memory level. These locks needs to be implemented in order to maintain data consistency and to reduce the contentions. Now some locks can be user managed (like shared and exclusive locks) and some are system managed. In system managed we have mutual exclusion or say mutex .

Mutex also works as a exclusive lock so that one thread can access a particular memory resource or a file (for example binary logs) at a time so that other threads which are requesting for same resource doesn’t interrupt its task and once thread complete its task it will release the lock and other thread/s which is/are waiting in queue will take over. Hence main thread will acquire mutex first and then it performs the required task and release the lock. One of the example of mutex is binary log. In case of syn_binlog > 1 , there is always one leader thread which is responsible for flushing the bin log cache to disk after the n number of group commit and other threads are follower. Leader thread will take the mutex lock and release it once data get flushed on the disk.

mysql_mutex_lock(&LOCK_log);

    DEBUG_SYNC(leader->thd, "commit_after_get_LOCK_log");
    mysql_mutex_lock(&LOCK_group_commit_queue);
    current= group_commit_queue;
    group_commit_queue= NULL;
    mysql_mutex_unlock(&LOCK_group_commit_queue);

//// I have taken this snippet of mutex from Percona mysql code from git hub.

show engine innodb mutex;

I have copied this line from mysql document page i.e.

“SHOW ENGINE INNODB MUTEX displays InnoDB mutex and rw-lock statistics”

Sample output showing all rwlock . For mutex lock only mutex will be shown.

Apart from this, mutex status can be monitored from the Performance schema as well. In later section you will see which INSTRUMENT is responsible for enabling mutex and which table is responsible for mutex monitoring.

for (ulint i = 0; i < n_sync_obj; i++) {
        rw_lock_create(hash_table_locks_key, table->sync_obj.rw_locks + i,
                       level);
      }
////  This is a snippet of hash0hash.cc where rwlock is created. 

/// In mysql doc it has been mentioned that for mutex it will report only Mutex name but for rwlock it will mention file name as well as line number. I believe they should mention same for mutex as well as it will really help alot.

For mutex we have 223 instruments for different purpose and corresponding table to check for mutex lock is performance_schema.mutex_instances.

Related instruments for mutex can be found from setup_instrument table with wait/synch/mutex/ and can be enabled dynamically. I am trying to re-produce the mutex locking scenario which I will add once I have some data. For now it can be understand that table “events_waits_current” in performance can help in analysing the mutex lock.

Meta-Data locks

What is metadata lock ?

Metadata locks helps us in acquiring a lock on table structure so that if a thread is in some transaction (i.e. begin and end transaction) and if there is another thread waiting for same resource to alter the structure of the table then the requesting thread will come into the meta data lock mode. Such scenarios can really hamper the database performance in a production environment. Think of a scenario where your application has started a transaction :

Session 1
Session 2
SESSION 3 ( from show processlist;)

With the help of show processlist it is visible that the requesting thread is waiting for the meta data lock to be released from the other thread. But problem is it is not easy to locate which thread is holding the lock ,not even SHOW ENGINE INNODB STATUS really help. But Performance schema and sys schema will be helpful here . Let us see how :

PSI_stage_info MDL_key::m_namespace_to_wait_state_name[NAMESPACE_END] = {
    {0, "Waiting for global read lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for backup lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for tablespace metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for schema metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for table metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for stored function metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for stored procedure metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for trigger metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for event metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for commit lock", 0, PSI_DOCUMENT_ME},
    {0, "User lock", 0, PSI_DOCUMENT_ME}, /* Be compatible with old status. */
    {0, "Waiting for locking service lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for spatial reference system lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for acl cache lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for column statistics lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for resource groups metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for foreign key metadata lock", 0, PSI_DOCUMENT_ME},
    {0, "Waiting for check constraint metadata lock", 0, PSI_DOCUMENT_ME}};


//// These are the metadata lock threads states which can be find in mdl.cc

Corresponding instrument for meta data locks for waiting queries which is enabled by default under setup_instrument table is:

wait/lock/metadata/sql/mdl

There are other instrument as well for metadata lock and corresponding table for metadata lock is performance_schema.metadata_locks

We can find the thread which has acquired the metadata lock for above transaction:

SELECT * FROM metadata_locks;
will only show the real or current rows and will be purged out when resolved.

column lock_status tells whether lock has been granted or in pending state. OWNER_THREAD_ID 97 is in PENDING STATE and the corresponding lock is with OWNER_THREAD_ID 80 which is holding SHARED_READ lock on database ankit and table t1_metadata. Taking the advantage of thread table in performance_schema can further be helpful to deep dive by selecting the rows with thread 80 and 97.

if you see here : thread_id 97 with process_id 33 (connection id from processlist ) is waiting for meta data lock from thread id 80

Hence 80 is the thread which is holding the lock since 474 seconds and corresponding thread is waiting for 456 seconds. Even to have more deep dive it is beneficial to use sys.session

output from sys.session

Combining all the above information in sys.schema_table_lock_waits can finally provide the solution to overcome this locking :

schema_table_lock_waits can tell us what action to be taken in order to resolve above meta data issue by referring the column sql_kill_blocking_query.


Some basic concepts and variables of MySQL should be known while handling this in real environment:

1. lock_wait_timeout
2. Type of isolation
3. query_time

Understanding how lock works in mysql is a key concept behind database performance.

Read latest articles

Some very basics of LSM tree….

This article is in continuation of my previous article on my journey from relational to distributed database. Also I talked about probabilistic data structure here. While discussing distributed databases with one of my colleague, I realised that we often miss focusing on basics and then I decided to publish this article in most simplest form.…

Who said nothing to do while migrating from MySQL 5.7 to 8 ?

As you know that MySQL 5.7 is going to be in its end of life this month (when I wrote this article) and many businesses are upgrading to 8 ( except few ones, believe me its true) . Many cloud vendors are providing extra support and time to their customer so that they can move…

Components in MySQL 8

This article is purely based on my understanding of Components and the outcome of my dirty hands. Reference of this Article is “MySQL official documents” & my experience with components.

Oracle introduced Components in MySQL 8.0 and user can locate its table under mysql database. See below figure.

1EDDDCE9-B386-4EEE-954D-2742904FDCBF

What is a component and why we need it ??

Components are similar to the Plugins but are more secure and provides improved & additional features. Moreover plugins will be deprecated in future versions.

If we want “audit logs” to be enable or have strong password mechanism, a user need to install respective plugins. But there are few challenges :

1. Plugins can interact only with server but not with each other which limits its usage.
2. Plugins interact directly with server components which means they are not encapsulated .
3. While installing plugins a user needs to mention the extension file name as well. For example .so or .dll

These are some of the over head or disadvantage of plugins which a DBA normally face. I remember once I required to install an “audit” plugin and being on linux platform , it requires a “.so” file . So in this case I made ensure that I should have  “.so” extension plugin .

But components remove above mentioned challenges. Components are more secure as it is encapsulated and provides defined set of services. Major advantages are :

1. Components can interact each other.
2. With components DBA doesn’t need to worry about remembering the extension file name. For example while installing “validate password” component user can simply install it as shown in below picture post which a user can check whether component has been successfully installed . Additionally we can uninstall if not required.

FAA1B240-1FA1-4C90-B8C4-DAA49EB1736E

3. A user can create his own plugins using existing macros . Some of the header for this component are as follows :

#include “validate_password_imp.h

#include <assert.h>

#include <string.h>

#include <algorithm>

#include <atomic>

#include <fstream>

#include <iomanip>

#include <set>

#include <sstream>

#include “mysql/components/library_mysys/my_memory.h

#include “mysql/components/services/mysql_rwlock.h

#include “mysql/components/services/psi_memory.h

#include “mysqld_error.h”

Respective macros can be found from MySQL distribution code

3.a After this, copy any macros of existing components and get the basic framework ready.
3.b Copy any of the service ( header file) and compile it using the source distribution of the MySQL server.

Currently below plugins are available to use. At the current moment not all components are available and thus plugins can be used in place but in future, plugins will be completely removed.

  1. Error logging
  2. Audit Messaging and logs
  3. MySQL Enterprise Data Masking and De-Identification Password Validation
  4. MySQL Enterprise Firewall
  5. Plugin API
  6. Thread Pool
  7. Version Tokens

Once enable, user can see the respective variables using “show global variables” .

Such variables starts with prefix “component name”. For  example in “validate_password”  , below are the variables which will be visible once user enable this component :

components

What about  “Audit log” component & how it is useful ??.

There are various cases when application need to append some specific string in audit logs. In such case we can use an UDF which is known as “audit_api_message_emit_udf()”.

  1. First install “audit log” component in similar manner which we used to install validate password component:

component

2. Now we wanna check if we can add some message to the audit log:

UDF_audit

If the required arguments has been passed, we will receive OK message which means that it is a success..This link provide argument required for this UDF

Hence stating this we have other components too, we can make our own also and in future it will replace the plugins. Moreover if are using plugins these days and planning to work on MySQL 8 , then you have to remove plugins as it will be going to remove in future.

What’s changed in building of Innodb’s b-tree indexes in MySQL 8

IN INNODB , EVERYTHING IS DESIGNED IN B-TREE.  When I say everything I mean primary key, Indexes, secondary index.

But have we ever thought how MySQL creates an index ? Obviously I am not saying by syntax “create index” or “alter table”. Let’s begin to start from an example by what I mean. Say, I have to create an index on table T1 having column ID with million of records in it. In prior version, mysql uses insert APIs and then index entries were inserted into the B-tree one by one. This is a traditional method to insert data in b-tree .

In the old method :

  1. First b-tree CURSOR (btr0btr.c) gets open .
  2. Search to find the correct position begins.
  3. If there is a space in page, then insert this entry into the b tree page and this has been done in optimistic locking.
  4. If there is no space in page , then pessimistic insert would be performed .For this , b-tree cursor (btr0btr.c) will need to open again and split the page and merging will be done.

This is a top down approach and it is costly because of multiple opening of b-tree cursor, searching, splitting and merging of b-tree nodes

Bottom up approach :

Lets divide new approach in 3 phase to understand :

  1. Run phase:

a.  In the run phase, Clustered or Primary index has been scanned or read, post which these entries get added to the buffer , known as sort buffer (innodb_sort_buffer_size) with default size of 262144 bytes. Until this, sorting hasn’t been done. So, you can tune this variable according to your data size to speed up the index creation. Once this buffer becomes full , all the records will be sorted and put it in a temporary file (which is explained further in the second part of this process). In addition to sort buffer size, max_sort_length also needs to be modified whenever we are making changes in sort buffer.
Remember here, I am talking about the sort_buffer_size , not the innodb_sort_buffer_size. However, when dealing with innodb engine and while creation of the secondary index online, innodb_sort_buffer_size should be consider

mysql> show global variables like '%sort_buffer%';

+-------------------------+---------+

| Variable_name           | Value   |

+-------------------------+---------+

| innodb_sort_buffer_size | 1048576 |

| myisam_sort_buffer_size | 8388608 |

| sort_buffer_size        | 262144  |

+-------------------------+---------+

3 rows in set (0.01 sec)

mysql>

2. Second phase :

Multiple temporary files, which got generated in run phase will get merged in this phase using MERGE SORT algorithm. At this stage, multiple temporary files got merged with each other. The amount of merge can be seen under the merge_passes status shown below. This status tells how many temporary files are getting merged and if this number is high enough, then it simply means that the sort buffer size is low and is a good time to tweak it.

mysql> show global status like '%passes%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Sort_merge_passes | 0     |
+-------------------+-------+
1 row in set (0.11 sec)

3. Third phase :

Once all file got merged , all the sorted records from second phase will be inserted into the b-tree nodes.


What’s next ?

Until this point, all these operations i.e. sort and merge is single threaded but starting from MySQL 8, these operations are now multi threaded. Think about a situation wherein you are trying to add a secondary index in your existing table using online DDL approach and you need to get it done faster or you are trying to count the number of records in a heavy table.

Now, we have two variables :

1. innodb_parallel_read_threads – Remember this is only applicable for the clustered index or PK.
2. innodb_ddl_threads – Remember this is only for the secondary index.
3. innodb_ddl_buffer_size

Take an example below to understand, how innodb_parallel_read_threads is making the clustered index read more faster as we increase the number of threads.

mysql> set local innodb_parallel_read_threads=1;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (2.12 sec)

mysql> set innodb_parallel_read_threads=4;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (0.92 sec)

mysql> set innodb_parallel_read_threads=1;
Query OK, 0 rows affected (0.00 sec)

mysql> select count(1) from fact;
+----------+
| count(1) |
+----------+
| 14531072 |
+----------+
1 row in set (2.09 sec)

However I don’t see any influential improvement in check tables.

mysql> show variables like '%parallel%';
+------------------------------+----------+
| Variable_name                | Value    |
+------------------------------+----------+
| innodb_parallel_read_threads | 1        |
| replica_parallel_type        | DATABASE |
| replica_parallel_workers     | 0        |
| slave_parallel_type          | DATABASE |
| slave_parallel_workers       | 0        |
+------------------------------+----------+
5 rows in set (0.01 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (17.53 sec)

mysql> set innodb_parallel_read_threads=4;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.23 sec)

mysql> set innodb_parallel_read_threads=8;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.48 sec)

mysql> set innodb_parallel_read_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.35 sec)

mysql> show variables like '%parallel%';
+------------------------------+----------+
| Variable_name                | Value    |
+------------------------------+----------+
| innodb_parallel_read_threads | 12       |
| replica_parallel_type        | DATABASE |
| replica_parallel_workers     | 0        |
| slave_parallel_type          | DATABASE |
| slave_parallel_workers       | 0        |
+------------------------------+----------+
5 rows in set (0.00 sec)

mysql> set innodb_parallel_read_threads=20;
Query OK, 0 rows affected (0.00 sec)

mysql> check table fact;
+------------+-------+----------+----------+
| Table      | Op    | Msg_type | Msg_text |
+------------+-------+----------+----------+
| ankit.fact | check | status   | OK       |
+------------+-------+----------+----------+
1 row in set (14.34 sec)

mysql> 

This phase is while scanning the clustered index. Similarly if you are using the 8.0.27, innodb_ddl_threads can be used to speed up the sort and merge operations as explained one. This defines the number of threads to perform sort and merge operation. Here is the below test I performed :

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (19.41 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_buffer_size=10485760;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_buffer_size=104857600;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_threads=10;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (16.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_parallel_read_threads=12;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (15.89 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> set innodb_ddl_buffer_size=1048576000;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_parallel_read_threads=16;
Query OK, 0 rows affected (0.00 sec)

mysql> set innodb_ddl_threads=16;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table fact drop index idx_dim2;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table fact add index idx_dim2(dim2);
Query OK, 0 rows affected (17.73 sec)
Records: 0  Duplicates: 0  Warnings: 0

With the various combinations, you can see how the performance varies.


What other things needs to consider when creating index in MySQL ?

Now , question is how much space can we reserved for index future growth per page. To achieve this, we have **innodb_fill_factor** option which will help in reserving the space for future b-tree index growth. For an example, if we have set this value to 70, then 30 percent of the space of page will get reserve for index growth during sorted index build.

This setting help us in managing server memory . Sometimes, when data becomes fragmented , setting this option can help in re-gaining it.

Conclusion

This article helps us to understand the core process behind how innodb creates an index and how with the time, things has been improved in 8. Here I have used 2 versions of MySQL to show the results. One is 8.0.26 and 8.0.32 on Mac OS.

Other articles

Bloom Filters

This article explain about the bloom filters in LSM tree databases.