Brief introduction of TiDB and data structure and storage of TiKV | JD Logistics Technical Team

1 Overview

TiDB is an open source distributed relational database independently designed and developed by PingCAP. It is an integrated distributed database product that supports both online transaction processing and online analytical processing (Hybrid Transactional and Analytical Processing, HTAP). Important features such as capacity, financial-level high availability, real-time HTAP, cloud-native distributed database, compatibility with MySQL 5.7 protocol and MySQL ecology. The goal is to provide users with one-stop OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP solutions. TiDB is suitable for various application scenarios such as high availability, high requirements for strong consistency, and large data scale.

To sum up, Tidb is a distributed database that is highly compatible with MySQL and has the following features:

  • Highly Compatible with MySQL: If you master MySQL, you can use TIDB with zero foundation
  • Horizontal elastic expansion: adaptive expansion, based on Raft protocol
  • Distributed transactions: pessimistic locks, optimistic locks, causal consistency
  • Real financial-level high availability: based on Raft protocol
  • One-stop HTAP solution: a single database supports OLTP and OLAP at the same time, capable of real-time intelligent processing

Among them, the core features of TiDB are: horizontal expansion and high availability.

This article mainly starts from the various components of TiDB to understand its infrastructure, and focuses on analyzing its storage architecture design, exploring how it organizes data, and how each row of records in Table is stored in memory and disk .

2 components

First look at a Tidb architecture diagram, which includes TiDB, Storage (TiKV, TiFlash), TiSpark, and PD. Among them, TiDB, TiKV, and PD are core components; TIFlash and TiSpark are components for solving complex OLAP.
TiDB is the interactive entry of Mysql syntax, and TiSpark is the interactive entry of sparkSAL.

2.1 TiDB Server

The SQL layer, which exposes the connection endpoint of the MySQL protocol, is responsible for accepting client connections, performing SQL parsing and optimization, and finally generating a distributed execution plan.

The TiDB layer itself is stateless. In practice, multiple TiDB instances can be started, and a unified access address is provided externally through load balancing components (such as LVS, HAProxy, or F5). Client connections can be evenly distributed among multiple TiDB instances. In order to achieve the effect of load balancing. TiDB Server itself does not store data, but only parses SQL and forwards actual data read requests to the underlying storage node TiKV (or TiFlash).

2.2 PD (Placement Driver) Server

The meta-information management module of the entire TiDB cluster is responsible for storing the real-time data distribution of each TiKV node and the overall topology of the cluster, providing the TiDB Dashboard management and control interface, and assigning transaction IDs to distributed transactions.

PD not only stores meta information, but also sends data scheduling commands to specific TiKV nodes according to the real-time data distribution status reported by TiKV nodes, which can be said to be the "brain" of the entire cluster. In addition, the PD itself is also composed of at least 3 nodes and has high availability capabilities. It is recommended to deploy an odd number of PD nodes.

2.3 Storage nodes

2.3.1 TiKV Server

Responsible for storing data. From the outside, TiKV is a distributed Key-Value storage engine that provides transactions.

The basic unit for storing data is Region, and each Region is responsible for storing the data of a Key Range (the left-closed right-open interval from StartKey to EndKey), and each TiKV node is responsible for multiple Regions.

TiKV's API provides native support for distributed transactions at the KV key-value pair level, and provides the isolation level of SI (Snapshot Isolation) by default, which is also the core of TiDB's support for distributed transactions at the SQL level.

After the SQL layer of TiDB completes the SQL parsing, it will convert the SQL execution plan into an actual call to the TiKV API. Therefore, the data is stored in TiKV. In addition, the data in TiKV will automatically maintain multiple copies (the default is three copies), which naturally supports high availability and automatic failover.

2.3.2 TiFlash

TiFlash is a special type of storage node. Different from ordinary TiKV nodes, inside TiFlash, data is stored in the form of columns, and its main function is to accelerate analytical scenarios. If the usage scenario is massive data and statistical analysis is required, a mapping table of the TiFlash storage structure can be created on the basis of the data table to improve the query speed.

The above components cooperate with each other to support Tidb to complete massive data storage, while taking into account high availability, transactions, and excellent read and write performance.

3 storage architecture

3.1 TiKV model

In the Tidb architecture described above, there are two services as storage nodes, TiKV and TiFlash. Among them, TiFlash is implemented in the form of columnar storage. You can refer to the architectural ideas of ClickHouse, which are similar. This chapter mainly discusses the implementation of TiKV.

In the figure above, what TiKV node describes is the storage component of Tidb in the OLTP scenario, while TiFlash is the corresponding LOAP scenario. TiKV chooses the Key-Value model as the data storage model, and provides an ordered traversal method for reading.

TiKV data storage has two key points:

  1. It is a huge Map (refer to HashMap), which stores Key-Value Pairs (key-value pairs).
  2. The Key-Value pairs in this Map are ordered according to the binary order of the Key, that is, you can Seek to a certain Key position, and then call the Next method continuously to obtain the Key-Value larger than this Key in increasing order.

It should be noted that the KV storage model of TiKV described here has nothing to do with the Table in SQL, and there should be no substitution.

Inside the TiKV node in the figure, there are the concepts of store and Region, which are high-availability solutions. TiDB uses the Raft algorithm to implement it. Here is a detailed analysis.

3.2 Row storage structure of TiKV

When using Tidb, read and write using the traditional "table" concept. In a relational database, a table may have many columns. Tidb constructs data in the form of Key-Value, so it needs to be considered to map each column of data in a row of records into a key-value key-value pair.

First of all, in OLTP scenarios, there are a large number of add, delete, modify, and query operations for single or multiple rows, requiring the database to have the ability to quickly read a row of data. Therefore, the corresponding Key should preferably have a unique ID (displayed or implicit ID) for quick location.

Second, many OLAP-style queries require full table scans. If the keys of all rows in a table can be encoded into a range, the task of full table scan can be efficiently completed through range query.

3.2.1 KV mapping of table data

The mapping relationship between table data and Key-Value in Tidb is designed as follows:

  • In order to ensure that the data in the same table will be put together for easy search, TiDB will assign a table ID to each table, represented by TableID, which is an integer and globally unique.
  • TiDB will assign a row ID to each row of data, represented by RowID, which is an integer and unique within the table. If the table has a primary key, the row ID is equal to the primary key.

Based on the above rules, the generated Key-Value key-value pair is:

Key:  tablePrefix{TableID}_recordPrefixSep{RowID} 
Value: [col1,col2,col3,col4]

Among them, tablePrefix and recordPrefixSep are specific string constants, which are used to distinguish other data in the Key space.

In this example, it is a Key formed entirely based on RowID, which can be compared to MySQL's clustered index.

3.2.2 KV mapping of index data

For ordinary indexes, there is a concept of non-clustered indexes in MySQL, especially in innodb, through the form of B+Tree, the child nodes record the primary key information, and then obtain the result data by returning to the table.

Tidb supports index creation, so how to store index information? It supports both primary keys and secondary indexes (including unique and non-unique indexes), and is similar to table data mapping.

The design is as follows:

  • Tidb assigns an index ID to each index in the table, denoted by IndexID.
  • For the primary key and unique index, it is necessary to quickly locate the RowID according to the key value, which will be stored in the value

Therefore, the generated key-value key-value pair is:

Key:tablePrefix{TableID}_indexPrefixSep{IndexID}_indexedColumnsValue
Value: RowID

Since indexedColumnsValue exists in the designed key, which is the field value of the query, it can be directly hit or fuzzy retrieved. Then use the RowID in the value to go to the table data mapping to retrieve the row record corresponding to the RowID.

For ordinary indexes, one key value may correspond to multiple rows, and the corresponding RowID needs to be queried according to the key value range.

Key:   tablePrefix{TableID}_indexPrefixSep{IndexID}_indexedColumnsValue_{RowID}
Value: null

According to the field value, the list of relevant keys can be retrieved, and the row record can be obtained according to the RowID contained in the key.

3.2.3 Constant strings in maps

tablePrefix, recordPrefixSep and indexPrefixSep in all the above encoding rules are string constants, used to distinguish other data in the Key space, defined as follows:

tablePrefix     = []byte{'t'}
recordPrefixSep = []byte{'r'}
indexPrefixSep  = []byte{'i'}

In the above mapping relationship, all rows in a table have the same Key prefix, and all data in an index also have the same prefix. Such data with the same prefix are arranged together in the key space of TiKV.

Therefore, only a stable suffix needs to be designed to ensure that table data or index data are stored in TiKV in an orderly manner. The value brought by order is the ability to read efficiently.

3.2.4 Examples

Suppose a table in the database is as follows:

CREATE TABLE User (
    ID int,
    Name varchar(20),
    Role varchar(20),
    Age int,
    PRIMARY KEY (ID),
    KEY idxAge (Age)
);

There are 3 rows in the table:

1, "TiDB", "SQL Layer", 10
2, "TiKV", "KV Engine", 20
3, "PD", "Manager", 30
4, "TiFlash", "OLAP", 30

This table has a primary key ID and a common index idxAge, which corresponds to the column Age.

Assuming that the table's TableID=10, its table data is stored as follows:

t10_r1 --> ["TiDB", "SQL Layer", 10]
t10_r2 --> ["TiKV", "KV Engine", 20]
t10_r3 --> ["PD", "Manager", 30]
t10_r4 --> ["TiFlash", "OLAP", 30]

Its common index idxAge is stored as follows:

t10_i1_10_1 --> null
t10_i1_20_2 --> null
t10_i1_30_3 --> null
t10_i1_30_4 --> null

3.3 SQL and KV mapping

TiDB's SQL layer, namely TiDB Server, is responsible for translating SQL into Key-Value operations, forwarding them to the shared distributed Key-Value storage layer TiKV, assembling the results returned by TiKV, and finally returning the query results to the client.

For example, an SQL statement such as "select count(*) from user where name='tidb';" is searched in Tidb, and the process is as follows:

  1. According to the table name, all RowIDs, and the Key encoding rules of the table data, a left-closed right-open interval of [StartKey, endKey) is constructed.
  2. Read data from TiKV according to the value in the interval [StartKey, endKey)
  3. After getting each row of records, filter out the data with name='tidb'
  4. Count the results, calculate the result of count(*), and return it.

In a distributed environment, in order to improve retrieval efficiency, during actual operation, the above process will push down name='tidb' and count(*) to each node of the cluster to reduce undisputed network transmission, each The node finally counts ( *) the results, and then the SQL layer accumulates the results and sums them up.

4 RockDB Persistence

4.1 Overview

The Key-Value Pairs described above are just storage models that exist in memory. For any persistent storage engine, the data must be saved on the disk after all. TiKV did not choose to write data directly to the disk, but saved the data in RocksDB, and RocksDB was responsible for the specific data landing.

The reason for this choice is that developing a stand-alone storage engine requires a lot of work, especially for a high-performance stand-alone engine, which requires various meticulous optimizations. RocksDB is an excellent stand-alone KV storage engine open sourced by Facebook. It can meet various requirements of TiKV for a stand-alone engine. Here you can simply think that RocksDB is a stand-alone persistent Key-Value Map.

4.2 RocksDB

The interior of TiKV Node is divided into multiple Regions. These Regions serve as data slices and are the basis of data consistency. The persistence unit of TiKV is Region, that is, each Region will be stored in a RocksDB instance.

Taking Region as a unit is based on sequential I/O performance considerations. As for how TiKV effectively organizes the data in the Region to ensure uniform and orderly sharding, LSM-Tree is used in it. If you have HBase experience, you must not use the model.

4.2.1 LSM-Tree structure

LSM-Tree (log structured merge-tree) literally means "log structured merge tree", and the structure of LSM-Tree spans disk and memory. According to the function of the storage medium, it divides the WAL (write ahead log) of the disk, the MemTable of the memory, and the SST file of the disk; the SST file is divided into multiple layers, and when the data of each layer reaches the threshold, a part of the SST will be selected and merged into the next One layer, the data in each layer is 10 times that of the previous layer, so 90% of the data will be stored in the last layer.

WAL: It is the implementation of the pre-written log. When the write operation is performed, the data will be backed up to the disk through WAL to prevent the memory from being lost due to power failure.

Memory-Table: It is a data structure in memory to save some recent update operations; memory-table can use data structures such as jump tables or search trees to organize data to maintain the order of data. When the memory-table reaches a certain amount of data, the memory-table will be transformed into an immutable memory-table, and a new memory-table will be created to process the new data.

Immutable Memory-Table: Immutable memory-table is an unmodifiable data structure in memory, which is an intermediate state for converting memory-table into SSTable. The purpose is to not block the write operation during the dump process. Write operations can be handled by the new memory-table instead of waiting for the memory-table to be locked.

SST or SSTable: An ordered collection of key-value pairs, which is the data structure of the LSM tree group on disk. If the SSTable is relatively large, you can also create an index based on the value of the key to speed up the query of the SSTable. There will be multiple SSTables, and according to Level design, there will be multiple SSTable files at each level.

4.2.2 LSM-Tree execution process

writing process

  1. First, it will check whether the storage in each area reaches the threshold, if not, it will be written directly;
  2. If the Immutable Memory-Table exists, it will wait for its compression process.
  3. If the Memory-Table is full and the Immutable Memory-Table does not exist, set the current Memory-Table to the Immutable Memory-Table, generate a new Memory-Table, trigger compression, and then write.
  4. The writing process will be written to the WAL first, and then the Memory-Table will be written after success, and the writing will be completed at this moment.

Where the data exists, it will go through WAL, Memory-Table, Immutable Memory-Table, SSTable in order. Among them, SSTable is the location where the data is finally persisted. Transactional writing only needs to go through WAL and Memory-Table to complete.

search process

1. According to the target key, search in Memory-Table, Immutable Memory-Table, and SSTable step by step.
2. Among them, SSTable will be divided into several levels, and the search is also performed according to Level.

  • For Level-0 level, RocksDB will use the traversal method, so for the sake of search efficiency, the number of Level-0 files will be controlled.
  • For Level-1 and above SSTables, the data will not overlap, and due to the orderly storage, binary search will be used to improve efficiency.

In order to improve the search efficiency of RocksDB, each Memory-Table and SSTable will have a corresponding Bloom Filter to speed up the judgment of whether the Key may be in it, so as to reduce the number of searches.

Delete and update process

When there is a deletion operation, it is not necessary to find the corresponding data in the disk and then delete it like a B+ tree.

  1. First, it will search in the Memory-Table and Immuatble Memory-Table through the search process.
  2. Mark the result as "deleted" if found.
  3. Otherwise, a node will be added at the end and marked as "deleted"
    . Before the actual deletion, future query operations will first find the record marked as "deleted".
  4. Then at some point, it's actually deleted through the compression process.

The update operation is similar to the delete operation, both of which only operate on the structure of the memory area, write a flag, and then the real update operation is delayed and completed at the time of merging. Since the operation occurs in memory, its read and write performance can also be guaranteed.

4.3 Advantages and disadvantages of RockDB

advantage

  1. Split the data into blocks of hundreds of M and write them sequentially
  2. The destination of the first write is memory, using the WAL design idea, plus sequential writing, to improve the ability to write, and the time complexity is approximately constant
  3. Supports transactions, but the data and key intervals of the L0 layer overlap, and the support is poor

shortcoming

  1. Read and write amplification is serious
  2. Insufficient peak shaving capability when dealing with sudden traffic
  3. limited compression
  4. Indexing is less efficient
  5. The compression process consumes more system resources and has a greater impact on reading and writing.

5 summary

The above is an introduction to the overall structure of TiDB, and focuses on how TiKV organizes and stores data. Compare its Key-Value design ideas with MySQL's index structure to identify similarities and differences. TiDB relies on RockDB to achieve persistence. Among them, Lsm-Tree, as an improved structure of B+Tree, focuses on "how to maintain the stability of system reading speed under frequent data changes", with sequential disk writing as the goal , assuming that the data is frequently sorted, and the order of the data is strived to bring about the stability of the read performance, but also a certain degree of read and write amplification problems.

Author: JD Logistics Geng Hongyu

Source: JD Cloud developer community Ziqishuo Tech

iQIYI client "White" TV, the background uploads at full speed The highest-paid technical position in 2023 deepin uses Asahi Linux to adapt Apple M1 Threads registrations have exceeded 30 million, and the backend is based on CPython's deep "magic modification" TIOBE July list: C++ is about to surpass C, JavaScript enters Top6 Visual Studio Code 1.80 released, supports terminal picture function ChatGPT traffic drops by 10% mid-background front-end suffers from CURD for a long time, and today will take Koala Form July database ranking: Oracle soars, Once again opened up the global desktop browser market share rankings, Safari continued to sit firmly in the second place
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10087467
Recommended