TiDB (7): storage technology insider

1 Introduction

The database, operating system, and compiler are collectively called the three major systems, which can be said to be the cornerstone of the entire computer software. Among them, the database is closer to the application layer and is the support of many businesses. After decades of development in this field, new progress is constantly being made.

Many people have used databases, but few have implemented a database, especially a distributed database. Understanding the principles and details of database implementation can improve personal skills on the one hand and help build other systems, and on the other hand, it is also conducive to making good use of databases.

The best way to study a technology is to study one of the open source projects, and databases are no exception. There are many good open source projects in the stand-alone database field, among which MySQL and PostgreSQL are the two most famous ones, and many students have seen the codes of these two projects. But in terms of distributed databases, there are not many good open source projects. TiDB has gained widespread attention, especially some technology enthusiasts, hoping to participate in this project. Due to the complexity of the distributed database itself, many people do not understand the whole project well, so I hope to write some articles, from top to bottom, from the shallower to the deeper, about some technical principles of TiDB, including user-visible technologies and A large number of technical points that are invisible to users are hidden behind the SQL interface.

2 save data

The most fundamental function of a database is to store data, so we start here.

There are many ways to save data, the simplest way is to directly build a data structure in memory to save the data sent by the user. For example, with an array, each time a piece of data is received, a record is added to the array. This solution is very simple, it can meet the most basic needs, and the performance will definitely be good, but it is full of loopholes. The biggest problem is that the data is completely in memory. Once the service is stopped or the service is restarted, the data will be permanently lost. .

In order to solve the data loss problem, we can put the data in a non-volatile storage medium (such as a hard disk). The improved solution is to create a file on the disk, and when a piece of data is received, just append a line in the file. OK, we now have a solution for storing data persistently. But it's not good enough, what if there is a bad track on this disk? We can do RAID (Redundant Array of Independent Disks) to provide stand-alone redundant storage. What if the whole machine is down? For example, if there is a fire, RAID cannot keep the data. We can also use network storage instead of storage, or perform storage replication through hardware or software. At this point, it seems that we have solved the data security problem and can breathe a sigh of relief. But, can the consistency between copies be guaranteed during the copying process? That is to say, on the premise of ensuring that the data is not lost, it is also necessary to ensure that the data is good. Ensuring that data is not lost is only a basic requirement, and there are more headaches waiting to be solved:

  • Can it support disaster recovery across data centers?
  • Is the write speed fast enough?
  • After the data is saved, is it easy to read?
  • How to modify the saved data? How to support concurrent modification?
  • How to modify multiple records atomically?

Each of these problems is very difficult, but to make an excellent data storage system, each of the above problems must be solved. In order to solve the data storage problem, we developed the TiKV project. Next, I will introduce some design ideas and basic concepts of TiKV to you.

3 Key-Value

As a data storage system, the first thing to decide is the data storage model, that is, in what form the data will be stored. TiKV's choice is the Key-Value model, and it provides an ordered traversal method. To put it simply, TiKV can be regarded as a huge Map, in which Key and Value are both original Byte arrays. In this Map, Keys are arranged in the order of comparison of the original binary bits of the Byte array. Everyone here needs to remember two points about TiKV:

  1. This is a huge Map, that is, the Key-Value pair is stored
  2. The Key-Value pairs in this Map are ordered according to the binary order of the Key, that is, we can Seek to a certain Key position, and then continuously call the Next method to obtain the Key-Value larger than this Key in increasing order

After talking so much, someone may ask, what is the relationship between the storage model mentioned here and the table in SQL? There is one important thing to say four times here:

The storage model here has nothing to do with the Table in SQL! The storage model here has nothing to do with the Table in SQL! The storage model here has nothing to do with the Table in SQL! The storage model here has nothing to do with the Table in SQL!

Now let's forget about any concepts in SQL and focus on how to implement a huge (distributed) Map with high performance and high reliability like TiKV.

4 RocksDB

For any persistent storage engine, data must be stored on disk after all, and TiKV is no exception. However, TiKV did not choose to write data directly to the disk, but to save the data in RocksDB, and RocksDB is responsible for the specific data landing. The reason for this choice is that developing a stand-alone storage engine requires a lot of work, especially for a high-performance stand-alone engine, which requires various meticulous optimizations. RocksDB is an excellent open source stand-alone storage engine that can satisfy our There are various requirements for the stand-alone engine, and the Facebook team is doing continuous optimization, so that we can enjoy a very powerful and constantly improving stand-alone engine with very little effort. Of course, we have also contributed some codes to RocksDB, hoping that this project will be better and better. Here you can simply think that RocksDB is a stand-alone Key-Value Map.

The underlying LSM tree stores the incremental changes to the data in the memory, and flushes the data to the disk in batches after reaching the specified size limit. The trees in the disk can be merged periodically to form a large tree to optimize performance.

5 Raft

Well, the first step of the Long March has been taken, and we have found an efficient and reliable local storage solution for the data. As the saying goes, everything is difficult in the beginning, difficult in the middle, and difficult in the end. Next, we are faced with a more difficult task: how to ensure that data will not be lost and errors will not occur in the event of a stand-alone failure? To put it simply, we need to find a way to replicate data to multiple machines, so that if one machine hangs up, we still have copies on other machines; in complex terms, we also need this replication solution to be reliable, efficient and able to Handle replica invalidation. It sounds difficult, but fortunately we have the Raft protocol. Raft is a consensus algorithm, which is equivalent to Paxos, but easier to understand. Raft's paper, if you are interested, you can take a look. This article will only give a brief introduction to Raft, and you can refer to the paper for details. Another point to mention is that the Raft paper is just a basic solution. If it is implemented strictly according to the paper, the performance will be poor. We have made a lot of optimizations to the implementation of the Raft protocol.

Raft is a consensus protocol that provides several important functions:

  • Leader election
  • member change
  • log replication

TiKV uses Raft for data replication, and each data change will be implemented as a Raft log. Through Raft's log replication function, the data is safely and reliably synchronized to most nodes in the Group.

Let’s summarize here that through single-machine RocksDB, we can quickly store data on disk; through Raft, we can replicate data to multiple machines to prevent single-machine failure. Data is written through the Raft layer interface instead of directly writing to RocksDB. By implementing Raft, we have a distributed KV, and now we no longer have to worry about a certain machine hanging up.

6 Region

Speaking of this, we can mention a very important concept: Region. This concept is the basis for understanding the subsequent series of mechanisms, please read this section carefully.

As mentioned earlier, we regard TiKV as a huge and ordered KV Map, so in order to achieve horizontal expansion of storage, we need to disperse data on multiple machines. The data mentioned here is scattered on multiple machines and Raft's data replication is not a concept. In this section, let's forget about Raft and assume that all data has only one copy, which is easier to understand.

For a KV system, there are two typical solutions for distributing data on multiple machines: one is to do Hash according to the Key, and select the corresponding storage node according to the Hash value; the other is to divide the Range, a certain continuous Key are stored on one storage node. TiKV chooses the second method, dividing the entire Key-Value space into many segments, each segment is a series of consecutive Keys, we call each segment a Region, and we will try to keep the data stored in each Region not exceeding a certain The size (this size can be configured, the current default is 64mb). Each Region can be described by a left-closed right-open interval from StartKey to EndKey.

Note that the Region here has nothing to do with the table in SQL! Please continue to forget about SQL and just talk about KV. After dividing the data into Regions, we will do two important things:

  • In the unit of Region, disperse the data on all nodes in the cluster, and try to ensure that the number of Regions served on each node is about the same
  • Raft replication and member management in units of Regions

These two points are very important, let's talk about them one by one.

Look at the first point first, the data is divided into many Regions according to the Key, and the data of each Region will only be saved on one node. Our system will have a component responsible for distributing Regions as evenly as possible on all nodes in the cluster, so that on the one hand, the horizontal expansion of storage capacity can be achieved (after adding new nodes, the Regions on other nodes will be automatically Scheduling), on the other hand, it also achieves load balancing (it will not happen that a node has a lot of data, and other nodes have no data). At the same time, in order to ensure that the upper-level client can access the required data, there will also be a component in our system to record the distribution of the Region on the node, that is, through any Key, we can query which Region the Key is in, and this Which node the Region is currently on.

For the second point, TiKV replicates data in units of Regions, that is, multiple copies of data in a Region are stored, and we call each copy a Replica. Data consistency is maintained between Replicas through Raft. Multiple Replicas of a Region will be stored on different nodes to form a Raft Group. One of the Replicas will be the Leader of the group, and the other Replicas will be Followers. All reads and writes are performed by the Leader, and then copied to the Follower by the Leader.

 

We use Region as a unit to disperse and replicate data, and we have a distributed KeyValue system with certain disaster recovery capabilities, so we don't have to worry about data storage or disk failure and data loss. This is cool, but not perfect, we need more features.

7 MVCC

Many databases implement multi-version control (MVCC), and TiKV is no exception. Imagine a scenario where two clients modify the value of a key at the same time. If there is no MVCC, the data needs to be locked. In a distributed scenario, it may cause performance and deadlock problems. TiKV's MVCC implementation is achieved by adding Version after the Key. Simply put, before MVCC, TiKV can be regarded as this:

Key1 -> Value
Key2 -> Value
……
KeyN -> Value

With MVCC, the key arrangement of TiKV is as follows:

Key1-Version3 -> Value
Key1-Version2 -> Value
Key1-Version1 -> Value
……
Key2-Version4 -> Value
Key2-Version3 -> Value
Key2-Version2 -> Value
Key2-Version1 -> Value
……
KeyN-Version2 -> Value
KeyN-Version1 -> Value
……

Note that for multiple versions of the same Key, we put the larger version number in the front and the smaller version number in the back (recall that the Keys we introduced in the Key-Value section are arranged in an orderly manner), so When the user obtains the Value through a Key + Version, the Key and Version can be used to construct the MVCC Key, which is the Key-Version. Then you can directly Seek(Key-Version) to locate the first position greater than or equal to this Key-Version.

8 affairs

The transaction of TiKV adopts the Percolator model, and has made a lot of optimizations. TiKV’s transactions use optimistic locks. During the execution of a transaction, no write-write conflicts will be detected. Conflict detection will only be done during the submission process. Among the conflicting parties who complete the submission earlier, the write will succeed, and the other party will try to Re-execute the entire transaction. When the write conflict of the business is not serious, the performance of this model will be very good, such as randomly updating the data of a certain row in the table, and the table is very large. However, if the business write conflicts are serious, the performance will be poor. An extreme example is the counter. Multiple clients modify a small number of rows at the same time, resulting in serious conflicts and a large number of invalid retries.

9 other

So far, we have learned the basic concepts and some details of TiKV, understood the hierarchical structure of this distributed KV engine with transactions and how to achieve multi-copy fault tolerance. The next section will introduce how to build an SQL layer on top of the KV storage model.

Guess you like

Origin blog.csdn.net/u013938578/article/details/131553853