Three articles to understand the inside story of TiDB technology - talk about scheduling

In any complex system, what users perceive is only the tip of the iceberg, and the database is no exception.

The first two articles introduced the basic concepts of TiKV and TiDB and the implementation principles of some core functions. One of these two components is responsible for the KV storage and the other is responsible for the SQL engine. They are all things that everyone can see. Behind these two components, there is another component called PD (Placement Driver). Although it does not directly contact the business, this component is the core of the entire cluster and is responsible for the storage of global meta information and the load balancing scheduling of TiKV clusters.

This article introduces this mysterious module. This part is more complicated, and many things are usually not thought of, and descriptions of similar things are rarely seen in other articles. We still follow the ideas of the first two articles, first talk about what kind of functions we need, and then talk about how we implement them. If you look at the implementation with your needs, it will be easier to understand the considerations behind our design.

Why schedule

First recall some of the information mentioned in the first article. TiKV cluster is a distributed KV storage engine of TiDB database. Data is replicated and managed in units of Regions. Each Region will have multiple Replicas (replicas). Distributed on different TiKV nodes, the leader is responsible for reading/writing, and the follower is responsible for synchronizing the raft log sent by the leader. With this information in mind, consider the following questions:

  • How to ensure that multiple replicas of the same Region are distributed on different nodes? Going a step further, what's the problem if you start multiple TiKV instances on one machine?
  • When a TiKV cluster is deployed across computer rooms for disaster recovery, how to ensure that one computer room is disconnected without losing multiple replicas of the Raft Group?
  • After adding a node to the TiKV cluster, how to move the data on other nodes in the cluster?
  • What happens when a node goes offline? What does the entire cluster need to do? What if the node just goes offline briefly (restarts the service)? If the node is offline for a long time (disk failure, all data is lost), what should I do?
  • Assuming that the cluster requires N replicas for each Raft Group, for a single Raft Group, the number of replicas may not be enough (for example, if a node goes offline and loses replicas), or it may be too many (for example, a node that goes offline returns to normal again) , automatically join the cluster). So how to adjust the number of Replica?
  • Read/write is performed through the leader. If the leader is only concentrated on a small number of nodes, what impact will it have on the cluster?
  • Not all regions are frequently accessed, and the hotspots may only be accessed in a few regions. What do we need to do at this time?
  • When the cluster is doing load balancing, it is often necessary to relocate data. Will this data migration take up a lot of network bandwidth, disk IO, and CPU? Which in turn affects online services?

These problems alone may find simple solutions, but mixed together, it is not easy to solve. Some problems seem to only need to consider the situation within a single Raft Group, such as whether to add replicas based on whether there are enough replicas. But in fact where this copy is added, it is necessary to consider the global information. The entire system is also changing dynamically. Region splits, node joins, node failures, and changes in access hotspots will continue to occur. The entire scheduling system also needs to move toward the optimal state in a dynamic state. With components that are scheduled and can be configured, it is difficult to meet these needs. Therefore, we need a central node to control and adjust the overall status of the system, so we have the PD module.

scheduling needs

There are a lot of problems listed above, let's classify and organize them first. In general, the problems fall into two categories:

As a distributed high-availability storage system, there are four requirements that must be met: ###

  • The number of copies cannot be more or less
  • Replicas need to be distributed on different machines
  • After a new node is added, the replicas on other nodes can be migrated over
  • After the node goes offline, the data of the node needs to be migrated away

As a good distributed system, the places that need to be optimized include:

  • Maintain an even distribution of leaders across the cluster
  • Maintain uniform storage capacity per node
  • Maintain a uniform distribution of access hotspots
  • Control the speed of Balance without affecting online services
  • Manage node status, including manual online/offline nodes, and automatic offline failure nodes

After the first type of requirements are met, the entire system will have the functions of multi-copy fault tolerance, dynamic expansion/shrinkage, tolerance of node disconnection, and automatic error recovery. After the second type of requirements are met, the load of the overall system can be made more uniform and can be easily managed.

In order to meet these requirements, first we need to collect enough information, such as the status of each node, the information of each Raft Group, the statistics of business access operations, etc.; secondly, we need to set some policies, PD based on these information and scheduling policies, formulate A scheduling plan that meets the aforementioned requirements as much as possible; finally, some basic operations are required to complete the scheduling plan. Basic Operations of Scheduling

Let's first introduce the simplest point, that is, the basic operation of scheduling, that is, in order to meet the scheduling strategy, what functions can we use. This is the basis of the entire scheduling. Only by knowing what kind of hammer you have in your hand can you know what posture to use to smash nails.

The above scheduling requirements may seem complicated, but the three things that have been sorted out and finally landed are the following three things:

  • Add a Replica
  • Delete a Replica
  • Transfer the Leader role between different Replicas in a Raft Group

It just so happens that the Raft protocol can meet these three requirements, and can support the above three basic operations through the three commands AddReplica, RemoveReplica, and TransferLeader. #collect message

Scheduling relies on the collection of information about the entire cluster. In simple terms, we need to know the status of each TiKV node and the status of each Region. The TiKV cluster will report two types of messages to the PD:

Each TiKV node will periodically report the overall information of the node to the PD

There is a heartbeat packet between the TiKV node (Store) and the PD. On the one hand, the PD uses the heartbeat packet to check whether each Store is alive and whether there is a new Store; on the other hand, the heartbeat packet also carries the status information of the [Store]. ][1], mainly including:

  • total disk capacity
  • Available disk capacity
  • Number of Regions hosted
  • Data write speed
  • The number of Snapshots sent/received (data may be synchronized between Replica through Snapshots)
  • Is it overloaded
  • Tag information (a tag is a series of tags with a hierarchical relationship)

The leader of each Raft Group will report information to PD regularly

There is a heartbeat packet between the Leader and PD of each Raft Group, which is used to report the status of this [Region][2], mainly including the following information:

  • Leader's position
  • Followers location
  • Number of dropped replicas
  • Data write/read speed

The PD continuously collects the information of the entire cluster through these two types of heartbeat messages, and then uses this information as the basis for decision-making. In addition, the PD can also receive additional information through the management interface to make more accurate decisions. For example, when the heartbeat packet of a store is interrupted, the PD cannot determine whether the node is temporarily or permanently failed, and can only wait for a period of time (30 minutes by default). If there is no heartbeat packet, it is considered that the store has Go offline, and then decide that you need to schedule all the Regions on this Store. However, sometimes, the operation and maintenance personnel take the initiative to take a machine offline. At this time, the PD can be notified through the PD management interface that the Store is unavailable, and the PD can immediately determine that all regions on the Store need to be scheduled away.

#scheduling strategy

After PD has collected this information, some strategies are needed to develop a specific scheduling plan.

The number of replicas in a region is correct

When the PD finds that the number of replicas in this region does not meet the requirements through the heartbeat packet of a region leader, it needs to adjust the number of replicas through the Add/Remove Replica operation. Possible reasons for this to happen are:

  • When a node goes offline, all the above data is lost, resulting in insufficient number of replicas in some regions
  • A disconnected node resumes service and automatically connects to the cluster. In this way, the number of Replica in the Region that has supplemented Replica before is too large, and a Replica needs to be deleted.
  • The administrator adjusted the replica strategy and modified the configuration of [max-replicas][3]

Multiple Replicas in a Raft Group are not in the same location

Note the second point, "multiple Replicas in a Raft Group are not in the same location", here we use "same location" instead of "same node". In general, PD will only ensure that multiple Replicas do not fall on one node, so as to avoid the loss of multiple Replicas caused by the failure of a single node. In actual deployment, the following requirements may also arise:

  • Multiple nodes deployed on the same physical machine
  • TiKV nodes are distributed on multiple racks, and it is hoped that system availability can be guaranteed even when a single rack is powered off
  • TiKV nodes are distributed in multiple IDCs, and the system can also be guaranteed to be available when power is turned off to a single computer room

These requirements are essentially that a node has a common location attribute and constitutes a minimum fault-tolerant unit. We hope that there will not be multiple replicas of a Region in this unit. At this time, you can configure [lables][4] for the node and specify which labels are the location labels by configuring [location-labels][5] on the PD. It is necessary to ensure that there will not be more than one Region when Replica is allocated. The nodes where the replicas are located have the same location identifier.

Replicas are evenly distributed among Stores

As mentioned earlier, the upper limit of the data capacity stored in each replica is fixed, so we maintain a balanced number of replicas on each node, which will make the overall load more balanced.

The number of Leaders is evenly distributed among the Stores

Raft protocol reads and writes cores through the leader, so the calculation load is mainly on the leader, and the PD will spread the leader among the nodes as much as possible.

The number of access hotspots is evenly distributed among the stores

Each Store and Region Leader reports information about the current access load, such as the read/write speed of the Key. The PD detects access hotspots and spreads them among the nodes.

The storage space of each Store is roughly equal

When each Store is started, a Capacity parameter will be specified, indicating the upper limit of the storage space of the Store. When the PD is scheduling, it will consider the remaining storage space of the node.

Control scheduling speed to avoid affecting online services

Scheduling operations need to consume CPU, memory, disk IO and network bandwidth, and we need to avoid causing too much impact on online services. PD will control the number of operations currently in progress. The default speed control is conservative. If you want to speed up the scheduling (for example, if the service has been stopped to upgrade, add new nodes, and hope to schedule as soon as possible), you can manually speed up the scheduling through pd-ctl speed.

Support manual offline node

When the node is manually offline through pd-ctl, the PD will schedule the data on the node under a certain rate control. When the scheduling is completed, the node will be placed offline. # implementation of scheduling

After understanding the above information, let's take a look at the entire scheduling process.

The PD continuously collects information through the heartbeat packets of the Store or Leader, obtains detailed data of the entire cluster, and generates scheduling operation sequences based on these information and scheduling policies. Every time it receives a heartbeat packet from the Region Leader, the PD will check whether there is For the operation to be performed in this Region, return the required operation to the Region Leader through the reply message of the heartbeat packet, and monitor the execution result in the subsequent heartbeat packet. Note that the operation here is only a suggestion to the Region Leader, and it does not guarantee that it will be executed. Whether it will be executed and when it will be executed is determined by the Region Leader itself according to its current state. Summarize

This article talks about things that you may rarely see in other articles. Every design has its own considerations. I hope you can understand what needs to be considered when a distributed storage system is scheduling. How to decouple the strategy and implementation to support the expansion of the strategy more flexibly.

So far, the three articles have been finished. I hope you can understand the basic concepts and implementation principles of the entire TiDB. In the future, we will write more articles to introduce more inside stories of TiDB from the architecture and code level. If you have any questions, please send an email to [email protected] for communication. (Text / Shen Li)

Extended reading

[Three articles to understand the inside story of TiDB technology - talk about storage][6] [Three articles to understand the inside story of TiDB technology - talk about computing][7]

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324538761&siteId=291194637