Talk about the performance optimization of Raft

Insert picture description hereThis work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
Insert picture description hereThis work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.

introduction

The title of this article looks like a technical expert’s experience sharing, but in fact it is just my personal conjecture, mixed with ideas from some blog literature on the Internet. Because I did not conduct any tests, it may just be a fantasy.

Optimization Strategy

Most of these optimizations are industrial implementation optimizations, not the optimization of the algorithm itself. For each point, I try to give out what my approach would be if I realize it, and I will make these methods as feasible as possible.

Parallelization of Append Log and Log Storage

In fact, Append Log is executed in parallel with the persistence of log items. The raft log is naturally ordered, and sequential writing is far better than random writing in terms of current hard disk storage media. Generally speaking, the log persistence is equivalent to WAL, and the whole is sequential. Parallelization may cause random writes in extreme cases, such as the following:

Insert picture description here

(C) In 2 (of course only S1 and S2 are placed), 3 log entries have already been placed, and they will all be overwritten in (d).

But doing so will not affect safety. If the leader does not Crash, then the processing result is consistent with the sequence. If the Leader Crash, then if followers greater than n/2+1 receive this message and Append is successful, then this Raft Log will be committed, and the newly elected Leader will respond to the client; otherwise, the Raft Log will not By Commit, it will be overwritten by the new log.

Batching & Pipelining

Obviously these two methods can definitely improve performance, but the real difficulty lies in how to ensure correctness. Let's discuss the two separately from the scene and the correctness.

First of all, where to apply these two optimization strategies, first talk about the application, and then talk about the correctness.

Pipelining:

  1. The optimization here is necessary, otherwise the overall log processing is not only linear, but there is also an RTT between each request processing. When the RTT is not small, the performance must not be much better.
  2. The pipelining process can open multiple threads and send in parallel, and the receiver can ensure order through the subscript.

In order to ensure the correctness of Pipelining, we must do two things:

  1. Makes all requests of each node are processed in an orderly manner and in the same order.
  2. The process of handling connection failure requests is similar to the sliding window process, but before the log entry with index T is received, the log entry after T cannot be applied even if it arrives. When Follower receives the Append message, it will check whether it matches the last Raft Log that has been received. If it does not match, it will return a Reject message. Then according to the Raft protocol, the Leader will retry from 3 after receiving the Reject message. , 4 will also be sent. After all, the probability of this situation is not large, so I personally think that there will be no major problems with additional retries.

Batching:

  1. We can apply it when the client submits a proposal, merging multiple proposals into one obviously has better performance and can reduce many RTTs.
  2. Multiple requests can be merged when append log, but this is what the basic raft does.
  3. Of course placing orders can also be batch processed.

The problem that may arise in request batch processing is that there is a downtime when half of the batch is executed, resulting in a batch of data that is only partially completed. I personally think that the simplest solution is to put these requests into one log. In fact, It is also equivalent to a request, and then return to commit after all is completed. That is to say, the downtime before commit may make the data structure in this machine incomplete, but it is down, it does not matter, so we can write to the disk first, and then write the data structure, the same is true for Follower In this way, the leader will return to the request when it finally receives most of the replies, so that's it. Just a simple thought, there may be some imperfections.

Multiplex connection

There may be multiple raft-groups inside a machine, because all copies in each group will establish connections, so there may be multiple TCPs between multiple machines, and the overhead of each connection is actually relatively large. The cost is not discussed in detail here. Therefore, as far as the raft part is concerned, there can be only one connection between each machine, and each raft-group has a unique label, so that each package can be handed over to a different group.

This does not seem difficult, I personally think it is more troublesome to implement, multiple threads need to share this connection. The simple method is that one thread receives data from this connection, and then sends the data to different receiving threads. Here is a scenario with a single writer and multiple readers. Of course, the more elegant way I think is that this single thread receives data from the connection, maintains raft-group kfifos, and puts different data into different kfifos, so that the lock-free distribution process is completed elegantly.

Log compressed shards

In the implementation of ChubbyGo, the function of sharding is not implemented during log compression, but because the possibility of logs falling behind is relatively low, all sending all logs is very inelegant, so sharding is a very necessary approach. Raft the original paper Section 7 describes this optimization.

Heartbeat packet merge

The heartbeat packet is to ensure the connectivity between machines. Obviously, multiple raft-groups of a machine do not maintain multiple connections, which is meaningless. Raft's heartbeat merge, when there are a lot of Raft-Group on a node, a large number of heartbeat packets can be saved.

This also means that we need to modify multiple raft-group timers in the callback function of the heartbeat package. Of course, the parameter setting of the callback here is also estimated to be troublesome, because this heartbeat processing process is independent of the raft-group, and some parameters need to be passed. Can be modified, of course, these raft-groups must be internal to a process. I can't think of other details at present.

Log storage

This part is the pain point of ChubbyGo. The durable design at that time can be described as sloppy.

The correct approach should be based on the WAL process.

The general process is as follows:

  1. raft receives a data
  2. Endurance
  3. Request commit
  4. commit successful
  5. Upper application execution log

This ensures that if the commit is successful, the data will be persisted in most nodes in the cluster.

The general state transition is like this, green is in-memory, blue is in-disk:
Insert picture description here

But when the upper-layer data is really appliedapplied, the data is not in the memory, which requires an IO, and disk IO is a performance killer, millisecond-level operations (of course, without considering the page cache), so we can maintain a user-mode cache, To improve performance.

I personally think that the organization of the cache here is actually an array and a compression point subscript.

Read optimization ReadIndex and LeaseRead

If every read operation needs to go through the raft process, it is too slow, so you can refer to zk to provide a client-side FIFO semantics, each client carries one in the request, ReadIndexand the log must be greater than this point before it can be returned. In this case, the semantics of a client FIFO can be satisfied, and a single slave node can execute the request.

If you want to wait for the expected ReadIndexfunction, you need to run a callback for each incoming data packet. When the target value appears, a series of callbacks are called. Are all triggered.

Of course, it is also possible to ensure strong consistency, that is, only the master node can receive read requests. This Readindexsolution is much faster than the read operation running through the process . Of course, it LeaseReadis an optimization for the former. Of course, in the case of the leader only, in general, there will not be a situation where the requested index is greater than the current actual index, so basically it is wait-free.

PreVote

In order to avoid a sudden increase in the term term of a partition, which will affect the entire protocol after the partition ends (the node with higher term is called leader), the PerVote mechanism can be introduced to solve the problem. Simply put, Candidatea new one is introduced before the identity is converted. That is PreCandidate, its function is to send a pre-election packet, where Term is its own Term+1. Note that the real Term of the node does not increase now, and the real election is carried out after receiving the replies from most nodes. Time is also a real self-increment term.

This mechanism avoids the problem mentioned earlier, that is, the election cost is one more round of message broadcasting. In fact, the problem is not very big, because leader downtime is not a common occurrence.

Of course, in terms of implementation details, we have to be able to distinguish the difference between the two packages of pre-election and election, because in the general election package, it is detected that the term of the opposite end is higher, and it will directly set its own Term as the size of the opposite end. Of course, this is not difficult, just put a mark in the bag.

snapshot management

This question was asked by Haidong during the mock interview. At that time, he talked about sharding, compression, and the same multi-threaded optimization network IO as redis.

Looking back at this problem again, in an RPC that installs a snapshot, if all snapshots are sent in the past, there are the following problems:

  1. The data volume of an RPC is too large, which is a challenge to the memory, and a network IO bottleneck may occur.
  2. If it fails, the overall retry is too expensive (have you thought of the use of merkle trees in Dynamo).
  3. Difficult to flow control.

In Elasticellthe Multi-Raftimplementation of this problem, the following optimization steps are used to reduce the cost of retry and the traffic is more even (of course, it is only effective when it is really found that the network IO is the bottleneck, but a snapshot of just a few G is not too much, one The throughput of a gigabit network card is less than 125mb/S, and the probability of a bottleneck in the network IO is still very large):

  1. Data storage in Raft's snapshot RPC, metadata of the snapshot file (including the ID of the shard, the current Raft's Term, Index, Epoch, etc.)
  2. After sending the Raft snapshot RPC, send specific data files asynchronously
  3. Data files are sent in chunks, and the cost of retrying is small
  4. The link to send Chunk and the link to Raft RPC are not reused
  5. Limit the number of chunks that can be sent in parallel to prevent the sending of snapshot files from affecting normal Raft RPC (after all, the network card throughput has an upper limit)
  6. Receiving the fragmented copy of the Raft snapshot is blocked until the complete snapshot data file is received (unclear what is the use of this)

Multi-Raft

Mr. Zong Dai asked me this question during the interview. In fact, this is not the first time. I remember that Huan Shen mentioned this to me when he came to the group last year, but later because of trivial matters, he didn’t think about it carefully. this problem. Fortunately, the interview has stabilized recently, and I finally have time to think.

First of all, what are the three deadly links? why? What can you do?

Cockroach’s definition of Multi-Raft [5][10] is quoted here to explain the first two points:

  • In CockroachDB, we use the Raft consensus algorithm to ensure that your data remains consistent even when machines fail. In most systems that use Raft, such as etcd and Consul, the entire system is one Raft consensus group. In CockroachDB, however, the data is divided into ranges, each with its own consensus group. This means that each node may be participating in hundreds of thousands of consensus groups. This presents some unique challenges, which we have addressed by introducing a layer on top of Raft that we call MultiRaft.
  • The more nodes, the worse the performance
    The system storage capacity depends on the size of the leader machine’s disk
  • In CockroachDB, we use the Raft consensus algorithm to ensure that your data remains consistent even if the machine fails. In most systems that use Raft, such as etcd and Consul, the entire system is a Raft consensus group. However, in CockroachDB, the data is divided into multiple ranges, and each range has its own consensus group. This means that each node may be participating in thousands of consensus groups. This presents some unique challenges, which we solved by introducing a layer called MultiRaft on top of Raft .
  • The more nodes, the worse the performance. The storage capacity of the system depends on the size of the host (leader) disk.

The definition of Multi-raft in TiKV is very clear in [8]:

  • If you've researched Consensus before, please note that comparing Multi-Raft to Raft is not at all like comparing Multi-Paxos to Paxos. Here Multi-Raft only means we manage multiple Raft consensus groups on one node. From the above section, we know that there are multiple different partitions on each node, if there is only one Raft group for each node, the partitions losing its meaning. So Raft group is divided into multiple Raft groups in terms of partitions, namely, Region.
    If you before Having studied the consensus, you need to be clear that comparing Multi-raft with raft is not like comparing multi-Paxos with Paxos at all . Here, the multi-raft protocol only means that we manage multiple raft consensus groups on one node. From the previous section, we know that there are multiple different partitions on each node. If each node has only one Raft group, these partitions will lose their meaning. Therefore, the raft group is divided into multiple raft groups according to the partition (ie, area).
  • TiKV also can perform split or merge on Regions to make the partitions more flexible. When the size of a Region exceeds the limit, it will be divided into two or more Regions, and the range may change like [a, c)[a, c) -> [a, b)[a,b) + [b, c)[b,c); when the sizes of two sibling Regions are small enough, they will be merged into a bigger Region, and the range may change like [a, b)[a,b) + [b, c)[b,c) -> [a, c)[a,c).
    TiKV can also perform split or merge on Regions to make Partitioning is more flexible. When the size of the area exceeds the limit, it will be divided into two or more areas, the range may be like [a, c) -> [a, b) + [b, c); If the size is small enough, they will merge into a larger area, and the range may be like [a, b) + [b, c) -> [a, c).

+

The main reason for this kind of fragmentation is that a single Raft is prone to the following problems in the KV scenario:

  1. Single-machine computing power bottleneck (for strong consistency, only Leader can handle write requests)
  2. Stand-alone storage bottlenecks (except for those using distributed storage systems as storage engines like bigtable, of course),
  3. It is not possible that all nodes can provide services.

The solution to the above problems can be sharding, and multi-raft is based on sharding. This not only increases the upper limit of computing power and storage, but also parallelizes previous serial operations . Because a node can be a leader or a follow, it is equivalent to a large- scale machine downtime that can still ensure the normal operation of most services .

As stated in [5], many core issues need to be resolved, as follows:

  1. How to shard : I think consistent hashing is a desirable solution. TiKV can be based on hash and range.
  2. The data in the shards is getting larger and larger, and more shards need to be split to form more Raft-Groups .
  3. Fragmentation scheduling : Make the load more even in the system. Cockroach refers to the PD of TiKV in the scheduling part. Here PD is responsible for the issuance of scheduling instructions, of which there are two most important resources, namely storage storage and computing leader . The PD collects data required for scheduling through heartbeats. These data include: the number of shards on the node, the number of leaders in the shard, the storage space of the node, the remaining storage space, and so on.
  4. How the new shards form a raft-group.

The above questions are answered in detail in [5].

However, there are still some doubts at this time:

  1. Will PD itself become a single point?
  2. Two partitions may be merged. Since the log will be separated if there is more than one interval, does each raft-group perform log compression? If this is the case, how to synchronize the log compression time between the master and slave? Are they already the leader? The details here cannot be worked out in a short while.

It seems that Multi-raft application sharding has greatly improved the overall throughput, but this looks more like an industrial-level optimization rather than an algorithm-level optimization.

to sum up

Most of them are optimization on industrial details, and there is not much information on the optimization of algorithm level. It may belong to the secrets of major companies like Multi-Paxos.

reference

  1. Elasticell-Talk about the optimization of Raft
  2. Make Raft 100 times faster-Dragonboat's writing optimization
  3. Linearity and Raft
  4. Raft's PreVote implementation mechanism
  5. Elasticell-Multi-Raft implementation
  6. Use of kfifo
  7. TiKV function introduction-PD Scheduler
  8. TiKV multi-shelf
  9. TIKV Data Sharding
  10. Elasticell Multi-Raft
  11. Based on the deep optimization of Raft, Tencent Cloud financial-level message queue CMQ high-reliability algorithm is explained in detail

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/115014226