Distributed | Raft consensus algorithm

1 Introduction

  • The Raft algorithm also discusses how to reach a consensus on a series of values ​​in a distributed system and keep the logs of each node consistent, but the Raft algorithm uses a leader as the core method, all 写请求through the Leader to publish proposals
  • The Raft algorithm belongs to Multi-Paxos算法it. It is based on the Multi-Paxos idea and has made some simplifications and restrictions. For example, the log must be continuous, with only three states 领导者, 跟随者and候选者
  • Raft has three main sub-problems, namely
    • Leader Election (领导者选举)
    • Log Replication(日志复制),
    • Safety(安全, 数据恢复)

2. Role introduction (node ​​status)

Each node may have three different states, namely

  • 领导者[Leader]: Responsible for processing write requests, managing log replication, and constantly sending heartbeat information
  • 跟随者[Follower]: Responsible for receiving and processing the leader's messages (election messages, data write messages, etc.), when the leader's heartbeat information times out or the countdown ends, it will become a candidate to initiate a voting election request
  • 候选者[Candidate]: A follower becomes a leader in an excessive state, it needs to request other nodes to vote, and if it gets more than half of the votes, it will become the leader.

illustrate

Insert picture description here

Insert picture description here

Insert picture description here

3. About the log

The following is the storage situation of the local logs of nodes A and B at a certain time:

  • Log ID 本地唯一标识每条操作日志的编号, 它是Continuous and monotonically increasing. It is used to recover data when the distributed system logs are inconsistent, and to ensure that each node can eventually reach a consistent state when the same log sequence passes through the same state machine.
    • For example, the following node A has 8 operation logs, while node B has only 4 logs at this time. It shows that the data of the two nodes are inconsistent at this time. 假如节点A此时是领导者After communicating with B based on the heartbeat, it finds that it stores the latest log locally before reaching the fourth After each heartbeat request, an operation log will be synchronously copied to B. After 4 heartbeats, node B will be consistent with the leader's log data. In fact, this process is一次数据恢复的过程
  • Term Number:In fact, this is a bit similar Paxos 算法里的提案ID, but it does not increment every proposal (calling every consensus negotiation request as a proposal), but it increments when the election vote request is issued, and it does not increment when a write request is issued. Incremental, so it is also called in the Raft algorithm 任期编号[Term], which describes the proposal request (write request) issued by the leader of which term in the log. It is functionally the same as the proposal ID of Paxos,, 每个节点不会接受比自己小的提案ID的请求here is每个节点不接受比自己本地保存的任期编号小的请求(无论是选举请求还是日志复制请求)
  • Specific instruction content: This one 类似与Paxos算法里的提案内容describes the real operating instructions of the client

Insert picture description here

4. Core consensus negotiation process

4.1 Leader election process

  • As shown in the figure below, each node belongs to at the beginning 跟随者[follower]状态, and each 跟随者或者候选者has one 时钟倒计时机制, which is the shrinking line of the outer layer. The countdown time of each node is randomly generated, usually between a few hundred milliseconds. 如果当前是跟随者, Once the follower first At the end of the countdown, they will become a candidate. And initiate an election request. 而如果是候选者Once the candidate's first countdown ends, a new round of election voting requests will be initiated.这个过程也可以叫选举超时
  • As shown in the figure, the 先后countdown of node S4 and node S5 ends and becomes a candidate state. S4 node 先给自己投一票, then 将本地保证的任期编号递增为2, then sends the voting request (or proposal request, the proposal ID is the term number of the S4 node [2]) to the other 4 Node. S1, S2, and S3, the three follower 先收到了candidate S4 node election requests, 按照先来先服务原则and the current proposal request ID is the term number 2 and 没有小于the term number 1 guaranteed by itself, so the request can be accepted, and then all three nodes will vote Vote to the candidate S4 node, and change the local guaranteed term number to the ID of the current proposal request, which is 2. After the three followers have finished voting, the candidate S5's voting request has also come, but because the vote has already been voted so three followers rejected the request for the election of candidates S5. the final vote candidates S4 is received 4张, and S5 only candidate 一张选票, 按照多数派原则[超过半数以上]the candidate's own S4 will update the status of leader leader and leader S4 Heartbeat packets will be sent to the other 3 followers and S5 candidates. Followers will reset their countdown after receiving the leader's heartbeat packets. The candidate S5 will not only reset itself after receiving the leader's heartbeat packets. Countdown and will降级为跟随者状态

Insert picture description here

4.1.1 Cases of election overtime

  • As shown in the above figure, the S5 node also has a clock countdown mechanism during the candidate period. If it does not succeed in the election and does not receive a vote request from the new leader (for example: the new leader hangs up or none of the candidates get more than half of the votes) )
  • For example, in the figure below, the new leader S4 hangs up before sending the heartbeat packet, and the countdown will continue. Once the countdown candidate S5 initiates a new round of voting requests, and increments the local guaranteed term number to 3.

Insert picture description here

4.2 Data writing process (log copy)

  • If the client wants to write a data (such as the command set x=3), the leader S5 first writes the operation log to its own local after receiving it but has not submitted it. Then it sends a log copy request to the other 4 followers. After receiving the request, the 4 followers found that the tenure number of the request was not less than their current local guaranteed tenure number (that is, all 3), so they accepted this log copy request and also wrote the operation log to their own local. However 还没提交, then Respond to the leader. When the leader S5 receives 半数以上the follower write successfully, it will formally submit the operation log locally and write the success message to the client. Then the follower receives the leader's heartbeat message or log copy After the message, if it is found that the leader has submitted a certain log, but it has not yet been submitted, then the follower will check the log 正式提交.

Insert picture description here

5 Abnormal scene

5.1 Leader has hung up

  • After the leader S3 hangs up, the countdown to S4 first ends and becomes a candidate. Initiating a round of voting and finally getting more than half of the votes to become the new leader. At this time, the old leader S2 is restored and will be automatically demoted to become the new leader. Followers. After receiving the heartbeat message of leader S4, because the current heartbeat request has a term number 3 greater than your local term number 2, so you will change the local term number to 3. And ensure that you will not deal with a term smaller than the term number 3 in the future. Any request (whether it is an election request or a data write request)

Insert picture description here

5.2 Data inconsistency (security·recovery)

  • For example, the follower S1 node hangs up, and then the leader S5 writes the data twice. Then the follower S4 does not receive the heartbeat message from the leader S5 after the timeout, and becomes the candidate to send an election request. After more than half of the consent is obtained, the node S4 Become the new leader, and then the leader S4 writes data twice. Suppose now that the follower S1 is up again, and now the follower S1 node is inconsistent with the local log data of other nodes. 4 logs are missing.

Insert picture description here
There is a slight change in the above animation. 就是当领导者S4当选后, 所有跟随者的日志读取指针都会与领导者保持一致(都是指向日志ID为5)In fact, it should be said that compared to the leader, it has to read the pointer of the follower node log, because the leader needs to know the current log of each follower, whether it is consistent or missing There is still wrong data 为什么要与领导者日志读取指针保持一致呢?, because the integrity of the log is subject to the leader's, even if your log is more than the leader, the leader will start to synchronously overwrite your log data from its log reading pointer. Be consistent with it.

Follower S1's specific data recovery process

Insert picture description here

  • After follower S1 recovers, after receiving the heartbeat of leader S4 for the first time, he finds that the leader's tenure number 4 is greater than his local passed tenure number 3, and he will change his passed tenure number to 4 (later Only accept requests that are greater than or equal to the tenure number of 4), and follower S1 finds that its log is inconsistent with the leader S4, then the follower will refuse to receive new log replication and return a failure message to the leader.
  • At this time, the heartbeat message of the leader S4 contains the message of its current local latest log. For example <最新的日志ID为6, 最新的日志的任期编号为4>, if there is , under normal circumstances, if the follower's local log is the latest log <最新的日志ID为6, 最新的日志的任期编号为4>, then it will respond to the heartbeat request normally. But for example, the follower S1 node does this When the previous item of the log pointer (that is, the latest log position) is empty and the log message at the corresponding position of the leader is inconsistent, a failure message will be sent back to the leader (如下图2.2).
  • After the leader receives the failure message, it will decrement the log reading pointer of node S1, for example, the log ID is 4, and then node S1 still finds that the previous item of the log pointer is empty at this time and corresponds to the position of the leader. The log message is still inconsistent, and a failure message is returned to the leader. After the leader receives it, the log reading pointer of S1 is decremented, and now it is decremented to 3. (如下图2.3)But the previous item of the log pointer corresponds to the leader's log If the messages are consistent, a success message will be returned to the leader S4
  • After that, the leader will start to overwrite and restore the log data of follower S1 from log pointer 3, but only one log will be restored per heartbeat.After 4 heartbeats, the data of S1 will eventually be restored.

Figure 2.2
Figure 1.1

Figure 2.3
Insert picture description here

5.3 Nodes with low log integrity become candidates

  • Another principle of Raft is 数据完整性原则that only candidates with high data integrity can become leaders, because the leader is responsible for data recovery. An election request from a candidate with low data integrity will be rejected.
  • For example, because S1 was previously hung up, the local log data was missing two pieces. It happened to be a candidate to initiate a round of voting requests. After receiving it, the followers found that the data integrity was lower than their own and rejected the election request of the S1 node. , And update the locally guaranteed arbitrary number to 3

Insert picture description here
Insert picture description here

5.4 Network partition problem

  • As shown in the figure, due to the network partition of the cluster, node AB and node CDE cannot communicate. Node CDE cannot receive the heartbeat message of leader B, once the countdown ends, it will start to elect a new leader. For example, node C has more than 3 votes Half of them became leaders and increased their term number to 2. At this point, there are two leaders in the Raft cluster. This violates the principle that Raft has only one leader.
    Insert picture description here
  • At this time, the client first 写请求set 3submits to leader B of the lower partition, and then leader B sends a log copy request to other nodes, but only node C receives and writes to the local log but does not submit. Leader B receives 2 There are no more than half of the successful write requests, so leader B returns the write failure to the client
  • At this time, another client 写请求set 3submitted it to the leader C of the upper partition. In the end, the leader C's log replication request received a response from more than half of the nodes. It first submitted its log, and then returned the write success to the client, and then notified the others. The node submits the log.
  • At this time, after the network partition is restored, leader C receives the message from leader B, because leader B’s term number 2 is less than himself, so it doesn’t process it. On the contrary, leader B receives the message from leader C because of leader C’s term number It is greater than itself, so leader B will automatically be demoted to follower, and then follow 5.2小节的数据恢复原理the follower B to leader C as the main log data recovery.
    Insert picture description here

5.5 Membership change issues

  • The change in the number of nodes in the Raft cluster may result in different judgments on the total number of nodes in the cluster under the new and old configurations. In the end, each node 半数以上原则判断出现问题.may have a dual-leader problem, just like the network partition. And Raft passed 单节点变更(single-server changes), To solve this problem

Single node change

  • For example, there used to be three nodes in the cluster, and now you need to expand the cluster to five nodes, you need to add one node at a time instead of adding two nodes at a time.
  • Because only one node is added and deleted at a time, there must be an intersection between the old and new configurations. The "majority" of the old configuration and the "majority" of the new configuration will have a node that overlaps, and the old configuration will not exist at the same time And the new configuration 2 "majority. The core is to let everyone" maintain a consistent understanding of the majority"

Insert picture description here

The specific change process:
Assume that the cluster now has three nodes ABC, and now a new node D needs to be added

  • 1. The leader (node ​​B) first synchronizes data with the new node (node ​​D).
  • 2. After completing the data synchronization, the leader (node ​​B) will use the new configuration [A, B, C, D] cluster list as a log item for log copy and write to all nodes in the new configuration, and submit it after receiving most of the responses Go to the local log, so as to ensure the consistency of the configuration of the cluster members, all are [A, B, C, D]

6. Summary

1. Raft's strong leadership model is to focus on the leader, similar to a stand-alone model. Performance and throughput will be limited

2. Difference from Muti Paxos consensus algorithm

  • In Raft, not all nodes can be elected as the leader. Only the node with the most complete log can be elected as the leader; second, in Raft, the log must be continuous.
  • The Raft algorithm ensures that there is only one leader for a term of office, and greatly reduces the failure of elections through the tenure of office, the leader's heartbeat message, the random election timeout, the first-come, first-served voting principle, and the majority vote principle.

3. Random countdown mechanism. The
Raft algorithm cleverly uses the random election timeout time method to spread the timeout time. In most cases, only one server node initiates the election first, rather than at the same time, which can reduce the number of votes. Divide the circumstances that led to the election failure.

[动画演示网站]

10. Reward

If you find the article useful, you can encourage the author (Alipay)

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_41347419/article/details/115046184