Detailed explanation of the Raft algorithm simulation number of the consensus algorithm

Abstract: Raft algorithm is a distributed consensus algorithm, which is used to solve the consistency problem in distributed systems.

This article is shared from Huawei Cloud Community " Raft Algorithm Simulation Number of Consensus Algorithm ", author: TiAmoZhang.

01. Leader election

There is a Raft cluster composed of three members A, B, and C. When it is first started, each member is in the Follower state. Among them, the heartbeat timeout of member A is 110ms, the heartbeat timeout of member B is 150ms, and the heartbeat timeout of member C is 130ms. Relevant information is shown in Figure 1.

■ Figure 1 Raft simulation initial state

Since there is no Leader in the cluster, members A, B, and C will not receive heartbeat information from the Leader. Among them, member A has the shortest timeout, enters the election state first, changes its state to Candidate, and increases its term number to 1, and initiates a request for voting message, as shown in Figure 2.

■ Figure 2 Request to Vote

Member A broadcasts his vote to members B and C through RequestVote. The vote describes the data owned by member A, including the term of member A and the latest log index. Members B and C process RequestVote messages according to voting rules.

Members with larger terms refuse to vote for members with smaller terms.

Members with large log indexes refuse to vote for members with small log indexes.

Only one ballot is cast in a term, and the principle of first-come, first-served voting is adopted.

Obviously, the term of members B and C is smaller than the term of member A, and there is no log index larger than the log index of member A, and the vote with term 1 has not been voted to other members, so members B and C will have the term as 1 votes for member A and updates his term to 1.

Member A gets 3 votes including himself and wins the majority of votes. Member A is promoted to Leader and sends heartbeat messages to other members to maintain his leadership position, as shown in Figure 3.

■ Figure 3 Leader Promotion Schematic

If member A does not receive a majority vote within the agreed time period, it will reset its own timeout and end the election process. Then other members will initiate the leader election after waiting for the heartbeat timeout. In the current case, the order of initiating the leader election is A→C→B.

Perhaps due to network problems, all members in the cluster initiated another round of elections, but none of them received a majority of votes, so a new timeout will be randomly generated to start the next cycle of elections.

02. Log replication

Log replication is a one-stage negotiation process, in which the commit operation of log entries is replaced by the next round of negotiation or heartbeat messages. Therefore, to process transaction requests, Raft only needs to send one round of AppendEntries messages.

The AppendEntries message will not only contain relevant information about the log entries that need to be copied, but also usually carry the committedIndex parameter of the Leader, indicating the last committed log index. Each Follower maintains a committedIndex locally, and the Follower can compare the Leader's committedIndex to promote its own commit operation.

Following the example shown in Figure 3, a cluster consisting of three members, member A is the leader, members B and C are followers, and no log entries are submitted in the cluster. After the Leader receives the Add request sent by the client, the Leader and Follower perform the following steps in sequence, as shown in Figure 4.

■ Figure 4 Log Replication-Replication

(1) The leader encapsulates it into a log item and appends it to the local log, and the log index is 1.

(2) Leader broadcasts log entries to all Followers through the AppendEntries(0, <1, Add>) message. in:

  • The first parameter is committedIndex, which is the log index submitted by the Leader last.
  • The second parameter is the log index where the Leader is located, that is, the index of the Add log entry.
  • The third parameter is the transaction operation instruction, that is, the instruction of the client.

(3) Follower receives the message and appends the log item to the local log.

At this point, members A, B, and C all have the log item Add and have completed persistence on index 1. After the Follower finishes processing the AppendEntries message, it needs to reply an ACK message to the Leader, representing acceptance of the log entry. After the leader receives the ACK message from the majority, it can submit the log entry locally and perform state transfer, and then return the execution result to the client, as shown in Figure 5.

■ Figure 5 log replication-reply

In the current scenario, member A submits the log item with index 1, and members B and C only have all the information of the log item with index 1 but have not submitted it. Members B and C need to wait for the next AppendEntries message, and advance the commit operation of the log entry whose index is 1 according to its committedIndex. Take the AppendEntries message of the heartbeat as an example. The AppendEntries message only carries the committedIndex. At this time, the Leader has submitted the log item with the index 1, so the committedIndex is 1. Follower can submit all log items whose index is 1 and before, as shown in Figure 6.

■ Figure 6 Log replication - heartbeat

03. Log alignment

We use <term, logIndex> to represent a log item, as shown in Table 1, the log index 3 of Follower E and the log index 4 of Follower D, which are inconsistent with the current Leader. This situation may be caused by the downtime immediately after Follower E and Follower D were elected as Leaders and proposed log entries with log indexes 3 and 4 on their own terms.

■ Table 1 log alignment

To make Follower E and Follower D consistent with the Leader data, the general steps are divided into two steps: find nextIndex, copy nextIndex and subsequent log entries. In Raft, this step can be completed by the AppendEntries message. Take the Follower E member as an example, the interaction details are as follows:

(1) Leader initializes nextIndex for Follower E, nextIndex=lastLogIndex+1, that is, nextIndex=6+1=7.

(2) The Leader sends a detection message through AppendEntries, carrying preLogIndex (nextIndex-1) and preLogTerm, where preLogIndex=6 and preLogTerm=3.

(3) Follower receives the detection message, compares the log item with index 6, and returns a failed response to Leader with lastLogIndex=3.

(4) Leader receives a failed response and updates nextIndex=lastLogIndexmsg+1, that is, nextIndex=4.

(5) The Leader sends the next round of detection messages, where preLogIndex=3 and preLogTerm=2.

(6) Follower receives the detection message, compares the log item with index 3, and returns a failed response to Leader with lastLogIndex=3.

(7) Leader receives a failed response, and at this time lastLogIndexmsg+1 ≤ nextIndex, then nextIndex monotonically decreases to 3.

(8) The Leader sends the next round of detection messages, where preLogIndex=2 and preLogTerm=1.

(9) Follower receives the probe message, compares the log entries with index 2, and returns a successful probe response to the Leader.

(10) After successfully detecting nextIndex, the Leader sends the log entry with index 3 to the Follower from nextIndex through the AppendEntries message.

(11) The Follower will take the Leader's data as the standard, overwrite the local log entries and return a successful response to the Leader.

(12) After the Leader receives a successful response, it monotonically increments nextIndex and continues to send the next log item. Until the nextIndex is equal to the lastLogIndex of the Leader, it means that the Follower has all the data of the Leader, and the log alignment is completed this time.

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

Graduates of the National People’s University stole the information of all students in the school to build a beauty scoring website, and have been criminally detained. The new Windows version of QQ based on the NT architecture is officially released. The United States will restrict China’s use of Amazon, Microsoft and other cloud services that provide training AI models . Open source projects announced to stop function development LeaferJS , the highest-paid technical position in 2023, released: Visual Studio Code 1.80, an open source and powerful 2D graphics library , supports terminal image functions . The number of Threads registrations has exceeded 30 million. "Change" deepin adopts Asahi Linux to adapt to Apple M1 database ranking in July: Oracle surges, opening up the score again
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/10086299