Why Raft is an easier-to-understand distributed consensus algorithm

The consistency problem can be regarded as a temple-level problem in the distributed field, and research on it can be traced back decades.

Byzantine Generals Problem

Leslie Lamport's paper "The Byzantine Generals Problem" published more than thirty years ago (ref [1]).

Byzantium, located in what is now Istanbul, Turkey, was the capital of the Eastern Roman Empire. Due to the vast territory of the Byzantine Roman Empire at that time, for the purpose of defense, each army was separated very far, and the generals could only rely on messengers to transmit messages. In the time of war, all generals in the Byzantine army must reach a consensus to decide whether there is a chance to win before attacking the enemy's camp. However, there may be traitors and spies of the enemy in the army, and the decisions of the generals will disrupt the order of the entire army. When consensus is reached, the results do not represent the opinions of the majority. At this time, the Byzantine problem was formed as to how the remaining loyal generals could reach an agreement without being influenced by traitors or spies, given that some members were known to be unreliable. The Byzantine hypothesis is a modeling of the real world, where computers and networks can behave unpredictably due to hardware errors, network congestion or disconnection, and malicious attacks.

Lamport has been researching these kinds of problems, publishing a series of papers. But in summary, the answer is to answer the following three questions:
1. Is there a solution to a distributed consistency problem like Byzantine Generals?
2. If there is a solution, what conditions need to be met?
3. On the basis of certain preconditions, propose a solution.

The first two questions were answered by Lamport in the paper "The Byzantine Generals Problem", while the third question was proposed in a later paper "The Part-Time Parliament" with an algorithm named Paxos. This paper uses a lot of mathematical proofs, and I basically can't understand it (the mathematical symbols are not fully recognized - 。-;), considering that it is difficult for everyone to understand, and later Lamport wrote another paper "Paxos" Made Simple" completely abandons the proof of all mathematical symbols and uses pure English logical derivation. I reluctantly read it word for word, and then I felt that I understood something, but if you ask me if I understand it, I still don't understand the standard. For me, there is a clear standard for understanding an algorithm, that is, if you really understand it, you will be able to map the algorithm to code in your mind, but after reading the next paper, you can't map it to code if you understand it. clarity.

While Lamport thinks Paxos is simple, maybe it's just for his head. The truth is that it's still difficult for everyone to understand, so Raft was built on the hope of getting a more understandable replacement for the Paxos algorithm. Taking understandability as one of the main goals of the algorithm, it can be seen from the title of the paper "In Search of an Understandable Consensus Algorithm".

Before entering the topic, I recalled that an old story can intuitively feel the difference in comprehensibility between different perspectives on a problem.

Comprehensibility from different perspectives

I vaguely remember that about twenty years ago, when I was still in junior high school, I saw such an interesting problem in a book that might be called "Divergent Thinking in Mathematics" (can't remember the title clearly).


Two people, A and B, take turns laying black and white Go pieces on a round table, placing one piece at a time, and the pieces are not allowed to overlap. Whoever has no place to put it first will lose.
 How do I put it to win?

This question has two meanings. First, is there a way to guarantee a win? Second, if so, how to prove it? Stop here and think for ten seconds.



The above diagram answers the question that the first mover wins, and three different ways of thinking are used here.
1. If the table is only the size of a Go piece.
2. If the table is infinite, the first mover will occupy the center of the circle. Since the circle is a symmetrical figure, as long as the opponent can still find a place to put it, you can always find a place to put it on the other side of the symmetry.
3. In a circle, a singular number of small circles of equal diameter and tangent to each other can be drawn.

Three different ways of thinking gradually deepen in intelligibility difficulty. The first is minimalist thinking, but not mathematically rigorous. The second is limit thinking, which combined with the first is mathematical induction, which is mathematically rigorous. The third is image thinking, which uses geometric concepts, but it is difficult to understand for people without basic knowledge of geometry.

An understandable description of the Raft protocol

Although Raft's paper is easier to read than the Paxos simple paper, the paper is still more divergent and relatively lengthy. After reading it, I closed the volume and pondered that it would be more reliable to sort it out and become truly my own. Here I will use the first minimalist thinking in the previous Othello move to describe and prove the concept of how the Raft protocol works.

There are three types of roles in a cluster organized by the Raft protocol:
1. Leader
2.Follower
3.Candidate (candidate)

Like a democratic society, leaders are elected by popular vote. At the beginning, there is no leader, and all the participants in the cluster are the masses. Then, a round of general elections will be opened first. During the general election, all the masses can participate in the election. At this time, the roles of all the masses will become candidates, and the leaders will be elected democratically. After that, the term of the leader begins, and then the election ends, and all candidates except the leader return to the role of the masses and obey the leader's leadership. Here is a concept "term", expressed with the term Term. There are so many core concepts and terminology about the Raft protocol and it matches very well with the real democratic system, so it is easy to understand. The transition diagram of the three types of roles is as follows, which is easy to understand in combination with the subsequent election process.



Leader election process

In minimalist thinking, a minimal Raft democratic cluster requires three participants (as shown below: A, B, C) so that a majority vote is possible. The initial state ABC is all Followers, and then there are three possible situations when an election is initiated. The first two in the figure below can elect Leader, and the third one indicates that the current round of voting is invalid (Split Votes). After that, each participant takes a random break (Election Timeout) to re-initiate the vote until one party obtains the majority of votes. The key here is a random timeout. The first party that resumes voting from the timeout requests a vote from the other two parties still in the timeout. At this time, they can only vote for each other, and an agreement is quickly reached.



After the leader is elected, the leader maintains its dominance by sending heartbeat messages to all followers on a regular basis. If the follower does not receive the leader's heartbeat for a period of time, it is considered that the leader may have hung up and initiates the master election process again.

Influence of Leader Node on Consistency

The Raft protocol strongly relies on the availability of the leader node to ensure the consistency of cluster data. The flow of data can only be transferred from the Leader node to the Follower node. When the client submits data to the cluster leader node, the data received by the leader node is in the uncommitted state (Uncommitted), then the leader node will concurrently copy the data to all follower nodes and wait for a response to ensure that at least more than half of the nodes in the cluster have received the data. After the data is received, confirm to the Client that the data has been received. Once the data receiving Ack response is sent to the Client, it indicates that the data state has entered the Committed (Committed) at this time, and the Leader node will notify the Follower node that the data state has been submitted.



During this process, the master node may hang up at any stage. Let's see how the Raft protocol ensures data consistency for different stages.

1. Before the data reaches the Leader node

The failure of the Leader at this stage does not affect the consistency, not much to say.



2. The data arrives at the Leader node, but is not replicated to the Follower node

At this stage, the leader hangs up, the data is in the unsubmitted state, and the client will not receive the Ack and will think that the timeout fails and can safely initiate a retry. There is no such data on the Follower node. After re-selecting the master, the Client can retry and resubmit successfully. After the original leader node is restored, it joins the cluster as a follower and re-synchronizes the data from the new leader of the current term, forcing the data to be consistent with the leader.



3. The data arrives at the Leader node and is successfully replicated to all Follower nodes, but has not yet received a response from the Leader

At this stage, the leader hangs up. Although the data is in the uncommitted state (Uncommitted) on the follower node, it remains consistent. After the leader is re-selected, the data submission can be completed. At this time, the client can retry the submission because it does not know whether the submission is successful or not. In response to this situation, Raft requires RPC requests to achieve idempotency, that is, to implement an internal deduplication mechanism.



4. The data arrives at the Leader node and is successfully copied to some Follower nodes, but has not yet received a response from the Leader

At this stage, the leader hangs up, and the data is in the Uncommitted state (Uncommitted) and inconsistent on the follower node. The Raft protocol requires that voting can only be cast to the node with the latest data. Therefore, the node with the latest data will be selected as the leader and then forced to synchronize the data to the followers. The data will not be lost and will be eventually consistent.



5. The data arrives at the Leader node and is successfully replicated to all or most of the Follower nodes. The data is in the submitted state in the Leader, but in the Unsubmitted state in the Follower

At this stage, the leader hangs up, and the processing flow after re-election of a new leader is the same as in stage 3.



6. The data reaches the Leader node and is successfully replicated to all or most of the Follower nodes. The data is in the submitted state on all nodes, but has not yet responded to the Client.

At this stage, the leader hangs up, and the internal data of the cluster is actually consistent. The client's repeated retry based on the idempotent strategy has no effect on the consistency.



7. The split-brain situation caused by the network partition, there is a double leader

The network partition separates the original leader node from the follower node. If the follower fails to receive the leader's heartbeat, it will initiate an election to generate a new leader. At this time, dual leaders are generated. The original leader is in a zone alone, and the data submitted to it cannot be copied to most nodes, so the submission will never succeed. Submitting data to the new leader can be successfully submitted. After the network is restored, the old leader finds that there is a new leader with an updated term in the cluster, and it is automatically downgraded to a follower and synchronizes data from the new leader to achieve cluster data consistency.



In summary, all the situations faced by the smallest cluster (3 nodes) are exhaustively analyzed. It can be seen that the Raft protocol can deal with the consistency problem well and is easy to understand.

Summarize

Let us conclude this article by citing an overview of the last section of the Raft paper.


The main design goals of the algorithm are correctness, efficiency, and simplicity.
 While these are valuable goals, none of these goals will be achieved until the developer writes a usable implementation.
 So we believe that comprehensibility is equally important.

Admittedly, think about the Paxos algorithm that Leslie Lamport published on his website in 1990. When did we hear about it? When will there be a usable implementation? The Raft algorithm was published in 2013. You can see how many open source implementation libraries in different languages ​​are available in reference [5]. This is the importance of understandability.

refer to

[1]. LESLIE LAMPORT, ROBERT SHOSTAK, MARSHALL PEASE. The Byzantine General Problem. 1982
 [2]. Leslie Lamport. The Part-Time Parliament. 1998
 [3]. Leslie Lamport. Paxos Made Simple. 2001
 [4]. Diego Ongaro and John Ousterhout. Raft Paper. 2013
 [5]. Raft Website. The Raft Consensus Algorithm
 [6]. Raft Demo. Raft Animate Demo

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326359830&siteId=291194637