This work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
This work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.
Article Directory
introduction
When thinking about the difference between ZAB algorithm and Raft algorithm, I was confused, because it can be said to be very similar but it can be said to be very different. Digging into this doubt gave birth to this article.
Problem Description
First of all, we have to be clear about a problem. Generally speaking, a consensus agreement is divided into several parts, which constitute a complete consensus agreement. Take Raft as an example. It is composed of leader election, log replication, member changes, and log compression;
The basic version of ZAB has four parts:
- Leader election
- Discovery
- Synchronization
- Broadcast
After the election phase is optimized with Fast Leader Election, it can be reduced to three phases:
- Fast Leader Election
- Recovery
- Broadcast
Figure 0, source [9]
Because the Fast Leader Election
obtained leader must be the latest log, it is natural to merge Discovery
with Synchronization
these two stages.
Of course, we can see that several parts actually overlap, and the fact that the former is classified as “partial” does not mean that the latter does not.
One thing worth mentioning is that the Raft algorithm is based on the log. The message attached when requesting a vote is also the latest log index
and corresponding Term
, so it guarantees that the selected Leader
must have the latest submitted log.
Correspondingly, Fast Leader Election
when requesting a vote, it will carry zxid
(the low 32-bits (32-bits) of each new proposal increases that count. The high 32-bits represent epoch). It can be seen from [3] that it also guarantees the selected leader log Is the latest.
This article will take a look at the Fast Leader Election
difference with Raft's election.
Fast Leader Election
The following information comes from [3].
The purpose of this article is very pure, it is to try to explain clearly the Fast Leader Election
meaning of part of the pseudo-code in [3] .
Go directly to the code.
figure 1:
figure 2:
The first is the noun explanation:
vote
: Who is this ballot forid
: Server IDstate
: Node statusround
:epoch
, Which is in RaftTerm
ReceivedVotes
: Ballot box, which stores the latest votes of other nodes
First of all, I personally think that elections can be divided into two situations:
- It's all
Election
state, such as the beginning of the cluster, orleader
downtime,Follower
transition toElection
. - A partition end (probably very short, may be very long)
Leader
, orFollower
, when the request for the vote can see each otherLeader
orFollower
The above two situations correspond to the two logical judgments in Figure 2 respectively. In fact, it is simply whether there is a leader in this election, and the behavior of the election nodes is 晋升
or 跟随
.
Okay, we can start talking about pseudo-code, Figure 1 is relatively simple, just look at Figure 2:
Initial stage
When judging to conduct an election, first clean up the ballot box, and then fill in the contents of the ballot, respectively filling in the following contents:
- Recommended
Leader id
: In the initial stage, for the first time, all Servers voted to elect themselves as Leaders. - The recommended maximum node
zxid
value: The higher the value, indicating that the new data to the Server. round
: This value increases from 0, and each election corresponds to a value, that is, in the same election, this value is the same. The larger the value, the newer the election process.state
:包括LOOKING,FOLLOWING,OBSERVING,LEADING。
Then send the filled ballots to all other peer nodes.
Then Figure [1] describes the behavior when a vote is received, that is, if the node is in Election
, push the vote to P.queue
it. If the local node is Round
greater than the opposite terminal, a local voting message is also sent.
A judgment is actually below normal ACK, that is, I am not Election
, but there is Election
a big brother to me sends a message that I tell you my good configuration.
Processing votes
At this time, all the ballot information received is placed P.queue
in the, and one is taken from one of them. If not, the timeout will be retransmitted.
If the state of the node sending this ballot is Election
:
- If the sent is
round
greater than the current oneround
. It means that this is an updated election, the machine needs to be updatedround
, and the votes that have been collected are cleared at the same time, because the data is no longer valid. Then determine whether you need to update your election situation. First judgezxid
, thezxid
bigger one wins; if the same comparisonleader id
, the bigger one wins. Of course, if the information is updated, it means that this node has voted for the updated node, and other nodes will update the ticket box without accident. - If the sent is
round
equal to the current oneround
. According to the receivedzxid
andleader id
updated votes, then broadcast it. - If the sent is
round
less than the current oneround
. It means that the other party is in an earlier election process and only needs to send the data of the local machine.
Put the ticket information processed above into the ticket box. Next update according to the existing data:
- It is judged whether the information of the ballot box is equal
SizeEnsemble
, that is, whether the server has collected the election status of all servers, and a judgment can be made at this time. If you set your own role (FOLLOWING or LEADER) based on the results of the election, then withdraw from the election. - If you have not received the election status of not all servers, you can also determine whether the election leader updated after the above process is supported by more than half of the servers. If it is, then try at T 0 T0Data is received within T 0 ms. If no new data arrives, it means that everyone has agreed with the result. At this time, set the role and withdraw from the election.
Of course, the node status for sending votes may also be leader
either follower
, which is the second of the two situations mentioned above.
First judge round
whether they are the same, if yes, save the data to the ballot box, and then make the following judgments:
- If the sender claims to be
Leader
, directly change the status toFollower
. - If the sender claims to be
Follower
, then it is judged whether more than half of the servers elect it, if it is to set the role and withdraw from the election.
If you go to the following logical judgment, it means that this is a Round
message that does not match the current receiving node . Of course, it may be a higher Round or a lower Round.
If the opposite party’s vote points to himself and OutOfElection
half of the votes point to himself, then he is promoted.
I didn't understand why later.
to sum up
The least we can see Raft
and the ZAB
difference between the phases of the election:
Raft
At the time of the election, eachFollower
only holds the votes cast for itself;Fast Leader Election
eachFollower
holds the voting information of all nodes, and can make judgments based on this (Figure 2 line25).Fast Leader Election
The election request information is: (P. Vote , P. Id, P. State, P. Round ), P. Vote ← (P. Last Z xid, P. Id) (P.vote, P.id, P.state , P.round), P.vote ← (P.lastZxid, P.id)(P.vote,P.id,P.state,P.round),P.vote←(P.lastZxid,P.id);而Raft为 ( t e r m , c a n d i d a t e I d , l a s t L o g I n d e x , l a s t L o g T e r m ) (term,candidateId,lastLogIndex ,lastLogTerm) (term,candidateId,lastLogIndex,lastLogTerm)- Node will modify when you receive a larger vote and broadcast their votes, and based on the received node of the vote as an option main criterion only
zxid
andid
combination may become the largest nodeLeader
; whileRaft
in contrast to a simple Many, only when the opposite log andTerm
both are greater than yourself will vote, based on the number of votes obtained to choose the master. Both ensure the success of the election of the latest node in the log. Raft
You can only vote once per Term;Fast Leader Election
you canEpoch
change the vote within one .
There is one thing I didn’t find the information, but I think it’s a very interesting feature. Of course, it may also be a problem in my analysis. Based on the above code, Fast Leader Election
the election process will not fail , because all nodes will eventually Will hold the votes of other nodes, and as long as more than half of the nodes vote in agreement, the state will change; in contrast, Raft’s election will cause the election to fail because of the fourth point above, and retry after a period of time. Of course, you can add a timeout parameter, so that the probability of multiple failures is greatly reduced.
The last question is why Fast Leader Election
it is Leader Election
faster than normal , and the answer that can be found is basically a short sentence:
The FastLeaderElection election algorithm is a standard Fast Paxos algorithm implementation, which can solve the problem of the slow convergence of the Leader Election election algorithm.
Why the convergence speed is slow, and Leader Election
the details of the algorithm, to be honest, I haven't found the information. It seems that the only way is the source code of zk, let's talk about it later.
One last point, the Fast Leader Election
higher-efficiency Raft
election mentioned in [6] , but in theory it is indeed Raft
faster, because only one round of news is needed, and Fast Leader Election
every time a higher priority vote is received, it needs to be broadcast. once. Without data, you can't make any guesses, and you will have a chance to test it later.
reference:
- 论文《Zab: A simple totally ordered broadcast protocol 》
- 论文《In Search of an Understandable Consensus Algorithm(Extended Version)》
- 论文《ZooKeeper’s atomic broadcast protocol: Theory and practice》
- gitee " Baidu Open Source/braft ZAB Protocol Introduction "
- Know what is the difference between raft protocol and zab protocol? 》
- https://time.geekbang.org/column/article/143329
- Blog "In- depth understanding of Zookeeper (1) Zookeeper architecture and FastLeaderElection mechanism "
- Document " Search Leader Activation Section "
- 博文《ZooKeeper fast leader election (Fast Leader Election) mechanism analysis》
- Blog post " Raft vs. ZAB Agreement "
- Blog post " Analysis of the Leader Election Source Code of Zookeeper Dead "