Exploration of the Fast Leader Election mechanism

Insert picture description hereThis work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
Insert picture description hereThis work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.

introduction

When thinking about the difference between ZAB algorithm and Raft algorithm, I was confused, because it can be said to be very similar but it can be said to be very different. Digging into this doubt gave birth to this article.

Problem Description

First of all, we have to be clear about a problem. Generally speaking, a consensus agreement is divided into several parts, which constitute a complete consensus agreement. Take Raft as an example. It is composed of leader election, log replication, member changes, and log compression;

The basic version of ZAB has four parts:

  1. Leader election
  2. Discovery
  3. Synchronization
  4. Broadcast

After the election phase is optimized with Fast Leader Election, it can be reduced to three phases:

  1. Fast Leader Election
  2. Recovery
  3. Broadcast

Figure 0, source [9]
Insert picture description here

Because the Fast Leader Electionobtained leader must be the latest log, it is natural to merge Discoverywith Synchronizationthese two stages.

Of course, we can see that several parts actually overlap, and the fact that the former is classified as “partial” does not mean that the latter does not.


One thing worth mentioning is that the Raft algorithm is based on the log. The message attached when requesting a vote is also the latest log indexand corresponding Term, so it guarantees that the selected Leadermust have the latest submitted log.

Correspondingly, Fast Leader Electionwhen requesting a vote, it will carry zxid(the low 32-bits (32-bits) of each new proposal increases that count. The high 32-bits represent epoch). It can be seen from [3] that it also guarantees the selected leader log Is the latest.

This article will take a look at the Fast Leader Electiondifference with Raft's election.

Fast Leader Election

The following information comes from [3].

The purpose of this article is very pure, it is to try to explain clearly the Fast Leader Electionmeaning of part of the pseudo-code in [3] .

Go directly to the code.

figure 1:
Insert picture description here

figure 2:
Insert picture description here

The first is the noun explanation:

  1. vote: Who is this ballot for
  2. id: Server ID
  3. state: Node status
  4. round: epoch, Which is in RaftTerm
  5. ReceivedVotes: Ballot box, which stores the latest votes of other nodes

First of all, I personally think that elections can be divided into two situations:

  1. It's all Electionstate, such as the beginning of the cluster, or leaderdowntime, Followertransition to Election.
  2. A partition end (probably very short, may be very long) Leader, or Follower, when the request for the vote can see each other LeaderorFollower

The above two situations correspond to the two logical judgments in Figure 2 respectively. In fact, it is simply whether there is a leader in this election, and the behavior of the election nodes is 晋升or 跟随.

Okay, we can start talking about pseudo-code, Figure 1 is relatively simple, just look at Figure 2:

Initial stage

When judging to conduct an election, first clean up the ballot box, and then fill in the contents of the ballot, respectively filling in the following contents:

  1. Recommended Leader id: In the initial stage, for the first time, all Servers voted to elect themselves as Leaders.
  2. The recommended maximum node zxidvalue: The higher the value, indicating that the new data to the Server.
  3. round: This value increases from 0, and each election corresponds to a value, that is, in the same election, this value is the same. The larger the value, the newer the election process.
  4. state:包括LOOKING,FOLLOWING,OBSERVING,LEADING。

Then send the filled ballots to all other peer nodes.

Then Figure [1] describes the behavior when a vote is received, that is, if the node is in Election, push the vote to P.queueit. If the local node is Roundgreater than the opposite terminal, a local voting message is also sent.

A judgment is actually below normal ACK, that is, I am not Election, but there is Electiona big brother to me sends a message that I tell you my good configuration.

Processing votes

At this time, all the ballot information received is placed P.queuein the, and one is taken from one of them. If not, the timeout will be retransmitted.

If the state of the node sending this ballot is Election:

  1. If the sent is roundgreater than the current one round. It means that this is an updated election, the machine needs to be updated round, and the votes that have been collected are cleared at the same time, because the data is no longer valid. Then determine whether you need to update your election situation. First judge zxid, the zxidbigger one wins; if the same comparison leader id, the bigger one wins. Of course, if the information is updated, it means that this node has voted for the updated node, and other nodes will update the ticket box without accident.
  2. If the sent is roundequal to the current one round. According to the received zxidand leader idupdated votes, then broadcast it.
  3. If the sent is roundless than the current one round. It means that the other party is in an earlier election process and only needs to send the data of the local machine.

Put the ticket information processed above into the ticket box. Next update according to the existing data:

  1. It is judged whether the information of the ballot box is equal SizeEnsemble, that is, whether the server has collected the election status of all servers, and a judgment can be made at this time. If you set your own role (FOLLOWING or LEADER) based on the results of the election, then withdraw from the election.
  2. If you have not received the election status of not all servers, you can also determine whether the election leader updated after the above process is supported by more than half of the servers. If it is, then try at T 0 T0Data is received within T 0 ms. If no new data arrives, it means that everyone has agreed with the result. At this time, set the role and withdraw from the election.

Of course, the node status for sending votes may also be leadereither follower, which is the second of the two situations mentioned above.

First judge roundwhether they are the same, if yes, save the data to the ballot box, and then make the following judgments:

  1. If the sender claims to be Leader, directly change the status to Follower.
  2. If the sender claims to be Follower, then it is judged whether more than half of the servers elect it, if it is to set the role and withdraw from the election.

If you go to the following logical judgment, it means that this is a Roundmessage that does not match the current receiving node . Of course, it may be a higher Round or a lower Round.

If the opposite party’s vote points to himself and OutOfElectionhalf of the votes point to himself, then he is promoted.

I didn't understand why later.

to sum up

The least we can see Raftand the ZABdifference between the phases of the election:

  1. RaftAt the time of the election, each Followeronly holds the votes cast for itself; Fast Leader Electioneach Followerholds the voting information of all nodes, and can make judgments based on this (Figure 2 line25).
  2. Fast Leader ElectionThe election request information is: (P. Vote , P. Id, P. State, P. Round ), P. Vote ← (P. Last Z xid, P. Id) (P.vote, P.id, P.state , P.round), P.vote ← (P.lastZxid, P.id)P.vote,P.id,P.state,P.roundP.voteP.lastZxid,P.id;而Raft为 ( t e r m , c a n d i d a t e I d , l a s t L o g I n d e x , l a s t L o g T e r m ) (term,candidateId,lastLogIndex ,lastLogTerm) termcandidateIdlastLogIndexlastLogTerm
  3. Node will modify when you receive a larger vote and broadcast their votes, and based on the received node of the vote as an option main criterion only zxidand idcombination may become the largest node Leader; while Raftin contrast to a simple Many, only when the opposite log and Termboth are greater than yourself will vote, based on the number of votes obtained to choose the master. Both ensure the success of the election of the latest node in the log.
  4. RaftYou can only vote once per Term; Fast Leader Electionyou can Epochchange the vote within one .

There is one thing I didn’t find the information, but I think it’s a very interesting feature. Of course, it may also be a problem in my analysis. Based on the above code, Fast Leader Electionthe election process will not fail , because all nodes will eventually Will hold the votes of other nodes, and as long as more than half of the nodes vote in agreement, the state will change; in contrast, Raft’s election will cause the election to fail because of the fourth point above, and retry after a period of time. Of course, you can add a timeout parameter, so that the probability of multiple failures is greatly reduced.

The last question is why Fast Leader Electionit is Leader Electionfaster than normal , and the answer that can be found is basically a short sentence:

The FastLeaderElection election algorithm is a standard Fast Paxos algorithm implementation, which can solve the problem of the slow convergence of the Leader Election election algorithm.

Why the convergence speed is slow, and Leader Electionthe details of the algorithm, to be honest, I haven't found the information. It seems that the only way is the source code of zk, let's talk about it later.

One last point, the Fast Leader Electionhigher-efficiency Raftelection mentioned in [6] , but in theory it is indeed Raftfaster, because only one round of news is needed, and Fast Leader Electionevery time a higher priority vote is received, it needs to be broadcast. once. Without data, you can't make any guesses, and you will have a chance to test it later.

reference:

  1. 论文《Zab: A simple totally ordered broadcast protocol
  2. 论文《In Search of an Understandable Consensus Algorithm(Extended Version)
  3. 论文《ZooKeeper’s atomic broadcast protocol: Theory and practice
  4. gitee " Baidu Open Source/braft ZAB Protocol Introduction "
  5. Know what is the difference between raft protocol and zab protocol?
  6. https://time.geekbang.org/column/article/143329
  7. Blog "In- depth understanding of Zookeeper (1) Zookeeper architecture and FastLeaderElection mechanism "
  8. Document " Search Leader Activation Section "
  9. 博文《ZooKeeper fast leader election (Fast Leader Election) mechanism analysis
  10. Blog post " Raft vs. ZAB Agreement "
  11. Blog post " Analysis of the Leader Election Source Code of Zookeeper Dead "

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/113770960