Zookeeper protocol-Paxos algorithm and ZAB protocol

Preface

This article mainly learns the Zookeeper consensus algorithm Paxos and the distributed coordination Zab protocol

Paxos algorithm

Paxos algorithm is a consensus algorithm based on message passing and highly fault-tolerant proposed by Leslie Lambert in 1990. It is currently recognized as one of the most effective algorithms for solving distributed problems.

Byzantine problem

In 1982, Lamport and the other two jointly published a paper and proposed a computer fault tolerance theory. In order to describe the problem in this theory, a story related to the problem was assumed, as follows:

There were many armies in the Byzantine Empire. Generals of different armies must formulate a unified action plan to make the decision to attack or retreat. At the same time, the generals were separated geographically and could only rely on the army. Of correspondents to communicate. However, there may be traitors among all the correspondents, and these traitors can tamper with the news at will to achieve the purpose of deceiving the general.

This is the well-known Byzantine problem, and this problem is similar to achieving consistency in asynchronous systems and unreliable protocols in distributed scenarios. Later Lamport proposed a theoretical consistency solution in 1990, and Chen Shi gave it Strict mathematical proof, Paxos algorithm was born from this.

Paxos algorithm

Let's first look at what points we should satisfy for a consensus algorithm:

1. In the end, only one of the proposed proposals was selected

2. If there is no proposal, then there is no selected proposal in the end

3. When the proposal is selected, all clients should be able to obtain the proposal information

In the Paxos algorithm, there are three roles: Proposer, Acceptor, and Learner, which communicate with each other by sending and receiving messages. In a distributed scenario, it is impossible for only one instance of Acceptor to exist to prevent a single machine failure. Therefore, we need to satisfy that the election operation is completed in the case of multiple Acceptors. Therefore, we can specify that Proposer sends proposals to an Acceptor set, and All Acceptors in this set may approve the proposal. When there are enough Acceptors to approve the proposal, we think that the proposal is selected, and the so-called enough, only need to satisfy most of the Acceptor approval proposals in the current Acceptor set. Yes, and we stipulate that each Acceptor can only approve one proposal during an election process.

Algorithm process

The entire Paxos algorithm process can be summarized as a two-stage submission algorithm execution process, which is divided into a Prepare request phase and an Accept request phase . The general process is as follows:

Prepare stage:

Proposer will choose a proposal, and numbered M0 , then the Acceptor send the set M0 numbered Prepare request, if this time Acceptor set an Acceptor received this request, and the number M0 all appropriate than the current proposal has the If the maximum number is even larger, then the Acceptor needs to feed back the maximum number processed by itself, and will not respond to any request with a number lower than M0 . If it has not responded to the request, it will directly respond to the request of the current M0 number.

Accept stage:

Similarly, when a proposal request sent by Proposer receives a response from more than half of the Acceptor in the entire Acceptor set , it means that the proposal can be generated. At this time, if more than half of the response does not contain any proposal information, it means that the current proposal is selected. The value can be any value. If more than half of the response is other proposal information, then find the value of the proposal with the highest number, here called V0 , which forms the Acceptor request of [M0,V0] , and sends it to the entire Acceptor set again . At this time, if the Acceptor does not pass   a proposal with a number greater than v4sp , it will be deemed to automatically agree to the proposal, and if more than half of the consent is obtained, the current proposal will be deemed passed.

Of course, during the entire process, each Proposer has its own cycle to generate different proposals at regular intervals, and runs strictly in accordance with the above process, and the correctness of the algorithm will be guaranteed in the end. Similarly, if Proposer has to generate a proposal with a larger number, Accept will notify the Proposer of the last agreed largest proposal with the received information of the largest number , and let it abandon its own proposal and select the latest largest number proposal.

Zab agreement

Looking at the previous Paxos algorithm, we may think that Zookeeper is based on the implementation of the Paxos algorithm, but in fact, Zookeeper is not a fully adopted Paxos, but a kind of support called Zookeeper Atomic Broadcast , or ZAB for short. The recovered protocol serves as the core algorithm for data consistency. The ZAB protocol defines the processing flow of transaction messages in the entire Zookeeper, which is roughly as follows:

All transaction requests processed by Zookeeper must have only one machine to coordinate processing. Such a machine is called the Leader server, and the rest of the Zookeepers are called Followers, and the Leader server sends the client's transaction requests. To all Follwer servers, wait for feedback from all Follwer servers. Once it receives feedback from more than half of the Follwer servers that normally completes the transaction processing, the Leader server will send a Commit message to all Follwer servers again, requesting that the previous Transaction request to submit

The ZAB protocol has two basic modes, namely crash recovery and message broadcasting . When the entire framework is running, or when the Leader has a network interruption, crash or restart, etc., the ZAB protocol will enter the recovery mode to re-elect a new Leader server to process transaction requests. After the Leader server is selected and the state synchronization is achieved with more than half of the machines, ZAB's recovery mode ends. At this time, ZAB's message broadcast mode can be performed. If a new Zookeeper joins the current cluster during this process, the recovery mode will be started until the state synchronization with the leader is achieved.

Just like the above transaction processing process, only the Leader in Zookeeper can coordinate processing transaction requests, so are other Follower servers useless? Of course not. In the entire Zookeeper cluster, all servers can provide external request processing. The only difference is that when the Follower server receives the transaction request, it will forward it to the Leader, allowing the Leader to perform unified processing and coordination . Of course, there will be many accidents during the operation of the entire cluster. For example, if more than half of the follower servers cannot communicate with the leader, they will enter the crash recovery mode from the message broadcast mode, and a new leader will be elected from other machines. , The same leader election must be supported by more than half of the machines, so when the entire cluster has more than half of the machines that are in normal service, the mode can be switched continuously to ensure the stability of the cluster service. Once the number of machines with problems exceeds After half of the cluster, the entire cluster no longer provides external processing of transaction requests, enters the protection mode, and is only readable.

News broadcast

The message broadcast in the ZAB protocol uses an atomic broadcast protocol, and the overall process can be seen as a two-phase commit process. After the client's transaction request arrives, the Leader server will generate the corresponding transaction Proposal, and send this to all the remaining machines within the current cluster, and collect the response vote information from all Follower machines, and the second-stage process is specific It reflects that the traditional two-stage interrupt logic processing has been removed. Therefore, in the process of collecting the responses of all follower machines, the Leader does not need to wait for all the machines to respond before executing the second stage, but only needs to wait for more than half of the machines to return to the corresponding Respond to start the second stage. Of course, in this simplified two-stage process, it is possible that the leader suddenly crashes, resulting in inconsistent data. Of course, when this happens, you need to start the crash recovery mechanism to handle it, and all messages have FIFO characteristics. The TCP protocol is used for network communication, so as to ensure the sequence in the message broadcast process.

In the first stage, the Leader machine will generate the corresponding Proposal for the transaction message to broadcast, and will generate a monotonically increasing ID as the transaction ID (ZXID) . Since the ZAB protocol needs to ensure the strict up and down order of things, it will Process the corresponding messages in strict accordance with the ZXID sequence. After the message broadcast is initiated, the Leader will allocate a separate queue for each follower server, and then store the proposal of the thing that needs to be broadcasted into the FIFO queue in turn. After each follower machine receives the transaction message, it will follow the transaction The log is written, and after success, it is fed back to the Leader machine Ack response. When more than half of the Ack responses are received, the Leader will initiate the second phase of the Commit message to all follower machines, and at this time, the leader will be the same as the follower machine. Carry out the submission operation of local things, complete the delivery and submission of the entire message .

Crash recovery

As we mentioned earlier, the crash recovery mechanism ensures that the Leader machine can still guarantee that if a thing is processed successfully on one machine, it should be processed successfully by all machines. Therefore, the crash recovery mechanism of the ZAB protocol as a whole needs to ensure the following two points:

1. All things that have been successfully submitted on the Leader server will eventually be submitted by all servers

2. If the leader proposed a message in the first stage of the message broadcast, but did not submit it, ZAB needs to ensure that this part of the transaction message is discarded

Therefore, the ZAB protocol must design an election algorithm to ensure that the proposal that has been submitted by the leader is submitted, while skipping only the proposal initiated by the leader that has not been submitted and needs to be skipped. Therefore, in order to design the elected leader to have the highest number of machines in the cluster (the ZXID is the largest), then it can be considered that the elected leader has all the proposals that have been submitted.

After the election is completed, the crash recovery process has not ended. At this time, the data synchronization process is entered. The leader needs to confirm that all the proposals in the transaction log have been submitted by more than half of the followers before the data synchronization is completed. Earlier we know that the Leader server will prepare a FIFO queue for all followers, store all the things that need to be submitted, and notify the followers that when all the logs in the queue have been submitted and cleared by the followers, this means that the data synchronization is completed. The follower will be added to the list of available machines, until the number of available machines in the list is more than half, it will directly start to complete the crash recovery process, change the mode to message broadcasting, and provide Zookeeper services to the outside.

Similarly, the submitted things in the Leader are synchronized, so how to remove uncommitted obsolete things? In the ZXID design of the ZAB protocol, it is a 64-bit number, which is divided into low 32 bits and high 32 bits. The low 32 bits can be regarded as a monotonically increasing counter. Before each transaction message generates an ID, the low 32 Bits will perform atomic +1 operations. The high 32 bits of ZXID represent the number of the epoch, and epoch is the process of each election of a new leader, will get the largest from all things ZXID, and then parse the value of the corresponding high 32 epoch, Then it is +1, which represents the number of elections, that is, the life cycle of the Leader, and after each election, after epoch+1, the value of the lower 32 bits will be cleared to zero as the new ZXID value . Therefore, when the epoch value represented by the ZXID of some machines is low during the election triggering process, these machines will directly become Followers, and only the same machines with the largest epoch will elect the Leader , and when the Leader is elected In the future, after all Followers are connected to the Leader, they will compare the current largest proposal with the Leader. At this time, the leader requires rollback or synchronizes the proposal to the largest transaction proposal submitted by more than half of the machines in the current cluster according to the comparison. At this point, the data synchronization operation is complete. The crash recovery mode ends.

Guess you like

Origin blog.csdn.net/ywlmsm1224811/article/details/108384966