zookeeper consistency agreement: ZAB agreement

Brief introduction of ZAB agreement:

  The zab protocol is a variant of the Paxos protocol. Its full name is ZooKeeper Atomic Broadcast (Zookeeper Atomic Broadcast Protocol). It is a crash recovery atomic message broadcasting algorithm specially designed for ZooKeeper. ZooKeeper uses a single master process to accept and process all client transaction requests, and broadcasts the state changes of server data to all replica processes in the form of transaction Proposal.

The ZAB protocol contains two basic modes:

  • Crash recovery

  • Message broadcasting

 

 

 

The ZAB agreement consists of three stages:

  • Phase 1: Discovery (Leader election process)

  • Phase 2: Synchronization (data synchronization process)

  • Stage 3: Broadcast (formally accept the request process)

When the entire service framework is in the process of starting, or when the leader server has network interruptions, crash exits and restarts and other abnormal conditions, the ZAB protocol will enter the recovery mode and elect a new leader server. When a new Leader server is elected and at least half of the machines in the cluster have completed state synchronization with the Leader server, the ZAB protocol will exit the recovery mode.

When more than half of the Follower servers in the cluster have completed the status synchronization with the Leader server, the entire service framework can enter the message broadcast mode. When a server that also complies with the ZAB protocol is started and added to the cluster, if there is already a Leader server in the cluster that is responsible for message broadcasting, the newly added server will consciously enter the data recovery mode: find the Leader Server and synchronize data with it, and then participate in the message broadcast process together.

Crash recovery

When the entire server is in the startup process, or when the Leader server has network interruptions, crash exits and restarts, etc., the ZAB protocol will enter the recovery mode and generate a new Leader server by election.

  • When a new Leader server is elected, and at least half of the machines in the cluster have completed state synchronization with the Leader server, the ZAB protocol will exit the recovery mode. State synchronization here refers to data synchronization, used to ensure that more than half of the machines in the cluster can be consistent with the data of the Leader server.

  • When more than half of the Follower servers in the cluster complete synchronization with the Leader server, the entire server cluster can enter the message broadcast mode.

  • When a new machine joins the cluster, since a leader already exists in the cluster, the newly joined machine will enter the data synchronization mode, that is, find the leader server and synchronize data with it.

  • When the leader crashes and exits or restarts, or no more than half of the servers in the cluster can maintain normal communication with the leader, then all machines will use the crash recovery protocol to achieve a consistent state before starting a new round of transaction operations.

      

 

 

// Compare the vote with the current vote of the server 
protected  boolean totalOrderPredicate ( long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) { 
    LOG.debug ( "id:" + newId + ", proposed id:" + curId + ", zxid: 0x" + 
              Long.toHexString (newZxid) + ", proposed zxid: 0x" + Long.toHexString (curZxid));
     if (self.getQuorumVerifier (). getWeight (newId) == 0 ) {
         return  false ; 
    } 

    / * 
     * The following three conditions are true, then return true: 
     * 1- New epoch is greater
     * 2- The new epoch is the same as the current epoch, but the new zxid is larger 
     * 3- The new epoch is the same as the current epoch, the new zxid is the same as the current zxid, but the server id is higher. 
     * / 
    return ((newEpoch> curEpoch) ||  
            ((newEpoch == curEpoch) && 
             ((newZxid > curZxid) || ((newZxid == curZxid) && (newId> curId))))); 
}

 

Message broadcasting

Message broadcasting is similar to a 2PC submission process. According to the client's transaction request, the Leader server will generate a corresponding transaction vote (that is, Proposal) and send it to other servers in the cluster, then collect their votes in the sub-table, and finally submit the transaction.

Unlike 2PC, the ZAB protocol has no interruption logic (all Followers will either respond to the Leader ’s transaction Ack or not respond), and when more than half of the Follower server feedbacks Ack, they will start submitting transactions without waiting for all Follower feedback.

The entire message broadcast protocol is based on the TCP protocol of the FIFO feature for network communication, so it can easily ensure the order of message reception and transmission during the message broadcast process.

    

 

 

  1. After receiving the message request, Leader assigns the message to a globally unique 64-bit self-increasing id, called: Zxid, which can realize the causal ordering feature through the comparison of the size of zxid.

  2. Leader distributes messages with zxid as a proposal (Proposal) to all Followers through a first-in first-out queue (implemented through the TCP protocol to achieve the global order feature)

  3. When Follower receives the Proposal, it will first write the Proposal to the hard disk. After writing the hard disk successfully, it will return an ACK to the Leader.

  4. When the Leader receives a legal number of ACKs, the Leader sends a COMMIT command to all Followers and executes the message locally.

  5. When Follower receives the COMMIT command of the message, it will execute the message.

data synchronization

  After the entire cluster completes the Leader election, the Learner will register with the Leader. When the Learner completes the registration with the Leader, it will enter the data synchronization link. The synchronization process is that the Leader synchronizes the transaction requests that have not been submitted on the Learner server to the Learner server.

 1. Direct differential synchronization       

    Applicable to follower node crash recovery

 2. Roll back first and then differentiate synchronization 

    Applicable to the leader node after the collapse, elect a new leader node and submit new things

 3. Rollback only (special rollback before differential synchronization)

    Applicable to the new node after the leader node crashes, nothing is submitted

 4. Full synchronization

    Applicable when a new Follower node is added

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/happily-ye/p/12733165.html