zookeeper (16) source analysis protocol -ZAB

Zookeeper used Zookeeper Atomic Broadcast (ZAB, Zookeeper atoms message broadcasting Protocol) as its data consistency core algorithms. ZAB atomic broadcast protocol is a protocol support for crash recovery Zookeeper designed.

ZAB theory

ZAB is the core protocol defines treatment for those who would change the state of affairs Zookeeper server data request, namely: all transaction requests must be coordinated by a globally unique server processes , such servers are called Leader server, the rest of the server Follower server is called, Leader server is responsible for transforming a client transaction requests to a transaction proposal (proposal), and the proposal circulated to all Follower cluster server, then the server needs to wait for all the feedback Leader Follower servers, more than once after half Follower server was correct feedback, the Leader will distribute Commit message again to all Follower servers, asking them to submit before a Proposal.

ZAB Some of these include two basic modes: crash recovery and message broadcasting .

1, when the entire framework service during startup or Leader server network outage occurs, exit and restart crashes and other anomalies, ZAB agreement will enter recovery mode and the election of a new Leader server. When elected the new Leader server, the cluster has completed more than half of machine state after synchronizing with the server Leader, ZAB agreement will exit recovery mode, state synchronization means that data synchronization is used to ensure that more than half of the machines in the cluster Leader and data server can be consistent state.

2, when the cluster has completed more than half of Follower server and server synchronization status of Leader, the entire frame can enter the service message broadcast mode, when a server ZAB also comply with the agreement start to the cluster, if at this time Leader cluster already exists in a server responsible for broadcasting messages, the server will be added to consciously enter the data recovery mode: where to find the server Leader, and its data synchronization, and then together to participate in a message broadcast to the process. Zookeeper only allows only one Leader server to process the transaction request, Leader server after receiving a transaction request from a client, will generate the corresponding transaction proposal and initiate a broadcast protocol, and if the other machines in the cluster receive client after the transaction request, then the server will first non-Leader the transaction server forwards the request to the Leader.

3, when the Leader server crash or restart the machine, more than half of server and Leader maintain normal communication server in the cluster no longer exists, before you start a new round of atomic Broadcasting operation, all processes will first use agreement crash recovery reach each other to make a consistent state, then the entire process will enter ZAB message from a broadcast mode to the crash recovery mode. A machine to become the new Leader, must be supported by more than half of the machine, and because each machine are likely to crash, and therefore, the process of ZAB protocol is running, there will be more before and after the Leader, and each machine there may be times become Leader, into the crash recovery mode, there is more than half as long as the server can communicate properly with one another cluster, then you can generate a new Leader and enter the message broadcast mode again. Such as a machine composed of three ZAB service usually consists of a Leader, 2 Ge Follower servers, a certain moment, to join a Follower hung up, the entire cluster is ZAB will not interrupt services.

① message broadcasting

Zab protocol message broadcasting has the following four steps:

  1. Leader PROPOSAL sent to all nodes in the cluster.
  2. After receiving node PROPOSAL, PROPOSAL off the disc, sends an ACK to the Leader.
  3. Leader after receiving ACK majority node sends COMMIT to all Follower cluster nodes.
  4. If there Observer node, Leader INFORM while sending information to synchronize data service node Observer, Observer receive only the Leader of INFORM message synchronization data, and does not participate in the elections Leader transaction commits.

zookeeper (16) source analysis protocol -ZAB

② crash recovery

Leader appears in a server crash, or due to network causes Leader server lost contact with more than half Follower, then the collapse will enter recovery mode, ZAB agreement in order to ensure the correct operation of the program, the entire process needs to recover after the election of a new Leader server, therefore, ZAB protocol requires an efficient and reliable Leader election algorithms to ensure to quickly elect a new Leader, at the same time, Leader election algorithm not only need to know their Leader has been elected as Leader, and also need to let all the other machines in the cluster can quickly perceive the new server Leader elected.

③ ZAB basic characteristics

The basic principles of the agreement ZAB

3.1, ZAB protocol need to ensure that the transaction has been committed on the Leader server eventually all servers are submitted

Before assuming that a transaction is committed on a server Leader, and has received more than half of the Ack feedback Follower server, but sent to all machines in its Commit message Follower, Leader service hung up. As shown below:

zookeeper (16) source analysis protocol -ZAB

A certain moment during normal operation of the cluster, the server Server1 is Leader, which has broadcast the P1, P2, C1, P3, C2 (C2 is Commit Of Proposal2 abbreviation), wherein, when the server issues C2 after crashes immediately Leader quit, for this situation, ZAB agreement on the need to ensure that transactions are submitted Proposal2 ultimately successful on all servers, otherwise inconsistent.

3.2, ZAB protocol need to ensure that those transactions dropped only be presented at the Leader server.

Need to be discarded if there is a proposal in the crash recovery process, the recovery after a crash of the transaction needs to be skipped Proposal, as shown below:

zookeeper (16) source analysis protocol -ZAB

假设初始的Leader服务器Server1在提出一个事务Proposal3之后就崩溃退出了,从而导致集群中的其他服务器都没有收到这个事务Proposal,于是,当Server1恢复过来再次加入到集群中的时候,ZAB协议需要确保丢弃Proposal3这个事务。

3.3、ZAB协议必须的Leader选举算法

能够确保提交已经被Leader提交的事务的Proposal,同时丢弃已经被跳过的事务Proposal。如果让Leader选举算法能够保证新选举出来的Leader服务器拥有集群中所有机器最高编号(ZXID最大)的事务Proposal,那么就可以保证这个新选举出来的Leader一定具有所有已经提交的提议,更为重要的是如果让具有最高编号事务的Proposal机器称为Leader,就可以省去Leader服务器查询Proposal的提交和丢弃工作这一步骤了。

3.4、数据同步,一致性

完成Leader选举后,在正式开始工作前,Leader服务器首先会确认日志中的所有Proposal是否都已经被集群中的过半机器提交了,即是否完成了数据同步。Leader服务器需要确所有的Follower服务器都能够接收到每一条事务Proposal,并且能够正确地将所有已经提交了的事务Proposal应用到内存数据库中。Leader服务器会为每个Follower服务器维护一个队列,并将那些没有被各Follower服务器同步的事务以Proposal消息的形式逐个发送给Follower服务器,并在每一个Proposal消息后面紧接着再发送一个Commit消息,以表示该事务已经被提交,等到Follower服务器将所有其尚未同步的事务Proposal都从Leader服务器上同步过来并成功应用到本地数据库后,Leader服务器就会将该Follower服务器加入到真正的可用Follower列表并开始之后的其他流程。

④ ZAB总结

1、 发现,选举产生Leader,产生最新的epoch(每次选举产生新Leader的同时产生新epoch)。

2、 同步,各Follower和Leader完成数据同步。

3、广播,Leader处理客户端的写操作,并将状态变更广播至Follower,Follower多数通过之后Leader发起将状态变更落地Commit。

在正常运行过程中,ZAB协议会一直运行于阶段三来反复进行消息广播流程,如果出现崩溃或其他原因导致Leader缺失,那么此时ZAB协议会再次进入发现阶段,选举新的Leader。

源码分析

1、Leader发送PROPOSAL

ProposalRequestProcessor.proce***equest()方法发送PROPOSAL 给每一个节点。它调用Leader.propose()方法把PROPOSAL
入队到各个follower的queuedPackets,然后直接把PROPOSAL提交给leader节点自己的SyncRequestProcessor 。

以下是大概的代码路径:

ProposalRequestProcessor.proce***equest(request)
  zks.getLeader().propose(request)
        sendPacket(pp)
            for f in forwardingFollowers
                f.queuePacket(qp) 
                    queuedPackets.add(p)
  syncProcessor.proce***equest(request)

2、Leader处理PROPOSAL

SyncRequestProcessor先处理

SyncRequestProcessor.run() 
    zks.getZKDatabase().append(si) 
    flush(toFlush)
        zks.getZKDatabase().commit() 
            while (!toFlush.isEmpty())
                Request i = toFlush.remove()
                if (nextProcessor != null)
                    nextProcessor.proce***equest(i)

然后是Leader的ACK处理器处理,返回给Leader自己ACK结果

AckRequestProcessor.proce***equest()
    proce***equest()
        leader.processAck(self.getId(), request.zxid, null)

zookeeper (16) source analysis protocol -ZAB

3、Follower处理PROPOSAL

Follower. followLeader()方法处理接收到的QuorumPacket, case Leader.PROPOSAL分支处理的就是PROPOSAL。

Follower.followLeader() 
    loop
    readPacket(qp)
      leaderIs.readRecord(pp, "packet")
        processPacket(qp) 
            case Leader.PROPOSAL
                Record txn = SerializeUtils.deserializeTxn(qp.getData(), hdr)
                fzk.logRequest(hdr, txn)
                    syncProcessor.proce***equest(request) 
            case Leader.COMMIT:
                    fzk.commit(qp.getZxid())
                        commitProcessor.commit(request)

SyncRequestProcessor的处理逻辑

SyncRequestProcessor.run() 
    zks.getZKDatabase().append(si) 
    flush(toFlush)
        zks.getZKDatabase().commit()
        while (!toFlush.isEmpty())
            Request i = toFlush.remove() 
            if (nextProcessor != null)
                nextProcessor.proce***equest(i)
                    QuorumPacket qp = new QuorumPacket(Leader.ACK) 
                    learner.writePacket(qp, false)
                         leaderOs.writeRecord(pp, "packet")
         ((Flushable)nextProcessor).flush()
                learner.writePacket(null, true) 
                    bufferedOutput.flush()

zookeeper (16) source analysis protocol -ZAB

4、Leader的ACK处理

Leader的processAck()处理ACK消息,如果收到大多数节点的ACK,发送COMMIT给所有的follower节点,并调用leader自己 的CommitProcessor。 processAck()有两个调用入口:1. LeaderHandler的run()方法处理来自follower的ACK。2. AckRequestProcessor的proce***equest方法处理leader自己的ACK。

Leader.processAck(this.sid, qp.getZxid(), sock.getLocalSocketAddress()) 
    Proposal p = outstandingProposals.get(zxid)
    p.addAck(sid)
    tryToCommit(p, zxid, followerAddr)
        if !p.hasAllQuorums() 
            return false;
        // Commit on all followers
        commit(zxid)
            QuorumPacket qp = new QuorumPacket(Leader.COMMIT, zxid, null, null)
            sendPacket(qp)
        // Commit on Leader 
        zk.commitProcessor.commit(p.request)

5、Leader的COMMIT处理

CommitProcessor.run()
    request = queuedRequests.poll() 
    processCommitted()
        sendToNextProcessor(pending)

Request has been submitted to the ToBeAppliedRequestProcessor ready to be applied to the in-memory database

ToBeAppliedRequestProcessor.proce***equest()
    next.proce***equest(request)

Finally to FinalRequestProcessor, returning a response result

zookeeper (16) source analysis protocol -ZAB

6, Follower of COMMIT processing


CommitProcessor.run()
    request = queuedRequests.poll() 
    processCommitted()
        sendToNextProcessor(pending) 
//返回响应结果        
FinalRequestProcessor.proce***equest()

zookeeper (16) source analysis protocol -ZAB

Guess you like

Origin blog.51cto.com/janephp/2462718