Zookeeper distributed protocols and implement (ZAB protocol) (b)

Copyright Notice: Copyright: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/qq_37598011/article/details/89814372

    In addressing the distributed data consistency, ZooKeeper Paxos algorithm is not used directly, instead of using the coherence protocol is called a ZAB (ZooKeeper Atomic Broadcast) is.

    ZAB core protocol:

    All transaction requests must be coordinated by a globally unique server processes, such servers are called server Leader, while the rest of the other servers become Follower server. Leader server responsible for converting a client transaction requests to a transaction Proposal (proposal), and the Proposal circulated to all Follower cluster servers. After Leader server needs to wait for all the feedback Follower servers, once the server for more than half of Follower correct feedback, the Leader will distribute Commit message again to all Follower servers, asking them to submit before a Proposal.

Protocol Introduction

    ZAB protocol includes two basic modes, namely crash recovery and message broadcast. When the whole service framework during startup, or when the server goes Leader network outages, crashes and other anomalies quit and restart when, ZAB agreement will enter recovery mode and the election of a new Leader server. Once elected the new Leader server, the cluster has completed more than half of the machines and the Leader of the state synchronization server, ZAB agreement will exit recovery mode. Wherein the synchronization status data synchronization means, to ensure that the cluster can be present in more than half of the machine state, and data consistency server Leader.

    When the cluster has completed more than half of Follower server and server synchronization status of Leader, the entire service framework can enter a message broadcast mode. When a server ZAB also comply with the agreement to start when added to the cluster, the cluster at this time if there is already a Leader server is responsible for message broadcast, the server will be conscious of the newly added data into the recovery mode; where to find Leader servers, and data synchronization therewith, and participate together in the process to the broadcast message. ZooKeeper designed to allow only a single server to process Leader transaction request. Leader server after receiving a transaction request from a client, it will generate transaction proposal corresponding and initiate a broadcast protocol; and if the other machines in the cluster receives a transaction request from a client, then these non-Leader server will first this transaction requests forwarded Leader to the server.

    Leader server crash when exit or reboot the machine about, or more than half of the server to maintain proper communication with the Leader server cluster no longer exists, before you start a new round of atomic Broadcasting operation, all processes will first use crash recovery protocol to make achieve a consistent state with each other, then the entire process will enter ZAB message from a broadcast mode to the crash recovery mode.

    A machine to become the new Leader, must be supported by more than half of the process, and because each process are likely to crash, and therefore, in the process of ZAB protocol is running, there will be more before and after the Leader, and each process is also likely to crash Therefore, in the course of ZAB protocol is running, there will be more before and after the Leader, and each process there may be many times become Leader. After entering the crash recovery mode, there is more than half as long as the server cluster can communicate properly with each other, then you can generate a new Leader and enter the message broadcast mode again. Take, for example, a machine composed of three ZAB service, usually by a single Leader, 2 Follower two servers. A certain moment, if one server Follower hung up, the whole ZAB cluster services will not be interrupted, because the Leader server is still able to obtain more than half of the machine (including the Leader himself) support.

Broadcast news

    ZAB message during the broadcast protocol used is an atomic broadcast protocol, similar to a two-phase commit process. Transaction for the requesting client, the server Leader for Proposal generate a corresponding transaction, and send it to all other machines in the cluster, then the respective votes were collected, and finally the transaction commits, the ZAB protocol message is shown in FIG. broadcasting a schematic flow.

    Two-phase commit protocol ZAB process, removing the interrupt logic, all the normal feedback Follower server or proposed transaction Proposal Leader, or to abandon Leader server. Meanwhile, ZAB agreement in the two-phase commit protocol interrupt logic means that we can begin to remove affairs Proposal submitted after the server has more than half Follower feedback Ack, without waiting for all the cluster servers Follower feedback response.

     In this simplified model, submitted under a two-stage, is unable to handle Leader server crashes brought exit data inconsistencies, so add another model in ZAB agreement, namely the use of crash recovery mode to fix the problem. Further, the entire message is propagated to network communication protocol based on the TCP protocol has a FIFO characteristic, it is possible to easily assure an orderly broadcast message during the message reception and transmission.

    Throughout the broadcast message, the server generates a corresponding Leader Proposal to broadcast requests for each transaction, and the transaction prior to broadcast Proposal, Leader server assigned a globally unique ID for this transaction monotonically increasing Proposal, we call transaction ID (ie ZXID). Because ZAB agreement need to ensure strict causality each message, each transaction must be in the order of their ZXID Proposal to sort and process.

    Specifically, in the course of a broadcast message, a server Leader separate queue Follower each server assigned to each, and then needs to be broadcast transaction Proposal to turn into the queues, and transmitting a message according to the FIFO policy. Follower each server after receiving this transaction Proposal, will first be written as the transaction log to the local disk, and Leader back to a server after a successful write Ack response. When the Leader Follower server receives more than half of the Ack response, it will broadcast a message to all Follower Commit server to notify their transaction commit, while Leader itself will complete the submission of the transaction, and each server receives a Follower after Commit message, it will complete the submission of the transaction.

Crash Recovery

    This message is based on atomic broadcast protocol communication process ZAB agreement, under normal circumstances run very well, but once Leader server crashes, or due to network causes Leader server lost contact with more than half Follower, it will crash into the recovery mode . ZAB agreement in order to ensure the correct operation of the program, the entire process needs to recover after the election of a new Leader server. Therefore, ZAB protocol requires an efficient and reliable Leader election algorithms to ensure quick elect a new Leader. Meanwhile, Leader Leader election algorithm not only need to make yourself aware of its own has been elected as Leader, at the same time need to let all the other machines in the cluster can quickly perceive the new server Leader elected.

Basic characteristics

    ZAB protocol need to ensure that the transaction has been committed on the Leader server eventually all servers are submitted

    Suppose a transaction is committed on the server Leader, and has received Ack feedback half Follower server, but before he was sent to all machines Follower Commit message, the server Leader hung, as shown in FIG.

    Figure message C2 is a typical example: at some point during normal operation of the cluster, Server 1 is Leader server which has broadcast the message P1, P2, C1, P3 and C2, wherein, when Leader server message after C2 (C2 is Commit of Proposal2 acronym, that the transaction is committed Proposal2) issued collapse immediately quit. In view of this situation, ZAB agreement on the need to ensure that transactions are submitted Proposal2 ultimately successful on all servers, otherwise inconsistent.

    ZAB protocol need to ensure that those transactions dropped only be presented at the Leader server

    Conversely, if the proposal appears a need to be discarded in the crash recovery process, the recovery after a crash of the transaction needs to be skipped Proposal, as shown in FIG.

    In the cluster shown above, it is assumed initial Leader server Server1 after the proposed transaction Proposal3 collapsed quit, leading to other servers in the cluster did not receive this transaction Proposal. So, when Server1 recovered join the cluster again, ZAB protocol need to ensure that discarded Proposal3 this transaction.

    The combination of these two crashes special circumstances mentioned above recovery process need to be addressed, it was decided the ZAB protocol must be designed so that a Leader election algorithms: to ensure that the transaction is committed Proposal has been submitted Leader, while discarding the transaction has been skipped Proposal. To address this requirement, if let Leader election algorithms to ensure that newly elected Leader server has a cluster all machines with the highest number (ie ZXID maximum) Transaction Proposal, then you can guarantee that newly elected Leader must have all the proposals already submitted . More importantly, if the machine with the highest number of transactions Proposal to become a Leader, will save Leader server checks Proposal submission and discard the work of this step.

data synchronization

    After completing the Leader election, before the official start of work (ie, receiving a transaction request from the client, and then put forward new proposals), Leader server will first confirm whether all Proposal transaction log have been submitted to cluster more than half of the machine, that is, whether complete data synchronization.

    所有正常运行的服务器,要么成为Leader,要么成为Follower并和Leader保持同步。Leader服务器需要确保所有的Follower服务器能够接收到每一条事务Proposal,并且能够正确的将所有已经提交了的事务Proposal应用到内存数据库中去。具体的,Leader服务器会为每一个Follower服务器都准备一个队列,并将那些没有被Follower服务器同步的事务以Proposal消息的形式逐个发送给Follower服务器,并在每一个Proposal消息后面紧接着再发送一个Commit消息,以表示该事务已经被提交。等到Follower服务器将所有其尚未同步的事务Proposal都从Leader服务器上同步过来并成功应用到本地数据库中后,Leader服务器就会将该Follower服务器加入到真正的可用Follower列表中,并开始之后的其他流程。

    上面的是正常情况下的数据同步逻辑,下面来看ZAB协议是如何处理那些需要被丢弃的事务Proposal的。在ZAB协议的事务编号ZXID设计中,ZXID是一个64位的数字,其中低32位可以看作是一个简单的单调递增的计数器,针对客户端的每一个事务请求,Leader服务器在产生一个新的事务Proposal的时候,都会对该计数器进行加1操作;而高32位则代表了Leader周期epoch的编号,每当选举产生一个新的Leader服务器,就会从这个Leader服务器上取出其本地日志汇总最大事务Proposal的ZXID,并从该ZXID中解析出对应的epoch值,然后再对其进行加1操作,之后就会以此编号作为新的epoch,并将低32位置0来开始生成新的ZXID。ZAB协议中的这一通过epoch编号来区分Leader周期变化的策略,能够有效地避免不同的Leader服务器错误的使用相同的ZXID编号提出不一样的事务Proposal的异常情况,这对于识别在Leader崩溃恢复前后生成的Proposal非常有帮助,大大简化和提升了数据恢复流程。

    基于这样的策略,当一个包含了上一个Leader周期中尚未提交过的事务Proposal的服务器启动时,其肯定无法成为Leader,原因很简单,因为当前集群中一定包含一个Quorum集合,该集合中的机器一定包含了更高epoch的事务Proposal,因此这台机器的事务Proposal肯定不是最高,也就是无法成为Leader了。当这台机器加入到集群中,以Follower角色连接上Leader服务器之后,Leader服务器会根据自己服务器上最后被提交的Proposal来和Follower服务器的Proposal进行比对,比对的结果当然是Leader会要求Follower进行一个回退操作——回退到一个确实已经被集群中过半机器提交的最新的事务Proposal。

深入ZAB协议

ZAB算法描述

    整个ZAB 协议主要包括消息广播和崩愤恢复两个过程,进一步可以细分为三个阶段,分别是发现(Discovery)、同步(Synchronization)和广播(Broadcast)阶段。组成ZAB协议的每一个分布式进程,会循环地执行这三个阶段,我们将这样一个循环称为一个主进程周期。

术语名 说明
F.p Follower f处理过的最后一个事务 Proposal
F.zxid Follower f 处理过的历史事务Proposal 中最后一个事务Proposal 的事务标识ZXID
hf 每一个Follower f通常都已经处理(接受)了不少事务Proposal ,并且会有一个针对已经处理过的事务的集合, 将其表示为hf, 表示Follower f 已经处理过的事务序列
Ie 初始化历史记录,在某一个主进程周期epoch e 中, 当准Leader 完成阶段一之后,此时它的hf就被标记为Ie

阶段1:发现

    阶段一主要就是Leader选举过程,用于在多个分布式进程中选举出主进程,准Leader L和Follower F的工作流程分别如下

  • 步骤F.1.1:Follower F将自己最后接受的事务Proposal的epoch值CEPOCH(F.p)发送给准Leader L。
  • 步骤L.1.1:当接收到来自过半Follower的CEPOCH(F.p)消息后,准Leader L会生成NEWEPOCH(e’)消息给这些过半的Follower。 (关于这个epoch值e',准Leader L 会从所有接收到的CEPOCH(F.p)消息中选取出最大的epoch值,然后对其进行加1操作,即为e'。)
  • 步骤F.1.2:当Follower接收到来自准Leader L 的NEWEPOCH(e')消息后,如果其检测到当前的CEPOCH(F.p)值小于e',那么就会将CEPOCH(F.p)赋值为e',同时向这个准Leader L反馈Ack消息。在这个反馈消息(ACK-E(F.p,hf)中,包含了当前该Follower 的epoch CEPOCH(F.p),以及该Follower的历史事务Proposal集合:hf。

    当Leader L接收到来自过半Follower的确认消息Ack之后,Leader L就会从这过半服务器中选取出一个Follower F,并使用其作为初始化事务集合Ie'。

关于这个 Follower F 的选取,对于Quorum中其他任意一个Follower F',F需要满足以下两个条件中的一个:

阶段2:同步

    在这一阶段中,Leader L和Follower F的工作流程分别如下。

  • 步骤L.2.1:Leader L会将e'和Ie'以NEWLEADER(e',Ie')消息的形式发送给所有Quorum中的Follower。
  • 步骤F.2.1:当Follower接收到来自Leader L的NEWLEADER(e',Ie')消息后,如果Follower发现CEPOCH(F.p)!=e',那么直接进入下一轮循环,因为此时Follower发现自己还在上一轮,或者更上轮,无法参与本轮的同步。如果CEPOCH(F.p) = e',那么Follower就会执行事务应用操作。

最后,Follower会反馈给Leader,表明自己已经接受并处理了所有Ie'中的事务 Proposal。

  • 步骤L.2.2: 当Leader接收到来自过半Follower针对NEWLEADER(e',Ie')的反馈消息后,就会向所有的Follower发送Commit消息。至此Leader完成阶段二。
  • 步骤F.2.2: 当Follower收到来自Leader的Commit消息后,就会依次处理并提交所有在Ie'中未处理的事务。至此Follower完成阶段二。

阶段3:广播

    完成同步阶段后,ZAB协议就可以正式接受客户端的事物请求,并进行消息广播流程。

  • 步骤L.3.1: Leader L接收到客户端新的事务请求后,会生成对应的事务Proposal,并根据ZXID的顺序向所有Follower发送提案<e',<v,z>> ,其中epoch(z)=e'。
  • 步骤F.3.1: Follower根据消息接收的先后次序来处理这些来自Leader的事务Proposal,并将他们追加到hf中去,之后再反馈给Leader。
  • 步骤L.3.1: 当Leader接收到来自过半Follower针对事务Proposal<e',<v,z>>的Ack消息后,就会发送Commit<e',<v,z>>消息给所有的Follower,要求它们进行事务的提交。
  • 步骤F.3.2: 当Follower F接收到来自Leader的Commit <e',<v,z>>消息后,就会开始提交事务Proposal <e,<v,z>>。需要注意的是此时该Follower F必定已经提交事物Proposal<e',<v',z'>>。

    在正常运行过程中, ZAB协议会一直运行于阶段三来反复地进行消息广播流程。如果出现Leader崩溃或其他原因导致Leader缺失,那么此时ZAB协议会再次进入阶段一,重新选举新的Leader。 

运行分析  

    在ZAB协议的设计中,每一个进程都有可能处于以下三种状态之一。    

  • LOOKING: Leader选举阶段
  • FOLLOWING: Follower服务器和Leader保持同步状态
  • LEADING: Leader服务器作为主进程领导状态

    组成ZAB协议的进程刚开始启动时,所有进程都处于LOOKING的初始化状态,此时集群中并不存在Leader。所有处于这种状态的进程都会试图去选举出一个Leader。如果某个进程发现已经选举出了Leader,那么它会马上切换到FOLLOWING状态,开始和Leader保持同步。这里,我们将处于FOLLOWING状态的进程称为Follower,将处于LEADING状态的进程称为Leader。考虑到Leader进程随时可能挂掉,当检测出Leader已经崩溃或放弃领导地位时,其余的FOLLOWING状态的进程就会重新进入LOOKING状态,并开始进行新一轮的Leader选举。因此在ZAB协议中,每个进程的状态都在LOOKING、FOLLOWING和LEADING之间不断转换。

    在完成Leader选举和数据同步之后,ZAB协议就进入了广播阶段。在这个阶段中,Leader会以队列的形式为每一个与自己保持同步的Follower创建一个操作队列。同一时刻,一个Follower只能与一个Leader保持同步。Leader进程与所有的Follower进程之间通过心跳检测机制来感知彼此的状态。如果Leader能在超时时间内收到Follower的心跳,那么Follower就会一直与该Leader保持同步。而如果在指定的超时时间内Leader无法从过半的Follower进程那收到心跳检测,或者TCP连接本身断开了,那么Leader就会停止对当前周期的领导,同时转换到LOOKING状态,所有Follower也会放弃这个Leader,进入LOOKING状态。之后,所有进程就会开启新一轮的Leader选举,并在选举产生新的Leader之后开始新一轮的主进程周期。
 

ZAB与Paxos算法的联系与区别

    ZAB 协议并不是Paxos算法的一个典型实现,在讲解ZAB和Paxos之间的区别之前,首先来看下两者的联系:

  • 两者都存在一个类似于Leader进程的角色,由其负责协调多个Follower进程的运行。
  • Leader进程都会等待超过半数的Follower做出正确的反馈后,才会将一个提案进行提交。
  • 在ZAB协议中,每个Proposal中都包含了一个epoch值,用来代表当前的Leader周期,在Paxos算法中,同样存在这样的一个标识,只是名字变成了Ballot。

    In the Paxos algorithm, the main process of a new elected will work in two stages. The first stage is called reading stage, at this stage, this new process will be collected by the main way to communicate with all other processes on a proposal put forward in the main process, the well will be submitted to them. The second phase is called the write phase, at this stage, the current primary process began to put forward its own proposal. On the basis of the Paxos algorithm design, ZAB additional protocol adds a synchronization stage. Before synchronization phase, there ZAB protocol very similar Paxos algorithm and a read phase process, called discovery (the Discovery) phase. In the synchronization phase, the new Leader will ensure that there is more than half of all transactions have been submitted Follower Proposal before Leader cycle. Introduction of this synchronization phase, can effectively ensure the proposed transaction Proposal before Leader in the new cycle, all processes have been completed prior to the submission of all affairs Proposal of. Once complete the synchronization phase, then the ZAB will write a similar stage of implementation and Paxos algorithm.

    In general, the essential difference between protocol and ZAB Paxos algorithm is that both design goals are not the same. ZAB protocol is mainly used to build a distributed standby system data availability, such as the ZooKeeper, the Paxos algorithm is used to build a distributed system consistent state.
 

Refer to "consistency principle and practice from Paxos to Zookeeper distributed"

Guess you like

Origin blog.csdn.net/qq_37598011/article/details/89814372