ZooKeeper the implementation of distributed data consistency ZAB protocol detailed

A, ZAB Protocol Overview

Before delving into ZooKeeper, it is considered a realization ZooKeeper Paxos algorithm. But in fact, the ZooKeeper Paxos algorithm does not completely, but the use of a protocol called ZooKeeper Atomic Broadcast (ZAB, ZooKeeper atoms message broadcast protocol) data consistency as its core algorithms.

ZAB atomic broadcast protocol is a protocol crash recovery support for distributed coordination service designed ZooKeeper.
ZAB Paxos algorithm that is not as protocol is a general distributed consensus algorithm, which is a broadcast message atom crash algorithm specifically designed ZooKeeper recoverable.

In ZooKeeper mainly rely ZAB distributed data consistency protocol to achieve, based on the protocol, implemented system architecture for ZooKeeper standby mode to maintain the primary data between each cluster a copy of consistency. Specifically, the ZooKeeper process using a single master to receive and process all transaction request from the client, and atomic broadcast protocol ZAB will change the state of the server data broadcast to all processes up copies in the form of transaction Proposal. The standby agreement ZAB model architecture ensures the same time in the cluster can only have one master process to the broadcast server's status change, and therefore works well with large numbers of concurrent client requests. On the other hand, taking into account the distributed environment, some of the state before and after the change the order of execution there will be some dependencies, some state changes must rely on earlier than those generated by its state changes, such as changing the A and C changes need to rely on B. change Such dependence ZAB agreement also made a request: ZAB protocol must be able to ensure a global change sequence is applied sequentially, that is to say, ZAB protocol need to ensure that if a state change has been processed, then all its dependence on state change should have been processed in advance out. (Meaning that transaction requests have been processed will not be processed, can not be lost) Finally, given the primary process are likely to crash or restart phenomenon quit at any time, therefore, ZAB agreement still needs to be done above-mentioned exception in the current primary process when circumstances, still be able to work properly.

The core ZAB protocol is defined for those handling ZooKeeper server will change the status of the transaction request data, namely:

All transaction requests must be coordinated by a globally unique server processes, such servers are called server Leader, while the rest of the other servers become Follower server. Leader server responsible for converting a client transaction requests to a transaction Proposal (proposal), and the Proposal circulated to all Follower cluster servers. After Leader server needs to wait for all the feedback Follower servers, once the server for more than half of Follower correct feedback, the Leader will distribute Commit message again to all Follower servers, asking them to submit before a Proposal.

Second, the agreement

ZAB protocol includes two basic modes, namely crash recovery and message broadcasting.

When will conduct Leader election?
When the whole service framework during startup, or when the server goes Leader network outages, crashes and other anomalies quit and restart when, ZAB agreement will enter recovery mode and the election of a new Leader server.

When an incoming message broadcast mode and data recovery mode?

Once elected the new Leader server, the cluster has completed more than half of the machines and the Leader of the state synchronization server, ZAB agreement will exit recovery mode. Here, the term refers to the state synchronization data synchronization, the cluster is used to ensure the presence of more than half of the machine to be consistent with the server data states Leader.
When the cluster has completed more than half of Follower server and server synchronization status of Leader, the entire service framework can enter a message broadcast mode. When a server ZAB also comply with the agreement to start when added to the cluster, the cluster at this time if there is already a Leader in charge of broadcast server messages, the server newly added data will consciously enter recovery mode: where to find Leader servers, and data synchronization therewith, and participate together in the process to the broadcast message.

Leader server crash when exit or reboot the machine about, or more than half of the server to maintain proper communication with the Leader server cluster no longer exists, before you start a new round of atomic Broadcasting operation, all processes will first use crash recovery protocol to make achieve a consistent state with each other, then the entire process will enter ZAB message from a broadcast mode to the crash recovery mode.
A machine to become the new Leader, must be supported by more than half of the process, and because each process are likely to crash, and therefore, in the process of ZAB protocol is running, there will be more before and after the Leader, and each process there may be more times become Leader. after entering the crash recovery mode, there is more than half as long as the server can communicate properly with one another cluster, then you can generate a new Leader and again into the news broadcast mode.

Leader can only process transactions, if the other node receives a transaction request it?
ZooKeeper designed to allow only a single server to process Leader transaction request. Leader server after receiving a transaction request from a client, it will generate transaction proposal corresponding and initiate a broadcast protocol, and if the other machines in the cluster receives a transaction request from a client, then these non-Leader server will first this transaction requests forwarded to
Leader server.

Broadcast news

ZAB message during the broadcast protocol used is an atomic broadcast protocol, similar to a two-phase commit process. Transaction for the requesting client, the server Leader for Proposal generate a corresponding transaction, and send it to all other machines in the cluster, then the respective votes were collected, and finally the transaction commits, the message protocol is shown in FIG ZAB broadcasting a schematic flow.
Here Insert Picture Description

ZAB two-stage protocol involved to submit different from the process?

在ZAB协议的二阶段提交过程中,移除了中断逻辑,所有的Follower 服务器要么正常反馈Leader提出的事务Proposal,要么就抛弃Leader服务器。同时,ZAB协议将二阶段提交中的中断逻辑移除意味着我们可以在过半的Follower 服务器已经反馈Ack之后就开始提交事务Proposal 了,而不需要等待集群中所有的Follower服务器都反馈响应。当然,在这种简化了的二阶段提交模型下,是无法处理Leader 服务器崩溃退出而带来的数据不一致问题的,因此在ZAB协议中添加了另一个模式,即采用崩溃恢复模式来解决这个问题。另外,整个消息广播协议是基于具有FIFO(先进先出)特性的TCP协议来进行网络通信的,因此能够很容易地保证消息广播过程中消息接收与发送的顺序性。

什么是ZXID?

在整个消息广播过程中,Leader服务器会为每个事务请求生成对应的Proposal来进行广播,并且在广播事务Proposal之前,Leader服务器会首先为这个事务Proposal分配一个全局单调递增的唯一ID, 我们称之为事务ID (即ZXID)。

由于ZAB协议需要保证每一个消息严格的因果关系,因此必须将每一个事务Proposal按照其ZXID的先后顺序来进行排序与处理。具体的,在消息广播过程中,Leader服务器会为每一个Follower服务器都各自分配一个单独的队列,然后将需要广播的事务Proposal 依次放入这些队列中去,并且根据FIFO策略进行消息发送。每一个Follower服务器在接收到这个事务Proposal 之后,都会首先将其以事务日志的形式写入到本地磁盘中去,并且在成功写入后反馈给Leader服务器一个Ack响应。当Leader服务器接收到超过半数Follower的Ack响应后,就会广播一个Commit消息给所有的Follower服务器以通知其进行事务提交,同时Leader自身也会完成对事务的提交,而每一个 Follower服务器在接收到Commit消息后,也会完成对事务的提交。

崩溃恢复

什么时候进入崩溃恢复模式?

ZAB协议的这个基于原子广播协议的消息广播过程,在正常情况下运行非常良好,但是一旦Leader服务器出现崩溃,或者说由于网络原因导致Leader服务器失去了与过半Follower的联系,那么就会进入崩溃恢复模式。

在ZAB协议中,为了保证程序的正确运行,整个恢复过程结束后需要选举出一个新的Leader服务器。因此,ZAB协议需要一个高效且可靠的Leader选举算法,从而确保能够快速地选举出新的Leader。同时,Leader选举算法不仅仅需要让Leader自己知道其自身已经被选举为Leader,同时还需要让集群中的所有其他机器也能够快速地感知到选举产生的新的Leader服务器。

基本特性
根据上面的内容,我们了解到,ZAB协议规定了如果一个事务Proposal在一台机器上被处理成功,那么应该在所有的机器上都被处理成功,哪怕机器出现故障崩溃。接下来我们看看在崩溃恢复过程中,可能会出现的两个数据不一致性的隐患及针对这些情况ZAB协议所需要保证的特性。

特性一:
ZAB协议需要确保那些已经在Leader服务器.上提交的事务最终被所有服务器都提交。

假设一个事务在Leader服务器上被提交了,并且已经得到过半Follower服务器的Ack反馈,但是在它将Commit消息发送给所有Follower机器之前,Leader服务器挂了
Here Insert Picture Description
上图中的消息C2就是一个典型的例子:在集群正常运行过程中的某一个时刻,Leader服务器先后广播了消息PI、P2、C1、P3和C2,其中,当Leader服务器将消息C2(C2是CommitOfProposal2的缩写,即提交事务Proposal2)发出后就立即崩溃退出了。针对这种情况,ZAB协议就需要确保事务Proposal2最终能够在所有的服务器上都被提交成功,否则将出现不一致。

特性二:
ZAB协议需要确保丢弃那些只在Leader服务器上被提出的事务。

如果在崩溃恢复过程中出现一个需要被丢弃的提案,那么在崩溃恢复结束后
需要跳过该事务Proposal,如下图。
Here Insert Picture Description

假设初始的Leader 服务器Server1 在提出了一个事务P3之后就崩溃退出了,从而导致集群中的其他服务器都没有收到这个事务Proposal。于是,当Serverl恢复过来再次加入到集群中的时候,ZAB协议需要确保丢弃Proposal3这个事务。

结合上面提到的这两个崩溃恢复过程中需要处理的特殊情况,就决定了ZAB协议必须设计这样一个Leader 选举算法:
能够确保提交已经被Leader 提交的事务Proposal, 同时丢弃已经被跳过的事务Proposal。

针对这个要求,如果让Leader选举算法能够保证新选举出来的Leader服务器拥有集群中所有机器最高编号(即ZXID最大)的事务Proposal,那么就可以保证这个新选举出来的Leader一定具有所有已经提交的提案。因为每次提交事务ZXID都会自增,更为重要的是,如果让具有最高编号事务Proposal 的机器来成为Leader, 就可以省去Leader 服务器检查Proposal的提交和丢弃工作的这一步操作了。

三、数据同步

完成Leader选举之后,在正式开始工作(即接收客户端的事务请求,然后提出新的提案)之前,Leader服务器会首先确认事务日志中的所有Proposal是否都已经被集群中过半的机器提交了,即是否完成数据同步。

ZAB协议的数据同步过程

所有正常运行的服务器,要么成为Leader,要么成为Follower并和Leader保持同步。Leader服务器需要确保所有的Follower服务器能够接收到每一条事务Proposal,并且能够正确地将所有已经提交了的事务Proposal应用到内存数据库中去。具体的,Leader服务器会为每一个Follower服务器都准备一个队列,并将那些没有被各Follower服务器同步的事务以Proposal消息的形式逐个发送给Follower服务器,并在每一个Proposal消息后面紧接着再发送一个Commit消息,以表示该事务已经被提交。等到Follower服务器将所有其尚未同步的事务Proposal都从Leader服务器上同步过来并成功应用到本地数据库中后,Leader服务器就会将该Follower服务器加入到真正的可用Follower列表中,并开始之后的其他流程。上面讲到的就是正常情况下的数据同步逻辑。

ZAB协议是如何处理那些需要被丟弃的事务Proposal 的。

在ZAB协议的事务编号ZXID设计中,ZXID是一个64位的数字,其中低32位可以看作是一个简单的单调递增的计数器,针对客户端的每一个事务请求,Leader服务器在产生一个新的事务Proposal的时候,都会对该计数器进行加1操作;而高32位则代表了Leader周期epoch的编号,每当选举产生一个新的Leader服务器,就会从这个Leader服务器上取出其本地日志中最大事务Proposal的ZXID,并从该ZXID中解析出对应的epoch值,然后再对其进行加1操作,之后就会以此编号作为新的epoch,并将低32位置0来开始生成新的ZXID。ZAB协议中的这一通过epoch编号来区分Leader周期变化的策略,能够有效地避免不同的Leader服务器错误地使用相同的ZXID编号提出不一样的事务Proposal的异常情况,这对于识别在Leader崩溃恢复前后生成的Proposal非常有帮助,大大简化和提升了数据恢复流程。基于这样的策略,当一个包含了上一个Leader 周期中尚未提交过的事务Proposal 的服务器启动时,其肯定无法成为Leader, 原因很简单,因为当前集群中一定包含一个Quorum集合,该集合中的机器一定包含 了更高epoch的事务Proposal, 因此这台机器的事务Proposal肯定不是最高,也就无法成为Leader 了。当这台机器加入到集群中,以Follower角色连接上Leader服务器之后,Leader服务器会根据自己服务器上最后被提交的Proposal 来和Follower服务器的Proposal进行比对,比对的结果当然是Leader 会要求Follower进行一个回退操作一回退到一个确实已经被集群中过半机器提交的最新的事务Proposal。 举个例子来说,在上图中,当Serverl连接上Leader后,Leader 会要求Serverl去除P3。

四、ZAB算法描述

整个ZAB协议主要包括消息广播和崩溃恢复两个过程,进一步可以细分为三个阶段,分别是发现(Discovery).同步(Synchronization) 和广播( Broadcast)阶段。组成ZAB协议的每一个分布式进程,会循环地执行这三个阶段,我们将这样一个循环称为一个主进程周期。

阶段一:发现

阶段-.主要就是Leader选举过程,用于在多个分布式进程中选举出主进程,准Leader L和Follower F的工作流程分别如下。

步骤F.1.1 Follower F将自己最后接受的事务Proposal 的epoch值CEPOCH发送给准Leader L。

步骤L.1.1当接收到来 自过半Follower 的epoch值消息后,准Leader L会生成新的epoch值消息给这些过半的Follower。
关于这个epoch值e’,准Leader L会从所有接收到的epoch值消息中选取出最大的epoch值,然后对其进行加1操作,即为e’。

步骤F.1.2当Follower接收到来自准Leader L的新的epoch值消息后,如果其检测到当前的epoch值小于e’,那么就会将当前epoch值赋值为e’,同时向这个准Leader L反馈Ack消息。在这个反馈消息(ACK-E(F.p,hr))中,包含了当前该Follower的最后一个处理的事务,以及该Follower的历史事务Proposal集合。当LeaderL接收到来自过半Follower的确认消息Ack之后,LeaderL就会从这过半服务器中选取出一个FollowerF,并使用其作为初始化事务集合le’。
关于这个FollowerF的选取,对于Quorum中其他任意-一个FollowerF’,F需要满足以下两个条件中的一个:
CEPOCH(F’p) < CEPOCH (F.p) 需要小于最后一个提交事务的epoch值。
(CEPOCH (F’p)= CEPOCH(F.p))&(F’zxid <F.zxid或F’zxid= F.zxid) 需要等于最后一个提交事务的epoch值并且zxid小于Follower f处理过的历史事务Proposal中最后一个事务Proposal 的事务标识ZXID。
至此,ZAB协议完成阶段一的工作流程。.

阶段二:同步

在完成发现流程之后,就进入了同步阶段。在这一阶段中,Leader L和FollowerF的工
作流程分别如下。

步骤L.2.1 Leader L会将e’(最大的epcho值)和Ie’(初始化事务集合)以NEWLEADER(e’,Ie’)消息的形式发送给所有Quorum(认定整个集群是否可用的一种方式)中的Follower。

步骤F.2.1当Follower 接收到来自Leader L的NEWLEADER(e’,le’)消息后,如果Follower发现CEPOCH (F.p) ≠e’, 那么直接进入下一轮循环,因为此时Follower发现自己还在上一轮,或者更上轮,无法参与本轮的同步。如果CEPOCH (F.p)= e’,那么Follower就会执行事务应用操作。最后,Follower 会反馈给Leader,表明自己已经接受并处理了所有L中的事务Proposal。

步骤L.2.2当Leader接收到来自过半Follower 针对NEWLEADER(e’,Ie’)的反馈消息后,就会向所有的Follower发送Commit消息。至此Leader完成阶段二。

步骤F.2.2当Follower收到来自Leader的Commit消息后,就会依次处理并提交所有在Ie’中未处理的事务。至此Follower完成阶段二。

阶段三:广播
完成同步阶段之后, ZAB协议就可以正式开始接收客户端新的事务请求,并进行消息广播流程。

步骤L.3.1 Leader L接收到客户端新的事务请求后,会生成对应的事务Proposal,并根据ZXID的顺序向所有Follower发送提案<e’,<v,z>>,其中epoch(z) = e’

步骤F.3.1Follower根据消息接收的先后次序来处理这些来自Leader的事务Proposal,并将他们追加到hp中去,之后再反馈给Leader。

步骤L.3.1当Leader 接收到来自过半Follower 针对事务Proposal <e’,<v,z>>的Ack消息后,就会发送Commit <e’,<v,z>>消息给所有Follower,要求它们进行事务的提交。
步骤F.3.2当Follower F接收到来自Leader的Commit <e’,<v,z>>消息后,就会开始提交事务Proposal<e’,<v,z>>。 需要注意的是,此时该FollowerF必定已经提交了事务Proposal<v’,z’>
以上就是整个ZAB协议的三个核心工作流程。

五、ZAB 与Paxos算法的联系与区别

ZAB协议并不是Paxos算法的一个典型实现,在说ZAB和Paxos之间的区别之前,我们首先来看下两者的联系。

  • 两者都存在一个类似于Leader进程的角色,由其负责协调多个Follower进程的运行。
  • Leader进程都会等待超过半数的Follower做出正确的反馈后,才会将一个提案进行提交。
  • 在ZAB协议中,每个Proposal中都包含了一个epoch值,用来代表当前的Leader周期,在Paxos算法中,同样存在这样的一个标识,只是名字变成了Ballot。

In the Paxos algorithm, the main process of a new elected will work in two stages. The first stage is called reading stage, at this stage, this new process will be collected by the main way to communicate with all other processes on a proposal put forward in the main process, and submit them. The second phase is called the write phase, at this stage, the current primary process began to put forward its own proposal. On the basis of the Paxos algorithm design, ZAB additional protocol adds a synchronization stage. Before synchronization phase, there ZAB protocol very similar Paxos algorithm and a read phase process, called discovery (the Discovery) phase. In the synchronization phase, the new Leader will ensure that there is more than half of all transactions have been submitted Follower Proposal before Leader cycle. Introduction of this synchronization phase, can effectively ensure the proposed transaction Proposal before Leader in the new cycle, all processes have been completed prior to the submission of all affairs Proposal of. Once complete the synchronization phase, then the ZAB will write a similar stage of implementation and Paxos algorithm. In general, the essential difference between protocol and ZAB Paxos algorithm is that both design goals are not the same.

Different points summarize:
ZAB protocol is mainly used to build a distributed standby system data availability, such as the ZooKeeper
the Paxos algorithm is used to construct a consistent state eleven distributed system.

Sixth, the state run

ZAB in the design of the protocol, each - a process are likely to be in one of three states.

  • LOOKING: Leader election period
  • FOLLOWING: Follower and Leader in sync server
  • LEADING: Leader server as the primary process leading state
Published 46 original articles · won praise 10 · views 4314

Guess you like

Origin blog.csdn.net/weiwei_six/article/details/104083861
Recommended