How much do you know about the election mechanism of ZooKeeper?

Starting today, we will continue to in-depth knowledge of ZK elections

1. Basic rules of election

ZKr~ This time I decided to go uncharacteristically and stop telling the story~ First, I have to talk about some very important things in the ZK election.

1.1 zxid

zxid is our prior transaction number mentioned, is an 8-byte integer numbers, but this time ZK design a digital split into two parts using a two fish to eat!

An 8-byte integer has a total length of 64 bits. The first 32 bits are used to record epochs, and the last 32 bits are used for counting. You may have to ask? epoch? What is it?

zxid is initialized to 0, that's it

00000000000000000000000000000000 00000000000000000000000000000000

Each write request will increase the last 32 bits. Assuming that there are 10 write requests (regardless of whether the request is actually modified to the data), zxid will become like this

00000000000000000000000000000000 00000000000000000000000000001010

When an election is conducted, the first 32 bits will increase by 1, and the 32 bits after clearing

00000000000000000000000000000001 00000000000000000000000000000000

In addition to the election, when the 32 bits are completely used up (it becomes all 1, that is, ZK has executed 2^32 normally-1 write request has not been elected, which is awesome!) It will also increase the first 32 bits by 1 , Which is equivalent to carry

# Before carry 00000000000000000000000000000000 11111111111111111111111111111111 # After carry 00000000000000000000000000000001 00000000000000000000000000000000

At this point, I can answer your previous questions. The epoch is the first 32 digits of zxid. The translation of epoch itself means "epoch, era", which means updating, and the last 32 digits of zxid are just writing The count of requests is nothing more

1.2 myid

In the previous short story, I gave each node in the ZK cluster a memorable name (Shente is memorable!). But how does ZK officially name each node in the cluster? Myid is used!

zoo.cfg There is an item in the ZK startup configuration  that  dataDir specifies the path for data storage (the default is  /tmp/zookeeper). Create a new text file under this path and name it  myid. The text content is a number, and this number is the myid of the current node.

/tmp └── zookeeper     ├── myid     └── ...

Then  zoo.cfg configure the cluster information like this

server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888

The number server. after this  is myid, and this myid cannot be repeated between nodes in the entire cluster. I forgot where I saw it before. I said myid can only be a number from 1 to 255. I always believed that it was true. Until this time, I practiced it with a rigorous attitude. Everything is based on facts, and my The experiment covers the three major versions 3.4, 3.5, and 3.6 (all are simple clusters of three machines), and the conclusion is: as long as myid is not equal to -1 (-1 is a fixed value, it will cause the current node to start reporting an error). Greater than  Long.MAX_VALUE or less than  Long.MIN_VALUE, but if the current node is configured in zookeeper.extendedTypesEnabled=true the current node, the maximum myid of  the current node is 254 (negative number does not affect, I don’t know the meaning of this 254, but there are judgments in the code) Is it weird knowledge has increased What~

For more information about the configuration, I will organize it separately later, and I will end today.

1.3 Election rules

What is the use of knowing the above? Very important! Because the election leader is all about these values

  • epoch
  • Number of write requests
  • myid

The priority is compared level by level from top to bottom, whoever is older is more qualified to become Leader, and the current level is the same as the next level, until the winner is determined! Because myid cannot be repeated, the winner will be determined in the end!

Okay, now everyone knows the most basic election rules~ Let's move on to the next section

Second, the three horse controversy

Ma Guoguo must have never imagined that, in this life, he can be compared with two famous star entrepreneurs. Let's go and see what happened together~

2.1 Ready to start

Previously, Ma Guoguo stipulated that the three offices must elect a leader before opening to the outside world. Before the official start of the election, each office also has some preparatory work to be done:

  • Every office must know how many offices there are in total
  • Hire some additional operators who are responsible for communicating with other offices
  • Prepare a ballot box for counting and returning votes
  • Set a fixed myid for each office

So now the layout of the office has become like this (I omitted other elements from the previous chapter):

With these preparations, all offices can enter the election stage, and the village committee has stipulated several states to indicate the current stage of the office:

  • LOOKING, looking for a leader, the office at this stage cannot provide external services
  • LEADING, the current office is the Leader, which can provide services to the outside world
  • FOLLOWING, the current office is following the leader and can provide services externally

Obviously, the offices that have just been prepared are now in LOOKING state, let us formally enter the election process.

2.2 Start the election

Since the offices have just been prepared, they have not yet passed the letter, and everyone’s surname is Ma, and they all want to be the boss in their hearts, so each office will take the lead in drawing up a ballot with its own written on it. Send it to other offices. There are mainly these information:

  • sid: who am i
  • leader: who do I choose
  • state: my current state
  • epoch: my current epoch
  • zxid: The largest transaction number of the leader I choose

马果果举例:

马小云马小腾也一样,一开始都选了自己做 Leader 候选人,并且都把自己认为的候选人(当前场景下就是自己)的票分别发送给了其他两位(以及自己)

2.2.1 马果果视角

每个办事处各自也会收到来自其他办事处的选票(也有可能是自己的),每拿到一张选票,都需要和当前自己认为的 Leader 候选人做比较,理论上自己投给自己的选票会先一步达到自己的票箱,因为不需要经过通讯减少了传输的路径,自己的选票和自己的候选人是一致的所以不需要比较,只需要在票箱中记上一笔,我们还是以马果果举例:

=》的左边是办事处的名字,右边是该办事处选的 Leader。当前投票统计是指,当前节点所选的 Leader 获得的选票统计。

假设他再收到了马小云的选票:

  • 马果果首先看到的是马小云也处在 LOOKING 状态
  • 接着就会比较自己候选人和马小云的选票(左边代表当前办事处的候选人,右边代表收到的选票信息,下同)
e:0 == e:0 z:0  == z:0 l: 马果果(69) >  l: 马小云(56)

最终因为马果果的 myid 69 要比马小云的 myid 56 要大,所以马果果最终胜出!虽然马小云胜出了,但是当前投票统计是不能修改的,因为马小云这一轮的选票就是选的马小云,需要等待他重新改票后再投才能修改投票统计。

之后会往投票箱记录:

紧接着是马小腾的投票:

e:0 == e:0 z:0  == z:0 l: 马果果(69) >  l: 马小腾(49)

马果果还是胜出!

记录投票箱:

每次收到投票的时候,马果果都会依据当前的投票统计进行归票,但是很遗憾选举仍然无法结束,因为结束的规则必须有某一个办事处获得半数以上的选票,现在只有一个马果果自己的选票,不满足半数以上,所以马果果只能再等等了。

而在马果果这边忙的热火朝天的同时,马小云马小腾也在进行着同样的动作。

2.2.2 马小云视角

我们这省略描述马小云记录自己选票的过程,假设他这边是先收到马果果的选票,是怎么处理的呢?

e:0 == e:0 z:0  == z:0 l: 马小云(56) <  l: 马果果(69)

马小云看到自己认为的 Leader 候选人被马果果的选票击败了,所以将自己的候选人改为马果果,并将新的选票重新广播出去

然后在自己的投票箱中记录:

为了叙述的完整性,我们还是把马小腾的票也看完

e:0 == e:0 z:0  == z:0 l: 马果果(69) >  l: 马小腾(49)

马果果还是胜出了,所以马小云的投票箱最终变成这样:

讲道理接下来应该以马小腾为主视角,再讲一遍刚才的过程,但是可以认为几乎和马小云是一样的,为了故事的顺畅,我们需要回到马果果的视角,因为马小云输给马果果之后改票了,又发了一轮选票

2.2.3 马果果视角(再)

马果果又再一次收到了马小云的选票(改票后),投票箱就会改成这样:

收到这个投票后,当前投票统计就会增加马小云的记录,然后马果果进行归票就发现了这次自己的选票超过半数了,然后会进行二次确认,会等待一会看看还能不能收到更新的选票,这里假设没有收到更新的投票,就会进行判断,当前过半数的候选人是不是自己?如果是的话,那自己就是 Leader,不是的话,自己就是 Follower。

很明显,马果果就是 Leader,然后会把自己的状态修改为 LEADING。

与此同时,马小云马小腾也进行归票,归票结果自己为 Follower,把自己状态修改为 FOLLOWING,然后各自都会和 Leader 进行数据的同步,同步完成之后整个办事处就都可以对外提供服务了。

2.3 马小腾停电啦

选举本身涉及到集群间的通信、节点自身的状态管理和状态变更,本身就是一个比较复杂的过程,刚才只是举例了一个最简单的启动选举流程,下面会举更多的例子帮助大家能理解整个选举的逻辑。

现在假设办事处安然无恙得对外提供了一段时间服务后,马小腾的办事处突然停电了,就不能和另外两马进行通讯了,而另外两马在一段时间内都没有收到过马小腾的信息的时候就知道,出事了!但是各自盘点了下目前仍然还有两个办事处可以对外提供服务,是达到整个集群总数的半数以上的,是可以继续让村民们来办理业务的,所以现在整个集群变成了这样:

没过一会,因为电力公司的积极抢修,马小腾的办事处恢复供电了,重新开张了,但是每一个办事处在开张前都是处在 LOOKING 状态的,还是会优先投票给自己,并会通过复盘本地的存档来得到自己办事处最新的数据,假设马小腾停电前是这样:

e:0 z:21  l: 马小腾(49) LOOKING

他和之前一样会给另外两个办事处发自己的选票

但和之前的情况不同,无论是马果果还是马小云他们现在都处在工作的状态,收到了马小腾的选票后就会把当前的 Leader 也就是马果果的选票信息以及自己当前的状态发送给他。

马果果发送的选票信息:

e:0 z:30  l: 马果果(69) LEADING

马小云发送的选票信息:

e:0 z:30  l: 马果果(69) FOLLOWING

马小腾收到两位的选票信息后,知道了当前的 Leader 是马果果,并且马果果本人也确认了是 LEADING 状态,就马上把自己的状态修改为了 FOLLOWING 状态,并且会和之前一样与 Leader 进行数据的同步,关于具体怎么同步的,我打算留到之后再进行讲解~

同步之后,马小腾的状态变成了和马小云一样的了。


我再假设这里有一个平行世界,回到马小腾刚恢复完供电准备开张上线的时候,此时的马小腾的状态假设是这样的:

e:1 z:7  l: 马小腾(49) LOOKING

哪怕 epoch 比目前的 Leader 还要大,其实照道理是更有资格当 Leader,但是由于当前集群中的其他办事处已经有了一个明确的 Leader,马小腾也只能忍辱负重(谁让你停电了呢)还是以 Follower 的身份加入到集群中来,并且仍然以当前 Leader 的信息来同步,你也可以理解为降级(把自己的 epoch 降级回 0 )

职场就是这么残忍,你稍微请个长假再回来可能已经是物是人非了~

2.4 马果果又病啦

马果果毕竟年事已高,又又又生病了,办事处只能含泪关门,但是和上一次马小腾停电不同,这次是作为 Leader 的马果果停止服务了,因为之前定下的规定,整个办事处集群必须得有一个 Leader。现在马小云马小腾发现 Leader 联系不上了,说明 Leader 无法服务了,他们就知道必须选出一个新的 Leader。于是纷纷将自己的状态都修改为 LOOKING 状态,并且再次把候选人选为自己,重新向其他仍然可以提供服务的办事处广播自己的选票(当前这个场景就是互相发选票了)。

无论谁收到选票后经过比较后都会知道是马小腾胜出

e:1 == e:1 z:77  <  z:80 l: 马小云(56)    l: 马小腾(49)

马小云会把自己的候选人修改为马小腾之后重新再把自己的选票发出去,现在马小腾就获得了 2 票通过,同时也满足大于整个办事处集群半数以上,所以马小腾马小云各自修改状态为 LEADING 和 FOLLOWING 后,并且会和之前说的一样,把 epoch 加 1 同时清空计数部分,最后重新恢复对村民提供服务。

马果果这边病好以后,会重新开张和之前的例子一样也是先从 LOOKING 状态开始,最后会从其他两马那里得知目前的 Leader 是马小腾之后,就会主动和马小腾同步数据并以 Follower 的身份加入到办事处集群中对外提供服务。

2.5 招商引资

办事处的热火朝天被村委会看在了眼里,心想只有三个办事处就能达到这样的效果,如果有更多的办事处呢?于是和三马商量了下,决定对外招商引入社会资本,让他们自己按照现有模式建立新的办事处,这样村委会不用出一分钱,村民还能获得实在的好处,秒啊!

此举一度引来社会资本的大量关注,但是商量过后,三马又觉得如果过多的引入外部力量势必会削弱自己手中的权力,所以又出了一个规定,三马自封为 Participant 只有他们三个才有资格进行 Leader 的竞选,而引入的社会资本所创建的办事处只能作为 Observer 加入办事处的集群中对外提供只读服务,没有资格竞争 Leader,这样就可以在不增加选举复杂程度的同时,提升整个办事处集群对读请求的吞吐量。

要声明当前节点是 Observer,需要在 zoo.cfg 中先配置 peerType=observer

同时声明的集群信息最后要多加一个 :observer 用来标识,这样其他节点也会知道当前 myid 为 1 和 2 都是 Observer

server.69=maguoguo:2888:3888 server.56=maxiaoyun:2888:3888 server.49=maxiaoteng:2888:3888 server.1=dongdong:2888:3888:observer server.2=jitaimei:2888:3888:observer

而在 LOOKING 状态的 Observer 一开始的 Leader 候选人也会选自己,但是选票信息被设置成了这样,以东东举例:

e:Long.MIN_VALUE z:Long.MIN_VALUE  l: 东东(1) LOOKING

因为 epoch 被设置成了最小值所以这个选票等同于形同虚设,可以被直接忽略,并且在三马那里会维护一个 Participant 的列表,如果他们收到了来自 Participant 以外的办事处的选票会直接选择忽略,所以可以说 Observer 的选票对选举结果是完全没有影响的。最终是等待 Participant 之间的选举结果通知,Observer 自身修改状态为 OBSERVING,开始和 Leader 进行同步数据,这点和 Follower 没区别,之后 Observer 和 Follower 会统称为 Learner

2.6 小结

  • 竞选 Leader 看的是 epoch、写请求操作数、myid 三个字段,依次比较谁大谁就更有资格成为 Leader
  • 获选超过半数以上的办事处正式成为 Leader,修改自己状态为 LEADING
  • 其他 Participant 修改为 FOLLOWING,Observer 则修改为 OBSERVING
  • 如果集群中已经存在一个 Leader,其他办事处如果中途加入的话,直接跟随该 Leader 即可
  • 还得提一句,如果当前可提供服务的节点已经不足半数以上了,那么这个选举就永远无法选出结果,每个节点都会一直处在 LOOKING 状态,整个办事处集群也就无法对外提供服务了


Guess you like

Origin blog.51cto.com/15114835/2655335