Evolution of Distributed Consensus Algorithms

Click on " Programmer Gray " above and select "Public Account"

Interesting and meaningful articles are delivered as soon as possible!



This article is reproduced from the public account



The Problem of Distributed Systems


Big Fatty Zhang encountered a problem.


Their company has a server that stores valuable data. In order to prevent it from hanging, the leader Bill asked Zhang Dafa to find a way to back up the data.


Zhang Dadang exerted his ability of abstraction, and such a picture emerged in his mind, this only machine can become a node:


640?wx_fmt=png&wxfrom=5&wx_lazy=1


In order to improve availability, several machines can be added and connected through a local area network to form a distributed system:

640?wx_fmt=png&wxfrom=5&wx_lazy=1


If the data is stored on each node, can you sit back and relax?


But Fatty Zhang quickly discovered that this is not an easy task. For example, each node keeps an account with a balance of 100 yuan. Now someone has added 20 yuan to the account through node A, and some people have reduced the account through node B. went to 30 yuan.


640?wx_fmt=png


What is the balance now?


In order to maintain consistency, node A has to send a message like "balance plus 20" to B, C, and node B has to send a message like "balance minus 30" to A, C, if there is a problem with the network, the message If it is not sent to other nodes, or a node is simply broken, the data is very likely to be inconsistent.


If users continue to operate on this inconsistent system, they can quickly descend into chaos.


2Who will be the boss?


After thinking about it for a long time, Fatty Zhang felt that he couldn't operate in such a disorderly manner, and he had to find a "boss" for these three nodes.


All operations are carried out through the "boss", and then let the boss send the message to each "little brother".


640?wx_fmt=png


But who will be the boss? Also, what if this boss hangs up?


It can be adjusted manually. For example, if node A hangs up, manually let node B be the "boss", and let node C be the "little brother".


But this is a bit cumbersome, can it be done automatically?


This question is very interesting. Fatty Zhang was fascinated and continued to think deeply: establish a campaign mechanism and let them compete for positions.


In the initial situation, each node is a candidate, and can send a voting invitation to other nodes to let everyone vote for themselves. If more than half of the votes are obtained, they can be the "boss".


为了避免大家同时发起投票邀请,可以给每个节点都分配一个随机的“选举超时时间”(election timeout),通俗来讲就是一个等待时间,在这段时间内,一个节点必须耐心等待,过了这段时间,才可以竞争上岗,争当老大。


每个节点都有一个计时器,从0开始计时,谁的等待时间到了, 就率先发起竞选,给其他节点打电话,要求他们投票让自己成为老大。


比如节点A等待170ms , 节点B等待200ms ,  节点C等待250ms 。


由于节点A的等待时间最短, 会捷足先登, 它先增加自己的任期(Term),这是一个整数,初始值为0 , 然后给自己投了一票,然后打电话给节点B和节点C,要求他们都投它。


640?wx_fmt=png


节点B和节点C收到了投票要求,如果自己还没有发起竞选投票(等待时间未到),那只好同意节点A当老大,与此同时要重置自己的计时器,重新从0开始计时,也就是说重新开始新一轮的等待。


640?wx_fmt=png


节点A得知其他两个节点同意了,投票计数变为3,已经过了半数, 就明白自己可以当老大了。


节点A成为老大后,开始向节点B和节点C定时发送消息,B,C收到消息后也要回应,维持心跳。


B和C每次收到心跳消息,都得重置自己的计时器, 重新从0开始计数。


此时节点B和节点C就成了“小弟”。


640?wx_fmt=png


如果节点A 不幸挂掉,节点B和节点C在自己的等待时间内收不到心跳消息,他们两个就会重新竞争上岗。


640?wx_fmt=png


上图中节点C占据了先机,率先发起竞选投票。


640?wx_fmt=png


节点B慢了一步, 无奈中同意支持节点C ,  节点C获得了超过半数的支持,成为“老大” ,  节点B成为“小弟”。


(可能有人会想到:节点B和节点C 同时发起竞选投票,每个节点的投票计数都是1 ,都过不了半数,  该怎么处理呢? 很简单,再次发起一轮竞选投票即可,当然为了防止B和C一直同时发起竞选投票,从而陷入无限循环,要重置一个随机的等待时间。)


投票过半数很重要,张大胖想,只有这样才能保证“老大”节点的唯一性。


对于每个节点,处理流程其实非常简单:


640?wx_fmt=png


3数据的复制


张大胖费了半天劲,终于把分布式系统中怎么自动地选取“老大”节点给确定了。


接下来就是要把发给“老大”的数据,想办法复制到“小弟”的节点上。 该怎么处理?


由于是分布式的,只有大多数节点都成功地保存了数据,才算保存成功


所以那个“老大”节点必须得承担起协调的职责。


张大胖想了一个复制日志的办法:  每个节点都有一个日志的队列。


640?wx_fmt=png


在真正把数据提交之前,先把数据追加到日志队列中,然后向个“小弟”复制。


640?wx_fmt=png


1.  客户端发送数据给节点A (“老大”)。

节点A 先把数据记录到日志中,即此时处于“未提交状态


2. 在下一次的心跳消息中, 数据被发送给各个“小弟”。


3. 各个“小弟” 也把数据记录到日志中(也处于未提交状态),然后向“老大”报告自己已经记录了日志。


4. 如果节点A收到响应超过了半数, 节点A就提交数据,通知客户端数据保存成功。


5. 节点A在下一次心跳消息中,通知各个“小弟”该数据已经提交。各个“小弟”也提交自己的数据。


如果某个“小弟”不幸挂掉,那“老大”会不断地尝试联系它, 一旦它重新开始工作,就需要从“老大”那里去复制数据,和“老大”保持一致。


4RAFT


张大胖对这个初步的设计还比较满意,他把这个方案交给领导Bill去审查。


Bill 看了以后,笑道: “你现在其实就是在折腾一个一致性算法, 说白了就是允许一组机器像一个整体一样工作,即使其中一些机器出现故障也能够继续工作下去。


“没错没错,领导总结得真是精准。” 张大胖拍马屁。


"However," Bill changed the conversation, "there are still many loopholes in the replication of the log you designed. I think there are 5 steps in your design. What if the "boss" node A fails during these 5 steps? Is the data inconsistent?"


"This..." Big Fatty Zhang really didn't think about it carefully. He secretly regretted it, just lowered his head to pull the car, forgot to look up at the road, and ignored the complex problems in the distributed environment.


"But you have done a very good job," the leader immediately encouraged. "The system you designed is actually very similar to the RAFT algorithm."


“RAFT? ”


"Yes, RAFT is a distributed consensus algorithm. Compared with Paxos, which is complicated and difficult to understand, RAFT has made great improvements in comprehension and implementability. Your 'boss' here, RAFT algorithm is called Leader, 'little brother' ' It's called a Follower, but there are very detailed rules about how logs are replicated and how to ensure data consistency."


As soon as Zhang Dafa heard that there was a ready-made algorithm, he immediately became happy: "Great, the distributed problem has been solved by others, and I will implement it."



—————END—————




Friends who like this article, welcome to long press the picture to follow the subscription number programmer Xiaohui , and watch more exciting content

640?wx_fmt=jpeg


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324636108&siteId=291194637