Consensus algorithm -raft

Consensus Algorithm

Consensus algorithms allow a collection of machines to work as a coherent group that can survive the failures of some of its members.

Private chain: the traditional system of distributed consensus algorithm (compared to the block chain concept), such as zookeeper's zab agreement, paxos, raft. Private chain without considering the presence of evil nodes in the cluster, because the system only considers node or network failure causes.

Union chain: there is a failure to consider node in the cluster, also need to consider the presence of evil node in the cluster. For Union chain, each new node is required to verify and audit. For example pbft algorithm

Public chain: there is a failure to consider nodes in the network, but also need to consider the evil node, which is similar to the chain and alliances. Union chain and the biggest difference is that the public chain nodes can freely join or quit, do not need rigorous validation and audit. For example pow.pos, dpos, ripple

raft and paxos

paxos a class of protocols is collectively

basic paxos

multi-paxos strengthen the leader

raft a bit like a multi-paxos

raft algorithm

Two paper

Essay "In Search of an Understandable Consensus Algorithm "

Large paper [ "CONSENSUS: BRIDGING THEORY AND PRACTICE"]

Some open source projects implemented, etcd, tidb / tikv, consul

Algorithm Overview

Process a simple raft algorithm

Raft algorithm by the leader node to deal with consistency problems. leader node receives a request log data from the client, and then synchronized to the others in the cluster nodes replicate, when the log has been synchronized to more than half of the node when, leader node then notifies the cluster in which other logging node has been copied successfully, can be submitted to raft state machine execution.

In the above manner, Raft algorithm to solve the consistency problem is divided into the following sub-problems.

leader election: there must be a leader in a cluster node.
Log Replication: leader node receives the request from the client and then these requests resynchronization sequences into log data to other nodes in the cluster.
Security: If a node has submitted a raft of data input state machine implemented, then another log data from other nodes is no longer possible to raft the same index input state machine execution.
Member changes (membership changes)

leader election

节点刚启动，进入follower状态，同时创建一个超时时间在150-300毫秒之间的选举超时定时器。
follower状态节点主循环：
  如果收到leader节点心跳：
    心跳标志位置1
  如果选举超时到期：
    没有收到leader节点心跳：
      任期号term+1，换到candidate状态。
    如果收到leader节点心跳：
      心跳标志位置空
  如果收到选举消息：
    如果当前没有给任何节点投票过 或者 消息的任期号大于当前任期号：
      投票给该节点
    否则：
      拒绝投票给该节点
candidate状态节点主循环：
  向集群中其他节点发送RequestVote请求，请求中带上当前任期号term
  收到AppendEntries消息：
    如果该消息的任期号 >= 本节点任期号term：
      说明已经有leader，切换到follower状态
    否则：
      拒绝该消息
  收到其他节点应答RequestVote消息：
    如果数量超过集群半数以上，切换到leader状态
    
  如果选举超时到期：
    term+1，进行下一次的选举
复制代码

When initiating elections, follower will increment the number of its term of office and then switch to candidate status. Then to initiate a new election by sending RequestVote RPC request to the other nodes in the cluster. A candidate node will remain in the state in which the term, until one of the following occurs.

The node candidate won the election, that voting received more than half of the other nodes in the cluster. Another node became leader. No single node becomes a leader when election time out soon.

Log Replication

日志复制的流程大体如下：

每个客户端的请求都会被重定向发送给leader，这些请求最后都会被输入到raft算法状态机中去执行。
leader在收到这些请求之后，会首先在自己的日志中添加一条新的日志条目。
在本地添加完日志之后，leader将向集群中其他节点发送AppendEntries RPC请求同步这个日志条目，当这个日志条目被成功复制之后（什么是成功复制，下面会谈到），leader节点将会将这条日志输入到raft状态机中，然后应答客户端。

 committedIndex >= appliedIndex
复制代码

Raft algorithm maintains two of the following, the role of these two attributes together to meet the matching log previously mentioned (LogMatch) attributes:

If two different entries in the log term and have the same index number, the command they are stored in the same.
If two different entries in the log has the same term index and number, all entries between them are exactly the same.

safety

Leaders completely principle: if a log entry is filed within a given term, then the entry will appear in all the larger number of leaders tenure

broadcastTime << electionTimeout << MTBF