[Distributed Lock] Distributed Consistency Algorithm Raft

Detailed explanation of typical applications, architecture and Raft consensus algorithm of distributed registration service center Etcd

1. Overview of distributed consensus algorithm Raft

1.1 Raft background

In distributed systems, consensus algorithms are crucial. Among all consensus algorithms, Paxos is the most famous. It was proposed by Leslie Lamport in 1990. It is a message passing-based consensus algorithm and is considered the most efficient among similar algorithms. .

Although the Paxos algorithm is very effective, its complex principles make it very difficult to implement. So far, there are very few open source software that implement the Paxos algorithm. The more famous ones include Chubby and libpaxos. In addition, the ZAB (Zookeeper Atomic Broadcast) protocol adopted by Zookeeper is also implemented based on the Paxos algorithm. However, ZAB has made many improvements and optimizations to Paxos. There are also differences in the design goals of the two - the ZAB protocol is mainly used to build a highly available Distributed data master and backup systems, and the Paxos algorithm is used to build a distributed consistency state machine system.

Because the Paxos algorithm is too complex and difficult to implement, it greatly restricts its application. The field of distributed systems is in urgent need of an efficient and easy-to-implement distributed consensus algorithm. In this context, the Raft algorithm emerged as the times require.

The Raft algorithm was published by Diego Ongaro and John Ousterhout of Stanford in 2013: "In Search of an Understandable Consensus Algorithm". Compared with Paxos, Raft makes it easier to understand and implement through logical separation. Currently, there are Raft algorithm implementation frameworks in more than ten languages, the more famous ones are etcd and Consul.

1.2 Raft role

A Raft cluster contains several nodes. Raft divides these nodes into three states: Leader, Follower, and Candidate. Each state is responsible for different tasks. Under normal circumstances, the nodes in the cluster only exist in two states: Leader and Follower. .

  • Leader : Responsible for log synchronization management, processing requests from clients, and maintaining heartBeat contact with Followers;
  • Follower : responds to the Leader's log synchronization request, responds to the Candidate's ticket invitation request, and forwards (redirects) transactions requested by the client to the Follower to the Leader;
  • Candidate (candidate) : Responsible for election voting. When the cluster has just started or the Leader is down, the node in the Follower status will turn into a Candidate and initiate an election. After winning the election (obtaining the votes of more than half of the nodes), it will turn from Candidate to Leader status. ;

1.3 Raft Overview

Usually, there is only one Leader in a Raft cluster, and other nodes are followers. Followers are passive: they do not send any requests, but simply respond to requests from the Leader or Candidate. The Leader is responsible for processing all client requests (if a client contacts a Follower, the Follower will redirect the request to the Leader). To simplify logic and implementation, Raft decomposes the consistency problem into three relatively independent sub-problems:

  • Election (Leader election) : When the Leader goes down or the cluster is started, a new Leader needs to be elected;
  • Log replication : Leader receives requests from clients and copies them to other nodes in the cluster in the form of log entries, and forces the logs of other nodes to be consistent with its own.
  • Safety : If any server node has applied a certain log entry to its state machine, then other server nodes cannot apply a different instruction at the same log index position.

2. Leader election principle of Raft algorithm

According to the Raft protocol, when a cluster applying the Raft protocol is first started, the status of all nodes is Follower. Since there is no Leader, the Followers cannot maintain a heartbeat with the Leader. Therefore, the Followers will think that the Leader is down and then switch to Candidate status. Then, the Candidate will request votes from other nodes in the cluster to agree to upgrade itself to Leader. If the Candidate receives votes from more than half of the nodes (N/2 + 1), it will win and become the Leader.

Phase 1: All nodes are Followers

According to the Raft protocol, when a cluster applying the Raft protocol is just started (or the leader goes down), the status of all nodes is Follower, and the initial Term is 0. Start the election timer at the same time. The election timer timeout of each node is between 100-500 milliseconds and is inconsistent (to avoid initiating elections at the same time).

Insert image description here

Phase 2: Follower turns to Candidate and initiates voting

Since there is no Leader, Followers cannot maintain a heartbeat with the Leader. If the node does not receive a heartbeat or voting request within an election timer period after startup, the state changes to the candidate candidate state, Term is incremented, and the node is added to the cluster. All nodes send voting requests and reset the election timer.

Note: Since the election timer timeout of each node is between 100-500 milliseconds and is inconsistent, it is possible to avoid all Followers turning into Candidates and initiating voting requests at the same time. In other words, the node that first turns to Candidate and initiates a voting request will have the "first-mover advantage" of becoming the Leader.

Insert image description here

Phase Three: Voting Strategy

After receiving the voting request, the node will decide whether to accept the voting request based on the following conditions:

  1. If the Term of the requesting node is greater than its own Term, and it has not yet voted for other nodes, accept the request and vote for it;
  2. If the Term of the requesting node is smaller than its own Term and it has not yet voted, the request will be rejected and the vote will be given to itself.

Insert image description here

The fourth stage: Candidate turns into Leader

After a round of election, under normal circumstances, a Candidate will receive votes from more than half of the nodes (n/2 + 1), then it will win and be upgraded to Leader, and then regularly send heartbeats to other nodes, and other nodes will Be the Follower and keep pace with the Leader. In this way, this round of election is over.

Note: It is possible that in one round of elections, if no candidate receives more than half of the node votes, the next round of elections will be held.

Insert image description here

3. Log replication principle of Raft algorithm

In a Raft cluster, only the Leader node can handle the client's request (if the client's request is sent to the Follower, the Follower will redirect the request to the Leader). Each request from the client contains an instruction that is executed by the replicated state machine. The leader appends this instruction to the log as a new log entry, and then sends the additional entry to the followers in parallel, allowing them to copy the log entry. When this log entry is safely copied by Followers, the Leader will apply this log entry to its state machine and return the execution results to the client. If a Follower crashes or runs slowly, or if the network packets are lost, the Leader will repeatedly try to append log entries (despite having already replied to the client) until all Followers have finally stored all log entries, ensuring strong consistency.

Phase 1: Client request is submitted to Leader

As shown in the figure below, the Leader receives the client's request: For example, it stores a data: 5; after the Leader receives the request, it writes it into the local log as a log entry. It should be noted that the status of the entry at this time is uncommitted, and the Leader will not update the local data, so it is unreadable.
Insert image description here

Phase 2: Leader sends entry to other Followers

The leader maintains a heartbeat relationship with the Floolwers. With the heartbeat, the Leader sends the additional entries (AppendEntries) to other Followers in parallel and lets them copy the log entry. This process is called replication. There are a few points to note:

1. Why does the entry sent by Leader to Follower be AppendEntries?

Because the heartbeat between the Leader and Follower is cyclical, and the Leader may receive multiple client requests during a cycle, there is a high probability that multiple entries, namely AppendEntries, are sent to Followers with the heartbeat. Of course, in this example, we assume that there is only one request, which is naturally one entry.

2. The Leader sends more than just additional entries (AppendEntries) to Followers.

When sending additional log entries, the Leader will include the index position (prevLogIndex) of the new log entry immediately following the previous entry and the Leader term number (term). If the Follower cannot find an entry in its log that contains the same index position and term number, it will refuse to accept new log entries, because this situation indicates that the Follower and Leader are inconsistent.

3. How to solve the problem of inconsistency between Leader and Follower?

Under normal circumstances, the logs of the Leader and Follower are consistent, so the consistency check of the appended log never fails. However, a series of crashes in the Leader and Follower will leave their logs in an inconsistent state. The follower may lose some log entries that are in the new leader, it may have some log entries that the leader does not have, or both. Missing or extra log entries may persist for multiple terms.

To make the follower's log consistent with the leader's, the leader must find the last place where the two agreed (to put it bluntly, it means backtracking to find the most recent point of agreement between the two), then delete all log entries after that point and send its own log To Follower. All of these operations are performed while performing a consistency check on the appended log.

The Leader maintains a nextIndex for each Follower, which represents the index address of the next log entry that needs to be sent to the Follower. When a Leader first gains power, it initializes all nextIndex values ​​to the index of its last log plus 1. If a Follower's log is inconsistent with the Leader's, the consistency check will fail the next time the log is appended. After being rejected by a Follower, the Leader will reduce the nextIndex value corresponding to the Follower and try again.

Eventually nextIndex will make the leader and follower logs agree at a certain position. When this happens, the additional log will be successful, and all the Follower's conflicting log entries will be deleted and the Leader's log will be added. Once the log is attached successfully, the Follower's log will be consistent with the Leader's and will continue to be maintained for the next term.

Insert image description here

Phase 3: Leader waits for Followers to respond

After Followers receive the copy request from the Leader, there are two possible responses:

  1. Write to the local log and return Success;
  2. The consistency check failed, the write was rejected, and false was returned. The reasons and solutions have been detailed above.

It should be noted that the status of the entry is also uncommitted at this time. After completing the above steps, Followers will send a response - success to the Leader. When the Leader receives responses from most Followers, it will mark the entry written in the first stage as submitted (committed) and apply this log entry to in its state machine.

Insert image description here

Phase 4: Leader responds to client

After completing the first three phases, the Leader will respond to the client - OK, and the write operation is successful.

Insert image description here

The fifth stage: Leader notifies Followers that the entry has been submitted

After the Leader responds to the client, it will notify the Followers with the next heartbeat. After the Followers receive the notification, they will also mark the entry as submitted. At this point, more than half of the nodes in the Raft cluster have reached a consistent state, ensuring strong consistency. It should be noted that nodes that have problems such as "slow response" and "inconsistency" due to various reasons such as network, performance, faults, etc. will eventually reach an agreement with the Leader.

Insert image description here

4. Security of Raft algorithm

The previous chapter described how the Raft algorithm elects the Leader and replicates the log. However, the mechanisms described so far do not fully guarantee that each state machine will execute the same instructions in the same order. For example, a follower may be in an unavailable state while the leader has submitted several log entries; then the follower recovers (has not yet reached an agreement with the leader) and the leader fails; if the follower is elected as leader and overwrites these log entries, it will The problem arises: different state machines execute different sequences of instructions.

In view of this, some restrictions need to be added during Leader election to improve the Raft algorithm. These restrictions ensure that any leader, for a given term number (Term), has all submitted log entries of the previous term (the so-called complete characteristics of the leader). The restrictions on this election will be explained in detail below.

4.1 Election restrictions

For all consensus algorithms based on the Leader mechanism, the Leader must store all submitted log entries. To ensure this, Raft uses a simple but effective method to ensure that all log entries that have been submitted in the previous term will appear in the new Leader during the election. In other words, the transmission of log entries is one-way, only from the Leader to the Follower, and the Leader never overwrites existing entries in its own local log.

Raft uses voting to prevent a candidate from winning an election unless the candidate contains all committed log entries. In order to win the election, the Candidate must contact a majority of the nodes in the cluster, which means that every submitted log entry must exist on at least one of these server nodes. If the Candidate's log is at least as new as the majority of server nodes (this new definition is discussed below), then it must hold all committed log entries (majority thinking). Limitations of voting requests: The request contains Candidate's log information, and then the voter will reject new voting requests that do not have its own logs.

Raft determines whose log is newer by comparing the index value and term number of the last log entry in the two logs. If the last entries in the two logs have different term numbers, the log with the larger term number is more recent. If the last entries in two logs have the same term number, the longer log is more recent.

4.2 Commit log entries from previous terms

As introduced in Section 4.1, the Leader knows that a log record in the current term can be submitted as long as it is copied to the majority of Followers (majority idea). If a leader crashes before committing a log entry, the successor leader will continue to try to replicate the log entry. However, a Leader cannot conclude that a log entry from a previous term has been committed if it is saved to a majority of Followers. This is obvious from the log replication process.

Given the above, the Raft algorithm will not commit a log entry from a previous term by counting the number of replicas. Only the log entries in the Leader's current term can be committed by counting the number of replicas; once the log entries in the current term are committed in this way, all previous log entries will also be submitted indirectly due to the log matching feature. In some cases, the leader can safely know whether an old log entry has been committed (just by determining whether the entry has been stored on all nodes), but Raft uses a more conservative approach to simplify the problem.

When the leader copies logs from previous terms, Raft retains the original term numbers for all logs, which creates additional complexity in the submission rules. However, this strategy makes it easier to identify logs because it maintains the same term number for logs over time and log changes. Additionally, this strategy causes the new leader to send fewer log entries.

Author: Love Little Idiot
Link: https://www.jianshu.com/p/8a4dc6d900cf

Guess you like

Origin blog.csdn.net/weixin_52622200/article/details/131723062
Recommended