Distributed raft consensus algorithm

1.CAP theorem

In a distributed system, CAP cannot have both. Only two of them can satisfy at most, As the saying goes, you can’t have your cake and eat it too

  • Consistency: All nodes have consistent data at the same time
  • Availability: The service is available within normal response times
  • Partition-tolerance: Partition failure can still provide external consistency or availability services.

1.1 The consistency problem is due to partitioning (multiple servers), because when there are more servers, it is more troublesome to follow consistency.

1.2 Data redundancy can provide system availability and partition fault tolerance, but it is difficult to meet strong consistency. Data redundancy refers to

Scatter the data copies into various partitions, but this does not satisfy strong consistency because:

Suppose there is a distributed system in which there are two nodesA and B, which store Identical copy of data. Consider the following scenario:

  1. The client sends a write request to node A to modify some data from X to Y.
  2. After receiving the write request, node A modifies the data from X to Y and responds to the client that the write operation is successful.
  3. At the same time, the client sends a read request to node B, hoping to obtain the latest data.
  4. Due to data redundancy, node B's data should also be up to date. However, in a distributed system, there may be problems such as communication delays or network partitions between nodes, causing data synchronization to take some time.

To solve the consistency problem and achieve strong consistency, all requests need to be processed through a single server, which makes it difficult to achieve availability.

Therefore, for distributed systems, there is a trade-off between consistency, availability, and partition tolerance. Empirically, availability or consistency is often the object of weakening.

For highly available systems, strong consistency is often retained. Therefore, the computer community solves such problems throughconsensus algorithms.

The consensus algorithm ensures that all participants have the same understanding, that is, strong consistency. Common consensus algorithms are:

  • paxos algorithm: represents Zookeeper
  • Raft algorithm: represents Etcd

This article mainly introduces Raft consensus algorithm. For process demonstration, please refer to:Raft algorithm process

2. Basic concepts of Raft


The Raft algorithm is a master-slave model (single Leader and multiple Followers). All requests are processed by the Leader. When the Leader processes a request, it first appends a log and then synchronizes the log to the Follower. The log is persisted after more than half of the nodes have been successfully written.

Raft nodes have three roles

  • Leader (primary copy)
  • Candidate (candidate copy)
  • Follower (secondary copy)

Raft voting mechanism

  • Nodes cannot vote repeatedly. Follower nodes record the nodes they have voted for and will not vote again during a term.
  • One node, one vote. The Candidate node votes for itself, and the Follower node votes for the Candidate canvassing for votes.

There are two types of messages used between Raft nodes:

  • RequesetVote:Requests other nodes to vote for itself, issued by the Candidate node
  • AppendEntries:Additional entries, number of entries > 0, log replication; number of entries > 0 = 0, heartbeat information, sent by the Leader node

Term: logical clock value, globally incremented, describing the term of a Leader

3. Raft algorithm core

The core of the Raft algorithm is

  • Leader election
  • Log replication

3.1. Leader election

Leader election process

When the nodes in the cluster are first started, all nodes are Follower nodes.
When the heartbeat sent by the Leader times out, the Follower node automatically becomes a Candidate node, recommends itself as an election candidate, and sends a RequesetVote message to other nodes.
If the Candidate node receives After more than half of the nodes support (n+1)/2, it becomes the Leader node. The new Leader node immediately sends heartbeat messages AppendEntries to other nodes, and other nodes reset their election timeouts and maintain the Follower role.
When the number of cluster nodes is an even number, a tie vote may occur. As shown in the figure, the two candidate nodes each received 2 votes, and neither of them received more than half of the votes. The Leader node cannot be selected. The phenomenon is called split vote. Enter the next round of elections.

In order to avoid the phenomenon of split elections, the Raft algorithm uses random election timeouts to reduce the probability of split votes. In this way, the time point when each node becomes a candidate node is staggered, which increases the probability of successful Leader node election.

After the Leader node is selected, each node needs to obtain the new Leader node information in the shortest possible time, otherwise the election will time out and enter a new round of election. So heartbeat timeout << election timeout.

The node's election time is reset after receiving a heartbeat message. If not reset, nodes will initiate elections frequently. This avoids nodes from frequently initiating elections and the system converges to a stable state. As long as the Leader continues to send heartbeat information, the Follower node will not become the Candidate role and initiate an election.

At this point, the election is over, and log synchronization begins between the Leader node and the Follower node.

Any event that causes a heartbeat timeout, such as cluster startup, Leader node downtime, network partition, etc., will lead to cluster Leader election.
 

Leader election control

  • Heartbeat timeout: The follower node does not receive the leader's heartbeat message within the specified time.
  • Election timeout: random value, the time Follower waits to become Candidate. When the timer expires, the Follower node automatically becomes the Candidate node and increases the election term by 1.

It can be understood like this:
The Leader node is the monarch, and the Follower node is the ambitious minister, ready to move and plot rebellion at any time. As long as the Leader issues orders regularly (heartbeat messages), after the Followers receive the heartbeat messages, they will fear the strength of the monarch and dare not rebel (reset the election timeout).
If the Follower no longer receives messages from the monarch after a period of time (heartbeat timeout), the Follower becomes a Candidate when it is ready (election timeout). Proclaims himself king and coerces other Followers to support him (RequesetVote message). Once more than half of the Followers are supported, a new king is crowned.
The new Leader immediately sends a heartbeat message, forcing other Candidates to give up their rebellion and continue to maintain their status as Followers.
 

 

3.2. Log replication


In the Raft algorithm, all data change requests from the client are appended to the node log as a log entry.

Log entries have two states:

Appended, but not yet committed (no persistence)
Committed (persistent)
Nodes in the Raft algorithm maintain committed log entries Index commitIndex, log entries less than or equal to this value are considered submitted, otherwise they are data that have not yet been persisted.
 

 

Raft log replication process

Log append
Log submission
[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly. (img-czAthxE1-1678935556629)(cloud native.assets/Raft_log replication.png)]

The client sends a data change request to the Leader, the Leader node appends a log to its own log, and then synchronizes data with other nodes through the AppendEntries message. When more than half of the nodes (including the Leader itself) successfully append new logs, the Leader node persists the log and submits the log, and then notifies other nodes to persist the log through the AppendEntries message again.

Normally, log replication requires sending AppendEntries messages back and forth twice (appending the log and committing the log). The main reason why it needs to be sent twice is to ensure that more than half of the logs are appended successfully before the system can submit the log normally. If more than half of the append is not confirmed, serious data inconsistency will occur when a split-brain or network partition occurs.

For example: As shown in the figure, in the Raft cluster, the original leader is node B, the others are followers, and the term is 1. If a network partition occurs at a certain moment, and nodes A and B are in the same partition, A can still receive B's heartbeat information and remain a Follower. Nodes C, D, and E cannot receive the heartbeat message from B. After the election times out, C is elected as the new Leader node with a term + 1.
 

 

When the client connects to nodes B and C and writes respectively, for partitions A and B, after appending the log, no confirmation is received from more than half of the nodes (2 nodes), and the log cannot be submitted; for partitions C, D, and E, the log is appended After receiving confirmation from more than half of the nodes (3 nodes), the log submission was successful. At this time, the logs of Leader B and Leader C conflict.

When the network partition is restored, the Leader B node finds that the Term value of the Leader C node is higher and is downgraded to a Follower. Then nodes A and B discard their own logs and synchronize Leader C's log messages. At this point, the log remains consistent.

4. Summary


Reading and writing Raft

All read and write requests from clients are processed by the Leader node.

Three states of Raft nodes

Follower: The random election times out and automatically becomes a Candidate
Candidate: Take the initiative to vote for itself and broadcast to other nodes to solicit votes. When its own votes exceed half of the votes, it becomes the Leader
Leader: regularly sends heartbeat messages to other nodes and synchronizes change information; if the Leader finds that there is a higher term than itself, it will immediately step down and become a Follower, and accept the data change synchronization of the new Leader< a i=3> Status changes of Raft nodes

Follower -> Candidate:Election timeout, automatically becomes Candidate
Candidate -> Candidate: This round of election starts a new round of elections
Candidate -> Leader:Obtain the support of more than half of the nodes
Leader - > Follower: Found a higher term
Candidate -> Follower: Received the heartbeat message of the Leader a>
 

Follower processes messages sent by other nodes

candidate: Send a vote to the candidate to which the canvassing vote received for the first time belongs, and set its own term to that of the candidate. At the same time, it will reset its own election timeout and heartbeat timeout
Leader: Reset its own election timeout and heartbeat timeout


After a tie vote, the next round of elections

Election term Term + 1
Re-randomize an election timeout
Reset the votes of the previous round


Log copy process

Log append: Upon receiving the data change request, the Leader will append the log to the local and synchronize the data with other nodes. After more than half of the nodes are appended successfully, log persistence will begin.
Log submission: Leader persists local logs and notifies other nodes of log persistence.


How to restore consistency after Raft repairs split brain

A split-brain occurs: The partition that cannot receive the heartbeat message re-elects a new Leader node, and the terms of all nodes in the partition + 1
Repairs a split-brain: If the Leader node discovers that there are A Leader node with a higher term than its own will be demoted to a Follower and receive the Leader's data synchronization
 

Guess you like

Origin blog.csdn.net/txh1873749380/article/details/134925532