raft protocol notes-consistency protocol

The Raft protocol is a consensus algorithm used to maintain the consistency of the entire system in a distributed system.

  1. It is the preferred consensus algorithm for distributed system development.
  2. The Raft algorithm achieves consensus on a series of values ​​and consistency of logs of each node through all leader-based methods.
  3. A protocol for managing log consistency.
  4. Decompose distributed consistency into multiple sub-problems:
    1. Leader election
    2. Log replication
    3. Safety
    4. Log compaction

Before understanding Raft, you need to know what state machines are:

The paper points out that Raft is a consensus algorithm used to manage log replication. So we need to understand it first. What is a log replication state machine. Let's think about a problem. If you want to share a very complex operation and calculation with your friends. Generally speaking, you have two methods:
The first one: you are responsible for the calculation. After a period of calculation, after calculating the result, directly tell your friends the calculation result.
The second type: You tell your partner the steps of each operation, tell him how to do it, and your partner calculates the result by himself.

The second way is to copy the working principle of the state machine. Replicated state machines are implemented by replicating logs. Each server keeps a log, which contains a series of commands, and the state machine executes these commands in sequence. Because the state machine of each computer is deterministic, the state of each state machine is the same, the commands executed are the same, and the final execution results are the same.

In practice, there are many similar applications. For example, the master-slave synchronization of mysql is synchronized through binlog.

The Raft protocol enables the entire cluster to form a state replicator, and a master node provides services to clients. The service can only change the state through commands. Every time a command is executed, the master node copies the log of the executed command to the slave node. By copying the log, if the logs of the slave node and the master node are the same, then their statuses are the same.

Animated demo of Raft:

thesecretlivesofdata.com/raft/

a concept

  1. Most: (total number of server nodes/2)+1
  2. term: term
  3. election timeout
  4. Heartbeat timeout

Basic safety guarantee:

In order to ensure the correctness of the process, Raft needs to ensure that the following properties are true at all times:

●Election security principle:
There can be at most one leader in the same term.

●Leader-only add principle:
The leader's log can only be added, not rewritten or deleted.

●Log matching principle:
If two logs have the same term and index, the logs between the two logs between [0, index] are exactly the same.

●Leader completeness principle:
If a log is submitted, then the leader of any subsequent term will have this log.

●State machine safety principle:
If one server has applied a log entry at a given index position into the state machine, all other servers will not apply a different entry at that index position.

2. Leader election----random election timeout

During the leader election of the Raft protocol, the node will be in one of three states:

●Follower
slave node provides votes to other nodes

●Candidate
candidate, if a Follower does not receive the heartbeat packet from the Leader after a certain period of time, it will determine that the Leader is faulty, convert its status to Candidate, and start a new election process.

●Leader
master node, if the Candidate election is successful, it will be converted into Leader and start providing services to the outside world.

Election process:

Random election timeout
is the time required to wait from Follower state to Candidate state. This time is random.

The functions of election timeout
1. Prevent multiple nodes from initiating voting at the same
time 2. In most cases, only one server node initiates the election first instead of initiating the election at the same time, which reduces the failure of the election due to vote division.

  1. If the Follower does not receive the heartbeat packet from the Leader after a period of time, it turns into a Candidate. The node's term value is increased by 1, and then it votes for itself and sends election requests to all other nodes. The election request is an RPC call: RequestVoteRPC
  2. The slave node receives the election request and returns the vote or rejects the vote.

Voting is on a first-come, first-served basis. Follower determines whether the requested term value is greater than its own. If it is greater, the vote is successful, otherwise the vote is rejected. Assume that there are currently two candidate nodes A and B that send election requests to C at the same time. C will only vote for the election request that arrives first.

  1. When a node obtains the most votes, it becomes the leader and begins to provide services to the outside world. Other Candidates are automatically converted to Followers.

If multiple nodes have the most votes at the same time, for example, the votes of both nodes A and B are 2, in this case, they will wait for a timeout period (150ms ~ 300ms) before restarting the election process.
In this case, the service will be unable to provide services for a longer period of time, so the timeout value of the Raft protocol is a random interval time to avoid this situation.

●Raft is composed of multiple nodes.
●Strong leader, the entire Raft has only one leader at the same time, and the log leader is responsible for distribution and synchronization.
●Leadership election. The leader is democratically elected. A majority of nodes in the cluster can become the leader by voting.

Every leader has a term limit. The beginning of every term is an election. If a leader is elected, that leader is responsible for leading the cluster. If no leader is elected, the next election will proceed. until a leader is elected.

The leader will periodically send heartbeats to each machine to ensure its leadership status.

If a follower does not receive the leader's heartbeat for a long time, he or she will initiate a vote to become a candidate, and the term will be +1. If the follower receives more than half of the support, he will be promoted to the leader.

If a candidate finds that there is a leader in the cluster when initiating voting, he will become a follower again.

If a candidate does not receive more than half of the responses within a certain period of time after initiating a vote, the candidate will initiate another vote.

If the leader discovers that there is a leader for the next term in the cluster, it becomes a follower.

Several key mechanisms of the Raft algorithm
: Through the following key mechanisms, it is ensured that there is only one leader for one term, which greatly reduces the number of election failures.

1. Term
2. Leader heartbeat information
3. Random election timeout
4. First come first served voting principle
5. Majority vote principle

Three-log replication

Log execution process

Insert image description here①The client requests the leader in the cluster and requests the execution operation log.
②The leader initiates a log writing request to all followers. After receiving the log writing request, the follower puts the log into the log queue and responds to the leader.
③ When most followers respond successfully, the leader initiates an application log request. After receiving the request, the followers write the log into their respective state machines.
④ The leader responds that the client writes successfully.

Log composition

1. The log consists of log entries with sequential numbers (log index).
2. Each log entry contains the term when it was created and the command used for state machine execution.
3. If a log entry is replicated to a majority of servers, it is considered ready for commit.
Most: (total number of server nodes/2)+1
Insert image description here

Log consistency

If two entries in different logs have the same index and term number, they store the same command.

Reason: The leader creates at most one log entry at one log index position in a term. The position of the log entry in the log never changes.

If two entries in different logs have the same index and term, then all previous entries before them are identical.

Reason: Each time the leader sends an additional log via RPC, the index of the previous log and the term number of the previous log will be appended and sent to the follower.

  1. If the follower finds that it does not match its own log, it will refuse to accept the log. This is called a consistency check.
  2. If the follower finds a match with its own log, it will be attached.

Logs are inconsistent

1.raft is guaranteed by the follower forcing a copy of the leader's log.

2. Generally, the logs of Leader and Followers are consistent, so the AppendEntries consistency check usually does not fail. However, leader crashes may cause log inconsistencies

The old leader may not have fully replicated all entries in the log.

Log inconsistency processing strategy

When the append log RPC's consistency check fails, the follower will reject the request. When the leader detects that the request to append the log fails, it will reduce the index value of the current appended log and try to append the log again until it succeeds. In order to reduce the number of times the leader's additional log RPC is rejected, a small optimization can be made. When the follower rejects the leader's additional log request, the follower can return the term number of the entry containing the conflict and the earliest term of the term it has stored. The index address allows the leader and the follower to find the last place where they agreed as quickly as possible. When the log difference between the follower and the leader is too large, the leader will directly send snapshots to quickly reach consistency.

safety

1. The Leader handles log inconsistencies by forcing Followers to copy its logs. Inconsistent logs on Followers will be overwritten by the Leader's logs. In order for the Leader to make the Followers' log consistent with its own, the Leader needs to find the place where the Followers are consistent with its log, and then overwrite the Followers' entries after that location.

2. Specific operations

The Leader will try from back to front. Each time AppendEntries fails, it will try the previous log entry until it successfully finds the consistent position of each Follower's log (based on the two guarantees mentioned above), and then covers the Followers one by one backwards. The entry after this position.

3. Summary
When the leader and follower logs conflict, the leader will verify whether the last log of the follower matches the leader. If it does not match, it will query in descending order until it matches. After matching, the conflicting log will be deleted. In this way, the consistency of the master-slave log is achieved.

Raft adds the following two restrictions to ensure security:

1. Only the Follower with the latest submitted log entry is qualified to become the leader.

Follower means that the corresponding node owns all the logs that the current leader has submitted.

2. Leader can only advance the commit index to submit the logs of the current term that have been copied to most servers. The submission of old term logs must wait until the logs of the current term are submitted for indirect submission (logs with a log index smaller than the commit index are submitted indirectly) .

1. Leader can only advance commit index.
2. When the leader's log is responsible for half of the flowers, the leader will submit the log of the current term.
3. The current log of the Leader has been submitted, and the logs before the current log must also be submitted.

When a node in Raft votes, it will determine whether the log corresponding to the candidate being voted is at least as new as its own. If not, no votes will be given to that candidate.

Compression of logs:

Log compression is relatively easy to understand. As the cluster is used, the number of logs becomes larger and larger, which will reduce the performance of the cluster and occupy a large amount of storage space. Therefore, logs need to be compressed regularly. Snapshots are the simplest compression method. In the snapshot system, the state of the entire system is written to stable persistent storage in the form of a snapshot, and then all logs before that point in time are discarded.

1. In an actual system, the log cannot be allowed to grow indefinitely. Raft solves this problem by taking a snapshot of the entire system, and discards all logs before the snapshot.

2. Each copy independently takes a snapshot of its own system status, and can only take a snapshot of submitted log records.
Insert image description here
As shown in the figure: Log entries before log index 5 can be deleted, and only one snapshot is retained (which saves the current status and some meta-information such as term index numbers). The Snapshot contains the following
content

1. Log metadata

2. The log index and term of the last submitted log entry.

These two values ​​​​will be used during the integrity check of the AppendEntries RPC of the first log entry after the snapshot.

3. Current status of the system.

When to send snapshots

1. When the log entry that the Leader wants to send to a Follower whose logs lag too far behind is discarded, the Leader will send a snapshot to the Follower.
2. When a new machine is added, a snapshot will also be sent to it.
3. Send the snapshot using InstalledSnapshot RPC.

Notes on snapshots

1. Don’t take snapshots too frequently, which will consume disk bandwidth.
2. Don’t take snapshots too infrequently. Otherwise, once the node is restarted, a large number of logs will need to be played back, affecting availability.
3. It is recommended to take a snapshot when the log reaches a certain fixed size.
4. Taking a snapshot may take too long and affect normal log synchronization.

You can avoid the snapshot process from affecting normal log synchronization by using copy-on-write technology.

Raft log snapshots have two main uses:

●Compress logs
When there are more and more logs in the system, they will take up a lot of space. The Raft algorithm uses a snapshot mechanism to compress the huge logs. At a certain point in time, all the status of the entire system can be stably written to a retrievable file. In the persistent storage, all logs after this point in time are cleared.

●Snapshot RPC
When we have a snapshot, we can directly copy the leader's status to those followers that are too lagging behind through the snapshot, so that the status of the followers and the leader can quickly reach consistency.

We can see that Raft's snapshot mechanism is very similar to Redis's persistent storage. Therefore, some Redis optimization mechanisms can be selectively applied to Raft, such as automatically triggering snapshots, using the fork mechanism to reduce the resource usage of creating snapshots, and using special data structures to ensure the detectability of snapshots.

Safety

Member change and double-master split-brain
Split-brain means that when the network is divided, the cluster will be divided into two small clusters. In this case, two Leaders will be generated.
In the case of split-brain, only one of the two small clusters can provide services.

Assume that the current cluster has N nodes, and N is an odd number. After a split-brain occurs, there will be at most one cluster with a number of followers greater than or equal to N/2, and only this cluster can provide services.

In the case of split-brain, the client requests a few clusters and will not receive an Ack. If it tries the request again, it will request a majority of clusters and will receive an Ack. When the network is restored, a few clusters will automatically become followers.

The problem with member changes is that too many members are added or subtracted, resulting in the old member group and the new member group not intersecting, so dual masters appear.

Solution (just make sure that the new or old configuration does not make decisions alone)
Only one member is allowed to be added or deleted for each member change (if you want to change multiple members, change it multiple times in a row)

Guess you like

Origin blog.csdn.net/qq_44961149/article/details/115400439