Distributed storage engine manufacturers' actual combat: Understanding the Raft series in one article (1)

Distributed storage engine manufacturers' actual combat: Understanding the Raft series in one article (1)

background

  In a real distributed system, it is impossible to ensure that every server in the cluster is 100% available and reliable. Any machine in the cluster may be down, network connection and other problems, which may cause a node in the cluster to be unavailable. In this way, the data of that node may be inconsistent with the cluster, so a mechanism is needed to ensure that reliable data services are provided to the outside when most machines exist. The raft algorithm is an algorithm that solves the data consistency between multiple nodes in a cluster in a distributed system. For example, the famous etcd in the Golange ecosystem is implemented using Raft. Mastering this algorithm can handle the fault tolerance and consistency requirements of most scenarios with ease. Such as distributed configuration system, distributed NoSQL storage, etc., easily break through the system's single-machine limitation.

Introduction to Raft algorithm

  The Raft algorithm belongs to the Multi-Paxos algorithm. It has made some simplifications and restrictions on the basis of Lambert's Multi-Paxos idea. For example, the log must be continuous and only supports the leader, follower and candidate. status. In essence, the core idea of ​​Raft algorithm to maintain data consistency is very simple, that is, "the minority obeys the majority".
  In addition to the leader, the Raft algorithm also supports two other memberships (server status), namely follower and candidate. At any time, each server node is in one of these three states.

  • Follower (follower): equivalent to ordinary people, silently receiving and processing messages from the leader, when the leader's heartbeat times out, it will take the initiative to stand up and recommend itself as a candidate.
  • Candidate: Send a RequestVote RPC message to other nodes to notify other nodes to vote. If they win most votes, they will be promoted to the leader.
  • Leader: An unreasonable and domineering president. The usual main work content is three parts, processing write requests, managing log replication, and constantly sending heartbeat information, and notifying other nodes "I am the leader, do not initiate new election".

Leader election

  In order to ensure the consistency of the data, the best way is to have only one node, and the only node reads or writes, so that the data must be consistent; but the distributed architecture obviously cannot be a node, so the Raft algorithm proposes to use the cluster The election of the leader among all nodes is very similar to the election in real life. The election of the leader of the node is very similar to the election in real life. It is voting. The leader with the most votes in the cluster is the leader. In order to prevent a tie, generally when deploying nodes, the number of nodes is set to an odd number (2n+1). How are these nodes elected? Let's take a look at the following example.
In the cluster, there are three nodes A, B, and C in the
Insert picture description here
  initial state, they are all follower states, but no leader state. After a while, A becomes a candidate, and this is the beginning of the election. Why can A become a candidate, because each node in the cluster has a waiting timeout period. Here, the timeout interval for each node to wait for the heartbeat information of the leader node is random, here is a random 150ms~300ms Number, each time the node's waiting timeout period expires, it will be triggered to become a candidate. Therefore, A's waiting timeout time is the smallest (150ms), and it will time out first because it hasn't waited for the leader's heartbeat information.
  There is also a field in each node called the term number (term), this term is a global, continuously increasing integer, and each time an election is held, the term will increase by one. At this time, when A triggers the election, it adds its own term and elects itself as a candidate. It first casts a vote for itself, and then sends a vote-request RPC message to other nodes, asking them to elect themselves as the leader.
Insert picture description here

  When B and C receive the RPC message of candidate A's request to vote, and they have not voted in the term of the term of 1, they will vote for node A and add its own term.

Insert picture description here
  If the candidate wins the majority of votes within the election overtime, it will become the new leader in the current term. After node A is elected as the leader, it will periodically send heartbeat messages to notify other servers that I am the leader and prevent followers from initiating new elections. After B and C receive the heartbeat message, they will reset the election timout. The heartbeat detection time is very short, much less than the election timeout time election timout.
Insert picture description here
After B and C receive the heartbeat information, they will send a heartbeat response and reset the election timeout.
Insert picture description here

Suppose that the heartbeat detection information sent by A is due to delay, packet loss, etc., a follower of B and C. At this time, when this node happens to be election timeout, it will trigger its own election. For example, at this time, C changes his term number term to 2, his status becomes a candidate, and casts his own vote. Initiate an election.
Insert picture description here
At this time, the term value of C becomes 2 greater than the value of A. In the Raft protocol, if you receive a node whose term number value is greater than its own, it will change its value, switch to a follower, and reset its election timeout .

Communication between nodes

  In Raft, the communication between server nodes is remote procedure call (RPC). In leader election, two types of RPC are needed.

  • Request Voting RPC (RequestVote) RPC: It is initiated by the candidate during the election to notify each node to vote
  • Log replication (AppendEntries) RPC: initiated by the leader, used to replicate logs and provide heartbeat information. Log replication RPC can only be initiated by the leader.
      It is stipulated in Raft that if a candidate or leader finds that his term number is smaller than that of other nodes, it will immediately return to the status of a follower. For example, after the partition error recovery, the leader node B with term 3 receives the heartbeat information from the new leader that contains the term 4, then node B will immediately return to the follower state.
      In addition, if a node receives a request containing a smaller term value, it will directly reject the request. For example, if node C has a term of 4 and receives an RPC message containing a term of 3 to request a vote, it will reject the message.

Election rules

  1. The leader periodically sends heartbeat information (log replication RPC messages that do not contain log entries) to all followers to inform everyone that they are the leader and prevent followers from initiating new elections

  2. If the follower does not receive the message from the leader within the specified time, the follower does not receive the message from the leader. Then it thinks that there is no leader at present, elects itself as a candidate, and initiates a leader election

  3. In an election, the candidate who wins the majority of votes is promoted to leader.

  4. In an election, each server node will vote at most for a term, and vote according to the "first come, first served" principle. For example, if the term of node C is 3, a voting request from node A containing term 4 is first received. Then it receives a voting request from B with a term of 4, then node C will vote for A with the only vote. When it receives the voting request RPC message from node B, there are no more votes to vote.
    Insert picture description here

  5. Followers with high log integrity refuse to vote for candidates with low log integrity. For example, the term of node B is 3, the term of node C is 4, the term corresponding to the last log entry of node B is 3, and the term of node C is 2. At this time, node C's request for B to vote for itself will be rejected.

Random timeout

  1. The time interval for the follower to wait for the leader's heartbeat information is random
  2. If the candidate does not win more than half of the votes within a random time interval, the election is invalid, and then the candidate initiates a new round of elections. The time interval for waiting for the election to time out is also random here.

to sum up

  • The Raft algorithm ensures that there is only one leader within a term of office, which also greatly reduces election failures through the tenure of office, leader's heartbeat information, random election timeout, and first-come-first-served voting principles.
  • The difference between the Raft algorithm and Multi-Paxos is that not all nodes can be elected leaders, and only nodes with complete logs can be elected nodes. And the log must be continuous.

Guess you like

Origin blog.csdn.net/songguangfan/article/details/115055366