RocketMQ multi-copy preface: a preliminary study of the raft protocol

The Raft protocol is another well-known protocol for solving consistency in the distributed field. It mainly includes two parts: leader election and log replication.

Reminder: This article is based on the official raft animation given by raft. The animation display address: http://thesecretlivesofdata.com/raft/

1. Leader election

1.1 In a round of voting, only one node initiates a vote

insert image description here

Nodes in the Raft protocol have 3 states (roles):

  • Follower Follower.
  • Candidate candidate.
  • Leader Leader (Leader), usually what we call the master node.

First, the initial state of the three nodes is Follower, and each node will have a timeout (timer) whose time is set to a random value between 150ms and 300ms. When the timer expires, the node status changes from Follower to Candidate, as shown in the following figure:

insert image description here

Usually, the timer of one of the three nodes will expire first, the node state changes to Candidate, and the node in the candidate state will initiate an election vote. Let's first consider how to select the master when only one node becomes Candidate.

When the node status is Candidate, a round of voting will be initiated. Since it is the first round of voting, set the voting round of this round to 1, and cast a vote for itself first, as shown in the above figure for the NodeA node, Team is 1, Vote Count is 1.insert image description here

When the timer of a node expires, it first casts a vote for itself, and then initiates a vote to other nodes in the group (it is more appropriate to use canvassing) to send a vote request.insert image description here

When a node in the cluster receives a voting request, if no voting has been done in this round, it approves, otherwise it rejects, then returns the result and resets the timer.insert image description here

When node A receives more than half of the approval votes, it is upgraded to the leader of the cluster, and then periodically sends heartbeats to other nodes in the cluster to determine its leadership position, as shown in the following figure.insert image description here

Node A, the leader in the cluster is sending heartbeat packets to other nodes.insert image description here

After the node receives the leader's heartbeat packet, it returns the response result and resets its own timer. If the node in the Flower state does not receive the leader's heartbeat packet within the timeout period, it will change from the Flower node to the Candidate. The node will initiate the next round of voting.

For example, if the NodeA node goes down and stops sending heartbeats to its slaves, let's see how the cluster re-selects the master.insert image description here

If the master node goes down, stop sending heartbeat packets to the nodes in the cluster. As the timer expires, Node B becomes Candidate before Node C, and Node B initiates a vote to other nodes in the cluster, as shown in the following figure.insert image description here

Node B, first set the voting round to 2, and then vote for itself first, and then initiate a voting request to other nodes.insert image description here

Node C receives the request, because its voting round is larger than its own voting round, and the round did not vote, votes in favor and returns the result, and then resets the timer. Node B will naturally become the new Leader and send heartbeat packets regularly.

The selection of the three nodes is introduced here. Some netizens may say that although the timers of each node are random, it is also possible that at the same time, or a node changes before receiving a voting request initiated by another node. Become a Candidate, that is, during a round of voting, more than one node status is Candidate, so how to choose the leader?

The following takes a cluster of 4 nodes as an example to illustrate how to select the master in the above situation.

1.2 In a round of voting, more than one node initiates a vote

First, two nodes enter the Candidate state at the same time, and start a new round of voting. The current voting number is 4. First, vote for yourself, and then vote for other nodes in the cluster, as shown in the following figure:insert image description here

Each node then receives a voting request, as follows, to vote:insert image description here

First of all, when nodes C and D receive the voting requests from nodes D and C, they will return disapproval, because in this round of voting, they have each voted for themselves. According to the above figure, node A agrees with node C and node B agrees Node D, then both C and D only get two votes at this time. Of course, if both A and B think that C or D becomes the master node, the selection can end. The above picture shows that both C and D only get two votes. , less than half, cannot become the master node, what will happen next? Please see the picture below:insert image description here

At this time, the timers of A, B, C, and D are each counting down. When the node becomes Candidate, or its own state is Candidate and the timer is triggered, a new round of voting is initiated. In the figure, node B and node D are at the same time. A new round of voting was initiated.insert image description here

The voting results are as follows: Node A and Node C agree that Node B becomes the leader, but since BD has initiated the fifth round of voting, the final voting round is updated to 6, as shown in the figure:insert image description here

The selection of the Raft protocol has been introduced here. Next, let's think about what issues should be considered if we implement the Raft protocol by ourselves, so as to provide some ideas for the next source code reading of the Dleger (RocketMQ multi-copy) module.

1.3 Think about how to implement Raft master election

  1. The node state needs to introduce three node states: Follower (follower), Candidate (candidate), trigger point of voting, and Leader (master node).
  2. When entering the voting state of the timer Follower and Candidate, it is necessary to maintain a timer, and each timing time is random between 150ms-300ms, that is, the timing expiration of each node is different each time. In the Follower state, After the timer expires, a round of voting is triggered. The node needs to reset the timer after receiving the voting request, the leader's heartbeat request and responding.
  3. For nodes in the Team Candidate state of the voting round, each time a round of voting is initiated, the Term is increased by one; the Term is stored.
  4. In the voting mechanism, a node can only vote for one node in each round. For example, the maintenance round of node A is 3, and it has already voted for node B. If it receives other nodes and the voting round is 3, it will be If you cast a negative vote, if you receive a node with a round of 4, you can vote for it again.
  5. The condition for becoming a leader must get the majority of the nodes in the cluster, that is, more than half of the nodes. For example, if there are 3 nodes in the cluster, you must get two votes. If one of the servers goes down, the remaining two nodes can still proceed. Pick a leader? The answer is yes, because you can get 2 votes, which is more than half of 3 in the initial cluster, so usually the machines in the cluster are counted as much as possible, because the availability of 4 machines is the same as that of 3 machines.

Reminder: The above conclusions are just some of my thoughts. We can take the above thoughts into the study of Dleger. The next article will learn how the great God implements the leader selection of the Raft protocol from the perspective of source code analysis. Let us Look forward to it together.

2, log replication

After completing the master selection in the cluster, the client sends a request to the master node, and the master node is responsible for data replication to keep the data in the cluster consistent. The initial state is shown in the following figure:insert image description here

The client initiates a request to the master node, such as set 5, to update the data to 5, as shown in the following figure:insert image description here

After the master node receives the client request, it appends the data to the leader's log (but does not submit it), and then forwards the log to the slave nodes in the cluster in the next heartbeat packet, as shown in the following figure:insert image description here

After the slave node receives the leader's log, it appends it to the slave node's log file and returns an ACK. After the leader receives the confirmation information from the slave node, it sends the confirmation information to the client.insert image description here

The above log replication is relatively simple, because only normal conditions are considered. If an exception occurs in the middle, how to ensure data consistency?

  1. If the leader node broadcasts logs to the slave nodes, one of the slave nodes sends a fault and goes down, what should I do?
  2. At what point is the log submitted? After the leader node receives the data change request from the client, it first appends it to the log file of the master node, and then broadcasts it to the slave node. When the slave node receives the log information, does it return ACK after submitting the log, or when to submit it?
  3. How logs are guaranteed to be unique.
  4. How to deal with network partitions.

I believe that readers and friends must have more questions. This article does not intend to answer the above questions, but takes these questions into the study of multiple copies of RocketMQ. After analyzing the implementation of RocketMQ DLedger through the source code, I will summarize the raft protocol again.

Dear readers, after reading this, please give a like, thank you, the next article will focus on analyzing how the RocketMQ Dledger multi-copy module implements the selection of the raft protocol.


About the author: Author of "RocketMQ Technology Insider", RocketMQ community evangelist, maintains the official account: middleware interest circle, you can scan the following QR code to interact with the author.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324143843&siteId=291194637