Distributed technology stack 2.2-Raft algorithm theory and application of common algorithms

introduction

In a distributed system, in order to meet the partition fault tolerance, that is, when a node is down, it can still serve normally without data loss. Multi-node backup is often used, and then a master node is selected on multiple nodes to provide read and write operations, and other nodes only provide read operations. The Raft algorithm is a relatively common masternode election algorithm.

1 Principle

The nodes in the algorithm have three states: leader, follower and candidate. The leader is the status of the node being selected as the master node, the follower is the status of the node as the slave node, and the candidate is the status of running for the master node after the master node loses connection.
Each node will cache a term value; each election will increment the term; each node will only vote for election requests with an election sequence number greater than the cached term value; when the node votes for an election request, the term value will be set to the election in the request Sequence number value; when more than half of the nodes pass the vote, they will be upgraded to the leader node; when the leader node notifies all nodes of the heartbeat of its success sequence number, other nodes will enter the follower state after receiving it.

1.1 Algorithm process

First, the node is in the follower state when it joins. When the node does not receive the notifyLeader message for more than a certain period of time, it will call the switchToCandidate method to switch to the candidate state. The details are as follows:
Insert picture description here
Then, when each node receives the notifyLeader message, it will do the corresponding processing. The processing flow is as follows:
Insert picture description here
Note: When the notifyLeader with term>=CurrentTerm is not received in a certain time interval, the switchToCandidate method will be called.

1.2 The fly in the ointment

When there are four nodes A, B, C, D, A and B initiate a campaign node at the same time and the reqTerm value is 1. Among them, A and B first receive A's request and vote for A, while C and D receive B first B's request was voted for; the second time and the third time in turn. . . In the end, it causes an endless loop and fails to pass more than half of the time and fails to run for the leader.

1.3 Sublimation again

In order to solve the shortcoming problem, after the node initiates the first request timeout, the node is given a random timeout interval. In particular, the interval of the random timeout interval should not exceed a certain upper limit.

2 Application

2.1 Master-slave backup

2.1.1 Data manipulation algorithm

First, the three states of the native algorithm are expanded into five states: leaderUnReady, leaderReady, followerUnReady, followerReady, and candidate. leaderUnReady and leaderReady are both leader states. The former has just become the leader and is in data synchronization, and only provides read-only operations on data. The latter has become the leader and data synchronization is completed, providing read and write operations on data; followerUnReady and followerReady are both followers Status. The former has just become a follower and is in data synchronization, and only provides read-only operations on data, while the latter has become a follower and data synchronization is completed, and provides read and write operations on data; candidate is consistent with the original algorithm and no additional explanation is given.

  • Run for leader

Insert picture description here
Note: The election process is basically the same as the original algorithm. RaceTerm's incremental algorithm is to ensure that the last submitted binLog is included in the newly selected LeaderNode. In addition, Leader and Follower states are divided into two types: ready and unready.

  • Handle heartbeat

Insert picture description here
Note: When the heartbeat of newTerm>=CurrentTerm is not received for a certain period of time, switchToCandidate will be called to start the election. In particular, the state settings when the Leader node and the non-Leader node receive the message are different; the Leader node does not provide a write operation in the LeaderUnReadyState state, and the write operation can be performed after the LeaderRecover is completed and switched to the LeaderReadyState state. In addition, for the follower node, after entering FollowerUnReadyState first, after FollowerRecover is completed and switched to FollowerReadyState, it can accept the master node's write synchronization.

  • data input

Insert picture description here
Note: Once the master node writes binLog successfully, it will synchronize binLog to all slave nodes until more than half of it returns to success. After the slave node receives the binLog, it first pulls the binLog that has not been synchronized; all binLogs are executed in order to the XA Prepare stage; the master node will notify all the slave nodes of the xid in the prepare state after the final commitXA. Node commit is not pending transactions in prepareXidList. In particular, it is necessary to ensure that the xid generated by different nodes cannot appear the same. In addition, the leader will not only send notifyPrepareXidList messages when a transaction is committed, but also periodically send notifyPrepareXidList messages.

  • leader reset

Insert picture description here
Note: After the node restarts, if the status is leaderUnReady or leaderReady, leaderRecover will also be executed. It synchronizes all the binglogs in the prepare state to all slave nodes in turn. If success returns more than half of the time, submit the XA transaction (in addition, there is a need to record these delayed submissions and unsuccessful return XA transactions), and then continue the loop; otherwise, it will continue to initiate recovery synchronization. In particular, if a conflict is received midway, it will be recorded and sent to the operation and maintenance personnel.
To deal with conflicts, you need to consider the following:

  1. Suppose there are ten nodes numbered from 0 to 9, of which number 0 is the master node. Among them, No. 0 successfully submitted the transaction with binlog sequence number 5, and it has also been submitted synchronously on nodes 1 to 9;
  2. Then node 0 initiated transactions with binlog serial numbers 6 and 7, but only three nodes 7, 8, and 9 returned success due to network reasons, and then node 0 suddenly went down. At this time, nodes 1 to 6 have no uncommitted binglogs, and the last submitted binglog sequence number is 5; nodes 7 to 9 all have binlog sequence numbers 6 and 7 that have not been submitted;
  3. At this time, the election of the master node will be performed on the 1st to the 9th. Suppose that at this time, numbers 1 to 6 are in one computer room, and numbers 7 to 9 are in another computer room, and then the communication between the two computer rooms fails, then at this time, numbers 1 to 6 will successfully run for a master node, suppose it is number 1. Of course, from the 7th to the 9th, they will always be in the state of election;
  4. Node 1 initiates a binlog with sequence number 6. Assuming that 2 to 6 are successfully returned, then the binglog with sequence number 6 of nodes 1 to 6 is inconsistent with the binlog with sequence number 6 in nodes 7 to 9 . How to connect the computer room from 1 to 6 to the computer room network from 7 to 9 at this time is restored, then after a node from 7 to 9 successfully elects to become the master node, there will be inconsistent data at this time.
  • follower reset

Insert picture description here
Note: After the node restarts, if the status is followerUnReady, followerRecover will also be executed. followerRecover will roll back XA transactions that are inconsistent with the master node, or XA transactions whose binLogNum is greater than the lastBinLogNum of the master node.

2.1.2 Node addition/removal

The addition/removal of data nodes can be regarded as a kind of data writing. But the data here is the addition or deletion of the content of the node information table.

  • Node join

Insert picture description here
Note: The node join is to insert the node into the system NodeTable; then start serialized read, serialized transaction isolation is to prevent contamination by subsequent transactions when pulling db data, of course, if the repeatable-read isolation under MVCC is used, it is more Great.

  • Node removal

Insert picture description here
Note: It is the same as the node joining, except that you delete yourself in the NodeTable.

2.1.3 Operation and maintenance matters

Operation and maintenance need to pay attention to:

  1. The number of nodes added each time must be less than half;
  2. The number of nodes removed each time must be less than half;

Guess you like

Origin blog.csdn.net/fs3296/article/details/106743503