DLedger-Library analysis based on Raft algorithm | JD Logistics Technology Team

1 background

In distributed system applications, high availability and consistency are common problems. For different application scenarios, we will choose different architectural methods, such as master-slave and master selection based on ZooKeeper. Over time, a method of automatically selecting the leader based on the Raft algorithm emerged. Raft is based on Paxos and has made some simplifications and restrictions. For example, the log must be continuous and only supports leaders, followers and candidates. The three states of human beings are relatively easy to understand and implement algorithms.

1) DLedger is a JAVA class library based on Raft released by openMessaging. It can be easily referenced into the system to meet its high availability, high reliability, and strong consistency requirements. It is used as a high-availability implementation of message Broker storage in RocketMQ. solution.

2) Raft divides the roles in the system into leaders, followers and candidates:

  • Leader: accepts client requests, sends heartbeat packets regularly, and synchronizes log requests to Follower. When the logs are synchronized to most nodes, it tells Follower to submit logs.
  • Follower: accepts and persists the log synchronized by the Leader, and submits the log after the Leader informs the log that it can be submitted.
  • Candidate: A temporary role in the Leader election process. Nodes in this state will initiate voting and try to choose themselves as the master node. After the election is successful, there will be no nodes in this state.

2 DLedger architecture design

The implementation of DLedger can be roughly divided into the following two parts:

  • Elect Leader
  • Log replication
  • Its overall structure is as shown below

Note: The picture is quoted from the official website

From the architecture diagram above, there are two core classes: DLedgerLeaderElector and DLedgerStore, election and file storage. After the leader is selected, the leader will receive the writing of data and synchronize it to other followers at the same time, thus completing the entire writing process of Raft.

3 DLedger main selection source code analysis

3.1 Download source code

Download the code from gitGub (https://github.com/openmessaging/dledger ). After the idea was introduced, we found that the entire code size It's small and easier to analyze code.

3.2 Analysis of the main selection process

3.2.1 Principle

Raft's leader election process is actually a state machine flow. When the cluster starts, the waiting timeout of each node is random. When the first node's timeout arrives, it will actively initiate voting to other nodes. After receiving more than half of the votes, it will After voting, he is promoted to leader (the voting process is a cyclic process) and sends a heartbeat request at the same time. After receiving the request from the master node, other candidate nodes change themselves to follower nodes.

  • term: term, each round of voting is a term, starting from 0 by default
  • Quorum mechanism: To put it simply, say less than half of the people, for example, 3 nodes, and 2 agree.
  • Timeout: During the election, the timeout of each node is random within a certain range, which ensures a smooth election.
3.2.2 Code analysis

The entire state machine is driven by the thread repeatedly executing the DLedgerLeaderElector.maintainState() method every 10ms. The following focuses on analyzing the driver of its status:

Enter the core method maintainAsCandidate():

1.step1 initialization

  • term : voting round.
  • ledgerEndTermLeader: The current voting round of the node.
  • ledgerEndIndex: the maximum sequence of the current log, that is, the starting index of the next log
  • nextTimeToRequestVote: The next time to initiate a vote (random)
  • needIncreaseTermImmediately: Whether to vote immediately will be explained later.

The initial state of each node in DLedger is WAIT_TO_REVOTE, so the first round is just initialization. Among them, only memberState.nextTerm() will change the voting round.

2.step2 vote

Enter the core method handleVote(). This method is mainly used to determine whether to vote in favor of other nodes based on their own term and the requester's request.

  • ledgerEndIndex Because during the log replication process, the progress of each node may be different, so in the new round of elections, you cannot vote in favor at this time.
  • If the term of the electee is less than the term of the elector, rejection will be returned.
  • If the term of the electee is greater than the term of the elector, the elector will perform the following operations:
    • Become candidate (or remain candidate)
    • Set needIncreaseImmediately to true.
    • Return REJECT_TERM_NOT_READY, which is mentioned later.

Additional explanation here:

The next state loop of the elector will enter the maintainAsCandidate() function, and then because needIncreaseImmediately is true, the term is updated and the timer is reset at the same time. However, the vote was not issued immediately (at this time, the elector's CurrVoteFor is still null, making it possible to vote in favor of the previous voting candidate)

After obtaining the voting results of all nodes, start counting the votes:

3.step3 Arbitration

After receiving the counting of voting results from all nodes, arbitration is carried out. Here we mainly explain the condition in the figure below.

  • acceptNum: the number of consents
  • notReadyTermNum: The number that is not ready (i.e. the result is REJECT_TERM_NOT_READY)

There is no time to reset nextTimeToRequestVote, and another vote will be initiated immediately. Combined with the above description, this ensures that the selected candidate can get the approval votes of these notRead nodes as soon as possible.

Finally, after multiple votes, when a node obtains more than half of the votes, it updates its non-leader role and sends heartBeat to other nodes. After receiving the heartbeat information, other nodes change themselves from candidate to follower.

3.3 Unit test verification

3.3.1 Writing unit tests

3.3.2 Log analysis

3.4 Application scenarios

  1. DLedger has been released as a message store for RocketMQ (version>=4.5.0)
  2. Implementing multi-node cache synchronization update based on DLedger
  3. Replica fault tolerance processing based on log replication

4 Summary

  1. Here we only briefly analyze the selection process. The process of reading the source code will involve a lot of Java basics and the use of netty, such as AQS, CompletableFuture, etc., which will help improve our coding capabilities.
  2. When DLedger is initialized, it sets the node role to candidate instead of follower. This is different from the original Raft, and there are also slight differences in the node role conversion process.

references

Author: JD Logistics Guo Qinghai

Source: JD Cloud Developer Community Ziyuanqishuo Tech Please indicate the source when reprinting

Guess you like

Origin blog.csdn.net/jdcdev_/article/details/135011677