Fabric source code analysis consensus algorithm raft

Introduction to the premise : The most famous system using Raft is etcd (a highly available distributed key-value database). It is generally believed that the core of etcd is the implementation of the Raft algorithm. As a distributed kv system, etcd uses Raft to synchronize data between multiple nodes, and each node has a full amount of state machine data.

More importantly, Fabric names the implementation of the Raft module as etcdraft in the source code, which further shows that Raft in Hyperledger Fabric only encapsulates the Raft in etcd to achieve node consensus in the alliance chain.

———————————————————————————————————————————

Hyperledger Fabric's core implementation code for the Raft algorithm is placed under the fabric/orderer/consensus/etcdraft package.

1. Introduce several core data structures

Chain interface , Chain structure and node structure .
 

Chain interface

位置:fabric/orderer/consensus/etcdraft/consensus/consensus.go; 

Function: It mainly defines the processing operation of the ordering node on the received message sent by the client;

Chain structure

位置:fabric/orderer/consensus/etcdraft/chain.go
Function: Implements the Chain interface, which mainly defines some channels for communication between nodes, so as to perform corresponding operations according to the communication messages.

node structure

位置:fabric/orderer/consensus/etcdraft/node.go

Function: Mainly used to connect the upper-layer Raft application implemented by Fabric itself and the underlying Raft implementation of etcd. It can be said that the node structure is the bridge of communication between them, and it is its existence that shields the details of Raft implementation.

2. Source code analysis of the startup process of the Raft mechanism

The startup entry for Raft is located in the fabric/orderer/consensus/etcdraft/chain.go file.

In the Chain's Start() method, node.start() in etcdraft/node.go will be started , and in the node.start() method, the raft.StartNode() method that has been encapsulated by etcd will be started.

Start() method in Chain

Function: The Start method mainly completes the cycle of starting etcdraft.Node side to initialize the Raft cluster node. In Chain, messages sent by the client and the bottom layer of Raft are processed through a loop by calling c.run().

Start method on etcdraft.Node side

As a bridge between the Chain side and the raft/node side, it will obtain the ID information of starting the Raft node based on the metadata configuration information passed in the Chain, and call the underlying Raft.StartNode method to start the node, and start n just like the Chain side. run() to process messages in a loop.

Finally, raft.StartNode() started in etcdraft/node means further starting the underlying Node node of Raft. Here, Raft is initialized, the configuration is read to start each node, and the logindex is initialized, etc. Like the previous startup process, it will also open a run method to continuously monitor the information of each channel in a loop to achieve status switching and take corresponding actions.

3. Source code analysis of transaction processing process of Raft mechanism

Next, the sorting node in Fabric can start receiving transactions and start sorting and packaging them into blocks;

1. Submission of transaction proposal

The client will forward the endorsed transaction proposal in the form of a broadcast request to the Leader of the Raft cluster for processing; transactions in Fabric can be divided into two categories, one is ordinary transactions, and the other is deployment transactions (also called configuration trade). These two types of requests will call different functions respectively, namely the Order and Configure functions to complete the submission of transaction proposals.

2. Forward the transaction proposal to the Leader

We can notice from the above source code that no matter what type of transaction it is, the Submit method will be called to submit the transaction proposal. In the Submit method, the main thing to do is to encapsulate the request message into a structure and write it into a specified channel (submitC) to pass it to the Chain for processing. In addition, it will also determine whether the current node is the Leader. If not, it will also redirect the message to the Leader node.

3. Sort transactions

As mentioned earlier, the proposal will be forwarded to the Leader, and the message will be encapsulated into a message structure and written into the submitC channel and passed to the Chain side. The Chain side will continuously receive transactions and sort them.

In the ordered method, different sorting operations will be performed according to different types of messages. For channel configuration messages received, such as channel creation, channel configuration update, etc. First call ConsensusSupport to check and apply the configuration message, and then directly call BlockCutter.Cut() to cut the message into blocks. This is because the configuration information is separated into blocks; for ordinary transaction messages, after direct verification, call BlockCutter.Ordered() enters the cache sorting and decides whether to generate a block according to the block generation rules.

4. Pack the block

After the transaction message is processed by c.ordered, the data packet bathches (data that can be packaged into blocks) returned by BlockCutter and the information about whether there is still data in the cache will be obtained. If there is remaining data in the cache that has not been out of the block, the timer is started, otherwise the timer is reset. The timer here is handled by case timer.C.

Next, the propose method will be called to package the transaction into a block. Propose will call createNextBlock based on the batches data packet to package the block and pass the block to the c.ch channel (only the Leader has the permission to propose). And if the current transaction is configuration information, it is also necessary to mark the status of the current configuration update.

5. Raft’s consensus on blocks

Leader will pass the data of the previously mentioned block to the underlying Raft state machine by calling c.Node.Propose. The Propose here is to propose writing data to the logs of each node. This is also the entry method to achieve consensus between nodes.

Propose is to broadcast the log, and ask all nodes to save it as much as possible, but not submit it yet. When the leader receives a response from more than half of the nodes that it has been saved, the leader can submit it at this time, and it will be ready the next time. Bring committedindex.

Here we will use a small amount of space to introduce the code analysis of leader election [etcd’s raft source code]:

You can directly download the ercd package for analysis go install go.etcd.io/etcd@latest

When the Follower node discovers that the Leader's heartbeat has timed out, it will trigger the tickc channel in the run function in the etcd/raft/node.go file. The timeout election function is implemented by calling the tickElection function.

 

The Step function is called in the timeout election function, the MsgHup message is sent, and the campaign function is called to publish the campaign message. In the campaign function, the node will set its Follower state to the candidate state, and at the same time increment the term number, and finally the candidate node will send campaign messages to other nodes. Location: etcd/raft/raft.go

Other nodes use the Step function to judge the election message, and decide whether to vote for the candidate node based on the corresponding judgment. The judgment logic of voting is mainly divided into two steps. In the first step, if the term number in the voting information is smaller than its own term number, nil will be returned directly and no voting response will be given. The second step is to judge by comparing it with the latest existing log locally. First, check whether the term number in the message is greater than the local maximum term number. If so, vote. Otherwise, if the term numbers are the same, the election message is required to have the largest log. index. Location: etcd/raft/raft.go

After the candidate node receives replies from other nodes, it determines whether the number of votes obtained exceeds half. If so, it sets itself as the leader. Otherwise, it is still set as the follower, indicating that the leader election failed in this round.

Raft’s leader election process is described in more detail, as follows:

Log replication

We have also analyzed above that for the blocks generated in the Leader, the Leader will call the Propose method in the Node interface of etcd to submit a log writing request. Propose specifically calls stepWithWaitOption internally to implement the delivery of log messages, and waits for the return of the result in a blocking/non-blocking manner.

The Leader node calls appendEntry to chase the message to the Leader's log, but does not commit the data. Then call bcastAppend to broadcast the message to other follower nodes.

After the Follower node receives the request, it will call the handleAppendEntries function to determine whether to accept the log submitted by the Leader. The judgment logic is as follows: If the log index submitted by the Leader is smaller than the locally submitted log index, the local index will be returned to the Leader. Find the conflict between the appended log and the local log. If there is a conflict, first find the conflict location and use the Leader's log to overwrite it starting from the conflict location. After the log is appended successfully, the latest log index is returned to the Leader. If the term information is inconsistent, the Leader's additional request will be directly rejected.

When the Leader receives the Follower's response, it has different processing logic for the two scenarios of rejection and acceptance, which is also a key link to ensure follower consistency .

  1. When the Leader confirms that the Follower has received the append request for the log, it calls maybeCommit to submit. During the submission process, it confirms the matchindex returned by each node. After sorting, the intermediate value is compared. If the intermediate value is greater than the local commitindex, it is considered to be more than half. This submission has been approved and can be committed. Then sendAppend is called to broadcast the message to all nodes. After the follower receives the request, it calls maybeAppend to submit the log.

  2. If the Follower rejects the Leader's log append request. After the leader receives the rejected request, it will enter the detection state and detect the latest matching position of the follower.

Combined with the source code, summarize Raft’s log copy process (more on that next time);

6. Save block

After Raft consensus, the node needs to write the block to the local. Here, the bottom layer of Raft will transmit the message of saving the block to the local through the channel (that is, the message that CommittedEntries is not empty). Here, Fabric completes the function of saving blocks by implementing the apply method.

In the apply method, if it is a normal entry, writeblock will be called to write the block to the local. If this block is a configuration block, the configuration block will be written to the orderer's ledger. At the same time, the configuration information needs to be parsed. See Check whether there are changes in the raft configuration items and raft nodes. If there are changes, call the ProposeConfChange of the raft state machine to apply the change, and the application layer will also update the relevant information; if it is a configuration entry, parse out the configuration update information and call it first The underlying raft state machine's ApplyConfChange applies this configuration update.

Guess you like

Origin blog.csdn.net/weixin_45270330/article/details/133847153