Research PBFT algorithm

Research PBFT algorithm

Distributed true and pseudo-distributed

The current database known as distributed storage, distributed storage with this block chain has a fundamentally different, this distributed database distributed server clusters is to achieve recovery and backup, redundant data services, yet they still It belongs to a company or institution, so it is still centralized database cluster management.

Block chain corresponding to each node in a distributed database system, a different point is that there is no administrator to maintain data synchronization between the various databases, while no deletions and deletion checking function to check the change. It can not be tampered with. (Do not mean can not be changed, but that can not be recorded without the change).

Byzantine generals problem (node ​​evil)

Roman capital of the Byzantine Empire, a vast territory, each army general consensus before deciding action in the war, but in general there may be a traitor, traitor interfere with these operations loyal generals thereby disrupting the operational plan.

It can be described as: Byzantine nodes appear in the known distributed system (active evil, hardware errors, network congestion nodes, etc.) in the case of how the system between nodes effectively reach consistency.

prerequisites

1, passing messages between nodes can not be tampered with, that you must use cryptography to ensure that messages are delivered security, messaging security issues can be tampered with is not being solved.

2, the state machine replicated copies based problem solving, it is assumed that there is a Faulty The F, summarized into N points, then the solvability in the case of N≥3F + 1.

The state machine replicated copy: A replication group is a minimum three nodes. A mistake, we can compare and informed the other two. The two copies is not enough, because no way to determine who is the wrong one.

Conversely, the three replication group can support up to a node error occurred. If more than one copy of the error, and three output node status may be different, and therefore can not confirm which one is correct.

In general, a fault-tolerant system supports F, you must use 2F + 1 copy. Extra copies are used to determine which half is correct, which half is wrong. Under special circumstances can increase the number.

So why obviously just 2F + 1 Ge can be judged, and Byzantine fault tolerance but it needs N≥3F + 1?

Suppose the total number of nodes N, F fault node, the NF must receive a response message (response without failure all nodes), it is possible to determine the result (because the failure node may not send a response). Nf f there is one response possible is false (failed node issued) [most wonderful words, but indeed a hypothesis, because there may be network latency and so on, we can think, and true node failure of fraud the number is equal, we need to maximize tolerance problematic node (the system needs a stable, definitely need to make the most problematic node tolerance), that number is the number of nodes == false node failure, regardless of the real number there big, we took the largest f there is value], it is true that Nff, and the majority, so it needs +1.

3, each node signature can not be tampered with

BFT

Traditional BFT there are two solutions, an oral agreement with the written agreement algorithm algorithm, as is the level of the index operation, with not much, so will not repeat them here, so that Byzantine fault tolerance algorithm is revitalized practical Byzantine fault tolerance (PBFT) he let down polynomial algorithm from exponential, so that can be applied in a distributed network.

PBFT

PBFT algorithm flow

Is selected from the whole network master node, the master node p = v mod R, R is the number of nodes, v is the view number.

Wherein C is the client, 0 is the master node, also called Sort node, node 2, 3 is called a replica can be seen from the figure, in any case a message to node 3, node 3 showed no response is , so we can see it as a failure node.

The main stage of the process is the first phase of 2,3,4.

Start talking about the first stage,

1、REQUEST

C client requests sent to the master node p <REQUEST, o, t, c>. o: the specific operation requested, t: the additional client request timestamp, c: client identification. REQUEST: including message contents m, and a message digest d (m). The client sign the request.

2、PRE-PREPARE:

The master node receives the client's request, the following check is required (in order):

(1) The client request message signature is correct.

Illegal request discarded. Right request, assign a number n, number n is mainly used for client requests sorted. Then broadcasts a << PRE-PREPARE, v, n, d>, m> copies of messages to other nodes. v: view number, d client message digest, m message content. <PRE-PREPARE, v, n, d> signature for master node. n is the [h, H], this will be explain later within a certain range interval.

3、PREPARE:

Node i receives a copy of the PRE-PREPARE message master node, the following is required for examination:

a. The master node PRE-PREPARE message signature is correct.

b. a copy of the current node has already received a number and at the same v is n, but different signatures PRE-PREPARE information.

C. Summary d and m are the same.

D. whether n in the interval [h, H] within.

Illegal request discarded. Right request, a copy of node i to other nodes including the master node sends a <PREPARE, v, n, d, i> message, v, n, d, m above PRE-PREPARE message contents are the same, i is the current copy of the node number. <PREPARE, v, n, d, i> signed copy of node i. And recording PRE-PREPARE PREPARE message to the log, the recovery process for the View Change request unfinished operations.

4、COMMIT:

The master node and the node receives a copy of the PREPARE message, you need to produce for examination the following:

a. a copy of the node PREPARE message signature is correct.

b. whether the current node has received a copy of the same view n v.

If c. N in the interval [h, H] within.

d. d whether the currently received PRE-PPREPARE d in the same

Illegal request discarded. If a copy of node i received 2f + 1 th PREPARE message verified, the other nodes including the master node sends a <COMMIT, v, n, d, i> message, v, n, d, i and the above PREPARE message content the same. <COMMIT, v, n, d, i> signed copy of node i. COMMIT message to the log record, during a View Change outstanding request to resume the operation. Other copies of records are sent to the node PREPARE message log.

5、REPLY:

The master node and the node receives a copy of the COMMIT message, we need to produce for examination the following:

a. a copy of the node COMMIT message signature is correct.

b. whether the current node has received a copy of the same view n v.

C. Summary d and m are the same.

D. whether n in the interval [h, H] within.

Illegal request discarded. If a copy of node i received 2f + 1 th COMMIT message verified, the network is currently in the most nodes have agreed that the requested operation o running the client, and returns <REPLY, v, t, c, i, r> to the client, r: is the result of the operation request, if the client receives f + 1 identical REPLY message stating that the request has been initiated by the client the entire network to reach a consensus, otherwise the client needs to determine whether a retransmission request to the master node. Other copies of records are sent to the node COMMIT message log.

PBFT nodes evil problem

Replica nodes can be seen in the case of evil, it will directly discard the illegal request, as a sort of master node node has become the focus of evil.

If the master node evil, it may give different requests compiled on the same serial number, or not to assign a sequence number or a serial number so that adjacent discontinuities. Backup node should have a duty to take the initiative to check the validity of these numbers. If the master node does not broadcast evil dropped or the client's request, the client set the time-out mechanism, if the timeout request message to all nodes broadcast copies. A copy of the master node detected node evil or offline, originating View Change.

Viewchange

When the primary node from the node think there is a problem, it sends view-change messages to other nodes, the survival of the current node with the smallest number of nodes will become the new master node. When the new master node receives 2f a view-change messages from other nodes, it is proved that there are enough nodes master node problem, so they New-view broadcast messages to other nodes. Does not initiate a new-view event from the node. . For the master node, after sending new-view will continue to operate on the message and the unprocessed requests views, starting from the pre-prepare phase. Other new-view nodes after verification messages, the master node will process incoming pre-prepare message, pbft time during the process described above is performed. View +1 this time will enter a new era of view.

Some problems (garbage collection, water level) unresolved previously mentioned

checkpoint :

Current latest processing request node number, PRE-PREPARE mentioned previously, the master node will receive the request is a request to record the number n. For example, a node is a consensus request number is 101, then for this node, which is the checkpoint 101.

: stable checkpoint (checkpoint stable)
maximum number stable checkpoint request is that most nodes (2f + 1) has been completed consensus. For example, the system has four nodes, three nodes have a consensus over the request number is 213, then the 213 is a stable checkpoint.

Why set up a stable checkpoint it?

Greatest goal is to reduce memory usage. Because each node should be recorded before the next request had been any consensus, but if they go on record, the data is growing, so there should be a mechanism to implement deletion of data. How then delete it? Very simple, such as the current stable checkpoint is 213, then the representative of the previous record of 213 had been consensus, so the record before it can be deleted.

Low water level interval [h, H]:

Because each node processing speed and other reasons, we need to make all the checkpoint processing nodes are in the same range, for example,

checkpoint A node is 1039, checkpoint B node 1133. The current system is stable checkpoint 1034. This number is low then the level 1034, and the high water level is low H = h + L, where L is the value set. Suppose we set to 100, so a high level of system 1034 + 100 = 1134.

Suppose B current checkpoint has to 1134, and checkpoint A or 1039, when if there is a new request to process B, B may choose to wait until node A is also processed to the A and B similar request number, such as A but also the processing to 1112 , and then there will be a mechanism to update all nodes stabel checkpoint (such as update execution after 100 Reply message), such stabel checkpoint can be set to 1100, and then B may process the new request, if the holder 100 L unchanged, when the high water level becomes a 1100 + 100 = 1200.

Published 15 original articles · won praise 13 · views 9056

Guess you like

Origin blog.csdn.net/weixin_43122409/article/details/98171865