Big Data Foundation (3) - Common Consistency Protocols

1. Overall introduction

  This section mainly introduces some common consensus protocols in distributed systems. Understanding these protocols is very helpful for understanding the design ideas of distributed systems.

2. Two-Phrase Commit (2PC)

  This is a common way to solve the distributed transaction problem, ensuring that in a distributed transaction, either all participating processes commit the transaction, or cancel the transaction. Either all backup data is changed at the same time, or none are modified to ensure strong data consistency.

  2PC divides the submission process into two consecutive phases: the voting phase (Voting) and the submission phase (Commit), consisting of the following sequence of operations:

2PC

2.1 Voting stage

coordinator perspective

  • The coordinator sends VOTE_REQUESTa message to all participants and enters the waiting state;

participant perspective

VOTE_REQUESTAfter   the participant receives the message

  • If ready, send VOTE_COMMITa message to the coordinator and enter the commit phase
  • If not ready, send a message to the coordinator VOTE_ABORT, informing the coordinator that the transaction cannot be committed at this time

2.2 Submission phase

coordinator perspective

  The coordinator sends a GLOBAL_COMMITmessage to all participants notifying them to make a local commit.

  • If any one of all participants returns VOTE_ABORT, the coordinator multicasts GLOBAL_ABORTthe cancel transaction to all participants
  • If none of the participants returns, VOTE_ABORTthe coordinator sends GLOBAL_COMMITa commit local transaction to all participants

participant perspective

  Participants who submitted voting information wait for coordinator actions

  • If the participant receives a GLOBAL_COMMITmessage, the participant commits the local transaction
  • Otherwise the message is received GLOBAL_ABORTand the participant cancels the local transaction

2.3 Disadvantages

single point of failure

  Although the coordinator hangs up, a new coordinator can be selected through the election algorithm, but the participants in the second stage will lock the resource, causing others to use this resource will be blocked. Even if the coordinator is changed again, the participants are still blocked.

synchronous blocking

  Participants are blocking. After receiving the request in the first stage, the resource will be locked in advance and COMMITwill not be released until the end.

data inconsistency

  In the second stage, COMMITif the coordinator hangs up, some participants will receive COMMITthe request, and some participants will not receive COMMITthe request, resulting in data inconsistency.

  Due to these problems, a three-phase commit protocol was introduced to solve these problems of the 2PC protocol.

3. Three Phase Commitment Protocol (3PC)

  The core idea of ​​3PC is to subdivide the submission phase of 2PC into two phases: pre-submission phase and submission phase.

3.1 canCommit stage

  The canCommit stage of 3PC is actually very similar to the preparation stage of 2PC. The coordinator sends a commit request to the participant, and the participant returns a yes response if it can commit, otherwise a no response.

3.2 preCommit stage

  The coordinator decides whether to continue the preCommit operation of the transaction according to the response of the participant in the canCommit phase.

  Depending on the response, there are two possibilities:

  1. The feedback that the coordinator gets from all participants is yes: then the pre-execution of the transaction is performed, the coordinator sends a preCommit request to all participants, and enters the prepared stage. Participating Zehe will perform transaction operations after receiving the preCommit request, and record the undo and redo information in the transaction log. If a participant successfully executes the transaction operation, it returns an ACK response and starts waiting for the final command.
  2. One of the feedbacks that the coordinator gets from all participants is No or the coordinator does not receive a response after waiting for a timeout: then the transaction must be interrupted, and the coordinator sends an abort request to all participants. After the participant receives the abort request from the coordinator, or fails to receive the request from the coordinator after a timeout, the execution of the transaction is interrupted.

3.3 doCommit stage

  The coordinator decides whether to continue the doCommit operation of the transaction according to the response of the participant's preCommit phase.

  Depending on the response, there are two possibilities:

  1. The coordinator got ACK feedback from the participants: the coordinator receives the ACK response sent by the participants, then it will enter the commit state from the pre-commit state, and send a doCommit request to all participants. After receiving the doCommit request, the participant executes a formal transaction commit, releases all transaction resources after completing the transaction commit, and sends a haveCommitted ACK response to the coordinator. Then the coordinator completes the task after receiving the ACK response.
  2. The coordinator did not get ACK feedback from the participant, or the receiver may not have sent an ACK response, or the response timed out: the execution of the transaction was interrupted.

3.4 Disadvantages

  If after entering PreCommit, the coordinator sends an abort request, assuming that only one Cohort receives and performs the abort operation, and other Cohorts who are unknown to the system state will choose to continue Commit according to 3PC, and the system state is inconsistency at this time.

  Another important algorithm is the Paxos algorithm. Zookeeper uses the improvement of the Paxos algorithm, which will be introduced in the following chapters.

4. clock

  In a distributed system, when writing data, because the data is not stored in a single point, for example, both DB1 and DB2 can provide writing services at the same time, and both store a full amount of data. No matter which DB the client is writing to, the client does not have to worry about data writing chaos. , but in real scenarios, parallel and simultaneous modification is often encountered, resulting in data inconsistency. In order to solve this problem, we introduce the concept of clock.

4.1 Logical Clocks (Lamport's Logical Clocks)

  • happens-before

  In order to synchronize logical clocks, Lamport defines a relationship called happens-before denoted as->

  • a->bmeans that all processes agree that event a happened before event b.

  In two cases, this relationship can be easily obtained:

  1. If event a and event b are in the same process and event a occurs before event b, thena->b
  2. If process A sends a message m to process B, a represents the event that process A sends message m, and b represents the event that process B receives message m, then (because the delivery a->bof the message takes time)

  The happens-before relationship is transitive: that is(a->b && b->c)->(a->c)

  If event a and event b occur in different processes, and the two processes do not pass messages, then they can neither be pushed a->bnor pushed b->a. Such two events are called concurrent events.

  Now it is necessary to define a function C of an event, so that [a->b]->[C(a)<C(b)], and because it is used as a measure of time, C must also only increase but not decrease.

  • Lamport algorithm

Lamport algorithm

  Each of the three machines runs a process, namely P1, P2, and P3. Since the quartz crystal on different machines is different, the clock rate on different machines may be different. For example, when the machine where P1 is located ticks 6 times, the machine where P2 is located ticked 8 times.

  In the figure, P1 sends a message m1 to P2, and m1 is attached with the clock 6 when sending m1, and then P2 receives m1. According to the clock when P2 receives m1, it is considered that the transmission of the message took 16-6=10 ticks.

  Subsequently, P3 sends a message m3 to P2, and the sending clock attached to m3 is 60. Since the clock of P2 runs slower than that of P3, when receiving m3, the local clock 56 is smaller than the sending clock 60. This is unreasonable, and the clock needs to be adjusted. As shown in the figure, adjust 56 of P2 to 61, that is, add 1 to the sending clock of m3.

  • Implementation of Lamport logical clocks

  Each process Pi maintains a local counter Ci, which is equivalent to logical clocks, and updates Ci according to the following rules

  1. Before each execution of an event (such as sending a message through the network, or handing the message to the application layer, or some other internal events), add 1 to Ci
  2. When Pi sends a message m to Pj, attach Ci to the message m
  3. When the receiving process Pj receives the message sent by Pi, it updates its own Cj = max{Cj,Ci}

4.2 Vector Clock (Vector Clock)

  Logical clocks are guaranteed (a->b)->( C(a)<C(b) ), but not( C(a)<C(b) )->(a->b)

  The problem with the logical clock is that the actual sequence of event a and event b cannot be determined just by comparing C(a) and C(b), because the logical clock has no causal relationship, so the vector clock is introduced.

  The vector clock is actually a set of version numbers (version number = logical clock). Assuming that the data needs to be stored in 3 copies and needs 3 db storage (indicated by A, B, and C), then the vector dimension is 3, and each db has a version Number, starting from 0, thus forming a vector version [A:0, B:0, C:0].

  • DB_A——> [A:0, B:0, C:0]
  • DB_B——> [A:0, B:0, C:0]
  • DB_C——> [A:0, B:0, C:0]

  This is both guaranteed (a->b)->( C(a)<C(b) )and assured ( C(a)<C(b) )->(a->b).

  • algorithmic logic

  Using VC(a) to represent the vector clock of event a has the following properties: VC(a) < VC(b) can be deduced that event a causally occurs before event b (that is, event a occurs before event b).

  Maintain a vector VC for each process Pi, which is the vector clock of Pi. This vector VC has the following properties:

  1. VCi[i] is the number of events that have occurred on process Pi so far
  2. VCi[k] is the number of events in process Pk known by process Pi (that is, Pi's knowledge of Pj)

  The VC of each process can be maintained by the following rules (similar to the Lamport algorithm):

  1. Process Pi increments VCi[i] by 1 each time before executing an event
  2. When Pi sends message m to Pj, attach VCi (vector clock of process Pi) to message m
  3. When the receiving process Pj receives the message sent by Pi, it updates its own VCj[k] = max{VCj[k],VCi[k]}, for all k
  • example description

  normal circumstances:

  • Step 1: In the initial state, all machines are [A:0, B:0, C:0];
DB_A——> [A:0, B:0, C:0]
DB_B——> [A:0, B:0, C:0]
DB_C——> [A:0, B:0, C:0]
  • Step 2: Assume that the current application is a shopping mall, and now enter an iphone13 price iphone13 price 6888, and the client randomly selects a db machine to write in, assuming that A is selected, the data is as follows:
{key=iphone_price; value=6888; vclk=[A:1,B:0,C:0]}
  • Step 3: Next, A will synchronize the data to B and C; so the final synchronization result is as follows
DB_A——> {key=iphone_price; value=6888; vclk=[A:1,B:0,C:0]}
DB_B——> {key=iphone_price; value=6888; vclk=[A:1,B:0,C:0]}
DB_C——> {key=iphone_price; value=6888; vclk=[A:1,B:0,C:0]}

Step 4: After 2 minutes, the price fluctuated and dropped to 6588, so the salesman updated the price. At this time, the system randomly selected B as the write storage, so the result looks like this:

DB_A——> {key=iphone_price; value=6888; vclk=[A:1,B:0,C:0]}
DB_B——> {key=iphone_price; value=6588; vclk=[A:1,B:1,C:0]}
DB_C——> {key=iphone_price; value=6888; vclk=[A:1,B:0,C:0]}
  • Step 5: So B synchronizes the update to several other stores
DB_A——> {key=iphone_price; value=6588; vclk=[A:1,B:1,C:0]}
DB_B——> {key=iphone_price; value=6588; vclk=[A:1,B:1,C:0]}
DB_C——> {key=iphone_price; value=6588; vclk=[A:1,B:1,C:0]}

  The above synchronizations are all in a normal state, and the following is an example of an abnormal situation:

  • Step 6: The price fluctuates again and becomes 4000, this time choose C to write:
DB_A——> {key=iphone_price; value=6588; vclk=[A:1,B:1,C:0]}
DB_B——> {key=iphone_price; value=6588; vclk=[A:1,B:1,C:0]}
DB_C——> {key=iphone_price; value=4000; vclk=[A:1,B:1,C:1]}
  • Step 7: C synchronizes the update to A and B. Because of some problems, it only synchronizes to A. The results are as follows:
DB_A——> {key=iphone_price; value=4000; vclk=[A:1,B:1,C:1]}
DB_B——> {key=iphone_price; value=6588; vclk=[A:1,B:1,C:0]}
DB_C——> {key=iphone_price; value=4000; vclk=[A:1,B:1,C:1]}
  • Step 8: The price fluctuates again and becomes 6,000 yuan, and the system chooses B to write
DB_A——> {key=iphone_price; value=6888; vclk=[A:1,B:1,C:1]}
DB_B——> {key=iphone_price; value=6000; vclk=[A:1,B:2,C:0]}
DB_C——> {key=iphone_price; value=4000; vclk=[A:1,B:1,C:1]}
  • Step 9: A problem occurs when B updates A and C synchronously. A’s own vector clock is [A:1,B:1,C:1], and the vector clock carried by the update message is [A:1,B:2,C:0]B: 2 is newer than B: 1, but C: 0 is Older than C1. At this time, an inconsistency conflict occurs, and the vector clock only tells you that there is a conflict in the current data, and you still need to handle it yourself.

5. RWN protocol

  The RWN protocol was proposed by Amazon when implementing the Dynamo KV storage system. Through the read and write configuration of multiple backup data in the distributed system, the analysis and constraint setting of data consistency are ensured.

  • R: It means that a successful read data operation requires at least R copies to be successfully read;
  • W: It means that a successful update operation requires at least W data to be written successfully;
  • N: How many copies of backup data are there in the distributed storage system;

  If the formula is satisfied R+W>N, the data consistency protocol is satisfied.

  When implementing the system, only relying on the RWN protocol cannot complete the consistency guarantee, and it is necessary to determine which data is the latest, which needs to be combined with the previously mentioned vector clock.


  reference article

Guess you like

Origin blog.csdn.net/initiallht/article/details/123983299