Transaction Processing in Distributed Systems

When we use a server to provide data services on the production line, I will encounter the following two problems:

1) The performance of a server is not sufficient to provide sufficient capacity to serve all network requests.

2) We are always afraid that our server will be down, resulting in unavailability of services or loss of data.

So we had to scale our servers to add more machines to share the performance issues and solve the single point of failure. Typically, we extend our data services in two ways:

1) Data partition : It is to put data blocks on different servers (such as: uid % 16, consistent hash, etc.).

2) Data mirroring : let all servers have the same data and provide equivalent services.

For the first case, we cannot solve the problem of data loss. When a single server fails, some data will be lost. Therefore, the high availability of data services can only be achieved through the second method - redundant storage of data (generally, the industry believes that the number of safe backups should be 3 copies, such as Hadoop and Dynamo) . However, adding more machines will make our data services very complex, especially cross-server transaction processing, that is, cross-server data consistency . This is a difficult question. Let's use the most classic Use Case: "Account A sends money to Account B" to illustrate. Those who are familiar with RDBMS transactions know that there are 6 operations required from account A to account B:

  1. Read the balance from account A.
  2. Subtract the A account.
  3. Write the result back to account A.
  4. Read the balance from account B.
  5. Add the B account.
  6. Write the result back to account B.

For the consistency of data, these 6 things are either completed successfully or unsuccessfully, and during this operation, other accesses to accounts A and B must be locked. The so-called locking is to exclude other readers. Write operations, otherwise there will be a problem of dirty data, which is a transaction. Well, as we add more machines, things get complicated:

 

1) In the data partition scheme : what if the data of account A and account B are not on the same server? We need a transaction across machines. That is to say, if A's deduction is successful, but B's addition is unsuccessful, we have to roll back A's operation. This becomes more complicated in the case of cross-machine.

2) In the data mirroring scheme : the remittance between account A and account B can be done on one machine, but don’t forget that we have multiple machines with copies of account A and account B. What if there are two concurrent operations (to be remitted to B and C) to account A, and these two operations occur on two different servers? That is to say, in data mirroring, how to ensure the consistency of the write operations to the same data on different servers and ensure that the data does not conflict?

At the same time, we also need to consider performance factors. If performance is not considered, it is not difficult to guarantee transactions, and the system should be slower. In addition to considering performance, we also consider availability, that is, if a machine is gone, data is not lost, and services can continue to be provided by other machines. Therefore, we need to focus on the following situations:

1) Disaster recovery : no data loss, failover of nodes

2) Data consistency : transaction processing

3) Performance: throughput, response time

As mentioned earlier, to prevent data loss, the only way is to use data redundancy. Even if it is a data partition, data redundancy processing is required for each area. This is the data copy: when the data of a certain node is lost, it can be read from the copy, and the data copy is the only means for the distributed system to solve the abnormal data loss. Therefore, in this article, for simplicity, we only discuss the issue of data consistency and performance in the case of data redundancy. Briefly:

1) To make the data highly available, it is necessary to write multiple copies of the data.

2) The problem of writing multiple copies will lead to data consistency problems.

3) The problem of data consistency will lead to performance problems

This is software development, press the gourd and play the scoop.

Consistency Model

Speaking of data consistency, there are simply three types (of course, if subdivided, there are many consistency models, such as: sequential consistency, FIFO consistency, session consistency, single-read consistency, single-write consistency However, for the sake of simplicity and readability of this article, I will only mention the following three):

1) Weak consistency : When you write a new value, the read operation may or may not read it on the data copy. For example: some cache systems, the data of other players in online games have nothing to do with you, systems such as VOIP, or Baidu search engine (hehe).

2) Eventually eventually consistency : When you write a new value, it may not be read out, but it is guaranteed to eventually be read out after a certain time window. For example: DNS, email, Amazon S3, Google search engine such systems.

3) Strong consistency : once new data is written, any replica can read the new value at any time. For example: file system, RDBMS, Azure Table are all strongly consistent.

From these three consistent models, we can see that Weak and Eventually are generally asynchronous and redundant, while Strong is generally synchronous and redundant. Asynchronous usually means better performance, but Also means more complex state control. Synchronization means simplicity, but it also means performance degradation. Well, let's take a step-by-step look at what technologies are available:

Master-Slave

The first is the Master-Slave structure. For this kind of addition, the Slave is generally the backup of the Master. In such a system, it is generally designed as follows:

1) The master is responsible for read and write requests.

2) After the write request is written to the Master, the Master synchronizes it to the Slave.

To synchronize from the Master to the Slave, you can use asynchronous or synchronous, you can use the Master to push, or you can use the Slave to pull. Generally speaking, it is Slave to periodically pull, so it is eventual consistency. The problem with this design is that if the Master crashes during the pull cycle, it will cause data loss in this time slice. If you don't want to lose data, Slave can only become Read-Only and wait for Master to recover.

Of course, if you can tolerate data loss, you can immediately let Slave work instead of Master (for nodes that are only responsible for computing, there is no problem of data consistency and data loss, and the Master-Slave method can solve the single-point problem Of course, the Master Slave can also be strongly consistent, for example: when we write the Master, the Master is responsible for writing itself first, and then writing the Slave after it succeeds. After both are successful, the whole process is synchronous. , if writing to Slave fails, then there are two methods, one is to mark the Slave as unavailable and report an error and continue the service (after the Slave is restored, synchronize the Master's data, there can be multiple Slaves, so there is one less, and there are backups, just like the previous Said to write three copies), the other is to roll back itself and return to write failure. (Note: generally do not write the Slave first, because if the Master fails to be written, the Slave must be rolled back. At this time, if the rollback of the Slave fails, the data must be corrected manually.) You can see that if the Master-Slave needs to be made How complicated is strong consistency.

Master-Master

Master-Master, also known as Multi-master , means that there are two or more Masters in a system, and each Master provides read-write services. This model is an enhanced version of Master-Slave. The synchronization between data is generally done asynchronously between Masters, so it is eventual consistency. The advantage of Master-Master is that when a Master hangs, other Masters can perform read and write services normally. Like Master-Slave, when data is not copied to other Masters, data will be lost. Many databases support the Master-Master Replication mechanism.

In addition, if multiple masters modify the same data, the nightmare of this model will appear - it is not an easy task to merge conflicts between data. Look at the design of Dynamo's Vector Clock (recording the version number and modifier of the data) to know that this is not so simple, and Dynamo's data conflict is left to the user to do. Just like our SVN source code conflicts, conflicts on the same line of code can only be handled by developers themselves. (We'll discuss Dynamo's Vector Clock later in this article)

Two/Three Phase Commit

The abbreviation of this agreement is also called 2PC, which is called two-phase submission in Chinese. In a distributed system, although each node can know the success or failure of its own operation, it cannot know the success or failure of other nodes' operations. When a transaction spans multiple nodes, in order to maintain the ACID characteristics of the transaction, it is necessary to introduce a component as a coordinator to uniformly control the operation results of all nodes (called participants ) and finally instruct these nodes whether to carry out the actual operation results. commits (such as writing updated data to disk, etc.). The algorithm for two-phase commit is as follows:

Phase 1 :

  1. The coordinator will ask all participating nodes if they can perform the commit operation.
  2. Each participant starts preparations for transaction execution: such as: locking resources, reserving resources, writing undo/redo log...
  3. The participant responds to the coordinator with "can commit" if the preparation of the transaction was successful, and "refused to commit" otherwise.

Second stage :

  • If all participants respond with "can commit", then the coordinator sends the "official commit" command to all participants. The participant completes the formal submission, releases all resources, and then responds with "complete". The coordinator ends the Global Transaction after collecting the "complete" response from each node.
  • If one participant responds "reject submission", then the coordinator sends "rollback operation" to all participants, releases all resources, and then responds "rollback complete", the coordinator collects the "rollback operation" of each node "After the response, cancel the Global Transaction.

We can see that 2PC, to put it bluntly, is an algorithm for Vote in the first stage and decision-making in the second stage. We can also see that 2PC is an algorithm with strong consistency. Earlier we discussed the strong consistency strategy of Master-Slave, which is similar to 2PC, except that 2PC is more conservative - try first and then submit. 2PC is used a lot. In some system designs, a series of calls will be connected in series, such as: A -> B -> C -> D. Each step will allocate some resources or rewrite some data. For example, the ordering operation of our B2C online shopping will have a series of processes that need to be done in the background. If we do it step by step, there will be such a problem. If a certain step cannot be done, then the resources allocated each time need to be reclaimed by reverse operation, so the operation is more complicated. Many workflows now use the 2PC algorithm for reference, and use the try -> confirm process to ensure the successful completion of the entire process. To give a common example, when Western churches get married, they all have such a bridge:

1) The pastor asks the bride and groom separately: Are you willing to...regardless of birth, old age, sickness or death...(questioning stage)

2) When both the bridegroom and the bride answer yes (locking resources for a lifetime), the pastor will say: I declare you... (transaction submission)

What a classic two-phase commit transaction. In addition, we can also see some of these problems, A) One of them is a synchronous blocking operation, which will inevitably affect performance greatly. B) Another major problem is on TimeOut, like,

1) If in the first stage, the participant did not receive the query request, or the participant's response did not reach the coordinator. Then, the coordinator needs to do timeout processing. Once it times out, it can be regarded as a failure, or it can be retried.

2) If in the second stage, after the official submission is sent, if some participants do not receive it, or the confirmation information after the participant submits/rolls back is not returned, once the response of the participant times out, either retry or put That participant marks the problem node and removes the entire cluster, which ensures that the service nodes are all data consistent.

3) The bad situation is that in the second stage, if the participant does not receive the coordinator's commit/fallback instruction, the participant will be in the "state unknown" stage, and the participant does not know what to do at all, for example: if all the After the participants complete the first phase of the reply (maybe all yes, maybe all no, maybe some yes and some no), if the coordinator hangs up at this time. Then all the nodes don't know what to do at all (ask other participants can't do it). For consistency, either wait for the coordinator, or reissue the yes/no command in the first phase.

The biggest problem with the two-stage submission is item 3). If the participant does not receive a decision in the second stage after the first stage is completed, the data node will enter a state of "overwhelmed", which will block the entire process. affairs . That is to say, the Coordinator is very important for the completion of the transaction, and the availability of the Coordinator is the key. Therefore, we introduce a three-paragraph commit. The description of the three-paragraph commit on Wikipedia is as follows. He breaks the first paragraph of the two-paragraph commit into two paragraphs: query, and then lock the resource. The last real commit. The schematic diagram of the three-stage submission is as follows:

The core idea of ​​the three-stage submission is that resources are not locked when asking, unless everyone agrees to start locking resources .

In theory, if all nodes in the first stage return success, then there is reason to believe that the probability of successful submission is high. In this way, the probability that the state of the participant Cohorts is unknown can be reduced. In other words, once the participant receives the PreCommit, it means that he knows that everyone actually agrees to the revision. this point is very important. Let's take a look at the state transition diagram of 3PC: ( note the dotted line in the figure, those F, T are Failuer or Timeout , among which: the meaning of the state is q – Query, a – Abort, w – Wait, p – PreCommit, c – Commit)

From the state change diagram in the above figure, we can see from the dotted lines (those F, T are Failuer or Timeout) - if the F/T problem occurs when the node is in the P state (PreCommit), the three-stage commit is better than the two-stage commit. The advantage of the segment commit is that the three-segment commit can continue to directly change the state to the C state (Commit), while the two-segment commit is overwhelmed .

In fact, three-stage submission is a very complicated thing, it is quite difficult to implement, and there are also some problems.

Seeing this, I believe you have many, many questions. You must be thinking about various failure scenarios in 2PC/3PC. You will find that Timeout is a very difficult thing to deal with, because Timeout on the Internet often makes you With nothing to do, you don't know if the other person did or didn't. So your good state machine becomes a decoration because of Timeout .

A web service can have three states: 1) Success, 2) Failure, 3) Timeout, and the third is an absolute nightmare, especially if you need to maintain state .

Two Generals Problem

The Two Generals Problem  is a thought experiment problem: Two armies, each led by a general, are preparing to attack a fortified city. Both armies were stationed near the city, occupying a hill. A valley separates the two mountains, and the only way for the two generals to communicate is to send their respective messengers to and from both sides of the valley. Unfortunately, the valley has been occupied by the defenders of the city, and there is a possibility that any messenger sent through the valley will be arrested. Note that while the two generals had agreed to attack that city, there was no agreement on the timing of the attack until they each took their hilltop positions. Both generals must have their armies attack the city at the same time to be successful. Therefore, they must communicate with each other to determine a time to attack and agree to attack at that time. If only one general were to attack, it would be a catastrophic failure. This thought experiment involves thinking about how they would do it. Here are our thoughts:

1) The first general first sends a message "Let's start the attack at 9 am". However, once the messenger was dispatched, it is not known whether he made it through the valley or not. Any bit of uncertainty will make the first general hesitate to attack, because if the second general cannot attack at the same moment, the garrison of that city will repel his army's attack, causing his army to be destroyed. .

2) Knowing this, the second general needs to send a confirmation reply: "I received your email and will attack at 9." But what if the courier with the confirmation message is caught? So the second general would hesitate whether his confirmation message would arrive.

3) So, it seems we have to ask the first general to send one more confirmation message - "I got your confirmation". However, what if the messenger is caught?

4) In this case, do we need the second general to send a "Confirmation of receipt of your confirmation" message.

Damn, so you'll find that this matter quickly developed into a situation where no matter how many confirmation messages were sent, there was no way to ensure that the two generals had enough confidence that their messengers were not captured by the enemy.

This question is unsolvable . The two generals problem and its unsolvable proof were first published by EA Akkoyunlu, K. Ekanadham and RV Huber in 1975 in the paper "Some Limitations and Compromises in Network Communication Design", on page 73 of this paper a paragraph describing the two Elucidated in correspondence between the two gangsters. In 1978, it was named the Two Generals Paradox in Jim Gray's book "Database Operating System Considerations" (starting on page 465). This reference is widely mentioned as the source of the definition and proof of insolubility of the two generals' problems.

This experiment is intended to illustrate the pitfalls and design challenges of trying to coordinate an operation through communication over an unreliable connection.

In engineering terms, a practical solution to the two generals problem is to use a scheme that can tolerate the unreliability of the communication channel, not trying to eliminate the unreliability, but reducing the unreliability to an acceptable level . For example, the first general dispatched 100 couriers and predicted that there was a low probability that they would all be captured. In this case, the first general will attack regardless of whether the second general will attack or receive any news. Also, the first general can send a stream of messages, and the second general can send an acknowledgement message for each of those messages, so both generals feel better if every message is received. However, we can see from the proof that neither of them are sure that the attack is coordinated. They have no algorithm available (eg, attack on more than 4 messages) that can ensure protection against only one-sided attacks. Also, the first general can also number each message, saying it's number 1, number 2, ... up to number n. This method lets the second general know how reliable the communication channel really is, and returns the appropriate number of messages to ensure that the last message is received. If the channel is reliable, only one message will do, and the rest won't help much. The probability of the last and first messages being lost is equal.

 The Two Generals Problem can be extended to the more perverted Byzantine Generals Problem . The background of the story is this: Byzantium is located in what is now Istanbul, Turkey, the capital of the Eastern Roman Empire. Due to the vast territory of the Byzantine Roman Empire at that time, for the purpose of defense, each army was separated very far, and the generals could only rely on messengers to transmit messages. In the time of war, all generals in the Byzantine army must reach a consensus to decide whether there is a chance to win before attacking the enemy's camp. However, the army may have traitors and enemy spies, and these traitor generals can disrupt or sway the decision-making process. At this time, how the remaining loyal generals can reach an unanimous agreement without being influenced by the traitors when it is known that a member has rebelled. This is the Byzantine generals problem.

Paxos algorithm

The descriptions of various Paxos algorithms on Wikipedia are very detailed, and you can go and watch them.

The problem solved by the Paxos algorithm is how to reach a consensus on a certain value in a distributed system where the above exceptions may occur, ensuring that no matter the occurrence of any of the above exceptions, the consistency of the resolution will not be destroyed. A typical scenario is that in a distributed database system, if the initial state of each node is consistent, and each node performs the same sequence of operations, then they can finally get a consistent state. In order to ensure that each node executes the same sequence of commands, a "consistency algorithm" needs to be executed on each instruction to ensure that the instructions seen by each node are consistent. A general consensus algorithm can be applied in many scenarios and is an important problem in distributed computing. Research on consensus algorithms has not stopped since the 1980s.

Notes : The Paxos algorithm is a consensus algorithm based on message passing proposed by Leslie Lamport (“La” in LaTeX, who is now at Microsoft Research) in 1990. The incomprehension of the algorithm did not attract attention at first, so Lamport republished it in ACM Transactions on Computer Systems ( The Part-Time Parliament ) eight years later in 1998. Even so, the paxos algorithm has not been taken seriously. In 2001, Lamport felt that his colleagues could not accept his sense of humor, so he re-expressed it in an acceptable way ( Paxos Made Simple ). It can be seen that Lamport has a soft spot for the Paxos algorithm. The popular use of Paxos algorithm in recent years also proves its important position in distributed consensus algorithm. In 2006, three papers by Google showed the beginning of "cloud". Among them, the Chubby Lock service used Paxos as the consensus algorithm in Chubby Cell, and the popularity of Paxos has been soaring since then. (Lamport himself wrote about the before and after of his 9-year publication of this algorithm on his blog)

Note: In Amazon's AWS, all cloud services are implemented based on an ALF (Async Lock Framework) framework, which uses the Paxos algorithm. When I was at Amazon, when I watched the internal sharing video, the designer said in the internal Principle Talk that he referenced the ZooKeeper method, but he implemented the algorithm in another way that is more readable than ZooKeeper.

Simply put, the purpose of Paxos is to allow nodes across the cluster to agree on a value change. The Paxos algorithm is basically a democratic election algorithm - the majority of decisions will be a unified decision of the entire cluster. Any node can make a proposal to modify a certain data, and whether the proposal is passed depends on whether more than half of the nodes in the cluster agree (so the Paxos algorithm requires that the nodes in the cluster are singular).

This algorithm has two stages (assuming this has three nodes: A, B, C):

The first stage: Prepare stage

A sends the Prepare Request request for modification to all nodes A, B, and C. Note that the Paxos algorithm will have a Sequence Number (you can think of it as a proposal number, this number keeps increasing and is unique, that is to say A and B cannot have the same proposal number), this proposal number will be associated with the modification request When sent together, any node in the "Prepare phase" will reject requests whose value is less than the current proposal number. Therefore, when node A applies for modification requests to all nodes, it needs to bring a proposal number. The newer the proposal, the larger the proposal number.

If the proposal number n received by the receiving node is greater than the proposal number sent by other nodes, this node will respond Yes (the latest approved proposal number on this node) and promise not to accept other proposals < n. In this way, the node will always commit to the latest proposal during the Prepare phase.

Optimization: During the above prepare process, if any node finds that there is a proposal with a higher number, it needs to notify the proposer to remind him to interrupt the proposal.

The second stage: Accept stage

If the proposer A receives Yes from more than half of the nodes, then he will issue an Accept Request to all nodes (again, with the proposal number n), if not more than half, it will return a failure.

When the nodes receive the Accept Request, if n is the largest for the receiving node, then it will modify this value. If it finds that it has a larger proposal number, then the node will Modifications will be rejected.

We can see that this seems to be a "two-phase commit" optimization. In fact, 2PC/3PC are all defective versions of distributed consensus algorithms. Mike Burrows, the author of Google Chubby, said that there is only one consensus algorithm in the world, that is Paxos, and other algorithms are defective.

We can also see that there is no problem even if the modification proposals for the same value at different nodes are received out of order at the receiver.

For some examples, you can take a look at the " Paxos Examples " section in Wikipedia Chinese, I won't say more here. For some abnormal examples in the Paxos algorithm, you can deduce it yourself. You will find that basically as long as more than half of the nodes are guaranteed to survive, there is no problem.

To say more, since Lamport published the Paxos algorithm in 1998, various improvements to Paxos have never stopped, and the biggest action is Fast Paxos published in 2005 . Regardless of the improvement, the focus remains on making various trade-offs between message latency and performance and throughput. In order to easily distinguish the two conceptually, the former is called Classic Paxos, and the improved latter is called Fast Paxos.

Summarize

Earlier, we said that in order to make data highly available, it is necessary to write multiple copies of redundant data. The problem of writing multiple copies will bring consistency problems, and the consistency problem will bring performance problems. As we can see from the above figure, we basically can't make all items green, which is the famous CAP theory: consistency, availability, partition tolerance, you may only want two of them.

NWR model

Finally, I would like to mention Amazon Dynamo's NWR model. This NWR model gives the user the choice of CAP, and let the user choose which two of your CAPs .

The so-called NWR model. N represents N backups, W represents that at least W copies must be written to be considered successful, and R represents at least read R backups. W+R > N is required when configuring . Because W+R > N, so what does R > NW mean? That is, the number of copies read must be greater than the difference between the total number of backups minus the multiples that ensure successful writing.

That is to say, every time you read, at least one latest version is read. Thus, an old data will not be read. When we need a highly writable environment, we can configure W = 1 if N=3 then R = 3. At this time, as long as any node is successfully written, it is considered successful, but when reading, data must be read from all nodes. If we require high read efficiency, we can configure W=NR=1. At this time, any node that reads successfully is considered successful, but when writing, all three nodes must be successfully written to be considered successful.

Some settings of the NWR model will cause the problem of dirty data, because it is obviously not a strongly consistent thing like Paxos, so each read and write operation may not be on the same node, so there will be some nodes. The data on is not the latest version, but the latest operation is carried out.

So, Amazon Dynamo introduced the design of data version. That is to say, if you read out that the version of the data is v1, when you want to backfill the data after the calculation, but find that the version number of the data has been updated to v2, the server will reject you. The version thing is like "optimistic locking".

However, for the distributed and NWR models, the version will also have nightmares - the problem of version collision, for example: we set N=3 W=1, if a value is accepted on the A node, the version From v1 -> v2, but there is no time to synchronize to node B (asynchronous, it should be W=1, writing a copy is successful), the v1 version is still on the B node, at this time, the B node receives the write request , Logically speaking, he needs to reject it, but on the one hand, he does not know that other nodes have been updated to v2, and on the other hand, he cannot reject it, because W=1, so writing one point is successful. As a result, there was a serious version conflict.

Amazon's Dynamo neatly sidesteps the issue of version conflicts -- the issue of version conflicts is left to the user to handle.

So Dynamo introduced the design of the Vector Clock (Vector Clock?!). This design allows each node to record its own version information, that is, for the same data, two things need to be recorded: 1) Who updated me, 2) What is my version number.

Next, let's look at a sequence of operations:

1) A write request is processed by node A for the first time. Node A will add a version information (A, 1). We record the data at this time as D1(A, 1). Then another request for the same key is still processed by A, so there is D2(A, 2). At this time, D2 can overwrite D1, and there will be no conflict.

2) Now we assume that D2 is propagated to all nodes (B and C), the data received by B and C is not generated from customers, but copied to them by others, so they do not generate new version information, so now B and C The data held by C is still D2(A, 2). So the data and version numbers on A, B, and C are the same.

3) If we have a new write request to node B, then node B generates data D3 (A, 2; B, 1), which means: the global version number of data D is 3, and A has been upgraded by two new ones. B went up once. Isn't this the so-called code version of the log?

4) If another request is processed by C when D3 is not propagated to C, then the data on node C is D4(A, 2; C, 1).

5) Well, here comes the most exciting thing: if a read request comes at this time, we have to remember that our W=1 then R=N=3, so R will read from all three nodes, at this time, He will read three versions:

    • A node: D2(A,2)
    • Node B: D3(A,2; B,1);
    • Node C: D4(A,2; C,1)

6) At this time, it can be judged that D2 is already an old version (already included in D3/D4) and can be discarded.

7) But D3 and D4 are obvious version conflicts. So, leave it to the caller to do version conflict resolution. Just like source code versioning.

Obviously, the above configuration of Dynamo uses A and P in CAP.

I highly recommend everyone to read this paper: " Dynamo: Amazon's Highly Available Key-Value Store ", if the English is painful, you can look at the translation (translator unknown).

 

http://coolshell.cn/articles/10910.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326856336&siteId=291194637