Let's talk about distributed transactions, let's talk about solutions

foreword

I haven't written a blog for a long time recently. On the one hand, it is because the company has been busy recently, and on the other hand, it is because the development of the next stage of CAP is underway , but it has come to an end.

Next, let's start our topic today and talk about distributed transactions, or distributed transactions in my eyes, because everyone may have different understandings of them.

Distributed transaction is a technical difficulty in enterprise integration, and it is also a thing involved in every distributed system architecture, especially in microservice architecture, it can almost be said to be unavoidable. This article will briefly talk about distributed transactions one time.

database transaction

Before talking about distributed transactions, let's start with database transactions. Database transactions may be familiar to everyone and are often used in the development process. But even so, many people may still be unclear about some details. For example, many people know several characteristics of database transactions: Atomicity, Consistency, Isolation, and Durability, or ACID for short. But further down, for example, when you ask what isolation refers to, you may not know, or you know what isolation is, but then ask what levels of isolation are implemented in the database, or what are the differences between them at each level? may not know when.

This article does not intend to introduce these database transactions, if you are interested, you can search for relevant information. However, there is a knowledge point we need to understand, that is, if the database suddenly loses power when committing a transaction, how does it recover? Why mention this knowledge point? Because the core of the distributed system is to deal with various abnormal situations, which is also the complex part of the distributed system, because the distributed network environment is very complex, and this kind of "power failure" failure is much more than that of a single machine, so we are doing distributed system. This is the first thing to consider when setting up the system. These anomalies may be machine downtime, network anomalies, message loss, out-of-order messages, data errors, unreliable TCP, loss of stored data, other anomalies, etc...

Let's go on to talk about the situation where the local transaction database is powered off. How does it ensure data consistency? We use SQL Server as an example, we know that we are using SQL Server database is composed of two files, a database file and a log file, usually, the log file is much larger than the database file. When the database performs any write operation, the log must be written first. For the same reason, when we execute a transaction, the database will first record the redo operation log of the transaction, and then start to actually operate the database. Write the log file to the disk, then when there is a sudden power failure, even if the operation is not completed, when the database is restarted, the database will perform undo rollback or redo rollback according to the current data situation, which ensures the strong data. consistency.

Next, let's talk about distributed transactions.

distributed theory

When the performance of our single database has a bottleneck, we may partition the database. The partition mentioned here refers to the physical partition. After partitioning, different libraries may be on different servers. At this time, the ACID can no longer adapt to this situation, and in this ACID cluster environment, it is almost difficult to ensure the ACID of the cluster, or even if it can be achieved, the efficiency and performance will be greatly reduced. The most important thing is that it is difficult to achieve. The new partition is expanded. At this time, if we pursue the ACID of the cluster, our system will become very poor. At this time, we need to introduce a new theoretical principle to adapt to the situation of this cluster, which is the CAP principle or the CAP theorem. , then what does the CAP theorem mean?

CAP theorem

The CAP theorem was proposed by Professor Eric Brewer of the University of California, Berkeley, who pointed out that WEB services cannot satisfy the following three properties at the same time:

  • Consistency: The client knows that a series of operations will happen at the same time (effective)
  • Availability: Every operation must end with a predictable response
  • Partition tolerance: Even if a single component is unavailable, the operation can still be completed

Specifically, in a distributed system, in any database design, a Web application can only support at most two of the above properties at the same time. Obviously, any scale-out strategy relies on data partitioning. Therefore, designers must choose between consistency and usability.

This theorem holds true in distributed systems so far! Why do you say that?

At this time, some students may move the 2PC (two-phase submission) of the database to speak. OK, let's take a look at the two-phase commit of the database.

Students who have an understanding of database distributed transactions must know the 2PC supported by the database, also known as XA Transactions.

MySQL has been supported since version 5.5, SQL Server 2005 has been supported, and Oracle 7 has been supported.

Among them, XA is a two-phase commit protocol, which is divided into the following two phases:

  • Phase 1: The transaction coordinator asks each database involved in the transaction to precommit this operation and reflects whether it can be committed.
  • Phase 2: The transaction coordinator asks each database to commit data.

Among them, if any one database vetoes the commit, then all databases will be asked to roll back their part of the information in this transaction. What's the downside of this? At first glance we can get consistency across database partitions.

If the CAP theorem is true, then it must affect availability.

If the availability of a system represents the sum of the availability of all components involved in performing an operation. Then in the two-phase commit process, the availability represents the sum of the availability in each database involved. We assume that each database has 99.9% availability during the two-phase commit process, then if the two-phase commit involves two databases, the result is 99.8%. According to the system availability calculation formula, assuming 43,200 minutes per month, 99.9% availability is 43,157 minutes, and 99.8% availability is 43,114 minutes, which is equivalent to an increase of 43 minutes of downtime per month.

Above, it can be verified that the CAP theorem is theoretically correct. We will see CAP here first, and we will talk about it later.

BASE theory

In distributed systems, we often pursue availability, and its important program is higher than consistency, so how to achieve high availability? The predecessors have proposed another theory for us, that is, the BASE theory, which is used to further expand the CAP theorem. BASE theory refers to:

  • Basically Available
  • Soft state
  • Eventually consistent

The BASE theory is the result of a trade-off between the consistency and availability in CAP. The core idea of ​​the theory is: we cannot achieve strong consistency, but each application can adopt an appropriate method according to its own business characteristics to make the system achieve Eventual consistency .

With the above theory, let's look at the problem of distributed transactions.

Distributed transaction

In a distributed system, to implement distributed transactions, there are no more than those solutions.

1. Two-phase commit (2PC)

Like the database XA transaction mentioned in the previous section, the two-phase commit is based on the principle of using the XA protocol. We can easily see some details in the middle such as commit and abort from the process of the following figure.

Two-phase commit is a solution that sacrifices some availability for consistency. In terms of implementation, in .NET, two-phase submission in distributed systems can be programmed with the help of the API provided by TransactionScop. For example, this part of the function is implemented in WCF. However, between multiple servers, it is necessary to rely on DTC to complete transaction consistency. Microsoft has MSDTC service under Windows, and it is more tragic under Linux.

In addition, TransactionScop cannot be used for transaction consistency between asynchronous methods by default, because the transaction context is stored in the current thread, so if it is an asynchronous method, you need to explicitly pass the transaction context.

Advantages: It ensures strong data consistency as much as possible, and is suitable for key areas that require high data consistency. (Actually, there is no 100% guarantee of strong consistency)

Disadvantages: complex implementation, sacrificing usability, great impact on performance, not suitable for high-concurrency and high-performance scenarios, if the distributed system is called across interfaces, there is currently no implementation solution in the .NET world.

2. Compensation Transaction (TCC)

TCC is actually the compensation mechanism adopted. Its core idea is: for each operation, a corresponding confirmation and compensation (revocation) operation must be registered. It is divided into three stages:

  • The Try phase is mainly to test the business system and reserve resources

  • The Confirm phase is mainly to confirm and submit the business system. When the Try phase is successfully executed and the Confirm phase is started, the default Confirm phase will not go wrong. That is: as long as the Try is successful, the Confirm must be successful.

  • The Cancel stage is mainly to cancel the execution of the business in the state of business execution error and need to be rolled back, and release the reserved resources.

For example, if Bob wants to transfer money to Smith, the idea is probably:
we have a local method, which calls
1. First, in the Try phase, we need to call the remote interface to freeze the money of Smith and Bob.
2. In the Confirm phase, the remote transfer operation is performed, and the transfer is successfully unfrozen.
3. If the second step is executed successfully, the transfer is successful. If the second step fails, the unfreeze method (Cancel) corresponding to the remote freeze interface is called.

Advantages: Compared with 2PC, the implementation and process are relatively simple, but the data consistency is also worse than 2PC

Disadvantages: The disadvantages are still relatively obvious, and it may fail in steps 2 and 3. TCC is a compensation method at the application layer, so programmers need to write a lot of compensation code when implementing. In some scenarios, some business processes may not be well defined and handled with TCC.

3. Local message table (asynchronous guarantee)

The implementation of the local message table should be the most used in the industry. The core idea is to split distributed transactions into local transactions for processing. This idea comes from eBay. We can see some of these details from the flow chart below:

The basic idea is:

The message producer needs to build an additional message table and record the message sending status. The message table and business data should be submitted in a transaction, which means they should be in a database. Then the message will be sent to the consumer of the message through MQ. If the message fails to be sent, it will be retried.

The message consumer needs to process the message and complete its own business logic. At this time, if the local transaction processing is successful, it indicates that the processing has been successful. If the processing fails, the execution will be retried. If it is a business failure, you can send a business compensation message to the producer to notify the producer to perform operations such as rollback.

The producer and consumer periodically scan the local message table, and send the unprocessed messages or failed messages again. If there is a reliable automatic reconciliation and replenishment logic, this scheme is still very practical.

This scheme follows the BASE theory and adopts eventual consistency. The author believes that these schemes are more suitable for actual business scenarios, that is, there will be no complex implementation like 2PC (when the call chain is very long, 2PC's Availability is very low), and there is no possibility of confirmation or rollback like TCC.

Pros: A very classic implementation that avoids distributed transactions and achieves eventual consistency. There are ready solutions in .NET.

Disadvantages: The message table will be coupled to the business system. If there is no packaged solution, there will be a lot of chores that need to be handled.

4. MQ transaction message

There are some third-party MQs that support transactional messages, such as RocketMQ. The way they support transactional messages is similar to the two-phase commit adopted, but some mainstream MQs on the market do not support transactional messages, such as RabbitMQ and Kafka. not support.

Taking Ali's RocketMQ middleware as an example, the idea is roughly as follows:

The first stage Prepared message will get the address of the message.
The second stage executes the local transaction, and the third stage accesses the message through the address obtained in the first stage and modifies the state.

That is to say, in the business method, you want to submit two requests to the message queue, one to send the message and one to confirm the message. If the confirmation message fails to be sent, RocketMQ will periodically scan the transaction messages in the message cluster. When a Prepared message is found, it will confirm to the message sender. Therefore, the producer needs to implement a check interface, and RocketMQ will check according to the policy set by the sender. Decide whether to roll back or continue sending confirmation messages. This ensures that the message sending succeeds or fails at the same time as the local transaction.

Unfortunately, RocketMQ doesn't have a .NET client. For more news about RocketMQ, you can check this blog

Advantages: Achieves eventual consistency and does not need to rely on local database transactions.

Disadvantages: It is difficult to implement, mainstream MQ is not supported, there is no .NET client, and some of the RocketMQ transaction message code is not open source.

5. Sagas transaction model

The Saga transaction model, also known as Long-running-transaction, was proposed by H.Garcia-Molina et al. of Princeton University, which describes another solution in the absence of two-phase commit Complex business transaction problems in distributed systems. You can see the Sagas related paper here .

What we are talking about here is a workflow transaction model based on the Sagas mechanism. The related theory of this model is relatively new at present, so that there is almost no relevant information on Baidu.

The core idea of ​​this model is to split the long transaction in the distributed system into multiple short transactions, or multiple local transactions, and then the Sagas workflow engine is responsible for coordination. If the entire process ends normally, then the business is successfully completed. If the implementation fails during this process, the Sagas workflow engine will call the compensation operation in the reverse order to re-roll the business.

For example, one of our business operations on purchasing a travel package involves three operations. They are booking a vehicle, booking a hotel, and booking an air ticket. They belong to three different remote interfaces. They may not belong to a transaction from the perspective of our program, but belong to the same transaction from a business perspective.

Their execution order is as shown in the figure above, so when a failure occurs, the canceled compensation operations will be performed in sequence.

Because a long transaction is split into many business flows, the most important part of the Sagas transaction model is the workflow or you can also call it a process manager (Process Manager). Although the workflow engine and Process Manager are not the same thing, but Here, their responsibilities are the same. After choosing the workflow engine, the final code might look like this

SagaBuilder saga = SagaBuilder.newSaga("trip")
        .activity("Reserve car", ReserveCarAdapter.class) 
        .compensationActivity("Cancel car", CancelCarAdapter.class) 
        .activity("Book hotel", BookHotelAdapter.class) 
        .compensationActivity("Cancel hotel", CancelHotelAdapter.class) 
        .activity("Book flight", BookFlightAdapter.class) 
        .compensationActivity("Cancel flight", CancelFlightAdapter.class) 
        .end()
        .triggerCompensationOnAnyError();

camunda.getRepositoryService().createDeployment() 
        .addModelInstance(saga.getModel()) 
        .deploy();

Here is a C# related example, interested students can take a look.

We will not talk about the advantages and disadvantages here, because this theory is relatively new, and there is no solution on the market at present. Even in the field of Java, I have not searched too much useful information.

Distributed Transaction Solution: CAP

You may also see the distributed transaction processing solutions described above in other places, but there is no relevant actual code or open source code, so it is not a good thing, so I will put the dry goods below.

In the .NET world, there doesn't seem to be an off-the-shelf solution for distributed transactions, or there is, but it's not open source. As far as I know, some companies actually have this kind of solution internally, but it is also one of the company's core products, not open source...

In view of the above reasons, the blogger planned to write one by himself and open source it, so he started to do this at the beginning of 2017, and then spent more than half a year of continuous improvement, which is the following CAP.

Github CAP : The CAP here is not the CAP theory, but the name of a .NET distributed transaction solution.

Detailed introduction:
http://www.cnblogs.com/savorboard/p/cap.htmlRelated
documents:
http://www.cnblogs.com/savorboard/p/cap-document.html

Exaggeratedly, this solution has a visual interface (Dashboard), and you can see which messages are successfully executed and which messages fail to execute, and whether the sending or processing fails, you can see at a glance.

The most exaggerated thing is that the visual interface of this solution also provides real-time dynamic charts , so that you can not only see the real-time message sending and processing, but also the speed of the current system processing messages. You can also see the past 24 Historical message throughput in hours.

The most exaggerated thing is that this solution also integrates Consul to do distributed node discovery and registration and heartbeat check, so you can see the status of other nodes at any time.

The most exaggerated thing is, do you think you need to log in to the Dashboard console of other nodes to view the data of other nodes? Wrong, you can open the Dashboard of any one of the nodes, and you can switch to the console interface of the node you want to see with a single click. Just like you see local data, they are completely decentralized.

Do you think this is enough? No, much more than:

  • CAP also supports RabbitMQ, Kafka and other message queues
  • CAP also supports SQL Server, MySql, PostgreSql and other databases
  • CAP Dashboard supports both Chinese and English interface dual language, mother no longer have to worry that I can't understand
  • CAP provides a wealth of interfaces for expansion, what serialization, custom processing, custom sending are all a matter of
  • CAP is based on MIT open source, you can use it for secondary development. (Remember to keep the MIT License)

Now you think I'm done? Do not!

You can use CAP as an EventBus. CAP has excellent message processing capabilities. Don’t worry that the bottleneck will be in CAP. Your database configuration is high enough...

Summarize

Through this article, we have learned the theories of two distributed systems, they are the CAP and BASE theories. At the same time, we have also summarized and compared the advantages and disadvantages of several distributed decomposition schemes. Distributed transaction itself is a technical problem, and there is no one There is a perfect solution to deal with all scenarios, and the specific choice should be based on the business scenario. Then we introduce CAP, a distributed transaction solution based on local messages.

If you think this article is helpful to you, thank you for your [recommendation].


The address of this article: http://www.cnblogs.com/savorboard/p/distributed-system-transaction-consistency.html
Author blog: Savorboard
welcome to reprint, please give the source and link in the obvious place

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324423048&siteId=291194637
Recommended