Distributed transaction and distributed system consistency solution

In a distributed system, it is impossible to satisfy all three of "consistency", "availability" and "partition fault tolerance" at the same time. Transaction consistency in distributed systems is a technical problem. Which solutions are better?
In the field of OLTP systems, we will face the need for transaction consistency in many business scenarios, such as the most classic case of Bob transferring money to Smith. In traditional enterprise development, the system often exists in the form of a single application, and does not span multiple databases.
We usually only need to use the unique data access technology and framework in the development platform (such as Spring, JDBC, ADO.NET), combined with the transaction management mechanism of the relational database to achieve transactional requirements. Relational databases usually have ACID characteristics: Atomicity, Consistency, Isolation, and Durability.
Large-scale Internet platforms are often composed of a series of distributed systems, and development language platforms and technology stacks are relatively complex. Especially in today's SOA and microservice architectures, a seemingly simple function may require many internal calls. A "service" and operating multiple databases or shards to achieve, the situation is often much more complicated. Single technical means and solutions can no longer cope with and satisfy these complex scenarios.
Characteristics of distributed systems
have had readers study of distributed systems, may have heard of "CAP's Law", "Base theory" and so on, is very clever, chemical theory ACID is an acid, Base happens to be alkaline. The author does not explain these concepts too much here, and interested readers can check relevant reference materials. The CAP law is as follows:
CAP law

In a distributed system, it is impossible to simultaneously satisfy the "consistency", "availability" and "partition fault tolerance" in the "CAP law". "Or "white, rich, beautiful" is more difficult. In most scenarios in the Internet field, strong consistency needs to be sacrificed in exchange for high availability of the system. The system often only needs to ensure "final consistency" as long as the final time is within the range acceptable to users.
Distributed Transaction
mention a distributed system, it is bound to mention a distributed transaction. To understand distributed transactions, you have to introduce the two-phase commit protocol first. Let me give a simple but inaccurate example to illustrate: In the
first stage, Teacher Zhang, as the "coordinator", sends WeChat to Xiaoqiang and Xiaoming (participants, nodes), and organizes them to gather at the school gate at 8 tomorrow and go together Climbing the mountain, and then began to wait for Xiaoqiang and Xiaoming to reply.
In the second stage, if both Xiaoqiang and Xiaoming answered no questions, then everyone arrived as scheduled. If one of Xiaoqiang or Xiao Ming replied, "Tomorrow is not free, no," then Teacher Zhang will immediately notify Xiaoqiang and Xiao Ming that "the climbing activity is cancelled".
Careful readers will find that there may be many problems in this process. If Xiaoqiang did not look at the phone, then Teacher Zhang would have been waiting for a reply. Xiao Ming might have prepared all the climbing equipment at home but has been waiting for Teacher Zhang to confirm the information. What's more serious is that if Xiaoqiang hasn't responded by 8 o'clock tomorrow, even if it's "timed out", will Xiaoming go or not to gather for climbing?
This is the shortcoming of the two-phase submission protocol, so the industry later introduced a three-phase submission protocol to solve this type of problem.
The two-phase submission protocol is widely used and implemented in mainstream development language platforms and database products. Here is the DTP model diagram provided by the XOpen organization:

DTP model provided by XOpen

XA protocol refers to the interface between TM (Transaction Manager) and RM (Resource Manager). The current mainstream relational database products all implement the XA interface. JTA (Java Transaction API) conforms to the X/Open DTP model, and the XA protocol is also used between the transaction manager and the resource manager. In essence, the two-phase commit protocol is also used to implement distributed transactions. Let's take a look at the model diagrams of the success and failure of XA transactions:
XA success
XA failed

Under the JavaEE platform, mainstream commercial application servers such as WebLogic and Webshare provide the implementation and support of JTA. But it is not implemented under Tomcat (in fact, I don't think Tomcat can be regarded as a JavaEE application server), which requires the use of third-party frameworks Jotm, Automikos, etc. to achieve, both of which support spring transaction integration.
In the Windows .NET platform, you can use the TransactionScop API in ado.net to program it, and you must configure and use the MSDTC service in the Windows operating system. If your database uses mysql, and mysql is deployed on the Linux platform, then it cannot support distributed transactions. Due to space limitations, it will not be expanded here. Interested readers can consult relevant materials and practice by themselves.
Summary: This method is not too difficult to implement, and it is more suitable for traditional monolithic applications. There are cross-database operations in the same method. However, distributed transactions have a relatively large impact on performance and are not suitable for scenarios with high concurrency and high performance requirements.
Rollback interfaces provided
in the service of architecture, function X, the need to coordinate the back-end of A, B and even more atomic services. So the question is, what if one of the calls of A and B fails?
I often encounter such problems in the author's work, and often provide a BFF layer to coordinate the invocation of A and B services. If some of the results need to be returned synchronously, I will try to call them in a "serial" manner. If call A fails, then B will not be called blindly. If the call to A is successful, but the call to B fails, it will try to roll back the call to A just now.
Of course, sometimes we do not need to strictly provide a separate corresponding rollback interface, but can be implemented cleverly by passing parameters.
In this case, we will try our best to put services that can provide rollback interfaces in front. Take an example: one of
our forums will reward users with 5 points after logging in successfully every day, but points and users are two independent subsystem services, corresponding to different DBs, which is more troublesome to control. Solutions:
1. Put the service call for login and bonus points in a local method of the BFF layer.
2. When the user requests to log in to the interface, first perform the operation of adding points, and then perform the login operation after the points are successfully added
. 3. If the login is successful, of course it is best, and the points are also added successfully. If the login fails, call the rollback interface corresponding to the points plus (execute the points minus operation).

Summary: This method has many shortcomings and is usually not recommended in complex scenarios, unless it is a very simple scenario, it is very easy to provide a rollback, and the dependent services are also very few.
This implementation will result in a huge amount of code and high coupling. And it is very limited, because there are many businesses that cannot be rolled back easily. If there are many serial services, the cost of rollback is too high.

Local news table
this way of thinking to achieve, in fact, is derived from ebay, and later by Alipay's sermons, widely used in the industry. The basic design idea is to split remote distributed transactions into a series of local transactions. If performance and design elegance are not considered, it can be achieved with the help of tables in relational databases.
Take a classic example of inter-bank transfer to describe.
The pseudo code of the first step is as follows, deduct 1W, and ensure that the voucher message is inserted into the message table through local transactions.
Fake code
The second step is to notify the other party that 1W has been added to the bank account. Then the question is, how to notify the other party?
Usually two ways:
1. the MQ high timeliness, triggered automatically by the subscription message and the other listens, news events
2. polling mode uses the timing of scanning, to check the data message table in
two ways in fact have Pros and cons, relying only on MQ, there may be a problem of notification failure. And too frequent regular polling, the efficiency is not the best (90% is useless). Therefore, we generally use a combination of the two methods.
The notification problem is solved, and there are new problems. If this news is repeatedly consumed, and more money is added to the user's account, wouldn't the consequences be serious?
Thinking carefully, we can actually inform the consumer and record the consumption status through a "consumption status table". Before performing the "addition" operation, check whether the message (providing identification) has been consumed, and after the consumption is completed, update the "consumption status table" through local transaction control. In this way, the problem of repeated consumption is avoided.

Summary: The appeal method is a very classic implementation, which basically avoids distributed transactions and achieves "eventual consistency". However, there are bottlenecks in the throughput and performance of relational databases, and frequent read and write messages will put pressure on the database. Therefore, in a real high-concurrency scenario, this solution will also have bottlenecks and limitations.

MQ (non-transactional messages)
Generally, when using MQ products supported by non-transactional messages, it is difficult for us to manage business operations and operations on MQ in a local transaction domain. In a more general description, take the “inter-bank transfer” mentioned above as an example. It is difficult for us to guarantee that the MQ delivery message operation will be successful after the deduction is completed. Such consistency seems difficult to guarantee.
Careful analysis is not completely impossible. First analyze from the end of the message producer, please see the pseudo code:
Non-transactional messages

Based on the above code and comments, let’s analyze the possible situations:
1. Operation of the database is successful, and delivery of messages to MQ is also successful, everyone is happy
2. Operation of the database fails, and no messages will be delivered to MQ
3. Operation of the database is successful, but failed to deliver a message to the MQ, an exception is thrown out, the operation to update the database just executed will be rolled back
from the several cases above analysis point of view, looks like the problem is not large. Then let's analyze the problems faced by the consumer:
1. After the message is listed, the business operation corresponding to the consumer must be executed successfully. If the service execution fails, the message cannot be invalidated or lost. Need to ensure that the message is consistent with business operations
2. Try to avoid repeated consumption of messages. If repeated consumption, and therefore can not affect the business results of
how to ensure consistent messaging and business operations, is not lost?
The mainstream MQ products all have the function of persisting messages. If the consumer is down or the consumption fails, the retry mechanism can be implemented (some MQs can customize the number of retries).
How to avoid problems caused by repeated consumption of messages?
1. Ensure the idempotence of the service interface for consumers to call the business.
2. Record the consumption status through consumption logs or similar status tables for easy judgment (it is recommended to implement it on the business by yourself, instead of relying on MQ products to provide this feature)
Summary: This This method is more common, performance and throughput are better than the use of relational database message tables. If MQ itself and its business have high availability, it can theoretically meet most business scenarios. However, it is not recommended to use it directly in trading business without sufficient testing.

MQ (Transaction Message)
For example, Bob transfers money to Smith. Should we send the message first or perform the deduction first?
It seems that there may be problems. If you send a message first and the deduction operation fails, then there will be an extra amount of money in Smith's account. Conversely, if the deduction is performed first and then the message is sent, it is possible that the deduction is successful but the message is not sent, and Smith cannot receive the money. In addition to the above-mentioned methods of exception capture and rollback, are there any other ideas?
Let's take Alibaba's RocketMQ middleware as an example to analyze its design and implementation ideas.
When RocketMQ sends a Prepared message in the first stage, it will get the address of the message, the second stage executes local things, and the third stage uses the address obtained in the first stage to access the message and modify the state. Attentive readers may find the problem again. What if the confirmation message fails to be sent? RocketMQ will periodically scan the transaction messages in the message cluster. When it finds the Prepared message, it will confirm to the message sender whether Bob's money has been reduced or not? If it is reduced, should it roll back or continue to send confirmation messages? RocketMQ will decide whether to roll back or continue sending confirmation messages according to the strategy set by the sender. This ensures that the message sending and the local transaction succeed or fail at the same time. As shown below:
RocketMQ

Summary: According to the author's understanding, almost all well-known e-commerce platforms and Internet companies adopt similar design ideas to achieve "ultimate consistency". This method is suitable for a wide range of business scenarios and is relatively reliable. However, the technical realization of this method is more difficult. At present, the mainstream open source MQ (ActiveMQ, RabbitMQ, Kafka) do not support transaction messages, so secondary development or new wheels are required. Regrettably, the code for the transaction message part of RocketMQ is not open source and needs to be implemented by yourself.
Other compensation
integrated interface through Alipay transaction students know, we typically page callback interfaces and Alipay, the decryption parameters, and then update the status of the transaction-related service call system, the order will be updated payment was successful. At the same time, Alipay will stop the callback request only when the word "success" is output on our callback page or the corresponding status code that indicates that the business is processed successfully. Otherwise, Alipay will initiate a callback request to the customer after a period of time until the successful identification is output.
In fact, this is a very typical compensation example, similar to some MQ retry compensation mechanisms.
In a generally mature system, the overall availability of higher-level services and interfaces is usually very high. If some services are due to transient network failures or call timeouts, this retry mechanism is actually very effective.
Of course, considering a more extreme scenario, if the system itself has a bug or there is a problem with the program logic, it will not help to retry 1W times. Wouldn't it be a tragedy like "Obviously payment has been made, but it shows that the payment has not been paid and the shipment will not be shipped"?
In fact, in order to make the trading system more reliable, we generally add detailed log records to high-level service codes such as transactions. Once a fatal exception occurs in the system, an email notification will be sent. At the same time, there will be regular tasks in the background to scan and analyze such logs, check out this special situation, try to compensate through the program and notify the relevant personnel by email.
In some special circumstances, there will be "artificial compensation", which is the last barrier.
summary
The author also summarized the design ideas, advantages, disadvantages, etc. of the several schemes appealed. I believe the readers already have a certain understanding. In fact, the transaction consistency of distributed systems is a technical problem in itself, and there is no simple and perfect solution that can deal with all scenarios. Specifically, users still have to make choices based on different business scenarios.

Old driver introduction
Ding Long, now working in a vertical business platform, as a technical architect. Pay attention to the high-concurrency and high-availability architecture design, and have in-depth research and rich practical experience in system service, sub-database and table, performance tuning, etc. Passionate about technology research and sharing.

This article was first published on InfoQ, all rights reserved, please do not reprint.
Address: http://www.infoq.com/cn/articles/solution-of-distributed-system-transaction-consistency
If you are still not satisfied after reading the article, you can scan the QR code to follow and ask me questions.QR code

Guess you like

Origin blog.csdn.net/dinglang_2009/article/details/51810151