Message middleware (1) A comparison of transaction consistency solutions in distributed systems.

In a distributed system, it is impossible to satisfy all three of "consistency", "availability" and "partition fault tolerance" at the same time. Transaction consistency of distributed systems is a technical problem. Which solutions are better?

In the field of OLTP systems, we face transaction consistency requirements in many business scenarios, such as the most classic case of Bob transferring money to Smith. In traditional enterprise development, the system often exists in the form of a single application and does not span multiple databases.

We usually only need to use the unique data access technologies and frameworks in the development platform (such as Spring, JDBC, ADO.NET), and combine the transaction management mechanism of the relational database to achieve transactional requirements. Relational databases usually have ACID characteristics: Atomicity, Consistency, Isolation, Durability.

Large-scale Internet platforms are often composed of a series of distributed systems, and the development language platform and technology stack are relatively complex, especially in today's SOA and microservice architectures, a seemingly simple function may need to be called internally. To implement a "service" and operate multiple databases or shards, the situation is often much more complicated. A single technical means and solution can no longer deal with and satisfy these complex scenarios.

Characteristics of Distributed Systems

Readers who have studied distributed systems may have heard of "CAP law", "Base theory", etc. Coincidentally, in chemical theory, ACID is an acid, and Base happens to be a base. The author does not explain these concepts too much here, and interested readers can check the relevant references. The CAP law is as follows:
write picture description here

In a distributed system, it is impossible to satisfy the "consistency", "availability" and "partition fault tolerance" in the "CAP Law" at the same time. ” or “white, rich, beautiful” is more difficult. In the vast majority of scenarios in the Internet field, strong consistency needs to be sacrificed in exchange for high system availability. The system often only needs to ensure "eventual consistency", as long as the final time is within the range acceptable to users.

Distributed transaction

When it comes to distributed systems, it is inevitable to mention distributed transactions. To understand distributed transactions, we have to first introduce the two-phase commit protocol. Let's take a simple but imprecise example to illustrate:

In the first stage, Mr. Zhang, as the "coordinator", sent WeChat to Xiaoqiang and Xiaoming (participants, nodes), and organized them to gather at the school gate at 8 o'clock tomorrow, go to the mountain together, and then began to wait for Xiaoqiang and Xiaoming to reply.

In the second stage, if both Xiaoqiang and Xiaoming answered no question, then everyone will come as promised. If one of Xiaoqiang or Xiaoming answers, "I'm not free tomorrow, no," then Teacher Zhang will immediately notify Xiaoqiang and Xiaoming that "the mountain climbing activity is canceled".

Attentive readers will find that there may be many problems in this process. If Xiaoqiang didn't look at his phone, then Teacher Zhang would have been waiting for an answer. Xiaoming might have prepared all the climbing equipment at home but had been waiting for Teacher Zhang to confirm the information. What's more serious is that if Xiaoqiang has not replied by 8 o'clock tomorrow, then even if it is "timed out", will Xiaoming go or not go to the mountain?

This is the drawback of the two-phase commit protocol, so later the industry introduced the three-phase commit protocol to solve this type of problem.

The two-phase commit protocol is widely used and implemented in mainstream development language platforms and database products. Let's introduce the DTP model diagram provided by the XOpen organization:
write picture description here
XA protocol refers to TM (transaction manager) and RM (resource manager) )between interface. The current mainstream relational database products all implement the XA interface. JTA (Java Transaction API) conforms to the X/Open DTP model, and the XA protocol is also used between the transaction manager and the resource manager. In essence, distributed transactions are realized by means of a two-phase commit protocol. Let’s take a look at the model diagrams of XA transaction success and failure:
write picture description here

write picture description here

Under the JavaEE platform, mainstream commercial application servers such as WebLogic and Webshare provide the implementation and support of JTA. It is not implemented under Tomcat (in fact, I do not think that Tomcat can be regarded as a JavaEE application server), which requires the help of third-party frameworks such as Jotm and Automikos, both of which support spring transaction integration.

In the Windows .NET platform, it can be programmed with the help of the TransactionScop API in ado.net, and the MSDTC service in the Windows operating system must also be configured and used. If your database uses mysql, and mysql is deployed on the Linux platform, it cannot support distributed transactions. Due to space constraints, it is not expanded here, and interested readers can consult relevant materials and practice by themselves.

Summary: This method is not too difficult to implement, and it is more suitable for traditional single applications. There are cases of cross-database operations in the same method. However, the impact of distributed transactions on performance will be relatively large, and it is not suitable for scenarios with high concurrency and high performance requirements.

Provide a rollback interface

In the service-oriented architecture, function X needs to coordinate the back-end A, B or even more atomic services. So the question is, what if one of the calls of A and B fails?

In the author's work, I often encounter such problems, and often provide a BFF layer to coordinate the invocation of A and B services. If some need to return results synchronously, I will try to call them in a "serial" way. If calling A fails, it will not blindly call B. If the call to A succeeds and the call to B fails, it will try to roll back the call to A just now.

Of course, sometimes we don't have to strictly provide a separate corresponding rollback interface, which can be implemented cleverly by passing parameters.

In this case, we will try to put the service that can provide the rollback interface in the front. For example:

One of our forum websites will reward users with 5 points after successful login every day, but points and users are two independent subsystem services corresponding to different DBs, which is more troublesome to control. Solutions:

1. Put the service calls for logging in and adding points in a local method in the BFF layer.
2. When the user requests to log in to the interface, first perform the operation of adding points, and then perform the login operation after the addition of points is successful.
3. If the login is successful, it is of course the best, and the points are also added successfully. If the login fails, call the rollback interface corresponding to the point addition (execute the operation of reducing the points).


Summary: This method has many disadvantages and is usually not recommended in complex scenarios, unless it is a very simple scenario, it is very easy to provide rollback, and there are very few dependent services.

This implementation will result in a huge amount of code and high coupling. And it is very limited, because there are many businesses that cannot be rolled back easily. If there are many serial services, the cost of rollback is too high.

local message table

The idea of ​​​​this realization method actually originated from eBay, and was later widely used in the industry through the preaching of Alipay and other companies. Its basic design idea is to split a remote distributed transaction into a series of local transactions. If performance and design elegance are not a concern, it can be achieved with the help of tables in relational databases.

Take a classic example of inter-bank transfer to describe.

In the first step, the pseudo code is as follows, deducting 1W, and ensuring that the voucher message is inserted into the message table through the local transaction.
write picture description here

The second step is to notify the other party that 1W has been added to the bank account. The question is, how to notify the other party?

There are usually two ways:

1. Use MQ with high timeliness, and the other party subscribes to messages and monitors them, and automatically triggers events when there are messages.
2. Uses regular polling and scanning to check the data in the message table.

In fact, the two methods have their own advantages and disadvantages. Relying only on MQ, there may be a problem of notification failure. And too frequent scheduled polling, the efficiency is not the best (90% is useless). Therefore, we generally use a combination of the two methods.

Solved the problem of notification, and there is a new problem. If this news is repeatedly consumed and extra money is added to the user's account, wouldn't the consequences be serious?

Thinking about it carefully, we can actually record the consumption status through a "consumption status table" for the message consumer. Before executing the "add money" operation, check whether the message (provided the identifier) ​​has been consumed. After the consumption is completed, the "consumption status table" is updated through the local transaction control. This avoids the problem of repeated consumption.

Summary: The way of appeal is a very classic implementation, which basically avoids distributed transactions and achieves "eventual consistency". However, there are bottlenecks in the throughput and performance of relational databases, and frequent reading and writing of messages will put pressure on the database. Therefore, in a real high concurrency scenario, this solution will also have bottlenecks and limitations.

MQ (non-transactional messages)

Usually, when using MQ products supported by non-transactional messages, it is difficult for us to manage business operations and operations on MQ in a local transaction domain. In a simple description, let's take the "inter-bank transfer" mentioned above as an example. It is difficult for us to guarantee that the operation of delivering messages to MQ will be successful after the deduction is completed. This consistency seems to be difficult to guarantee.
Let's analyze from the message producer side first, please see the pseudo code:

write picture description here

Based on the above code and comments, let's analyze the possible situations:

1. The operation of the database is successful, and the delivery of the message to MQ is also successful. Everyone is happy
. 2. The operation of the database fails, and the message will not be delivered to MQ.
3. The operation of the database is successful, but the delivery of the message to MQ fails, and an exception is thrown out. , the operation just performed to update the database will be rolled back

From the above analysis of several situations, it seems that the problem is not big. So let's analyze the problems faced by consumers:

1. After the message is dequeued, the business operation corresponding to the consumer must be executed successfully. If the business execution fails, the message cannot be invalidated or lost. It is necessary to ensure that messages are consistent with business operations.
2. Try to avoid repeated consumption of messages. If repeated consumption, it cannot affect business results

How to ensure that messages are consistent with business operations and not lost?

Mainstream MQ products all have the function of persisting messages. If the consumer is down or the consumption fails, the retry mechanism can be implemented (some MQs can customize the number of retries).

How to avoid problems caused by repeated consumption of messages?

1. Ensure the idempotency of the service interface for consumers to call the business
2. Use the consumption log or similar status table to record the consumption status, which is convenient for judgment (it is recommended to implement it in the business, instead of relying on MQ products to provide this feature)


Summary: This method is more common, and the performance and throughput are better than the scheme using the relational database message table. If MQ itself and the business have high availability, it can theoretically satisfy most business scenarios. However, it is not recommended to use it directly in the trading business without adequate testing.

MQ (Transactional Message)

For example, if Bob transfers money to Smith, do we send a message first, or do we perform a deduction first?
Seems like there might be a problem. If the message is sent first and the deduction operation fails, there will be an extra amount of money in Smith's account. Conversely, if the deduction operation is performed first, and then the message is sent, it is possible that the deduction was successful but the message was not sent, and Smith could not receive the money. In addition to the above-mentioned ways of catching and rolling back by exception, are there any other ideas?

The following takes Alibaba's RocketMQ middleware as an example to analyze its design and implementation ideas.

When RocketMQ sends a Prepared message in the first stage, it will get the address of the message, the second stage executes local transactions, and the third stage accesses the message through the address obtained in the first stage and modifies the state. Attentive readers may find the problem again, what if the confirmation message fails to be sent?

RocketMQ will periodically scan the transaction messages in the message cluster. At this time, it finds the Prepared message, and it will confirm to the message sender whether Bob's money has been reduced or not? If the decrease is to roll back or continue to send a confirmation message? RocketMQ will decide whether to roll back or continue to send confirmation messages according to the policy set by the sender. This ensures that the message sending succeeds or fails at the same time as the local transaction. As shown below:
write picture description here

Summary: According to the author's understanding, almost all major well-known e-commerce platforms and Internet companies adopt similar design ideas to achieve "eventual consistency". This method is suitable for a wide range of business scenarios and is relatively reliable. However, this method is technically difficult to achieve. At present, mainstream open source MQ (ActiveMQ, RabbitMQ, Kafka) do not support transaction messages, so secondary development or new wheels are required. Unfortunately, the code of the RocketMQ transaction message part is not open source and needs to be implemented by yourself.

other compensation methods

Students who have done the Alipay transaction interface know that we usually decrypt the parameters in the Alipay callback page and interface, and then call the service related to updating the transaction status in the system to update the order to successful payment. At the same time, Alipay will stop the callback request only when the word success is output in the callback page or the corresponding status code indicates that the business has been processed successfully. Otherwise, Alipay will initiate a callback request to the client side after a period of time until the success flag is output.
In fact, this is a very typical compensation example, similar to some MQ retry compensation mechanisms.

In a generally mature system, for higher-level services and interfaces, the overall availability is usually very high. If some services are due to transient network failures or call timeouts, then this retry mechanism is actually very effective.

Of course, considering a more extreme scenario, if the system itself has bugs or program logic problems, then retrying 1W times will not help. Wouldn't that be a tragedy like "even though the payment has been made, but it is shown that the payment is not made and the shipment is not delivered"?

In fact, in order to make the trading system more reliable, we generally add detailed log records to high-level service codes like transactions. Once a fatal exception is triggered inside the system, there will be an email notification. At the same time, there will be scheduled tasks in the background to scan and analyze such logs, check out this special situation, try to compensate through the program and notify relevant personnel by email.

In some special cases, there will be "artificial compensation", which is also the last barrier.

summary

Among the several schemes of appeal, the author also roughly summarizes its design ideas, advantages, disadvantages, etc., I believe readers have a certain understanding. In fact, the transaction consistency of a distributed system itself is a technical problem. At present, there is no simple and perfect solution that can deal with all scenarios. Specifically, it is still up to the user to choose according to different business scenarios.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326286115&siteId=291194637