This article explains how to ensure transaction consistency under microservices

Original address: Liang Guizhao's blog

Blog address: http://blog.720ui.com

Welcome to the official account: "Server Thinking". A group of people with the same frequency, grow together, improve together, and break the limitations of cognition.

The evolution from local transactions to distributed transactions

What is a transaction? Before answering this question, let's take a look at a classic scenario: transfers from trading platforms such as Alipay. Suppose Xiaoming needs to transfer 100,000 yuan to Xiaohong with Alipay. At this time, Xiaoming's account will be 100,000 less, and Xiaohong's account will be 100,000 more. If the system crashes during the transfer process, and Xiaoming's account is less than 100,000 yuan, while the amount of Xiaohong's account remains unchanged, there will be a big problem, so at this time we need to use transactions. See Figure 6-1.

Here, a very important feature of transactions is reflected: atomicity. In fact, transactions have four basic properties: atomicity, consistency, isolation, and durability. Among them, atomicity, that is, the operations within the transaction either all succeed or all fail, and will not end at a certain link in the middle. Consistent, the database must be in a consistent state even before and after a transaction is executed. If the transaction execution fails, it needs to be automatically rolled back to the original state. In other words, once the transaction is committed, the results seen by other transactions are the same, and once the transaction is rolled back, other transactions can only see the state before the rollback. Isolation, that is, in a concurrent environment, when different transactions modify the same data at the same time, one outstanding transaction does not affect another outstanding transaction. Persistence, that is, once a transaction is committed, its modified data will be permanently saved to the database, and its changes will be permanent.

Local transactions ensure strong data consistency through ACID. ACID is an acronym for Atomic, Consistency, Isolation, and Durability. In the actual development process, we more or less use local transactions. For example, MySQL transaction processing uses begin to start a transaction, rollback transaction rollback, and commit transaction confirms. Here, after the transaction is committed, the changes are recorded through the redo log, and the undo log is used to roll back when it fails to ensure the atomicity of the transaction. The author adds that developers who use the Java language have all been exposed to Spring. Spring uses the @Transactional annotation to handle the transaction function. In fact, Spring encapsulates these details. When generating related beans, when you need to inject related beans annotated with @Transactional , use the proxy to inject, and open the commit/rollback transaction for us in the proxy. See Figure 6-2.

With the rapid development of business, in the face of massive data, for example, tens of millions or even hundreds of millions of data, the time spent for a query will become longer, and even a single point of pressure on the database will be caused. Therefore, we have to consider the sub-library and sub-table solutions. The purpose of sub-database and sub-table is to reduce the burden of the single database and single table of the database, improve the query performance, and shorten the query time. Here, let's first look at the scenario of order library splitting. In fact, the table splitting strategy can be summarized as vertical splitting and horizontal splitting. Vertical splitting, splitting the fields of the table, that is, splitting a table with many fields into multiple tables, which makes the row data smaller. On the one hand, the number of bytes transmitted over the network between the client program and the database can be reduced, because the production environment shares the same network bandwidth, and with the increase of concurrent queries, it may cause bandwidth bottlenecks and blockages. On the other hand, a data block can hold more data, which reduces the number of I/Os when querying. Split horizontally to split the rows of the table. Because the number of rows in the table exceeds several million rows, it will slow down. At this time, the data of one table can be split into multiple tables for storage. For horizontal splitting, there are many strategies, for example, modular division, time dimension division, etc. In this scenario, although we split the tables according to certain rules, we can still use local transactions. However, the sub-table in the library only solves the problem that the data of a single table is too large, but does not distribute the data of a single table to different physical machines, so it does not reduce the pressure on the MySQL server, and there are still data on the same physical machine. Resource contention and bottlenecks, including CPU, memory, disk IO, network bandwidth, etc. For the scenario of sub-database splitting, it divides the data of one table into different databases, and the table structure of multiple databases is the same. At this point, if we route the data we need to use transactions to the same library according to certain rules, we can ensure its strong consistency through local transactions. However, for vertical splits by business and function, it will separate business data into separate databases. Here, the split system will encounter data consistency problems, because we need to ensure that the data guaranteed by transactions are scattered in different databases, and each database can only ensure that its own data can satisfy ACID to ensure strong consistency. But in a distributed system, they may be deployed on different servers and can only communicate through the network, so it is impossible to accurately know the execution of transactions in other databases. See Figure 6-3.

In addition, there is not only a problem that local transactions cannot solve in cross-database calls, but with the implementation of microservices, each service has its own database, and the databases are independent and transparent. If service A needs to obtain the data of service B, there is a cross-service call. If the service is down, or the network connection is abnormal, the synchronization call timeout and other scenarios will lead to data inconsistency. This is also a distributed scenario. Consider data consistency issues. See Figure 6-4.

To sum up, when the scale of the business is expanded, the sub-library and the business service after the micro-service is implemented will cause the problem of distributed data inconsistency. Since local transactions cannot meet the demand, distributed transactions are on the stage. What is a distributed transaction? We can simply understand that it is a transaction solution to ensure data consistency in different databases. Here, it is necessary for us to first understand the CAP principle and the BASE theory. The CAP principle is an acronym for Consistency, Availablity, and Partition-tolerance, which is a theory of balance in distributed systems. In a distributed system, consistency requires that all nodes can guarantee to obtain the latest data every read operation; availability requires that services are still available regardless of any failure; partition fault tolerance requires that partitioned nodes can normally provide external services . In fact, any system can only satisfy two of them at the same time, not all three. For distributed systems, partition fault tolerance is a fundamental requirement. Well, if consistency and partition tolerance are chosen, and availability is forfeited, then network problems can make the system unavailable. If you choose availability and partition fault tolerance, and give up consistency, data between different nodes cannot synchronize data in time, resulting in data inconsistency. See Figure 6-5.

At this point, BASE theory proposes a solution for consistency and availability. BASE is the abbreviation of Basic Available (basically available), Soft-state (soft state) and Eventually Consistent (eventually consistent), which is the theoretical support for eventual consistency. . Simply understand that in a distributed system, partial availability is allowed to be lost, and there is a delay in the process of data synchronization between different nodes, but after a period of repair, the final consistency of the data can finally be achieved. BASE emphasizes the eventual consistency of data. Compared to ACID, BASE gains availability by allowing a loss of partial consistency.

At present, the more commonly used distributed transaction solutions in the industry include strong consistency two-phase commit protocol, three-phase commit protocol, and eventually consistent reliable event mode, compensation mode, and Ali's TCC mode. We will introduce and practice in detail in the following chapters.

Strong consistency solution

two-phase commit protocol

In a distributed system, each database can only guarantee that its own data can satisfy ACID to ensure strong consistency, but they may be deployed on different servers and can only communicate through the network, so it is impossible to accurately know the transactions in other databases Implementation. Therefore, in order to solve the coordination problem between multiple nodes, it is necessary to introduce a coordinator responsible for controlling the operation results of all nodes, either all succeed or all fail. Among them, the XA protocol is a distributed transaction protocol, which has two roles: transaction manager and resource manager. Here, we can understand the transaction manager as the coordinator, and the resource manager as the participant.

The XA protocol guarantees strong consistency through the two-phase commit protocol.

The two-phase commit protocol, as the name suggests, has two phases: the first phase prepares, and the second phase commits. Here, the transaction manager (coordinator) is mainly responsible for controlling the operation results of all nodes, including the preparation process and the submission process. In the first stage, the transaction manager (coordinator) initiates a preparation instruction to the resource manager (participant), and asks the resource manager (participant) whether the pre-commit is successful. If the resource manager (participant) can complete the operation, it will perform the operation without submitting, and finally give its own response result, whether the pre-submission is successful or the pre-submission fails. In the second stage, if all resource managers (participants) reply that the pre-submission is successful, the resource managers (participants) formally submit the command. If one of the resource managers (participants) replies that the pre-commit fails, the transaction manager (coordinator) issues a rollback command to all resource managers (participants). For example, now we have one transaction manager (coordinator) and three resource managers (participants), then in this transaction, we need to ensure the strong consistency of the data of these three participants during the transaction process. First, the transaction manager (coordinator) initiates a prepare command to predict whether they have been pre-committed successfully. If all replies to the pre-commit are successful, then the transaction manager (coordinator) formally initiates a commit command to execute data changes. See Figure 6-6.

Note that although the two-phase commit protocol proposes a solution to ensure strong consistency, there are still some problems. First, the transaction manager (coordinator) is mainly responsible for controlling the operation results of all nodes, including the preparation process and the submission process, but the whole process is synchronous, so the transaction manager (coordinator) must wait for each resource manager (participating The next operation can only be performed after the operation result is returned. This is very easy to cause synchronization blocking problems. Second, single points of failure are also issues that need serious consideration. Both the transaction manager (coordinator) and the resource manager (participant) may be down. If the resource manager (participant) fails, it cannot respond and waits all the time. If the transaction manager (coordinator) fails, the transaction process The controller is lost, in other words, the entire process will be blocked all the time, and even in extreme cases, some resource managers (participants) perform data submission, and some do not perform submission, and data inconsistency will also occur. At this point, readers will ask questions: These problems should all be low probability situations, and generally do not occur? Yes, but for distributed transaction scenarios, we not only need to consider the normal logic flow, but also need to pay attention to the abnormal scenarios with small probability. If we lack a processing plan for the abnormal scenarios, there may be data inconsistency, and then rely on manual labor in the later stage. Intervention and processing will be a very costly task. In addition, the core link of the transaction may not be a data problem, but a more serious asset loss problem.

Three-phase commit protocol

There are many problems with the two-phase commit protocol, so the three-phase commit protocol is about to take the stage. The three-phase commit protocol is an improved version of the two-phase commit protocol. It differs from the two-phase commit protocol in that a timeout mechanism is introduced to solve the problem of synchronization blocking, and a resource manager (participation in ) and terminate the transaction. If all resource managers (participants) can complete, the second stage of preparation and the third stage of commit are initiated. Otherwise, any one of the resource managers (participants) replies to execute, or waits for a timeout, then terminates the transaction. To summarize, the three-phase commit protocol includes: Phase 1 Prep, Phase 2 Prep, and Phase 2 Commit. See Figure 6-7.

The three-phase commit protocol solves the problems brought by the two-phase commit protocol very well, and is a very meaningful solution. However, data inconsistencies may occur in scenarios with extremely small probability. Because the three-phase commit protocol introduces a timeout mechanism, if the resource manager (participant) time-out scenario occurs, the commit will be successful by default, but if it is not successfully executed, or if other resource managers (participants) roll back, then it will appear Data inconsistency.

eventually consistent solution

TCC mode

The two-phase commit protocol and the three-phase commit protocol solve the problem of distributed transactions very well, but there are still data inconsistencies in extreme cases. In addition, it will cost a lot to the system. The introduction of transaction managers (coordinators) Afterwards, single-point bottlenecks are more likely to occur, and when the business scale continues to grow, there will also be problems with system scalability. Note that it is a synchronous operation, so after a transaction is introduced, resources cannot be released until the end of the global transaction, and performance can be a big problem. Therefore, it is rarely used in high concurrency scenarios. Therefore, Ali proposed another solution: the TCC mode. Note that many readers equate the two-phase commit with the two-phase commit protocol. This is a misunderstanding. In fact, the TCC mode is also a two-phase commit.

TCC mode splits a task into three operations: Try, Confirm, and Cancel. If we have a func() method, then in TCC mode, it becomes three methods: tryFunc(), confirmFunc(), and cancelFunc().

tryFunc();
confirmFunc();
cancelFunc();

In the TCC mode, the main business service is responsible for initiating the process, and the slave business service provides the three operations of Try, Confirm, and Cancel in the TCC mode. Among them, there is also a transaction manager role responsible for controlling the consistency of transactions. For example, we now have three business services: transaction service, inventory service, payment service. The user selects a product, places an order, and then selects a payment method for payment. Then, for this request, the transaction service will first call the inventory service to deduct the inventory, then the transaction service will call the payment service for related payment operations, and then the payment service will request a third party. The payment platform creates the transaction and debits the payment. Here, the transaction service is the main business service, and the inventory service and payment service are the sub-business services. See Figure 6-8.

Let's sort out the process of TCC mode. In the first phase, the main business service calls all the Try operations of the slave business services, and the transaction manager records the operation log. In the second stage, when all the slave business services are successful, the Confirm operation is performed, otherwise the Cancel reverse operation will be performed to roll back. See Figure 6-9.

Now, let's talk about the general business implementation ideas for the TCC model. First, the transaction service (the main business service) registers with the transaction manager and starts the transaction. In fact, the transaction manager is a conceptual global transaction management mechanism, which can be a business logic embedded in the main business service, or an abstracted TCC framework. In fact, it generates a global transaction ID for recording the entire transaction chain, and implements a set of processing logic for nested transactions. When the main business service calls all the try operations of the slave business services, the transaction manager uses the local transaction to record the relevant transaction log. In this case, it records the action record of calling the inventory service and the action record of calling the payment service, and its status Set to "pre-commit" state. Here, calling the Try operation from the business service is the core business code. So, how is the Try operation bound to its corresponding Confirm and Cancel operations? In fact, we can write a configuration file to establish a binding relationship, or it is also a good choice to add two parameters confirm and cancel through Spring's annotations. When all the slave business services are successful, the transaction manager executes the Confirm operation through the TCC transaction context aspect, and sets its status to the "successful" state; try. Therefore, the TCC mode guarantees its eventual consistency by means of compensation.

There are many mature open source projects for the implementation framework of TCC, such as the tcc-transaction framework. (For details of the tcc-transaction framework, you can read: https://github.com/changmingxie/tcc-transaction ) The tcc-transaction framework mainly involves tcc-transaction-core, tcc-transaction-api, and tcc-transaction-spring modules. Among them, tcc-transaction-core is the underlying implementation of tcc-transaction, tcc-transaction-api is the API used by tcc-transaction, and tcc-transaction-spring is the Spring support of tcc-transaction. tcc-transaction abstracts each business operation into transaction participants, and each transaction can contain multiple participants. Participants need to declare three types of methods: try / confirm / cancel. Here, we mark the try method with the @Compensable annotation and define the corresponding confirm / cancel methods.

// try 方法
@Compensable(confirmMethod = "confirmRecord", cancelMethod = "cancelRecord", transactionContextEditor = MethodTransactionContextEditor.class)
@Transactional
public String record(TransactionContext transactionContext, CapitalTradeOrderDto tradeOrderDto) {}

// confirm 方法
@Transactional
public void confirmRecord(TransactionContext transactionContext, CapitalTradeOrderDto tradeOrderDto) {}

// cancel 方法
@Transactional
public void cancelRecord(TransactionContext transactionContext, CapitalTradeOrderDto tradeOrderDto) {}

For the implementation of the tcc-transaction framework, let's understand some core ideas. The tcc-transaction framework intercepts through the @Compensable aspect, which can transparently call the confirm/cancel method of the participant, thereby realizing the TCC mode. Here, tcc-transaction has two interceptors, see Figure 6-10.

  • org.mengyun.tcctransaction.interceptor.CompensableTransactionInterceptor, Compensable Transaction Interceptor.

  • org.mengyun.tcctransaction.interceptor.ResourceCoordinatorInterceptor, resource coordinator interceptor.

Here, we need to pay special attention to the TransactionContext transaction context, because we need to pass the transaction to the remote participant in the form of a parameter when we need to call the participant of the service remotely. In tcc-transaction, a transaction org.mengyun.tcctransaction.Transactioncan have multiple participants participating org.mengyun.tcctransaction.Participantin business activities. Among them, the transaction number TransactionXid is used to uniquely identify a transaction, which is generated using the UUID algorithm to ensure uniqueness. When a participant makes a remote call, the transaction number of the remote branch transaction is equal to the participant's transaction number. Through the associated TCC confirm / cancel method of the transaction number, the transaction number of the participant is used to associate with the remote branch transaction, so as to realize the commit and rollback of the transaction. The transaction status TransactionStatus includes: trying status TRYING (1), confirming status CONFIRMING (2), canceling status CANCELLING (3). In addition, the transaction type TransactionType includes: root transaction ROOT(1), branch transaction BRANCH(2). When calling TransactionManager#begin() to initiate a root transaction, the type is MethodType.ROOT, and the transaction try method is called. Call the TransactionManager#propagationNewBegin() method to propagate the branch transaction. The method is called when the method type is MethodType.PROVIDER and the transaction try method is called. Call the TransactionManager#commit() method to commit the transaction. This method is called when the transaction is in the confirm / cancel method. Similarly, call the TransactionManager#rollback() method to cancel the transaction. See Figure 6-11.

In addition, for the transaction recovery mechanism, the tcc-transaction framework implements scheduling based on Quartz, and retries the transaction at a certain frequency until the transaction is completed or the maximum number of retries is exceeded. If a single transaction exceeds the maximum number of retries, the tcc-transaction framework will not retry, and manual intervention is required at this time.

Here, we pay special attention to the idempotency of operations. The core of the idempotent mechanism is to ensure the uniqueness of resources. For example, repeated submission or multiple retries on the server side will only produce one result. Payment scenarios, refund scenarios, and transactions involving money cannot have multiple deductions and other issues. In fact, the query interface is used to obtain resources, because it only queries data and does not affect the changes of resources, so no matter how many times the interface is called, the resources will not change, so it is idempotent. The newly added interface is non-idempotent, because calling the interface multiple times will cause resource changes. Therefore, we need to be idempotent in the presence of duplicate commits. So, how to guarantee the idempotent mechanism? In fact, we have many implementations. One of the solutions is the common creation of unique indexes. Creating a unique index in the database for the resource fields we need to constrain can prevent duplicate data from being inserted. However, in the case of sub-database and sub-table, the unique index is not so easy to use. At this time, we can query the database once, and then judge whether the constrained resource fields are duplicated, and then perform the insert operation when there is no duplicate. . Note that, in order to avoid concurrent scenarios, we can ensure the uniqueness of data through locking mechanisms, such as pessimistic locking and optimistic locking. Here, distributed locking is a frequently used scheme, which is usually an implementation of pessimistic locking. However, many people often regard pessimistic locks, optimistic locks, and distributed locks as solutions for idempotent mechanisms, which is incorrect. In addition, we can also introduce a state machine, and use the state machine to perform state constraints and state jumps to ensure the process execution of the same business, thereby realizing data idempotence.

Compensation mode

In the previous section, we mentioned the retry mechanism. In fact, it is also a solution for eventual consistency: we need to try our best to keep retrying to ensure that the database operation will eventually ensure data consistency. If the final retry fails multiple times, we can actively notify the developer according to the relevant logs. Manual intervention by personnel. Note that the callee needs to guarantee its idempotency. The retry mechanism can be a synchronization mechanism, for example, the main business service call times out or a non-abnormal call fails, and the business call needs to be re-initialized in time. The retry mechanism can be roughly divided into a fixed number of retry strategies and a fixed time retry strategy. In addition, we can also use the message queue and timed task mechanism. The retry mechanism of the message queue, that is, if the message fails to be consumed, it will be re-delivered, so as to avoid the message being discarded without being consumed. For example, RocketMQ can allow each message to be retried up to 16 times by default, and the interval between each retry can be Make settings. For the retry mechanism of timed tasks, we can create a task execution table and add a "retry times" field. In this design scheme, we can obtain whether the task is in the state of execution failure and does not exceed the number of retries when it is called regularly, and if so, retry on failure. However, when the execution fails and the number of retries exceeds the number of retries, it means that the task has permanently failed, and developers need to manually intervene and troubleshoot the problem.

In addition to the retry mechanism, it can also be fixed on each update. For example, for the counting scenarios such as the number of likes, favorites, and comments of social interaction, the data may be inconsistent for a certain period of time due to network jitter or the unavailability of related services. We can repair it every time it is updated. It is guaranteed that the data will eventually be consistent after a short period of self-recovery and correction of the system. It should be noted that with this solution, if a piece of data is inconsistent, but it is not updated and repaired again, it will always be abnormal data.

Regular proofreading is also a very important solution, which is guaranteed by periodic proofreading operations. Regarding the selection of timed task frameworks, Quartz in stand-alone scenarios is commonly used in the industry, and distributed timed task middleware such as Elastic-Job, XXL-JOB, and SchedulerX in distributed scenarios. There are two scenarios for timed proofreading. One is the unfinished timed retry. For example, we use the timed task to scan the unfinished calling task and repair it through the compensation mechanism to achieve the final consistency of the data. The other is timing check, which requires the main business service to provide a relevant query interface to the slave business service for checking and querying to recover lost business data. Now, let's imagine the refund business in the e-commerce scenario. In this refund business, there will be a basic refund service and an automated refund service. At this time, the automatic refund service enhances the refund capability based on the basic refund service, realizes automatic refund based on multiple rules, and receives the refund snapshot information pushed by the basic refund service through the message queue. However, data inconsistency is likely to occur due to the loss of messages sent by the refund basic service or the active discarding of message queues after repeated failures and retries. Therefore, it is particularly important to restore lost business data by regularly checking and checking from the basic service for refunds.

reliable event pattern

In distributed systems, message queues play a very important role in the server-side architecture, mainly addressing scenarios such as asynchronous processing, system decoupling, and traffic peak shaving. If multiple systems communicate synchronously, it is easy to cause blocking, and at the same time, these systems will be coupled together. Therefore, a message queue is introduced, which solves the blocking caused by the synchronous communication mechanism on the one hand, and decouples the business through the message queue on the other hand. See Figure 6-12.

In the reliable event mode, by introducing a reliable message queue, as long as the current reliable event delivery is guaranteed and the message queue ensures that the event is delivered at least once, the consumers who subscribe to this event can ensure that the event can be consumed within their own business. Here, readers are asked to think whether the problem can be solved as long as the message queue is introduced? In fact, just introducing a message queue does not guarantee its eventual consistency, because communication is based on the network in a distributed deployment environment, and in the process of network communication, upstream and downstream messages may be lost due to various reasons.

First, when the main business service sends a message, it may fail because the message queue is unavailable. In this case, we can let the main business service (producer) send the message, and then make the business call to ensure. The general practice is that the main business service persists the message to be sent to the local database, sets the flag state to "to be sent", and then sends the message to the message queue. After the message queue receives the message, it also persists the message to its In the storage service, it does not immediately deliver messages to the slave business service (consumer), but first returns the response result of the message queue to the main business service (producer), and then the main business service judges the business processing after the response result is executed. If the response fails, the subsequent business processing is abandoned, and the local persistent message flag state is set to the "end" state. Otherwise, the subsequent business processing is performed, and the local persistent message flag state is set to the "sent" state.

public void doServer(){
    // 发送消息
    send();
    // 执行业务
    exec();
    // 更新消息状态
    updateMsg();
}

In addition, after a message occurs in the message queue, the business service (consumer) may go down and cannot be consumed. Most message middlewares, such as RabbitMQ, RocketMQ, etc., have introduced an ACK mechanism for this situation. Note that, by default, automatic response is used, in which the message queue will delete the message from the message queue immediately after sending the message. Therefore, in order to ensure the reliable delivery of the message, we use the manual ACK method. If the ACK is not sent from the business service (consumer) due to downtime or other reasons, the message queue will resend the message to ensure the reliability of the message. After the relevant business is processed from the business service, the message queue is notified by manual ACK, and the message queue deletes the persistent message from the message queue. Then, if the message queue fails to retry all the time and cannot be delivered, the message will be actively discarded. How do we solve it? Astute readers may have noticed that in the previous step, the main business service has persisted the message to be sent to the local database. Therefore, after successful consumption from the business service, it also sends a notification message to the message queue, at which point it is a message producer. After the main business service (consumer) receives the message, it finally marks the local persistent message status as "completed". Having said that, readers should understand that we use the "forward and reverse message mechanism" to ensure reliable event delivery in the message queue. Of course, compensation mechanisms are also essential. A scheduled task scans the database for unfinished messages within a certain period of time and redelivers them. See Figure 6-13.

Note that because the message queue may not receive the processing result of the message due to the message processing timeout or service downtime, and the network may be received from the business service, the event delivery is reliable and the message queue ensures that the event is delivered at least once. Here, the slave business service (consumer) needs to guarantee idempotency. If the slave business service (consumer) does not guarantee the idempotency of the interface, it will lead to abnormal scenarios such as repeated submission. In addition, we can also independently deploy the message service, and share the message service according to different business scenarios to reduce the cost of repeated service development.

Now that we understand the methodology of the "Reliable Event Pattern", let's look at a real case to deepen our understanding. First, when the user initiates a refund, the automated refund service will receive a refund event message. At this time, if the refund conforms to the automated refund policy, the automated refund service will first write to the local database for persistence After this refund snapshot, a message to execute the refund is sent to the message queue. After the message queue receives the message, it returns a successful response result. Then the automatic refund service can execute the subsequent business logic. At the same time, the message queue asynchronously delivers the message to the basic service for refund, and then the basic service for refund executes its own business-related logic. Whether the execution fails or not is guaranteed by the basic service for refund. If the execution is successful, an execution refund is sent. A successful message is posted to the message queue. Finally, the scheduled task scans the database for unfinished messages within a certain period of time and redelivers them. Here, it should be noted that the persistent refund snapshot of the automated refund service can be understood as a message that needs to be delivered successfully, and the “forward and reverse message mechanism” and “timed task” ensure its successful delivery. In addition, the real refund accounting logic is guaranteed by the basic refund service, so it must ensure idempotency and the convergence of the accounting logic. When the execution fails and the number of retries exceeds the number of retries, it means that the task has permanently failed, and developers need to manually intervene and troubleshoot the problem. See Figure 6-14.

To sum up, the introduction of a message queue does not guarantee reliable event delivery. In other words, the loss of messages due to various reasons such as the network cannot guarantee its eventual consistency. Therefore, we need to ensure that the "forward and reverse message mechanism" is used. Message queues deliver reliable event delivery, and use a compensation mechanism to re-deliver messages that are not completed within a certain amount of time as much as possible.

Interpretation of distributed transaction implementation of open source projects

The application of distributed transactions in open source projects has a lot to learn and learn from. In this section, we will interpret its implementation.

RocketMQ

Apache RocketMQ is a high-performance, high-throughput distributed message middleware open sourced by Alibaba. During the Double 11 over the past years, RocketMQ has undertaken all the news flow of Alibaba's production system, and has a stable and excellent performance in the core transaction link. It is one of the core basic products that carry the peak transaction value. RocketMQ also has a commercial version of MQ that can be purchased on Alibaba Cloud ( https://www.aliyun.com/product/ons). The main difference between Alibaba's open source version and commercial version is that it will open source all core features of distributed messages. At the commercial level, especially in the construction of cloud platforms, operation and maintenance management and control, security authorization, and in-depth training are included in the business priorities.

Apache RocketMQ version 4.3 officially supports distributed transaction messages. The RocketMQ transaction message design mainly solves the problem of atomicity between the message sending on the producer side and the execution of local transactions. In other words, if the local transaction execution is unsuccessful, MQ message push will not be performed. Then, you may have questions if you are smart: we can execute local transactions first, and then send MQ messages after the execution is successful, so that we can guarantee transactionality? However, please think carefully again, what if the MQ message is not sent successfully? In fact, RocketMQ provides a good idea and solution for this. RocketMQ will first send a pre-execution message to MQ, and execute a local transaction after sending the pre-execution message successfully. Then, it performs subsequent execution logic according to the execution result of the local transaction. If the execution result of the local transaction is commit, then the MQ message is formally delivered. If the execution result of the local transaction is rollback, MQ deletes the pre-execution message delivered before and does not deliver it. send. Note that for abnormal situations, such as server downtime or timeout during the execution of a local transaction, RocketMQ will continuously ask other producers in the same group to obtain the status. See Figure 6-15.

So far, we have understood the implementation idea of ​​RocketMQ. If you are interested in the source code implementation, you can read it org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl#sendMessageInTransaction.

ServiceComb

ServiceComb is open sourced based on Huawei's internal CSE (Cloud Service Engine) framework. It provides a set of functions including code framework generation, service registration discovery, load balancing, service reliability (fault-tolerant circuit breaker, current limit downgrade, call chain tracking) and other functions microservices framework. Among them, ServiceComb Saga is a data eventual consistency solution for microservice applications.

Saga splits distributed transactions into multiple local transactions, which are then coordinated by the Saga engine. If the entire process ends normally, the business is successfully completed; if there is a partial failure in the implementation during this process, the Saga engine invokes the compensation operation. Saga has two recovery strategies: forward recovery and backward recovery. Among them, forward recovery takes the best efforts to continuously retry the failed node to ensure that the operation of the database can ultimately ensure data consistency. If the retry fails several times in the end, the developer can be actively notified according to the relevant log for manual intervention. Backward recovery performs the rollback transaction operation on all previous successful nodes, so as to ensure that the data achieves a consistent effect.

The difference between Saga and TCC is that Saga has one less Try operation than TCC. Therefore, Saga will be submitted directly to the database, and then when a failure occurs, a compensation operation will be performed. The design of Saga may lead to more troublesome compensation actions in extreme scenarios, but for simple business logic, it is less intrusive, lighter, and reduces the number of communications, see Figure 6-16.

ServiceComb Saga expands on its theoretical foundation with two components: alpha and omega. Alpha acts as the coordinator, mainly responsible for persistently storing the events of the transaction and coordinating the state of the sub-transactions, so that it is finally consistent with the state of the global transaction. Omega is an agent embedded in the microservice, which is responsible for intercepting network requests and reporting transaction events to alpha, and performing corresponding compensation operations according to the instructions issued by alpha under abnormal conditions. In the preprocessing phase, alpha records the event that the transaction started; in the post-processing phase, the alpha records the event that the transaction ends. Therefore, each successful subtransaction has a one-to-one corresponding start and end events. On the service producer, omega intercepts the transaction-related id in the request to extract the context of the transaction. On the service consumer side, omega will inject the transaction-related id in the request to pass the transaction context. Through this cooperative processing of service providers and service consumers, sub-transactions can be connected to form a complete global transaction. Note that Saga requires related sub-transactions to provide transaction processing methods and to provide compensation functions. Here, add the @EnableOmega annotation to initialize the omega's configuration and establish a connection with the alpha. Add the @SagaStart annotation to the starting point of the global transaction, and add the @Compensable annotation to the sub-transaction to indicate its corresponding compensation method. Use case: https://github.com/apache/servicecomb-saga/tree/master/saga-demo

@EnableOmega
public class Application{
  public static void main(String[] args) {
    SpringApplication.run(Application.class, args);
  }
}

@SagaStart
public void xxx() { }


@Compensable
public void transfer() { }

Now, let's take a look at its business flow diagram, see Figure 6-17.

More exciting articles, all in "Server Thinking"!

This article is published by OpenWrite , a multi-post blog platform !

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324121887&siteId=291194637