Under article thoroughly publicize microService how to ensure transactional consistency

Original Address: Liang Gui Zhao's blog

Blog address: blog.720ui.com

Welcome to public concern number: "server-side thinking." A group of the same frequency who grow up together, along with sophisticated, breaking the limitations of cognition.

Evolved from local transactions to distributed transactions

What is a transaction? Before answering this question, let's look at a classic scene: the transfer of payments treasure and other trading platforms. Let's assume Bob need to use Alipay red transfer 100,000 yuan, at this time, Xiao Ming account will be less 100,000 yuan, while red will account more than 100,000 yuan. If the system crashes in the transfer process, Xiao Ming account less 100,000 yuan, while the same amount of red account, it will be a big problem, so this time we need to use the transaction. See Figure 6-1.

Here, the transaction reflects a very important feature: atomicity. In fact, there are four basic characteristics of the transaction: atomicity, consistency, isolation, durability. Among them, atomic, ie either all operations within a transaction succeed or fail, will not end in the middle of a link. Consistency, even if the database until after a transaction execution and implementation of the database must be in a consistent state. If the transaction fails, the need to automatically roll back to the original state, in other words, the transaction once submitted, in line with other affairs to see the result, once the transaction is rolled back, other transactions can only see the state before the rollback. When isolation, i.e. in a concurrent environment, different transactions modify the same data at the same time, it is not a completed transaction will not affect another transaction is not completed. Persistence, that is, once the transaction submitted its revised data will be permanently saved to the database, that change is permanent.

Local transaction ACID guarantee strong consistency of the data. It is an abbreviation of Atomic ACID (atomicity), Consistency (consistency), Isolation (isolation) and Durability Rev (persistence) of. In the actual development process, we more or less have to use local transactions. For example, MySQL transaction processing use to begin to start a transaction, rollback transaction rollback, commit the transaction confirmation. Here, the transaction is committed, through the redo log records changes to be rolled back on failure by the undo log, to ensure the transaction atomicity. Under the author added that the use of the Java language developers have contact with Spring. Spring annotation using @Transactional can handle transactional capabilities. In fact, Spring encapsulate these details when generating related Bean, and with the need to inject @Transactional annotation bean when injected with the agency related to open commit / rollback transactions for us in the proxy. See Figure 6-2.

With the rapid development of business, in the face of massive data, for example, tens of millions or even billions of data, query time it takes to become a long time, or even cause a single point of pressure on the database. Therefore, we will consider sub-library table and divide the program. And sub-sub-table repository object is to reduce the burden of a single database table single database, the query to improve performance and reduce the query time. Here, we first look at the split of a single library scene. Indeed, sub-table strategy can be summarized as horizontal and vertical split split. A vertical split, the split field in the table, i.e., a relatively large field of the table is split into multiple tables, such that the line data becomes smaller. On the one hand, it can reduce the number of bytes transmitted over the network between the client and the database, because the production environment to share the same network bandwidth, with the increase of concurrent queries, possibly causing bandwidth bottlenecks causing obstruction. On the other hand, a data block can store more data, will reduce the query I / O number. Horizontal split, the split line of the table. Because the number of rows in the table more than a few million rows, will slow down, then you can put the data in a table split into multiple tables to store. Split level, there are a number of strategies, for example, modulo sub-table, the dimension of time sub-table and the like. In this scenario, although we divided the table according to specific rules, we can still use local transactions. However, the library points table, just to solve a single table of data is too big a problem, but not the data sheet tables spread across different physical machines, and therefore can not reduce the pressure on the MySQL server, still exist on the same physical machine competition for resources and bottlenecks, including CPU, memory, disk IO, network bandwidth, and so on. For sub-library split scenario that dividing the data into a different database table, table structure as a plurality of databases. At this point, according to certain rules if we use the data we need to route transactions to the same database, you can ensure consistency through its strong local affairs. However, the vertical resolution services and functions in accordance with the division, it will service data into different databases. Here, the system after the split will encounter problems of data consistency, because we need to ensure that transaction data scattered in different databases, and each database can only guarantee their own data to meet the ACID guarantee strong consistency, However, in a distributed system, they may be deployed on different servers can only communicate through the network, and therefore can not know the exact implementation of the affairs of other databases. See Figure 6-3.

In addition, not only in cross-database calls there is a local transaction can not solve the problem, with the micro-landing services, each service has its own database, and the database is independent and transparent. If you need to get that data service A service B, there is cross-service calls, if you encounter service is down, or the network connection anomalies, synchronous call timeout scenarios will lead to inconsistent data, the next is also a need for a distributed scenario consider data consistency issues. See Figure 6-4.

To summarize, when the sub-library after the order to expand the business, as well as micro-ground service business as a service, will produce inconsistent distributed data problem. Since the local transaction can not meet the demand, so the distributed transaction would on stage. What is a distributed transaction? We can simply understand that it is in order to ensure consistency of data from different databases of transactional solutions. Here, we need first to understand the principles and under the CAP BASE theory. CAP principles are Consistency (consistency), Availablity (availability) Abbreviations and Partition-tolerance (partitions fault tolerance), which is the balance theory in a distributed system. In distributed systems, consistency requires that all nodes each read operation can be guaranteed to get the latest data; availability requirements regardless of any failure to produce the guaranteed service is still available; partitions fault tolerance requirements are node partitions can provide services outside normal . In fact, any system which can meet only two, not three into account. For distributed systems, fault tolerance partition is a basic requirement. So, if you select the partition consistency and fault tolerance, give up availability, the network problems can cause the system unusable. If you choose to partition availability and fault tolerance, give up consistency of data between different nodes can not be timely synchronized data lead to inconsistent data. See Figure 6-5.

In this case, consistency and availability for BASE theory proposed a scheme, BASE is Basically Available (available basic), Acronym Soft-state (soft state) and Eventually Consistent (final consistency), which is the theoretical support eventual consistency . Simple to understand, in a distributed system, allowing availability loss portion, and the presence of different nodes Delayed synchronization process, but after a period of repair, ultimately reaching the final data consistency. BASE emphasized that the final data consistency. Compared to terms ACID, BASE to get availability by allowing lose some consistency.

Now, the industry is more commonly used distributed transaction solutions, including reliable event pattern strong consistency of the two-phase commit protocol, the three-phase commit protocol, as well as eventual consistency, compensation mode, Ali TCC mode. We will introduce the practical detail later chapters.

Strong consistency Solutions

Two-phase commit protocol

In a distributed system, each database can only guarantee their own data to meet the ACID guarantee strong consistency, but they may be deployed on different servers can only communicate through the network, and therefore can not accurately know the other transactions in the database Implementation. Therefore, in order to solve the problem of coordination between multiple nodes, we need to introduce a coordinator responsible for controlling the operating results of all the nodes, either all succeed, or all fail. Which, XA distributed transaction protocol is a protocol, which has two roles: the transaction managers and resource managers. Here, we can be understood as the coordinator of the transaction manager and the resource managers to understand for the participants.

XA protocol commit protocol to ensure strong consistency through two stages.

Two-phase commit protocol, as the name suggests, it has two stages: a first stage of preparation, the second-phase commit. Here, the transaction manager (coordinator) is responsible for controlling all nodes operating results, including the preparation process and the submission process. The first phase, the transaction manager (coordinator) initiating a preparation instruction to the resource managers (participants), ask resource managers (participants) pre-submission was successful. If the resource manager (participants) can be done, it will perform the operation, not to submit, give their final response results are pre-submission success or failure of the pre-submission. The second stage, if all resource managers (participants) are pre reply submitted successfully, resource managers (participants) formally submitted command. If there is a resource manager (participants) respond to pre-commit fails, the transaction manager (coordinator) initiate a rollback to all resource managers (participants). For the case, now we have a transaction manager (coordinator), three resource managers (participants), then this transaction, we need to ensure strong consistency of the three participants in the transaction process the data. First, the transaction manager (coordinator) initiating a preparation instruction to predict whether or not they have been pre-submitted successfully, and if the reply to all the pre-submission is successful, the transaction manager (coordinator) officially launched submit change command data. See Figure 6-6.

Note that, although the two-phase commit protocol to ensure strong consistency proposes a solution, but there are still some problems. First, the transaction manager (coordinator) is responsible for controlling the operation results of all nodes, including the preparation process and the submission process, but the whole process is synchronized, so the transaction manager (coordinator) must wait for each resource managers (involved the next step to the latter) return result of the operation. This is likely to cause synchronous blocking problems. Second, the single point of failure is also a need to seriously consider the issue. Transaction manager (coordinator) and resource managers (participants) are likely to downtime, if resource managers (participants) have been waiting for failure and can not respond, the transaction manager (coordinator) failure of the transaction process lost control person, in other words, the entire process would have been blocked, even under extreme circumstances, some resource managers (participants) to perform data submitted, no part of the commit, there will be data inconsistency. At this point, the reader will ask: should these issues are small probability case, is generally not produced? Yes, but for distributed transaction scenario, we need to consider not only the normal logic flow, but also need to focus on the small probability of abnormal scenario, if we lack treatment options for abnormal scene, data inconsistency may occur, then the latter rely on manual intervention treatment, the cost would be a very big task, in addition to the core link transaction data may not issue, but the problem is more serious loss of capital.

Three-phase commit protocol

Two-phase commit protocol problems, therefore we must commit protocol on the stage of the three stages. Three-phase commit protocol is a two-phase commit protocol modified version, submit it with two different stages of the agreement is that the introduction of synchronous blocking timeout mechanism to solve the problem, in addition to join the preparatory stage early discovery resource managers can not be executed (possible participation person) and terminate the transaction, if all resource managers (participants) can be done before initiating the preparation and submission of the third phase of the second stage. Otherwise, any one resource managers (participants) respond to execute, or wait for a timeout, then the transaction is terminated. To sum up, the three-phase commit protocol comprising: a first stage of preparation, the second stage of preparation, the second-phase commit. See Figure 6-7.

Three-phase commit protocol good solution to the problem of two-phase commit protocol brings, is a very meaningful reference solution. However, data inconsistency may occur in a very small probability scenarios. Because the three-phase commit protocol introduces a timeout mechanism, if the resource managers (participants) will be the default timeout scenarios submitted successfully, but if it does not successfully executed, or other resource managers (participants) rollback occurs, then there will be data inconsistency.

Eventual consistency Solutions

TCC mode

Two-phase commit protocol and three-phase commit protocol good solution to the problem of distributed transactions, but there are still inconsistencies in the data, in extreme cases, in addition to the cost of its system will be relatively large, the introduction of transaction manager (coordinator) after more prone to single point of bottleneck, and in the case of business scale becomes larger and larger, and system scalability will be a problem. Note that, it is a synchronous operation, so the introduction of the transaction until the end of global transactions in order to free up resources, performance can be a big problem. Therefore, rarely used in highly concurrent scenarios. Consequently, Ali proposes another solution: TCC mode. Note that many readers the two-phase commit protocol is equivalent to the two-phase commit protocol, this is a misunderstanding, in fact, TCC model is a two-phase commit.

TCC task split mode a three operations: Try, Confirm, Cancel. If we have a func () method, then the TCC mode, it becomes tryFunc (), confirmFunc (), cancelFunc () method three.

tryFunc();
confirmFunc();
cancelFunc();
复制代码

In TCC mode, the main business services responsible for initiating the process, while providing TCC mode from business services Try, Confirm, Cancel three operations. Role where there is a transaction manager is responsible for controlling the consistency of the transaction. For example, we now have three business services: transaction services, inventory services, payment services. Users choose merchandise, orders, followed by selection of payment to make a payment, then this request, the transaction service will first call button stock inventory service, then call and transaction services payment services related to payment operations and third-party payment service requests create a debit transaction and payment platform, here, transaction services is the main business services, and inventory services and payment services from business services. See Figure 6-8.

We come back next carding process TCC mode. The first phase of the main business of all service calls from business services Try operation, and operation log recording the transaction manager. The second stage, when all the success from a business service, and then perform Confirm operation, otherwise it will perform the inverse operation Cancel the rollback. See Figure 6-9.

Now, we talk about the realization of ideas in the business for almost TCC mode. First, transaction services (primary business service) will be registered with the transaction manager and start the transaction. In fact, the transaction manager is a global transaction management mechanism on a concept, it can be a business logic embedded in the main business services, or pulled out of a TCC framework. In fact, it generates a global transaction ID for the entire transaction record link, and implements the process logic of a nested transaction. When try operation, the transaction manager main business service calls all from a business service using the local transaction records related to the transaction log, in this case, it records the action record invoke the inventory services, and call the action record payment services, and the state set to "pre-commit" state. Here, business services Try calling from operations is the core business code. So, Try the operation and how it corresponds to Confirm, Cancel bound with it? In fact, we could write a configuration file to establish a binding relationship, or add confirm and cancel two parameters is a good choice with Spring annotations. When all the success from business services, by the transaction manager section performed by TCC transaction context Confirm operation, its status is set to "success" status, otherwise, its status is set to "pre-commit" state to perform Cancel operation, then re test. Thus, TCC mode guarantees that the eventual consistency by way of compensation.

TCC implementation framework has a lot of mature open source projects, such as tcc-transaction framework. (Tcc-transaction details on the frame, can be read: github.com/changmingxi... frame relates tcc-transaction-core, tcc- transaction-api, tcc-transaction-spring wherein three modules, tcc-transaction-core Yes. tcc-transaction underlying implementation, tcc-transaction-api API tcc-transaction is used, tcc-transaction-spring is the Spring support tcc-transaction. tcc-transaction each transaction participant abstracted operational, each transaction may comprise a plurality of participants. try confirm cancel three types of methods / / participants need declaration. here, we try method mark on @Compensable annotation, and define appropriate confirm / cancel method.

// try 方法
@Compensable(confirmMethod = "confirmRecord", cancelMethod = "cancelRecord", transactionContextEditor = MethodTransactionContextEditor.class)
@Transactional
public String record(TransactionContext transactionContext, CapitalTradeOrderDto tradeOrderDto) {}

// confirm 方法
@Transactional
public void confirmRecord(TransactionContext transactionContext, CapitalTradeOrderDto tradeOrderDto) {}

// cancel 方法
@Transactional
public void cancelRecord(TransactionContext transactionContext, CapitalTradeOrderDto tradeOrderDto) {}
复制代码

For the realization tcc-transaction framework, we have to understand some of the core ideas. tcc-transaction frame by intercepting section @Compensable, transparent call participants can confirm / cancel method, thereby achieving TCC mode. Here, tcc-transaction has two interceptors, see Figure 6-10.

  • org.mengyun.tcctransaction.interceptor.CompensableTransactionInterceptor, can compensate for the transaction interceptor.

  • org.mengyun.tcctransaction.interceptor.ResourceCoordinatorInterceptor, resource coordinator interceptors.

Here, in need of special attention TransactionContext transaction context because passes through the formal parameters of the transaction when we need participants to remotely call services to remote participants. In tcc-transaction, a transaction org.mengyun.tcctransaction.Transactioncan have multiple participants org.mengyun.tcctransaction.Participantinvolved in operational activities. Wherein TransactionXid transaction number used to uniquely identify a transaction, it generates a UUID algorithm used to ensure uniqueness. When participants make remote calls, remote branch transaction transaction transaction number equal to the number of participants. , Participants use a transaction number and remote branch transactions related through the association TCC confirm the transaction number / cancel methods to implement the transaction commit and rollback. Transaction state TransactionStatus comprising: Status attempt TRYING (1), acknowledgment state CONFIRMING (2), cancel status CANCELLING (3). In addition, transaction type TransactionType comprising: a root transaction ROOT (1), the branch transaction BRANCH (2). When you call TransactionManager # begin () to initiate a transaction root, type MethodType.ROOT, and try the transaction method is called. Call TransactionManager # propagationNewBegin () method, the spread of initiating branch transactions. This method calls the method and the transaction type MethodType.PROVIDER try method is called. Call TransactionManager # commit () method to commit the transaction. This method transaction is confirm / cancel method is called. Similarly, calling TransactionManager # rollback () method to cancel the transaction.

In addition, transaction recovery mechanism, tcc-transaction scheduler framework implemented based on Quartz, retry the transaction according to a certain frequency until the transaction is completed or exceeds the maximum number of retries. If a single transaction exceeds the maximum number of retries, tcc-transaction framework is no longer retried, this time to address the need for manual intervention.

Here, we pay special attention to the operation of idempotent. The core power and other mechanisms is guaranteed to be unique resources, such as duplicate or multiple re-submit the test server will only produce a result. Payment scenarios, the refund scene, involving sums of money can not appear multiple charges and other issues. In fact, the query interface for access to resources, because it's just query data without affecting the resource changes, so no matter how many times the interface, transfer of resources does not change, so that it is idempotent. The new interfaces are non-idempotent, because the call interface multiple times, it will have produced changes resource. Therefore, we need to deal with in the event of a power such as duplicate submission. So, how to ensure power and other mechanism? In fact, we have many implementations. Among them, a scheme that is common to create a unique index. Create a unique index for the fields we need resources constraints in the database, it is possible to prevent the insertion of duplicate data. However, the situation encountered in sub-library sub-table is a unique index is not so Haoshi, this time, we can first query a database, and then determine whether there is a resource constraint field repeat, repeat again when the insertion operation is not . Note that, in order to avoid concurrency scenarios, we can lock mechanisms, such as pessimistic locking and optimistic locking to ensure data is unique. Here, distributed lock scheme is frequently used, it is to achieve a pessimistic lock normally. However, many people often pessimistic locking mechanism, optimistic locking, distributed power solutions such as locks, this is not true. In addition, we can also introduce a state machine, the state of constraint and state by state machine jumps to ensure that the process of executing the same business in order to achieve power and other data.

Compensation mode

On section, we mentioned the retry mechanism. In fact, it is also a final consistency solution: We need to keep retrying the maximum effort to ensure the operation of the database eventually will be able to ensure data consistency, if the final number of retries failed and proactive notification can develop in accordance with the relevant log personnel manual intervention. Note that the callee need to ensure that idempotency. Retry mechanism can be a synchronization mechanism, such as a primary business service call or non-call time-out abnormal need for timely failure to re-initiate a service call. Retry mechanism can be roughly divided into a fixed number of retry strategy retry strategy and a fixed time. In addition, we can also use the Message Queue task and timing mechanisms. Retry mechanism message queue, the message consumer fails to re-deliver, and avoid consumption message is not discarded, for example, each message RocketMQ may default to allow up to 16 retries, each retry intervals may set. Retry mechanism timed task, we can create a task to perform list, and add a "number of retries" field. This design, we can call upon the timing, whether the task is to get the implementation of state failure and does not exceed the number of retries, if it is then carried fail and try again. However, when the state failed to perform and appear more than the number of retries, it shows the permanent mission failed, developers need for manual intervention and troubleshooting.

In addition to retry mechanism can also be repaired at the time of each update. For example, for counting the number of scenes thumbs social interaction, collection number, number of comments, etc., perhaps because the network jitter or related service is unavailable, resulting in inconsistent data within a certain period of time, we can fix each time update, self-assurance system has been restored and corrected a short period of time, and ultimately achieve data consistency. Note that, when using this solution, if certain pieces of data inconsistencies, but there is no update fixes again, then it will always be abnormal data.

Timing proofreading is also a very important means of settlement, it should be verified periodically to take action to guarantee. Selection of the frame timing of the task, the industry has commonly used in stand alone scenario Quartz, and Elastic-Job, XXL-JOB, SchedulerX a distributed middleware scheduled tasks on distributed scenario. About Timing proofreading can be divided into two scenarios, one is unfinished retry timing, for example, we use the calling task timed scan task has not been completed, and be repaired by compensation mechanism, data finally reach consensus. Another is the timing check, it needs to provide services related to the main business query interface to check queries from business services for the business to recover lost data. Now, let's imagine a refund business electricity supplier of the scene. There will be a refund of basic services and automated refund refund in this business. In this case, the automated refund service implementation on the basis of the refund on the basis of services to enhance the ability of the refund, the refund achieve more than rules-based automation, and received a refund refund snapshot information infrastructure services push through a message queue. However, due to the refund of basic services to send messages lost or discarded in the message queue initiative failed after several retries, it is likely to result in data inconsistency. Therefore, we query the refund check from basic services through regular, restore lost business data is especially important.

Reliable event mode

In distributed systems, message queues position in the service side of the architecture is very important, mainly to solve the asynchronous processing, decoupling, traffic clipping and other scenes. If the synchronization between a plurality of communication systems is likely to cause clogging, while these systems will be coupled together. Therefore, the introduction of a message queue, the settlement of obstruction caused by synchronous communication mechanism, on the other hand to conduct business by decoupling the message queue. See Figure 6-12.

Reliable event mode, through the introduction of reliable message queue, as long as the current and reliable delivery of events and message queues to ensure that the event passes at least once, then subscribe to this event to ensure that consumers can consume events can be in their own business. Here, the reader is thinking, whether the introduction of the long message queue can solve the problem of it? In fact, only the introduction of the message queue do not guarantee a final consistency, as distributed deployments are network-based communications, and the network communication process, the upstream and downstream may lead to a variety of reasons message is lost.

First, the main business service may not be available because of a message queue and a failure occurs when sending messages. In this case, we can get the main business services (producer) sends a message, and then call the business to ensure. The general practice is, the message main service the service to be transmitted persisted to the local database, sets a flag state "to be transmitted" state, and then sends the message to the message queue, the message queue receives a message, also the messages persisted to its storage service, but not immediately delivered to the business service (consumer) message, but Xianxiang main business services (producer) returns the message queue in response to the result, and then determines the main service in response to the service after the service processing result of execution. If the response fails, the business processes after giving up, set up a local persistent message status flags to "end" state. Otherwise, the subsequent processing business, set up a local persistent message flag status is "Sent" status.

public void doServer(){
    // 发送消息
    send();
    // 执行业务
    exec();
    // 更新消息状态
    updateMsg();
}
复制代码

In addition, the message queue messages occur, but it may not be from the consumer business services (consumer) down. In this case the vast majority of message middleware, e.g. RabbitMQ, RocketMQ the like is introduced ACK mechanism. Note that, by default, automatic answering, in this embodiment the message queue will immediately send a message to delete the message from the message queue. Therefore, in order to ensure reliable delivery of the message, we have, if not send an ACK from business services (consumers) due to downtime and other reasons, the message queue will re-send the message manually ACK way to ensure the reliability of the message. After business service-related businesses have been processed notification message queue manually ACK, message queues before deleting the persistent message from the message queue. So, the message queue has been retried if failure undeliverable, the situation unsolicited messages will be discarded, and how we need to solve it? The astute reader may have noticed, we are in the last step, the main business service message has to be sent to the local database persistence. Therefore, the consumption of services from business success, it will send a notification message to the message queue, this time it is the producer of a message. After the primary business services (consumer) receives the message, eventually local signs of persistent message status is "completed" state. Here, the reader should understand that we can use the "pros and cons to the message mechanism" to ensure that the event message queue reliable delivery. Of course, the compensation mechanism is also essential. Timing task and re-delivered from the database scanning is not completed within a certain time message. See Figure 6-13.

Note that, because the reason may receive the message processing timeouts or service downtime, as well as the network as a result of the message queue and receive messages from the processing results of less than business services, reliable delivery of events and message queues to ensure that the event passed at least once. Here, from business services (consumers) need to ensure idempotency. If there is no guarantee interface from business services (consumer) idempotency will result in duplicate submissions and other abnormalities scene. In addition, we also can separate message service, the message service deployed independently, the common message service according to different business scenarios, service development cost repeated.

Understand the methodology of "reliable event mode" after, and now we look at a real case to deepen understanding. First, when a user initiates a refund, automated refund service will receive a refund in the event message, this time, if the refund in line with automated refund policy, then automated refund service will first write to the local database persistence this refund snapshot, followed by sending a message posted to the implementation of the refund to the message queue, the message queue is returned in response to a successful result after receiving the message, the automated refund service can perform subsequent business logic. At the same time, the message queue asynchronously put messages delivered to a refund of basic services, infrastructure services and refunds related to the execution of their business logic, failed or not by the self-refund guarantee basic services, if executed successfully executed to send a refund success message posted to a message queue. Finally, the timing task and re-delivered from the database scanning is not completed within a certain message. Here, it should be noted that the automated refund persistent snapshot refund can be understood as the need to ensure the successful delivery of a message from "the positive and negative messaging" and "time task" to ensure its successful delivery. In addition, the real logic refund refund the account to ensure basic services, so it is to ensure convergence idempotency, and the account logic. When the state failed emergence and exceed the number of retries, it shows the permanent mission failed, developers need for manual intervention and troubleshooting. See Figure 6-14.

In summary, the introduction of a message queue and are not guaranteed event delivery, in other words, due to various reasons such as network and lead to loss of messages can not guarantee its final consistency, therefore, we need to "positive and negative message mechanism" to ensure that the message Queuing reliable event delivery, and use the message mechanism to compensate as much as possible within a certain time unfinished and re-delivery.

Distributed Transaction open source project realization interpretation

Open source projects have applied for distributed transactions lot we can learn and learn from them. In this section, we have to interpret its implementation.

RocketMQ

Ali open source Apache RocketMQ is a high-performance, high-throughput distributed messaging middleware. In the calendar year double 11, RocketMQ have had to bear all the news flow Alibaba production system, has a stable and outstanding performance in the core trade links, it is one of the core basic products bearing trading peak. RocketMQ exist MQ commercial version can be purchased on Ali cloud ( www.aliyun.com/product/ons...

Apache RocketMQ 4.3 version officially supports distributed transaction message. RocketMQ transaction messages designed primarily to solve the problem of atomicity a transaction message with the local execution of the producer end, in other words, if the local transactional execution is not successful, then the message will not be pushed MQ. So, clever you might doubt: we can first perform local transactions, the successful implementation of the re-sending MQ messages, so you can not guarantee transactional? But, come down again serious thinking, if MQ messaging unsuccessful how to do it? In fact, it RocketMQ this a good idea and solutions. RocketMQ will first send a pre-execution message to MQ, and perform a local transaction after sending a message pre-execution success. Then, according to the results of its local affairs subsequent execution logic, if the outcome of the transaction is performed locally commit, then the formal delivery MQ message, if the outcome of the transaction is performed locally rollback, pre-execution messages delivered before deleting the MQ, not performed delivery hair. Note that, for unusual circumstances, such as performing local affairs during server downtime or overtime, RocketMQ will wonder about the other end of the same group of its producers to obtain the state. See Figure 6-15.

So far, we've seen the realization of ideas RocketMQ, if the implementation interested in the source code reader can read org.apache.rocketmq.client.impl.producer.DefaultMQProducerImpl#sendMessageInTransaction.

ServiceComb

ServiceComb based CSE internal Huawei (Cloud Service Engine) from the open-source framework, which provides a framework that contains the code generated, registered service discovery, load balancing, service reliability (fault-tolerant fuse, limiting downgrade, call chain tracking) and other functions the micro-services framework. Which, ServiceComb Saga is a micro data service application consistency of the final solution.

Saga distributed transaction split into multiple local transaction is then responsible for coordinating Saga engine. If the entire process ends normally, the business completed successfully; if partial failure to achieve in this process, the engine calls Saga compensation operation. There are two strategies recovered Saga: forward recovery and restore back. Among them, the greatest efforts to take forward recovery failed node keep retrying to ensure the operation of the database eventually will be able to ensure data consistency, if the final number of retries failed and proactively notify developers can manually intervene in accordance with the relevant logs. Back to resume execution of a transaction rollback operation for all nodes before success, thus ensuring that the data is consistent effect.

Saga that differs from TCC, Saga Try a less than TCC operation. Therefore, Saga will be submitted directly to the database, then appeared when failure to compensate the operation. Saga design compensation operation may result in an extreme scenario is relatively cumbersome, but for simple business logic less intrusive, more lightweight, and reduces the number of times of communication, see Figure 6-16.

ServiceComb Saga was extended in its theoretical basis, it contains two components: alpha and omega. alpha acts as coordinator, is responsible for the affairs of the events were persistent storage, and coordination of sub-state affairs, it is ultimately consistent with the state of the global transaction. omega agent is a micro-embedded service, the network is responsible for intercepting a request for reporting the transaction to the alpha events, and executes a corresponding compensation operation under abnormal conditions in accordance with instructions issued by the alpha. In the pre-processing stage, the beginning of the alpha event transactions will be recorded; in the post-processing phase, alpha transaction will end the event record. Therefore, every successful transaction has sub-one correspondence start and end events. In the service producers, omega intercepts the request in matters pertaining to extract the id of the transaction context. In the service consumer, omega-related matters will be injected id in the request to deliver context of a transaction. This collaboration processing provider and service consumer through the service, child transaction can be connected together to form a complete global transaction. Note that, Saga requirements associated subtransaction provide transaction processing method, and providing compensation function. Here, annotations added to initialize @EnableOmega omega configuration and establishes a connection with alpha. Add annotations @SagaStart at the start of a global transaction, add annotations @Compensable compensation method specified in the corresponding sub-transaction. Use case: github.com/apache/serv...

@EnableOmega
public class Application{
  public static void main(String[] args) {
    SpringApplication.run(Application.class, args);
  }
}

@SagaStart
public void xxx() { }


@Compensable
public void transfer() { }
复制代码

Now, we look at its business process diagrams, see Figure 6-17.

More exciting articles, all in "server-side thinking"!

Guess you like

Origin juejin.im/post/5e1c9956f265da3e0f4d5b13