Don't understand distributed transactions, don't say you understand microservices!

1. Transaction management for traditional applications

1.1 Local affairs

Before introducing data consistency under microservices, let's briefly introduce the background of transactions. Traditional stand-alone applications use an RDBMS as a data source. The application starts a transaction, performs CRUD, commits or rolls back the transaction, all of which occur in the local transaction, and the resource manager (RM) directly provides transaction support. Data consistency is guaranteed in a local transaction.

1.png

1.2 Distributed transaction

1.2.1 Two-phase commit (2PC)

When the application gradually expands and an application uses multiple data sources, local transactions can no longer meet the requirements of data consistency. Due to the simultaneous access of multiple data sources, transactions need to be managed across multiple data sources, and distributed transactions arise at the historic moment. One of the most popular is two-phase commit (2PC), and distributed transactions are managed uniformly by the transaction manager (TM).

Two-phase commit is divided into preparation phase and commit phase.

2.png

3.png

However, the two-phase commit cannot fully guarantee the data consistency problem, and there is the problem of synchronization blocking, so its optimized version of the three-phase commit (3PC) was invented.

1.2.2 Three-phase submission (3PC)

4.png

However, 3PC can only guarantee data consistency in most cases.

2. Transaction management under microservices

So, is distributed transaction 2PC or 3PC suitable for transaction management under microservices? The answer is no, for three reasons:

  1. Because microservices cannot directly access data, the mutual calls between microservices are usually done through RPC (dubbo) or Http API (SpringCloud), so it is no longer possible to use TM to manage the RM of microservices.
  2. The types of data sources used by different microservices may be completely different. If the microservice uses a database that does not support transactions, such as NoSQL, then transactions will simply be impossible.
  3. Even if the data sources used by microservices all support transactions, if one large transaction is used to manage the transactions of many microservices, the maintenance time of this large transaction will be several orders of magnitude longer than that of the local transaction. Such long-term transactions and cross-service transactions will generate many locks and unavailability of data, which will seriously affect system performance.

It can be seen that traditional distributed transactions can no longer meet the transaction management needs under the microservice architecture. So, since traditional ACID transactions cannot be satisfied, transaction management under microservices must follow a new rule-BASE theory.

The BASE theory was proposed by Dan Pritchett, the architect of eBay. The BASE theory is an extension of the CAP theory. The core idea is that even if strong consistency cannot be achieved, the application should be able to achieve final consistency in a suitable way. BASE refers to Basically Available, Soft State, Eventual Consistency.

基本可用: Refers to the loss of partial availability in the case of a distributed system failure, that is, to ensure that the core is available.

软状态: Allow the system to have an intermediate state, and the intermediate state will not affect the overall availability of the system. In distributed storage, there are generally at least three copies of a piece of data. The delay that allows the synchronization of copies between different nodes is a manifestation of soft state.

最终一致性: Eventual consistency means that all data copies in the system can finally reach a consistent state after a certain period of time. Weak consistency is the opposite of strong consistency. Ultimate consistency is a special case of weak consistency.

BASE is the fundamental requirement for transaction management under microservices. The 最终一致性transaction management based on microservices cannot achieve strong consistency, but the most important consistency must be guaranteed. So, what methods are there to ensure the ultimate consistency of transaction management under microservices? According to the implementation principle, there are two main types: event notification type and compensation type. The event notification type can be divided into reliable event notification mode and best effort Notification mode, and compensation mode can be divided into two types: TCC mode and business compensation mode. These four modes can all achieve the ultimate consistency of data under microservices.

3. Ways to achieve data consistency under microservices

3.1 Reliable event notification mode

3.1.1 Synchronization event

The design concept of the reliable event notification mode is easier to understand, that is, after the main service is completed, the result is passed to the slave service through the event (usually a message queue), and the slave service consumes the message after receiving the message to complete the business, so as to achieve the main service and Message consistency between slave services. The first and simplest thing that can be thought of is the synchronous event notification. Business processing and message sending are executed synchronously. The implementation logic is shown in the code and sequence diagram below.

public void trans() { 
    try { 
    // 1. Operate the database 
        bool result = dao.update(data); // 
    If the database operation fails, an exception will be thrown // 2. If the database operation is successful, send a message 
        if(result){ 
            mq.send(data);// If the method fails to execute, an exception will be thrown 
        } 
    } catch (Exception e) { 
        roolback();// If an exception occurs, roll back 
    }}

5.png

The above logic seems seamless. If the database operation fails, it will exit directly without sending the message; if the message sending fails, the database rolls back; if the database operation succeeds and the message is sent successfully, the business succeeds and the message is sent to downstream consumers. Then, after careful consideration, there are actually two shortcomings in synchronous message notification.

  1. Under the microservice architecture, there may be network IO problems or server downtime problems. If these problems occur in step 7 of the sequence diagram, the main service cannot be notified normally after the message is delivered (network problem), or the submission cannot be continued. Transaction (downtime), then the main service will think that the message delivery failed and will roll the main service business, but in fact the message has been consumed by the slave service, then the data of the main service and the slave service will be inconsistent. The specific scenarios can be seen in the following two timing diagrams.


6.png

7.png

  1. The event service (message service in this case) is too coupled with the business. If the message service is unavailable, the business will be unavailable. The event service should be decoupled from the business and performed asynchronously independently, or try to send the message once after the business is executed. If the message fails to be sent, it will be downgraded to asynchronous sending.

3.1.2 Asynchronous events

3.1.2.1 Local Event Service

In order to solve the problem of synchronous events described in 3.1.1, the asynchronous event notification mode was developed. The business service and the event service are decoupled, and the event is performed asynchronously. A separate event service ensures reliable delivery of the event.

8.png

异步事件通知-本地事件服务


当业务执行时,在同一个本地事务中将事件写入本地事件表,同时投递该事件,如果事件投递成功,则将该事件从事件表中删除。如果投递失败,则使用事件服务定时地异步统一处理投递失败的事件,进行重新投递,直到事件被正确投递,并将事件从事件表中删除。这种方式最大可能地保证了事件投递的实效性,并且当第一次投递失败后,也能使用异步事件服务保证事件至少被投递一次。

然而,这种使用本地事件服务保证可靠事件通知的方式也有它的不足之处,那便是业务仍旧与事件服务有一定耦合(第一次同步投递时),更为严重的是,本地事务需要负责额外的事件表的操作,为数据库带来了压力,在高并发的场景,由于每一个业务操作就要产生相应的事件表操作,几乎将数据库的可用吞吐量砍了一半,这无疑是无法接受的。正是因为这样的原因,可靠事件通知模式进一步地发展-外部事件服务出现在了人们的眼中。

3.1.2.2 外部事件服务

外部事件服务在本地事件服务的基础上更进了一步,将事件服务独立出主业务服务,主业务服务不在对事件服务有任何强依赖。

9.png

异步事件通知-外部事件服务


业务服务在提交前,向事件服务发送事件,事件服务只记录事件,并不发送。业务服务在提交或回滚后通知事件服务,事件服务发送事件或者删除事件。不用担心业务系统在提交或者会滚后宕机而无法发送确认事件给事件服务,因为事件服务会定时获取所有仍未发送的事件并且向业务系统查询,根据业务系统的返回来决定发送或者删除该事件。

外部事件虽然能够将业务系统和事件系统解耦,但是也带来了额外的工作量:外部事件服务比起本地事件服务来说多了两次网络通信开销(提交前、提交/回滚后),同时也需要业务系统提供单独的查询接口给事件系统用来判断未发送事件的状态。

3.1.2.3 可靠事件通知模式的注意事项

可靠事件模式需要注意的有两点,1. 事件的正确发送; 2. 事件的重复消费。
通过异步消息服务可以确保事件的正确发送,然而事件是有可能重复发送的,那么就需要消费端保证同一条事件不会重复被消费,简而言之就是保证事件消费的
幂等性

如果事件本身是具备幂等性的状态型事件,如订单状态的通知(已下单、已支付、已发货等),则需要判断事件的顺序。一般通过时间戳来判断,既消费过了新的消息后,当接受到老的消息直接丢弃不予消费。如果无法提供全局时间戳,则应考虑使用全局统一的序列号。

对于不具备幂等性的事件,一般是动作行为事件,如扣款100,存款200,则应该将事件id及事件结果持久化,在消费事件前查询事件id,若已经消费则直接返回执行结果;若是新消息,则执行,并存储执行结果。

3.2 最大努力通知模式

相比可靠事件通知模式,最大努力通知模式就容易理解多了。最大努力通知型的特点是,业务服务在提交事务后,进行有限次数(设置最大次数限制)的消息发送,比如发送三次消息,若三次消息发送都失败,则不予继续发送。所以有可能导致消息的丢失。同时,主业务方需要提供查询接口给从业务服务,用来恢复丢失消息。最大努力通知型对于时效性保证比较差(既可能会出现较长时间的软状态),所以对于数据一致性的时效性要求比较高的系统无法使用。这种模式通常使用在不同业务平台服务或者对于第三方业务服务的通知,如银行通知、商户通知等,这里不再展开。

3.3 业务补偿模式

接下来介绍两种补偿模式,补偿模式比起事件通知模式最大的不同是,补偿模式的上游服务依赖于下游服务的运行结果,而事件通知模式上游服务不依赖于下游服务的运行结果。首先介绍业务补偿模式,业务补偿模式是一种纯补偿模式,其设计理念为,业务在调用的时候正常提交,当一个服务失败的时候,所有其依赖的上游服务都进行业务补偿操作。举个例子,小明从杭州出发,去往美国纽约出差,现在他需要定从杭州去往上海的火车票,以及从上海飞往纽约的飞机票。如果小明成功购买了火车票之后发现那天的飞机票已经售空了,那么与其在上海再多待一天,小明还不如取消去上海的火车票,选择飞往北京再转机纽约,所以小明就取消了去上海的火车票。这个例子中购买杭州到上海的火车票是服务a,购买上海到纽约的飞机票是服务b,业务补偿模式就是在服务b失败的时候,对服务a进行补偿操作,在例子中就是取消杭州到上海的火车票。

补偿模式要求每个服务都提供补偿借口,且这种补偿一般来说是不完全补偿,既即使进行了补偿操作,那条取消的火车票记录还是一直存在数据库中可以被追踪(一般是有相信的状态字段“已取消”作为标记),毕竟已经提交的线上数据一般是不能进行物理删除的。

业务补偿模式最大的缺点是软状态的时间比较长,既数据一致性的时效性很低,多个服务常常可能处于数据不一致的情况。

3.4 TCC/Try Confirm Cancel模式

TCC模式是一种优化了的业务补偿模式,它可以做到完全补偿,既进行补偿后不留下补偿的纪录,就好像什么事情都没有发生过一样。同时,TCC的软状态时间很短,原因是因为TCC是一种两阶段型模式(已经忘了两阶段概念的可以回顾一下1.2.1),只有在所有的服务的第一阶段(try)都成功的时候才进行第二阶段确认(Confirm)操作,否则进行补偿(Cancel)操作,而在try阶段是不会进行真正的业务处理的。

10.png

TCC模式


TCC模式的具体流程为两个阶段:

  1. Try,业务服务完成所有的业务检查,预留必需的业务资源
  2. 如果Try在所有服务中都成功,那么执行Confirm操作,Confirm操作不做任何的业务检查(因为try中已经做过),只是用Try阶段预留的业务资源进行业务处理;否则进行Cancel操作,Cancel操作释放Try阶段预留的业务资源。

这么说可能比较模糊,下面我举一个具体的例子,小明在线从招商银行转账100元到广发银行。这个操作可看作两个服务,服务a从小明的招行账户转出100元,服务b从小明的广发银行帐户汇入100元。

服务a(小明从招行转出100元):

try: update cmb_account set balance=balance-100, freeze=freeze+100 where acc_id=1 and balance>100;

confirm: update cmb_account set freeze=freeze-100 where acc_id=1;

cancel: update cmb_account set balance=balance+100, freeze=freeze-100 where acc_id=1;

服务b(小明往广发银行汇入100元):

try: update cgb_account set freeze=freeze+100 where acc_id=1;

confirm: update cgb_account set balance=balance+100, freeze=freeze-100 where acc_id=1;

cancel: update cgb_account set freeze=freeze-100 where acc_id=1;

具体说明:
a的try阶段,服务做了两件事,1:业务检查,这里是检查小明的帐户里的钱是否多余100元;2:预留资源,将100元从余额中划入冻结资金。

a的confirm阶段,这里不再进行业务检查,因为try阶段已经做过了,同时由于转账已经成功,将冻结资金扣除。

In the cancel phase of a, the reserved resources are released, and the funds are frozen for 100 yuan and restored to the balance.

The try stage of b is carried out, resources are reserved, and 100 yuan is frozen.

In the confirm phase of b, use the resources reserved in the try phase to transfer 100 yuan of frozen funds to the balance.

In the cancel phase of b, the reserved resources in the try phase are released, and 100 yuan is subtracted from the frozen funds.

As can be seen from the above simple example, the TCC model is more complicated than the pure business compensation model, so each service needs to implement two interfaces, Cofirm and Cancel.

3.5 Summary

The following table compares these four commonly used modes:

Types of name Real-time data consistency Development costs Whether the upstream service depends on the downstream service result
Notification type maximum effort low low not depend on
Notification type Reliable event high high not depend on
Compensation type Business compensation low low rely
Compensation type TCC high high rely

At last

Thank you all for seeing here, the article has deficiencies, and you are welcome to point out; if you think it is well written, please give me a thumbs up.

Also welcome everyone to pay attention to my official account : programmer Maidong, update industry information every day!

Guess you like

Origin blog.51cto.com/14849432/2555128