Distributed Transaction - solutions for distributed transactions to talk, let me talk about solutions

Original link: https://www.cnblogs.com/savorboard/p/distributed-system-transaction-consistency.html

 

Distributed Transaction talk, let me talk about solutions

 

Foreword

Recently a long time did not write a blog, partly because the company things are more busy recently, on the other hand is because during the  CAP  to develop the next phase, but has now come to an end.

Then there are the beginning of our topic today, to talk about distributed transactions, or that is in my eyes a distributed transaction, because everyone could understand it's not the same.

Distributed transaction is a technical difficulty in enterprise integration, but also each distributed system architecture will be a thing involved, especially in the micro-service architecture, it can be said that almost can not be avoided, this paper a distributed transaction to simply chat a bit.

Database transaction

Before saying a distributed transaction, we start talking about database transactions. We are familiar with the database transaction might also frequently used in the development process. But even so, it may be for some of the details, a lot of people still do not know. For example, many people are aware of several features of database transaction: atomicity (Atomicity), consistency (Consistency), isolation or independence (Isolation) and persistence (Durabilily), is referred to as ACID. But then down to ask such isolation of what is meant by the time may not know, or know what isolation is but to ask what are isolated database implementation level, each level or what are their differences the time may not know.

This article does not intend to introduce these database transactions these things, interested can search relevant information. However, there is a knowledge we need to understand is that if a sudden power failure in the database when a transaction is committed, then it is how to restore it? Why mention this knowledge of it? Because the core of distributed systems is to deal with a variety of unusual circumstances, this is a complex distributed system where, because of the distributed network environment is complex, this "power off" a lot more failures than stand-alone, so we are doing distributed when the system, the first consideration is the case. These abnormalities may have machine downtime, network anomalies, the message is lost, the message out of order, data errors, unreliable TCP, stored data is lost, other anomalies and so on ...

We went on to say that when local transaction database outages, it is how to ensure data consistency of it? We use SQL Server for example, we know that we are using SQL Server database consists of two files, a database file and a log file, under normal circumstances, the log files must be much larger than the database files. When the database is the first of any write operation to write the log, the same token, when we execute a transaction database will record the first operation redo log transactions, and then began to really manipulate the database, it first before operation the log file is written to disk, then when a sudden power failure, even if the operation is not complete, restart the database, the database will be a rollback or undo redo roll according to the situation before the current data, thus ensuring strong data consistency.

Then we say about distributed transactions.

Distributed theory

When our single database performance bottlenecks, we might database partition, partition mentioned here refers to a physical partition after partition may be in different libraries on different servers, and this time a single database ACID can not adapt to this situation, and in the cluster environment of this ACID, ACID think of to ensure that the cluster is almost difficult to achieve, or even if we can achieve that efficiency and performance will be substantially reduced, the most critical is again difficult the new extended partition, and this time if we pursue ACID cluster will cause our system to become poor, then we need to introduce a new theoretical principles to fit conditions such clusters, that is, the principle of CAP or called CAP theorem , then the CAP theorem refers to what is it?

CAP theorem

CAP theorem by the University of California, Berkeley professor Eric Brewer proposed, he noted WEB service can not meet at the same time three attributes:

  • Consistency (Consistency): client knows a series of actions will occur simultaneously (entry into force)
  • Availability (Availability): the end of each operation must be expected to respond to
  • Fault tolerance partition (Partition tolerance): not available even if a single component, the operation still be completed

Specifically, in a distributed system, in any database design, a Web application only supports up to two attributes above. Obviously, any lateral expansion strategy must rely on the data partition. Therefore, designers must choose between consistency and availability.

This theorem in a distributed system are applicable in so far!  Why do you say?

This time some students may put the database 2PC (two-phase commit) to speak out. OK, we look at two stages of the database submitted.

Distributed database transaction has an understanding of the students must know the database supports 2PC, also known as XA Transactions.

MySQL is supported from version 5.5, SQL Server 2005 began to support, Oracle 7 began to support.

Where, XA is a two-phase commit protocol, which is divided into two stages:

  • The first stage: Transaction Coordinator requires that each database transaction related to pre-commit (precommit) this, and reflect whether to submit.
  • Phase II: requires that each database transaction coordinator to submit data.

Among them, if there is any veto the submission of a database, all databases will be required to roll back their part of the information on this transaction. What do such a defect is it? First glance we can get consistency across database partitions.

If the CAP theorem is correct, then it will certainly affect the availability.

If the availability of the representative of the system is to perform an operation related to the availability of all components and. So in the course of the two-phase commit, on behalf of the availability of each database involved in the availability of and. We assume that two-phase commit process in each database has a 99.9% availability, so if the two-phase commit involves two databases, the result is 99.8%. The system availability is calculated, it is assumed, 99.9% availability is 43,157 minutes 43,200 minutes per month, 99.8% availability is 43,114 minutes, corresponding to an increase in downtime of 43 minutes every month.

Above, you can verify it, CAP Theorem in theory is correct, CAP let's see here, and so will say next.

BASE theory

In distributed systems, we often seek is available, it's important to be higher than the consistency of the program, then how to achieve high availability it? The former has given us to another theory, that BASE theory, it is used to CAP theorem for further expansion. BASE theory means:

  • Basically Available (Basic available)
  • Soft state (soft state)
  • Eventually consistent (eventual consistency)

BASE CAP theory is the result of consistency and availability of a trade-off, the core idea of the theory is this: we can not do the same strong, but each application according to their operational characteristics, appropriate way to make the system reach The final consistency (Eventual consistency).

With the above theory After that, we look at the problem of distributed transactions.

Distributed Transaction

In distributed systems, to achieve a distributed transaction, nothing less than that several solutions.

First, the two-phase commit (2PC)

And XA transactions on a database mentioned in the same two-phase commit is the principle of using the XA protocol, we can easily see the flow chart below details some of this, such as commit and abort the middle.

Two-phase commit this solution belong to sacrifice in exchange for part of the availability of consistency. In terms of realization, in .NET, you can make use API TransactionScop programming provided by a distributed system of two-phase commit, such as WCF have achieved in this part of the function. But among multiple servers, need to rely on the DTC to complete the transaction consistency under Microsoft Windows MSDTC service out there, under Linux is more tragedy.

Further to say, TransactionScop default coherent transactions can not be used between an asynchronous method, because the transaction is stored in the current context of the thread, if it is an asynchronous method, requires an explicit transaction context transfer.

Advantages:  try to ensure strong data consistency, strong consistency for data demanding key areas. (In fact, we can not guarantee 100% strong agreement)

Disadvantages:  to implement complex, at the expense of usability, greater impact on performance, not suitable for high-performance concurrent scenario, if the system is distributed across the interface call, .NET community has not achieved the current program.

Second, the compensation transaction (TCC)

TCC compensation mechanism is actually used, the core idea is: for each operation, should a registration confirmation and the compensation corresponding (undo) action. It is divided into three stages:

  • Try to do the main stage testing and resource reservation system for business

  • When Confirm stage is mainly to make sure to submit business systems, Try successful execution stage and begin to implement Confirm phase, phase default Confirm will not go wrong. That is: as long as Try successful, Confirm with some success.

  • The main stage is the implementation of mistake Cancel in the business, performed under the need to roll back the status of the service canceled, releasing reserved resources.

For example, the fake Bob Smith would like to transfer, probably thinking:
We have a local method, which in turn calls
1, the first stage in the Try, first call the remote interface to Smith and Bob's money to freeze up.
2, in the Confirm phase, the transfer of operations to perform remote calls, transfer successful thaw.
3. If step 2 is successful, then the transfer is successful, if the second step fails, then call the remote interface corresponds to the freeze thaw method (Cancel).

Advantages:  with 2PC compared, and the process is relatively simple to achieve some, but the consistency of the data should be worse than some of 2PC

Disadvantages:  the disadvantage is quite obvious, are likely to fail in step 3. TCC is a form of compensation application layer, so programmers need to write a lot of code in the realization of compensation, in some scenarios, a number of business processes may not be well defined and treated with TCC.

Third, local message table (asynchronous ensure)

Local news should table this implementation is the industry's most used, the core idea is to split the cost of a distributed transaction to transaction processing, this idea is derived from ebay. We can see from some of the details in the following flow chart:

The basic idea is this:

Message producers, the need for additional build a message table, and records the message transmission status. Message table and business data to be submitted in a single transaction, which means that they want in a database inside. The message is then sent to the messages through MQ consumer. If the message fails, retry transmission.

News consumer, need to deal with this news, and complete their business logic. At this point if the local transaction processing is successful, indicating that treatment has been successful, if the process fails, it will retry execution. If a failure of the above operations, a service may be transmitted to the production side of the compensation message informing producers to rollback operation.

Production and consumers scanning timing of the local table message, the message is not processed or a retransmission failure message again. If automatic account reconciliation logic complement fly, this embodiment is very practical.

This program follows BASE theory, is the ultimate consistency, I believe that these types of programs which are more suitable for the actual business scenarios, that is as complex implementation as 2PC (when calling chain long time, does not appear 2PC availability is very low), it will not confirm or circumstances that may arise can not be rolled back as TCC.

Pros:  a very classic implementation, avoiding the distributed transaction to achieve eventual consistency. There are ready-made solutions in .NET.

Disadvantages:  message table would be coupled to a business system, if there is no good solution package, there will be many chores need to be addressed.

Four, MQ transactional messages

There are some third-party MQ transactional messages, such as RocketMQ, the way they support the transaction message is similar to the use of two-stage submission, but some of the mainstream of the market do not support the MQ transactional messages, such as RabbitMQ and Kafka are not support.

Ali RocketMQ with middleware, for example, the idea is roughly:

The first stage Prepared messages will get the address of the message.
The second stage of the implementation of a local transaction, the third stage through the first stage to get access to the message address, and modify state.

That is in order to submit the message queue in the service request method twice, once again send messages and acknowledgment messages. If the confirmation message failed transaction message RocketMQ periodically scans the message cluster, this time found Prepared message, it will confirm to the sender of the message, so producers need to implement a check interfaces, RocketMQ will be based on the policy the sender settings decide whether to roll back or continue to send a confirmation message. This ensures that messages sent with the local transaction succeed or fail.

Unfortunately, it RocketMQ not .NET client. For more news about RocketMQ, you can view this blog

Pros:  to achieve a final consistency, does not depend on the local database transactions.

Disadvantages:  difficult to achieve large, mainstream MQ is not supported, there is no .NET client, RocketMQ transaction message is also part of the code is not open source.

Five, Sagas transaction model

Saga事务模型又叫做长时间运行的事务(Long-running-transaction), 它是由普林斯顿大学的H.Garcia-Molina等人提出,它描述的是另外一种在没有两阶段提交的的情况下解决分布式系统中复杂的业务事务问题。你可以在这里看到 Sagas 相关论文。

我们这里说的是一种基于 Sagas 机制的工作流事务模型,这个模型的相关理论目前来说还是比较新的,以至于百度上几乎没有什么相关资料。

该模型其核心思想就是拆分分布式系统中的长事务为多个短事务,或者叫多个本地事务,然后由 Sagas 工作流引擎负责协调,如果整个流程正常结束,那么就算是业务成功完成,如果在这过程中实现失败,那么Sagas工作流引擎就会以相反的顺序调用补偿操作,重新进行业务回滚。

比如我们一次关于购买旅游套餐业务操作涉及到三个操作,他们分别是预定车辆,预定宾馆,预定机票,他们分别属于三个不同的远程接口。可能从我们程序的角度来说他们不属于一个事务,但是从业务角度来说是属于同一个事务的。

他们的执行顺序如上图所示,所以当发生失败时,会依次进行取消的补偿操作。

因为长事务被拆分了很多个业务流,所以 Sagas 事务模型最重要的一个部件就是工作流或者你也可以叫流程管理器(Process Manager),工作流引擎和Process Manager虽然不是同一个东西,但是在这里,他们的职责是相同的。在选择工作流引擎之后,最终的代码也许看起来是这样的

SagaBuilder saga = SagaBuilder.newSaga("trip")
        .activity("Reserve car", ReserveCarAdapter.class) 
        .compensationActivity("Cancel car", CancelCarAdapter.class) 
        .activity("Book hotel", BookHotelAdapter.class) 
        .compensationActivity("Cancel hotel", CancelHotelAdapter.class) .activity("Book flight", BookFlightAdapter.class) .compensationActivity("Cancel flight", CancelFlightAdapter.class) .end() .triggerCompensationOnAnyError(); camunda.getRepositoryService().createDeployment() .addModelInstance(saga.getModel()) .deploy();

这里有一个 C# 相关示例,有兴趣的同学可以看一下。

优缺点这里我们就不说了,因为这个理论比较新,目前市面上还没有什么解决方案,即使是 Java 领域,我也没有搜索的太多有用的信息。

分布式事务解决方案:CAP

上面介绍的那些分布式事务的处理方案你在其他地方或许也可以看到,但是并没有相关的实际代码或者是开源代码,所以算不上什么干货,下面就放干货了。

在 .NET 领域,似乎没有什么现成的关于分布式事务的解决方案,或者说是有但未开源。具笔者了解,有一些公司内部其实是有这种解决方案的,但是也是作为公司的一个核心产品之一,并未开源...

鉴于以上原因,所以博主就打算自己写一个并且开源出来,所以从17年初就开始做这个事情,然后花了大半年的时间在一直不断完善,就是下面这个 CAP。

Github CAP:这里的 CAP 就不是 CAP 理论了,而是一个 .NET 分布式事务解决方案的名字。

详细介绍:
http://www.cnblogs.com/savorboard/p/cap.html
相关文档:
http://www.cnblogs.com/savorboard/p/cap-document.html

夸张的是,这个解决方案是具有可视化界面(Dashboard)的,你可以很方面的看到哪些消息执行成功,哪些消息执行失败,到底是发送失败还是处理失败,一眼便知。

最夸张的是,这个解决方案的可视化界面还提供了实时动态图表,这样不但可以看到实时的消息发送及处理情况,连当前的系统处理消息的速度都可以看到,还可以看到过去24小时内的历史消息吞吐量。

最最夸张的是,这个解决方案的还帮你集成了 Consul 做分布式节点发现和注册还有心跳检查,你随时可以看到其他的节点的状况。

最最最夸张的是,你以为你看其他节点的数据要登录到其他节点的Dashboard控制台看?错了,你随便打开其中任意一个节点的Dashboard,点一下就可以切换到你想看的节点的控制台界面了,就像你看本地的数据一样,他们是完全去中心化的。

你以为这些就够了?不,远远不止:

  • CAP 同时支持 RabbitMQ,Kafka 等消息队列
  • CAP 同时支持 SQL Server, MySql, PostgreSql 等数据库
  • CAP Dashboard 同时支持中文和英文界面双语言,妈妈再也不用担心我看不懂了
  • CAP 提供了丰富的接口可以供扩展,什么序列化了,自定义处理了,自定义发送了统统不在话下
  • CAP 基于MIT开源,你可以尽管拿去做二次开发。(记得保留MIT的License)

这下你以为我说完了? 不!

你完全可以把 CAP 当做一个 EventBus 来使用,CAP具有优秀的消息处理能力,不要担心瓶颈会在CAP,那是永远不可能, 因为你随时可以在配置中指定CAP处理的消息使用的进程数, 只要你的数据库配置足够高...

前言

最近很久没有写博客了,一方面是因为公司事情最近比较忙,另外一方面是因为在进行 CAP 的下一阶段的开发工作,不过目前已经告一段落了。

接下来还是开始我们今天的话题,说说分布式事务,或者说是我眼中的分布式事务,因为每个人可能对其的理解都不一样。

分布式事务是企业集成中的一个技术难点,也是每一个分布式系统架构中都会涉及到的一个东西,特别是在微服务架构中,几乎可以说是无法避免,本文就分布式事务来简单聊一下。

数据库事务

在说分布式事务之前,我们先从数据库事务说起。 数据库事务可能大家都很熟悉,在开发过程中也会经常使用到。但是即使如此,可能对于一些细节问题,很多人仍然不清楚。比如很多人都知道数据库事务的几个特性:原子性(Atomicity )、一致性( Consistency )、隔离性或独立性( Isolation)和持久性(Durabilily),简称就是ACID。但是再往下比如问到隔离性指的是什么的时候可能就不知道了,或者是知道隔离性是什么但是再问到数据库实现隔离的都有哪些级别,或者是每个级别他们有什么区别的时候可能就不知道了。

本文并不打算介绍这些数据库事务的这些东西,有兴趣可以搜索一下相关资料。不过有一个知识点我们需要了解,就是假如数据库在提交事务的时候突然断电,那么它是怎么样恢复的呢? 为什么要提到这个知识点呢? 因为分布式系统的核心就是处理各种异常情况,这也是分布式系统复杂的地方,因为分布式的网络环境很复杂,这种“断电”故障要比单机多很多,所以我们在做分布式系统的时候,最先考虑的就是这种情况。这些异常可能有 机器宕机、网络异常、消息丢失、消息乱序、数据错误、不可靠的TCP、存储数据丢失、其他异常等等...

我们接着说本地事务数据库断电的这种情况,它是怎么保证数据一致性的呢?我们使用SQL Server来举例,我们知道我们在使用 SQL Server 数据库是由两个文件组成的,一个数据库文件和一个日志文件,通常情况下,日志文件都要比数据库文件大很多。数据库进行任何写入操作的时候都是要先写日志的,同样的道理,我们在执行事务的时候数据库首先会记录下这个事务的redo操作日志,然后才开始真正操作数据库,在操作之前首先会把日志文件写入磁盘,那么当突然断电的时候,即使操作没有完成,在重新启动数据库时候,数据库会根据当前数据的情况进行undo回滚或者是redo前滚,这样就保证了数据的强一致性。

接着,我们就说一下分布式事务。

分布式理论

当我们的单个数据库的性能产生瓶颈的时候,我们可能会对数据库进行分区,这里所说的分区指的是物理分区,分区之后可能不同的库就处于不同的服务器上了,这个时候单个数据库的ACID已经不能适应这种情况了,而在这种ACID的集群环境下,再想保证集群的ACID几乎是很难达到,或者即使能达到那么效率和性能会大幅下降,最为关键的是再很难扩展新的分区了,这个时候如果再追求集群的ACID会导致我们的系统变得很差,这时我们就需要引入一个新的理论原则来适应这种集群的情况,就是 CAP 原则或者叫CAP定理,那么CAP定理指的是什么呢?

CAP定理

CAP定理是由加州大学伯克利分校Eric Brewer教授提出来的,他指出WEB服务无法同时满足一下3个属性:

  • 一致性(Consistency) : 客户端知道一系列的操作都会同时发生(生效)
  • 可用性(Availability) : 每个操作都必须以可预期的响应结束
  • 分区容错性(Partition tolerance) : 即使出现单个组件无法可用,操作依然可以完成

具体地讲在分布式系统中,在任何数据库设计中,一个Web应用至多只能同时支持上面的两个属性。显然,任何横向扩展策略都要依赖于数据分区。因此,设计人员必须在一致性与可用性之间做出选择。

这个定理在迄今为止的分布式系统中都是适用的! 为什么这么说呢?

这个时候有同学可能会把数据库的2PC(两阶段提交)搬出来说话了。OK,我们就来看一下数据库的两阶段提交。

对数据库分布式事务有了解的同学一定知道数据库支持的2PC,又叫做 XA Transactions。

MySQL从5.5版本开始支持,SQL Server 2005 开始支持,Oracle 7 开始支持。

其中,XA 是一个两阶段提交协议,该协议分为以下两个阶段:

  • 第一阶段:事务协调器要求每个涉及到事务的数据库预提交(precommit)此操作,并反映是否可以提交.
  • 第二阶段:事务协调器要求每个数据库提交数据。

其中,如果有任何一个数据库否决此次提交,那么所有数据库都会被要求回滚它们在此事务中的那部分信息。这样做的缺陷是什么呢? 咋看之下我们可以在数据库分区之间获得一致性。

如果CAP 定理是对的,那么它一定会影响到可用性。

如果说系统的可用性代表的是执行某项操作相关所有组件的可用性的和。那么在两阶段提交的过程中,可用性就代表了涉及到的每一个数据库中可用性的和。我们假设两阶段提交的过程中每一个数据库都具有99.9%的可用性,那么如果两阶段提交涉及到两个数据库,这个结果就是99.8%。根据系统可用性计算公式,假设每个月43200分钟,99.9%的可用性就是43157分钟, 99.8%的可用性就是43114分钟,相当于每个月的宕机时间增加了43分钟。

以上,可以验证出来,CAP定理从理论上来讲是正确的,CAP我们先看到这里,等会再接着说。

BASE理论

在分布式系统中,我们往往追求的是可用性,它的重要程序比一致性要高,那么如何实现高可用性呢? 前人已经给我们提出来了另外一个理论,就是BASE理论,它是用来对CAP定理进行进一步扩充的。BASE理论指的是:

  • Basically Available(基本可用)
  • Soft state(软状态)
  • Eventually consistent(最终一致性)

BASE理论是对CAP中的一致性和可用性进行一个权衡的结果,理论的核心思想就是:我们无法做到强一致,但每个应用都可以根据自身的业务特点,采用适当的方式来使系统达到最终一致性(Eventual consistency)。

有了以上理论之后,我们来看一下分布式事务的问题。

分布式事务

在分布式系统中,要实现分布式事务,无外乎那几种解决方案。

一、两阶段提交(2PC)

和上一节中提到的数据库XA事务一样,两阶段提交就是使用XA协议的原理,我们可以从下面这个图的流程来很容易的看出中间的一些比如commit和abort的细节。

两阶段提交这种解决方案属于牺牲了一部分可用性来换取的一致性。在实现方面,在 .NET 中,可以借助 TransactionScop 提供的 API 来编程实现分布式系统中的两阶段提交,比如WCF中就有实现这部分功能。不过在多服务器之间,需要依赖于DTC来完成事务一致性,Windows下微软搞的有MSDTC服务,Linux下就比较悲剧了。

另外说一句,TransactionScop 默认不能用于异步方法之间事务一致,因为事务上下文是存储于当前线程中的,所以如果是在异步方法,需要显式的传递事务上下文。

优点: 尽量保证了数据的强一致,适合对数据强一致要求很高的关键领域。(其实也不能100%保证强一致)

缺点: 实现复杂,牺牲了可用性,对性能影响较大,不适合高并发高性能场景,如果分布式系统跨接口调用,目前 .NET 界还没有实现方案。

二、补偿事务(TCC)

TCC 其实就是采用的补偿机制,其核心思想是:针对每个操作,都要注册一个与其对应的确认和补偿(撤销)操作。它分为三个阶段:

  • Try 阶段主要是对业务系统做检测及资源预留

  • Confirm 阶段主要是对业务系统做确认提交,Try阶段执行成功并开始执行 Confirm阶段时,默认 Confirm阶段是不会出错的。即:只要Try成功,Confirm一定成功。

  • Cancel 阶段主要是在业务执行错误,需要回滚的状态下执行的业务取消,预留资源释放。

举个例子,假入 Bob 要向 Smith 转账,思路大概是:
我们有一个本地方法,里面依次调用
1、首先在 Try 阶段,要先调用远程接口把 Smith 和 Bob 的钱给冻结起来。
2、在 Confirm 阶段,执行远程调用的转账的操作,转账成功进行解冻。
3、如果第2步执行成功,那么转账成功,如果第二步执行失败,则调用远程冻结接口对应的解冻方法 (Cancel)。

优点: 跟2PC比起来,实现以及流程相对简单了一些,但数据的一致性比2PC也要差一些

缺点: 缺点还是比较明显的,在2,3步中都有可能失败。TCC属于应用层的一种补偿方式,所以需要程序员在实现的时候多写很多补偿的代码,在一些场景中,一些业务流程可能用TCC不太好定义及处理。

三、本地消息表(异步确保)

本地消息表这种实现方式应该是业界使用最多的,其核心思想是将分布式事务拆分成本地事务进行处理,这种思路是来源于ebay。我们可以从下面的流程图中看出其中的一些细节:

基本思路就是:

消息生产方,需要额外建一个消息表,并记录消息发送状态。消息表和业务数据要在一个事务里提交,也就是说他们要在一个数据库里面。然后消息会经过MQ发送到消息的消费方。如果消息发送失败,会进行重试发送。

消息消费方,需要处理这个消息,并完成自己的业务逻辑。此时如果本地事务处理成功,表明已经处理成功了,如果处理失败,那么就会重试执行。如果是业务上面的失败,可以给生产方发送一个业务补偿消息,通知生产方进行回滚等操作。

生产方和消费方定时扫描本地消息表,把还没处理完成的消息或者失败的消息再发送一遍。如果有靠谱的自动对账补账逻辑,这种方案还是非常实用的。

这种方案遵循BASE理论,采用的是最终一致性,笔者认为是这几种方案里面比较适合实际业务场景的,即不会出现像2PC那样复杂的实现(当调用链很长的时候,2PC的可用性是非常低的),也不会像TCC那样可能出现确认或者回滚不了的情况。

优点: 一种非常经典的实现,避免了分布式事务,实现了最终一致性。在 .NET中 有现成的解决方案。

缺点: 消息表会耦合到业务系统中,如果没有封装好的解决方案,会有很多杂活需要处理。

四、MQ 事务消息

有一些第三方的MQ是支持事务消息的,比如RocketMQ,他们支持事务消息的方式也是类似于采用的二阶段提交,但是市面上一些主流的MQ都是不支持事务消息的,比如 RabbitMQ 和 Kafka 都不支持。

以阿里的 RocketMQ 中间件为例,其思路大致为:

第一阶段Prepared消息,会拿到消息的地址。
第二阶段执行本地事务,第三阶段通过第一阶段拿到的地址去访问消息,并修改状态。

也就是说在业务方法内要想消息队列提交两次请求,一次发送消息和一次确认消息。如果确认消息发送失败了RocketMQ会定期扫描消息集群中的事务消息,这时候发现了Prepared消息,它会向消息发送者确认,所以生产方需要实现一个check接口,RocketMQ会根据发送端设置的策略来决定是回滚还是继续发送确认消息。这样就保证了消息发送与本地事务同时成功或同时失败。

遗憾的是,RocketMQ并没有 .NET 客户端。有关 RocketMQ的更多消息,大家可以查看这篇博客

优点: 实现了最终一致性,不需要依赖本地数据库事务。

缺点: 实现难度大,主流MQ不支持,没有.NET客户端,RocketMQ事务消息部分代码也未开源。

五、Sagas 事务模型

Saga事务模型又叫做长时间运行的事务(Long-running-transaction), 它是由普林斯顿大学的H.Garcia-Molina等人提出,它描述的是另外一种在没有两阶段提交的的情况下解决分布式系统中复杂的业务事务问题。你可以在这里看到 Sagas 相关论文。

我们这里说的是一种基于 Sagas 机制的工作流事务模型,这个模型的相关理论目前来说还是比较新的,以至于百度上几乎没有什么相关资料。

该模型其核心思想就是拆分分布式系统中的长事务为多个短事务,或者叫多个本地事务,然后由 Sagas 工作流引擎负责协调,如果整个流程正常结束,那么就算是业务成功完成,如果在这过程中实现失败,那么Sagas工作流引擎就会以相反的顺序调用补偿操作,重新进行业务回滚。

比如我们一次关于购买旅游套餐业务操作涉及到三个操作,他们分别是预定车辆,预定宾馆,预定机票,他们分别属于三个不同的远程接口。可能从我们程序的角度来说他们不属于一个事务,但是从业务角度来说是属于同一个事务的。

他们的执行顺序如上图所示,所以当发生失败时,会依次进行取消的补偿操作。

因为长事务被拆分了很多个业务流,所以 Sagas 事务模型最重要的一个部件就是工作流或者你也可以叫流程管理器(Process Manager),工作流引擎和Process Manager虽然不是同一个东西,但是在这里,他们的职责是相同的。在选择工作流引擎之后,最终的代码也许看起来是这样的

SagaBuilder saga = SagaBuilder.newSaga("trip")
        .activity("Reserve car", ReserveCarAdapter.class) 
        .compensationActivity("Cancel car", CancelCarAdapter.class) 
        .activity("Book hotel", BookHotelAdapter.class) 
        .compensationActivity("Cancel hotel", CancelHotelAdapter.class) .activity("Book flight", BookFlightAdapter.class) .compensationActivity("Cancel flight", CancelFlightAdapter.class) .end() .triggerCompensationOnAnyError(); camunda.getRepositoryService().createDeployment() .addModelInstance(saga.getModel()) .deploy();

这里有一个 C# 相关示例,有兴趣的同学可以看一下。

优缺点这里我们就不说了,因为这个理论比较新,目前市面上还没有什么解决方案,即使是 Java 领域,我也没有搜索的太多有用的信息。

分布式事务解决方案:CAP

上面介绍的那些分布式事务的处理方案你在其他地方或许也可以看到,但是并没有相关的实际代码或者是开源代码,所以算不上什么干货,下面就放干货了。

在 .NET 领域,似乎没有什么现成的关于分布式事务的解决方案,或者说是有但未开源。具笔者了解,有一些公司内部其实是有这种解决方案的,但是也是作为公司的一个核心产品之一,并未开源...

鉴于以上原因,所以博主就打算自己写一个并且开源出来,所以从17年初就开始做这个事情,然后花了大半年的时间在一直不断完善,就是下面这个 CAP。

Github CAP:这里的 CAP 就不是 CAP 理论了,而是一个 .NET 分布式事务解决方案的名字。

详细介绍:
http://www.cnblogs.com/savorboard/p/cap.html
相关文档:
http://www.cnblogs.com/savorboard/p/cap-document.html

夸张的是,这个解决方案是具有可视化界面(Dashboard)的,你可以很方面的看到哪些消息执行成功,哪些消息执行失败,到底是发送失败还是处理失败,一眼便知。

最夸张的是,这个解决方案的可视化界面还提供了实时动态图表,这样不但可以看到实时的消息发送及处理情况,连当前的系统处理消息的速度都可以看到,还可以看到过去24小时内的历史消息吞吐量。

最最夸张的是,这个解决方案的还帮你集成了 Consul 做分布式节点发现和注册还有心跳检查,你随时可以看到其他的节点的状况。

最最最夸张的是,你以为你看其他节点的数据要登录到其他节点的Dashboard控制台看?错了,你随便打开其中任意一个节点的Dashboard,点一下就可以切换到你想看的节点的控制台界面了,就像你看本地的数据一样,他们是完全去中心化的。

你以为这些就够了?不,远远不止:

  • CAP 同时支持 RabbitMQ,Kafka 等消息队列
  • CAP 同时支持 SQL Server, MySql, PostgreSql 等数据库
  • CAP Dashboard 同时支持中文和英文界面双语言,妈妈再也不用担心我看不懂了
  • CAP 提供了丰富的接口可以供扩展,什么序列化了,自定义处理了,自定义发送了统统不在话下
  • CAP 基于MIT开源,你可以尽管拿去做二次开发。(记得保留MIT的License)

这下你以为我说完了? 不!

你完全可以把 CAP 当做一个 EventBus 来使用,CAP具有优秀的消息处理能力,不要担心瓶颈会在CAP,那是永远不可能, 因为你随时可以在配置中指定CAP处理的消息使用的进程数, 只要你的数据库配置足够高...

Guess you like

Origin www.cnblogs.com/jiawen010/p/11446850.html
Recommended