Discussion on the Ultimate Solution of Spring Cloud Distributed Transaction

I. Introduction

This topic has been included in the video lecture "Spring Cloud Distributed Transaction Solution", you may as well watch it

Ali 2017 Yunqi Conference " Crack the world's technical problems! In GTS Makes Distributed Transactions Simple and Efficient ", Ali claims to have proposed an ultimate solution to solve the world's difficult distributed transactions, which is ahead of all technologies on the market in terms of reliability and processing speed. Unfortunately, firstly, the project is not open source , and secondly, it must rely on Alibaba Cloud 's distributed database. After all, the guy who eats can't be easily shown .

Even so, the article "World Problems..." summarizes transactions quite well: "A seemingly simple function may need to call multiple "services" and operate multiple databases or shards to achieve it. , a single technical means and solution can no longer meet these complex application scenarios. Therefore, distributed transactions in distributed system architecture are an unavoidable challenge.

What is a distributed transaction? To put it simply, a large operation consists of different small operations. These small operations are distributed on different servers. Distributed transactions need to ensure that these small operations either all succeed or all fail . "

Take a chestnut:

When you buy something on Taobao, you need to deduct the money first, and then the product inventory will be -1. However, deduction and inventory belong to two services respectively. These two services have to go through a series of intermediate layers such as network, gateway, and host. If there is a problem in any place, such as network jitter, sudden abnormal waiting, it will lead to inconsistency , for example, the deduction is successful, but the inventory is not -1, there will be oversold, and this is the problem that distributed transactions need to solve

Two Phase 2 commits (2PC, 3PC, etc.)

2-phase commit is the traditional solution for distributed transactions, and it is widely available until advanced. When a transaction spans multiple nodes, in order to maintain the transaction ACIDcharacteristics, it is necessary to introduce a coordinator to uniformly control the operation results of all nodes (called participants) and finally instruct these nodes whether to actually commit the operation results (such as write the updated data to disk, etc.). Therefore, the algorithm idea of ​​the two-stage submission can be summarized as: the participants notify the coordinator of the success or failure of the operation, and then the coordinator decides whether each participant wants to submit the operation or abort the operation according to the feedback information of all participants.

Take a meeting as an example

A, B, D, and Ding want to organize a meeting, and they need to determine the meeting time. Let A be the coordinator and B, B, Ding and Ding are the participants.

voting stage

  1. A sends an email to Bing Ding, is there time for the meeting at 20:00 this week?
  2. A has time to reply;
  3. B has time to reply;
  4. C does not reply for a long time. At this time, for this activity, both A, B, and C are in a blocking state, and the algorithm cannot continue;
  5. C has time (or no time) to reply;

commit phase

  1. Coordinator A feeds back the collected results to E-B-D (when and what the feedback results are, in this case, it depends on the time and decision of C-C);
  2. B received;
  3. C received;
  4. Ding received;

It is not only necessary to lock all the resources of the participants, but also the resources of the coordinator, which is expensive. One sentence summary is: 2PC is very inefficient and unfriendly to high concurrency.

Quoting 《世界性难题...》the original words of the article "For the commercial distributed transaction products based on the XA model with decades of history and technology accumulation abroad, under the same software and hardware conditions, the throughput often drops by orders of magnitude after the distributed transaction is enabled."

In addition, there are three-phase commits


Anyone who is interested may wish to study

Three flexible affairs

The so-called flexible transaction is relative to the rigid transaction that enforces the lock table. The process is as follows: If the transaction of server A is executed smoothly, then transaction A will be submitted first. If transaction B is also executed successfully, then transaction B will also be submitted, and the entire transaction will be completed. However, if transaction B fails to execute, transaction B itself is rolled back. At this time, transaction A has been committed, so it is necessary to perform a compensation operation to reverse the operation performed by transaction A that has been committed and restore the state of transaction A before it was not executed. .

缺点是业务侵入性太强,还要补偿操作,缺乏普遍性,没法大规模推广。

四 消息最终一致性解决方案之RocketMQ

目前基于消息队列的解决方案有阿里的RocketMQ,它实现了半消息的解决方案,有点类似于Paxos算法,具体流程如下

第一阶段:上游应用执行业务并发送MQ消息


  1. 上游应用发送待确认消息到可靠消息系统
  2. 可靠消息系统保存待确认消息并返回
  3. 上游应用执行本地业务
  4. 上游应用通知可靠消息系统确认业务已执行并发送消息。

可靠消息系统修改消息状态为发送状态并将消息投递到 MQ 中间件

第二阶段:下游应用监听 MQ 消息并执行业务

下游应用监听 MQ 消息并执行业务,并且将消息的消费结果通知可靠消息服务。


  1. 下游应用监听 MQ 消息组件并获取消息
  2. 下游应用根据 MQ 消息体信息处理本地业务
  3. 下游应用向 MQ
  4. 确认消息被消费
  5. 下游应用通知可靠消息系统消息被成功消费,可靠消息将该消息状态更改为已完成

RocketMQ貌似是一种先进的实现方案了,但问题是缺乏文档,无论是在Apache项目主页,还是在阿里的页面上,最多只告诉你如何用,而原理性或者指导性的东西非常缺乏。

当然,如果你在阿里云上专门购买了RocketMQ服务,想必是另当别论了。但如果你试图在自己的服务环境中部署和使用,想必要历经相当大的学习曲线。毕竟是人家吃饭的家伙嘛

五 消息最终一致性解决方案之RabbitMQ实现

RabbitMQ遵循了AMQP规范,用消息确认机制来保证:只要消息发送,就能确保被消费者消费来做到了消息最终一致性。而且开源,文档还异常丰富,貌似是实现分布式事务的良好载体

6.1 RabbitMQ消息确认机制


rabbitmq的整个发送过程如下

1. 生产者发送消息到消息服务
2. 如果消息落地持久化完成,则返回一个标志给生产者。生产者拿到这个确认后,才能放心的说消息终于成功发到消息服务了。否则进入异常处理流程。
    rabbitTemplate.setConfirmCallback((correlationData, ack, cause) -> {
    if (!ack) {
        //try to resend msg
    } else {
        //delete msg in db
        }
    });
3. 消息服务将消息发送给消费者
4. 消费者接受并处理消息,如果处理成功则手动确认。当消息服务拿到这个确认后,才放心的说终于消费完成了。否则重发,或者进入异常处理。
    final Consumer consumer = new DefaultConsumer(channel) {
      @Override
      public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
    String message = new String(body, "UTF-8");

    System.out.println(" [x] Received '" + message + "'");
    try {
      doWork(message);
    } finally {
       //确认收到消息
      channel.basicAck(envelope.getDeliveryTag(), false);
        }
      }
    };

6.2 异常


我们来看看可能发送异常的四种

1. 直接无法到达消息服务

网络断了,抛出异常,业务直接回滚即可。如果出现connection closed错误,直接增加 connection数即可

    connectionFactory.setChannelCacheSize(100);
2. 消息已经到达服务器,但返回的时候出现异常

rabbitmq提供了确认ack机制,可以用来确认消息是否有返回。因此我们可以在发送前在db中(内存或关系型数据库)先存一下消息,如果ack异常则进行重发

    /**confirmcallback用来确认消息是否有送达消息队列*/     
    rabbitTemplate.setConfirmCallback((correlationData, ack, cause) -> {
    if (!ack) {
        //try to resend msg
    } else {
        //delete msg in db
    }
    });
     /**若消息找不到对应的Exchange会先触发returncallback */
    rabbitTemplate.setReturnCallback((message, replyCode, replyText, tmpExchange, tmpRoutingKey) -> {
        try {
            Thread.sleep(Constants.ONE_SECOND);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    
        log.info("send message failed: " + replyCode + " " + replyText);
        rabbitTemplate.send(message);
    });

3. 消息送达后,消息服务自己挂了

如果设置了消息持久化,那么ack= true是在消息持久化完成后,就是存到硬盘上之后再发送的,确保消息已经存在硬盘上,万一消息服务挂了,消息服务恢复是能够再重发消息

4. 未送达消费者

消息服务收到消息后,消息会处于"UNACK"的状态,直到客户端确认消息

    channel.basicQos(1); // accept only one unack-ed message at a time (see below)
    final Consumer consumer = new DefaultConsumer(channel) {
      @Override
      public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties, byte[] body) throws IOException {
        String message = new String(body, "UTF-8");

    System.out.println(" [x] Received '" + message + "'");
    try {
      doWork(message);
    } finally {
       //确认收到消息
      channel.basicAck(envelope.getDeliveryTag(), false);
    }
      }
    };
    boolean autoAck = false;
    channel.basicConsume(TASK_QUEUE_NAME, autoAck, consumer);
5. 确认消息丢失

消息返回时假设确认消息丢失了,那么消息服务会重发消息。注意,如果你设置了autoAck= false,但又没应答channel.baskAck也没有应答channel.baskNack,那么会导致非常严重的错误:消息队列会被堵塞住,所以,无论如何都必须应答

6. 消费者业务处理异常

消息监听接受消息并处理,假设抛异常了,第一阶段事物已经完成,如果要配置回滚则过于麻烦,即使做事务补偿也可能事务补偿失效的情况,所以这里可以做一个重复执行,比如guavaretry,设置一个指数时间来循环执行,如果n次后依然失败,发邮件、短信,用人肉来兜底。

六 总结

《世界性难题...》一文中对分布式事务的几种实现方式进行了形象归纳

你每天上班,要经过一条10公里的只有两条车道的马路到达公司。这条路很堵,经常需要两三个小时,上班时间没有保证,这是2PC的问题-慢。

选择一条很绕,长30公里但很少堵车的路,这是选b。上班时间有保证,但是必须早起,付出足够的时间和汽油。这是柔性事务的问题,必须用具体业务来回滚,很难模块化

选择一条有点绕,长20公里的山路,路不平,只有suv可以走,这是事务消息最终一致性问题。引入了新的消息中间件,需要额外的开发成本。但我司开发的CoolMQ已经对组件进行了封装,只需要发送,接受,就能满足事务的要求。目前还有该方案的专题讲座,大家可以根据自己的需要选用。

最后是GTSGTS修了一条拥有4条车道的高架桥,没有绕路,还是10公里。不堵车,对事务来说是高性能;不绕路,对事务来说是简单易用,对业务无侵入,不用为事务而重构;没有车型限制,对事务来说是没有功能限制,提供强一致事务。在没有高架桥的时代,高架桥出现对交通来说就是一个颠覆性创新,很多以前看来无解的问题就迎刃而解了,同样的,GTS希望通过创新改变数据一致性处理的行业现状。但遗憾的是并未开源,而且需要结合阿里云服务来使用。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324717945&siteId=291194637