Coffee Wang notes-how to ensure the consistency of transactions under the microservice architecture (InfoQ open class)

Hello, 大家好!
我是不作死就不会死,智商不在线,但颜值超有品的拆家队大队长 ——咖啡汪
一只不是在戏精,就是在戏精路上的极品二哈
前几天在 InfoQ 公开课上看到了自己感兴趣的东西,所以便简单做了下记录

Attach the original video link:
How to ensure the consistency of transactions under the
microservice architecture | InfoQ Open Course Lecturer: Liang Guizhao
Video link: https://www.infoq.cn/video/K7pDdIP5ZvqY9aAbf5vY
Insert picture description here

Preface

Course catalog, what can you learn:

什么是分布式事务?
分布式事务用的多吗?
二阶段提交协议/三阶段提交协议,为什么业务上用的不多?
二阶段提交协议/三阶段提交协议,使用案例有哪些?
CAP 理论有哪些误区?
TCC 模式,有哪些借鉴?
补偿模式,在哪些场景下会使用?
可靠事件模式,是否引入消息队列,就可以了?
可靠事件模式,反向消息也存在消息丢失,如何考虑?
可靠事件模式,如果我们采用广播模式,怎么办?
如何设计一个可复用的分布式事务解决组件?
RocketMQ分布式事务模式,是否可靠?

Good opening

快来随本汪一起看看吧

1. The performance of a single service is not as good as that of a microservice.

After microservices, the entire call link will become very long, the original one RPC call will become multiple RPC calls, and the performance loss on the network will increase.
比如说异地多活,他最大的挑战其实就是网络时延,机房内部调用是 1~2 毫秒的延时,同城跨机房一般是十几毫秒,跨省市比如杭州到上海之前的数据大概是 30~50 毫秒之间,那么之前的项目是厦门到新加坡,极端情况下是会有 100~200 毫秒的延时,网络时延的堆积,会对性能产生很大的影响。

The essence of microservices is to sacrifice the performance of network calls to squeeze the resources of the machine.

2. The problem of data consistency between services, that is, distributed transactions.

3. The evolution from local transactions to distributed transactions

思考:什么是分布式事务,为什么会需要分布式事务?

Under a single service, the program is often deployed on a single physical machine, and there is also a database. The ACID atomicity, consistency, isolation, and durability of the database can be used to ensure data consistency.

瓶颈: CPU, memory, disk IO, network bandwidth, these are all hardware resource bottlenecks.
But each database can still guarantee its own ACID

重点: Divide the table, divide it into 1024 tables through the hash modulus, but they can still guarantee strong consistency through ACID under one database. Through the time model, such as refund, logistics, log, we divide it according to the time period, year, and quarter. As long as he is under the same database, the basic characteristics of his database are all available.
The sub-libraries are not the same, and each library is not perceptible.

概念: Distributed transaction is a solution to ensure data consistency in different databases.

Insert picture description here

4. Strong consistency solution-two-phase submission agreement

两个阶段: The first stage of preparation; the second stage of submission

两个角色: A coordinator is responsible for the coordination of tasks; a participant is responsible for the specific implementation.

简单例子助理解:
Group dinner, 10 people in total, who are willing to go back to 1, and everyone gives their own answers. Everyone replies to 1, and then decides whether to start.

真实例子:
The coordinator asked the participants whether to pre-submit? All participants reply that the pre-submission is successful, then the formal submission can be initiated; if one participant replies to the pre-submission failure, the coordinator will reply to all participants and perform a rollback, and the submission will fail;

注意:这有一个问题!
The second-phase submission is waiting synchronously. If a participant can't be contacted or a participant has not responded, the process will be very long, or even wait for a long time. Synchronization makes him very time-consuming.

使用场景:Mysql InnoDB storage engine is the two-phase commit protocol used. His binlog (binary log) and (transaction log) redolog use the two-phase commit protocol to ensure consistency.

具体执行过程: We now update a piece of data, he will first execute a redolog, which is a pre-commit state, and then write a binlog, and write it to disk, formally submit, and the update is complete. There is a coordinator in the database to complete.

为什么数据库中用,实际业务中用不到呢?

The important problem is that in microservices, direct DB connection is forbidden, and we often have to call across services, so that there is network delay, and even the request fails. The two-phase submission protocol, because it is synchronized, runs into the network. It is easy to fail due to jitter or other failures, so it is generally not used in microservices.

Insert picture description here

5. Strong consistency solution-three-phase submission agreement

The three-phase commit protocol is an improved version of the two-phase commit protocol.

The three phases adopted a timeout mechanism to solve the synchronization problem. He joined the preparatory phase to prepare for the second phase in the early stage of the task, and find the problem as much as possible before the third phase is submitted.

The timeout mechanism is a default mechanism, but there are still problems. For example: I submitted successfully by default over time, but this service actually has an exception, and other services have rolled back. In this case, there will still be data inconsistencies.

Therefore, even if the two-phase and three-phase submission are used, data compensation is still required, and an eventual consistency plan is needed to find out. This is one reason for not using them.

Insert picture description here

6. Eventually consistent solution-CAP

(1)CAP 理论
It is not possible to meet all three, but two of the three must be met. Partition tolerance is the most basic requirement.
If we choose consistency and partition tolerance, then network problems will lead to unavailability.
If we choose availability and partition consistency, then there may be data inconsistencies in the data synchronization process.

(2)BASE 理论
In a distributed system, a part of the consistency is allowed to be lost, and the final consistency of the data is guaranteed through a period of repair.

(3)传统的 CAP 理论并不完全正确
1) The consistency in the CAP is different from the consistency in our ACID. After the 一致性 = 可线性化
method A is operated, the B operation is performed, and then the result of the B operation is viewed, and the result of the A operation is considered to be completed. It seems that there is only one copy of the data, but there can be multiple copies of the data.

2) The availability of CAP is not exactly the same as the availability of our microservices, because the availability of our microservices,一般会通过 SLA 来衡量

eg: We have two data centers. When the network of the two computer rooms is interrupted, we adopt consistency and availability. We shut down the services of the other data center. All the reading and writing are in one center, and then the user’s Traffic cuts to this data center. This does not mean that the network interruption will definitely cause the system to stop serving!

In multiple data centers, through asynchronous solutions, binlog synchronizes data. In many cases, it is not because of network failures but because of network delays. Most of the data are inconsistent due to network delays.

Insert picture description here

Insert picture description here

7. Eventually Consistent Solution-TCC

三个:尝试,提交和撤销。

tryTry to do resource detection and reservation.
confirmPerform business submission operations.
cancelPerform resource rollback and compensation.

Basically never used it, because it has a certain intrusiveness to the business code. It is a synchronous solution, and will cause business code to become inconspicuous due to the introduction of these three methods. But TCC still has some frameworks.

用到的话,要注意 TCC 的一些异常情况:

1) 空回滚If the try operation is not executed, the second phase cancel is called. If the service is down or the network is abnormal, and the try is not executed, after the failure is restored, it will perform a rollback.

2) 幂等Multiple submissions result in dirty data records.

3) The 悬挂cancel operation is faster than try, cancel first, then try,

场景如下: Try timeout and retry, resulting in a late execution cycle of try, which will be executed at a later node.

解决方法: We introduce a transaction table. We record every try, confirm, cancel operation.

For example, when we call cancel, we find that try is not called, so we don't call the cancel method.

The same is true for idempotence. When executing, check whether it has been executed first, and we will not execute it if it has been executed.

However, there will be a concurrency security issue that needs attention.

Insert picture description here

8. Eventually consistent solution-compensation mechanism

重试机制: Fixed time, fixed number of times, such as RocketMQ, he defaults to retry 3 times, but he can retry up to 16 times.
Why is there a fixed time? Because the CPU execution speed is very fast, if you don’t set the retry time, it may be retried 16 times within one second. This increases the system load and does not make much sense. 1 second, 3 seconds, 7 seconds, 1 minute, 2 minutes, 1 hour, 2 hours.

更新修复: Self-correction, reduce the instantaneous pressure by hundreds of thousands, you can use the scheduler; tens of millions, you need to divide the time, in each user request link, disperse the instantaneous pressure. For content distribution scenarios, likes, comments and other social scenarios, redundant tables are added to make data redundant.

自我修正: Data consistency is guaranteed, no timed tasks are required, and the user requests an article to determine whether the period of his request exceeds the update time, for example, if it exceeds 24 hours, the data is refreshed asynchronously. As long as there is a user to request, it will trigger the refresh of the data, so that other users can see the latest data. But if there has been no user request, it does not matter if this data is not updated.

定时机制: Retry regularly and check regularly.

数据核对: Under microservices, DB is forbidden to connect directly. When the interface is adjusted, the data volume is very large, and the interface will have limited flow. If the interface exceeds 5000, the flow will be limited. When the network jitter is added, the data request will be very unstable. It is very unrealistic to call millions of data through the interface, so it is generally in this domain, through the synchronization mechanism, binlog -> kafka -> event broadcast, to write the data to its own domain by monitoring the event.

Insert picture description here

9. Eventually consistent solution-reliable event mechanism

If an automatic response mechanism is used, the message will definitely be lost.
Insert picture description here

scene 1:
服务 A 向服务 B 发送消息, 服务 B 收到之后自动应答,就告诉消息队列我已经收到消息了,你把持久化消息清除掉吧。结果 B 在进行消息消费的时候失败了,此时 服务B 已经无法重新获取消息了。 所以我们一定要开启手动应答,手动 ACK.

Scene 2:
双11 大促,把消息全部填过了,再通过 MQ 做一个消息的积压,然后慢慢地把消息消费掉。 但这个过程中有一个问题,服务 B 无法及时把消息消费掉,如果说周期特别长,比如说 RocketMQ 超过了两个小时,那么消息就会被记录到 MQ 的死信队列里去,那么这些消息就需要人工干预才能处理了。这些消息即时不丢,也不会重试了。对于服务 B 而言,这些消息看起来就已经丢了,因为我们没办法再消费了。

流程: Service A stores the data in the database, and then marks the message as pending, then service A delivers the message to the message queue, service B successfully consumes the message, and returns an ACK; at the same time, service B sends the message to another message queue A message. After service A receives this message, it goes to the database and sets the data to a completed state.

Insert picture description here

定时任务, Scan the unsuccessful messages for re-delivery. After 3 to 5 times, if the delivery is not successful, manual intervention is required.

幂等性: Build a unique index in a single database, and ensure uniqueness based on idempotent fields. The core of idempotence is to ensure the uniqueness of resources.

If the amount of data is particularly large, you need to sub-database and sub-table. Generally, there are two common ways:

  1. Check first and insert later, you need to pay attention to the concurrency security issues. One master has multiple sub-orders, and one main order has multiple sub-orders. Generally, multiple sub-orders are created at the same time. In this case, there will be concurrency security issues. We need to add distributed locks to solve the concurrency safety problem of critical states. But there is a problem that needs to be noted. Distributed locks have an expiration time, such as 30 minutes, but for example, our retry period exceeds 30 minutes, and the lock fails, and there will be concurrency security issues. Or join a state machine to ensure uniqueness through state constraints and state flow.

Business scene:
我们是一个买家,我们在淘宝买了一个商品,然后觉得他不好,我们就去发起一个退款。那么退款,这个工单就会以售后工单的形式流转到买家,买家的客服就会去看这个单子,去看下单子的订单,定价信息,交易信息,退款理由,退款金额,看你个各种评论,各种风控,看完没问题,就把钱给你退 了。钱就到账了。

退款和自动化退款. Automated refund, based on strong rules, refund rules, order rules, such as 7 days without a reason, the order is less than 100, it will be automatically refunded.

Insert picture description here

Process:用户发起退款服务后,退款服务会先写入本地数据库,然后持久化这笔退款,接着发送消息投递到消息队列。退款服务无需同步等待退款结果,他就可以继续做其他事情。退款成功后,自动化系统将退款成功的消息推入另一个消息队列,退款服务收到消息后,就会把该退款单设置成完成状态。

The timed task goes to the database to scan the unfinished refund work order and retry it. If it fails, manual intervention is required.

Through two message queues, forward and reverse delivery to ensure the successful delivery of messages.

10. Eventually consistent solution-RocketMQ mode

这里有一个非常巧妙的设计,大家可以去学习一下。它其实是用来解决生产者发送消息与本地事务的原子性问题。换句话说,本地事务执行不成功,则不会发送消息。但有一个问题,本地事务执行成功,MQ 不一定能执行成功。本地事务就需要回滚。RocketMQ 实际上解决了这个问题,也是我们需要去学习的一点。

半投递状态机制。RocketMQ first sends a pre-execution message to the queue to test the connectivity of MQ, but the message is not executed at this time. Then go to execute the local transaction. After the local transaction is successfully executed, the pre-execution message can be executed. If the execution of the local transaction fails, then we need to delete the messages in the RocketMQ queue.

需要注意: The message is sent to MQ, but due to some current restrictions or unavailability of the service, the message cannot be consumed normally, or the message enters the dead letter queue. In fact, our downstream still don't know. He is unreliable, or a compensation model is needed to ensure final consistency.

Insert picture description here

(Disclaimer: The purpose of the coffee Wang translation blog is to convey more information, and does not represent my views and positions. The content of the article is for reference only.)

Guess you like

Origin blog.csdn.net/weixin_42994251/article/details/114090176