How to complete distributed transactions in practice

background

I wrote an article about distributed transactions a year ago: If someone asks you about distributed transactions again, throw this to him . In this article, I introduced in detail what distributed transactions are and how to implement distributed transactions. What are the commonly used solutions, but many of them are theoretical, and many readers may still have a little gap in their actual use in actual combat. So in the previous updates of the article, I introduced a lot of content about Seata (a distributed transaction framework open sourced by Alibaba). If you are not familiar with Seata, you can read the following content:

Seata has provided us with two implementation distributed modes:

  • AT: automatic mode, we record the undolog of running sql to complete the automatic redo when the transaction fails.
  • TCC: TCC mode, this mode makes up for the scenario that our AT mode can only support ACID database.

Most of the time Seata is enough, but in many cases we can't choose a TCC framework like Seata in different scenarios:

  • The transformation is difficult. At present, there are not many communication frameworks supported by Seata, only Dubbo and Spring-Cloud-Alibaba. If other frameworks are used, or simple HTTP is used, even some companies may not support Trace in the current system.
  • The maintenance cost is high. Seata needs a separate cluster to maintain. Generally, companies need to allocate certain resources (personnel resources, machine resources) to manage and maintain Seata. In many cases, it is impossible to spend such a large cost for several distributed transactions. , of course, this part can be solved by going to the cloud in the future.

And I also encountered these problems when I was doing some distributed transactions recently. Since distributed transactions are generally used as the business side, you need the support of colleagues who drive RPC components, and we are not a pure financial services company. A set of distributed transaction middleware similar to Seata is also more resource-intensive.

Most of the solutions introduced before are relatively general. As the saying goes, it is better to teach a man to fish than to give him a fish. So I will teach you step by step how to implement distributed transactions without using a framework.

question

In order to better explain how to complete distributed transactions in actual combat, here is an example that everyone is familiar with: when users place an order, they can choose three kinds of assets, namely, stored value balance, points, and coupons. Every application can see it, and this scene can be mapped to 4 services in our backend, as shown in the following figure:

In this scenario, most people's code will basically be written as follows. There are the following steps in the order service. For simplicity, there are not too many order statuses:

  • Step 1: Create order status as Initialized, and check whether all resources of the user are sufficient
  • Step 2: Pay the stored value balance
  • Step 3: Payment Voucher
  • Step 4: Pay gold coins
  • Step 5: Update the order status to Completed

There are almost 4 simple lines here. Many people will put these 5 steps directly into the transaction, that is, add the @Transactional annotation, but in fact, adding this annotation not only does not play a transactional role, but also makes our The transaction has become a long transaction. Steps 2-4 here are all RPC remote calls. Once a Timeout occurs in an RPC, our database connection will be held for a long time and not released, which may lead to an avalanche of our system.

Since it is useless to add transactions here, we can see what problems will occur. If the payment in Step 2 is successful and Step 3 fails, it will lead to data inconsistency. In fact, many people will have a fluke mentality. By default, our Step 2-4 will be successful. If there is a problem, we will manually fix it. The cost of manual repair is too high, you think if you are traveling outside and suddenly ask you to repair the data, will you vomit blood with anger? Therefore, we will teach you step by step how to gradually optimize this business logic to ensure that our data is consistent.

method

Generally speaking, any distributed transaction framework is inseparable from three keywords: redo records, retry mechanisms, and idempotency. These three keywords are also inseparable from our business.

redo record

Let's think about what our mysql transaction rollback relies on? Relying on undolog, our undolog saves a version of the data before the transaction, so when we roll back, we can directly use this version of the data to roll back. Here we first need to add our redo records, we don't need to call undolog, we need to add a transaction record table to each resource service:

CREATE TABLE `transaction_record` (
  `orderId` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `op_total` int(11) NOT NULL COMMENT '本次操作资源操作数量',
  `status` int(11) NOT NULL COMMENT '1:代表支付成功 2:代表支付取消',
  `resource_id` int(11) NOT NULL COMMENT '本次操作资源的Id',
  `user_id` int(11) NOT NULL COMMENT '本次操作资源的用户Id',
  PRIMARY KEY (`orderId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

There is a global transaction ID in our distributed transaction, and we orderIdcan adapt to this role well. Here we need to record this OrderId in the transaction record table of each resource to associate with the global transaction, and here we are Using it directly as the primary key also shows that the global transaction ID will only appear once in this table. The one here is op_totalused to record the number of resources for this operation for subsequent rollback, even if we do not rollback, we can also use it for subsequent record queries. statusIt is used to record the state of our current record. Two states are used here, and we can expand more states to solve more distributed transaction problems.

With this redo record, we only need to record the transaction_record of our current resource in each execution, and roll back all resources according to our OrderId when rolling back. After our optimization, the code can be as follows:

        int orderId = createInitOrder();
        checkResourceEnough();
        try {
            accountService.payAccount(orderId, userId, opTotal);
            coinService.payCoin(orderId, userId, opTotal);
            couponService.payCoupon(orderId, userId, couponId);
            updateOrderStatus(orderId, PAID);
        }catch (Exception e){
            //这里进行回滚
            accountService.rollback(orderId, userId);
            coinService.rollback(orderId, userId);
            couponService.rollback(orderId, userId);
            updateOrderStatus(orderId, FAILED);
        }

Here we will pass the created initial order as a parameter to our resource service record, and finally update the status. If an exception occurs, then we need to manually roll back and turn the order data into FAILED, the basis for rollback is our order id. The pseudocode for our payment and rollback is as follows:

    @Transactional
    void payAccount(int orderId, int userId, int opTotal){
        account.payAccount(userId, opTotal); // 实际的去我们account表扣减
        transactionRecordStorage.save(orderId, userId, opTotal, account.getId()); //保存事务记录表
    }
    @Transactional
    void rollback(int orderId, int userId){
        TransactionRecord tr = transactionRecordStorage.get(orderId); //从记录表中查询
        account.rollbackBytr(tr); // 根据记录回滚
    }

The version here is relatively simple, and there are still more problems. We will talk about optimization later.

retry mechanism

Some students may ask as if our above code can basically guarantee distributed transactions, right? It is true that the above code can ensure that we basically have no problems without downtime or other more serious situations, but if there is a downtime, for example, we have just paid the account, and then pay the coin when our order machine Downtime, if this request is not sent, we will not go to our manual rollback request, so our account will never be rolled back, and we have to rely on our manual rollback, if you are still Travel, and tell you to roll back, it is estimated that you will continue to be dizzy. Or what if there is an error when we roll back again? We have no effective means to rollback against rollbacks.

So we need an additional retry mechanism to ensure that, first we need to define what kind of data needs to be retried, here we can pay all the resources in about one minute according to the business, if our order status is init and created If the time exceeds one minute, then the above error event is considered to have occurred. Next, we can roll back through our retry mechanism. There are two common retry mechanisms:

  • Scheduled tasks: Scheduled tasks are our most common retry mechanism. Basically, all distributed transaction frameworks are also done through scheduled tasks. Here we need to use distributed scheduled tasks. Distributed scheduled tasks can use a single machine. Task + distributed lock or directly use open source distributed task middleware such as elastic-job. In the logic of the distributed task, every time we query our order status as init and the creation time is more than one minute, we roll it back, and set the order status to FAILED after the rollback is completed.
  • Message queue: At present, we use the message queue in our business, and put the order operation into the message queue to do it. If we have various exceptions, then we rely on the retry mechanism of the message queue. Generally speaking, the current queue is currently running. Try again, and then throw it into the dead letter queue to try again. The logic here needs to be changed. When we create the order, it is possible that the order already exists. If it exists, we judge whether its status (init+1min) should be rolledback directly, and if so, directly 1min. Why did we choose message queue for retry? Because our business logic relies on message queues, we don't need to introduce timed tasks, we can directly rely on message queues.

idempotent

Judging whether a programmer's experience is sophisticated can be seen from whether he can consider idempotency when writing code. Many young programmers don't even think about the existence of idempotency, or even know what idempotency is. Here's an explanation of the concept of idempotency: you can simply think that the impact of any number of executions is the same as the impact of one execution.

Why do we need idempotency when we complete distributed transactions? You can think that if the machine crashes while performing the rollback operation, the retry mechanism above will start to work. For example, our coupon resource has been rolled back, but when we retry the operation, I don't know that the coupon has been rolled back. If it is rolled back, try to roll back the coupon again at this time. What will happen if the idempotent operation is not performed? It may lead to an increase in user assets, which will cause a lot of losses to the company.

So idempotency is very important when we retry, what is the key to achieving idempotency? We want to make multiple operations the same as one operation, then we only need to compare the first one has been done, and what does this mark pass through? Here we can use the means of our state machine transition to complete the marking. It's not enough to just mark it here, why here we use an example to illustrate the simple optimization of the above rollback:

    @Transactional
    void rollback(int orderId, int userId){
        TransactionRecord tr = transactionRecordStorage.get(orderId);
        if(tr.isCanceled()){
            return; //如果已经被取消了那么直接返回
        }
        //从记录表中查询
        account.rollbackBytr(tr); // 根据记录回滚
    }

In the above code, we judge that if the state has been canceled, that is, it has been rolled back, then we will return directly, and here we complete what we call idempotence. But here is another question: what if two rollbacks are executed at the same time? You may ask what kind of situation there may be two rollbacks. Here is a scenario where the request is blocked during the first rollback. At this time, the caller has triggered a timeout, and then the second rollback comes after a period of time. , this time happens to be not blocked for the first time, then there will be two rollback requests sent here. When the status judgment is executed, if the two requests execute the status judgment at the same time, then this check will be bypassed, and finally the user will We must avoid this situation by refunding the money twice.

So how can it be avoided? Smart students will immediately think of using distributed locks. When it comes to distributed locks, they immediately think of Redis locks, ZK locks, etc. I also introduced in this article: Let’s talk about distributed locks , but We can directly use the database row lock here, that is, use the following sql statement to query:

select * from transaction where orderId = "#{orderId}" for update;

The rest of the code remains the same, and in this form we are idempotent. At this time, some students may ask, what if TransactionRecord does not exist? Because how do we know whether his Try is successful or not when we retry, we don't know here, so we also have a strategy to ensure that our logic will not have a null pointer. Here are two strategies to do this:

  • If it is empty, we can return directly.
  • If it is empty, we save a TransactionRecord whose Status is the state of having performed an empty rollback.

The first strategy above is relatively simple, but we need to choose the second strategy here, why because we also need to prevent one thing: anti-suspension, let's talk about rollback idempotent, if the first rollback occurs network congestion, then Here we replace the rollback with the blocking when we made the first payment, which caused the pay to reach our client after the rollback. If we adopt the first method, our blocked Pay request cannot perceive the entire transaction because of the rollback. , and then continue to pay, so that our pay will never be rolled back, which is hanging. So we adopt the second strategy here, save a record, we will also check whether there is this record in pay, so the optimized code is:

    @Transactional
    void payAccount(int orderId, int userId, int opTotal){
        TransactionRecord tr = transactionRecordStorage.getForUpdate(orderId);
        if(tr != null){
            return; //如果已经有数据了,这里直接返回
        }
        account.payAccount(userId, opTotal); // 实际的去我们account表扣减
        transactionRecordStorage.save(orderId, userId, opTotal, account.getId()); //保存事务记录表
    }
    @Transactional
    void rollback(int orderId, int userId){
         TransactionRecord tr = transactionRecordStorage.getForUpdate(orderId);
        if(tr == null){
            saveNullCancelTr(orderId, userId); //保存空回滚的记录
        }
        if(tr.isCanceled() || tr.isNullCancel()){
            return; //如果已经被取消了那么直接返回
        }
        //从记录表中查询
        account.rollbackBytr(tr); // 根据记录回滚
    }

Summarize

At this point, we have basically completed the construction of distributed transactions. In this way, we can basically solve business problems related to distributed transactions in the future. Here we revisit our three main points:

  • Retry logging: Saved by data logging.
  • Retry mechanism: The retry of scheduled tasks or message queues.
  • Idempotent: add database row locks through the state machine.

As long as we can master these three points, it is not only helpful for distributed transactions, but also greatly improves other businesses.

If you think this article is helpful to you, your attention and forwarding are the greatest support for me, O(∩_∩)O:

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324135735&siteId=291194637