10 minutes to take you to understand the compensation mechanism of distributed systems

We know that in the case of distributed application systems, there will be a significant problem in communication, that is, a business process often needs to combine a group of services, and a single communication may pass through DNS services, network cards, switches, routers, and loads. Equilibrium and other equipment, and these serving equipment are not necessarily stable all the time. In the whole process of data transmission, as long as any link goes wrong, it will cause problems.

This kind of thing is even more obvious under microservices, because the business needs the guarantee of consistency. That is, if a step fails, either keep retrying to ensure that all steps succeed, or roll back to the previous service call.

Therefore, we can define the process of business compensation, that is, when an exception occurs in an operation, how to eliminate the "inconsistent" state caused by the exception through the internal mechanism.

1. About business compensation mechanism

1. What is business compensation

We know that in the case of distributed application systems, there will be a significant problem in communication, that is, a business process often needs to combine a group of services, and a single communication may pass through DNS services, network cards, switches, routers, and loads. Equilibrium and other equipment, and these serving equipment are not necessarily stable all the time. In the whole process of data transmission, as long as any link goes wrong, it will cause problems.

This kind of thing is even more obvious under microservices, because the business needs the guarantee of consistency. That is, if a step fails, either keep retrying to ensure that all steps succeed, or roll back to the previous service call.

Therefore, we can define the process of business compensation, that is, when an exception occurs in an operation, how to eliminate the "inconsistent" state caused by the exception through the internal mechanism.

2. Implementation method of business compensation design

There are mainly two ways to implement service compensation design:

  • Rollback (transaction compensation), reverse operation, rollback business process, means giving up, and the current operation will inevitably fail;
  • Retrying, operating forward, and trying to complete a business process means that there is still a chance of success.

Generally speaking, business transaction compensation requires a workflow engine. This workflow engine connects all kinds of services together, and makes corresponding business compensation on the workflow. The whole process is designed to be eventually consistent.

Ps: Because "compensation" is already an additional process, since this additional process can be followed, it means that timeliness is not the first consideration. So the core point of making compensation is: rather slow than wrong.

2. About rollback

"Rollback" refers to the behavior of restoring the program or data to the latest correct version when the program or data goes wrong. The rollback designed in distributed business compensation is to return to the state before the service call through transaction compensation.

1. Show rollback

Rollback can generally be divided into 2 modes:

Explicit rollback; call the reverse interface to perform the reverse operation of the last operation, or cancel the last operation that has not been completed (resources must be locked);

Implicit rollback: Implicit rollback means that you do not need to perform additional processing for this rollback operation, and the failure handling mechanism is often provided by the downstream.

The most common is "explicit rollback". This solution is nothing more than doing two things:

The first thing to do is to identify the failed step and state, so that you can determine the scope of the rollback. A business process is often formulated at the beginning of the design, so it is easier to determine the scope of rollback. But the only thing to note here is: if not all the services involved in a business process provide a "rollback interface", then the services that provide the "rollback interface" should be placed in front when arranging services, so that There is also a chance to "roll back" when the subsequent work service fails.

Secondly, it must be able to provide the business data used by the "rollback" operation. The more data you provide when "rolling back", the more beneficial it is for the robustness of the program. Because the program can check the business when receiving the "rollback" operation, such as checking whether the accounts are equal, whether the amount is consistent, and so on.

2. Implementation of rollback

For cross-database transactions, the more common solutions are: two-phase commit and three-phase commit (ACID), but these two methods are generally not advisable in a high-availability architecture, because cross-database lock tables consume a lot of time performance.

In a highly available architecture, strong consistency is generally not required, as long as the final consistency is achieved. Can consider: transaction table, message queue, compensation mechanism, TCC mode (occupancy/confirmation or cancellation), Sagas mode (split transaction + compensation mechanism) to achieve eventual consistency.

3. About retrying

The semantics of "retry" is that we think the failure is temporary, not permanent, so we will try again. The biggest advantage of this operation is that there is no need to provide an additional reverse interface. This has advantages for code maintenance and long-term development costs, and the business is changing. The reverse interface also needs to change. So you can consider retrying more often.

1. Scenarios for retrying

Compared with rollback, there are fewer scenarios for retrying: when the downstream system returns a request timed out, when the flow is limited, or a temporary state, we can consider retrying. And if it is a clear business error that returns insufficient balance and no authority, there is no need to retry. Some middleware or RPC frameworks return 503, 404 errors with no expected recovery time, and there is no need to retry.

2. Retry strategy

The time to retry and the number of retries. This requires different considerations in different situations. The mainstream retry strategies are mainly as follows:

Strategy 1 - Retry Immediately: Sometimes the failure is temporary, which may be caused by events such as network packet collision or peak traffic of hardware components. In this case, it is suitable to immediately retry the operation. However, the immediate retry operation should not exceed once. If the immediate retry fails, other strategies should be used instead;

Strategy 2 - Fixed interval: This is easy to understand, such as retrying every 5 minutes. PS: Strategy 1 and Strategy 2 are mostly used in the interactive operation of the front-end system;

Strategy 3 - Incremental Interval: Each retry interval increments incrementally. For example, 0 seconds for the first time, 5 seconds for the second time, and 10 seconds for the third time, so that the priority of retry requests with more failures is ranked lower, making way for new incoming retry requests;

return (retryCount - 1) * incrementInterval;

Strategy 4 - Exponential Interval: Each retry interval increases exponentially. Same as the incremental interval, it is to make the priority of retry requests with more failures be ranked lower, but the growth rate of this scheme is larger;

return 2 ^ retryCount;

Strategy 5 - Full Jitter: Increase randomness on an incremental basis (the exponential growth part can be replaced with incremental growth.) It is suitable for scenarios where a large number of retry requests generated centrally at a certain time are distributed under pressure;

return random(0 , 2 ^ retryCount);

Strategy 6 - Equal Jitter: Find a moderate solution between "Exponential Interval" and "Full Jitter" to reduce the effect of randomness. The applicable scenarios are the same as "Full Shake".

int baseNum = 2 ^ retryCount;
return baseNum + random(0 , baseNum);

The performance of strategies - 3, 4, 5, 6 is roughly like this (the x-axis is the number of retries):

picture

3. Precautions when retrying

First of all, for the interface that needs to be retried, it needs to be made idempotent, that is, the cumulative increase or decrease of business data cannot be caused by multiple calls of the service.

Satisfying "idempotence" actually requires finding a way to identify repeated requests and filter them out. The idea is:

Define a unique identifier for each request.

When "retrying", it is judged whether the request has been executed or is being executed, and if so, the request is discarded.

Ps: In addition, retry is especially suitable for being degraded under high load conditions, and of course it should also be affected by current limiting and fuse mechanisms. When the "spear" of retry is used in conjunction with the "shield" of current limiting and fusing, the effect is the best.

4. Matters needing attention in business compensation mechanism

1. ACID or BASE

ACID and BASE are two different levels of consistency theories in distributed systems. In distributed systems, ACID has stronger consistency, but its scalability is very poor, so it is only used when necessary; BASE's consistency is weak , but has good scalability and can also be processed asynchronously in batches; most distributed transactions are suitable for BASE.

In the scenario of retry or rollback, we generally do not require strong consistency, as long as the final consistency is guaranteed!

2. Matters needing attention in business compensation design

Considerations for Business Compensation Design:

Because to complete the execution of a business process, the service parties involved in the process need to support idempotency. And there is a retry mechanism upstream;

We need to carefully maintain and monitor the status of the entire process, so don't put these statuses in different components, it is better to be a business process controller to do this, that is, a workflow engine. Therefore, this workflow engine needs to be highly available and stable;

The business logic and flow of compensation does not have to be strictly reversed. Sometimes it can be parallelized, sometimes, it may be simpler. In short, when designing the business forward process, it is also necessary to design the reverse compensation process of the business;

We must clearly know that the business logic of business compensation is strongly business-related, and it is difficult to make it universal;

It is best for the lower-level business side to provide a short-term resource reservation mechanism. Just like in e-commerce, the inventory of goods is pre-occupied and waits for users to pay within 15 minutes. If the user's payment is not received, the inventory is released. Then roll back to the previous order operation and wait for the user to place an order again.

Finally: In order to give back to the die-hard fans, I have compiled a complete software testing video learning tutorial for you. If you need it, you can get it for free【保证100%免费】
insert image description here

Software Testing Interview Documentation

We must study to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Ali, Tencent, and Byte, and some Byte bosses have given authoritative answers. Finish this set The interview materials believe that everyone can find a satisfactory job.

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/m0_67695717/article/details/131314055