System Architecture: Distributed Idempotent Applicable Scenarios and Solutions

1. Background

A distributed system is composed of many microservices, and there must be a large number of network calls between microservices. The following figure is an example of an abnormal call between services. After the user submits an order, he requests service A. After service A places the order, he starts to call service B. However, there are many uncertainties in the process of calling B from A. For example, service B The execution timed out, and the RPC directly returned that A's request timed out, and then A returned some error prompts to the user, but the actual situation is that B may be executed successfully, but the execution time is too long.

img

After seeing the error message, the user often chooses to click repeatedly on the interface, resulting in repeated calls. If B is a payment service, the user's repeated clicks may cause the same order to be deducted multiple times. Not only users may trigger repeated calls, but also scheduled tasks, message delivery, and machine restarts may all be repeatedly executed. In a distributed system, it is very common for service calls to have various abnormalities. These abnormalities often make the state of the system inconsistent, so fault-tolerant compensation design is required. The most common method is that the caller implements a reasonable retry strategy. , the callee implements an idempotent policy against retries.

Two, idempotence

For idempotence, there is a very common description: the same request should return the same result, so the query class interface is a natural idempotent interface. For example: if there is a query interface to query the status of the order, the status will change over time, then in two query requests at different times, different order status may be returned, this query interface is still idempotent interface?

The definition of idempotent directly determines how we design idempotent solutions. If idempotence means that the same request returns the same result, then in fact, only the first return result needs to be cached to achieve idempotency in subsequent repeated requests. up. But is the question really that simple? The author prefers this definition: idempotent means that the same request (identical request) is executed one or more times and the side effects (side-effects) are the same.

This definition has a certain degree of abstraction and is relatively general. When designing an idempotent solution, it is actually the realization of the abstraction. For example: what is the same request? What are the side effects? How to avoid side effects? .

3. Solutions

Many articles about idempotence claim that their solution is a general solution, but the author believes that under different business scenarios, the same request and side effects are different, and different side effects require different solutions. Completely general solutions exist. The trilogy aims to extract a thinking mode, and give examples to illustrate that under this thinking mode, it is easier to design idempotent solutions that meet business scenarios.

1. Identify the same request

Idempotency is to solve the problem of repeatedly executing the same request, so how to identify whether a request is repeated with the previous request? Some schemes are identified by a serial number field in the request, and the same serial number indicates the same request. There are also solutions that compare certain fields or even all fields in the request to identify whether it is the same request. Therefore, when designing a solution, it is the first step to clearly define what is the same request in a specific business scenario.

Example: token mechanism to identify repeated front-end requests

In the back-end system of a call link, it is generally possible to identify whether it is a duplicate request through the reqNo+source passed by the upstream system. As shown in the figure below, system B relies on the reqNo+source passed by system A to identify the same request, but system A directly interacts with the front-end page, how to identify that the requests initiated by users are the same? For example, if the user clicks multiple times on the payment interface, how does system A recognize that this is a repeated operation?

img

The front end can set the button to disable when the first click is completed, so that the user cannot repeatedly click the second time on the interface, but this is only a front-end solution to improve the experience, not a real security solution.

A common server-side solution is to use the token mechanism to prevent duplicate submissions. As shown below,

img

(1) When the user enters the form page, the front end will apply for a token from the server and save it in the front end.
(2) When the user clicks submit for the first time, the token and form data will be submitted to the server together, and the server will judge whether the token exists, and execute the business logic if it exists.
(3) When the user clicks submit for the second time, the token and form data will be submitted to the server together. The server will judge whether the token exists. If it does not exist, an error will be returned, and the front end will display that the submission failed.

This solution combines front-end and back-end. From the perspective of the front-end, this is used to prevent repeated requests. From the perspective of the server, this is used to identify the same request from the front-end. The server is often implemented based on a distributed cache similar to redis to ensure the uniqueness of the generated token and the atomicity of the token operation. The core logic is as follows.

// SETNX keyName value: 如果key存在,则返回0,如果不存在,则返回1

// step1. 申请token
String token = generateUniqueToken();

// step2. 校验token是否存在
if(redis.setNx(token, 1) == 1){
    
    
  // do business
} else {
    
    
 // 幂等逻辑
}

2. List and reduce the analytical dimensions of side effects

The same request repeatedly executes business logic, and if it is not handled properly, it will bring side effects to the system. So what are side effects? From a technical point of view, it means that after the result is returned, some "system state" will change. A function without side effects is called a pure function. From a business point of view, it is an unexpected result that is unacceptable to the business. The most common ones are repeated warehousing, data being changed by mistake, etc. Most idempotent solutions are designed around solving such problems. However, the system may often have side effects in multiple dimensions, for example:
(1) Calling the downstream dimension: what happens if you repeatedly call the downstream? What are the side effects of repeated calls if the downstream is not idempotent?
(2) Return to the upstream dimension: For example, the first return to the upstream is abnormal, and the second return to the upstream is idempotent? What side effects will it bring upstream?
(3) Concurrent Execution Dimension: What happens to concurrent and repeated execution? What are the side effects?
(4) Distributed lock dimension: introduce distributed locks to prevent concurrent execution? But if the lock is inconsistent, what are the side effects?
(5) Interaction timing dimension: Is there asynchronous interaction, and is there a timing problem? What are the side effects?
(6) Customer experience dimension: How much time must be completed from data inconsistency to final consistency? What are the side effects if it is not completed within that time? For example, a large number of customer complaints (adhering to the principle of customer first, in Alipay, if the number of customer complaints is too large, it will be classified as a failure of the production environment).
(7) Business verification dimension: Does repeated calls overwrite the verification mark, causing side effects that cannot be verified normally? In the financial system, it is unacceptable that the funds link cannot be verified.
(8) Data quality dimension: Are there duplicate records? What are the side effects, if any?

img

The above are some common analysis dimensions. There will be different dimensions in the systems of different industries. Summarize these dimensions as much as possible and include them in the checklist during system analysis, which can better improve the idempotent solution. A complete idempotent solution has no side effects, but too many dimensions of side effects will increase the complexity of the idempotent solution. Therefore, on the premise that the business can be achieved, reducing some analysis dimensions can make the implementation of the idempotent solution more economical and effective. For example: if there is a dedicated idempotent table to store the idempotent results returned to the upstream, the (2) dimension is not considered. If locks are used to prevent concurrency, the (3) dimension is not considered. If stand-alone locks are used instead of distribution lock, the (4th) dimension is not considered.

This is the second part of tackling idempotence: listing and reducing the analytical dimension of side effects. In this song, the solutions involved are often to solve the side effects of a certain dimension, and are suitable to exist in the form of general components as a public technical routine within the team.

Example: Locking to avoid concurrent repeated execution

Many idempotent solutions are related to anti-concurrency, so what is the relationship between idempotence and concurrency? The connection between the two is: idempotence solves the problem of repeated execution, and repeated execution includes serial repeated execution (such as timed tasks) and concurrent repeated execution. If the repeated execution of business logic does not have shared variables and data change operations, concurrent repeated execution has no side effects, and the issue of concurrency can be ignored. For services that contain shared variables and involve change operations (in fact, most of these services), concurrency problems may lead to out-of-order reading and writing of shared variables, repeated data insertion, and other issues. In particular, concurrent reading and writing of shared variables is often only perceived after a production failure occurs.

Therefore, in the dimension of concurrent execution, turning concurrent repeated execution into serial repeated execution is the best idempotent solution. The most common method of Alipay is: one lock, two judgments and three updates, as shown in the figure below. When a request comes: first lock, lock the resource to be operated; second judgment, identify whether it is a repeated request (the problem to be defined in the first part), and judge whether the business status is normal; third update: execute business logic.

img

Q&A

Small A: The lock may cause performance impact, and the execution of the lock after the judgment can improve the performance.
Daming: This may lose the effect of preventing concurrency. Remember the double check implementation of the singleton pattern? I judged it before locking it, so why do I need to judge it after locking it? In fact, the second check is necessary. think about it?
Little A is drawing a picture and thinking...
Little A: I understand, one lock, two judgments and three updates, the order of locks and judgments cannot be changed. If the lock conflict is relatively high, you can judge before locking to improve efficiency, so it is called double check.
Daming: Yes, smart. These two scenarios are different, but the idea of ​​concurrency is the same.

private volatile static Girl theOnlyGirl;

// 实现单例时做了 double check
public static Girl getTheOnlyGirl() {
    
    

    if (theOnlyGirl == null) {
    
       // 加锁前check
        synchronized (Girl.class) {
    
    
            if (theOnlyGirl == null) {
    
      // 加锁后check
                theOnlyGirl = new Girl();    // 变更执行
            }
        }
    }
     
    return theOnlyGirl;
}

The implementation of the lock can be a distributed lock or a database lock. The distributed lock itself will bring about the consistency of the lock, which needs to be considered according to the business requirements for system stability. Many systems of Alipay realize the business lock component by creating a new lock record table in the business database. Its sub-table logic is consistent with the sub-table logic of the business table, and the stand-alone database lock can be realized. If there is no lock component, pessimistic locking of business documents can also meet the conditions. Pessimistic locking should be implemented in transactions using select for update. Pay attention to deadlock problems, and the index must be hit in the where condition, otherwise the table will be locked. lock records.

The concurrency dimension is almost a general analysis dimension of distributed idempotence, so a general lock component is necessary. But this only solves the side effect of the dimension of concurrency. Although there is no concurrent repeated execution, the serial repeated execution still exists. Repeated execution is the core problem to be solved by idempotence. If there are other side effects in repeated execution, the idempotent problem has not been solved.

After locking, the performance of the business will be reduced. How to solve this problem? The author believes that in most cases, the stability of the architecture has a higher priority than the system performance, and there are too many places to achieve performance optimization, such as reducing bad code, removing slow SQL, optimizing business architecture, horizontally expanding database resources, etc. Way. Realizing a service that meets the SLA through system stress testing is the correct way to evaluate the performance of the entire link.

3. Identify fine-grained side effects and design targeted solutions

After solving the side effects of some dimensions, it is necessary to identify and solve the fine-grained side effects of the remaining dimensions one by one. On the data quality dimension, one of the biggest side effects is duplication of data. In the interaction dimension, one of the biggest side effects is the out-of-order execution of services. Generally, such problems are not designed as general-purpose components, and developers can play freely. This section uses two common scenarios as examples.

Example 1: Unique constraints to avoid repeated storage

When designing the data table, design two fields: source, reqNo, source indicates the caller, and seqNo indicates the request number sent by the caller. source and reqNo are set as a combined unique index to ensure that the document will not be dropped twice. If the caller does not have the two fields of source and reqNo, according to the actual business situation, some business parameters in the request can generate an md5 as a unique field and drop it into the unique field to avoid repeated storage.

img

The core logic is as follows:

try {
    
    
    dao.insert(entity);    
    // do business
} catch (DuplicateKeyException e) {
    
    
    dao.select(param);
    // 幂等返回
}

Here, insert the document directly. If it succeeds, it means that the request has not been made, and the business logic continues to be executed. If a DuplicateKeyException is thrown, it means that it has been executed, and an idempotent return is made. Simple services can also identify whether it is a duplicate request in this way. (first part).

Use the unique index of the database to avoid duplicate records, and you need to pay attention to the following issues:
(1) Due to the design of read-write separation, it is possible that the insert operation is the master database, but the select query is the slave database. In time, it is possible that the select check is also empty.
(2) In the case where the database has a Failover mechanism, if a natural disaster occurs in one city, it is likely to switch to another city’s standby database, then the unique constraint may fail, such as the first insert in a concurrent scenario It was in the library in Hangzhou, and at this time failover switched the library to Shanghai, and the same request for insert was successful again.
(3) In the database expansion scenario, due to the change of the sub-database rules, it is possible that the first insert operation is in the A database, and the second insert operation is in the B database, and the unique index also does not work.
(4) Some systems catch SQLIntegrityConstraintViolationException, which is an integrity constraint, including uniqueness constraints. If a required field is not set, this exception will also be thrown, so you should catch DuplicateKeyException.
For problem (1), it can be solved by placing insert and select in the same transaction. For (2) and (3), Alipay internally designed a set of distribution based on data replication technology in order to cope with capacity surge and FO. I don't know much about this case, and I will discuss it later when I have a chance.

Small A: If I use a unique constraint to ensure that no duplicate data will be dropped, can I prevent concurrency without locking?
Daming: There is no direct relationship between the two. Locking to prevent concurrency solves the problem of side effects of the concurrency dimension, and unique constraints only solve the problem of a single side effect of repeated data. If there is no unique constraint, serial repeated execution will also lead to the problem of repeated data drop of insert. The unique constraint essentially solves the problem of repeated data, not the problem of concurrency.

Example 2: State machine constraints solve out-of-order problems

There are often different states in the life cycle of a business, and it is the best choice to use a state machine to control the state transition in the business process. One-way state machines are commonly used in actual business. When the state machine is in the next state, it cannot return to the previous state. The following scenarios often use the state machine for verification:
(1) The caller calls a timeout to retry.
(2) Message delivery timeout retry.
(3) The business system initiates multiple tasks, but expects them to return in order in the order in which they were initiated.

For this kind of problem, it is generally judged whether the state meets the expectation before processing, and if it meets the expectation, the business is executed. After the business execution is completed, when the state is changed, a method similar to optimistic locking will be adopted to verify the bottom line. For example, the M state can only be converted from the N state, so when updating the document, the state verification will be performed in SQL.

update apply set status = 'M' where status = 'N'

Four. Summary

This article first introduces the definition of idempotence: the same request has no side effects, and then proposes a trilogy of designing idempotent solutions, with examples. Designers should be able to clearly define the same request, and use common components to reduce the analysis dimension of some side effects, and then design corresponding solutions for specific side effects until there are no side effects, which is a truly complete idempotent solution. In actual business, the implementation of the trilogy is not necessarily in a strict order, but as long as the plan is conceived according to the trilogy, it will surely open up ideas and simplify

Guess you like

Origin blog.csdn.net/zhanggqianglovec/article/details/131481460