Actual Combat: Dealing with Concurrency Problems at Work | JD Logistics Technical Team

1. Problem Background

The problem happened in the process of express sorting. I tried to simplify the business background as much as possible, so that everyone only pays attention to the concurrency problem itself.

The sorting business will generate a task for each express package, we call it task. There are two fields in the task that need attention, one is the exception (exp_type) that occurs during sorting , and the other is the status of the sorting task (status) . In addition, it is necessary to pay attention to the sorting status reporting interface , which is used to record abnormalities and status changes during the sorting process.   

Task overview.png

Under normal circumstances, the sorter will call the interface to report in time when the sorting exception occurs, and call the interface to mark it as completed when the sorting is completed. The time interval between the two interface calls is long, and there will be no concurrency problems.

However, there is a special sorting machine that does not report in time when the abnormality occurs, but reports the abnormality and the sorting result together when the sorting is completed. At this time, the sorting status reporting interface is in There will be two calls at the same time, and an unexpected concurrency problem occurs at this time.

Let's first look at the execution process of the sorting status reporting interface:

  1. First query the sorting task task. By default, exp_type and status are both default values ​​0
  2. Modify the exp_type in the task when the sorting is abnormal, and modify the status field information after the sorting is completed
  3. The modification is completed and the task is written

The illustration of the occurrence of concurrency issues is as follows:

Concurrency Problem Flowchart.png

The initial value of the database is that the sorting exception and sorting completion are reported almost at the same time, and they all read this value. The sorting exception action changes exp_type to 9 and writes it into the database, and the database value at this time ; the sorting completion action changes status to 1 and writes it into the database, making the final value of the database, which overwrites the value of the abnormal field . Under normal circumstances, the final value should be , and the sorting completion action should read the value after the abnormal completion of the sorting and then modify it. 1, 0, 0 1, 9, 0 1, 0, 1 1, 9, 1 1, 9, 0 

2. Solutions

The reason for this problem is easy to find: two transactions simultaneously perform a read-modify-write sequence, and one of the write operations directly overwrites the result of the other write operation without merging the changes of the other write operation , resulting in data loss.  

This kind of problem is a typical lost update problem, which can be avoided by locking the database read operation or changing the isolation level of the database to serializable so that the transaction is executed serially. Below I will introduce the solutions proposed by everyone when discussing the problem of avoiding lost updates, and express them with code as much as possible.  

2.1 Database read operation locking and serializable isolation level

We can consider: If the transaction that modifies each Task data is allowed to be modified after the current transaction is completed, so that the transaction is executed serially, then we can avoid this situation. The more direct implementation is achieved through explicit locking, as follows

select exp_type, status
from task
where id = 1
for update;

The transaction that first queries the row of data will obtain an exclusive lock on the row of data , and all subsequent read and write requests for the data will be blocked until the previous transaction is executed and the lock is released. 

In this way, the serial execution of transactions is realized by locking. However, when adding a lock statement to SQL, you need to determine whether to lock the row of data instead of locking the entire table. If it is the latter, it may cause a serious decline in system performance, and you need to pay attention to what business If the SQL is used in the scenario, whether there is a long-term execution of read-only transactions, if there are, there may be delays and system performance degradation due to locking, so careful evaluation is required.

In addition, the serializable database isolation level can also guarantee the serial execution of transactions, but it is for all transactions. Under normal circumstances, in order to ensure performance, we will not adopt this scheme (the MySQL repeatable read isolation level is used by default).

MySQL's InnoDB engine implements the serializable isolation level using the 2PL mechanism: the lock is acquired during the first phase of transaction execution, and the lock is released after the second phase of transaction execution.

2.2 Only modify the necessary fields for the business

If the abnormal state request only modifies the exp_type field, and only modifies the status field after sorting, then we can sort out the business logic and write only the fields that need to be modified into the database, so that the exception of lost update will not occur, as shown in the following code :

// 处理异常状态请求,封装修改数据的对象
Task task = new Task();
tast.setId(id);
task.setExpType(expType);

// 更改数据
taskService.updateById(task);

Before executing the modification data, create a new modification object and assign values ​​only to its necessary modification fields. But what needs to be considered is: if the business process processing is already very complicated, it is likely that it is not clear which fields should be assigned values ​​and new exceptions will occur. Therefore, this method needs to be familiar with the business enough, and after the modification Do ample testing.

2.3 Distributed locks

The method of distributed lock is similar to method 1. It uses locking to ensure that only one transaction is executed at the same time. The difference is that the lock of method 1 is added to the database layer, while distributed lock is realized by Redis.

The advantage of this implementation is that the granularity of the lock is small, and the lock contention is limited to a single package, and there is no need to consider the granularity of the lock and the impact on related businesses like database locking. Pseudocode looks like this:

// 分布式锁KEY
String distributedKey = String.format(DISTRIBUTED_KEY_PREFIX, packageNo);
try {
    // 分布式锁阻塞同一包裹号的修改
    lock(distributedKey);
    // 处理业务逻辑
    handler();
} finally {
    // 执行完解锁
    redissonDistributedLocker.unlock(distributedKey);
}

It should be noted that lock() the locking method must ensure that the locking failure or other abnormal conditions do not affect the execution of the business logic, and set the lock holding time and the blocking time of waiting for the lock. In addition, the unlocking method must be added to the finally code block to ensure the lock release.  

2.4 CAS

CAS is an optimistic solution. It generally records the time of the last data change by adding a timestamp column in the database. When a new transaction is executed, it is necessary to compare the timestamp of the row of data when it is read with the one saved in the database. Whether the timestamp is consistent is used to determine whether other transactions have modified the row of data during the execution of the transaction. Only when there is no change, the update is allowed, otherwise the transaction needs to be retried. Sample SQL is as follows:

update task
set exp_type = #{expType}, status = #{status}, ts = #{currentTs}
where id = #{id} and ts = #{readTs}

Its principle is not difficult to understand, but it may be difficult to implement, because it needs to consider how to retry after the execution fails, and the method of retrying and the number of retries need to be judged according to the business.

shoulders of giants

  • "Data-Intensive Application System Design" Chapter 7 Affairs

Author: JD Logistics Wang Yilong

Source: Reprinted from Yuanqishuo Tech by JD Cloud developer community, please indicate the source

Ministry of Industry and Information Technology: Do not provide network access services for unregistered apps Go 1.21 officially releases Linus to personally review the code, hoping to quell the "infighting" about the Bcachefs file system driver ByteDance launched a public DNS service 7-Zip official website was identified as a malicious website by Baidu Google releases AI code editor: Project IDX Tsinghua Report: Wenxin Yiyan firmly sits first in China, beyond the ChatGPT Vim project, the future meditation software will be launched, ChatGPT was founded by "China's first Linux person" with a daily cost of about 700,000 US dollars , OpenAI may be on the verge of bankruptcy
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10095858