A major accident caused by Redis distributed locks to avoid future pits

Preface

 

Using distributed locks based on Redis is nothing new nowadays. This article is mainly based on the analysis and solutions of accidents caused by redis distributed locks in our actual projects.

 

Background: The snap-up orders in our project are solved by distributed locks.

 

Once, the operator made a rush to buy Feitian Moutai, stock 100 bottles, but it was oversold! You know, the scarcity of flying Maotai on this earth! ! ! The accident is classified as a P0 major accident...can only be accepted frankly. The performance of the entire project team has been deducted~~

 

After the accident, the CTO named me by name and let me take the lead to deal with it, okay, rush~

 

accident scene

 

After some understanding, I learned that this panic buying activity interface has never happened before, but why is it oversold this time?

 

The reason is that the previous rush-buying products were not scarce products, but this event was actually Feitian Maotai. Through the analysis of the buried point data, all the data have basically doubled, and the activity can be imagined! Not much to say, go directly to the core code, and the confidential part is treated with pseudo-code. . .

 

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId;
    try {
        Boolean lockFlag = redisTemplate.opsForValue().setIfAbsent(key, "val", 10, TimeUnit.SECONDS);
        if (lockFlag) {
            // HTTP请求用户服务进行用户相关的校验
            // 用户活动校验
            
            // 库存校验
            Object stock = redisTemplate.opsForHash().get(key+":info", "stock");
            assert stock != null;
            if (Integer.parseInt(stock.toString()) <= 0) {
                // 业务异常
            } else {
                redisTemplate.opsForHash().increment(key+":info", "stock", -1);
                // 生成订单
                // 发布订单创建成功事件
                // 构建响应VO
            }
        }
    } finally {
        // 释放锁
        stringRedisTemplate.delete("key");
        // 构建响应VO
    }
    return response;
}

 

The above code guarantees that the business logic has enough execution time through the expiration time of the distributed lock with a validity period of 10s; the try-finally statement block is used to ensure that the lock will be released in time. The inventory is also verified within the business code.

 

It looks safe~ Don't worry, continue to analyze. . .

 

Cause of the accident

 

The Feitian Maotai snap-up activity attracted a large number of new users to download and register our APP. Among them, there are many wool parties who use professional methods to register new users to collect wool and brush orders. Of course, our user system is prepared in advance, and access to Alibaba Cloud human-machine verification, three-factor authentication, and self-developed risk control system and other 18 martial arts has blocked a large number of illegal users. I can't help but like it here~

 

But it is precisely because of this that the user service has been under a higher operating load.

 

At the moment when the panic buying activity started, a large number of user verification requests hit the user service. As a result, the user service gateway has a short response delay. The response time of some requests exceeds 10s, but because the response timeout of HTTP requests is set to 30s, this causes the interface to be blocked for user verification. After 10s, distributed The lock has expired. At this time, a new request can get the lock, which means that the lock is overwritten.

 

After these blocked interfaces are executed, the logic of releasing the lock will be executed, which releases the locks of other threads, causing new requests to compete for the lock. This is really an extremely bad cycle.

 

At this time, we can only rely on inventory verification, but inventory verification is not non-atomic. It uses the get and compare method, and the tragedy of oversold happened like this~~~

 

Accident analysis

 

After careful analysis, it can be found that this snap-up interface has serious security risks in high concurrency scenarios, which are mainly concentrated in three places:

 

No other system risk fault tolerance handling

 

Due to the tight user service, the gateway response is delayed, but there is no way to deal with it. This is the fuse for oversell.

 

The seemingly secure distributed locks are not safe at all

 

Although the method of set key value [EX seconds] [PX milliseconds] [NX|XX] is adopted, if thread A executes for a long time before it can be released, the lock will expire, and thread B can acquire the lock at this time. .

 

When thread A finishes executing, releasing the lock actually releases the lock of thread B.

 

At this time, thread C can acquire the lock again, and at this time, if thread B finishes executing and releasing the lock, it is actually the lock set by the released thread C. This is the direct cause of oversold.

 

Non-atomic inventory verification

 

Non-atomic inventory verification leads to inaccurate inventory verification results in concurrent scenarios. This is the root cause of oversold.

 

Through the above analysis, the root cause of the problem is that inventory verification relies heavily on distributed locks. Because in the case of distributed locks with normal set and del, inventory verification is no problem. However, when distributed locks are not safe and reliable, inventory verification is useless.

 

solution

 

After knowing the reason, we can prescribe the right medicine.

 

Realize relatively safe distributed locks

 

Relatively safe definition: set and del are mapped one by one, and there will be no other existing lock del. From the perspective of the actual situation, even if the set and del one-to-one mapping can be achieved, the absolute security of the business cannot be guaranteed.

 

Because the expiration time of the lock is always bounded, unless the expiration time is not set or the expiration time is set very long, this will also bring other problems. So it makes no sense.

 

To achieve a relatively safe distributed lock, you must rely on the value of the key. When the lock is released, the uniqueness of the value is used to ensure that it will not be deleted. We implement atomic get and compare based on the LUA script, as follows:

 

public void safedUnLock(String key, String val) {
    String luaScript = "local in = ARGV[1] local curr=redis.call('get', KEYS[1]) if in==curr then redis.call('del', KEYS[1]) end return 'OK'"";
    RedisScript<String> redisScript = RedisScript.of(luaScript);
    redisTemplate.execute(redisScript, Collections.singletonList(key), Collections.singleton(val));
}

 

We use LUA scripts to securely unlock.

 

Achieve safe inventory verification

 

If we have a deeper understanding of concurrency, we will find that operations such as get and compare/read and save are all non-atomic. If we want to achieve atomicity, we can also use LUA scripts to achieve it.

 

But in our example, since only one bottle can be placed in a panic buying activity, it can be based on the atomicity of redis instead of LUA script implementation. the reason is:

 

// redis会返回操作之后的结果,这个过程是原子性的
Long currStock = redisTemplate.opsForHash().increment("key", "stock", -1);

 

Found no, the inventory check in the code is completely "superfluous".

 

Improved code

 

After the above analysis, we decided to create a new DistributedLocker class specifically for handling distributed locks.

 

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;
    String key = "key:" + request.getSeckillId();
    String val = UUID.randomUUID().toString();
    try {
        Boolean lockFlag = distributedLocker.lock(key, val, 10, TimeUnit.SECONDS);
        if (!lockFlag) {
            // 业务异常
        }

        // 用户活动校验
        // 库存校验,基于redis本身的原子性来保证
        Long currStock = stringRedisTemplate.opsForHash().increment(key + ":info", "stock", -1);
        if (currStock < 0) { // 说明库存已经扣减完了。
            // 业务异常。
            log.error("[抢购下单] 无库存");
        } else {
            // 生成订单
            // 发布订单创建成功事件
            // 构建响应
        }
    } finally {
        distributedLocker.safedUnLock(key, val);
        // 构建响应
    }
    return response;
}

 

 

Deep thinking

 

Is distributed lock necessary?

 

After the improvement, we can actually find that we can guarantee that we will not be oversold with the help of the atomic deduction of redis itself. correct. But if there is no such lock, then all requests will go through the business logic. Due to the dependence on other systems, the pressure on other systems will increase at this time.

 

This will increase performance loss and service instability, which is more than the loss. Based on distributed locks, some traffic can be intercepted to a certain extent.

 

Selection of distributed locks

 

Someone proposed to use RedLock to implement distributed locks. RedLock is more reliable, but at the expense of certain performance. In this scenario, this improvement in reliability is far inferior to the cost-effectiveness brought about by the improvement in performance. For scenarios with extremely high reliability requirements, RedLock can be used to achieve it.

 

Is it necessary to think again about distributed locks?

 

Since the bug needs to be repaired urgently and launched, we optimized it and performed a stress test in the test environment, and immediately deployed it online. It turns out that this optimization is successful, and the performance is slightly improved. In the case of distributed lock failure, there is no oversold situation.

 

However, is there room for optimization? some!

 

Since the service is deployed in a cluster, we can distribute the inventory equally to each server in the cluster and notify each server in the cluster through broadcast. The gateway layer uses a hash algorithm based on the user ID to determine which server to request. In this way, inventory deduction and judgment can be realized based on the application cache. Performance has been further improved!

 

// 通过消息提前初始化好,借助ConcurrentHashMap实现高效线程安全
private static ConcurrentHashMap<Long, Boolean> SECKILL_FLAG_MAP = new ConcurrentHashMap<>();
// 通过消息提前设置好。由于AtomicInteger本身具备原子性,因此这里可以直接使用HashMap
private static Map<Long, AtomicInteger> SECKILL_STOCK_MAP = new HashMap<>();

...

public SeckillActivityRequestVO seckillHandle(SeckillActivityRequestVO request) {
SeckillActivityRequestVO response;

    Long seckillId = request.getSeckillId();
    if(!SECKILL_FLAG_MAP.get(requestseckillId)) {
        // 业务异常
    }
     // 用户活动校验
     // 库存校验
    if(SECKILL_STOCK_MAP.get(seckillId).decrementAndGet() < 0) {
        SECKILL_FLAG_MAP.put(seckillId, false);
        // 业务异常
    }
    // 生成订单
    // 发布订单创建成功事件
    // 构建响应
    return response;
}

 

Through the above transformation, we don't need to rely on redis at all. Both performance and safety can be further improved!

 

Of course, this solution does not consider complex scenarios such as dynamic expansion and shrinkage of the machine. If these are still to be considered, it is better to directly consider the solution of distributed locks.

 

to sum up

 

 

Overselling of scarce commodities is definitely a major accident. If the amount of oversold is large, it will even have a very serious operating and social impact on the platform. After this accident, I realized that no line of code in the project should be taken lightly, otherwise in some scenarios, these normally working codes will become deadly killers!

 

For a developer, when designing a development plan, the plan must be considered thoroughly. How can the plan be considered thoroughly? Only continuous learning!

Guess you like

Origin blog.csdn.net/jcmj123456/article/details/108438506