Reasons why Redlock can be unsafe

Reasons why Redlock can be unsafe

Let me first briefly introduce how to use redis single node to deploy distributed locks, and explain the possible problems, then introduce Redlock, and mainly explain the problems of the Redlock algorithm through the debate between distributed master Martin and Redis author antirez

At present, there are still many solutions that can implement distributed locks, such as based on databases, zookeeper, redis, and so on. Here we mainly record the implementation of distributed locks based on redis.

Single-node deployment of distributed locks

In fact, it is similar to the lock-free control concurrency on a single machine, except that the lock variable needs to be maintained by a shared storage system .

There are two main ways:

  • Use the combination of SETNX and DEL commands to realize lock and release lock operations

    When the SETNX command is executed, it will judge whether the key-value pair exists. If it does not exist, it will set the value of the key-value pair. If it exists, it will not set anything.

    // 加锁
    SETNX lock_key 1
    // 业务逻辑
    DO THINGS
    // 释放锁
    DEL lock_key
    

    There are two problems:

    • If the expiration time is not set, it may cause a client to hold the lock and not release it. (e.g. an exception occurred while executing business logic)

      Solution: set expiration time

    • If the expiration time is set, it may cause the lock held by a client to be accidentally deleted (timeout when executing business logic, other clients acquire the lock and then delete it)

      Solution: Set the unique identifier of the client, and then make a logical judgment when deleting (the specific implementation is as follows)

  • Add NX after the SET command to replace the effect of SETNX, and the SET command can also be executed with the EX or PX option to set the expiration time of the key-value pair.

    When the following command is executed, only if the key does not exist, SET will create the key and assign a value to the key. In addition, the survival time of the key is determined by the seconds or milliseconds option value.

    SET key value [EX seconds | PX milliseconds]  [NX]
        
    // 加锁, unique_value作为客户端唯一性的标识
    SET lock_key unique_value NX PX 10000
    

    The release step is to ensure the atomicity of multiple commands, so it is necessary to use Lua script:

    //释放锁 比较unique_value是否相等,避免误释放
    if redis.call("get",KEYS[1]) == ARGV[1] then
        return redis.call("del",KEYS[1])
    else
        return 0
    end
    

Problems that may arise from a single node

If only one Redis instance is used to save the lock variable now, if the Redis instance fails and goes down, the lock variable will be gone. At this time, the client cannot perform lock operations, which will affect the normal execution of the business. Therefore, when we implement distributed locks, we also need to ensure the reliability of the locks.

In order to avoid the problem that the lock cannot work due to the failure of the Redis instance, Antirez, the developer of Redis, proposed the distributed lock algorithm Redlock .

Redlock

The basic idea of ​​the Redlock algorithm is to let the client and multiple independent Redis instances request to lock in turn. If the client can successfully complete the locking operation with more than half of the instances, then we believe that the client has successfully obtained the distributed Locked, otherwise the lock fails . In this way, even if a single Redis instance fails, because the lock variable is also saved on other instances, the client can still perform lock operations normally, and the lock variable will not be lost.

Redis distributed locks are implemented in Redisson and support single point mode and cluster mode. In cluster mode, Redisson uses the Redlock algorithm.

The implementation of the Redlock algorithm requires N independent Redis instances, and its process can be divided into the following three steps:

  1. The client gets the current time.

  2. The client performs locking operations on N Redis instances in sequence.

    Here, the operation of the client to obtain instances from N nodes is the same as the second method in the single-node deployment of distributed locks described above. But what needs to be guaranteed is that if a node fails to acquire, a certain timeout period (less than the effective time of the lock) needs to be set to allow the client to acquire the lock of the next Redis instance.

  3. Once the client completes the locking operation with all Redis instances, the client needs to calculate the total time spent on the entire locking process to determine whether the lock is reasonable and the effective time to update the lock

    Only when the following two conditions are met, the client can consider the lock successful:

    • Condition 1: The client has successfully acquired locks from more than half (greater than or equal to N/2+1) of the Redis instances;
    • Condition 2: The total time spent by the client to acquire the lock does not exceed the effective time of the lock.

​If the above two conditions are met, then the valid time of the current lock will be set to (the initial valid time of the lock - the time to acquire the lock), and the lock is successful!

​If it is not satisfied , the lock release operation will be issued to all Redis instances (due to possible reasons such as the network, the lock has been successfully locked on the Redis instance, but it has not been sent to the client, in order to ensure that the lock can be released normally in this case , so a request to release the lock needs to be sent to all instances)!

The general process of Redlock has been introduced above, but this distributed algorithm is not perfect, and there are some problems that have not been resolved.

Below I will give an overview of some of the issues mentioned in the two articles Is Redis-based distributed locks safe (top) and Redis-based distributed locks safe (bottom) )), these two articles describe the distribution A debate on the security of Redlock by Martin and the author of Redis .

The debate between Martin and antirez

Question 1 raised by Martin : The client crashes and restarts causing Redlock to fail

Suppose there are 5 Redis nodes: A, B, C, D, E. Imagine the following sequence of events taking place:

  1. Client 1 successfully locks A, B, C, and acquires the lock successfully (but D and E are not locked).
  2. Node C crashes and restarts, but the lock added by client 1 on C is not persisted (both AOF and RDB cannot guarantee that every command can be persisted), and it is lost.
  3. After node C restarts, client 2 locks C, D, and E, and acquires the lock successfully.

Antirez's rebuttal:

Antirez proposes the concept of delayed restarts .

That is to say, after a node crashes, do not restart it immediately, but wait for a period of time before restarting. This period of time should be greater than the lock validity time. In this way, the locks that this node participated in before restarting will expire, and it will not affect the existing locks after restarting.

Question 2 raised by Martin: Redlock relies too much on system timing. If the system clock jumps , it will cause Redlock to fail

  1. Client 1 successfully acquires locks from Redis nodes A, B, and C (most nodes). Communication with D and E failed due to network problems.
  2. The clock on node C jumped forward, causing the lock maintained on it to expire rapidly.
  3. Client 2 successfully acquires locks of the same resource from Redis nodes C, D, and E (most nodes).
  4. Both client 1 and client 2 now think they hold the lock.

When Martin mentioned clock jumps, he gave two specific examples that may cause clock jumps:

  • The system administrator manually changed the clock.
  • A large clock update event was received from the NTP service.

Antirez's rebuttal:

  • For artificial reasons such as manually modifying the clock, just don't do it.
  • Using an ntpd program that doesn't "jump" the system clock (probably properly configured), changes to the clock are done in multiple small tweaks.

Judging from antirez's answer, antirez generally agrees that a large system clock jump will cause Redlock to fail. On this point, he differs from Martin's view in that he believes that large clock jumps can be avoided in practical systems. Of course, it depends on the infrastructure and how it is operated.

Question 3 raised by Martin : Redlock failure caused by client GC pause or network delay

Martin's blog believes that distributed locks with automatic expiration function must provide some kind of fencing mechanism to ensure the real mutual exclusion protection of shared resources. Redlock does not provide such a mechanism.

Fencing mechanism : When the client applies for a lock from the server, the lock server can assign a serial number to the client in order. When accessing the resource server, it will make certain judgments to ensure that only when the serial number of the current client is greater than the accessed resource The server can only be accessed normally when the maximum serial number of the server is set . In this way to ensure normal access to shared resources,

The following is the picture drawn by Martin: where the fencing token is the assigned serial number

Timing with fencing token

The following is the timing diagram of Martin's assumption:

  1. Client 1 initiates a lock request to Redis nodes A, B, C, D, and E.
  2. Each Redis node has returned the request result to Client 1, but Client 1 entered a long GC pause before receiving the request result or there was a large amount of network delay during transmission.
  3. On all Redis nodes, the lock expires.
  4. Client 2 has acquired locks on A, B, C, D, E.
  5. Client 1 recovers from the GC pause and receives the request results from each Redis node in the previous step 2. Client 1 thinks it has successfully acquired the lock.
  6. Both client 1 and client 2 now think they hold the lock.

In the example given by Martin, GC pause or network delay actually occurs before the client receives the request result (the lock is not acquired). We can look at the workflow of Redlock above, and we will find that in this case, it does not meet the condition that the total time spent by the client to acquire the lock does not exceed the valid time of the lock , and redlock will not consider this lock to be reasonable. , it will not be locked, so the redlock is not invalid.

However, we can imagine that if the GC pause of client 1 occurs after the client successfully locks , if our business logic executes for too long , the lock has expired (but client 1 thinks it is still acquiring the lock). And client 2 acquires the lock again, then they all operate on the shared resource at the same time, then there must be problems.

Antirez's rebuttal:

First of all, about the fencing mechanism . Antirez questioned Martin's way of argument: since there is already a fencing mechanism that can continue to maintain mutually exclusive access to resources when the lock fails, why use a distributed lock and require it to provide so What about strong security guarantees? In addition, antirez also mentioned that redlock will provide unique_value to ensure the uniqueness of the client, and provide a method similar to CAS to atomically access shared resources . (Here, I think that antirez has not provided a basis to ensure the correctness of redlock in this respect, but can only explain the concurrency problems that exist when fighting for locks)

Finally: antirez and Martin also reached an agreement that redlock solves the message delay between the client and the lock server (that is, the network delay before acquiring the lock, etc., how to solve it has been described above), but for the client and the resource server ( The delay between the process of accessing shared resources), antirez admits that all distributed lock implementations, including Redlock, have no good way to deal with it.

References:

Is Redis-based distributed lock safe? (Part 1)

Is Redis-based distributed lock safe? (Part 2)

Redis core technology and practice

Guess you like

Origin blog.csdn.net/qq_53578500/article/details/126691114