Redis implements distributed shared locks

The writing is very, very good. It is reproduced here for your reference, and the original address is attached!
Transfer address: http://blog.csdn.net/hupoling/article/details/53411190

Background
In many Internet product applications, some scenarios need to be locked, such as: spike, global incremental ID, floor generation, etc. Most of the solutions are implemented based on DB. Redis is a single-process and single-threaded mode. The queue mode is used to turn concurrent access into serial access, and there is no competition between multiple clients connecting to Redis. Secondly, Redis provides some commands SETNX, GETSET, which can facilitate the realization of the distributed lock mechanism.

Introduction to Redis Commands To implement distributed locks
using Redis, there are two important functions that need to be introduced

SETNX command (SET if Not eXists)
syntax:
SETNX key value
function:
if and only if the key does not exist, set the value of the key to value and return 1; if the given key already exists, SETNX does not do any action, and returns 0.

GETSET Command
Syntax :
GETSET key value
Function:
Set the value of the given key as value and return the old value of the key. When the key exists but is not a string type, an error is returned. When the key does not exist, returns nil.

GET Command
Syntax :
GET key
Function:
Returns the string value associated with the key, or returns the special value nil if the key does not exist.

DEL command
Syntax :
DEL key [KEY ...]
Function:
delete one or more given keys, the non-existing keys will be ignored.

Soldiers are precious, not many. For distributed locks, we rely on these four commands. However, in the specific implementation, there are still many details that need to be carefully considered, because in the distributed concurrent multi-process, any error will cause a deadlock and hold all the processes.

lock implementation

SETNX can directly lock operations, such as locking a keyword foo, the client can try
SETNX foo.lock

If it returns 1, it means that the client has acquired the lock and can proceed. After the operation is completed, pass
DEL foo.lock

command to release the lock.
If it returns 0, it means that foo has been locked by another client. If the lock is non-blocking, you can choose to return to the call. If the call is blocked, it needs to enter the following retry loop until the lock is successfully obtained or the retry times out. The ideal is beautiful, the reality is cruel. Using only SETNX to lock with race conditions can cause deadlock errors in some specific cases.

Handling deadlocks

In the above processing method, if the execution time of the client that acquires the lock is too long, the process is killed, or the lock cannot be released due to other abnormal crashes, a deadlock will result. Therefore, it is necessary to perform a timeliness detection on locking. Therefore, when we lock, we store the current timestamp as a value in this lock, and compare the current timestamp with the timestamp in Redis. If the difference exceeds a certain value, the lock is considered to have expired, preventing the lock from being locked indefinitely. However, in the case of large concurrency, if the lock failure is detected at the same time, and the deadlock is simply and rudely deleted, and then locked through SETNX, it may cause a race condition, that is, multiple clients acquire the lock at the same time.

C1 acquires the lock and crashes. After C2 and C3 call SETNX to lock and return 0, they obtain the timestamp of foo.lock. By comparing the timestamps, it is found that the lock has timed out.
C2 sends a DEL command to foo.lock.
C2 sends SETNX to foo.lock to acquire the lock.
C3 sends a DEL command to foo.lock. At this time, when C3 sends DEL, DEL actually loses the lock of C2.
C3 sends SETNX to foo.lock to acquire the lock.

At this point, both C2 and C3 have acquired the lock, resulting in a race condition. In the case of higher concurrency, more clients may acquire the lock. Therefore, the operation of the DEL lock cannot be directly used in the case of the lock timeout. Fortunately, we have the GETSET method. Suppose we now have another client C4, and see how to use the GETSET method to avoid this situation.

C1 acquires the lock and crashes. After C2 and C3 call SETNX to lock and return 0, they call the GET command to obtain the timestamp T1 of foo.lock. By comparing the timestamps, it is found that the lock has timed out.
C4 sends GESET command to foo.lock,
GETSET foo.lock
and gets old timestamp T2 in foo.lock

If T1=T2, it means that C4 gets the timestamp.
If T1!=T2, it means that another client C5 before C4 obtained the timestamp by calling GETSET, and C4 did not obtain the lock. You can only sleep and enter the next cycle.

The only question now is whether the C4 setting the new timestamp of foo.lock will have any effect on the lock. In fact, we can see that the time difference between C4 and C5 is extremely small, and the valid time errors written into foo.lock are all wrong, so it has no effect on the lock.
In order to make this lock stronger, the client that acquires the lock should call the GET method again to obtain T1 when calling the key business, and compare it with the written T0 timestamp, so as to avoid the lock being accidentally unlocked by DEL due to other circumstances. . The above steps and situations are easy to see from other references. Client handling and failure situations are very complex, not just as simple as a crash, but also because the client is blocked for a long time due to some operation, and then the DEL command is attempted to execute (but the lock is now in another client. in hand). It may also lead to deadlock due to improper handling. It is also possible that Redis is overwhelmed under large concurrency due to unreasonable sleep settings. The most common problems are

What kind of logic should be followed when GET returns nil?

The first type of timeout logic
C1 client acquires the lock, and after processing, DEL drops the lock, before the DEL lock. C2 sets the timestamp T0 to foo.lock through SETNX and finds that a client acquires the lock and enters the GET operation.
C2 sends a GET command to foo.lock and gets the return value T1(nil).
C2 enters the GETSET process by comparing T0>T1+expire.
C2 calls GETSET to send T0 timestamp to foo.lock, and returns the original value of foo.lock T2
C2 If T2=T1 is equal, the lock is acquired, if T2!=T1, the lock is not acquired.

In the second case, the setnx logic
C1 client acquires the lock in a loop, and after processing, DEL drops the lock, before the DEL lock. C2 sets the timestamp T0 to foo.lock through SETNX and finds that a client acquires the lock and enters the GET operation.
C2 sends a GET command to foo.lock and gets the return value T1(nil).
C2 loop, enter the next SETNX logic

Both logics seem to be OK, but in terms of logical processing, there is a problem in the first case. When GET returns nil, it means that the lock is deleted, not timed out, and the SETNX logic should be used to lock. The problem with the first case is that the normal locking logic should go to SETNX, but now when the lock is released, it goes to GETST. If the conditions are not judged properly, it will cause a deadlock, which is very sad. I encountered it, how to encounter it, see the following problem

What should I do when GETSET returns nil?

Clients C1 and C2 call the GET interface, and C1 returns T1. At this time, the network condition of C3 is better. It quickly enters to acquire the lock, and executes DEL to delete the lock. C2 returns T2 (nil), and both C1 and C2 enter the timeout processing logic.
C1 sends a GETSET command to foo.lock and gets the return value T11(nil).
C1 compares C1 and C11 and finds that they are different, and the processing logic considers that the lock has not been acquired.
C2 sends a GETSET command to foo.lock to get the return value T22 (the timestamp written by C1).
C2 compares C2 and C22 and finds that they are different, and the processing logic considers that the lock has not been acquired.

At this time, both C1 and C2 think that the lock has not been acquired. In fact, C1 has already acquired the lock, but his processing logic does not consider the situation that GETSET returns nil, but simply compares the values ​​of GET and GETSET. As for why this happens ? One is when there are multiple clients, after each client connects to Redis, the commands issued are not continuous, resulting in the seemingly continuous commands seen from a single client, and after reaching the Redis server, the two commands may be A large number of commands issued by other clients have been inserted, such as DEL, SETNX, etc. In the second case, the time between multiple clients is not synchronized, or is not strictly synchronized.

Timestamp problem

We see that the value of foo.lock is a timestamp, so to ensure that the lock is valid in the case of multiple clients, the time of each server must be synchronized. If there is a difference in time between servers. Clients with inconsistent time will deviate when judging the lock timeout, resulting in a race condition.
Whether the lock times out or not depends strictly on the time stamp, and the time stamp itself also has a precision limit. If our time precision is seconds, from locking to executing operations to unlocking, general operations can definitely be completed within one second. In this case, our above CASE is easy to appear. Therefore, it is best to increase the time precision to the millisecond level. In this way, millisecond-level locks can be guaranteed to be safe.

Distributed lock problem

1: Necessary timeout mechanism: Once the client that acquires the lock crashes, there must be an expiration mechanism, otherwise other clients will be unable to acquire the lock, resulting in a deadlock problem.
2: Distributed locks, multi-client timestamps cannot guarantee strict consistency, so under some specific factors, there may be lock strings. It needs a moderate mechanism that can withstand the occurrence of small probability events.
3: Only lock key processing nodes. It is a good habit to prepare relevant resources. For example, after connecting to the database, call the locking mechanism to obtain the lock, operate directly, and then release it to minimize the time for holding the lock.
4: Do you want to check the lock during the lock holding period? If you need to strictly depend on the state of the lock, it is best to do the CHECK check mechanism of the lock in the key steps, but according to our tests, we found that in the case of large concurrency, every CHECK lock operation , it takes several milliseconds, and our entire lock processing logic is less than 10 milliseconds, and the player does not choose to check the lock.
5: The knowledge of sleep, in order to reduce the pressure on Redis, when trying to acquire a lock, a sleep operation must be performed between loops. But how long to sleep is a matter of knowledge. You need to make reasonable calculations based on your own Redis QPS, plus the lock processing time.
6: As for why Redis's muti, expire, watch and other mechanisms are not used, you can check the reference materials to find the reason.

lock test data

The first type of sleep is not used , and sleep is not performed
when the lock is retried. Single request, lock, execute, unlock time

It can be seen that the locking and unlocking times are very fast, when we use

ab -n1000 -c100 ' http://sandbox6.wanke.etao.com/test/test_sequence.php?tbpm=t '
AB has 100 concurrent requests for a total of 1000 times, and this method is stress tested.

We will find that the time to acquire the lock becomes, and after holding the lock, the execution time also becomes, and the time to delete the lock is nearly 10ms. Why is this?
1: After holding the lock, our execution logic includes calling the Redis operation again. In the case of large concurrency, the execution of Redis becomes significantly slower.
2: The lock deletion time has become longer, from 0.2ms before to 9.8ms, and the performance has dropped by nearly 50 times.
In this case, the QPS of our stress test was 49, and we finally found that the QPS was related to the total amount of stress testing. When we concurrently performed 100 requests for a total of 100 times, the QPS was over 110. when we use sleep

When using Sleep

When executing a single request

We see that the performance is comparable to when the sleep mechanism is not used. When compressing with the same pressure test conditions

The time for acquiring the lock is significantly longer, and the time for releasing the lock is significantly shorter, which is only half of that without the sleep mechanism. Of course, the execution time becomes longer because we recreate the database connection during the execution process. At the same time, we can compare the command execution pressure of Redis

The thin and high part in the above picture is the pressure measurement map when the sleep mechanism is not used, and the short and fat part is the pressure measurement map with the sleep mechanism. From the above figure, we can see that the pressure is reduced by about 50%. Of course, there are other methods of sleep. The second disadvantage is that the QPS drops significantly. Under our stress test conditions, it is only 35, and some requests time out. However, after considering various situations, we still decided to use the sleep mechanism, mainly to prevent Redis from being overwhelmed in the case of large concurrency, which is very bad. We have encountered it before, so the sleep mechanism will definitely be used.

References

http://www.worlduc.com/FileSystem/18/2518/590664/9f63555e6079482f831c8ab1dcb8c19c.pdf
http://redis.io/commands/setnx
http://www.blogjava.net/caojianhua/archive/2013/01/28/394847.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325899098&siteId=291194637
Recommended