[Distributed]-Redis implements distributed locks

distributed lock

To put it simply, distributed lock means that in a distributed environment, multiple nodes and multiple instances grab the same lock.

Local locks are usually used in scenarios where multiple threads in a process compete for access to shared resources and need to ensure synchronization; distributed locks are basically the same, except that the service that provides the lock is on a certain node, such as a server that provides Redis The server of the service, and the object of competition for resources becomes other nodes and instances; and during competition, related operations need to be completed through the network

The difficulty of distributed locks lies in the network. In fact, many problems in distributed systems come from the network. The network is an eternal problem and we can never avoid it.

four characteristics

  1. Mutual exclusivity : The purpose of the lock is to obtain the right to use the resource, so only one competitor can hold the lock. This should be ensured as much as possible
  2. Security : Avoid deadlock situations. When a competitor fails to actively unlock due to an unexpected crash during the period of holding the lock, the lock held by it can also be released normally, and it is guaranteed that other competitors can also lock subsequently.
  3. Symmetry : For the same lock, locking and unlocking must be the same competitor. Cannot release locks held by other competitors
  4. Reliability : Distributed lock services require a certain degree of exception handling and disaster recovery capabilities.

The simplest implementation

The starting point for implementing a distributed lock is to use the setnx command to set a key-value pair. This command can ensure that the key-value pair is set exclusively, that is, in the case of multiple concurrency, only one client can successfully obtain the lock.

The result of setnx may be successful, returning 1, that is, the lock was obtained successfully; it
may also return 0, failing, because someone else obtained the lock first; it may
also be that the request times out and does not reach the lock server at all;
or the request may be sent It passed, and the lock was indeed obtained, but the response from the server was not sent back due to network problems.

So the simplest implementation is that the locking operation is to set a key with the setnx command and the value is arbitrary; then the unlocking operation is to delete the key.

There seems to be no problem, multiple nodes can acquire distributed locks synchronously. But what we have to consider in distributed systems is the problem of node downtime. If a node goes down after being locked and cannot be unlocked normally, then the distributed lock will always be locked and other nodes will not be able to obtain the lock.

So in order to solve this problem, the node needs to specify a timeout when locking. If the lock exceeds the timeout and has not been unlocked, it will be automatically unlocked. Many factors are uncertain, but time is certain, so it is often used as a guarantee.

time out

Adding a distributed lock timeout mechanism can prevent the lock from being unlocked in time due to the downtime of the locked node. Therefore, the expiration time of the distributed lock must be set. But new problems will arise. Assume that after the node is locked, the lock is automatically unlocked due to timeout due to long business processing time, GC, network delay, etc., and is locked by the second node.

Then when the first node comes to unlock later, it may unlock successfully, but the lock it unlocks is not its own but the lock added by the second node. What's more serious is that at this time, the second node still thinks that it is the holder of the lock, but at this time the lock has been deleted by the first node, so it may be locked by the third node. At this time, the second node Both the third node and the third node consider themselves to be the holders of the lock, which violates the most basic semantics of distributed locks.

In order to solve this problem, you need to ensure that the lock is indeed added by yourself when the node is unlocked. You can set the value of the key-value pair to a unique identifier when locking.

Uniquely identifies

When locking a node, the value of the key-value pair is specified as a unique identifier, such as UUID. When the node is unlocked, determine whether the value is the value specified when locking. If so, it means that the lock was indeed added by yourself last time and can be unlocked.

There are also problems with this operation: when unlocking, you need to take out the value first, and then determine whether the lock was added by yourself. If so, delete the key. These three operations are separate and not performed atomically. It is possible that when the node takes out the value, it does add the lock by itself, but at this time the node is blocked for some reason, causing the lock to be automatically released after timeout, and the second node gets the lock and waits until the first node is blocked. , then judge the value taken out, judge that it is the lock added by yourself, and then unlock it, but at this time, the lock added by the second node is already solved.

Therefore, it is necessary to ensure that the three operations of value retrieval, judgment, and unlocking are performed atomically. Lua scripts can be used in Redis to solve this problem.

Renew contract

As mentioned earlier, the expiration time must be set for distributed locks. So how long should the expiration time be designed for?

For a locked user, it is difficult for him to determine how long the lock expiration time should be. If the setting is short, the lock may expire before the business is completed; if the setting is long, if the instance really goes down and crashes, then As a result, other nodes cannot obtain the lock for a long time until the lock automatically times out and is deleted; more seriously, the business processing time cannot be determined, and the business execution time may exceed the lock time.

We can introduce a renewal mechanism, that is, extend the timeout again before the lock expires. In this case, the expiration time does not need to be set very long, because if it really happens that the business execution time may exceed the original timeout of the lock, the timeout can also be extended through renewal; and if the instance really crashes, the renewal will be It won’t continue, and it will naturally become ineffective after a while.

Manual renewal

In Redis, if you use manual renewal, you can atomically retrieve the value of the lock through a Lua script to determine whether you added the lock yourself. If so, use the expire command or pexpire (set the millisecond expiration time) to renew Set the timeout period of the key-value pair to complete these steps

Automatic renewal

The operation of manual renewal is not complicated. The difficulty lies in the details of use: how often should the contract be renewed? What if the contract renewal fails? The failure may be due to the request timeout. In this case, a retry strategy can be adopted; but if other problems other than timeout occur, how should we deal with it? If it is confirmed that the contract renewal has failed, how to deal with the ongoing business? If only the business can be interrupted, how can it be interrupted?

For most users, it is still difficult to deal with various abnormal situations during the renewal process, so you can consider designing middleware to provide users with automatic renewal. Of course, designers still can't avoid those unusual problems:

How often do I renew my contract and how long does it last? Since the renewal operation is greatly affected by the network and the Redis server, it is not a problem that the designer can control. Therefore, the user is allowed to choose the renewal interval; as for the length of renewal, you can simply use the timeout set when the lock is initially locked. But in fact, there is a certain relationship between the renewal interval and the timeout period. It certainly cannot be said that I will wait until the lock timeout is deleted before renewing the contract. The availability of the secondary service needs to be considered. If we set the timeout of the distributed lock to 10s, andIt is expected that the contract will be successfully renewed within 2 seconds, then accordingly you can consider setting the renewal interval to 8s. The probability of expected renewal success time is actually the requirement for availability.

How to handle timeouts? How long should the timeout be set to? The timeout here refers to the timeout of the renewal request to the Redis server. In this case of timeout, we don’t know whether the renewal was successful, and most of the time the timeout issporadic, so we choose to retry the renewal request. The disadvantage is that if a certain timeout is not accidental, but the Redis server actually crashes or the network is unavailable, it will result in unlimited attempts to renew the contract. As for the timeout period, we also let the user specify

How to notify the user that the renewal failed? We only handle renewal failures caused by timeouts and handle them by retrying; if other errors occur, we directly tell the user that an error that cannot be handled has been encountered. When considering errors that may occur during operation, it often only needs to be divided into two categories: timeout and other errors. Timeout is a kind of error that is different from other errors.

Do you want to set a maximum number of renewals? If a business keeps renewing its contract and does not release the distributed lock for several minutes, should we force it to be released? Our answer is not to set it. The reason is that if the user has this need, and if the business really needs to be executed for such a long time, then he should renew the contract manually. As a designer of middleware, we can consider and solve all problems for users as much as possible, but we cannot take into account such personalized needs of customers, and can only let users operate by themselves.

Lock and retry

Occasional failures may also occur when locking, in which case we can try again. In fact, we will only retry if the timeout fails. Most of the other errors cannot be solved by retrying. In a distributed environment, timeout errors are different from other errors.

  • If the call times out, just lock again. Note that the value at this time needs to be the value set when the lock was locked before the last timeout. If it is found that the key-value pair key does not exist in Redis, it means that the lock request before timeout was indeed unsuccessful. At this time, continue to lock, set the lock timeout, and then return.
  • If the key exists, we must first determine whether the value corresponding to the key is the value when we tried to lock before the timeout. If so, it means that the lock added before the timeout was successful, but the response from Redis did not return to our client.
  • And if the value value is another value, it means that the last locking failed and has been locked by other nodes, then the locking this time has failed. Therefore,
    retrying the lock is different from direct locking. The biggest difference is that Before retrying to lock, you need to determine whether the value is the value specified when locking for the first time.

Similar to automatic renewal, lock retry also needs to consider some issues: How to retry? The retry logic is as above. How often do you retry, and how many retries in total? This should also be left to the user to specify. Under what circumstances should you retry, and under what circumstances should you not retry? We have also said that you should retry for timeout errors, but there is no need to retry for other errors.

singleflight optimization

In the case of very high concurrency and concentrated hotspots , that is, multiple threads in multiple nodes compete for those locks, you can consider combining singleflight for optimization.

Specifically, all threads/coroutines local to a single node first compete for a lock locally, and the winner then grabs the global distributed lock. Threads
on the same node compete first, and the winning thread then competes with other nodes. The winner competes for the global distributed lock. As many nodes as there are in the system, there will eventually be up to a few threads competing for the distributed lock.

It can be seen that singleflight can only take advantage of a large number of threads competing for distributed locks in a single node; if only one or two threads on each node will compete for distributed locks, then In the end, the number of threads participating in the competition for distributed locks is almost the same as the number of threads directly competing for distributed locks without going through singleflight. Each node even has to experience a local lock competition. In the end, the overhead caused by the entire process may be Even bigger than using singleflight

some problems

  1. Why did the lock fail? Request timeout, network failure, Redis server failure, lock held by someone else
  2. How to optimize the performance of distributed locks? The entire distributed lock operation process mainly involves three parts: the business of this node, network communication, and Redis server. The performance of Redis itself is very fast, and the network is an influencing factor that we cannot control. If you really want to optimize, you shouldStart with the business, reduce the time the business holds the lock, and try to release the lock normally., to avoid other abnormal problems

Guess you like

Origin blog.csdn.net/Pacifica_/article/details/127719758