Lightweight distributed lock based on Redis

Continue to create, accelerate growth! This is the first day of my participation in the "Nuggets Daily New Plan · June Update Challenge", click to view the details of the event

opening

In the era of microservices, distributed locks are necessary to protect resources in many business scenarios. We need to ensure that the modification of a certain identity (perhaps a user, tenant, resource) will not cause abnormal cases caused by concurrency. Especially in scenarios involving e-commerce, points, transfers, etc., the requirements for locks are higher. Today, let's take a look at how Redis-based distributed locks should be implemented.

ideas

We talked before, if it is a stand-alone program, in Golang we can use sync.Mutex to lock to avoid concurrent access in the critical section. Golang Mutex principle analysis

How do we judge whether the lock is obtained? Recall:


func (m *Mutex) Lock() {
  // Fast path: grab unlocked mutex.
  if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
    if race.Enabled {
      race.Acquire(unsafe.Pointer(m))
    }
    return
  }
  // Slow path (outlined so that the fast path can be inlined)
  m.lockSlow()
}
复制代码

As you can see, here is a CAS to see if the current coroutine can change the state in the Mutex structure. Because CAS itself is the atomic capability provided by atomic, multiple coroutines come, only one can succeed.

func (m *Mutex) Unlock() {
  if race.Enabled {
    _ = m.state
    race.Release(unsafe.Pointer(m))
  }

  // Fast path: drop lock bit.
  new := atomic.AddInt32(&m.state, -mutexLocked)
  if new != 0 {
    // Outlined slow path to allow inlining the fast path.
    // To hide unlockSlow during tracing we skip one extra frame when tracing GoUnblock.
    m.unlockSlow(new)
  }
}
复制代码

In the unlocking process, we also ignore the slow path. We can see that we also continue to call AddInt32 of the atomic package and subtract mutexLocked from state.

To summarize, we need to implement atomically, the ability to set the value for a key, and reset the value. (Corresponding to sync.Mutex, the key is m.state, and the value is mutexLocked)

In a distributed scenario, there is another point to consider:

In the single-machine scenario, if there are some abnormal situations in the coroutine that acquires the lock, such as the lock is not released for a long time, the impact is limited to the local machine, and the developer can debug it by himself. However, in a distributed scenario, the same lock will be scrambled by multiple online instances, so naturally, the storage of this lock state must not be with the server that executes business logic, and it needs to be an independent storage server (such as a Redis instance, ZK, or some KV store with constraints on strong consistency).

Then the problem arises. If, as mentioned above, an instance that has already got the lock hangs up, for some reason, the lock is not released. What will happen?

这就会造成,其他实例后续无论怎样请求,由于锁已经被占用,都无法获取。而持有锁的实例已经出现异常,可能很长时间,甚至丢失了上下文,永远无法释放锁。那么针对这个被锁定的资源的后续操作就全部被 block 了,这是生产环境不能接受的。

所以,分布式场景一定要注意容错,锁的释放不能仅仅指望加锁的实例来执行,必须要有 TTL 过期的机制。

实现

利用 redis 实现分布式锁,在业内其实是非常常见的做法,并不完美,但对于大部分业务场景来说是足够的。大家有兴趣也可以搜索一下 zookeeper 实现分布式锁的方案,主要是基于了zk的临时有序节点特性。

这里插一句,其实 redis 支持分布式锁,官方提供过一个专门的算法 redlock,还由此引发了 Martin Kleppmann (DDIA 作者)和 antirez (redis 作者)的辩论,引发了很多讨论。redlock 算法不是本文的重点,大家有兴趣的话可以看看这篇博客:redis.io/docs/refere…

这里我们来看看,基于 redis 的基础特性,我们怎么做出一个基于单实例 redis 实现的分布式锁。

基础实现

单机 redis 锁的原理其实很简单,我们知道 redis 提供了 SetNX 的能力,即如果key不存在则设置,若存在则忽略。搭配上 TTL 即可满足加锁要求

  • 加锁
SET resource_name 1 NX PX 30000
复制代码
  • 解锁
Del resource_name
复制代码

解锁也很简单,直接把 key 删掉就 ok。此后如果还有需要加锁的实例,会直接通过 SetNX 上锁成功。

我们来看下 Golang 实现:

func (l *Lock) Acquire() error {
    retrySleep := 3
    var err error
    for l.RetryTimes > 0 {
            v := l.RedisClient.SetNX(l.Key, 1, l.Timeout)
            err = v.Err()
            if v.Err() == nil && v.Val() {
                    l.acquire = true
                    return nil
            }
            l.RetryTimes--
            time.Sleep(time.Millisecond * time.Duration(retrySleep))
            retrySleep = retrySleep * 2
    }
    if err != nil {
            return err
    }
    return ErrTimeout
}

func (l *Lock) Release() {
    if l.acquire {
            l.RedisClient.Del(l.Key)
    }
    l.acquire = false
}
复制代码

不谈单机性能,只从实用角度看,上面的实现在很多场景下都是 ok 的,如果已经上锁,其他实例再来请求时会无法获取锁,进而 sleep 一段时间后重试,若重试到最后还不成功,就返回【超时错误】。

但对于大型业务来说,这样的实现绝对是不保险的,下一节我们来看看怎么改进。

存在的问题

上一节给出的基础实现,核心问题在于【无法对超时场景兜底】。

比如,锁的 TTL 我们设置为 10秒钟。

时刻1:A 实例获取了锁,正常处理请求的时延 pct99 为 500ms; 时刻2:B 实例尝试获取锁失败,持续 sleep 一段时间后重试; 时刻3:问题出现,A 实例由于某种原因,一些业务流程卡住了,阻塞超过10秒,锁超时,A 还在执行业务流程; 时刻4:B 实例获取到了锁,开始执行业务流程; 时刻5:A 实例执行结束,走 defer 流程 Del,本意是释放自己持有的锁,但实际上自己的锁早已超时过期,此时释放了 B 的锁; 时刻6:由于A的误删,C 实例获取到锁, 和正在执行的 B 实例出现并发,锁失效了。

这里问题的根源在于,锁的解除有两种情况:

  1. 锁被显式地释放;
  2. 锁的 TTL 到了,过期。

我们希望,如果 2 发生了,1 就不能再执行了。

但上面的方案里,释放锁的时候,只是简单的一个 Del,任何一个持有者都可以去释放,这是有很大隐患的,我们怎么知道释放的是否是自己的锁呢?毕竟一个超时的实例是不会知道自己是否超时的。写死时间判断并不是一个好方法。

ok,那我们希望,任何一个锁持有者,只能删除自己加的锁,如果锁过期了,直接返回就行。怎么实现呢?

答案很简单,我们需要保证,加锁的值,只能加锁的实例知道。随后解锁时,需要校验,是不是当初自己持有的锁(也是通过校验加锁值来做到)。

(有些同学可能会觉得,那我随便搞个随机数,比如时间戳,还保持 Del 不行么?这里也是不 ok 的,因为如果你不校验加锁值,不管你用了什么随机数,最后一个 Del 上去就把锁释放了,没有做到【加解锁的绑定】)

当然,像我们上面这样直接用固定值是最不安全的,即便你做了加锁值比对,因为本身是固定值,两个并发实例的加锁值还是相同,绑定就没有意义了。

简言之,两点需要做到:

  1. 加锁值在所有客户端和所有锁定请求中必须是唯一的,不能用可能重复,甚至固定值来加锁,随机性要有保障;
  2. 解锁时校验当前是否和加锁值匹配,只有匹配才能解锁(意味着你解的是自己当时上的锁)。

What should this random string be? We assume it’s 20 bytes from /dev/urandom, but you can find cheaper ways to make it unique enough for your tasks. For example a safe pick is to seed RC4 with /dev/urandom, and generate a pseudo random stream from that. A simpler solution is to use a UNIX timestamp with microsecond precision, concatenating the timestamp with a client ID. It is not as safe, but probably sufficient for most environments.

随机性其实本身有比较大的灵活度,大家可以根据自身业务场景选择即可。参照上面,redis 官方推荐的一个轻量级随机数的方法是:使用具有微秒精度的 UNIX 时间戳,将时间戳与客户端 ID 绑定起来,组成一个随机数。

优化后解法

  • 加锁
SET resource_name my_random_value NX PX 30000
复制代码
  • 解锁
if redis.call("get",KEYS[1]) == ARGV[1] then
    return redis.call("del",KEYS[1])
else
    return 0
end
复制代码

此处使用 lua 脚本来解锁是比较常见的,因为我们的诉求是将以下操作包装成一个原子操作:

  1. 读取锁对应的 key 的当前值;
  2. 比对当前值和加锁值是否一致;
  3. 若一致,删掉锁对应的 key,若不一致,则直接返回,不能删。

简单说,本质上我们是需要一个 Compare And Delete 原子操作。

限制

注意,上面我们提供的解法,基于的前提是【单机redis】,其实这个前提是忽略了分布式场景下一些case的。

通常情况下,当我们的业务server都已经是分布式了,很难还强依赖一个单机redis做锁,容量和性能都是瓶颈。

If it is a cluster version of redis to support distributed locks, we will have to face exceptions such as master-slave switching, delay and other problems. It needs to be judged according to the specific redis deployment and usage.

In fact, the official implementation of redlock is also to deal with the problem of how to do distributed locks in redis deployed in clusters. It is recommended to take a look at the documentation of redlock when you have time to learn: redis.io/docs/refere…

Guess you like

Origin juejin.im/post/7102339194252427300