Ten years of experience in the development of large manufacturers about the technical selection and thinking of distributed locks

Locks and Distributed Locks

In computers, the role of locks is to solve the problem of mutual exclusion of shared resources in a concurrent state, ensuring that only one process/thread can control the resources at the same time.

For example the following situations:

  1. The implementation of file lock is to solve the concurrent problem of different users reading and writing the same file at the same time, and prevent the content of the file from being damaged.
  2. Queues implemented using arrays generally need to be locked at the place where the push operation is performed to solve the problem of contention for slots and prevent multiple push conflicts that may result in data loss.
  3. For 12306, the train ticket is his resource. When the ticket is finally released, it needs to be locked to ensure that the ticket, person, and seat uniquely correspond.
  4. ……

The above example actually includes the traditional stand-alone locks we usually talk about and the distributed locks I want to talk about.

In a stand-alone environment, resource competitors are all from within the machine ((process/thread), so the solution to implement locks only needs to rely on stand-alone resources, such as disks, memory, and registers.

However, in a distributed environment, the living environment of resource competitors is more complicated, and the original solution that relies on a single machine no longer works. At this time, a coordinator recognized by everyone is needed to help solve the competition problem. The coordinator said that It is a distributed lock.

The above example is like a conflict between two employees, which can be resolved as long as the leader of the company comes forward. When two companies have competition and conflicts, they need the judicial organs to come forward, which is the same reason.

Simply put, distributed lock is a means to solve the problem of resource competition in a distributed environment.

You can click to join the group: 561614305 There is a high-end route that I will live broadcast to explain the knowledge points. 

Application Scenarios of Distributed Locks

In all distributed environments where resource competition occurs, the coordination of distributed locks is required. In addition to the 12306 ticket issuance described above, there are also applications such as editing issues on shared document platforms, hero selection for King of Glory, and global auto-increment primary keys. . Briefly introduce the usage scenarios of multi-person collaborative editing platforms such as the company's internal Wiki.

Multiplayer online editing in the wiki

Scenario 1: Before the Qingming Festival, the team asked us to register our vacation status on the Wiki, assuming that we recorded our vacation time and contact number on the document id=1. Two classmates A and C started editing at the same time, and A and C submitted the results at the same time, and their documents were empty before submission. How does the service need to handle these two requests? Whose will prevail? Will there be an overwrite phenomenon that causes A's records to be lost?

Scenario 2: Another case, I am classmate Z, and others have already filled it out before me. I have a bad habit. I like to press Ctrl+s 3-5 times continuously when saving, and each Ctrl+s will trigger a requests, but each request is processed for about 1s, but the actual requests are sent out within 20ms.

The problem is the same as above, how to ensure that the additional records are not repeated?

Suppose your storage service and storage architecture are like this:

enter image description here

The general processing code is as follows:

    //根据docid获取文件内容,从分布式文件系统取,时间不可控
    nowFileContent = getFileByDocId(docId)
    //do something,类似diff,追加操作
    newFileContent = doSomeThing()
    //存储到文件系统
    setNewFileContent(docId,newFileContent)

For the two requests A and C mentioned in scenario 1, the two requests arrive at the code segment at the same time, but due to network reasons, A gets the document content first, and C reads the file content before A writes, so the final result is that one of the two will be lost. write.

enter image description here

Therefore, it is necessary to perform a lock on the read and write operations to ensure the integrity and consistency of the transaction.

The image below is an illustration from Modern Operating Systems, and the effect here is hopefully the same.

enter image description here

Scenarios such as Wiki belong to the resource processing problem of long time-consuming transactions. The appearance of locks ensures that the write coverage will not be caused by the large span between reads and writes in the transaction, so that requests are queued and processed sequentially.

You can click to join the group: 561614305 There is a high-end route that I will live broadcast to explain the knowledge points. 

Solution selection

The problem I encountered is also the problem of long affairs such as Wiki. The first thought when encountering a problem is to look for solutions on the Internet.

There are many implementations of MySQL, ZK, and Redis on the Internet. Which one do I need to choose? How to choose? What do I need to weigh?

When I read distributed books before, a word that was mentioned many times was: trade-off. I understand it to be a trade-off or a trade-off.

As a web developer, the main things I need to consider include the following:

  1. Is it OK to implement my function, and does it take time to meet online needs?
  2. Realization difficulty and learning cost;
  3. Operation and maintenance costs.

So let's take a look at the current options according to these criteria:

Method to realize Functional requirements Difficulty to achieve learning cost Operation and maintenance cost
MySQL's solution is implemented with table locks/row locks meet basic requirements not difficult familiar A small amount of OK, a large amount of influence on the existing business, 1 master multi-slave architecture, inconvenient for expansion
Implemented by creating data nodes in ZK fulfil requirements Familiar with ZK API is enough need to learn Heavy, need to stack machines, there are cross-room requests
Redis uses setnxex basic requirements not difficult familiar Easy expansion, existing services

MySQL single-master architecture, writing will go to the master, there is a bottleneck. The ZK method requires its own construction, operation and maintenance, and requires stacking machines, so the utilization rate is not high. In the end, Redis was used to implement it. The traffic/storage can be expanded, and the operation and maintenance do not need to be done by yourself.

accomplish

After choosing the plan, the following is the realization. If we finally implement this lock, what are the requirements for it?

  1. The lock implementation must be atomic, while ensuring that only one contender is exclusive at any time;
  2. unlock must be atomic, while ensuring that only oneself can unlock itself;
  3. No deadlock can occur, and other locking behaviors will not be affected when the process hangs up;
  4. Support architecture and stand-alone in Twemproxy mode;
  5. Time consuming is acceptable.

Based on the above requirements, my implementation is as follows (only approximate, sensitive information removed):

<?php
class LockUtility{

    const DEFAULT_UNLOCK_TIME = 4 ;
    const COMMON_REDISKEY_PREFIX = 'xxxxx' ;

    /**
     * @brief  
     *
     * @param $ukey  需要加锁的key
     * @param $unlockTime  锁持有时长
     *
     * @return   
     */
    public function __construct($ukey,$unlockTime=self::DEFAULT_UNLOCK_TIME){
        $this->_objRedis   = RedisFactory::getRedis();
        $this->_redisKey   = self::COMMON_REDISKEY_PREFIX.$ukey;
        $this->_unLockTime = $unlockTime ;
        //为单次加锁生成唯一guid
        $this->_guid       = genGuid();   
    }

    /**
     * @brief 对给定的key进行加锁处理 
     *
     * @return   
     *
     *        true  表示加锁成功
     *        
     *        抛出异常则表示加锁未成功,根据业务选择自己的care的级别
     *        异常错误码 :
     *        1.网络错误:  ErrorCodes::REDIS_ERROR        视业务严谨度,这个错误是否忽略
     *        2.锁被占用:  ErrorCodes::LOCK_IS_USED       明确确定锁被别人占有
     */

    public function lock(){
        /*  
         *  设置锁的过程需要是原子的,所以采用了set来操作
         *         SET key value [EX seconds] [PX milliseconds] [NX|XX]
         *         Redis 2.6.12 版本开始支持通过set 指定参数完成setexnx功能
         *
         *  php 语法  : $redis->set('key', 'value', Array('xx', 'px'=>1000));
         *
         */
        $setRet = $this->_objRedis->set($this->_redisKey,$this->_guid,array('nx', 'ex' => $this->_unLockTime));

        //返回false表示请求锁失败
        if(false === $setRet){
            //锁被占用,抛异常
            throw new Exception("get Lock Failed!Locking",Constants_ErrorCodes::LOCK_IS_USED);
        }

        //redis返回null,是网络、机器授权、语法错误等等
        if(is_null($setRet)){
            //网络错误、异常
            throw new Exception("Request Redis Failed",Constants_ErrorCodes::REDIS_ERROR);    
        }

        return $setRet ;
    }

    /**
     * @brief  解除对某个key的锁定,原则上不需要关心返回值,可以多次调用
     *
     * @return  
     *         1  redis会话成功,并且成功删除了key
     *         0  redis会话成功,但是待删除的key已经不存在
     * 
     */
    public function unlock(){
        //Reids 2.6 版本增加了对 Lua 环境的支持,解决了长久以来不能高效地处理 CAS (check-and-set)命令的缺点
        $luaScript = "if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end" ; 
        $delRet = $this->_objRedis->eval($luaScript,array($this->_redisKey,$this->_guid),1);

        if(is_null($delRet)){
            //redis返回null,是网络、机器授权、语法错误等等
            throw new Exception("Request Redis Failed",Constants_ErrorCodes::REDIS_ERROR);    
        }
        return $delRet ;
    }
}

Has the code been written to solve the above problems? Let's take a look at the use of stand-alone and clustered Redis solutions.

Stand-alone Redis architecture

enter image description here

For the single-point architecture shown above, read and write are not separated.

So does the above code meet the above requirements?

You can click to join the group: 561614305 There is a high-end route that I will live broadcast to explain the knowledge points. 

  1. lock adopts set + nx + ex parameters + redis single thread to ensure that lock is an atomic operation, the success of locking means success, and the failure means failure, satisfying requirements 1 and 3 deadlock processing, and the timeout key is invalid;
  2. Unlock uses Lua to ensure that the compare and del operation is atomic, and at the same time solves the need to delete yourself;
  3. How long does it take? It is a one-time request, acceptable, and the same computer room is at the ms level.

Multi-region and multi-shard master-slave architecture in Twemproxy mode

enter image description here

Twemproxy is a proxy for Redis/Memcache, mainly responsible for the function of routing to shards according to the key, there are operations that it does not support, for example  keys *. The reason it is not supported is that it needs to traverse all shards to complete the operation. For simple set/get or routing to the corresponding shard, the working principle is the same.

What about Lua scripts? How are Lua scripts routed? support?

When we use eval to execute, I found that our cluster's documentation says:

There must be at least one key after script. The command will be sent to the shard where the first key is located.

That is to say, using eval to complete the work, the command is sent to the first key, and our first key is the key we want to process, so this code is also supported in cluster mode.

However, for the cluster, the mode of eventual consistency, single-domain master and multi-region slaves, and write-away master region is now adopted.

So that means that the write request is cross-regional? I optimized this by using one more step operation read, because reading does not cross regions, and writing does not cross regions, but more than 99% of the requests master-slave delay is not so large, of course, 99% of this ratio is my guess.

The specific code is as follows:

function lock(){
    //首先采用exist来看指定key是不是存在了
    if($objRedis->exist($key)){
       //key存在一定是被占了,抛异常
    }
    //if not exist,并不能代表这个锁真的没被占用,可能是主从延时,这时候复用上面的代码更安全,减少一次跨机房写
}

The precautions for use are as follows:

  1. When using it, you need to control your own lockTime, which needs to be longer than your transaction execution time;
  2. When the upper layer fails to acquire the lock, it needs to choose whether to block or discard the request and let the client retry.

The current issues to be resolved are:

  1. If your process is suspended because the CPU is tight, and the suspension time exceeds the expiration time of the lock you set, will there still be problems?
  2. What happens if a shard in cluster mode dies?
  3. Do you have any solution? Welcome to leave a message for discussion.
  4. You can click to join the group: 561614305 There is a high-end route that I will live broadcast to explain the knowledge points. 

Summarize

To sum up my sharing this time, there are mainly the following conclusions:

  1. Distributed locks refer to locks required in a distributed business environment, and there is no requirement for services that support locks to be distributed;
  2. The lock is actually the role of a resource coordinator, managing resource control in concurrent state;
  3. Scheme selection is like investment, and the input-output ratio needs to be considered;
  4. Redis stand-alone and cluster solutions have their own optimization points, which can be optimized according to the scenario;
  5. After writing the article, I found that there is a problem with my topic. A more accurate name should be "Thinking of Redis Implementing Distributed Locks". If I lied to you, please tell me.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325736966&siteId=291194637