Redis command execution process

We first understand the overall flow of Redis command execution, and then carefully analyze the principles and implementation details of the process from Redis startup to socket connection establishment, to reading socket data to the input buffer, parsing commands, and executing commands. Next, let's take a look at the implementation details of the set and get commands and how to send the command results to the Redis client through the output buffer and socket.

The specific implementation of set and get commands

As mentioned earlier, the processCommand method will parse out the corresponding redisCommand from the input buffer, and then call the call method to execute the parsed redisCommand's proc method. The proc methods of different commands are different. For example, the proc of redisCommand named set is the setCommand method, while the get is the getCommand method. In this form, a particularly common polymorphic strategy in Java is actually implemented.

void call(client *c, int flags) {
    ....
    c->cmd->proc(c);
    ....
}
// redisCommand结构体
struct redisCommand {
    char *name;
    // 对应方法的函数范式
    redisCommandProc *proc;
    .... // 其他定义
};
// 使用 typedef 定义的别名
typedef void redisCommandProc(client *c);
// 不同的命令,调用不同的方法。
struct redisCommand redisCommandTable[] = {
    {"get",getCommand,2,"rF",0,NULL,1,1,1,0,0},
    {"set",setCommand,-3,"wm",0,NULL,1,1,1,0,0},
    {"hmset",hsetCommand,-4,"wmF",0,NULL,1,1,1,0,0},
    .... // 所有的 redis 命令都有
}

setCommand will judge whether the set command carries optional parameters such as nx, xx, ex or px, and then call the setGenericCommand command. Let's look directly at the setGenericCommand method.

The processing logic of the setGenericCommand method is as follows:

  • First judge whether the type of set is set_nx or set_xx, if it is nx and the key already exists, it will return directly; if it is xx and the key does not exist, it will return directly.
  • Call the setKey method to add the key value to the corresponding Redis database.
  • If there is an expiration time, calling setExpire will set the expiration time
  • Make keyspace notification
  • Return the corresponding value to the client.
// t_string.c 
void setGenericCommand(client *c, int flags, robj *key, robj *val, robj *expire, int unit, robj *ok_reply, robj *abort_reply) {
    long long milliseconds = 0; 
    /**
     * 设置了过期时间;expire是robj类型,获取整数值
     */
    if (expire) {
        if (getLongLongFromObjectOrReply(c, expire, &milliseconds, NULL) != C_OK)
            return;
        if (milliseconds <= 0) {
            addReplyErrorFormat(c,"invalid expire time in %s",c->cmd->name);
            return;
        }
        if (unit == UNIT_SECONDS) milliseconds *= 1000;
    }
    /**
     * NX,key存在时直接返回;XX,key不存在时直接返回
     * lookupKeyWrite 是在对应的数据库中寻找键值是否存在
     */
    if ((flags & OBJ_SET_NX && lookupKeyWrite(c->db,key) != NULL) ||
        (flags & OBJ_SET_XX && lookupKeyWrite(c->db,key) == NULL))
    {
        addReply(c, abort_reply ? abort_reply : shared.nullbulk);
        return;
    }
    /**
     * 添加到数据字典
     */
    setKey(c->db,key,val);
    server.dirty++;
    /**
     * 过期时间添加到过期字典
     */
    if (expire) setExpire(c,c->db,key,mstime()+milliseconds);
    /**
     * 键空间通知
     */
    notifyKeyspaceEvent(NOTIFY_STRING,"set",key,c->db->id);
    if (expire) notifyKeyspaceEvent(NOTIFY_GENERIC,
        "expire",key,c->db->id);
    /**
     * 返回值,addReply 在 get 命令时再具体讲解
     */
    addReply(c, ok_reply ? ok_reply : shared.ok);
}

We won't go into details about the implementation of the specific setKey and setExpire methods here. In fact, it is to add the key value to the dict data hash table of db, and add the key and expiration time to the expires hash table, as shown in the figure below.

Next, look at the specific implementation of getCommand. Similarly, it will call the getGenericCommand method at the bottom.

The getGenericCommand method will call lookupKeyReadOrReply to find the corresponding key value from the dict data hash table. If not found, return C_OK directly; if found, call addReply or addReplyBulk method to add the value to the output buffer according to the type of the value.

int getGenericCommand(client *c) {
    robj *o;
    // 调用 lookupKeyReadOrReply 从数据字典中查找对应的键
    if ((o = lookupKeyReadOrReply(c,c->argv[1],shared.nullbulk)) == NULL)
        return C_OK;
    // 如果是string类型,调用 addReply 单行返回。如果是其他对象类型,则调用 addReplyBulk
    if (o->type != OBJ_STRING) {
        addReply(c,shared.wrongtypeerr);
        return C_ERR;
    } else {
        addReplyBulk(c,o);
        return C_OK;
    }
}

lookupKeyReadWithFlags will look up the corresponding key-value pair from redisDb. It will first call expireIfNeeded to determine whether the key has expired and needs to be deleted. If it is expired, it will call the lookupKey method to find and return from the dict hash table. Specific explanation can be found in the detailed comments in the code

/*
 * 查找key的读操作,如果key找不到或者已经逻辑上过期返回 NULL,有一些副作用
 *   1 如果key到达过期时间,它会被设备为过期,并且删除
 *   2 更新key的最近访问时间
 *   3 更新全局缓存击中概率
 * flags 有两个值: LOOKUP_NONE 一般都是这个;LOOKUP_NOTOUCH 不修改最近访问时间
 */
robj *lookupKeyReadWithFlags(redisDb *db, robj *key, int flags) { // db.c
    robj *val;
    // 检查键是否过期
    if (expireIfNeeded(db,key) == 1) {
        .... // master和 slave 对这种情况的特殊处理
    }
    // 查找键值字典
    val = lookupKey(db,key,flags);
    // 更新全局缓存命中率
    if (val == NULL)
        server.stat_keyspace_misses++;
    else
        server.stat_keyspace_hits++;
    return val;
}

Redis will call expireIfNeeded to determine whether the key has expired before calling the search key-value series method, and then perform synchronous deletion or asynchronous deletion according to whether Redis is configured with lazy deletion. For details about key deletion, please refer to the article "Detailed Explanation of Redis Memory Management Mechanism and Implementation" .

There are two special cases in the logic of judging key release expiration:

  • If the current Redis is a slave instance in the master-slave structure, it only judges whether the key has expired, and does not directly delete the key, but waits for the delete command sent by the master instance before deleting it. If the current Redis is the primary instance, call propagateExpire to propagate the expiration directive.
  • If the Lua script is currently being executed, because of its atomicity and transactional nature, the time in the entire execution expiration is calculated according to the moment when it starts to execute, that is to say, the keys that have not expired during Lua execution are also kept during its entire execution. will not expire.

/*
 * 在调用 lookupKey*系列方法前调用该方法。
 * 如果是slave:
 *  slave 并不主动过期删除key,但是返回值仍然会返回键已经被删除。
 *  master 如果key过期了,会主动删除过期键,并且触发 AOF 和同步操作。
 * 返回值为0表示键仍然有效,否则返回1
 */
int expireIfNeeded(redisDb *db, robj *key) { // db.c
    // 获取键的过期时间
    mstime_t when = getExpire(db,key);
    mstime_t now;

    if (when < 0) return 0;

    /*
     * 如果当前是在执行lua脚本,根据其原子性,整个执行过期中时间都按照其开始执行的那一刻计算
     * 也就是说lua执行时未过期的键,在它整个执行过程中也都不会过期。
     */ 
    now = server.lua_caller ? server.lua_time_start : mstime();

    // slave 直接返回键是否过期
    if (server.masterhost != NULL) return now > when;
    // master时,键未过期直接返回
    if (now <= when) return 0;

    // 键过期,删除键
    server.stat_expiredkeys++;
    // 触发命令传播
    propagateExpire(db,key,server.lazyfree_lazy_expire);
    // 和键空间事件
    notifyKeyspaceEvent(NOTIFY_EXPIRED,
        "expired",key,db->id);
    // 根据是否懒删除,调用不同的函数 
    return server.lazyfree_lazy_expire ? dbAsyncDelete(db,key) :
                                         dbSyncDelete(db,key);
}

The lookupKey method uses the dictFind method to look up the key value from the dict hash table of redisDb. If it can be found, it judges whether to update the latest access time of lru according to the maxmemory_policy policy of redis, or call the updateFU method to update other indicators. These indicators can be The key value is recycled when the subsequent memory is insufficient.

robj *lookupKey(redisDb *db, robj *key, int flags) {
    // dictFind 根据 key 获取字典的entry
    dictEntry *de = dictFind(db->dict,key->ptr);
    if (de) {
        // 获取 value
        robj *val = dictGetVal(de);
        // 当处于 rdb aof 子进程复制阶段或者 flags 不是 LOOKUP_NOTOUCH
        if (server.rdb_child_pid == -1 &&
            server.aof_child_pid == -1 &&
            !(flags & LOOKUP_NOTOUCH))
        {
            // 如果是 MAXMEMORY_FLAG_LFU 则进行相应操作
            if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
                updateLFU(val);
            } else {
                // 更新最近访问时间
                val->lru = LRU_CLOCK();
            }
        }
        return val;
    } else {
        return NULL;
    }
}

Write command result to output buffer

At the end of all redisCommand executions, the addReply method is generally called to return the results. Our analysis has also come to the return data stage of Redis command execution.

The addReply method does two things:

  • prepareClientToWrite judges whether data needs to be returned, and adds the current client to the queue waiting to write and return data.
  • Call the _addReplyToBuffer and _addReplyObjectToList methods to write the return value into the output buffer, waiting to be written into the socket.
void addReply(client *c, robj *obj) {
    if (prepareClientToWrite(c) != C_OK) return;
    if (sdsEncodedObject(obj)) {
        // 需要将响应内容添加到output buffer中。总体思路是,先尝试向固定buffer添加,添加失败的话,在尝试添加到响应链表
        if (_addReplyToBuffer(c,obj->ptr,sdslen(obj->ptr)) != C_OK)
            _addReplyObjectToList(c,obj);
    } else if (obj->encoding == OBJ_ENCODING_INT) {
        .... // 特殊情况的优化
    } else {
        serverPanic("Wrong obj->encoding in addReply()");
    }
}

prepareClientToWrite first determines whether the current client needs to return data:

  • The client executed by the Lua script needs to return a value;
  • If the client sends a REPLY OFF or SKIP command, no return value is required;
  • If it is the master instance client during master-slave replication, no return value is required;
  • It is currently a fake client in the AOF loading state, so no return value is required.

Then, if the client is not yet in the state of delayed waiting for writing (CLIENT_PENDING_WRITE), set it to this state, and add it to the Redis waiting to write return value client queue, that is, the clients_pending_write queue.

int prepareClientToWrite(client *c) {
    // 如果是 lua client 则直接OK
    if (c->flags & (CLIENT_LUA|CLIENT_MODULE)) return C_OK;
    // 客户端发来过 REPLY OFF 或者 SKIP 命令,不需要发送返回值
    if (c->flags & (CLIENT_REPLY_OFF|CLIENT_REPLY_SKIP)) return C_ERR;

    // master 作为client 向 slave 发送命令,不需要接收返回值
    if ((c->flags & CLIENT_MASTER) &&
        !(c->flags & CLIENT_MASTER_FORCE_REPLY)) return C_ERR;
    // AOF loading 时的假client 不需要返回值
    if (c->fd <= 0) return C_ERR; 

    // 将client加入到等待写入返回值队列中,下次事件周期会进行返回值写入。
    if (!clientHasPendingReplies(c) &&
        !(c->flags & CLIENT_PENDING_WRITE) &&
        (c->replstate == REPL_STATE_NONE ||
         (c->replstate == SLAVE_STATE_ONLINE && !c->repl_put_online_on_ack)))
    {
        // 设置标志位并且将client加入到 clients_pending_write 队列中
        c->flags |= CLIENT_PENDING_WRITE;
        listAddNodeHead(server.clients_pending_write,c);
    }
    // 表示已经在排队,进行返回数据
    return C_OK;
}

Redis divides the space for storing the response data waiting to be returned, that is, the output buffer into two parts, a fixed-size buffer and a linked list of response content data. When the linked list is empty and the buffer has enough space, the response is added to the buffer. If the buffer is full, a node is created and appended to the linked list. _addReplyToBuffer and _addReplyObjectToList are methods to write data to these two spaces respectively.

The fixed buffer and response list form a queue as a whole. The advantage of such an organization is that it can save memory, does not need to pre-allocate a large block of memory at the beginning, and can avoid frequent allocation and recycling of memory.

The above is the process of writing the response content to the output buffer. Let's take a look at the process of writing data from the output buffer to the socket.

The prepareClientToWrite function adds the client to the Redis client queue waiting to write the return value, that is, the clients_pending_write queue. The event processing logic of request processing is over, and the response is written from the output buffer to the socket when waiting for Redis to process the next event cycle.

Write command return value from output buffer to socket

In  the article "Detailed Explanation of Redis Event Mechanism" , we know that Redis will call the beforeSleep method to handle some things between two event loops, and the processing of the clients_pending_write list is among them.

The following aeMain method is the main logic of the Redis event loop, and you can see that the beforesleep method is called every time the loop loops.

void aeMain(aeEventLoop *eventLoop) { // ae.c
    eventLoop->stop = 0;
    while (!eventLoop->stop) {
        /* 如果有需要在事件处理前执行的函数,那么执行它 */
        if (eventLoop->beforesleep != NULL)
            eventLoop->beforesleep(eventLoop);
        /* 开始处理事件*/
        aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP);
    }
}

The beforeSleep function calls the handleClientsWithPendingWrites function to process the clients_pending_write list.

The handleClientsWithPendingWrites method will traverse the clients_pending_write list. For each client, the writeToClient method will be called first to try to write the returned data from the output buffer to the socekt. If it has not been written, you can only call the aeCreateFileEvent method to register a write data event processing The sendReplyToClient waits for the Redis event mechanism to be called again.

The advantage of this is that for clients that return less data, there is no need to troublesomely register write data events, and wait for the event to trigger before writing data to the socket. Instead, the data will be written directly to the socket in the next event cycle, which speeds up the data processing. The response speed returned.

However, it can also be found from here that if the clients_pending_write queue is too long, the processing time will be too long, which will block the normal event response processing and increase the delay of subsequent Redis commands.

// 直接将返回值写到client的输出缓冲区中,不需要进行系统调用,也不需要注册写事件处理器
int handleClientsWithPendingWrites(void) {
    listIter li;
    listNode *ln;
    // 获取系统延迟写队列的长度
    int processed = listLength(server.clients_pending_write);

    listRewind(server.clients_pending_write,&li);
    // 依次处理
    while((ln = listNext(&li))) {
        client *c = listNodeValue(ln);
        c->flags &= ~CLIENT_PENDING_WRITE;
        listDelNode(server.clients_pending_write,ln);

        // 将缓冲值写入client的socket中,如果写完,则跳过之后的操作。
        if (writeToClient(c->fd,c,0) == C_ERR) continue;

        // 还有数据未写入,只能注册写事件处理器了
        if (clientHasPendingReplies(c)) {
            int ae_flags = AE_WRITABLE;
            if (server.aof_state == AOF_ON &&
                server.aof_fsync == AOF_FSYNC_ALWAYS)
            {
                ae_flags |= AE_BARRIER;
            }
            // 注册写事件处理器 sendReplyToClient,等待执行
            if (aeCreateFileEvent(server.el, c->fd, ae_flags,
                sendReplyToClient, c) == AE_ERR)
            {
                    freeClientAsync(c);
            }
        }
    }
    return processed;
}

The sendReplyToClient method will actually call the writeToClient method, which is to write as much data in the buf and reply list in the output buffer as possible to the corresponding socket.

// 将输出缓冲区中的数据写入socket,如果还有数据未处理则返回C_OK
int writeToClient(int fd, client *c, int handler_installed) {
    ssize_t nwritten = 0, totwritten = 0;
    size_t objlen;
    sds o;
    // 仍然有数据未写入
    while(clientHasPendingReplies(c)) {
        // 如果缓冲区有数据
        if (c->bufpos > 0) {
            // 写入到 fd 代表的 socket 中
            nwritten = write(fd,c->buf+c->sentlen,c->bufpos-c->sentlen);
            if (nwritten <= 0) break;
            c->sentlen += nwritten;
            // 统计本次一共输出了多少子节
            totwritten += nwritten;

            // buffer中的数据已经发送,则重置标志位,让响应的后续数据写入buffer
            if ((int)c->sentlen == c->bufpos) {
                c->bufpos = 0;
                c->sentlen = 0;
            }
        } else {
            // 缓冲区没有数据,从reply队列中拿
            o = listNodeValue(listFirst(c->reply));
            objlen = sdslen(o);

            if (objlen == 0) {
                listDelNode(c->reply,listFirst(c->reply));
                continue;
            }
            // 将队列中的数据写入 socket
            nwritten = write(fd, o + c->sentlen, objlen - c->sentlen);
            if (nwritten <= 0) break;
            c->sentlen += nwritten;
            totwritten += nwritten;
            // 如果写入成功,则删除队列
            if (c->sentlen == objlen) {
                listDelNode(c->reply,listFirst(c->reply));
                c->sentlen = 0;
                c->reply_bytes -= objlen;
                if (listLength(c->reply) == 0)
                    serverAssert(c->reply_bytes == 0);
            }
        }
        // 如果输出的字节数量已经超过NET_MAX_WRITES_PER_EVENT限制,break
        if (totwritten > NET_MAX_WRITES_PER_EVENT &&
            (server.maxmemory == 0 ||
             zmalloc_used_memory() < server.maxmemory) &&
            !(c->flags & CLIENT_SLAVE)) break;
    }
    server.stat_net_output_bytes += totwritten;
    if (nwritten == -1) {
        if (errno == EAGAIN) {
            nwritten = 0;
        } else {
            serverLog(LL_VERBOSE,
                "Error writing to client: %s", strerror(errno));
            freeClient(c);
            return C_ERR;
        }
    }
    if (!clientHasPendingReplies(c)) {
        c->sentlen = 0;
        //如果内容已经全部输出,删除事件处理器
        if (handler_installed) aeDeleteFileEvent(server.el,c->fd,AE_WRITABLE);
        // 数据全部返回,则关闭client和连接
        if (c->flags & CLIENT_CLOSE_AFTER_REPLY) {
            freeClient(c);
            return C_ERR;
        }
    }
    return C_OK;
}

Guess you like

Origin blog.csdn.net/qq_38140936/article/details/103537690