Redis aof persistent source code analysis

This article analyzes aof persistence based on the source code of redis 5.0.7 .

Redis's aof persistence method has two important components

  1. Synchronize incremental write commands to disk
  2. aof file full rewrite

1. Incremental synchronization

1. Incremental write commands are appended to the buffer

Redis has a buffer. Commands that have not been written to the disk are first stored in the buffer, and then written to the disk when the conditions are met.

struct redisServer {
	// sds 是redis定义的char数组
	sds aof_buf;      /* AOF buffer, written before entering the event loop */
}

After redis performs each write operation, it will call the propagate function to append the write operation to the aof_buf buffer

void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc, int flags)
{
    if (server.aof_state != AOF_OFF && flags & PROPAGATE_AOF)
    	// aof功能打开的前提下,把新的追加aof_buffer
        feedAppendOnlyFile(cmd,dbid,argv,argc);
    if (flags & PROPAGATE_REPL)
        replicationFeedSlaves(server.slaves,dbid,argv,argc);
}

2. The buffer data is flushed to the disk

There are three strategies for aof buffer synchronization to disk

appendfsync always // Synchronize once for each write command
appendfsync everysec // Synchronize once every second
appendfsync no // No manual synchronization, the timing of synchronization is determined by the operating system

Note: The appendfsync no policy only calls the write function to write the data in the buffer into the kernel buffer of the operating system. As for when the data in the kernel buffer is written to the disk, it is determined by the operating system

Syncing every second is a compromise strategy. Synchronization needs to call the system function fsync, which involves switching between the user state and the core state of the operating system . At the same time, the real disk IO occurs here, which consumes more performance.

Call flushAppendOnlyFile to synchronize data in the servercron cycle of redis

void flushAppendOnlyFile(int force) {
    ssize_t nwritten;
    int sync_in_progress = 0;
    mstime_t latency;
	... ... 
    latencyStartMonitor(latency);
    // 调用write函数将aof_buf的数据写入文件的内核缓冲区
    nwritten = aofWrite(server.aof_fd,server.aof_buf,sdslen(server.aof_buf));
    latencyEndMonitor(latency);
   
   ... ...

try_fsync:
    /* Don't fsync if no-appendfsync-on-rewrite is set to yes and there are
     * children doing I/O in the background. */
    if (server.aof_no_fsync_on_rewrite &&
        (server.aof_child_pid != -1 || server.rdb_child_pid != -1))
            return;

    /* Perform the fsync if needed. */
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* redis_fsync is defined as fdatasync() for Linux in order to avoid
         * flushing metadata. */
        latencyStartMonitor(latency);
        // 直接调用fsync函数将内核缓冲区的数据写入磁盘
        redis_fsync(server.aof_fd); /* Let's try to get this data on the disk */
        latencyEndMonitor(latency);
        latencyAddSampleIfNeeded("aof-fsync-always",latency);
        server.aof_fsync_offset = server.aof_current_size;
        server.aof_last_fsync = server.unixtime;
    } else if ((server.aof_fsync == AOF_FSYNC_EVERYSEC &&
                server.unixtime > server.aof_last_fsync)) {
        if (!sync_in_progress) {
        	// 创建后台线程,在后台线程里调用sync函数将内核缓冲区的数据写入磁盘
            aof_background_fsync(server.aof_fd);
            server.aof_fsync_offset = server.aof_current_size;
        }
        server.aof_last_fsync = server.unixtime;
    }
}

If it is synchronized once per second, a thread will be created, and fsync will be called in the thread to synchronize data.

/* Starts a background task that performs fsync() against the specified
 * file descriptor (the one of the AOF file) in another thread. */
void aof_background_fsync(int fd) {
    bioCreateBackgroundJob(BIO_AOF_FSYNC,(void*)(long)fd,NULL,NULL);
}

The background scheduling thread will take the task from the task queue and execute it

void *bioProcessBackgroundJobs(void *arg) {
    struct bio_job *job;
    unsigned long type = (unsigned long) arg;
    sigset_t sigset;

    while(1) {
        listNode *ln;
		
		// 从任务队列中取出位于队首的任务
        ln = listFirst(bio_jobs[type]);
        job = ln->value;
        /* It is now possible to unlock the background system as we know have
         * a stand alone job structure to process.*/
        pthread_mutex_unlock(&bio_mutex[type]);

       // 根据任务类型执行任务
        if (type == BIO_CLOSE_FILE) {
        	// 关闭文件
            close((long)job->arg1);
        } else if (type == BIO_AOF_FSYNC) {
        	// 执行fsync同步数据
            redis_fsync((long)job->arg1);
        } else if (type == BIO_LAZY_FREE) {
            if (job->arg1)
                lazyfreeFreeObjectFromBioThread(job->arg1);
            else if (job->arg2 && job->arg3)
                lazyfreeFreeDatabaseFromBioThread(job->arg2,job->arg3);
            else if (job->arg3)
                lazyfreeFreeSlotsMapFromBioThread(job->arg3);
        } else {
            serverPanic("Wrong job type in bioProcessBackgroundJobs().");
        }
        // 释放存放任务的对象
        zfree(job);

        pthread_mutex_lock(&bio_mutex[type]);
        // 删除任务队列队首元素
        listDelNode(bio_jobs[type],ln);
        bio_pending[type]--;

        /* Unblock threads blocked on bioWaitStepOfType() if any. */
        pthread_cond_broadcast(&bio_step_cond[type]);
    }
}

Many posts say that redis is single-threaded, but it is not. Redis is multi-process and multi-threaded , and it only executes read and write commands in a single thread.
insert image description here

The difference between write system call and fsync

The write system call does not guarantee that the data is finally written to the disk . If the machine is powered off and the data in the kernel buffer is not written to the disk, the data will be lost. So just calling write is not enough.
insert image description here
Similar to the aof mechanism of redis, MySQL also appends write commands to binlog. When configuring a slave for the master of mysql, the slave will use the binlog of the master to restore the data, so as to be consistent with the data of the master.

2. Rewrite the aof file in full

Completely rewrite the process

insert image description here

Why do we need full synchronization

As the write operation proceeds, the aof file becomes larger and larger, and at the same time, there is a lot of redundant data. Suppose there is such a sequence of operations

set key value1
set key value2
set key value3 
... ...
set key valueN

Then the aof file only needs to be saved. set key valueNThis command is used for database restoration.

overridden trigger

auto-aof-rewrite-percentage 100  // 当前文件超过上次同步后文件百分比
auto-aof-rewrite-min-size 64mb  // 重写的文件最小大小

In the serverCron big event loop, judge whether the rewriting condition is met

int serverCron(struct aeEventLoop *eventLoop, long long id, void *clientData) {
		... ...
		
        /* Trigger an AOF rewrite if needed. */
        if (server.aof_state == AOF_ON &&
            server.rdb_child_pid == -1 &&
            server.aof_child_pid == -1 &&
            server.aof_rewrite_perc &&
            // 文件大小超过最小值
            server.aof_current_size > server.aof_rewrite_min_size)
        {
            long long base = server.aof_rewrite_base_size ?
                server.aof_rewrite_base_size : 1;
            long long growth = (server.aof_current_size*100/base) - 100;
            // 文件增幅超过指定大小
            if (growth >= server.aof_rewrite_perc) {
                serverLog(LL_NOTICE,"Starting automatic rewriting of AOF on %lld%% growth",growth);
                // 开启子进程进行文件重写
                rewriteAppendOnlyFileBackground();
            }
        }
        ... ...
        
}

In the child process, redis converts all key-value pairs in the database into corresponding format commands and writes them into a new aof file. The source code is subsidized, basically this process.

Guess you like

Origin blog.csdn.net/bruce128/article/details/104579288