Redis 4.0 automatic memory defragmentation (Active Defrag) source code analysis

Before reading this article I recommend that you read on this blog: Redis source from where began to read

Redis 4.0 version adds a lot of nice new features, automatic memory defragmentation activedefragis certainly a very attractive, which makes Redis cluster reclaim memory fragmentation compared Redis 3.0 is more elegant and convenient. We upgrade directly open the Redis 4.0 activedefrag, after deleting key part of the test and found it really effective release memory fragmentation, but it did not test other relevant parameters.

First, the problem phenomenon

As business needs, we removed the cluster accounts for two-thirds of memory Key , after deleting a cluster fragmentation average rate of 1.3 to 1.4, decreased memory, but this time the service response suddenly increased, we passed redis.cli -c -h 127.0.0.1 -p 5020 --latencythe test server cluster performance, discovery response (+ network queuing) reached 2-3ms, which for redis is already very high, and we respond to other clusters generally about 0.2ms. After investigation, we try to activedefragoff, and tested redis server response immediately returned to normal, online service response lessened, open activedefragresponse immediately skyrocketed.

Two, Redis 4.0 source code analysis (based on the branch 4.0)

Active Defrag function of the core code in defrag.cthe activeDefragCycle(void)function

1. Active Defrag presentation and related parameters

We look at redis.conf regarding activedefragcomments (google translation)

Features

警告此功能是实验性的。然而,即使在生产中也进行了压力测试,并且由多个工程师手动测试了一段时间。
什么是主动碎片整理?
-------------------------------
自动(实时)碎片整理允许Redis服务器压缩内存中小数据分配和数据释放之间的空间,从而允许回收内存。

碎片化是每个分配器都会发生的一个自然过程(幸运的是,对于Jemalloc来说却不那么重要)和某些工作负载。通常需要重新启动服务器以降低碎片,或者至少刷新所有数据并再次创建。
但是,由于Oran Agra为Redis 4.0实现了这一功能,这个过程可以在运行时以“热”的方式发生,而服务器正在运行。

基本上当碎片超过一定水平时(参见下面的配置选项),Redis将开始通过利用某些特定的Jemalloc功能在相邻的内存区域中创建值的新副本(以便了解分配是否导致碎片并分配它在一个更好的地方),同时,将释放数据的旧副本。对于所有键,以递增方式重复此过程将导致碎片回退到正常值。
需要了解的重要事项:
1.默认情况下,此功能处于禁用状态,仅在您编译Redis以使用我们随Redis源代码提供的Jemalloc副本时才有效。这是Linux版本的默认设置。
2.如果没有碎片问题,则永远不需要启用此功能。
3.一旦遇到碎片,可以在需要时使用命令“CONFIG SET activedefrag yes”启用此功能。配置参数能够微调其行为碎片整理过程。如果您不确定它们的含义,最好保持默认设置不变。
复制代码

Parameter Description

# 开启自动内存碎片整理(总开关)
activedefrag yes
# 当碎片达到 100mb 时,开启内存碎片整理
active-defrag-ignore-bytes 100mb
# 当碎片超过 10% 时,开启内存碎片整理
active-defrag-threshold-lower 10
# 内存碎片超过 100%,则尽最大努力整理
active-defrag-threshold-upper 100
# 内存自动整理占用资源最小百分比
active-defrag-cycle-min 25
# 内存自动整理占用资源最大百分比
active-defrag-cycle-max 75
复制代码

2. Active Defrag Timer executed in the thread?

Redis is based on event-driven, Timer Events and I / O events registered to the main thread among which is memory defragmentation Timer in the main thread execution among.

Original reference [1]

  • Sign up timer event callback . Redis as a single-threaded program (single-threaded), which if you want to schedule tasks executed asynchronously, such collecting operation is performed periodically expired key, in addition to the event loop mechanism dependent, no other way. This step is just to create a good event loop in front of registering a timer event, and can be configured to periodically perform a callback: serverCron. Since Redis only one main thread, so this function is performed periodically in this thread, it is driven by the event loop (ie call at the right time), but does not affect the execution of other logic on the same thread (equivalent to pressing the a time-sliced). serverCronFunction in the end what did it? In fact, in addition to its recovery operation is performed periodically expired key, also performs many other tasks, such as reconnection between the main even heavier, the Cluster nodes, and trigger the execution BGSAVE AOF rewrite, and the like. This is not the focus of this article, is described here is not launched.
  • Register I / O event callbacks . Redis server main job is to monitor I / O event, which analyzes the command from the client's request, execute commands, and then returns a response results. For the monitor I / O events, naturally dependent on the event loop. As mentioned earlier, Redis can turn two on duration: for listening to TCP connections and for Unix domain socket listener. So here contains registered callbacks for both I / O events, two callbacks are acceptTcpHandlerand acceptUnixHandler. For Redis processing request from a client, it will come to these two functions. We will discuss in the next section to this process. In addition, the fact Redis here also registers an I / O event, for two-way communication through the mechanism of the pipe (pipe) with the module. This is not the focus of this article, we ignore it.
  • Initializing the background thread . Redis will create some extra threads running in the background, and is designed to handle some of the tasks may be delayed execution time-consuming (usually some cleanup). In these background threads Redis which it is called bio (Background I / O service) . They are responsible for tasks including: can delay the file to perform the closing operation (such as execution unlink command), AOF persistent write database operations (ie fsync call, but note that only fsync operation may be delayed execution was performed in the background thread), there are some big key cleanup (such as the implementation of flushdb async command). Visible bio name is a bit unworthy of the name, it do not necessarily do with I / O. For these background threads, we may also produce a doubt: in front of the initialization process, have registered a timer event callback that serverCronfunction, supposedly a background thread to perform these tasks seem can also be placed serverCronin execution. Because the serverCronfunction also can be used to perform background tasks. In fact, doing so is not enough. We have already mentioned, serverCronis driven by the event loop is executed on the main thread Redis, equivalent and other operations performed on the main thread (mainly for performing the requested command) for the time slice. In this case, serverCronthere would be too time-consuming operation can not be performed, otherwise it will affect the response time Redis execute commands. Therefore, for the time-consuming, and the task can be delayed execution, you can only go into a separate thread of execution.

When 3.Active Defrag Timer logic is executed?

We can see the parameter presentation, activedefragis a main switch is turned on only when it is possible to perform, but if you really need to perform the following several parameters control.

void activeDefragCycle(void) {
    /* ... */

    /* 每隔一秒,检查碎片情况,决定是否执行*/
    run_with_period(1000) {
        size_t frag_bytes;
        /* 计算碎片率和碎片大小*/
        float frag_pct = getAllocatorFragmentation(&frag_bytes);
        /* 如果没有运行或碎片低于阈值,则不执行 */
        if (!server.active_defrag_running) {
            /* 根据计算的碎片率和大小与我们设置的参数进行比较判断,决定是否执行 */
            if(frag_pct < server.active_defrag_threshold_lower || frag_bytes < server.active_defrag_ignore_bytes)
                return;
        }
    /* ... */
}
复制代码

By source code, we can see whether defragmentation is mainly through active_defrag_running, active-defrag-ignore-bytes, active-defrag-threshold-lowerthese parameters common decision.
The default setting of memory fragmentation official rate greater than 10% and memory fragmentation larger than 100mb.

Why 4.Active Defrag will affect the response Redis cluster?

We will Redis cluster 2/3 data are deleted, fragmentation rate quickly dropped to around 1.3, memory is also released soon, but why Redis response will become high?

First, we Memory Defragmenter is executed in the main thread, the source found, memory defragmentation operations will scan (by iteration) redis entire node, and memory copy, transfer and other operations, because redis is single-threaded, so this will certainly lead to redis performance degradation (may affect memory consolidation of control by adjusting redis cluster configuration, will be described in detail later).

By redis journal found that defragmentation is still kept execute , and uses 75% of the CPU (we will interpret this as 75% redis main thread resources), each execution time-consuming 82s ( here note that although time-consuming 82s, but not so long time redis main thread is blocked, but from the first iteration of the time between the last iteration, at this time the main thread may also request processing command ).
Visible from the log frag=14%, configuration parameters we have been able to reach memory defragmentation threshold, the main thread will continue to go for memory defragmentation, redis clusters lead to poor performance.

/* redis 配置及日志
 * activedefrag yes
 * active-defrag-ignore-bytes 100mb
 * active-defrag-threshold-lower 10
 * active-defrag-threshold-upper 100
 * active-defrag-cycle-min 25
 * active-defrag-cycle-max 75 */
11:M 28 May 06:37:17.430 - Starting active defrag, frag=14%, frag_bytes=484401800, cpu=75%
11:M 28 May 06:38:40.424 - Active defrag done in 82993ms, reallocated=50, frag=14%, frag_bytes=484365248

# redis 性能
[service@bigdata src]$ ./redis-cli -h 127.0.0.1 -p 5020 --latency
min: 0, max: 74, avg: 7.38 (110 samples)
复制代码

We first activedefragset no, then respond immediately return to normal.

# redis 性能
min: 0, max: 1, avg: 0.14 (197 samples)
复制代码

5.Active Defrag how to adjust the parameters?

Memory defragmentation functionality we need, then how do we adjust the parameters in order to find a balance between performance and memory defragmentation redis it? So I adjusted a few test these parameters.

(1) adjusted active-defrag-ignore-bytesandactive-defrag-threshold-lower
this adjustment is relatively simple to determine whether to enter only memory defragmentation logic, if the size of the pieces of debris or transfer large to accept a threshold value, for memory defragmentation Redis not, will not clusters have too much influence. We can see from the following code, when two conditions are met, will enter the Memory Defragmenter logic.

if (!server.active_defrag_running) {
    if(frag_pct < server.active_defrag_threshold_lower || frag_bytes < server.active_defrag_ignore_bytes)
        return;
}
复制代码

Here we need to pay attention to, frag_pctand frag_bytesdoes not mean infocommands mem_fragmentation_ratio, such as when the problem occurs mem_fragmentation_ratio = 1.31, through frag_pctthe debris of the calculation is 1.14, so the reference parameter setting can not be completely infoin the mem_fragmentation_ratioinformation.

/* frag_pct 是从 jemalloc 获取的 */
/* Utility function to get the fragmentation ratio from jemalloc.
 * It is critical to do that by comparing only heap maps that belown to
 * jemalloc, and skip ones the jemalloc keeps as spare. Since we use this
 * fragmentation ratio in order to decide if a defrag action should be taken
 * or not, a false detection can cause the defragmenter to waste a lot of CPU
 * without the possibility of getting any results. */
float getAllocatorFragmentation(size_t *out_frag_bytes) {
    size_t epoch = 1, allocated = 0, resident = 0, active = 0, sz = sizeof(size_t);
    /* Update the statistics cached by mallctl. */
    je_mallctl("epoch", &epoch, &sz, &epoch, sz);
    /* Unlike RSS, this does not include RSS from shared libraries and other non
     * heap mappings. */
    je_mallctl("stats.resident", &resident, &sz, NULL, 0);
    /* Unlike resident, this doesn't not include the pages jemalloc reserves
     * for re-use (purge will clean that). */
    je_mallctl("stats.active", &active, &sz, NULL, 0);
    /* Unlike zmalloc_used_memory, this matches the stats.resident by taking
     * into account all allocations done by this process (not only zmalloc). */
    je_mallctl("stats.allocated", &allocated, &sz, NULL, 0);
    float frag_pct = ((float)active / allocated)*100 - 100;
    size_t frag_bytes = active - allocated;
    float rss_pct = ((float)resident / allocated)*100 - 100;
    size_t rss_bytes = resident - allocated;
    if(out_frag_bytes)
        *out_frag_bytes = frag_bytes;
    serverLog(LL_DEBUG,
        "allocated=%zu, active=%zu, resident=%zu, frag=%.0f%% (%.0f%% rss), frag_bytes=%zu (%zu%% rss)",
        allocated, active, resident, frag_pct, rss_pct, frag_bytes, rss_bytes);
    return frag_pct;
}
复制代码
/* mem_fragmentation_ratio */
/* Fragmentation = RSS / allocated-bytes */
float zmalloc_get_fragmentation_ratio(size_t rss) {
    return (float)rss/zmalloc_used_memory();
}
复制代码

(2) adjustment active-defrag-cycle-minandactive-defrag-cycle-max
these two parameters is the main thread of the upper and lower occupancy rate of resources, if you want to ensure that memory defragmentation unduly influence redis cluster performance, you need to consider very carefully the two configuration parameters.
When I adjust these two parameters, I found out through a time-consuming, resource consumption, etc., when viewed redis response memory consolidation - a long time when more resources occupation, the greater the intensity of memory defragmentation, the shorter the time, of course, redis performance the impact is greater.

# active-defrag-cycle-min 10
# active-defrag-cycle-max 10

# 日志记录-耗时、资源占用
11:M 28 May 08:37:39.458 - Starting active defrag, frag=15%, frag_bytes=502210608, cpu=10%
11:M 28 May 08:45:26.160 - Active defrag done in 466700ms, reallocated=187804, frag=14%, frag_bytes=493183888

# redis 响应
min: 0, max: 27, avg: 2.69 (295 samples)
复制代码
# active-defrag-cycle-min 5
# active-defrag-cycle-max 10

# 日志记录-耗时、资源占用
11:M 28 May 07:08:29.988 - Starting active defrag, frag=14%, frag_bytes=487298400, cpu=5%
11:M 28 May 07:22:58.225 - Active defrag done in 868237ms, reallocated=4555, frag=14%, frag_bytes=484875424

# redis 响应
min: 0, max: 6, avg: 0.44 (251 samples)
复制代码

(3) general adjustment
Prior to this, we also need to look at what activeDefragCycle(void)the specific logic of this function defrag.c
Tips: C language was modified static variables are global, the following codecursor

/* 从serverCron执行增量碎片整理工作。
 * 这与activeExpireCycle的工作方式类似,我们在调用之间进行增量工作。 */
void activeDefragCycle(void) {
    static int current_db = -1;
    /* 游标,通过迭代scan 整个 redis 节点*/
    static unsigned long cursor = 0;
    static redisDb *db = NULL;
    static long long start_scan, start_stat;
    /* 迭代计数器 */
    unsigned int iterations = 0;
    unsigned long long defragged = server.stat_active_defrag_hits;
    long long start, timelimit;

    if (server.aof_child_pid!=-1 || server.rdb_child_pid!=-1)
        return; /* Defragging memory while there's a fork will just do damage. */

    /* Once a second, check if we the fragmentation justfies starting a scan
     * or making it more aggressive. */
    run_with_period(1000) {
        size_t frag_bytes;
        float frag_pct = getAllocatorFragmentation(&frag_bytes);
        /* If we're not already running, and below the threshold, exit. */
        if (!server.active_defrag_running) {
            if(frag_pct < server.active_defrag_threshold_lower || frag_bytes < server.active_defrag_ignore_bytes)
                return;
        }

        /* 计算内存碎片整理所需要占用的主线程资源 */
        int cpu_pct = INTERPOLATE(frag_pct,
                server.active_defrag_threshold_lower,
                server.active_defrag_threshold_upper,
                server.active_defrag_cycle_min,
                server.active_defrag_cycle_max);
        /* 限制占用资源范围 */
        cpu_pct = LIMIT(cpu_pct,
                server.active_defrag_cycle_min,
                server.active_defrag_cycle_max);
         /* We allow increasing the aggressiveness during a scan, but don't
          * reduce it. */
        if (!server.active_defrag_running ||
            cpu_pct > server.active_defrag_running)
        {
            server.active_defrag_running = cpu_pct;
            serverLog(LL_VERBOSE,
                "Starting active defrag, frag=%.0f%%, frag_bytes=%zu, cpu=%d%%",
                frag_pct, frag_bytes, cpu_pct);
        }
    }
    if (!server.active_defrag_running)
        return;

    /* See activeExpireCycle for how timelimit is handled. */
    start = ustime();
    /* 计算每次迭代的时间限制 */
    timelimit = 1000000*server.active_defrag_running/server.hz/100;
    if (timelimit <= 0) timelimit = 1;

    do {
        if (!cursor) {
            /* Move on to next database, and stop if we reached the last one. */
            if (++current_db >= server.dbnum) {
                long long now = ustime();
                size_t frag_bytes;
                float frag_pct = getAllocatorFragmentation(&frag_bytes);
                serverLog(LL_VERBOSE,
                    "Active defrag done in %dms, reallocated=%d, frag=%.0f%%, frag_bytes=%zu",
                    (int)((now - start_scan)/1000), (int)(server.stat_active_defrag_hits - start_stat), frag_pct, frag_bytes);

                start_scan = now;
                current_db = -1;
                cursor = 0;
                db = NULL;
                server.active_defrag_running = 0;
                return;
            }
            else if (current_db==0) {
                /* Start a scan from the first database. */
                start_scan = ustime();
                start_stat = server.stat_active_defrag_hits;
            }

            db = &server.db[current_db];
            cursor = 0;
        }

        do {
            cursor = dictScan(db->dict, cursor, defragScanCallback, defragDictBucketCallback, db);
            /* Once in 16 scan iterations, or 1000 pointer reallocations
             * (if we have a lot of pointers in one hash bucket), check if we
             * reached the tiem limit. */
            /* 一旦进入16次扫描迭代,或1000次指针重新分配(如果我们在一个散列桶中有很多指针),检查我们是否达到了tiem限制。*/
            if (cursor && (++iterations > 16 || server.stat_active_defrag_hits - defragged > 1000)) {
                /* 如果超时则退出,等待下次获取线程资源后继续执行,*/
                if ((ustime() - start) > timelimit) {
                    return;
                }
                iterations = 0;
                defragged = server.stat_active_defrag_hits;
            }
        } while(cursor);
    } while(1);
}
复制代码

By code logic analysis, we note that there are two computing cpu_pctfunction (resource utilization) of

int cpu_pct = INTERPOLATE(frag_pct,
        server.active_defrag_threshold_lower,
        server.active_defrag_threshold_upper,
        server.active_defrag_cycle_min,
        server.active_defrag_cycle_max);
cpu_pct = LIMIT(cpu_pct,
        server.active_defrag_cycle_min,
        server.active_defrag_cycle_max);

/* 插值运算函数 */
#define INTERPOLATE(x, x1, x2, y1, y2) ( (y1) + ((x)-(x1)) * ((y2)-(y1)) / ((x2)-(x1)) )
/* 极值函数 */
#define LIMIT(y, min, max) ((y)<(min)? min: ((y)>(max)? max: (y)))
复制代码

Suppose we set the parameters as follows (the production line configuration)

active-defrag-ignore-bytes 500mb
active-defrag-threshold-lower 50
active-defrag-threshold-upper 100
active-defrag-cycle-min 5
active-defrag-cycle-max 10
复制代码

(1) we can conclude that the first calculation cpu_pctof the first function y = 0.1x
(2) is assumed at this time frag_pct = 100 & frag_bytes > 500mb, then cpu_pct = 10
(3) after the extremum function calculation, the last obtained cpu_pctvalue of 10
(4) then this value then calculate timelimit = 1000000*server.active_defrag_running(10)/server.hz(in redis.conf 10)/100 = 10000μs = 10ms
(5) Finally Redis automatic memory defragmentation function by timelimitvalue as much as possible to ensure that no centralization of resources occupy the main thread

6.Memory Purge manually organize your memory fragmentation

Here by the way tell us about Memory Purge function.
memory purgeCommand is triggered manually sorting memory fragmentation, it will register as an I / O event to the main thread to execute them. It is noteworthy that, it activedefragrecovered memory is not the same area, it attempts to clear the dirty pages to memory allocator recycling .
Specific logic, let's look at the source code to achieve, object.c

/*必须是使用jemalloc内存分配器时才可用*/
#if defined(USE_JEMALLOC)
    char tmp[32];
    unsigned narenas = 0;
    size_t sz = sizeof(unsigned);
    /*获取arenas的个数,然后调用jemalloc的接口进行清理 */
    if (!je_mallctl("arenas.narenas", &narenas, &sz, NULL, 0)) {
        sprintf(tmp, "arena.%d.purge", narenas);
        if (!je_mallctl(tmp, NULL, 0, NULL, 0)) {
            addReply(c, shared.ok);
            return;
        }
    }
    addReplyError(c, "Error purging dirty pages");
#else
    addReply(c, shared.ok);
    /* Nothing to do for other allocators. */
#endif
复制代码

About arenas relevant knowledge, you can refer to the interpretation of this article. Original reference [2]

From the production line in the actual use of point of view, memory purgethe effect compared to the activedefragnot so ideal, which is the mechanism to decide, but under extreme circumstances some memory fragmentation rate, will also play a role. It recommended with the use of the actual situation, and activedefrag.

Three, Active Defrag parameter adjustment recommendation

In sum, we conclude that, through us active-defrag-ignore-bytes, and active-defrag-threshold-lowerto control whether memory defragmentation, through active-defrag-cycle-minand active-defrag-cycle-maxto organize efforts to control memory fragmentation. Because each company Redis cluster size, memory data structure will differ, so after opening the automatic memory defragmentation switch, must be based on their actual situation to set the parameters efforts to organize memory fragmentation.

Reference article:
[1] Redis source from where began to read
[2] redis4 supports memory defragmentation functions to achieve analysis
[3] jemalloc 3.6.0 source code Detailed - [1] Arena

Reproduced in: https: //juejin.im/post/5cec843bf265da1bb77648b5

Guess you like

Origin blog.csdn.net/weixin_34090643/article/details/91421296