Redis "junk" expired dead key management and optimization

【Author】Fu Lei

The definitions of Redis dead keys are different, there are usually two types:

  • After being written to Redis, due to the long expiration time or no expiration time at all, and the fact that it is not accessed for a long time, this type of key can be called a dead key.

  • The expiration time has obviously passed, but it still occupies Redis memory (not actually deleted). This type of key can also be called a dead key.

Note: This article discusses the second situation

1. Two examples

The key values ​​in the following two columns all have expiration times, and some key values ​​have expired.

1. After a full scan of a Redis cluster, changes in the number of key values ​​and capacity:

Number of keys Capacity GB
Before scanning 5,628,636,513 1116
After scanning 4,206,662,303 798

2. Number of keys and capacity of two Redis clusters with the same name and different versions (all of type string)

Version Number of keys Capacity GB
Redis 4.0.14 821,131,528 831
Redis 6.0.15 821,131,528 433

Preliminary Conclusions:

  • The scan operation may speed up the deletion of expired key values ​​in Redis.

  • Redis 6 version saves more space than Redis 4 with the same data. Considering the Redis Cost Optimization-Version Upgrade-1.SDS Optimization History article, it is mentioned that the 4.0~7.0 string type is not too optimized in terms of capacity, so it is initially determined Redis 6 may have been optimized for expiration

2. Basic knowledge-Redis expiration

1. How to save expired data in Redis?

Each Redis has multiple redisDb (but normally only db0 is used). Each redisDb contains two dicts: dict stores key-value, expires stores the expiration time of the key.

typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    ...
} redisDb;

As shown in the figure

(1) Normal dict graph: (This graph comes from "Redis Design and Implementation")

(2) The abstract representation dict and expires tables can be used as shown below: (from google images)

picture

Borrow a picture again: the key stringobject in expires and the key stringobject in dict are the same:

picture

2. Redis expiration strategy

Due to the single-threaded (work thread) nature of Redis, deleting each expired key value accurately and in real time will consume a lot of CPU. Therefore, Redis implements two expired deletion strategies: lazy deletion and regular deletion.

(1) Lazy deletion

When the client executes key-related commands (take get as an example), it will first check whether the key is in the expires table:

  • If in expires table

    • If it has expired: delete it directly and return empty

    • If it has not expired: get the value from the dict table

  • If it is not in the expires table, get the value from the dict table

(2) Regular deletion

The problem with lazy deletion is that it relies on active access. If it is never accessed, the data will be stored for a long time, causing a waste of memory, so a new strategy needs to be added: regular deletion.

Redis will perform an adaptive algorithm deletion of expired data in the expires table every 100 milliseconds (if hz defaults to 10) (the specific method will be described in detail below).

3. Redis version optimization

In order to more vividly demonstrate the upgrade of Redis 6 on dead keys, you can do the following experiment: write 5 million strings, the key and value are both 16 bytes, and the expiration time is 1-18 seconds.

Version All data expiration time
Redis 4.0.14 38262ms
Redis 6.0.15 19267ms
1. Before Redis 6.0:

Redis has two modes for regular deletion and expiration: fast mode and slow mode (default)

注意:
1. 快模式:希望每次定期删除快速结束,防止占用Redis处理正常命令的CPU。
2. 快模式和慢模式在执行过程中自适应的进行转换,本质都是防止占用Redis处理正常命令的CPU。
3. 快模式和慢模式:只是超时时间不同,删除逻辑是一样的

picture

Enter slow mode by default:

(1) Loop through all redisDb, randomly select 20 key values, and delete them directly if they are found to be expired.

(2) Determine whether 25% of the 20 key values ​​(that is, 5) have expired

  • If it is less than or equal to 25%, exit the current redisDb loop and continue with the next redisDb

  • If it is greater than 25%, continue to extract 20 key values ​​and loop, each time judging whether the total execution time exceeds 25 milliseconds.

    • If it exceeds 25 milliseconds, it will expire and enter fast mode (the timeout will become shorter)

    • If it does not exceed 25 milliseconds, exit the current redisDb loop and continue with the next redisDb.

Several important parameters:

#define ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP 20 /* 上述的20个键值 */
#define ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC 25 /* 慢模式超时时间:25%的CPU时间 */
#define ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1000 /* 快模式超时时间:1ms */
2. Redis 6.0 optimization
(1) Randomly every time --> record traversal cursor

Before Redis 6.0, every time regular deletion was performed, 20 key values ​​were randomly selected. If the current Redis has a large number of key values ​​with expiration time (such as millions or tens of millions), then this randomness will cause many keys to not be deleted. Scanned, so in Redis 6.0, a cursor (expires_cursor) was added to redisDb to record the location of the last scan, which can ensure that all key values ​​will be scanned in the end, effectively improving efficiency.

typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    unsigned long expires_cursor; /* Cursor of the active expire cycle. */
   .......
} redisDb;
(2) Determine 25% of the 20 key values ​​(that is, 5) --> 10% (that is, 2)

Before 6.0:

do {
   num = ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP
   while (num--) {
        //检测每个key的过期时间,并做相关记录,如果已经过期expired++
    }
} while (expired > ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP/4);

After 6.0, config_cycle_acceptable_stale is configurable

do {
   num = ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP
   while (num--) {
        //检测每个key的过期时间,并做相关记录,如果已经过期expired++
    }
}
} while ((expired*100/sampled) > config_cycle_acceptable_stale);
(3) Add enhancement coefficient

The new active_expire_effort configuration can appropriately enhance the granularity of periodic deletion. Its value range is 1-10.

unsigned long effort = server.active_expire_effort-1, /* Rescale from 0 to 9. */
//增加每次扫描key的个数
config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +  ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort,
//增加快模式的超时时间
config_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +  ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort,
//增加慢模式的超时时间
config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC + 2*effort,
//上述while中的比率
config_cycle_acceptable_stale = ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE- effort;

4. Why are there dead keys and what are the dangers?

1. Why are there dead keys?
(1) Lazy deletion: If many keys will not be accessed twice, dead keys will be generated.
(2) Regular deletion: If the production speed of expired key values ​​is greater than the regular deletion speed

There are two situations for (2):

The first type: Currently Redis has a large number of writes and the key value expiration time is very short.

Second type: Currently Redis contains a large number of key values ​​(for example, millions of levels), but expired data only accounts for a small proportion, which is relatively strange.

Let’s take an example: an online cluster

Number of key valuesMillion Capacity GB
Before scanning 991.18 396.37
After scanning 814.58 296.90

We can analyze its key value space: short expiration time only accounts for a small proportion, and it cannot complete the fast expiration of dead keys by itself.

picture

When we return to the flow chart analyzed before, we can easily get the answer: each time it traverses redisdb, a core condition is whether more than 25% of the key values ​​have expired. From the key value analysis chart above, it can be judged that most of the keys in each scan are likely to be expired. Just cycle once.

picture

2. The dangers of dead keys

The nature of the hazard: For example, the current cluster is 100GB. If there is no dead key, the key value is only 50G. If there is a dead key, it may be 90GB.

(1) Increase the number of operations and maintenance: The business side may frequently submit expansion requests.

(2) Wasted costs:

(3) Eviction may occur: Unexpected capacity usage may cause data eviction (most eviction algorithms are approximate algorithms, such as lru)

5. How to solve

What is interesting about this is that one of the official methods is to restart, which is not practical for online environments (even with failover). We can use the following methods in production:

1. Moderately adjust the active_expire_effort parameter (for Redis 6.0+)

What is moderation? We must understand that Redis essentially provides services to the outside world, so we must ensure that there is enough CPU time for normal command access. After Redis 4.0, there is a core indicator stat_expired_time_cap_reached_count that can be used as a reference. In fact, it records the number of timeouts, which represents Spending too much CPU time on expired deletions.

/* We can't block forever here even if there are many keys to
 * expire. So after a given amount of milliseconds return to the
 * caller waiting for the other active expire cycle. */
if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
    elapsed = ustime()-start;
    if (elapsed > timelimit) {
        timelimit_exit = 1;
        server.stat_expired_time_cap_reached_count++;
        break;
    }
}

Can be aligned for monitoring.

2. Regular scan

When it is recognized that some clusters have the following characteristics, you can use external scan (actually lazy deletion) to help delete expired key value data, but the intensity must be moderate. For example, the sleep time should be set based on the current Redis CPU busyness.

picture

3. hz:

It is recommended not to adjust this randomly. . hz affects more than this, so don’t adjust it randomly (there are various articles on the Internet asking you to adjust this, so be careful)

6. How to identify?

To be honest, this issue is quite complicated. I have been thinking about it for a long time. It can be roughly boiled down to 6 points (big guys are welcome to comment)

1.The expires table must be "large":

It needs to be of a certain scale (otherwise the problem of dead keys does not exist), and it is generally considered to be more than 1 million (but this is not absolute, such as the second situation)

2. Generate a large number of key values ​​with short expiration times in batches:

An example of avg_ttl is as follows:

picture

The effect of key-value analysis is as follows:

picture

3.avg_ttl is unreliable

avg_ttl is an approximate value, and it will be disturbed by very long expiration times (commonly known as "averaged"). The above example is a typical one. avg_ttl is 15 days, but it does contain a large number of dead keys.

picture

picture

4. Use stat_expired_time_cap_reached_count localization

stat_expired_time_cap_reached_count frequently indicates that there are many expired key values. Because it has timed out, all instances can be monitored or alerted by drawing.

5. Key value analysis combined with stat_expired_stale_perc indicator

stat_expired_stale_perc is the approximate ratio of total_expired/total_sampled. If it is higher, it means there are many expired key values. If it is lower, it needs to be combined with key value analysis to see whether there is overall interference.

double current_perc;
if (total_sampled) {
    current_perc = (double)total_expired/total_sampled;
} else
    current_perc = 0;
server.stat_expired_stale_perc = (current_perc*0.05)+
                                 (server.stat_expired_stale_perc*0.95);
6. The ultimate trick: label the cluster after scanning

During off-peak periods, clean up and scan each suspicious cluster, record the key value capacity changes before and after, label the cluster, and start regular scans.

picture

Please control the number of key values ​​of the redis instance

Refer to the analysis of Redis development specifications (3)--How many keys should be stored in a Redis

7. The last little experiment

How to prove that the key in expires and the key in dict are the same.

picture

Experiment: Insert two sets of data in Redis 6.0.15

number of keys key value Expiration used_memory_human
5000000 String type, key and value are both 16 bytes Not expired 598.158MB
5000000 String type, key and value are both 16 bytes Expiration time 1 day 776.599MB
Excess: 178.44MB

Now execute debug htstats on the second set of data

127.0.0.1:12615> debug HTSTATS 0
[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 8388608
 number of elements: 5000000
 different slots: 3766504
 max chain length: 10
 avg chain length (counted): 1.33
 avg chain length (computed): 1.33
 Chain length distribution:
   0: 4622104 (55.10%)
   1: 2755189 (32.84%)
   2: 820438 (9.78%)
   3: 163161 (1.95%)
   4: 24475 (0.29%)
   5: 2930 (0.03%)
   6: 278 (0.00%)
   7: 32 (0.00%)
   10: 1 (0.00%)
[Expires HT]
Hash table 0 stats (main hash table):
 table size: 8388608
 number of elements: 5000000
 different slots: 3766504
 max chain length: 10
 avg chain length (counted): 1.33
 avg chain length (computed): 1.33
 Chain length distribution:
   0: 4622104 (55.10%)
   1: 2755189 (32.84%)
   2: 820438 (9.78%)
   3: 163161 (1.95%)
   4: 24475 (0.29%)
   5: 2930 (0.03%)
   6: 278 (0.00%)
   7: 32 (0.00%)
   10: 1 (0.00%)

The calculations are as follows:

(1) table size: 8388608, each requires 8 byte pointers = 8388608 * 8 / 1024 / 1024 = 64MB

(2) number of elements: 5000000, excluding the key in dictEntry (assumed to be shared here), value = 16 bytes (an int-encoded redisobj, why is it 16 bytes (redis 6), please refer to Redis cost optimization - Version upgrade-1. SDS optimization history plus an 8-byte next pointer, final conversion = 5000000 * (16 + 8) /1024/1024=114.44MB

Finally we can verify that the key in expires and the key in dict are shared.

8. Some thoughts

1.Why can’t Redis expiration be deleted immediately?

2. Why can’t you use the expired event function: use event notification to delete.

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/135136598