Redis common delay problem troubleshooting manual! With 33 optimization suggestions

     As a memory database, Redis has very high performance, and the QPS of a single instance can reach about 10W. However, when we use Redis, there will often be very long access delays from time to time. If you don't know the internal implementation principle of Redis, you will be confused when troubleshooting.

In many cases, Redis has increased access delays, which are caused by our improper use or unreasonable operation and maintenance.

Redis slowed down? Location and analysis of common delay problems

Let's analyze the delay problem that Redis often encounters in the use process and how to locate and analyze it.

Use complex commands

How to troubleshoot if the access delay is suddenly increased when using Redis?

First of all, the first step is to check the Redis slow log. Redis provides the statistics function of slow log commands. We can check which commands have a large delay in execution through the following settings.

First set the Redis slow log threshold. Only commands that exceed the threshold will be recorded. The unit here is subtle. For example, set the slow log threshold to 5 milliseconds, and set to keep only the last 1000 slow log records:

# 命令执行超过5毫秒记录慢日志
CONFIG SET slowlog-log-slower-than 5000

# 只保留最近1000条慢日志
CONFIG SET slowlog-max-len 1000

After the setting is completed, if the delay of all executed commands is greater than 5 milliseconds, they will be recorded by Redis. We execute SLOWLOG get 5 to query the last 5 slow logs:

127.0.0.1:6379> SLOWLOG get 5

1) 1) (integer) 32693       # 慢日志ID

   2) (integer) 1593763337  # 执行时间

   3) (integer) 5299        # 执行耗时(微妙)

   4) 1) "LRANGE"           # 具体执行的命令和参数

      2) "user_list_2000"

      3) "0"

      4) "-1"

2) 1) (integer) 32692

   2) (integer) 1593763337

   3) (integer) 5044

   4) 1) "GET"

      2) "book_price_1000"

By looking at the slow log records, we can know which commands are more time-consuming to execute at what time, if your business often uses commands with more than O(n) complexity, such as sort, sunion, zunionstore, or executing O(n) The amount of data manipulated during the command is relatively large, and in these cases Redis processing data will be time-consuming.

If your service request volume is not large, but the CPU usage of the Redis instance is high, it is likely to be caused by the use of highly complex commands.

The solution is to not use these more complex commands, and do not get too much data at a time, try to manipulate a small amount of data each time, so that Redis can process and return in time.

Store big key

If you query the slow log and find that it is not caused by more complex commands, such as SET and DELETE operations appearing in the slow log records, then you have to doubt whether there is a situation where Redis has written a large key.

When Redis writes data, it needs to allocate memory for new data. When data is deleted from Redis, it releases the corresponding memory space.

If the data written by a key is very large, Redis will also take more time to allocate memory. Similarly, when the data of this key is deleted, it will take a long time to release the memory.

You need to check your business code, whether there is a large key written, you need to evaluate the size of the written data, and the business layer should avoid storing a key with an excessive amount of data.

So is there any way to scan whether there is any big key data in Redis?

Redis also provides a method to scan big keys:

redis-cli -h $host -p $port --bigkeys -i 0.01

Use the above command to scan the distribution of the key size of the entire instance, which is displayed in the type dimension.

It should be noted that when we perform a large key scan on the online instance, the QPS of Redis will increase suddenly. In order to reduce the impact on Redis during the scan process, we need to control the frequency of the scan, just use the -i parameter control, which means scan The time interval of each scan in the process, in seconds.

The principle of using this command is that Redis executes the scan command internally, traverses all keys, and then executes strlen, llen, hlen, scar, zcard for different types of keys to obtain the length of the string and the container type (list/dict/set /zset) The number of elements.

For container-type keys, only the key with the most elements can be scanned, but the key with the most elements does not necessarily occupy the most memory. This requires our attention. However, using this command generally we can have a clearer understanding of the distribution of keys in the entire instance.

In response to the problem of large keys, Redis officially launched a lazy-free mechanism in version 4.0 to asynchronously release the memory of large keys and reduce the impact on Redis performance. Even so, we do not recommend using large keys. Large keys will also affect the performance of the migration during the cluster migration process. This will be described in detail later when we introduce cluster-related articles.

Centralized expiration

Sometimes you will find that there is no longer a delay when using Redis, but a wave of delay suddenly appears at a certain point in time, and the time point of the slowdown is very regular, such as a certain hour or how long the interval is It will happen once.

If this happens, you need to consider whether there is a large number of key expiration.

If a large number of keys expire at a fixed point in time, access to Redis at this point in time may cause increased delay.

Redis's expiration strategy adopts two strategies: active expiration + lazy expiration:

  • Active expiration: Redis maintains a timed task internally. By default, 20 keys will be randomly taken from the expired dictionary every 100 milliseconds, and expired keys will be deleted. If the proportion of expired keys exceeds 25%, continue to obtain 20 keys and delete expired keys. The key of, iteratively, until the proportion of expired key drops to 25% or the execution of this task takes more than 25 milliseconds, it will exit the cycle;

  • Lazy expiration: Only when a key is accessed, can it be judged whether the key has expired, and if it has expired, it will be deleted from the instance.

Note that Redis's actively expired timing tasks are also executed in the Redis main thread. That is to say, if there is a need to delete a large number of expired keys during the execution of active expiration, then you must wait for this during business access After the overdue task is executed, the business request can be processed. At this time, there will be a problem of increased service access delay, and the maximum delay is 25 milliseconds.

And this access delay will not be recorded in the slow log. The slow log only records the time it takes to execute a certain command. The Redis active expiration strategy is executed before the operation command. If the operation command takes less time than the slow log threshold, it will not be counted in the slow log statistics, but we The business felt that the delay increased.

At this point, you need to check your business to see if there is really a centralized expired code. Generally, the command used for centralized expiration is expireat or pexpireat. Just search for this keyword in the code.

If your business really needs to focus on expiring certain keys, and you don't want to cause Redis to jitter, what are the optimization solutions?

The solution is to add a random time when the focus expires, and break up the time of these keys that need to expire.

The pseudo code can be written like this:

# 在过期时间点之后的5分钟内随机过期掉

redis.expireat(key, expire_time + random(300))

In this way, when Redis is processing expiration, it will not cause excessive pressure due to centralized deletion of keys and block the main thread.

In addition, in addition to paying attention to this problem in business use, you can also find this situation in time by means of operation and maintenance.

The approach is that we need to monitor the various operating data of Redis, and execute info to get all the operating data. Here we need to focus on the item expired_keys, which represents the total number of expired keys deleted in the entire instance so far.

We need to monitor this indicator. When there is a sudden increase in this indicator in a short period of time, we need to report to the police in time, and then compare and analyze with the time when the business report is slow to confirm whether the time is consistent. If they are consistent, it can be considered that it is indeed because The delay caused by this reason is increased.

Instance memory reaches the upper limit

Sometimes we use Redis as a pure cache, we will set a memory limit maxmemory for the instance, and then turn on the LRU elimination strategy.

When the memory of the instance reaches maxmemory, you will find that every time you write new data, it may slow down.

The reason for the slowdown is that when the Redis memory reaches maxmemory, some data must be kicked out every time before writing new data to keep the memory under maxmemory.

This logic of kicking out old data also takes time, and the specific time-consuming length depends on the configured elimination strategy:

  • allkeys-lru: Regardless of whether the key is set to expire, eliminate the least recently accessed key;

  • Volatile-lru: Only eliminate the least recently accessed and set expired keys;

  • allkeys-random: regardless of whether the key is set to expire, it will be eliminated randomly;

  • Volatile-random: only randomly eliminate keys with expired settings;

  • allkeys-ttl: Regardless of whether the key is set to expire, the key that is about to expire is eliminated;

  • noeviction: Do not eliminate any key, and write it directly after it is full, and report an error directly;

  • allkeys-lfu: Regardless of whether the key is set to expire, eliminate the key with the lowest access frequency (4.0+ support);

  • Volatile-lfu: Only eliminate expired keys with the lowest access frequency (4.0+ support).

Which strategy to use depends on the business scenario to decide.

The most commonly used strategy is the allkeys-lru or volatile-lru strategy. Their processing logic is to randomly take out a batch of keys (configurable) from the instance each time, then eliminate a least-accessed key, and then take the rest The key is temporarily stored in a pool, and then a batch of keys is randomly taken out, and compared with the keys in the previous pool, and then the least accessed key is eliminated. Repeat this until the memory drops below maxmemory.

If you use the allkeys-random or volatile-random strategy, it will be much faster, because it is randomly eliminated, then the time consumption of comparing key access frequency is less, and a batch of keys can be eliminated directly after randomly taking out a batch of keys, so This strategy is faster than the LRU strategy above.

But the above logic is executed before the real command is executed when accessing Redis, that is, it will affect the commands executed when we access Redis.

In addition, if there is a large key stored in the Redis instance at this time, when the large key is eliminated and the memory is released, this will take longer and the delay will be greater, which requires us to pay special attention.

If your business visit volume is very large, and you must set maxmemory to limit the memory upper limit of the instance, and at the same time face the situation that the elimination of the key causes the delay to increase, to alleviate this situation, in addition to avoiding the storage of large keys and using random elimination as mentioned above In addition to the strategy, you can also consider the method of splitting the instance to alleviate it. Splitting the instance can spread the pressure of one instance to eliminate the key to multiple instances, which can reduce the delay to a certain extent.

Fork is time-consuming

If your Redis has the function of automatically generating RDB and AOF rewriting enabled, it is possible that when RDB and AOF rewriting are generated in the background, the access delay of Redis will increase, and after these tasks are executed, the delay will disappear.

In this case, it is generally caused by the task of generating RDB and AOF rewriting.

The generation of RDB and AOF requires the parent process to fork a child process for data persistence. During the execution of the fork, the parent process needs to copy the memory page table to the child process. If the entire instance memory occupies a large amount, then the copied memory page needs to be copied. The table will be time-consuming, and this process will consume a lot of CPU resources. Before the fork is completed, the entire instance will be blocked and unable to process any requests. If the CPU resources are tight at this time, the fork time will be longer, even up to seconds. level. This will seriously affect the performance of Redis.

The specific principle can also refer to the article I wrote before: How does Redis persistence work? Comparative analysis of RDB and AOF.

We can execute the info command to view the time-consuming latest_fork_usec of the last fork execution, the unit is subtle. This time is the time when the entire instance is blocked and cannot process the request.

In addition to generating RDB due to backup reasons, when the master and slave nodes establish data synchronization for the first time, the master node will also generate an RDB file for the slave node to perform a full synchronization, which will also have a performance impact on Redis.

To avoid this situation, we need to plan the data backup cycle. It is recommended to perform the backup on the slave node, and it is best to perform it during the low peak period. If the business is not sensitive to lost data, it is not recommended to enable the AOF and AOF rewrite functions.

In addition, the time consumption of fork is also related to the system. If Redis is deployed on a virtual machine, this time will also increase. Therefore, when using Redis, it is recommended to deploy it on a physical machine to reduce the impact of fork.

Bind CPU

In many cases, when we deploy services, in order to improve performance and reduce the performance loss of context switching when the program uses multiple CPUs, we generally use the operation of binding the CPU to the process.

But when using Redis, we don't recommend this for the following reasons.

When Redis is bound to the CPU, when the data is persisted, the child process fork will inherit the CPU usage preference of the parent process, and the child process will consume a lot of CPU resources for data persistence, and the child process will CPU contention occurs in the main process, which will also cause insufficient CPU resources of the main process to increase the access delay.

So when deploying the Redis process, if you need to enable the RDB and AOF rewriting mechanism, you must not perform CPU binding operations!

Turn on AOF

As mentioned above, when the AOF file is rewritten, the Redis delay will increase due to the time-consuming fork execution. In addition to this, if the AOF mechanism is turned on, the set strategy is unreasonable, which will also cause performance problems.

After the AOF is turned on, Redis will write the written commands to the file in real time, but the process of writing the file is to write to the memory first, and the content in the memory will be only when the data in the memory exceeds a certain threshold or reaches a certain period of time. Is actually written to the disk.

In order to ensure the safety of writing files to disk, AOF provides three brushing mechanisms:

  • appendfsync always: flush disk every time you write, it has the greatest impact on performance, takes up disk IO relatively high, and has the highest data security;

  • appendfsync everysec: flushing the disk once a second has a relatively small impact on performance, and up to 1 second of data is lost when the node is down;

  • appendfsync no: flush disk according to the mechanism of the operating system, which has the least impact on performance, low data security, and data loss due to node downtime depends on the disk flush mechanism of the operating system.

When using the first mechanism appendfsync always, every time Redis processes a write command, the command will be written to the disk, and this operation is executed in the main thread.

The data in the memory is written to the disk, which will increase the IO burden of the disk, and the cost of operating the disk is much greater than the cost of operating the memory. If the amount of writing is large, then every update will be written to the disk. At this time, the disk IO of the machine will be very high, which will slow down the performance of Redis. Therefore, we do not recommend using this mechanism.

Compared with the first mechanism, appendfsync everysec will refresh the disk every 1 second, while appendfsync no depends on the refresh time of the operating system, which is not very safe. Therefore, we recommend using appendfsync everysec. In the worst case, only 1 second of data will be lost, but it can maintain better access performance.

Of course, for some business scenarios, it is not sensitive to data loss, and AOF may not be enabled.

Use swap

If you find that Redis suddenly becomes very slow, each access takes hundreds of milliseconds or even seconds, then check whether Redis uses Swap. In this case, Redis basically cannot provide high performance. service.

As we know, the operating system provides a swap mechanism, the purpose is when the memory is insufficient, a part of the data in the memory can be exchanged to the disk, in order to achieve the buffering of the memory usage.

But when the data in the memory is changed to the disk, access to the data needs to be read from the disk, this speed is much slower than the memory!

Especially for a high-performance memory database such as Redis, if the memory in Redis is changed to disk, this operation time is unacceptable for a database with extremely sensitive performance such as Redis.

We need to check the memory usage of the machine to confirm whether Swap is actually used because of insufficient memory.

If you do use Swap, you must organize the memory space in time to release enough memory for Redis to use, and then release Redis's Swap to allow Redis to reuse the memory.

The redis swap process usually involves restarting the instance. In order to avoid the impact of restarting the instance on the business, generally perform a master-slave switch first, then release the swap of the old master node, restart the service, and switch back to the master node after the data synchronization is completed. can.

It can be seen that when Redis uses Swap, the high performance of Redis is basically abolished at this time, so we need to prevent this situation in advance.

We need to monitor the memory and Swap usage of the Redis machine, and give an alarm in time when the memory is insufficient and Swap is used, and the corresponding processing is carried out in time.

Network card load is too high

If you have avoided the above scenarios that cause performance problems, and Redis has been running stably for a long time, but after a certain point in time, access to Redis began to slow down, and it has continued until now. This situation is What caused it?

We have encountered this kind of problem before, and the characteristic is that it starts to slow down after a certain point in time, and it continues. At this time, you need to check the network card traffic of the machine to see if the network card traffic is full.

If the network card load is too high, data transmission delay and data packet loss will occur at the network layer and TCP layer. In addition to memory, Redis's high performance lies in network IO, and the sudden increase in request volume will cause the network card load to become higher.

If this happens, you need to check which Redis instance on this machine has too much traffic and fills up the network bandwidth, and then confirm whether the sudden increase in traffic is a normal business situation. If it is, you need to expand or migrate the instance in time to avoid this Other instances of the machine are affected.

At the operation and maintenance level, we need to increase the monitoring of various indicators of the machine, including network traffic, and alert in advance when the threshold is reached, confirm with the business in time and expand the capacity.

Above we have summarized common scenarios in Redis that may lead to increased delay or even blockage, which involves both business use issues and Redis operation and maintenance issues.

It can be seen that in order to ensure the high-performance operation of Redis, it involves all aspects of CPU, memory, network, and even disk, including the use of related features of the operating system.

As a developer, we need to understand the operating mechanism of Redis, such as the execution time complexity of each command, data expiration strategy, data elimination strategy, etc., use reasonable commands and optimize them in combination with business scenarios.

As a DBA operation and maintenance personnel, you need to understand data persistence, operating system fork principles, swap mechanism, etc., and reasonably plan the capacity of Redis, reserve enough machine resources, and perform complete monitoring of the machine to ensure the stability of Redis run.

Redis best practices: business level and operation and maintenance level

In the above, I mainly explained the common slowdown scenarios and problem location and analysis of Redis, which are mainly caused by unreasonable business use and improper operation and maintenance.

After we understand the reasons that cause Redis to slow down, we can make Redis stable and perform higher performance by optimizing it in a targeted manner.

Then I will summarize the best practices when using Redis, which mainly includes two levels: business level and operation and maintenance level.

Since I have written a lot of UGC back-end services before, and used Redis in a large number of scenarios, I have also stepped on a lot of pits in this process, so I also summarized a set of reasonable usage methods during the use process.

Later, I did the basic architecture and developed Codis and Redis-related middleware. At this stage, the focus of the field is sinking from the use level to the development and operation of Redis, and more focus is on the various problems arising from the internal implementation and operation and maintenance of Redis. , I have also accumulated some experience in this area.

For these two areas, I will share with you what I think is a reasonable way to use Redis and operation and maintenance. It may not be the most comprehensive, and it may be different from the way you use Redis, but the following methods are the actual conclusions I have summarized after stepping on the pit. Experience, for your reference.

Business level

At the business level, developers need to pay attention, that is, how to use Redis reasonably when writing business code. Developers need to have a basic understanding of Redis in order to use Redis in appropriate business scenarios, so as to avoid delays caused by the business level.

During the development process, the optimization suggestions at the business level are as follows:

  • The length of the key should be as short as possible. When the amount of data is very large, a long key name will occupy more memory;

  • Be sure to avoid storing excessively large data (large value). Excessively large data consumes a lot of time when allocating and releasing memory, which will block the main thread;

  • For Redis 4.0 or higher, it is recommended to enable the lazy-free mechanism to operate asynchronously when large values ​​are released, without blocking the main thread;

  • It is recommended to set the expiration time and use Redis as a cache, especially when the number is large, not setting the expiration time will cause unlimited memory growth;

  • Do not use highly complex commands, such as SORT, SINTER, SINTERSTORE, ZUNIONSTORE, ZINTERSTORE. Using these commands takes a long time and will block the main thread;

  • When querying data, try to get as little data as possible at a time. When you are not sure about the number of container elements, avoid operations such as LRANGE key 0 -1 and ZRANGE key 0 -1. You should set the number of specific query elements. It is recommended Query less than 100 elements at a time;

  • When writing data, try to write as little data as possible at a time, such as HSET key value1 value2 value3..., control the number of elements written at one time, it is recommended to be less than 100, and large data volume is written in multiple batches;

  • When operating data in batches, replace GET/SET with MGET/MSET, and replace HGET/HSET with HMGET/MHSET to reduce the number of network IO requests back and forth and reduce latency. For commands without batch operations, it is recommended to use pipeline to send multiple commands at once To the server

  • It is forbidden to use the KEYS command. When you need to scan the instance, it is recommended to use SCAN. The online operation must control the scanning frequency to avoid performance jitter to Redis.

  • To avoid a large number of keys expiring at a certain point in time, it is recommended to add a random time when the expiration is concentrated to break up the expiration time, reduce the pressure of Redis when the key expires in the concentration, and avoid blocking the main thread;

  • According to the business scenario, choose the appropriate elimination strategy, usually the random expiration is faster than the LRU expiration to eliminate the data;

  • Use the connection pool to access Redis, and configure reasonable connection pool parameters to avoid short connections. The TCP three-way handshake and four waved hands are also time-consuming;

  • Only use db0. Multiple dbs are not recommended. Using multiple dbs will increase the burden on Redis. Each time you access a different db, you need to execute the SELECT command. If the business lines are different, it is recommended to split multiple instances and increase a single instance. Performance;

  • When the number of read requests is large, it is recommended to use read-write separation, provided that the problem of not updating the data from the section in time can be tolerated;

  • When the amount of write requests is large, it is recommended to use a cluster and deploy multiple instances to share the write pressure.

Operational level

The operation and maintenance level is mainly what DBA needs to pay attention to. The purpose is to rationally plan the deployment of Redis and ensure the stable operation of Redis. The main optimizations are as follows:

  • Deploy different instances of different business lines, independent of each other to avoid mixing. It is recommended that different business lines use different machines, and deploy them in different groups according to the importance of the business to avoid problems in one business line affecting other business lines;

  • Ensure that the machine has sufficient CPU, memory, bandwidth, and disk resources to prevent excessive load from affecting Redis performance;

  • Deploy instances in a master-slave cluster and distribute them on different machines to avoid single points. The slave must be set to readonly;

  • The machines where the master and slave nodes are located are independent of each other. Do not cross-deploy instances. Normally, the backup work will be done on the slave, and the machine resources will be consumed when doing the backup. Cross-deployment will affect the performance of the master;

  • It is recommended to deploy sentinel nodes to increase availability. The number of nodes should be at least 3, and they should be distributed on different machines to realize automatic failover of faults;

  • Plan the capacity in advance. The upper limit of the memory of a machine deployment instance is preferably half of the machine's memory. The master-slave synchronization will occupy up to an extra double the memory space to prevent large-scale network failures from causing the full amount of all master-slaves. Synchronization causes the machine's memory to be eaten up;

  • Do a good job of monitoring the CPU, memory, bandwidth, and disk of the machine, and promptly alarm when resources are insufficient. After Redis uses swap, the performance drops sharply, the network bandwidth load is too high, and the access delay increases significantly. When the disk IO is too high, turning on AOF will slow down Redis performance;

  • Set the upper limit of the maximum number of connections to prevent excessive client connections from causing excessive service load;

  • The memory usage of a single instance is recommended to be controlled below 10G. An excessively large instance will cause long backup time, high resource consumption, and longer block time for master-slave full synchronization data;

  • Set a reasonable slowlog threshold, 10 milliseconds is recommended, and monitor it. If too many slow logs are generated, timely alarm is required;

  • Set a reasonable size of the replication buffer repl-backlog, and increase the repl-backlog appropriately to reduce the probability of master-slave full replication;

  • Set a reasonable client-output-buffer-limit size of the slave node. For instances with a large amount of writes, appropriately increasing the size can avoid the interruption of master-slave replication;

  • It is recommended to do the backup on the slave node, which does not affect the performance of the master;

  • Do not enable AOF or enable AOF configuration to flush disks every second to avoid disk IO consumption and reduce Redis performance;

  • When the instance has set the upper limit of memory and needs to increase the upper limit of memory, adjust the slave first and then adjust the master, otherwise the data of the master and slave nodes will be inconsistent;

  • Add monitoring to Redis. When monitoring and collecting info information, use long connections. Frequent short connections will also affect Redis performance;

  • When scanning the entire number of instances online, remember to set the sleep time to avoid the sudden increase of QPS during scanning to cause performance jitters on Redis;

  • Do a good job in the runtime monitoring of Redis, especially the expired_keys, evicted_keys, and latest_fork_usec indicators. A sudden increase in the value of these indicators in a short period of time may block the entire instance and cause performance problems.

The above is the practice method recommended by Redis when I use Redis and develop Redis-related middleware. These aspects mentioned above have been encountered more or less in actual use.

It can be seen that in order to stably exert the high performance of Redis, we need to do a good job in all aspects, but any problem in one aspect will inevitably affect the performance of Redis, which puts forward higher requirements for our use and operation and maintenance.

If you encounter more problems or have better experience in using Redis, you can leave a message and discuss it together.

                      Public account " Java Senior Architect ", reply to " Interview Questions " Get: HD 3585 pages of real interview questions from Dachang

 

Guess you like

Origin blog.csdn.net/qq_17010193/article/details/114853540