Analysis of Redis Cluster Data Skew Problem and Solution

Overview 

In server-side system service development, caching is a commonly used technology, which can improve the system's processing efficiency of requests, and redis is a leader in the caching technology stack, widely used in various service systems. In large-scale Internet services, there are massive requests to be processed and cached data to be stored every day. In these large-scale systems, it is difficult to meet the system's ultra-high concurrent requests and massive data cache requirements with a single-instance redis. For the use of redis in large-scale Internet services, cluster architecture is often adopted, and the scale of redis instances is expanded horizontally to improve the processing efficiency and data storage capacity of the cache system for data requests at a lower cost.

Although the redis cluster architecture has many advantages, things often have two sides. After solving the problems in some scenarios, they will cause problems in other scenarios. Of course, the redis cluster mode also conforms to this rule. In redis cluster mode, it will increase the complexity of operation and maintenance, and limit the use of certain commands in redis (range query, use of transaction operations, etc.). In addition to these common problems, the problem of data skew is a problem in redis cluster mode. A relatively hidden problem that only appears in some special scenarios, but the impact of this problem is indeed huge.

Example

For the 2019 Spring Festival lottery service, the business evaluation peak qps is 2w, which is transformed into a redis cluster with 10w qps and 5GB memory storage, and deploys 5 shards of 1GB+2W qps redis clusters (including reserved capacity). As a result, when the activity started, it was discovered that there was a "hot key" in the service, and the request was severely skewed. The 6w qps at the peak time were all concentrated in one of the shards, causing the shard to be overloaded and the entire lottery service to avalanche.

What is Data Skew

   In redis cluster mode, data will be distributed to different instances according to certain distribution rules. If due to the particularity of business data, according to the specified distribution rules, the data distribution on different instances may be uneven, such as the following scenarios: some slice instances have a large amount of data distribution, some instances have less data distribution; Hot data is saved, and the data access volume is relatively large, and the data stored on some instances is relatively "cold", with almost no access volume. Instances that store a large amount of data, or instances that store hot data, will have higher resource utilization and greater load pressure, resulting in slower responses to data requests. At this point, data skew occurs.

 What kind of data skew

Redis distributed cluster tilt problems are mainly divided into two categories:

1. The data storage capacity is tilted, and data storage always falls to a small number of nodes in the cluster;

2. The qps request is skewed, and the qps always falls to a few nodes.

The impact of redis cluster skew

The skew problem has a greater impact on pure memory and single-threaded services such as redis, and has the following pain points:

  • The concentration of qps to a few redis nodes will cause a few nodes to be overloaded, which will drag down the entire service. At the same time, the ability of the cluster to process qps is not scalable;

  • The data capacity is tilted, resulting in a small number of nodes memory explosion, OOM Killer and cluster storage capacity are not scalable;

  • The operation and maintenance management becomes complicated, and it is inconvenient to unify values ​​such as monitoring and alarming memory usage, QPS, number of connections, and redis cpu busy;

  • Because other node resources in the cluster cannot be fully utilized, the redis server/container resource rate is low;

  • Increase the difficulty of automatic configuration management; try to unify the parameter configuration of single cluster nodes;

After analyzing the impact, let's look at the common causes of serious "skew" of Redis clusters in the production environment.

Common Causes of Redis Cluster Skew

Generally, during system design, the keyspace design is unreasonable:

  • When the system is designed, the redis keyspace (keyspace) design is unreasonable, and "hot keys" appear, which leads to the qps overload of the node where such keys are located, and the qps skew of the cluster;

  • There are large collection keys (hash, set, list, etc.) in the system, which leads to the overload of the capacity and QPS of the node where the large key is located, and the QPS and capacity tilt of the cluster;

  • The DBA improperly plans the cluster or expands the capacity, resulting in uneven distribution of the number of data slots (slots), resulting in skewed capacity and request qps;

  • The system uses a large number of Keys hash tags, which may lead to a large number of keys in some data slots, and the qps and capacity skew of the cluster cluster;

  • The engineer executes commands such as monitor, which causes the output buffer of the current node client to increase; the used_memory_rss is expanded; resulting in an increase in the memory capacity of the node and a capacity inclination;

Next, when the memory capacity, number of keys, or QPS request volume of the cluster is severely skewed, should we troubleshoot the positioning problem?

How to Troubleshoot Redis Cluster Skew Problems

Check the hotspot key of the node and determine the top commands.

When the cluster qps is tilted due to hot keys, it is necessary to quickly locate hot keys and top commands. You can use the open source tool redis-faina, or it is better to have a real-time redis analysis platform.

The following is an analysis using the redis-faina tool. It can be seen that the QPS ratio of the two prefix keys is basically 50% each, and the hot key is obvious; the exception of the auth command (top commands) can also be seen.

Overall Stats
========================================
Lines Processed         100000
Commands/Sec            7276.82

Top Prefixes
========================================
ar_xxx         49849   (49.85%)

Top Keys
========================================
c8a87fxxxxx        49943   (49.94%)
a_r:xxxx           49849   (49.85%)

Top Commands
========================================
GET             49964   (49.96%)
AUTH            49943   (49.94%)
SELECT          88      (0.09%)

Whether the system uses larger collection keys

The use of a large key in the system will cause the cluster node capacity or qps to be skewed. For example, a hash key with a 5kw field occupies nearly 10GB of memory. The memory capacity or qps of the node where the key is located in the slot is likely to be skewed.

This kind of collection key operates several fields at a time, and it is difficult to find the size of the key from the proxy or sdk.

You can use redis-cli --bigkeys to analyze the big keys that exist on the node. If you need full analysis, you can use redis-rdb-tools (https://github.com/sripathikrishnan/redis-rdb-tools) to analyze the full amount of the RDB file of the node, and get the number of memory bytes occupied by the large key through the result size_in_bytes column .

Example using redis-cli for sampling analysis:

redis-cli  --bigkeys -p 7000                                 

# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.  You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).
[00.00%] Biggest string found so far 'key:000000019996' with 1024 bytes
[48.57%] Biggest list   found so far 'mylist' with 534196 items
-------- summary -------
Sampled 8265 keys in the keyspace!
Total key length in bytes is 132234 (avg len 16.00)

Biggest string found 'key:000000019996' has 1024 bytes
Biggest   list found 'mylist' has 534196 items

8264 strings with 8460296 bytes (99.99% of keys, avg size 1023.75)
1 lists with 534196 items (00.01% of keys, avg size 534196.00)

Check whether the data slot distribution of each shard in the cluster is even

The following takes the Redis Cluster cluster as an example to confirm the number of data slots (slots) and keys that each node is responsible for in the cluster. Some examples of the following demos are not slightly "slanted" but not serious, and reblance may be considered.

redis-trib.rb info redis_ip:port
nodeip:port (5e59101a...) -> 44357924 keys | 617 slots | 1 slaves.
nodeip:port (72f686aa...) -> 52257829 keys | 726 slots | 1 slaves.
nodeip:port (d1e4ac02...) -> 45137046 keys | 627 slots | 1 slaves.
---------------------省略------------------------
nodeip:port (f87076c1...) -> 44433892 keys | 617 slots | 1 slaves.
nodeip:port (a7801b06...) -> 44418216 keys | 619 slots | 1 slaves.
nodeip:port (400bbd47...) -> 45318509 keys | 614 slots | 1 slaves.
nodeip:port (c90a36c9...) -> 44417794 keys | 617 slots | 1 slaves.
[OK] 1186817927 keys in 25 masters.
72437.62 keys per slot on average.

Does the system use keys hash tags extensively?

In a redis cluster, some businesses use hash tags to assign certain types of keys to the same shard in order to achieve multi-key operations, which may lead to uneven data and qps. You can use scan to scan whether the keyspace uses hash tags, or use monitor, vc-redis-sniffer tools to analyze skewed nodes, and check whether Dali contains keys with hash tags.

Is the memory capacity skewed due to abnormal client output buffer?

Confirm whether there is an abnormal output buffer usage on the client, causing excessive memory problems; for example, when the monitor, keys command or slave synchronization full sync is executed, the input buffer of the client is too large.

In this case, the basic redis instance memory will grow rapidly, and will soon fall back. By monitoring the client output buffer usage; see the following example for analysis:

# 通过监控client_longest_output_list输出列表的长度,是否有client使用大量的输出缓冲区.
redis-cli  -p 7000 info clients
# Clients
connected_clients:52
client_longest_output_list:9179
client_biggest_input_buf:0
blocked_clients:0

# 查看输出缓冲区列表长度不为0的client。 可见monitor占用输出缓冲区370MB
redis-cli  -p 7000 client list | grep -v "oll=0"
id=1840 addr=xx64598  age=75 idle=0 flags=O obl=0 oll=15234 omem=374930608 cmd=monitor

How to effectively avoid the Redis cluster skew problem

  • When the system designs the redis cluster key space and query pattern, hot keys should be avoided. If there is hot key logic, try to disperse different nodes or add a local cache of the program;

  • When designing the redis cluster key space in the system, you should avoid using large keys and split up the key design; in addition to the problem of tilt, large keys have a serious impact on the stability of the cluster;

  • Redis cluster deployment and expansion and contraction processing to ensure the even distribution of data slots;

  • From the perspective of system design, keys hash tag should be avoided;

  • In daily operation and maintenance and system, you should avoid directly using commands such as keys and monitor, which will cause the output buffer to accumulate; such commands are recommended to be rename;

  • Configure the normal client output buffer in total, it is recommended to set 10mb, and the slave limit is 1GB, and temporarily adjust it as needed (warning: confirm and adjust with the business before modifying to avoid business errors)

In actual production business scenarios, it is difficult for large-scale clusters to achieve complete balance of clusters, but try to ensure that no serious skew problems occur.

Guess you like

Origin blog.csdn.net/m0_37723088/article/details/130978292