Technical interpretation丨Distributed cache database Redis big KEY problem positioning and optimization suggestions

Summary: How to locate the Redis big KEY problem of the distributed cache database, and practical cases will take you to master the optimization method.

【background】

An OOM error occurred when accessing the Redis 5.0 cluster cluster. The error message was (error) OOM command not allowed when used memory>'maxmemory'. Some ECS applications could not write to the database, which affected the normal use of the service. When executing set t2 s2, the database reports an OOM error, as shown below:

【Topology】

Environmental information:

Redis 5.0 cluster cluster 4G memory

DCS network segment: 192.168.1.0/24

Shard 1: master 192.168.1.12 slave 192.168.1.37

Shard 2: master 192.168.1.10 slave 192.168.1.69

Fragment 3: master 192.168.1.26 slave 192.168.1.134

【analysis of idea】

【detailed steps】

One, view monitoring

Checking the Redis instance monitoring shows that the Redis cluster memory occupies 46.97% without obvious abnormalities. The result is shown in the following figure:

View the memory monitoring of the node. The memory usage of master node 192.168.1.10 in Shard 2 reaches 100%, and the memory usage of the other two shards is about 20%. The results are shown in the following figure:

2. Confirm abnormal fragment information

From the above monitoring information, it can be known that the memory usage rate of shard 2 in the redis cluster reaches 100%. There is one and only this fragment memory abnormality.

3. Big KEY analysis

Online analysis

① Tool analysis: use the HUAWEI CLOUD management console cache analysis-big key analysis tool. After the execution is complete, just check the information. The result is shown in the following figure: (string type saves top20, list/set/zset/hash type saves top80)

For specific usage, please refer to the following link: https://support.huaweicloud.com/usermanual-dcs/dcs-ug-190808001.html

② Command analysis: Use the redis-cli -h IP -p port –bigkeys command, the tool will list the information of the largest key among the big keys in each type of data. The result is shown below:

As shown in the figure above, it can be concluded that the large key of string type in this environment is "nc_filed/_pk", the size is 13283byte, and no large key is found for data of type list, set, hash, and zset.

Offline mode

Offline analysis needs to use a special rdb_bigkeys analysis tool to analyze rdb files. Tool address:  https://github.com/weiyanwei412/rdb_bigkeys . Specific steps are as follows:

Compilation method:

# yum install git go -y

# mkdir /home/gocode/

# cd /home/gocode/

# git clone https://github.com/weiyanwei412/rdb_bigkeys.git

# cd rdb_bigkeys

# go build

The executable file rdb_bigkeys is generated after the execution is completed.
Instructions:

./rdb_bigkeys -bytes 1024 -file bigkeys.csv -sorted -threads 4 /home/redis/dump.rdb

Parameter Description:

-bytes 1024: filter keys larger than 1024 bytes

-file bigkeys.csv: save the result to bigkeys.csv file

-sorted: sort from big to small

-threads: the number of threads used

/home/redis/dump.rdb: the actual RDB file path

The generated file style is as follows:

Each column is the database number, key type, key name, key size, number of elements, maximum element name, element size, and key expiration time. Document link: https://www.cnblogs.com/yqzc/p/12425533.html

Four, the solution

The root cause of this OOM problem is that the large KEY leads to uneven data size distribution. The memory of a certain slice reaches maxmemory. During the data writing process, if the slice is scheduled to be scheduled, an OOM problem will occur. Export a copy of the rdb file of this fragment to facilitate the corresponding optimization for the big key later.

temporary plan:

In order to respond to the business as soon as possible, delete the big KEY that was queried in the above step, and perform the operation as follows: (For the bigkey that is not a string, do not use del to delete, use hscan, sscan, zscan to delete gradually)

Long-term plan:

By splitting a large KEY, a large KEY is split into multiple small KEYs, which become value1, value2... valueN, which can not be divided into different shards, avoiding data distribution caused by data skew all.

Other types of data can also be split and reorganized in the same way to avoid the impact of big KEY.

V. Result verification

Check the fragment monitoring, the memory usage of 192.168.1.10 dropped to 24%, and the result is shown in the following figure:

Execute set t2 s2 to return to normal, log in to the cluster, execute the get command, and return data information normally. The results are as follows, and business has returned to normal at this point.

[Optimization and Suggestions]

1) Configure node-level memory utilization monitoring indicators alarms. If a node has a large key, the memory usage rate of this node is much higher than that of other nodes, and an alarm will be triggered to facilitate users to find potential large keys.

2) Configure node-level alarms for the maximum network access bandwidth, maximum network output bandwidth, and CPU utilization monitoring indicators. If a node has a hot key, the bandwidth occupation and CPU utilization of this node are higher than other nodes, and the node will easily trigger an alarm, which is convenient for users to find potential hot keys.

3) The string type is controlled within 10KB, and the hash, list, set, and zset elements should not exceed 5000 as much as possible.

4) Regularly check whether there is a big key problem in the cluster through big key and hot key analysis tools, and identify risks as soon as possible.

 

Click to follow and learn about Huawei Cloud's fresh technology for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/109067728