Accident Summary Highlights-Accidents Caused by Redis Cached Big Keys-Wang Wen Wen Qi 09 (updated once a week)

Get into the habit of writing together! This is the first day of my participation in the "Nuggets Daily New Plan · April Update Challenge"

【Problem Description】

  • The online business interface service immediately started to timeout and alarm, and the availability rate of the upstream caller port dropped to 80%.

【Accident Level】

  • P0

【process】

  • 19:44 Received an alarm, the interface tp99 exceeds 300ms (8ms under normal circumstances)
  • 19:50 Check the machine monitoring and find that some machines have timed out, and the problem machines are randomly not fixed;
  • 19:55 This interface depends on the cache, there is a downgrade switch, query db (confirm the switch is turned off)
  • 19:58 Checking the cache cluster monitoring, no cache client problems were found overall
  • 20:03 Operation and maintenance assistance to view the cluster to confirm that there is no problem with the cache server
  • 20:10 Scan the cached copy and find that a single large key exceeds 5M and continues to request
  • 20:13 Delete the big key and restore the cluster

[Cause of failure] Cache big key problem!

Summarize:

With the continuous impact of high concurrent traffic, various distributed frameworks and distributed caches have become the common technology stacks of major Internet companies. Among them, redis cache, as a distributed cache technology stack, is responding to requests with high concurrent frequency. , and fast responses become the best in caching.

In our usual interviews, we are often asked how to deal with the cached key. Next, we draw on the knowledge of traditional Chinese medicine to analyze how to find the cached key. If we judge the cached key, if we can solve the cached key.

see

In an interface service, if the client where our service is located is only connected to the cache cluster, how can we quickly confirm whether it is a large cache key problem based on the indicators?

First, eliminate the hardware problem of the redis server cluster:
we first use the cluster indicators to determine whether there is an abnormality in the cache server machine, which causes the response to timeout, observe the indicators of the redis server machine, cpu, memory, network, etc., and exclude the redis server. hardware problem.

smell

Next, find out the phenomenon that the client connection timeout is interrupted, and why the client server and the redis service connection are interrupted.

Appearance 1: The incoming traffic of redis shards is not large, but the outgoing traffic is huge

Appearance 2: Each time the specified redis shard times out accordingly, the outgoing traffic is huge; while the outgoing traffic of other shards is normal

我们先要清楚一个知识点,就是缓存的数据路由,Redis Cluster采用虚拟哈希槽分区而非一致性hash算法,预先分配16384(2^14)个卡槽,所有的键根据哈希函数映射到 0 ~ 16383整数槽内,每一个分区内的master节点负责维护一部分槽以及槽所映射的键值数据。

所以我们存入redis的每一条数据,在分片不变的情况下,始终存储在一个分片上。我们根据这个特点,结合监控集群的指标反应,可以快速定位是否是缓存大key。

正常情况下出流量各个分片出流量监控:

wecom-temp-f8602ac170a9b975b230e25fb3866c45.png

缓存大key导致的固定分片出流量大:

image.png

这种基本就可以确定是缓存大key的问题了,我们继续下一步,如何快速找到这些缓存大key,并且删掉

找到问题的缓存分片,针对缓存分片进行大key扫描,网上方法有很多,感兴趣的话可以自己搜索,在此只介绍集中比较热门的方式。建议公司运维同事提前搭好类似的基础服务工具。

1、redis-rdb-tools工具。redis实例上执行bgsave,然后对dump出来的rdb文件进行分析,找到其中的大KEY。 关于rdb工具的详细介绍请查看链接github.com/sripathikri… 基本命令:rdb -c memory dump.rdb (dump.rdb为Redis实例的rdb文件,可通过bgsave生成)

2、redis-cli --bigkeys命令。可以找到某个实例5种数据类型(String、hash、list、set、zset)的最大key。

3、自定义的扫描脚本,以Python脚本居多,方法与redis-cli --bigkeys类似。

自此可以找出缓存的中的大key。执行 del 命令删除大key即可。

我们如何预防,才是重点。

1、redis本身定位就是一个内存缓存,而不是一个数据库,所以我们如果涉及到复杂的数据结构,建议设计之初就应该进行拆分。

2、在所有set缓存方法前,增加切面校验,设定阈值,校验存储的缓存key-value的大小,如果超过的话,阻止录入并且计入报警。

最后我们解释下,为什么缓存大key会导致服务端链接缓存超时链接中断的现象。

先了解一个概念,redis缓冲区一共有三个,客户端缓冲区、复制及压缓冲区、AOF缓冲区,大key问题导致客户端-服务端链接中断的原因就在于第一个:客户端缓冲区

Redis为每个客户端分配了输入缓冲区, 它的作用是将客户端发送的命令临时保存, 同时Redis从会输入缓冲区拉取命令并执行, 输入缓冲区为客户端发送命令到Redis 执行命令提供了缓冲功能, 如图所示。

image.png

Redis 为每个客户端设置的输出缓冲区也包括两部分:一部分,是一个大小为 16KB 的固定缓冲空间,用来暂存 OK 响应和出错信息;另一部分,是一个可以动态增加的缓冲空间,用来暂存大小可变的响应结果。造成输出缓冲区溢出的原因:

大key请求返回的结果
执行了Monitor命令
缓冲区设置大小不合理

限制输出缓冲区的配置

client-output-buffer-limit normal 0 0 0 普通客户端不做限制,请求是阻塞式的
client-output-buffer-limit pubsub 8mb 2mb 60 发布订阅式 总量8mb 60s内超过2mb就会自动关闭连接

避免溢出的方式显而易见:

避免bigkey
生产不使用Monitor操作
合理设置client-output-buffer-limit
复制代码
这也是我们的服务接口所在的客户端服务器,和redis服务端中断的根本原因,中断之后,客户端的检测狗会重新和redis服务端建立新的链接。
至于redis服务端为什么每次中断的客户端服务器链接都是随机的呢?答案就在于我们的dubbo负载均衡策略,默认是随机请求。

Guess you like

Origin juejin.im/post/7084904769579384863