redis集群故障排查:kernel: Memory cgroup out of memory: Kill process 46773 (redis-server)

查看到系统日志:

cat /var/log/message
Sep  3 21:43:22 kn-36 kernel: Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podaae68abc_6d8f_4bb5_a7a8_f57845a47080.slice/docker-bfe13af359ae7ed3a03e3cba6c9ffff6407de218406d5f596205c60f66bec807.scope: cache:804KB rss:47185072KB rss_huge:0KB mapped_file:56KB swap:0KB inactive_anon:0KB active_anon:47185044KB inactive_file:432KB active_file:240KB unevictable:0KB
Sep  3 21:43:22 kn-36 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Sep  3 21:43:22 kn-36 kernel: [37691]     0 37691      253        1       4        0          -998 pause
Sep  3 21:43:22 kn-36 kernel: [41147]   999 41147 16147695 11795378   23560        0           682 redis-server
Sep  3 21:43:22 kn-36 kernel: [46773]   999 46773 16147733 11795673   23561        0           682 redis-server
Sep  3 21:43:22 kn-36 kernel: [46804]     0 46804   188620     1288      26        0           682 runc:[2:INIT]
Sep  3 21:43:22 kn-36 kernel: Memory cgroup out of memory: Kill process 46773 (redis-server) score 1683 or sacrifice child
Sep  3 21:43:22 kn-36 kernel: Killed process 46773 (redis-server) total-vm:64590932kB, anon-rss:47182692kB, file-rss:0kB, shmem-rss:0kB
Sep  3 21:43:22 kn-36 dockerd: time="2023-09-03T21:43:22.986862015+08:00" level=error msg="stream copy error: reading from a closed fifo"
Sep  3 21:43:22 kn-36 dockerd: time="2023-09-03T21:43:22.986868156+08:00" level=error msg="stream copy error: reading from a closed fifo"
Sep  3 21:43:22 kn-36 dockerd: time="2023-09-03T21:43:22.988216831+08:00" level=error msg="Error running exec e6974504af953e738c75f40cd56b640562976bd3b01cf972d656efeaf21ded22 in container: OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: close exec fds: readdirent /proc/self/fd: bad address: unknown"
Sep  3 21:43:22 kn-36 kubelet: E0903 21:43:22.992576    8504 summary_sys_containers.go:47] Failed to get system container stats for "/system.slice/docker.service": failed to get cgroup stats for "/system.slice/docker.service": failed to get container info for "/system.slice/docker.service": unknown container "/system.slice/docker.service"
Sep  3 21:43:26 kn-36 kernel: redis-server invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=682
Sep  3 21:43:26 kn-36 kernel: redis-server cpuset=docker-bfe13af359ae7ed3a03e3cba6c9ffff6407de218406d5f596205c60f66bec807.scope mems_allowed=0
Sep  3 21:43:26 kn-36 kernel: CPU: 29 PID: 46863 Comm: redis-server Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-957.el7.x86_64 #1
Sep  3 21:43:26 kn-36 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
Sep  3 21:43:26 kn-36 kernel: Call Trace:

根据系统日志可以看到:两个redis服务部署在同一台服务器上。并且内存使用量都有10G+ 。
关键日志:kernel: Memory cgroup out of memory: Kill process 46773 (redis-server) score 1683 or sacrifice child
异常说明参考:OutOfMemoryError系列(8): Kill process or sacrifice child

操作系统(operating system)构建在进程(process)的基础上. 进程由内核作业(kernel jobs)进行调度和维护, 其中有一个内核作业称为 “Out of memory killer(OOM终结者)”。
Out of memory killer 在可用内存极低的情况下会杀死某些进程。只要达到触发条件就会激活, 选中某个进程并杀掉。 通常采用启发式算法, 对所有进程计算评分(heuristics scoring), 得分最低的进程将被 kill 掉。是系统内核内置的一种安全保护措施。

链接redis客户端查看内存使用情况:


# 查看集群内存使用情况
127.0.0.1:6379> info memory
# Memory
used_memory:34952849048 #由 Redis 分配器分配的内存总量,包含了redis进程内部的开销和数据占用的内存,以字节(byte)为单位(是你的Redis实例中所有key及其value占用的内存大小) 
used_memory_human:32.55G #Redis 分配器分配的内存总量 可读 
used_memory_rss:30004371456 #向操作系统申请的内存大小(这个值一般是大于used_memory的,因为Redis的内存分配策略会产生内存碎片。)
used_memory_rss_human:27.94G 
used_memory_peak:46624453568 #redis的内存消耗峰值(以字节为单位)
used_memory_peak_human:43.42G
used_memory_peak_perc:74.97%  # 使用内存与峰值内存的百分比(used_memory / used_memory_peak) *100%
used_memory_overhead:12842730 # Redis维护数据集的内部机制所需的内存开销,包括所有客户端输出缓冲区、查询缓冲区、AOF重写缓冲区和主从复制的backlog
used_memory_startup:1463872
used_memory_dataset:34940006318 # 数据占用的内存(used_memory - used_memory_overhead)
used_memory_dataset_perc:99.97% # 数据占用的内存大小百分比,(used_memory_dataset / (used_memory - used_memory_startup))*100%
allocator_allocated:34953270856
allocator_active:34958581760
allocator_resident:35062763520
total_system_memory:135024558080 # 系统内存总量
total_system_memory_human:125.75G
used_memory_lua:33792
used_memory_lua_human:33.00K
used_memory_scripts:216
used_memory_scripts_human:216B
number_of_cached_scripts:1
maxmemory:35000000000 #Redis实例的最大内存配置(设置的最大内存)
maxmemory_human:32.60G
maxmemory_policy:allkeys-lru  # 当达到maxmemory时的淘汰策略
allocator_frag_ratio:1.00
allocator_frag_bytes:5310904
allocator_rss_ratio:1.00
allocator_rss_bytes:104181760
rss_overhead_ratio:0.86
rss_overhead_bytes:-5058392064
mem_fragmentation_ratio:0.86
mem_fragmentation_bytes:-4948498536
mem_not_counted_for_evict:244
mem_replication_backlog:1048576
mem_clients_slaves:49694
mem_clients_normal:9467744
mem_aof_buffer:244
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

maxmemory 参数
在生产环境中我们是不允许 Redis 出现交换行为的,为了限制最大使用内存,Redis 提供了配置参数 maxmemory 来限制内存超出期望大小。
当实际内存超出 maxmemory 时,Redis 提供了几种可选策略(maxmemory-policy) 来让用户自己决定该如何腾出新的空间以继续提供读写服。

  1. noeviction 不会继续服务写请求(DEL 请求可以继续服务),读请求可以继续进行。这样可以保证不会丢失数据,但是会让线上的业务不能持续进行。这是默认的淘汰策略。
  2. volatile-lru 尝试淘汰设置了过期时间的 key,最少使用的 key 优先被淘汰。没有设置过期时间的 key 不会被淘汰,这样可以保证需要持久化的数据不会突然丢失。
  3. volatile-ttl 跟上面一样,除了淘汰的策略不是 LRU,而是 key 的剩余寿命ttl 的值,ttl 越小越优先被淘汰。
  4. volatile-random 跟上面一样,不过淘汰的 key 是过期 key 集合中随机的key。
  5. allkeys-lru 区别于 volatile-lru,这个策略要淘汰的 key 对象是全体的 key 集合,而不只是过期的 key 集合。这意味着没有设置过期时间的 key 也会被淘汰。
  6. allkeys-random 跟上面一样,不过淘汰的策略是随机的key。
  7. volatile-xxx 策略只会针对带过期时间的 key 进行淘汰,allkeys-xxx 策略会对所有的 key 进行淘汰。如果你只是拿 Redis 做缓存,那应该使用 allkeys-xxx,客户端写缓存时不必携带过期时间。如果你还想同时使用 Redis 的持久化功能,那就使用 volatile-xxx 策略,这样可以保留没有设置过期时间的 key,它们是永久的 key 不会被 LRU 算法淘汰。

内存使用量即将达到最高限制 used_memory_dataset_perc 高达99.9%

排查redis中的大Key

root@redis-2:/data# redis-cli -c --bigkeys
# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.  You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).

[00.00%] Biggest hash   found so far 'user:res:pkgbiz:2dbf5009364647aebb128b526f904194' with 1 fields
[00.00%] Biggest string found so far 'user:sign:confirm:27246432ced94a908ab8459c02a5e5e6' with 2 bytes
[00.00%] Biggest string found so far '16909799251079910000010000635' with 86400263 bytes
[00.08%] Biggest list   found so far 'user:gateway:routes' with 9 items
[01.10%] Biggest hash   found so far 'user:session:sessions:f2cc3097-9328-427d-a326-354419f3a863' with 6 fields
[02.00%] Biggest list   found so far 'user:svc:locationList' with 2531 items
[03.32%] Biggest hash   found so far 'user-activeSessionCache:sessions:7b1a1c0f-6002-4845-89db-19c36b77dbb9' with 7 fields
[14.57%] Biggest set    found so far 'user-activeSessionCache:expirations:1693880100000' with 1 members
[16.42%] Biggest hash   found so far 'user:T_TY_PUB_SYS_PARAM' with 37 fields
[33.79%] Biggest list   found so far 'user:test:svc:locationList' with 3297 items
[64.89%] Biggest set    found so far 'spring:session:user:user:expirations:1693809960000' with 2 members

-------- summary -------

Sampled 13371 keys in the keyspace!
Total key length in bytes is 775988 (avg len 58.04)

Biggest   list found 'user:test:svc:locationList' has 3297 items
Biggest   hash found 'user:T_TY_PUB_SYS_PARAM' has 37 fields
Biggest string found '16909799251079910000010000635' has 86400263 bytes
Biggest    set found 'spring:session:user:user:expirations:1693809960000' has 2 members

4 lists with 5839 items (00.03% of keys, avg size 1459.75)
2737 hashs with 2850 fields (20.47% of keys, avg size 1.04)
10623 strings with 29981298221 bytes (79.45% of keys, avg size 2822300.50)
0 streams with 0 entries (00.00% of keys, avg size 0.00)
7 sets with 8 members (00.05% of keys, avg size 1.14)
0 zsets with 0 members (00.00% of keys, avg size 0.00)

Biggest string found ‘16909799251079910000010000635’ has 86400263 bytes 单个string内存占用高达80M
Biggest list found so far ‘user:test:svc:locationList’ with 3297 items 当个列表元素太多
缓存使用不合理,需要业务方改造

猜你喜欢

转载自blog.csdn.net/d495435207/article/details/132671831