1. Low memory termination daemon
The Android Low Memory Termination Daemon (lmkd) process monitors the memory status of a running Android system and responds to high memory pressure issues by terminating the least unnecessary processes, allowing the system to run at an acceptable performance level.
All application processes are hatched from zygote, recorded in the mLruProcesses list in AMS, and managed uniformly by AMS. AMS will update the oom_adj value corresponding to the process according to the status of the process, and this value will be passed to lmkd through the socket. Depending on the kernel version, lmdk either passes it to the kernel or handles the low memory recycling mechanism itself. In order to free up more memory space, when the memory reaches a certain threshold, the process of cleaning up the high oom_adj value will be triggered.
1. Introduction to memory pressure
Android systems running multiple processes in parallel may experience system memory exhaustion and significant delays in processes requiring more memory. Memory pressure is a state of insufficient system memory that requires Android to release memory (to relieve this pressure) by limiting or terminating unnecessary processes, requesting processes to release non-critical cache resources, etc.
Historically, Android monitored system memory pressure using the low-memory kill daemon (LMK) driver in the kernel, a strict mechanism that relied on hard-coded values. Starting with kernel 4.12, the LMK driver has been removed from the upstream kernel and user-space lmkd performs memory monitoring and process termination tasks instead.
2. Pressure stall information
Android 10 and above support the new lmkd mode, which uses the Kernel Pressure Stall Information (PSI) monitor to detect memory pressure. The PSI patch set in the upstream kernel (backported to the 4.9 and 4.14 kernels) measures the time a task is delayed due to insufficient memory. Because these delays directly impact the user experience, they represent a convenient indicator for determining the severity of memory pressure. The upstream kernel also includes PSI monitors, which allow privileged userspace processes (such as lmkd) to specify thresholds for these latencies and subscribe to events from the kernel when the thresholds are breached.
① PSI monitor and vmpressure signal
Because the vmpressure signal (generated by the kernel to detect memory pressure and used by lmkd) often contains a large number of false positives, lmkd must perform filtering to determine whether memory pressure actually exists. This causes unnecessary lmkd wake-ups and uses additional computing resources. Using a PSI monitor enables more accurate memory pressure detection and minimizes filtering overhead.
②Use PSI monitor
To use the PSI monitor (instead of vmpressure events), configure the ro.lmk.use_psi property. The default value is true, which uses the PSI monitor as the default mechanism for lmkd memory pressure detection. Since the PSI monitor requires kernel support, the kernel must contain the PSI backport patch and be compiled with PSI support enabled (CONFIG_PSI=y).
3. Disadvantages of the LMK driver in the kernel
Android has deprecated the LMK driver due to a number of issues, including:
- For low-memory devices, tuning must be proactive, and even then performance is poor when handling workloads involving active page caches supporting large files. Poor performance will cause thrashing, but not termination.
- The LMK kernel driver relies on available memory limits and does not scale based on memory pressure.
- Due to the strict nature of the design, partners often customize the driver to work on their own devices.
- The LMK driver is hooked into the Slab Shrinker API, which is not designed to perform heavy operations such as searching for and killing targets, which would slow down the vmscan process.
4. User space lmkd
User-space lmkd implements the same functionality as a driver in the kernel, but it uses existing kernel mechanisms to detect and evaluate memory pressure. These mechanisms include using kernel-generated vmpressure events or the Pressure Stall Information (PSI) monitor to get notifications about memory pressure levels, and using the memory cgroup feature to limit the memory resources allocated to each process based on the process's importance.
Using userspace lmkd in Android 10
In Android 9 and above, userspace lmkd is activated when the LMK driver in the kernel is not detected. Because userspace lmkd requires the kernel to support memory cgroups, the kernel must be compiled with the following configuration settings:
CONFIG_ANDROID_LOW_MEMORY_KILLER=n
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
Termination strategy
Userspace lmkd supports termination policies based on vmpressure events or PSI monitors, their severity, and other cues such as swap utilization. Termination strategies differ for low-memory devices and high-performance devices:
- For devices with insufficient memory, the system will generally choose to endure greater memory pressure.
- For high-performance devices, if memory pressure occurs, it is considered an anomaly and should be repaired promptly to avoid affecting overall performance.
You can configure the termination policy using the ro.config.low_ram property. For details, see Low RAM Configurations .
Userspace lmkd also supports a legacy mode in which it makes termination decisions using the same strategy as the LMK driver in the kernel (ie, available memory and file cache thresholds). To enable legacy mode, set the ro.lmk.use_minfree_levels property to true.
5. Illustration
5-1. LMK/LMKD
5-2. lmkd kill process flow
LMKD can use the minfree table to adjust which process to kill (ro.lmkk.use_minfree_levels=1)
Or use medium/critical pressure to adjust adj processes to kill (ro.lmk.use_minfree_levels=0).
Medium/critical pressure regulation is configured by ro.lmk.medium/ro.lmk. Key, default value is 800/0.
lmkd log:(in main log)
ro.lmk.use_minfree_levels=1
ro.lmk.use_minfree_levels=0
5-3. lmk kill process flow
lmk log: (in kernel log)
6 .minfree table & oom adj
6.1 minfree table
adb shell cat /sys/module/lowmemorykiller/parameters/minfree
or
adb root
adb shell getprop |grep minfree
[sys.lmk.minfree_levels]: [18432:0,23040:100,27648:200,32256:250,36864:900,46080:950]
6.2 How to modify?
•The minfree table can be modified by modifying
(frameworks/base/core/res/res/values/config.xml)
config_lowMemoryKillerMinFreeKbytesAbsolute
config_lowMemoryKillerMinFreeKbytesAdjust
(frameworks/base/services/core/java/com/android/server/am/ProcessList.java)
updateOomLevels() function's calculation formula or
default value table(mOomMinFreeLow/mOomMinFreeHigh)
6.3 oom adj
ADJ Priority> |
OOMADJ |
Corresponding scene |
UNKNOWN_ADJ |
1001 |
Generally means that the process will be cached and a certain value cannot be obtained. |
CACHED_APP_MAX_ADJ |
906 |
Maximum adj value for invisible processes (invisible processes may be killed at any time) |
CACHED_APP_MIN_ADJ |
900 |
Minimum adj value for invisible processes (invisible processes may be killed at any time) |
SERVICE_B_ADJ |
800 |
Service in B List (older, less likely to be used) |
PREVIOUS_APP_ADJ |
700 |
The process of the previous App (for example, APP_A jumps to APP_B, and when APP_A is invisible, A belongs to PREVIOUS_APP_ADJ) |
HOME_APP_ADJ |
600 |
Home process |
SERVICE_ADJ |
500 |
Service process |
HEAVY_WEIGHT_APP_ADJ |
400 |
Background heavyweight process, set in the system/rootdir/init.rc file |
BACKUP_APP_ADJ |
300 |
Backup process |
PERCEPTIBLE_APP_ADJ |
200 |
Aware of processes, such as background music playback |
VISIBLE_APP_ADJ |
100 |
Visible process (visible, but failed to obtain focus, for example, the new process has only one suspended Activity, Visible process) |
FOREGROUND_APP_ADJ |
0 |
Foreground process (the APP being displayed has an interactive interface and Foreground process) |
PERSISTENT_SERVICE_ADJ |
-700 |
Associated with system or persistent processes |
PERSISTENT_PROC_ADJ |
-800 |
System persistent processes, such as telephony |
SYSTEM_ADJ |
-900 |
System process |
NATIVE_ADJ |
-1000 |
native process (not managed by the system) |
7. lmkd parameters
Parameter |
Description |
Default |
LowRam |
ro.lmk.debug |
debug switch. Debug messages other than killing log need to be turned on to see them. |
false |
|
ro.lmk.kill_heaviest_task |
The default is false - every time a process needs to be killed, it will start traversing from the highest oom_adj, and at the same oom_adj, it will kill from the last one added to the list until enough memory is released; true - Each time a process needs to be killed, the traversal starts from the high oom_adj. When the oom_adj is the same, the process starts from the process with the highest rss (refer to the second value in the node /proc/<$pid>/statm) until enough is released. until memory; |
false |
|
ro.config.low_ram |
Generally, ago device is defined as low ram device. Currently it is a device with less than 1GB ram. It has two characteristics. 1. Limit memory according to different oomadj, 2. Only one process will be killed at a time |
false |
|
ro.lmk.kill_timeout_ms |
The timeout time between the next kill after the kill process |
0 |
|
ro.lmk.use_minfree_levels |
Use the cache/minfree reference mechanism of the kernel lowmemory killer to kill the process instead of referring to memory pressure. |
false |
|
Mem Pressure relative |
The prop used by mp_event_common will take effect if it is different from the PSI parameter. |
||
ro.lmk.low |
The lowest adj for kill when memory pressure is low |
1001 |
|
ro.lmk.medium |
The lowest adj of kill when memory pressure is medium |
800 |
|
ro.lmk.critical |
The lowest adj for kill when memory pressure is high |
0 |
|
ro.lmk.critical_upgrade |
Allow memory pressure to be raised from medium to critical, provided that the mem_pressure calculation is lower than the upgrade_pressure critical value |
false |
|
ro.lmk.upgrade_pressure |
The reference value of critical pressure, the above is medium, the following is critical |
100 |
|
ro.lmk.downgrade_pressure |
The reference value of medium pressure, the above is low, the below is medium |
100 |
|
PSI relative (>=AndroidQ) |
mp_event_psi使用的参数,和Pressuure参数不同时生效 |
||
ro.lmk.use_psi |
kernel 使用psi event上发lmkd |
1 |
1 |
ro.lmk.use_new_strategy |
1: use mp_event_psi , 0: use mp_event_common to kill process |
0 |
1 |
ro.lmk.swap_free_low_percentage |
判定swap low的百分比 ex : swap free < 10/100 |
20 |
10 |
ro.lmk.swap_util_max |
最大内存交换量:占可交换内存的百分比。(默认值实际上会停用此功能) |
100 |
100 |
ro.lmk.thrashing_limit |
判定 thrashing 的标准值 |
100 |
30 |
ro.lmk.thrashing_limit_decay |
thrashing limit衰减百分比 , 每次衰减 |
10 |
50 |
ro.lmk.psi_partial_stall_ms |
内存失速阈值。用于触发内存不足的通知。 Default for low-RAM devices = 200, for high-end devices = 70 (PSI_SOME) |
70 |
200 |
ro.lmk.psi_complete_stall_ms |
完全PSI失速阈值。用于触发关键内存通知。 Default =700 (PSI_FULL) |
700 |
700 |
ro.lmk.thrashing_min_score_adj |
发生thrashing 时kill 的 min score adj |
200 |
200 |
二、低内存的数据特征和行为特征
1、Meminfo 信息
最简单的方法是使用 Android 系统自带的 Dumpsys meminfo 工具
1 |
adb shell dumpsys meminfo |
如果系统处于低内存的话 , 会有如下特征:
- FreeRam 的值非常少 , Used RAM 的值非常大
- ZRAM 使用率非常高(如果开了 Zram 的话)
2、LMK && kswapd 线程活跃
低内存的时候, LKMD 会非常活跃, 在 Kernel Log 里面可以看到 LMK 杀进程的信息:
1 |
[kswapd0] lowmemorykiller: Killing 'u.mzsyncservice' (15609) (tgid 15609), adj 906, |
上面这段 Log 的意思是说, 由于 mem 低于我们设定的 900 的水位线 (261272kB),所以把 pid 为 15609 的 mzsyncservice 这个进程杀掉(这个进程的 adj 是 906 )
3、proc/meminfo
这里是 Linux Kernel 展示 meminfo 的地方
从结果来 , 当系统处于低内存的情况时候 , MemFree 和 MemAvailable 的值都很小
shell cat proc/meminfo
1 |
MemTotal: 5630104 kB |
4、整机卡顿 && 响应慢
低内存的时候,整机使用的时候要比非低内存的时候要卡很多,点击应用或者启动 App 都会有不顺畅或者响应慢的感觉
三、低内存对性能的具体影响
1、LMK 频繁工作抢占 cpu
LMK 工作时, 会占用 cpu 资源 , 其表现主要有下面几点
- CPU 资源 : 由于 LMK 杀掉的进程通常都是一些 Cache 或者 Service , 这些进程由于低内存被杀之后 , 通常会很快就被其主进程拉起来, 然后又被 LMK 杀掉, 从而进入了一种循环. 由于起进程是一件很消耗 cpu 的操作, 所以如果后台一直有进程被杀和重启, 那么前台的进程很容易出现卡顿
- Memory : 由于低内存的原因, 很容易触发各个进程的 GC , 如下图的 CPU 状态可以看到, 用于内存回收的 HeapTaskDeamon 出现非常频繁
- IO : 低内存会导致磁盘 IO 变多, 如果频繁进行磁盘 IO , 由于磁盘IO 很慢, 那么主线程会有很多进程处于等 IO 的状态, 也就是我们经常看到的 Uninterruptible Sleep
2、影响主线程 IO 操作
主线程出现大量的 IO 相关的问题 ,
- 反馈到 Trace 上就是有大量的黄色 Trace State 出现 , 例如 : Uninterruptible Sleep | WakeKill - Block I/O .
- 查看其 Block 信息 (kernel callsite when blocked:: “wait_on_page_bit_killable+0x78/0x88)
Linux 系统的 page cache 链表中有时会出现一些还没准备好的 page ( 即还没把磁盘中的内容完全地读出来 ) , 而正好此时用户在访问这个 page 时就会出现 wait_on_page_locked_killable 阻塞了. 只有系统当 io 操作很繁忙时, 每笔的 io 操作都需要等待排队时, 极其容易出现且阻塞的时间往往会比较长.
当出现大量的 IO 操作的时候,应用主线程的 Uninterruptible Sleep 也会变多,此时涉及到 io 操作(比如 view ,读文件,读配置文件、读 odex 文件),都会触发 Uninterruptible Sleep , 导致整个操作的时间变长
3、出现 CPU 竞争
低内存会触发 Low Memory Killer 进程频繁进行扫描和杀进程,kswapd0 是一个内核工作线程,内存不足时会被唤醒,做内存回收的工作。 当内存频繁在低水位的时候,kswapd0 会被频繁唤醒,占用 cpu ,造成卡顿和耗电。
比如下面这个情况, kswapd0 占用了 855 的超大核 cpu7 ,而且是满频在跑,耗电可想而知,如果此时前台应用的主线程跑到了 cpu7 上,很大可能会出现 cpu 竞争,导致调度不到而丢帧。
HeapTaskDaemon 通常也会在低内存的时候跑的很高
, 来做内存相关的操作
4、进程频繁查杀和重启
对 AMS 的影响主要集中在进程的查杀上面 , 由于 LMK 的介入 , 处于 Cache 状态的进程很容易被杀掉 , 然后又被他们的父进程或者其他的应用所拉起来 , 导致陷入了一种死循环 . 对系统 CPU \ Memory \ IO 等资源的影响非常大.
比如下面就是一次 Monkey 之后的结果 , QQ 在短时间内频繁被杀和重启 .
14:32:16.932 1435 1510 I am_proc_start: [0,30387,10145,com.tencent.mobileqq,restart,com.tencent.mobileqq]
1 |
07-23 14:32:16.969 1435 3420 I am_proc_bound: [0,30387,com.tencent.mobileqq] |
其对应的 Systrace - SystemServer 中可以看到 AM 在频繁杀 QQ 和起 QQ
此 Trace 对应的 Kernel 部分也可以看到繁忙的 cpu
5、影响内存分配和触发 IO
手机经过长时间老化使用整机卡顿一下 , 或者整体比刚刚开机的时候操作要慢 , 可能是因为触发了内存回收或者 block io , 而这两者又经常有关联 . 内存回收可能触发了 fast path 回收 \ kswapd 回收 \ direct reclaim 回收 \ LMK杀进程回收等。(fast path 回收不进行回写)
回收的内容是匿名页 swapout 或者 file-backed 页写回和清空。(假设手机都是 swap file 都是内存,不是 disk), 涉及到 file 的,都可能操作 io,增加 block io 的概率。
还有更常见的是打开之前打开过的应用,没有第一次打开的快,需要加载或者卡一段时间 . 可能发生了 do_page_fault,这条路径经常见到 block io 在 wait_on_page_bit_killable(),如果是 swapout 内存,就要 swapin 了。如果是普通文件,就要 read out in pagecache/disk.
do_page_fault —> lock_page_or_retry -> wait_on_page_bit_killable 里面会判断 page 是否置位 PG_locked, 如果置位就一直阻塞, 直到 PG_locked 被清除 , 而 PG_locked 标志位是在回写开始时和 I/O 读完成时才会被清除,而 readahead 到 pagecache 功能也对 block io 产生影响,太大了增加阻塞概率。
四、实例
下面这个 Trace 是低内存情况下 , 抓取的一个 App 的冷启动 , 我们只取应用启动到第一帧显示的部分 ,总耗时为2s 。
可以看到其 Running 的总时间是 682 ms ,
1、低内存的启动情况
低内存情况下 , 这个 App 从 bindApplication 到第一帧显示 , 共花费了 2s . 从下面的 Thread 信息那里可以看到
- Uninterruptible Sleep | WakeKill - Block I/O 和 Uninterruptible Sleep 这两栏总共花费 750 ms 左右(对比下面正常情况才 130 ms)
- Running 的时间在 600 ms (对比下面正常情况才 624 ms , 相差不大)
从这段时间内的 CPU 使用情况来看 , 除了 HeapTaskDaemon 跑的比较多之外 , 其他的内存和 io 相关的进程也非常多 , 比如若干个 kworker 和 kswapd0.
2、正常内存情况下
正常内存情况下 , 这个 App 从 bindApplication 到第一帧显示 , 只需要 1.22s . 从下面的 Thread 信息那里可以看到
- Uninterruptible Sleep | WakeKill - Block I/O 和 Uninterruptible Sleep 这两栏总共才 130 ms.
- Running 的时间是 624 ms
从这段时间内的 CPU 使用情况来看 , 除了 HeapTaskDeamon 跑的比较多之外 , 其他的内存和 io 相关的进程非常少.
五、Low memory处理建议
1. 优化系统进程内存占用
排查内存占比高进程并优化
2. 减少reserved memory
2-1 获取reserved memory 讯息:
>=Android Q, 请提e-service 申请 “memory-layout-parser” 工具
也可从lk log 搜mblock_reserve-R (但可能有缺漏)
Line 1920: [1604] mblock_reserve-R[3].start: 0x46000000, sz: 0x400000 map:0 name:lk_addr_mb
Line 1921: [1605] mblock_reserve-R[4].start: 0x46900000, sz: 0x8000000 map:0 name:scratch_addr_mb
Line 1922: [1606] mblock_reserve-R[5].start: 0x44000000, sz: 0x80000 map:1 name:dtb_kernel_addr_mb
Line 1923: [1607] mblock_reserve-R[6].start: 0x40008000, sz: 0x3200000 map:0 name:kernel_addr_mb
Line 1924: [1608] mblock_reserve-R[7].start: 0x45000000, sz: 0x1000000 map:0 name:ramdisk_addr_mb
Line 1925: [1609] mblock_reserve-R[8].start: 0x77370000, sz: 0xc90000 map:0 name:framebuffer
Line 1926: [1610] mblock_reserve-R[9].start: 0x7fa00000, sz: 0x400000 map:0 name:logo_db_addr_pa
Line 1927: [1611] mblock_reserve-R[10].start: 0x77360000, sz: 0x10000 map:0 name:SPM-reserved
Line 1928: [1612] mblock_reserve-R[11].start: 0x77350000, sz: 0x10000 map:0 name:MCUPM-reserved
Line 1929: [1613] mblock_reserve-R[12].start: 0x72000000, sz: 0x4000000 map:0 name:ccci
或是lk 代码搜
mblock_reserve 或 mblock_reserve_ext
ex:
logo_db_addr_pa = (void *)(u32)mblock_reserve_ext(&g_boot_arg->mblock_info,
LK_LOGO_MAX_SIZE, PAGE_SIZE, 0x80000000, 0, "logo_db_addr_pa");
或.dts 搜 reserved-memory
ex:
318 reserve-memory-scp_share { 319 compatible = "mediatek,reserve-memory-scp_share"; 320 no-map; 321 size = <0 0x01400000>; /*20 MB share mem size */ 322 alignment = <0 0x1000000>; 323 alloc-ranges = <0 0x40000000 0 0x50000000>; /*0x4000_0000~0x8FFF_FFFF*/ 324 }; 325 consys-reserve-memory { 326 compatible = "mediatek,consys-reserve-memory"; 327 no-map; 328 size = <0 0x200000>; 329 alignment = <0 0x200000>; 330 alloc-ranges = <0 0x40000000 0 0x80000000>;
3. 限制后台
3-1修改DEFAULT_MAX_CACHED_PROCESSES
/frameworks/base/services/core/java/com/android/server/am/ActivityManagerConstants.java or ProcessList.java
public int MAX_CACHED_PROCESSES = DEFAULT_MAX_CACHED_PROCESSES;
private static final int DEFAULT_MAX_CACHED_PROCESSES = 32; // 改为DEFAULT_MAX_CACHED_PROCESSES = 8 or 16 or ...
3-2修改mCachedRestoreLevel
/frameworks/base/services/core/java/com/android/server/am/ProcessList.java中
long getCachedRestoreThresholdKb() {
return mCachedRestoreLevel; //将mCachedRestoreLevel 改为 mCachedRestoreLevel/2
}
4. 调整lmk参数
4-1. 调整minfree table
<=kernel-4.9 non-ago project or kernel-4.14 (ro.lmk.use_minfree_levels=1)
minfree table后三项阀值 ,分别增大1.x倍 1.x倍,1.x倍 (ex: 1.2 , 1.5 ,...倍)
4-2. 调整lmkd 参数
Ago project , or kernel-4.14 (ro.lmk.use_minfree_levels=0)
ro.lmk.medium 调小(mediaum pressure kill adj 减小, 更多进程可杀)
ro.lmk.downgrade_pressure 调大(更容易进到mediaum pressure状态)
ro.lmk.upgrade_pressure 调大(更容易进到critical pressure状态)
5. swap szie & swappiness
5-1.调大swap size, 使系统逻辑内存延伸加大
/device/mediatek/mt6xxx/
/device/mediatek/vendor/common/
fstab.enableswap
fstab.enableswap_gmo
fstab.enableswap_ago
/dev/block/zram0 none swap defaults zramsize=xx% 把值或百分比调大
可从/proc/zraminfo确认是否生效
5-2.调大swappiness, 使系统充分利用swap 分区
/proc/sys/vm/swappiness
/dev/memcg/memory.swappiness
/dev/memcg/apps/memory.swappiness
/dev/memcg/system/memory.swappiness
6. Duraspeed enable (or 做好后台管理)
duraspeed 可主动管理后台进程与内存, 避免进入内存恶劣情况
7. 其他优化方案
- 提高 extra_free_kbytes 值
- 提高 disk I/O 读写速率,如用 UFS3.0,用固态硬盘
- 避免设置太大的 read_ahead_kb 值
- 使用 cgroup 的 blkio 来限制后台进程的 io 读操作,缩短前台 io 响应时间
- 提前做内存回收的操作,避免在用户使用应用时碰到而感受到稍微卡顿
- 增加 LMK 效率,避免无效的 kill
- kswapd 周期性回收更多的 high 水位
- 调整 swappiness 来平衡 pagecache 和 swap
- 策略 : 针对低内存机器做特殊的策略 , 比如杀进程更加激进 (这会带来用户体验的降低 , 所以这个度需要兼顾性能和用户体验)
- 策略 : 在内存不足的时候提醒用户(或者不提醒用户) , 杀掉不必要的后台进程 .
- 策略 : 在内存严重不足且无法恢复的情况下 , 可以提示用户重启手机.
八.Slab内存占用以致Kill应用程序问题分析
一般的,都是有应用程序向系统申请内存,但是系统发现剩余的内存大小无法满足当前的申请,进行一系列的操作之后还是无法满足,将会选择最合适的程序将其kill,这样系统将可以回收它的内存,从而满足系统中其他进程的内存需求。所以,程序被kill掉,并不一定说该程序有内存泄露,只是说当系统内存被kill时,它最适合被kill。
在程序被kill之前,可以查看进程占用的内存信息,看看进程是否存在内存泄露:
其中部分信息如下: VmPeak: 3068 kB VmSize: 3068 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 612 kB VmRSS: 612 kB
我们主要查看VmRSS的大小是否逐渐在增大,如果该值逐渐增大,很大可能是程序存在内存泄露。但是在test测试中,程序的该值并没有很明显的变化,所以转向系统内存信息。
每隔一段时间,查看系统内存的信息,操作如下:
root@Linux: /# cat /proc/meminfo MemTotal: 493184 kB MemFree: 442572 kB MemAvailable: 452300 kB Buffers: 3424 kB Cached: 3224 kB SwapCached: 0 kB Active: 8940 kB Inactive: 284 kB Active(anon): 2588 kB Inactive(anon): 120 kB Active(file): 6352 kB Inactive(file): 164 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 493184 kB LowFree: 442572 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 20 kB Writeback: 0 kB AnonPages: 2616 kB Mapped: 2204 kB Shmem: 124 kB Slab: 30528 kB SReclaimable: 13904 kB SUnreclaim: 16624 kB KernelStack: 704 kB PageTables: 296 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 246592 kB Committed_AS: 55424 kB VmallocTotal: 507904 kB VmallocUsed: 0 kB VmallocChunk: 0 kB CmaTotal: 65536 kB CmaFree: 59280 kB
通过cat /proc/meminfo查看系统的内存信息,其中,Slab是slab占用的内存大小,SReclaimable是可回收的,而SUnreclaim是不可回收的。发现Slab占用了系统快30M的内存,留意这个信息。接着,再查看一下,slab的详细使用情况:
root@Linux: /# cat /proc/slabinfo slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> ext4_groupinfo_4k 58 81 296 27 2 : tunables 0 0 0 : slabdata 3 3 0 ext4_groupinfo_1k 1 28 288 28 2 : tunables 0 0 0 : slabdata 1 1 0 jbd2_1k 0 0 3072 10 8 : tunables 0 0 0 : slabdata 0 0 0 bridge_fdb_cache 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0 sd_ext_cdb 2 18 216 18 1 : tunables 0 0 0 : slabdata 1 1 0 sgpool-128 2 14 2304 14 8 : tunables 0 0 0 : slabdata 1 1 0 sgpool-64 2 25 1280 25 8 : tunables 0 0 0 : slabdata 1 1 0 sgpool-32 2 21 768 21 4 : tunables 0 0 0 : slabdata 1 1 0 sgpool-16 2 16 512 16 2 : tunables 0 0 0 : slabdata 1 1 0 sgpool-8 2 21 384 21 2 : tunables 0 0 0 : slabdata 1 1 0 cfq_io_cq 10 31 264 31 2 : tunables 0 0 0 : slabdata 1 1 0 cfq_queue 9 22 360 22 2 : tunables 0 0 0 : slabdata 1 1 0 fat_inode_cache 3 26 616 26 4 : tunables 0 0 0 : slabdata 1 1 0 fat_cache 0 0 200 20 1 : tunables 0 0 0 : slabdata 0 0 0 squashfs_inode_cache 88 200 640 25 4 : tunables 0 0 0 : slabdata 8 8 0 jbd2_transaction_s 0 42 384 21 2 : tunables 0 0 0 : slabdata 2 2 0 jbd2_inode 1 76 208 19 1 : tunables 0 0 0 : slabdata 4 4 0 ...... kmalloc-128 1589 1596 384 21 2 : tunables 0 0 0 : slabdata 76 76 0 kmalloc-64 15937 16200 320 25 2 : tunables 0 0 0 : slabdata 648 648 0 kmem_cache_node 107 125 320 25 2 : tunables 0 0 0 : slabdata 5 5 0 kmem_cache 107 126 384 21 2 : tunables 0 0 0 : slabdata 6 6 0
从这里可以了解到slab的使用情况,记录下来。
slab是Linux操作系统的一种内存分配机制。其工作是针对一些经常分配并释放的对象,如进程描述符等,这些对象的大小一般比较小,如果直接采用伙伴系统来进行分配和释放,不仅会造成大量的内碎片,而且处理速度也太慢。而slab分配器是基于对象进行管理的,相同类型的对象归为一类(如进程描述符就是一类),每当要申请这样一个对象,slab分配器就从一个slab列表中分配一个这样大小的单元出去,而当要释放时,将其重新保存在该列表中,而不是直接返回给伙伴系统,从而避免这些内碎片。slab分配器并不丢弃已分配的对象,而是释放并把它们保存在内存中。当以后又要请求新的对象时,就可以从内存直接获取而不用重复初始化。
接着将可以隔较长一段时间,重复的进行cat /proc/meminfo和cat /proc/slabinfo操作,对比几次的信息,检查问题。
最后发现,经过较长一段时间的测试之后,Slab占用的内存数量大大增加,如果是slab占用较大的内存,则是内核频繁分配结构体导致,导致系统可用内存减小。直到出现Out of memory导致kill程序。
1.解决
了解到是Slab导致的占用内存过高的问题之后,可以手动的刷Slab,操作如下:
echo 3 > /proc/sys/vm/drop_caches /* 回刷缓冲 */
其中drop_caches的4个值有如下含义:
- 0:不做任何处理,由系统自己管理
- 1:清空pagecache
- 2:清空dentries和inodes
- 3:清空pagecache、dentries和inodes
但是这样的办法不是最佳的,最好还是应该通过slabinfo信息,了解到应用程序进行什么操作,导致内核频繁申请结构体导致Slab占用大量内存,看能否避免这样的问题,同时,内核有自动回收机制,可修改触发自动回收的阀值,当slab空闲内存达到一定量的时候,进行有效的回收。
2.后续
后来在参考文章看到信息,概括如下:
文中开头的说到的老化测试程序test,就是大量的保存文件,频繁的文件io操作(open、write、close),导致了dentry_cache占用了系统太多的内存资源。
inode对应于物理磁盘上的具体对象,而dentry是一个内存实体,其中的d_inode成员指向对应的inode,故可以把dentry看成是Linux文件系统中某个索引节点(inode)的链接,这个索引节点可以是文件,也可以是目录。而dentry_cache是目录项高速缓存,是Linux为了提高目录项对象的处理效率而设计的,它记录了目录项到inode的映射关系。
3.系统的自动slab缓存回收
在slab缓存中,对象分为SReclaimable(可回收)和SUnreclaim(不可回收),而在系统中绝大多数对象都是可回收的。内核有一个参数,当系统内存使用到一定量的时候,会自动触动回收操作。
- 内核参数:
vm.min_free_kbytes = 836787
代表系统所保留空闲内存的最低限。
在系统初始化时会根据内存大小计算一个默认值,计算规则是:
min_free_kbytes = sqrt(lowmem_kbytes * 16) = 4 * sqrt(lowmem_kbytes)(注:lowmem_kbytes即可认为是系统内存大小)
另外,计算出来的值有最小最大限制,最小为128K,最大为64M。
可以看出,min_free_kbytes随着系统内存的增大不是线性增长,因为随着内存的增大,没有必要也线性的预留出过多的内存,能保证紧急时刻的使用量便足矣。 - min_free_kbytes的主要用途是计算影响内存回收的三个参数 watermark[min/low/high]
- watermark[high] > watermark [low] > watermark[min],各个zone各一套
- 在系统空闲内存低于 watermark[low]时,开始启动内核线程kswapd进行内存回收(每个zone一个),直到该zone的空闲内存数量达到watermark[high]后停止回收。如果上层申请内存的速度太快,导致空闲内存降至watermark[min]后,内核就会进行direct reclaim(直接回收),即直接在应用程序的进程上下文中进行回收,再用回收上来的空闲页满足内存申请,因此实际会阻塞应用程序,带来一定的响应延迟,而且可能会触发系统OOM。这是因为watermark[min]以下的内存属于系统的自留内存,用以满足特殊使用,所以不会给用户态的普通申请来用。
- 三个watermark的计算方法:
watermark[min] = min_free_kbytes换算为page单位即可,假设为min_free_pages。(因为是每个zone各有一套watermark参数,实际计算效果是根据各个zone大小所占内存总大小的比例,而算出来的per zone min_free_pages)
watermark[low] = watermark[min] * 5 / 4
watermark[high] = watermark[min] * 3 / 2
所以中间的buffer量为 high - low = low - min = per_zone_min_free_pages * 1/4。因为min_free_kbytes = 4* sqrt(lowmem_kbytes),也可以看出中间的buffer量也是跟内存的增长速度成开方关系。 - 可以通过/proc/zoneinfo查看每个zone的watermark
- Impact of min_free_kbytes size
The larger the min_free_kbytes setting is, the higher the watermark line will be, and the amount of buffer between the three lines will also increase accordingly. This means that kswapd will be started earlier for recycling, and more memory will be recycled (it will not stop until watermark[high]). This will cause the system to reserve too much free memory, thus reducing the cost to a certain extent. The amount of memory available to the application. In extreme cases, when min_free_kbytes is set close to the memory size, there will be too little memory left for the application and OOM may occur frequently.
If min_free_kbytes is set too small, the system reserved memory will be too small. There will also be a small amount of memory allocation during the recycling process of kswapd (PF_MEMALLOC will be set). This flag will allow kswapd to use reserved memory; another situation is that the process selected to be killed by OOM is in the exit process. If necessary You can also use the reserved part when applying for memory. In both cases, letting them use reserved memory can prevent the system from entering the deadlock state.
It can be tested. After adjusting the min_free_kbytes value to be greater than the system's free memory, the kswapd process indeed enters the running state from sleep state and begins to reclaim memory.
At the same time, there is also a parameter vm.vfs_cache_pressure = 200.
This file indicates the tendency of the kernel to recycle memory used for directory and inode cache; the default value of 100 indicates that the kernel will keep the directory and inode cache at a reasonable percentage based on pagecache and swapcache; Lowering the value below 100 will cause the kernel to tend to retain the directory and inode cache; increasing the value above 100 will cause the kernel to tend to reclaim the directory and inode cache.