(Turn) Linux OOM Killer Personal Summary

There is a feature under Linux called OOM killer (Out Of Memory killer), which will jump out when the system memory is exhausted, and selectively kill some processes in order to release some memory. A typical situation is: one day the machine suddenly can't get on and can ping, but ssh can't connect. The reason is that the sshd process was killed by the OOM killer. After restarting the machine, check the system log and you will find Out of Memory: Killed process ××× and so on.

The following introduces what kind of mechanism the OOM killer under Linux is, when will it jump out, and which processes will be selected to start.

1. When to jump out

Look at the first question first, when will it jump out. Does it jump out when malloc returns NULL? No, there is the following paragraph in malloc's manpage:

By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL there is no guarantee that the memory really is available. This is a really bad bug. In case it turns out that the system is out of memory, one or more processes will be killed by the infamous OOM killer. In case Linux is employed under circumstances where it would be less desirable to suddenly lose some randomly picked processes, and moreover the kernel version is sufficiently recent, one can switch off this overcommitting behavior using a command like:

# echo 2 > /proc/sys/vm/overcommit_memory

The above paragraph tells us that malloc in Linux returns a non-null pointer, which does not necessarily mean that the memory pointed to is available. Under Linux, programs are allowed to apply for more memory than the system can use. This feature is called Overcommit. This is done for the sake of optimizing the system, because not all programs apply for memory and use it immediately. When you use it, the system may have reclaimed some resources. Unfortunately, when you use the memory given to you by this Overcommit, if the system has no resources, the OOM killer will jump out.

There are three Overcommit strategies under Linux (refer to the kernel documentation: vm/overcommit-accounting), which can be configured in /proc/sys/vm/overcommit_memory (take three values ​​of 0, 1 and 2, the default is 0).

(1) 0: Heuristic strategy, more serious Overcommit will not succeed, for example, you suddenly apply for 128TB of memory. And minor overcommits will be allowed. In addition, the value of root can Overcommit is slightly more than that of ordinary users.

(2) 1: Overcommit is always allowed. This strategy is suitable for applications that cannot afford memory allocation failure, such as some scientific computing applications.

(3) 2: Overcommit is always prohibited. In this case, the memory that the system can allocate will not exceed the swap+RAM* coefficient (/proc/sys/vm/overcmmit_ratio, the default is 50%, you can adjust it), if so much The resource has been exhausted, and any subsequent attempts to allocate memory will return an error, which usually means that no new programs can be run at this time.

Supplement (to be verified): In this article: Memory overcommit in Linux, the author mentioned that in fact, the heuristic policy will only work when the SMACK or SELinux module is enabled, and in other cases, the policy is always allowed.

2. After jumping out, choose the strategy of the process

As long as there is an overcommit, OOM killer may jump out. So what is the strategy for selecting targets after the OOM killer jumps out? What we expect is: useless and memory-intensive programs are shot.

This selection strategy under Linux has also been constantly evolving. As users, we can influence the OOM killer to make decisions by setting some values. Each process under Linux has an OOM weight. In /proc/<pid>/oom_adj, the value is -17 to +15. The higher the value, the easier it is to be killed.

In the end, OOM killer determines which process is killed by the value of /proc/<pid>/oom_score. This value is calculated by the memory consumption, CPU time (utime + stime), survival time (uptime - start time) and oom_adj of the system synthesis process. The more memory consumed, the higher the score, and the longer the survival time, the lower the score. In short, the general strategy is: lose the least work, free the most memory without hurting innocent processes that use a lot of memory, and kill as few processes as possible.

In addition, when Linux calculates the memory consumption of a process, half of the memory consumed by the child process will be calculated into the parent process at the same time. In this way, those processes with more child processes should be careful. Of course, there are other strategies, you can refer to the articles: Taming the OOM killer and When Linux Runs Out of Memory.

 

Addressing is limited under the 32-bit CPU  architecture . The Linux kernel defines three areas:

-------------------------------------------------------------------------------

# DMA: 0x00000000 -  0x00999999 (0 - 16 MB)                                    

# LowMem: 0x01000000 - 0x037999999 (16 - 896 MB) - size: 880MB                 

# HighMem: 0x038000000 - <hardware specific>                                             

-------------------------------------------------------------------------------

The LowMem zone (also called NORMAL ZONE) is 880 MB in total and cannot be changed (unless a hugemem kernel is used). For high-load systems, OOM Killer may be triggered due to poor utilization of LowMem. One possible reason is that there are too few LowFrees, and another reason is that LowMem is full of fragments and cannot request contiguous memory areas. [According to a case I encountered, a guess is that some applications request relatively large memory at one time. It is within 880M, and the idle (LowFree) is not large enough, it will trigger OOM Killer to come out to work].

Tip: How to turn off and activate OOM Killer:

# echo "0" > /proc/sys/vm/oom-kill  

# echo "1" > /proc/sys/vm/oom-kill   

Check the current LowFree value:

# cat /proc/meminfo |grep LowFree     

Check for LowMem memory fragmentation:

# cat /proc/buddyinfo                 

It is said that the way to use SysRq is better, but it will be used when Hang. According to some documentation, OOM Killer behaves differently on 2.4 and 2.6. In the 2.4 version, the process that came in (newly applied for memory) was killed. On 2.6, it is to kill the process that occupies the most memory (this is very dangerous and can easily lead to paralysis of system applications).

For RHEL 4, there is a new parameter: vm.lower_zone_protection . The default unit of this parameter is MB. When the default value is 0, LowMem is 16MB. It is recommended to set vm.lower_zone_protection = 200 or even larger to avoid fragmentation of the LowMem area, which can definitely solve this problem (this parameter is the solution to this problem).

对于某个用户页面的请求可以首先从“普通”区域中来满足(ZONE_NORMAL);如果失败,就从ZONE_HIGHMEM开始尝试; 如果这也失败了,就从ZONE_DMA开始尝试。这种分配的区域列表依次包括ZONE_NORMAL、ZONE_HIGHMEM和ZONE_DMA区域。另一方面,对于 DMA 页的请求可能只能从 DMA 区域中得到满足,因此这种请求的区域列表就只包含 DMA 区域。

检查当前 low&hight 内存 的值

# egrep 'High|Low' /proc/meminfo

或者

# free –lm                     

检查LowMem内存碎片

# cat /proc/buddyinfo              

内存物理上使用基本没有所谓的碎片问题,读写内存上任何一个字节的速度都是固定的。不存在硬盘的机械寻道慢的问题。 buddy system分配内存的时候并不能保证物理地址连续的,尤其是非2的整数次冥的页面数量的时候,需要mmu将这些物理地址不连续的内存块在线性空间上拼接起来。

Node   0就是有几个内存结点 如果不是NUMA那就是只有一个了

zone   内存区域                                           

DMA  内存开始的16 MB                                     

HighMem 所有大于4G的内存 ,                               

Normal 介于两者之间的内村                                

后面11个数字是以4,8,16,32,64,128,256,512,1024,2048,4096kb为单位的所有可分配内存(可以连续分配的内存地址空间)。当low 内存不足时,不管high 内存有多大,oom-killer都会开始结束一些进程保持系统运行。oom的一个原因就是虽然看着有内存,但是无法分配,所以Kill占用内存最大的程序。那都是进程级的,有问题自己解决,用内存池也行。系统都是大块分配给进程并回收,都是整页的,不管多离散,在虚存空间都能映射成连续的,所以没有碎片问题。

如何保留低端内存:

在 X86 高内存设备中,当用户进程使用 mlock() 在常规区域分配大量内存时,可重新使用的 lowmem 内存可能会不足,而一些系统呼叫将失败并显示“EAGAIN” 等错误。在 RHEL 5.x (X86) 中,最终用户可以使用 lowmem_reserve_ratio 控制保留的 lowmem。

详细信息

# cat /proc/sys/vm/lowmem_reserve_ratio                 

256 256 32

DMA Normal HighMem

在常规区域中,将保留 256 页(默认)。要在常规区域中保留 512 页:

# echo "256 512 32" > /proc/sys/vm/lowmem_reserve_ratio

# cat /proc/sys/vm/lowmem_reserve_ratio                

256 512 32

要设置永久值,请编辑 /etc/sysctl.conf 并添加以下内容:

vm.lowmem_reserve_ratio = 256 512 32                   

# sysctl –p                                            

# cat /proc/sys/vm/lowmem_reserve_ratio                 

256 512 32              

 

=======================================

 

最近有位 VPS 客户抱怨 MySQL 无缘无故挂掉,还有位客户抱怨 VPS 经常死机,登陆到终端看了一下,都是常见的 Out of memory 问题。这通常是因为某时刻应用程序大量请求内存导致系统内存不足造成的,这通常会触发 Linux 内核里的 Out of Memory (OOM) killer,OOM killer 会杀掉某个进程以腾出内存留给系统用,不致于让系统立刻崩溃。如果检查相关的日志文件(/var/log/messages)就会看到下面类似的 Out of memory: Kill process 信息:

...
Out of memory: Kill process 9682 (mysqld) score 9 or sacrifice child
Killed process 9682, UID 27, (mysqld) total-vm:47388kB, anon-rss:3744kB, file-rss:80kB
httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
httpd cpuset=/ mems_allowed=0
Pid: 8911, comm: httpd Not tainted 2.6.32-279.1.1.el6.i686 #1
...
21556 total pagecache pages
21049 pages in swap cache
Swap cache stats: add 12819103, delete 12798054, find 3188096/4634617
Free swap  = 0kB
Total swap = 524280kB
131071 pages RAM
0 pages HighMem
3673 pages reserved
67960 pages shared
124940 pages non-shared

Linux 内核根据应用程序的要求分配内存,通常来说应用程序分配了内存但是并没有实际全部使用,为了提高性能,这部分没用的内存可以留作它用,这部分内存是属于每个进程的,内核直接回收利用的话比较麻烦,所以内核采用一种过度分配内存(over-commit memory)的办法来间接利用这部分 “空闲” 的内存,提高整体内存的使用效率。一般来说这样做没有问题,但当大多数应用程序都消耗完自己的内存的时候麻烦就来了,因为这些应用程序的内存需求加起来超出了物理内存(包括 swap)的容量,内核(OOM killer)必须杀掉一些进程才能腾出空间保障系统正常运行。用银行的例子来讲可能更容易懂一些,部分人取钱的时候银行不怕,银行有足够的存款应付,当全国人民(或者绝大多数)都取钱而且每个人都想把自己钱取完的时候银行的麻烦就来了,银行实际上是没有这么多钱给大家取的。

内核检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码 linux/mm/oom_kill.c,当系统内存不足的时候,out_of_memory() 被触发,然后调用 select_bad_process() 选择一个 “bad” 进程杀掉,如何判断和选择一个 “bad” 进程呢,总不能随机选吧?挑选的过程由 oom_badness() 决定,挑选的算法和想法都很简单很朴实:最 bad 的那个进程就是那个最占用内存的进程。

/**
 * oom_badness - heuristic function to determine which candidate task to kill
 * @p: task struct of which task we should calculate
 * @totalpages: total present RAM allowed for page allocation
 *
 * The heuristic for determining which task to kill is made to be as simple and
 * predictable as possible.  The goal is to return the highest value for the
 * task consuming the most memory to avoid subsequent oom failures.
 */
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
			  const nodemask_t *nodemask, unsigned long totalpages)
{
	long points;
	long adj;

	if (oom_unkillable_task(p, memcg, nodemask))
		return 0;

	p = find_lock_task_mm(p);
	if (!p)
		return 0;

	adj = (long)p->signal->oom_score_adj;
	if (adj == OOM_SCORE_ADJ_MIN) {
		task_unlock(p);
		return 0;
	}

	/*
	 * The baseline for the badness score is the proportion of RAM that each
	 * task's rss, pagetable and swap space use.
	 */
	points = get_mm_rss(p->mm) + p->mm->nr_ptes +
		 get_mm_counter(p->mm, MM_SWAPENTS);
	task_unlock(p);

	/*
	 * Root processes get 3% bonus, just like the __vm_enough_memory()
	 * implementation used by LSMs.
	 */
	if (has_capability_noaudit(p, CAP_SYS_ADMIN))
		adj -= 30;

	/* Normalize to oom_score_adj units */
	adj *= totalpages / 1000;
	points += adj;

	/*
	 * Never return 0 for an eligible task regardless of the root bonus and
	 * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
	 */
	return points > 0 ? points : 1;
}

上面代码里的注释写的很明白,理解了这个算法我们就理解了为啥 MySQL 躺着也能中枪了,因为它的体积总是最大(一般来说它在系统上占用内存最多),所以如果 Out of Memeory (OOM) 的话总是不幸第一个被 kill 掉。解决这个问题最简单的办法就是增加内存,或者想办法优化 MySQL 使其占用更少的内存,除了优化 MySQL 外还可以优化系统(优化 Debian 5优化 CentOS 5.x),让系统尽可能使用少的内存以便应用程序(如 MySQL) 能使用更多的内存,还有一个临时的办法就是调整内核参数,让 MySQL 进程不容易被 OOM killer 发现。

配置 OOM killer

我们可以通过一些内核参数来调整 OOM killer 的行为,避免系统在那里不停的杀进程。比如我们可以在触发 OOM 后立刻触发 kernel panic,kernel panic 10秒后自动重启系统。

# sysctl -w vm.panic_on_oom=1
vm.panic_on_oom = 1

# sysctl -w kernel.panic=10
kernel.panic = 10

# echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
# echo "kernel.panic=10" >> /etc/sysctl.conf

从上面的 oom_kill.c 代码里可以看到 oom_badness() 给每个进程打分,根据 points 的高低来决定杀哪个进程,这个 points 可以根据 adj 调节,root 权限的进程通常被认为很重要,不应该被轻易杀掉,所以打分的时候可以得到 3% 的优惠(adj -= 30; 分数越低越不容易被杀掉)。我们可以在用户空间通过操作每个进程的 oom_adj 内核参数来决定哪些进程不这么容易被 OOM killer 选中杀掉。比如,如果不想 MySQL 进程被轻易杀掉的话可以找到 MySQL 运行的进程号后,调整 oom_score_adj 为 -15(注意 points 越小越不容易被杀):

# ps aux | grep mysqld
mysql    2196  1.6  2.1 623800 44876 ?        Ssl  09:42   0:00 /usr/sbin/mysqld

# cat /proc/2196/oom_score_adj
0
# echo -15 > /proc/2196/oom_score_adj

当然,如果需要的话可以完全关闭 OOM killer(不推荐用在生产环境):

# sysctl -w vm.overcommit_memory=2

# echo "vm.overcommit_memory=2" >> /etc/sysctl.conf

找出最有可能被 OOM Killer 杀掉的进程

我们知道了在用户空间可以通过操作每个进程的 oom_adj 内核参数来调整进程的分数,这个分数也可以通过 oom_score 这个内核参数看到,比如查看进程号为981的 omm_score,这个分数被上面提到的 omm_score_adj 参数调整后(-15),就变成了3:

# cat /proc/981/oom_score
18

# echo -15 > /proc/981/oom_score_adj
# cat /proc/981/oom_score
3

The following bash script can be used to print the processes with the highest oom_score (the most easily killed by OOM Killer) on the current system:

# vi oomscore.sh
#!/bin/bash
for proc in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
    printf "%2d %5d %s\n" \
        "$(cat $proc/oom_score)" \
        "$(basename $proc)" \
        "$(cat $proc/cmdline | tr '\0' ' ' | head -c 50)"
done 2>/dev/null | sort -nr | head -n 10

# chmod +x oomscore.sh
# ./oomscore.sh
18   981 /usr/sbin/mysqld
 4 31359 -bash
 4 31056 -bash
 1 31358 sshd: root@pts/6
 1 31244 sshd: vpsee [priv]
 1 31159 -bash
 1 31158 sudo -i
 1 31055 sshd: root@pts/3
 1 30912 sshd: vpsee [priv]
 1 29547 /usr/sbin/sshd -D

                              

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326364646&siteId=291194637