文章	内容
Linux内存管理：Bootmem的率先登场	`Bootmem`启动过程内存分配器
Linux内存管理：Buddy System姗姗来迟	`Buddy System`伙伴系统内存分配器
Linux内存管理：Slab闪亮登场	`Slab`内存分配器
Linux内存管理：内存分配和内存回收原理	内存分配和内存回收原理

这是源码剖析专栏的第四篇文章

主要分成四大模块来剖析：内存管理、设备管理、系统启动和其他部分

其中内存管理分为Bootmem、Buddy System和Slab三部分来阐述，当然，除了内存初始化，还必然有内存分配和内存回收

有些todo后续会补上

Reclaim Memory

Basic Concept

当linux系统内存压力大时，就会对系统的每个压力大的zone内存回收

内存回收主要是针对匿名页和文件页进行的

对于匿名页，内存回收过程中会筛选出一些不经常使用的匿名页，写入swap分区，作为空闲页框释放到伙伴系统

对于文件页，如果此文件页保存的内容是一个干净的页，就无需会写，直接将空闲页释放给伙伴系统；相反，脏页则先写回到磁盘中，再释放到伙伴系统

但是此时会有一个弊端，便是会给I/O造成极大的压力，因此，在系统中，一般每个zone会设置一条线，当空闲页框数量不满足这条线时，就会执行内存回收操作；否则不执行内存回收

内存回收是以zone为单位的，一般情况下zone有三条线：

watermark[WMARK_MIN]：在快速分配失败后的慢速分配中会使用此阈值进行分配，如果慢速分配中此值还是无法进行分配，则直接内存回收和快速内存回收
watermark[WMARK_LOW]：低阈值，是快速分配的默认阈值，在分配过程中，如果zone的空闲页数量低于此阈值，系统会对zone执行快速内存回收
watermark[WMARK_HIGH]：高阈值，是zone对于空闲页数量比较满意的一个值。一般情况下，对zone进行内存回收时，目标是将zone的空闲页数量提高到此值

liuzixuan@liuzixuan-ubuntu ~ # cat /proc/zoneinfo
Node 0, zone   Normal
  pages free     5179
        min      4189
        low      5236
        high     6283

对于zone内存的回收，主要针对三个东西：slab、lru链表中的页和buffer_head，其中lru主要用于管理进程空间中使用的内存页，主要管理三种类型的页：匿名页、文件页和shmem页

扫描二维码关注公众号，回复： 16615277 查看本文章

判断内存页能够回收的前提是page->_count = 0

Mem Allocator

内存分配alloc_page和alloc_page一般都是调用__alloc_pages_nodemask->__alloc_pages_internal

static inline struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
		struct zonelist *zonelist, nodemask_t *nodemask)
{
    
    
	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
}

__alloc_pages_internal一般会进行一次low阈值的快速内存分配get_page_from_freelist和一次使用min阈值的慢速内存分配

struct page *
__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
			struct zonelist *zonelist, nodemask_t *nodemask)
{
    
    
    // ...
	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
	if (page)
		goto got_pg;

快速内存分配

快速内存分配函数get_page_from_freelist通过low阈值从zonelist中获取合适的zone进行分配，如果zone没有达到low阈值，则会进行快速内存回收，快速内存回收后再尝试分配

gfp_mask：申请内存所使用的gfp mask
order：申请物理内存阶数
zonelist：zone节点的zonelist数组
alloc_flags：转换后申请内存flags
high_zoneidx：所允许申请内存最高zone

alloc flags是buddy内部申请内存所使用flag，决定一些内存分配行为：

/* The ALLOC_WMARK bits are used as an index to zone->watermark */
#define ALLOC_WMARK_MIN		WMARK_MIN
#define ALLOC_WMARK_LOW		WMARK_LOW
#define ALLOC_WMARK_HIGH	WMARK_HIGH
#define ALLOC_NO_WATERMARKS	0x04 /* don't check watermarks at all */
 
/* Mask to get the watermark bits */
#define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
/*
 * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
 * cannot assume a reduced access to memory reserves is sufficient for
 * !MMU
 */
#ifdef CONFIG_MMU
#define ALLOC_OOM		0x08
#else
#define ALLOC_OOM		ALLOC_NO_WATERMARKS
#endif
 
#define ALLOC_HARDER		 0x10 /* try to alloc harder */
#define ALLOC_HIGH		 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET		 0x40 /* check for correct cpuset */
#define ALLOC_CMA		 0x80 /* allow allocations from CMA areas */
#ifdef CONFIG_ZONE_DMA32
#define ALLOC_NOFRAGMENT	0x100 /* avoid mixing pageblock types */
#else
#define ALLOC_NOFRAGMENT	  0x0
#endif
#define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */

标志含义如下

ALLO_WMARK_XXX:是申请内存时与watermark相关
ALLOC_NO_WATERMARKS：申请内存时不检查water mark
ALLOC_OOM：内存不足时允许触发OOMALLOC_HARDER:是否允许使用页迁移中的MIGRATE_HIGHATOMIC保留内存
ALLOC_HIGH：与__GFP_HIGH功能相同
ALLOC_CPUSET：是否使用CPUSET功能控制内存申请
ALLOC_CMA：允许从CMA中申请内存
ALLOC_NOFRAGMENT：如果设置则决定内存不足时使用no_fallback策略，不允许从远端节点中申请内存即不允许产生外内存碎片
ALLOC_KSWAPD：内存不足时允许开启kswapd

get_page_from_freelist是为buddy算法第一次尝试申请内存，核心思想就是当内存足够时从zone中对应的order的freelist中，获取到物理页

get_page_from_freelist

/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
{
    
    
	struct zoneref *z;
	struct page *page = NULL;
	int classzone_idx;
	struct zone *zone, *preferred_zone;
	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
	int zlc_active = 0;		/* set if using zonelist_cache */
	int did_zlc_setup = 0;		/* just call zlc_setup() one time */

获得zone的id

	classzone_idx = zone_idx(preferred_zone);

来看zonelist_scan标号，该标号的通过zonelist，寻找有足够空闲页的zone

zonelist_scan:
	/*
	 * Scan zonelist, looking for a zone with enough free.
	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
	 */
	for_each_zone_zonelist_nodemask(zone, z, zonelist,
						high_zoneidx, nodemask) {
    
    
		if (NUMA_BUILD && zlc_active &&
			!zlc_zone_worth_trying(zonelist, z, allowednodes))
				continue;
		if ((alloc_flags & ALLOC_CPUSET) &&
			!cpuset_zone_allowed_softwall(zone, gfp_mask))
				goto try_next_zone;

		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
    
    
			unsigned long mark;
			int ret;
			if (alloc_flags & ALLOC_WMARK_MIN)
				mark = zone->pages_min;
			else if (alloc_flags & ALLOC_WMARK_LOW)
				mark = zone->pages_low;
			else
				mark = zone->pages_high;

			if (zone_watermark_ok(zone, order, mark,
				    classzone_idx, alloc_flags))
				goto try_this_zone;

			if (zone_reclaim_mode == 0)
				goto this_zone_full;

			ret = zone_reclaim(zone, gfp_mask, order);
			switch (ret) {
    
    
			case ZONE_RECLAIM_NOSCAN:
				/* did not scan */
				goto try_next_zone;
			case ZONE_RECLAIM_FULL:
				/* scanned but unreclaimable */
				goto this_zone_full;
			default:
				/* did we reclaim enough */
				if (!zone_watermark_ok(zone, order, mark,
						classzone_idx, alloc_flags))
					goto this_zone_full;
			}
		}

关于宏zonelist_scan展开为

for (z = first_zones_zonelist(zonelist, high_zoneidx, nodemask, &zone)
{
    
    
	zone;
    z = next_zones_zonelist(++z, high_zoneidx, nodemask, &zone)// 获取zonelist中的下一个zone
}

static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
					enum zone_type highest_zoneidx,
					nodemask_t *nodes,
					struct zone **zone)
{
    
    
	return next_zones_zonelist(zonelist->_zonerefs, highest_zoneidx, nodes,
								zone);
}

struct zoneref *next_zones_zonelist(struct zoneref *z,
					enum zone_type highest_zoneidx,
					nodemask_t *nodes,
					struct zone **zone)
{
    
    
	/*
	 * Find the next suitable zone to use for the allocation.
	 * Only filter based on nodemask if it's set
	 */
	if (likely(nodes == NULL))
		while (zonelist_zone_idx(z) > highest_zoneidx)
			z++;
	else
		while (zonelist_zone_idx(z) > highest_zoneidx ||
				(z->zone && !zref_in_nodemask(z, nodes)))
			z++;

	*zone = zonelist_zone(z); // 获得zonelist中的zone
	return z;
}

		if (NUMA_BUILD && zlc_active &&
            // z->zone所在的节点不允许分配或者该zone已经饱满了
			!zlc_zone_worth_trying(zonelist, z, allowednodes)) 
				continue;
		if ((alloc_flags & ALLOC_CPUSET) &&
            // 开启了检查内存节点是否在指定CPU集合，并且该zone不被允许在该CPU上分配内存
			!cpuset_zone_allowed_softwall(zone, gfp_mask))
				goto try_next_zone;

zlc_zone_worth_trying查看所在的节点是否允许分配或者该zone是否已经饱满

static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
						nodemask_t *allowednodes)
{
    
    
	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
	int i;				/* index of *z in zonelist zones */
	int n;				/* node that zone *z is on */

	zlc = zonelist->zlcache_ptr; // 得到zonelist_cache指针信息
	if (!zlc)
		return 1;

	i = z - zonelist->_zonerefs; // 获得_zonerefs数组位置
	n = zlc->z_to_n[i];

	/* This zone is worth trying if it is allowed but not full */
	return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
}

struct zonelist_cache {
    
    
	unsigned short z_to_n[MAX_ZONES_PER_ZONELIST];		/* zone->nid */
	DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST);	/* zone full? */
	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
};

其中主要是检验两个函数：node_isset和test_bit

#define node_isset(node, nodemask) test_bit((node), (nodemask).bits)

static inline int test_bit(int nr, const volatile unsigned long *addr)
{
    
    
	return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
}

cpuset_zone_allowed_softwall函数同理，不做概述

获取快速内存分配的水位

		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
    
    
			unsigned long mark;
			int ret;
			if (alloc_flags & ALLOC_WMARK_MIN)
				mark = zone->pages_min; // 选择min阈值
			else if (alloc_flags & ALLOC_WMARK_LOW)
				mark = zone->pages_low; // 选择low阈值
			else
				mark = zone->pages_high; // 选择high阈值

zone_watermark_ok

检查该zone是否有足够的页来分配

			if (zone_watermark_ok(zone, order, mark,
				    classzone_idx, alloc_flags))
				goto try_this_zone;

检查水位，其中mask可能是low、min或者high中的一个，从该函数可见，要分配到2^order的页面得满足几个条件

除了被分配的页框，内存管理区至少还有min个空闲页框
除了被分配的页框，在order至少为o的块中，有大于等于min/2^o个空闲页框

/*
 * Return 1 if free pages are above 'mark'. This takes into account the order
 * of the allocation.
 */
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
		      int classzone_idx, int alloc_flags)
{
    
    
	/* free_pages my go negative - that's OK */
	long min = mark;
    // 获得空闲页的数量vm_stat[NR_FREE_PAGES]
    // 减去要分配的页面(1 << order)
	long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
	int o;

	if (alloc_flags & ALLOC_HIGH)
		min -= min / 2;
	if (alloc_flags & ALLOC_HARDER)
		min -= min / 4;
	// lowmem_reserve表示要预留的页面个数
	if (free_pages <= min + z->lowmem_reserve[classzone_idx]) 
		return 0;
	// 除去要分配的页面个数，从order k 到 order 10的空闲页面总数，至少得是 min/(2^k)
	for (o = 0; o < order; o++) {
    
    
		/* At the next order, this order's pages become unavailable */
		free_pages -= z->free_area[o].nr_free << o;

		/* Require fewer higher order pages to be free */
		min >>= 1;

		if (free_pages <= min)
			return 0;
	}
	return 1;
}

通过了zone_watermark_ok水位监测，则直接去try_this_zone去分配页框，即buffered_rmqueue函数

buffered_rmqueue

该函数是伙伴系统中指定zone区域进行页面分配的核心函数

static struct page *buffered_rmqueue(struct zone *preferred_zone,
			struct zone *zone, int order, gfp_t gfp_flags)
{
    
    
	unsigned long flags;
	struct page *page;
	int cold = !!(gfp_flags & __GFP_COLD);
	int cpu;
	int migratetype = allocflags_to_migratetype(gfp_flags); //

again:
	cpu  = get_cpu();
	if (likely(order == 0)) {
    
      // 表示单页，从pcplist进行分配 冷热页
		struct per_cpu_pages *pcp;
		// 获取到本节点的cpu高速缓存页
		pcp = &zone_pcp(zone, cpu)->pcp;
		local_irq_save(flags);
        // 该链表为空，大概率上次获取的cpu高速缓存的迁移类型和这次不一致
		if (!pcp->count) {
    
    
            // 从伙伴系统中获得页，然后向高速缓存中添加内存页
			pcp->count = rmqueue_bulk(zone, 0,
					pcp->batch, &pcp->list, migratetype);
            // 如果链表仍然为空，那么说明伙伴系统中页面也没有了，分配失败
			if (unlikely(!pcp->count))
				goto failed;
		}
		
		/* Find a page of the appropriate migrate type */
        // 如果分配的页面不需要考虑硬件缓存(注意不是每CPU页面缓存)，则取出链表的最后一个节点返回给上层
		if (cold) {
    
    
			list_for_each_entry_reverse(page, &pcp->list, lru)
				if (page_private(page) == migratetype)
					break;
		} else {
    
     // 如果要考虑硬件缓存，则取出链表的第一个页面，这个页面是最近刚释放到每CPU缓存的，缓存热度更高
			list_for_each_entry(page, &pcp->list, lru)
				if (page_private(page) == migratetype)
					break;
		}

		/* Allocate more to the pcp list if necessary */
		if (unlikely(&page->lru == &pcp->list)) {
    
    
			pcp->count += rmqueue_bulk(zone, 0,
					pcp->batch, &pcp->list, migratetype);
			page = list_entry(pcp->list.next, struct page, lru);
		}
		//将页面从每CPU缓存链表中取出，并将每CPU缓存计数减1
		list_del(&page->lru);
		pcp->count--;
    // 分配的是多个页面，不需要考虑每CPU页面缓存，直接从系统中分配
	} else {
    
     // 去指定的migratetype的链表中去分配
		spin_lock_irqsave(&zone->lock, flags); //关中断，并获得管理区的锁
        
		page = __rmqueue(zone, order, migratetype);
		spin_unlock(&zone->lock); //先回收（打开）锁,待后面统计计数设置完毕后再开中断
		if (!page)
			goto failed;
	}
	// 事件统计计数，debug（调试）用
	__count_zone_vm_events(PGALLOC, zone, 1 << order);
	zone_statistics(preferred_zone, zone);
	local_irq_restore(flags); //恢复中断
	put_cpu();

	VM_BUG_ON(bad_range(zone, page));
	if (prep_new_page(page, order, gfp_flags))
		goto again;
	return page;

failed:
	local_irq_restore(flags);
	put_cpu();
	return NULL;
}

而__rmqueue函数，分为两种情况

快速分配__rmqueue_smallest：直接从指定的迁移类型链表中分配
慢速分配__rmqueue_fallback：当指定链表中迁移类型内存不足，此时使用备份列表

/*
 * Do the hard work of removing an element from the buddy allocator.
 * Call me with the zone->lock already held.
 */
static struct page *__rmqueue(struct zone *zone, unsigned int order,
						int migratetype)
{
    
    
	struct page *page;
	// 快速分配
	page = __rmqueue_smallest(zone, order, migratetype);

	if (unlikely(!page))
		page = __rmqueue_fallback(zone, order, migratetype);

	return page;
}

此时完结，若zone_watermark_ok水位检测不合格，则继续向下调用

运行到这里表明该zone要回收页

即当水位检测不合格的时候，说明没有可用页，因此在此处进行一定数量的内存回收

			ret = zone_reclaim(zone, gfp_mask, order);

通过zone_reclaim进行一些页面回收

回收到了2^order数量的页框时，才会返回真，即使回收了，没达到这个数量，也返回假

int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
    
    
	int node_id;
	int ret;
	// 都小于最小设定的值
	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
		return ZONE_RECLAIM_FULL;

	if (zone_is_all_unreclaimable(zone)) // 设定了标识不回收
		return ZONE_RECLAIM_FULL;
	// 如果没有设置__GFP_WAIT，即wait为0，则不继续进行内存分配
    // 如果PF_MEMALLOC被设置，也就是说调用内存分配函数的本身就是内存回收进程，则不继续进行内存分配
	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
		return ZONE_RECLAIM_NOSCAN;

	node_id = zone_to_nid(zone);// 获得本zone的nodeid
    // 不属于该cpu范围
	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
		return ZONE_RECLAIM_NOSCAN;
	// 其他进程在回收
	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
		return ZONE_RECLAIM_NOSCAN;

	ret = __zone_reclaim(zone, gfp_mask, order);// 回收该zone的页
	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);// 释放回收锁

	if (!ret)
		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

	return ret;
}

此处注意PF_MEMALLOC和__GFP_WAIT，其中PF_MEMALLOC是一个进程标志位，一般非内存管理子系统不应该使用该标志

继续向下调用__zone_reclaim

static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
    
    
	/* Minimum pages needed in order to stay on node */
	const unsigned long nr_pages = 1 << order;
	struct task_struct *p = current;
	struct reclaim_state reclaim_state;
	int priority;
	struct scan_control sc = {
    
     // 控制扫描结果
		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
		.may_swap = 1,
		.swap_cluster_max = max_t(unsigned long, nr_pages,
					SWAP_CLUSTER_MAX),
		.gfp_mask = gfp_mask,
		.swappiness = vm_swappiness,
		.order = order,
		.isolate_pages = isolate_pages_global,
	};
	unsigned long slab_reclaimable;

	disable_swap_token();
	cond_resched();
	/*
	 * We need to be able to allocate from the reserves for RECLAIM_SWAP
	 * and we also need to be able to write out pages for RECLAIM_WRITE
	 * and RECLAIM_SWAP.
	 */
	p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
	reclaim_state.reclaimed_slab = 0;
	p->reclaim_state = &reclaim_state;

	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
    
    
		priority = ZONE_RECLAIM_PRIORITY;
		do {
    
    
			note_zone_scanning_priority(zone, priority);
			shrink_zone(priority, zone, &sc); // 回收内存
			priority--;
		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
	}

	slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
	if (slab_reclaimable > zone->min_slab_pages) {
    
    
		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
				slab_reclaimable - nr_pages)
			;
		sc.nr_reclaimed += slab_reclaimable -
			zone_page_state(zone, NR_SLAB_RECLAIMABLE);
	}

	p->reclaim_state = NULL;
	current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
	return sc.nr_reclaimed >= nr_pages;
}

关于shrink_zone和shrink_slab在后面会讲解

			switch (ret) {
    
    
			case ZONE_RECLAIM_NOSCAN:
				/* did not scan */
				goto try_next_zone;
			case ZONE_RECLAIM_FULL:
				/* scanned but unreclaimable */
				goto this_zone_full;
			default:
				/* did we reclaim enough */
				if (!zone_watermark_ok(zone, order, mark, 
						classzone_idx, alloc_flags))
					goto this_zone_full;
			}

理想状态下，开始分配内存

try_this_zone:
		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
		if (page)
			break;

该zone已经满了

this_zone_full:
		if (NUMA_BUILD)
			zlc_mark_zone_full(zonelist, z);

继续下一个zone

try_next_zone:
		if (NUMA_BUILD && !did_zlc_setup) {
    
    
			/* we do zlc_setup after the first zone is tried */
			allowednodes = zlc_setup(zonelist, alloc_flags);
			zlc_active = 1;
			did_zlc_setup = 1;
		}
	}

再循环一次

	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
    
    
		/* Disable zlc cache for second zonelist scan */
		zlc_active = 0;
		goto zonelist_scan;
	}

慢速内存分配

慢速内存分配：如果快速内存分配，也就是zonelist中所有zone在快速分配中都没有获取内存，则会使用min阈值进行慢速分配

异步内存压缩
直接内存回收
轻同步内存压缩(视oom进行分配)

唤醒每个node的kswapd内核线程

	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
		wakeup_kswapd(zone, order);  // 唤醒每个node的kswapd内核线程

/*
 * A zone is low on free memory, so wake its kswapd task to service it.
 */
void wakeup_kswapd(struct zone *zone, int order)
{
    
    
	pg_data_t *pgdat;

	if (!populated_zone(zone))
		return;

	pgdat = zone->zone_pgdat;
	// 检查水位
	if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
		return;
	if (pgdat->kswapd_max_order < order)
		pgdat->kswapd_max_order = order;
	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))  // 允许位
		return;
	if (!waitqueue_active(&pgdat->kswapd_wait))  
		return;
	wake_up_interruptible(&pgdat->kswapd_wait);  
}

降低要求，尝试以min阈值为标准进行快速内存分配

	alloc_flags = ALLOC_WMARK_MIN;
	if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
		alloc_flags |= ALLOC_HARDER;
	if (gfp_mask & __GFP_HIGH)
		alloc_flags |= ALLOC_HIGH;
	if (wait)
		alloc_flags |= ALLOC_CPUSET;

此处的几个宏定义

ALLOC_HARDER：表示试图更努力的分配内存
ALLOC_HIGH：表示设置调用者__GFP_HIGH高优先级
ALLOC_CPUSET：表示检查cpuset，是否允许分配内存页

	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
						high_zoneidx, alloc_flags);
	if (page)
		goto got_pg;

此处有一个说法叫做“五剑式”，第一剑式，随意而为，以low为标准分配，即直接调用get_free_page_list()，第二剑式即此处的使用min为标准进行内存分配，且在分配内存的标志中增加ALLOC_WMARK_MIN和ALLOC_HARDER以及ALLOC_HIGH的标志

此处是完全不检查水位进行调用，将alloc_flags赋值为ALLOC_NO_WATERMARKS

rebalance:
	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
			&& !in_interrupt()) {
    
    
		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
    
    
nofail_alloc:
			/* go through the zonelist yet again, ignoring mins */
			page = get_page_from_freelist(gfp_mask, nodemask, order,
				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); // 不检查水位分配内存
			if (page)
				goto got_pg;
			if (gfp_mask & __GFP_NOFAIL) {
    
    
				congestion_wait(WRITE, HZ/50);
				goto nofail_alloc;
			}
		}
		goto nopage;
	}

try_to_free_pages

通过同步释放内存来获取内存页，主要函数是try_to_free_pages

	cpuset_update_task_memory_state();
	p->flags |= PF_MEMALLOC;

	lockdep_set_current_reclaim_state(gfp_mask);
	reclaim_state.reclaimed_slab = 0;
	p->reclaim_state = &reclaim_state;

	did_some_progress = try_to_free_pages(zonelist, order,
						gfp_mask, nodemask);

	p->reclaim_state = NULL;
	lockdep_clear_current_reclaim_state();
	p->flags &= ~PF_MEMALLOC;

函数体为

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
				gfp_t gfp_mask, nodemask_t *nodemask)
{
    
    
	struct scan_control sc = {
    
     // 扫描控制结构
		.gfp_mask = gfp_mask,
		.may_writepage = !laptop_mode,
		.swap_cluster_max = SWAP_CLUSTER_MAX,
		.may_unmap = 1,
		.may_swap = 1,
		.swappiness = vm_swappiness,
		.order = order,
		.mem_cgroup = NULL,
		.isolate_pages = isolate_pages_global,
		.nodemask = nodemask,
	};

	return do_try_to_free_pages(zonelist, &sc);
}

有一个概念问题要弄清楚，前面三个方法是分配内存，如果内存不足从其他node节点获取页框即可(当然也有内存回收zone_reclaim，但是本质是从其他节点找内存)，而从此处开始是直接在本节点进行寻找

do_try_to_free_pages充斥着大量的shrink_zone和shrink_slab，是回收页的主逻辑函数

static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
					struct scan_control *sc)
{
    
    
	int priority;
	unsigned long ret = 0;
	unsigned long total_scanned = 0;
	struct reclaim_state *reclaim_state = current->reclaim_state;
	unsigned long lru_pages = 0;
	struct zoneref *z;
	struct zone *zone;
	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);

	delayacct_freepages_start();

	if (scanning_global_lru(sc))
		count_vm_event(ALLOCSTALL);
	/*
	 * mem_cgroup will not do shrink_slab.
	 */
	if (scanning_global_lru(sc)) {
    
    
		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
    
    

			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
				continue;

			lru_pages += zone_lru_pages(zone);
		}
	}

	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
    
    
		sc->nr_scanned = 0;
		if (!priority)
			disable_swap_token();
		shrink_zones(priority, zonelist, sc);
		/*
		 * Don't shrink slabs when reclaiming memory from
		 * over limit cgroups
		 */
		if (scanning_global_lru(sc)) {
    
    
			shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
			if (reclaim_state) {
    
    
				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
				reclaim_state->reclaimed_slab = 0;
			}
		}
		total_scanned += sc->nr_scanned;
		if (sc->nr_reclaimed >= sc->swap_cluster_max) {
    
    
			ret = sc->nr_reclaimed;
			goto out;
		}

		/*
		 * Try to write back as many pages as we just scanned.  This
		 * tends to cause slow streaming writers to write data to the
		 * disk smoothly, at the dirtying rate, which is nice.   But
		 * that's undesirable in laptop mode, where we *want* lumpy
		 * writeout.  So in laptop mode, write out the whole world.
		 */
		if (total_scanned > sc->swap_cluster_max +
					sc->swap_cluster_max / 2) {
    
    
			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
			sc->may_writepage = 1;
		}

		/* Take a nap, wait for some writeback to complete */
		if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
			congestion_wait(WRITE, HZ/10);
	}
	/* top priority shrink_zones still had more to do? don't OOM, then */
	if (!sc->all_unreclaimable && scanning_global_lru(sc))
		ret = sc->nr_reclaimed;
out:
	/*
	 * Now that we've scanned all the zones at this priority level, note
	 * that level within the zone so that the next thread which performs
	 * scanning of this zone will immediately start out at this priority
	 * level.  This affects only the decision whether or not to bring
	 * mapped pages onto the inactive list.
	 */
	if (priority < 0)
		priority = 0;

	if (scanning_global_lru(sc)) {
    
    
		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
    
    

			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
				continue;

			zone->prev_priority = priority;
		}
	} else
		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);

	delayacct_freepages_end();

	return ret;
}

	if (likely(did_some_progress)) {
    
    
		page = get_page_from_freelist(gfp_mask, nodemask, order,
					zonelist, high_zoneidx, alloc_flags); // 回收后再去分配
		if (page)
			goto got_pg;

todo，过于复杂，后面再看

最后是使用omm机制，即实在是没有页柯分配了，那么就杀掉某一个进程占用的页(有些残忍)，即所谓的out of memory killer机制

	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
    
    
		if (!try_set_zone_oom(zonelist, gfp_mask)) {
    
    
			schedule_timeout_uninterruptible(1);
			goto restart;
		}

		/*
		 * Go through the zonelist yet one more time, keep
		 * very high watermark here, this is only to catch
		 * a parallel oom killing, we must fail if we're still
		 * under heavy pressure.
		 */
		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
			order, zonelist, high_zoneidx,
			ALLOC_WMARK_HIGH|ALLOC_CPUSET); // 虚晃一枪，使用ALLOC_WMARK_HIGH来要求，明显不可能完成
		if (page) {
    
    
			clear_zonelist_oom(zonelist, gfp_mask);
			goto got_pg;
		}

		/* The OOM killer will not help higher order allocs so fail */
		if (order > PAGE_ALLOC_COSTLY_ORDER) {
    
    
			clear_zonelist_oom(zonelist, gfp_mask);
			goto nopage;
		}

		out_of_memory(zonelist, gfp_mask, order); // 释放进程的内存
		clear_zonelist_oom(zonelist, gfp_mask);
		goto restart;

Scan_Control

扫描控制结构，它的主要作用保存对一次内存回收或者内存压缩的变量和参数，一些处理结果也会保存在这里

主要应用于内存回收和内存压缩

struct scan_control {
    
     
	/* Incremented by the number of inactive pages that were scanned */
	unsigned long nr_scanned;  // 已经扫描的页框数量

	/* Number of pages freed so far during a call to shrink_zones() */
	unsigned long nr_reclaimed;   // 已经回收的页框数量

	/* This context's GFP mask */
	gfp_t gfp_mask;  // 申请内存时使用的分配标志

	int may_writepage;  // 能否执行回写操作

	/* Can mapped pages be reclaimed? */
	int may_unmap;  // 能否进行unmap操作，即将所有映射了此页的页表项清空

	/* Can pages be swapped as part of reclaim? */
	int may_swap;  // 能否进行swap交换

	/* This context's SWAP_CLUSTER_MAX. If freeing memory for
	 * suspend, we effectively ignore SWAP_CLUSTER_MAX.
	 * In this context, it doesn't matter that we scan the
	 * whole list at once. */
	int swap_cluster_max;

	int swappiness;

	int all_unreclaimable;

	int order;  // 申请内存时使用的order值，因为只有申请内存，然后内存不足时才会进行扫描

	/* Which cgroup do we reclaim from */
	struct mem_cgroup *mem_cgroup;  // 目标memcg，如果是针对整个zone的，则此为NULL

	/*
	 * Nodemask of nodes allowed by the caller. If NULL, all nodes
	 * are scanned.
	 */
	nodemask_t	*nodemask;  // 允许执行扫描的node节点掩码

	/* Pluggable isolate pages callback */
	unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
			unsigned long *scanned, int order, int mode,
			struct zone *z, struct mem_cgroup *mem_cont,
			int active, int file);
};

内存压缩技术：一般情况下内存紧张的时候，为了达到系统最优，频繁的将内存数据通过I/O写回到磁盘不仅影响flash寿命，还严重影响系统性能，因此，引入内存压缩技术，主流的有以下几种

zSwap：交换空间，一般压缩的是匿名页

zRam：使用内存模拟块设备的方法，一般压缩的是匿名页

zCache：一般压缩的是文件页

匿名页就是没有关联到文件的页，如进程的堆、栈等，即无法与磁盘文件交换，但可以通过硬盘上划分额外的swap交换分区或使用交换文件进行交换

关于活动页和惰性页，一般情况下过该页是否经常被系统中的应用程序访问来判定该页是不是活跃的，如果该页没有被置位，则说明是惰性的，它需要被移到惰性链表，而如果页被置位，说明它近期被访问过，则应该移到活动链表

着时间的推移，最不活跃的页会处于惰性链表的尾端，在出现内存不足时，内核会换出这些页，因为这些页从出生到被换出时，一直都很少被使用，所以根据LRU的原理，换出这些页对系统的破坏是最小的

Mem Shrink

内存回收一般指的是对zone的内存回收，也可能指的是最某个memcg进行回收

每次回收2^(order+1)个页框，满足于本次内存分配，尽量回收更多页框。如果非活动lru链表中的数量不满足这个标准时，则取消这种状态的判断
zone的内存回收旺旺伴随着zone的内存压缩，所以进行zone的内存回收时，会回受到空闲页框满足进行内存压缩为止

根据前面的回收函数可知，主要有三个函数shrink_zone、shrink_list和shrink_slab

shrink_zone

static void shrink_zone(int priority, struct zone *zone,
				struct scan_control *sc)
{
    
    
	unsigned long nr[NR_LRU_LISTS];
	unsigned long nr_to_scan;
	unsigned long percent[2];	/* anon @ 0; file @ 1 */
	enum lru_list l;
	unsigned long nr_reclaimed = sc->nr_reclaimed;
	unsigned long swap_cluster_max = sc->swap_cluster_max;

get_scan_ratio

	get_scan_ratio(zone, sc, percent);

一般情况下当物理内存不够时，有两种选项

将一部分匿名页置换到swap分区
将page cache里面的数据刷回到磁盘，或者直接清理掉

在这两种方法中，置换到swap的权重

	/*
	 * With swappiness at 100, anonymous and file have the same priority.
	 * This scanning priority is essentially the inverse of IO cost.
	 */
	anon_prio = sc->swappiness;
	file_prio = 200 - sc->swappiness;

首先判断是否完全关闭了swap

static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
					unsigned long *percent)
{
    
    
	unsigned long anon, file, free;
	unsigned long anon_prio, file_prio;
	unsigned long ap, fp;
	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

	/* If we have no swap space, do not bother scanning anon pages. */
	if (!sc->may_swap || (nr_swap_pages <= 0)) {
    
    
		percent[0] = 0;
		percent[1] = 100;
		return;
	}

计算匿名页和page cache页的个数

	anon  = zone_nr_pages(zone, sc, LRU_ACTIVE_ANON) +
		zone_nr_pages(zone, sc, LRU_INACTIVE_ANON);
	file  = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) +
		zone_nr_pages(zone, sc, LRU_INACTIVE_FILE);

空闲页个数free+page cache页个数小于high阈值，则全部放到swap中

	if (scanning_global_lru(sc)) {
    
    
		free  = zone_page_state(zone, NR_FREE_PAGES);
		/* If we have very few page cache pages,
		   force-scan anon pages. */
		if (unlikely(file + free <= zone->pages_high)) {
    
      
			percent[0] = 100;
			percent[1] = 0;
			return;
		}
	}

计算比例

	anon_prio = sc->swappiness;
	file_prio = 200 - sc->swappiness;

	/*
	 * The amount of pressure on anon vs file pages is inversely
	 * proportional to the fraction of recently scanned pages on
	 * each list that were recently referenced and in active use.
	 */
	ap = (anon_prio + 1) * (reclaim_stat->recent_scanned[0] + 1);
	ap /= reclaim_stat->recent_rotated[0] + 1;

	fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
	fp /= reclaim_stat->recent_rotated[1] + 1;

	/* Normalize to percentages */
	percent[0] = 100 * ap / (ap + fp + 1);
	percent[1] = 100 - percent[0];

中间有一个过渡

	for_each_evictable_lru(l) {
    
    
		int file = is_file_lru(l);
		unsigned long scan;

		scan = zone_nr_pages(zone, sc, l);
		if (priority) {
    
    
			scan >>= priority;
			scan = (scan * percent[file]) / 100;
		}
		if (scanning_global_lru(sc)) {
    
    
			zone->lru[l].nr_scan += scan;
			nr[l] = zone->lru[l].nr_scan;
			if (nr[l] >= swap_cluster_max)
				zone->lru[l].nr_scan = 0;
			else
				nr[l] = 0;
		} else
			nr[l] = scan;
	}

宏for_each_evictable_lru展开为

for (l = 0; l <= LRU_ACTIVE_FILE; l++)

l代表的变量是struct lru_list

enum lru_list {
    
    
	LRU_INACTIVE_ANON = LRU_BASE,
	LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
#ifdef CONFIG_UNEVICTABLE_LRU
	LRU_UNEVICTABLE,
#else
	LRU_UNEVICTABLE = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */
#endif
	NR_LRU_LISTS
};

可见LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE，表示可回收的链表

因此此循环中即循环这三个lru链表，遍历所有可回收的链表

shrink_list

以LRU_INACTIVE_ANON，LRU_ACTIVE_ANON，LRU_INACTIVE_FILE，LRU_ACTIVE_FILE这个顺序遍历lru链表

	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
					nr[LRU_INACTIVE_FILE]) {
    
    
		for_each_evictable_lru(l) {
    
    
			if (nr[l]) {
    
    
				nr_to_scan = min(nr[l], swap_cluster_max);
				nr[l] -= nr_to_scan;
				// 对此lru类型的链表进行回收
				nr_reclaimed += shrink_list(l, nr_to_scan,
							    zone, sc, priority);
			}
		}
		/*
		 * On large memory systems, scan >> priority can become
		 * really large. This is fine for the starting priority;
		 * we want to put equal scanning pressure on each zone.
		 * However, if the VM has a harder time of freeing pages,
		 * with multiple processes reclaiming pages, the total
		 * freeing target can get unreasonably large.
		 */
		if (nr_reclaimed > swap_cluster_max &&
			priority < DEF_PRIORITY && !current_is_kswapd())
			break;
	}

还记得在vfs_cache_init中对各个目录项存放的shrinker_list吗？就是在此处进行内存回收，其中主要分类

LRU_ACTIVE_FILE：活动文件，调用shrink_active_list，对活动lru链表进行处理
LRU_ACTIVE_ANON：活动匿名，调用shrink_active_list，对活动lru链表进行处理并且非匿名活动页太少
shrink_inactive_list，非活动lru链表

static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
	struct zone *zone, struct scan_control *sc, int priority)
{
    
    
	int file = is_file_lru(lru);

	if (lru == LRU_ACTIVE_FILE) {
    
    
		shrink_active_list(nr_to_scan, zone, sc, priority, file);
		return 0;
	}

	if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
    
    
		shrink_active_list(nr_to_scan, zone, sc, priority, file);
		return 0;
	}
	return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
}

设置回收到的内存

	sc->nr_reclaimed = nr_reclaimed;

当非匿名活动页太少，调用shrink_active_list

	/*
	 * Even if we did not try to evict anon pages at all, we want to
	 * rebalance the anon lru active/inactive ratio.
	 */
	if (inactive_anon_is_low(zone, sc))
		shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

如果太多脏页回写了，此处休眠一会儿

	throttle_vm_writeout(sc->gfp_mask);

shrink_active_list

选取一个函数来看

static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
			struct scan_control *sc, int priority, int file)
{
    
    
	unsigned long pgmoved;
	int pgdeactivate = 0;
	unsigned long pgscanned;
	LIST_HEAD(l_hold);	/* The pages which were snipped off */
	LIST_HEAD(l_inactive);
	struct page *page;
	struct pagevec pvec;
	enum lru_list lru;
	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

	lru_add_drain();
	spin_lock_irq(&zone->lru_lock);
	pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
					ISOLATE_ACTIVE, zone,
					sc->mem_cgroup, 1, file);
	/*
	 * zone->pages_scanned is used for detect zone's oom
	 * mem_cgroup remembers nr_scan by itself.
	 */
	if (scanning_global_lru(sc)) {
    
    
		zone->pages_scanned += pgscanned;
	}
	reclaim_stat->recent_scanned[!!file] += pgmoved;

	if (file)
		__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
	else
		__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
	spin_unlock_irq(&zone->lru_lock);

	pgmoved = 0;
	while (!list_empty(&l_hold)) {
    
    
		cond_resched();
		page = lru_to_page(&l_hold);
		list_del(&page->lru);

		if (unlikely(!page_evictable(page, NULL))) {
    
    
			putback_lru_page(page);
			continue;
		}

		/* page_referenced clears PageReferenced */
		if (page_mapping_inuse(page) &&
		    page_referenced(page, 0, sc->mem_cgroup))
			pgmoved++;

		list_add(&page->lru, &l_inactive);
	}

	/*
	 * Move the pages to the [file or anon] inactive list.
	 */
	pagevec_init(&pvec, 1);
	lru = LRU_BASE + file * LRU_FILE;

	spin_lock_irq(&zone->lru_lock);
	/*
	 * Count referenced pages from currently used mappings as
	 * rotated, even though they are moved to the inactive list.
	 * This helps balance scan pressure between file and anonymous
	 * pages in get_scan_ratio.
	 */
	reclaim_stat->recent_rotated[!!file] += pgmoved;

	pgmoved = 0;
	while (!list_empty(&l_inactive)) {
    
    
		page = lru_to_page(&l_inactive);
		prefetchw_prev_lru_page(page, &l_inactive, flags);
		VM_BUG_ON(PageLRU(page));
		SetPageLRU(page);
		VM_BUG_ON(!PageActive(page));
		ClearPageActive(page);

		list_move(&page->lru, &zone->lru[lru].list);
		mem_cgroup_add_lru_list(page, lru);
		pgmoved++;
		if (!pagevec_add(&pvec, page)) {
    
    
			__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
			spin_unlock_irq(&zone->lru_lock);
			pgdeactivate += pgmoved;
			pgmoved = 0;
			if (buffer_heads_over_limit)
				pagevec_strip(&pvec);
			__pagevec_release(&pvec);
			spin_lock_irq(&zone->lru_lock);
		}
	}
	__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
	pgdeactivate += pgmoved;
	__count_zone_vm_events(PGREFILL, zone, pgscanned);
	__count_vm_events(PGDEACTIVATE, pgdeactivate);
	spin_unlock_irq(&zone->lru_lock);
	if (buffer_heads_over_limit)
		pagevec_strip(&pvec);
	pagevec_release(&pvec);
}

快速内存回收

处于get_page_from_freelist()函数中，在遍历zonelist过程中，对每个zone都在分配前进行判断，如果分配后的zone的空闲内存数量<阈值+保留页框数量，那么此zone就会进行快速内存回收

阈值可能是min/low/high的任何一种

直接内存回收

直接内存回收发生在慢速分配中，在慢速分配中，首先唤醒所有node节点的kswap内核线程，然后调用get_page_from_freelist尝试用min阈值从zonelist的zone后去连续页框，如果失败，则对zonelist的zone进行异步压缩，异步压缩之后在此调用get_page_from_freelist尝试用min阈值，如果失败，则进行直接内存回收

kswapd内存回收

kswapd->balance_pgdat()->kswapd_shrink_zone()->shrink_zone()

在分配过程中，只要get_page_from_freelist()函数无法以low阈值从zonelist的zone中获取到连续页框，并且分配内存标志gfp_mask没有标记__GFP_NO_KSWAPD，则会唤醒kswapd内核线程，在当中执行kswapd内存回收

Linux内存管理：内存分配和内存回收原理

目录