article | content |
---|---|
Linux memory management: Bootmem takes the lead | Bootmem Start process memory allocator |
Linux Memory Management: The Buddy System is long overdue | Buddy System Partner system memory allocator |
Linux memory management: Slab makes its debut | Slab memory allocator |
Linux memory management: principles of memory allocation and memory recycling | Memory allocation and memory recovery principles |
This is the fourth article in the source code analysis column
It is mainly divided into four major modules for analysis: memory management, device management, system startup and other parts.
Memory management is divided into three parts: Bootmem
, Buddy System
and . Of course, in addition to memory initialization, there must also be memory allocation and memory recycling.Slab
Some
todo
will be added later
Table of contents
Reclaim Memory
Basic Concept
When the system memory pressure is high, each memory linux
in the system with high pressure will be recycled.zone
Memory recycling is mainly performed on anonymous pages and file pages.
- For anonymous pages, some infrequently used anonymous pages will be filtered out during the memory recycling process, written to
swap
the partition, and released to the partner system as free page frames.- For a file page, if the content saved in this file page is a clean page, there is no need to be able to write, and the free page will be released directly to the partner system; on the contrary, the dirty page will be written back to the disk first, and then released to the partner system.
However, there will be a drawback at this time, which is that it will I/O
cause great pressure. Therefore, in the system, zone
a line is generally set for each one. When the number of free page frames does not meet this line, a memory recycling operation will be performed. ; Otherwise, memory recycling will not be performed.
Memory recycling is zone
based on units. Generally, zone
there are three lines:
watermark[WMARK_MIN]
: This threshold will be used for allocation in slow allocation after fast allocation fails. If this value still cannot be allocated in slow allocation, direct memory recycling and fast memory recycling will be performed.watermark[WMARK_LOW]
: Low threshold, which is the default threshold for fast allocation. During the allocation process, if thezone
number of free pages is lower than this threshold, the system willzone
perform fast memory recyclingwatermark[WMARK_HIGH]
: A high threshold iszone
a value that is satisfactory for the number of free pages. Generally, whenzone
performing memory reclamation, the goal iszone
to increase the number of free pages to this value
liuzixuan@liuzixuan-ubuntu ~ # cat /proc/zoneinfo
Node 0, zone Normal
pages free 5179
min 4189
low 5236
high 6283
For
zone
memory recycling, three things are mainly targeted:slab
,lru
pages in the linked listbuffer_head
, and pages inlru
the linked list. It is mainly used to manage the memory pages used in the process space. It mainly manages three types of pages: anonymous pages, file pages andshmem
pages .
The premise for judging whether a memory page can be recycled is
page->_count = 0
Self Allocator
Memory allocation alloc_page
and alloc_page
generally calls__alloc_pages_nodemask->__alloc_pages_internal
static inline struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
}
__alloc_pages_internal
low
Generally, a fast memory allocation with a threshold get_page_from_freelist
and a min
slow memory allocation with a usage threshold are performed.
struct page *
__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
// ...
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;
Fast memory allocation
The fast memory allocation function obtains the appropriate allocation get_page_from_freelist
through thelow
threshold . If the threshold is not reached , fast memory recycling will be performed. After fast memory recycling, allocation will be attempted again.zonelist
zone
zone
low
gfp_mask
: used to apply for memorygfp mask
order
: Apply for physical memory levelzonelist
: arrayzone
of nodeszonelist
alloc_flags
: Apply for memory after conversionflags
high_zoneidx
: The maximum amount of memory allowed to be applied forzone
alloc flags
It is used by buddy to apply for memory internally flag
and determines some memory allocation behaviors:
/* The ALLOC_WMARK bits are used as an index to zone->watermark */
#define ALLOC_WMARK_MIN WMARK_MIN
#define ALLOC_WMARK_LOW WMARK_LOW
#define ALLOC_WMARK_HIGH WMARK_HIGH
#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */
/* Mask to get the watermark bits */
#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
/*
* Only MMU archs have async oom victim reclaim - aka oom_reaper so we
* cannot assume a reduced access to memory reserves is sufficient for
* !MMU
*/
#ifdef CONFIG_MMU
#define ALLOC_OOM 0x08
#else
#define ALLOC_OOM ALLOC_NO_WATERMARKS
#endif
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
#define ALLOC_CMA 0x80 /* allow allocations from CMA areas */
#ifdef CONFIG_ZONE_DMA32
#define ALLOC_NOFRAGMENT 0x100 /* avoid mixing pageblock types */
#else
#define ALLOC_NOFRAGMENT 0x0
#endif
#define ALLOC_KSWAPD 0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
The meaning of the logo is as follows
ALLO_WMARK_XXX
watermark
: is relevant when applying for memoryALLOC_NO_WATERMARKS
: Do not check when applying for memorywater mark
ALLOC_OOM
: Allowed to trigger when memory is insufficient : Whether to allow the use of reserved memoryOOMALLOC_HARDER
in page migrationMIGRATE_HIGHATOMIC
ALLOC_HIGH
:__GFP_HIGH
Same function asALLOC_CPUSET
: Whether to useCPUSET
function to control memory applicationALLOC_CMA
: Allows memory to be requestedCMA
fromALLOC_NOFRAGMENT
: If set, it determinesno_fallback
the policy to be used when there is insufficient memory. It does not allow memory to be applied for from remote nodes, that is, it does not allow the generation of external memory fragments.ALLOC_KSWAPD
: Allowed to be enabled when memory is insufficientkswapd
get_page_from_freelist
It is the first attempt of the algorithm to apply for memory. The core idea is to obtain the physical page from the corresponding one whenbuddy
the memory is sufficient.zone
order
freelist
get_page_from_freelist
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
{
struct zoneref *z;
struct page *page = NULL;
int classzone_idx;
struct zone *zone, *preferred_zone;
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
obtained zone
_id
classzone_idx = zone_idx(preferred_zone);
Let's look at zonelist_scan
the label. If the label is passed zonelist
, look for the one with enough free pages.zone
zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask) {
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
int ret;
if (alloc_flags & ALLOC_WMARK_MIN)
mark = zone->pages_min;
else if (alloc_flags & ALLOC_WMARK_LOW)
mark = zone->pages_low;
else
mark = zone->pages_high;
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto try_this_zone;
if (zone_reclaim_mode == 0)
goto this_zone_full;
ret = zone_reclaim(zone, gfp_mask, order);
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
/* did not scan */
goto try_next_zone;
case ZONE_RECLAIM_FULL:
/* scanned but unreclaimable */
goto this_zone_full;
default:
/* did we reclaim enough */
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto this_zone_full;
}
}
About macro zonelist_scan
expansion to
for (z = first_zones_zonelist(zonelist, high_zoneidx, nodemask, &zone)
{
zone;
z = next_zones_zonelist(++z, high_zoneidx, nodemask, &zone)// 获取zonelist中的下一个zone
}
static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
enum zone_type highest_zoneidx,
nodemask_t *nodes,
struct zone **zone)
{
return next_zones_zonelist(zonelist->_zonerefs, highest_zoneidx, nodes,
zone);
}
struct zoneref *next_zones_zonelist(struct zoneref *z,
enum zone_type highest_zoneidx,
nodemask_t *nodes,
struct zone **zone)
{
/*
* Find the next suitable zone to use for the allocation.
* Only filter based on nodemask if it's set
*/
if (likely(nodes == NULL))
while (zonelist_zone_idx(z) > highest_zoneidx)
z++;
else
while (zonelist_zone_idx(z) > highest_zoneidx ||
(z->zone && !zref_in_nodemask(z, nodes)))
z++;
*zone = zonelist_zone(z); // 获得zonelist中的zone
return z;
}
if (NUMA_BUILD && zlc_active &&
// z->zone所在的节点不允许分配或者该zone已经饱满了
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
if ((alloc_flags & ALLOC_CPUSET) &&
// 开启了检查内存节点是否在指定CPU集合,并且该zone不被允许在该CPU上分配内存
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;
zlc_zone_worth_trying
Check whether the node allows allocation or zone
whether it is full
static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
nodemask_t *allowednodes)
{
struct zonelist_cache *zlc; /* cached zonelist speedup info */
int i; /* index of *z in zonelist zones */
int n; /* node that zone *z is on */
zlc = zonelist->zlcache_ptr; // 得到zonelist_cache指针信息
if (!zlc)
return 1;
i = z - zonelist->_zonerefs; // 获得_zonerefs数组位置
n = zlc->z_to_n[i];
/* This zone is worth trying if it is allowed but not full */
return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
}
struct zonelist_cache {
unsigned short z_to_n[MAX_ZONES_PER_ZONELIST]; /* zone->nid */
DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST); /* zone full? */
unsigned long last_full_zap; /* when last zap'd (jiffies) */
};
The main ones are to test two functions: node_isset
andtest_bit
#define node_isset(node, nodemask) test_bit((node), (nodemask).bits)
static inline int test_bit(int nr, const volatile unsigned long *addr)
{
return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
}
cpuset_zone_allowed_softwall
The same applies to functions, no summary is given.
Get the water level of fast memory allocation
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
int ret;
if (alloc_flags & ALLOC_WMARK_MIN)
mark = zone->pages_min; // 选择min阈值
else if (alloc_flags & ALLOC_WMARK_LOW)
mark = zone->pages_low; // 选择low阈值
else
mark = zone->pages_high; // 选择high阈值
zone_watermark_ok
Check that zone
there are enough pages to allocate
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto try_this_zone;
Check the water level, which mask
may be one of low
, min
or high
. From this function, it can be seen that 2^order
the page to be allocated must meet several conditions.
- In addition to the allocated page frame, the memory management area also has at least
min
one free page frame. - In addition to the allocated page frame, there are more than or equal to free page frames in blocks of
order
at leasto
min/2^o
/*
* Return 1 if free pages are above 'mark'. This takes into account the order
* of the allocation.
*/
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
int classzone_idx, int alloc_flags)
{
/* free_pages my go negative - that's OK */
long min = mark;
// 获得空闲页的数量vm_stat[NR_FREE_PAGES]
// 减去要分配的页面(1 << order)
long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
int o;
if (alloc_flags & ALLOC_HIGH)
min -= min / 2;
if (alloc_flags & ALLOC_HARDER)
min -= min / 4;
// lowmem_reserve表示要预留的页面个数
if (free_pages <= min + z->lowmem_reserve[classzone_idx])
return 0;
// 除去要分配的页面个数,从order k 到 order 10的空闲页面总数,至少得是 min/(2^k)
for (o = 0; o < order; o++) {
/* At the next order, this order's pages become unavailable */
free_pages -= z->free_area[o].nr_free << o;
/* Require fewer higher order pages to be free */
min >>= 1;
if (free_pages <= min)
return 0;
}
return 1;
}
After passing zone_watermark_ok
the water level monitoring, try_this_zone
go allocate the page frame, that is, buffered_rmqueue
the function
buffered_rmqueue
This function is zone
the core function for page allocation in a designated area in the partner system.
static struct page *buffered_rmqueue(struct zone *preferred_zone,
struct zone *zone, int order, gfp_t gfp_flags)
{
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
int migratetype = allocflags_to_migratetype(gfp_flags); //
again:
cpu = get_cpu();
if (likely(order == 0)) {
// 表示单页,从pcplist进行分配 冷热页
struct per_cpu_pages *pcp;
// 获取到本节点的cpu高速缓存页
pcp = &zone_pcp(zone, cpu)->pcp;
local_irq_save(flags);
// 该链表为空,大概率上次获取的cpu高速缓存的迁移类型和这次不一致
if (!pcp->count) {
// 从伙伴系统中获得页,然后向高速缓存中添加内存页
pcp->count = rmqueue_bulk(zone, 0,
pcp->batch, &pcp->list, migratetype);
// 如果链表仍然为空,那么说明伙伴系统中页面也没有了,分配失败
if (unlikely(!pcp->count))
goto failed;
}
/* Find a page of the appropriate migrate type */
// 如果分配的页面不需要考虑硬件缓存(注意不是每CPU页面缓存),则取出链表的最后一个节点返回给上层
if (cold) {
list_for_each_entry_reverse(page, &pcp->list, lru)
if (page_private(page) == migratetype)
break;
} else {
// 如果要考虑硬件缓存,则取出链表的第一个页面,这个页面是最近刚释放到每CPU缓存的,缓存热度更高
list_for_each_entry(page, &pcp->list, lru)
if (page_private(page) == migratetype)
break;
}
/* Allocate more to the pcp list if necessary */
if (unlikely(&page->lru == &pcp->list)) {
pcp->count += rmqueue_bulk(zone, 0,
pcp->batch, &pcp->list, migratetype);
page = list_entry(pcp->list.next, struct page, lru);
}
//将页面从每CPU缓存链表中取出,并将每CPU缓存计数减1
list_del(&page->lru);
pcp->count--;
// 分配的是多个页面,不需要考虑每CPU页面缓存,直接从系统中分配
} else {
// 去指定的migratetype的链表中去分配
spin_lock_irqsave(&zone->lock, flags); //关中断,并获得管理区的锁
page = __rmqueue(zone, order, migratetype);
spin_unlock(&zone->lock); //先回收(打开)锁,待后面统计计数设置完毕后再开中断
if (!page)
goto failed;
}
// 事件统计计数,debug(调试)用
__count_zone_vm_events(PGALLOC, zone, 1 << order);
zone_statistics(preferred_zone, zone);
local_irq_restore(flags); //恢复中断
put_cpu();
VM_BUG_ON(bad_range(zone, page));
if (prep_new_page(page, order, gfp_flags))
goto again;
return page;
failed:
local_irq_restore(flags);
put_cpu();
return NULL;
}
The function __rmqueue
is divided into two situations:
- Quick allocation
__rmqueue_smallest
: allocate directly from the specified migration type linked list - Slow allocation
__rmqueue_fallback
: When there is insufficient memory for the migration type in the specified linked list, the backup list is used.
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
static struct page *__rmqueue(struct zone *zone, unsigned int order,
int migratetype)
{
struct page *page;
// 快速分配
page = __rmqueue_smallest(zone, order, migratetype);
if (unlikely(!page))
page = __rmqueue_fallback(zone, order, migratetype);
return page;
}
It is completed at this time. If zone_watermark_ok
the water level detection fails, continue to call downwards.
Running here indicates that zone
the page should be recycled
That is, when the water level detection fails, it means that there are no available pages, so a certain amount of memory recycling is performed here.
ret = zone_reclaim(zone, gfp_mask, order);
By zone_reclaim
doing some page recycling
2^order
True will be returned only when the number of page frames has been recycled . Even if the number of page frames has been recycled and the number has not been reached, false will be returned.
int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
int node_id;
int ret;
// 都小于最小设定的值
if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
return ZONE_RECLAIM_FULL;
if (zone_is_all_unreclaimable(zone)) // 设定了标识不回收
return ZONE_RECLAIM_FULL;
// 如果没有设置__GFP_WAIT,即wait为0,则不继续进行内存分配
// 如果PF_MEMALLOC被设置,也就是说调用内存分配函数的本身就是内存回收进程,则不继续进行内存分配
if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
return ZONE_RECLAIM_NOSCAN;
node_id = zone_to_nid(zone);// 获得本zone的nodeid
// 不属于该cpu范围
if (node_state(node_id, N_CPU) && node_id != numa_node_id())
return ZONE_RECLAIM_NOSCAN;
// 其他进程在回收
if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
return ZONE_RECLAIM_NOSCAN;
ret = __zone_reclaim(zone, gfp_mask, order);// 回收该zone的页
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);// 释放回收锁
if (!ret)
count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
return ret;
}
Note here
PF_MEMALLOC
and__GFP_WAIT
, whichPF_MEMALLOC
is a process flag bit. Generally, non-memory management subsystems should not use this flag.
Continue to call down__zone_reclaim
static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
/* Minimum pages needed in order to stay on node */
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
int priority;
struct scan_control sc = {
// 控制扫描结果
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.swap_cluster_max = max_t(unsigned long, nr_pages,
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
.swappiness = vm_swappiness,
.order = order,
.isolate_pages = isolate_pages_global,
};
unsigned long slab_reclaimable;
disable_swap_token();
cond_resched();
/*
* We need to be able to allocate from the reserves for RECLAIM_SWAP
* and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP.
*/
p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
priority = ZONE_RECLAIM_PRIORITY;
do {
note_zone_scanning_priority(zone, priority);
shrink_zone(priority, zone, &sc); // 回收内存
priority--;
} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
}
slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (slab_reclaimable > zone->min_slab_pages) {
while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
slab_reclaimable - nr_pages)
;
sc.nr_reclaimed += slab_reclaimable -
zone_page_state(zone, NR_SLAB_RECLAIMABLE);
}
p->reclaim_state = NULL;
current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
return sc.nr_reclaimed >= nr_pages;
}
About shrink_zone
and shrink_slab
will be explained later
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
/* did not scan */
goto try_next_zone;
case ZONE_RECLAIM_FULL:
/* scanned but unreclaimable */
goto this_zone_full;
default:
/* did we reclaim enough */
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
goto this_zone_full;
}
Ideally, start allocating memory
try_this_zone:
page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
if (page)
break;
should zone
be full
this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
continue to nextzone
try_next_zone:
if (NUMA_BUILD && !did_zlc_setup) {
/* we do zlc_setup after the first zone is tried */
allowednodes = zlc_setup(zonelist, alloc_flags);
zlc_active = 1;
did_zlc_setup = 1;
}
}
cycle again
if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
/* Disable zlc cache for second zonelist scan */
zlc_active = 0;
goto zonelist_scan;
}
Slow memory allocation
Slow memory allocation: If there is a fast memory allocation, that is, if zonelist
no zone
memory is acquired in the fast allocation, min
a threshold will be used for slow allocation.
- Asynchronous memory compression
- direct memory reclamation
- Light synchronous memory compression (as
oom
allocated)
Wake up each node
kernel kswapd
thread
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
wakeup_kswapd(zone, order); // 唤醒每个node的kswapd内核线程
/*
* A zone is low on free memory, so wake its kswapd task to service it.
*/
void wakeup_kswapd(struct zone *zone, int order)
{
pg_data_t *pgdat;
if (!populated_zone(zone))
return;
pgdat = zone->zone_pgdat;
// 检查水位
if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
return;
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) // 允许位
return;
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
wake_up_interruptible(&pgdat->kswapd_wait);
}
Reduce requirements and try to min
use thresholds as criteria for fast memory allocation
alloc_flags = ALLOC_WMARK_MIN;
if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
alloc_flags |= ALLOC_HARDER;
if (gfp_mask & __GFP_HIGH)
alloc_flags |= ALLOC_HIGH;
if (wait)
alloc_flags |= ALLOC_CPUSET;
Several macro definitions here
ALLOC_HARDER
: Indicates trying harder to allocate memoryALLOC_HIGH
: Indicates setting the caller's__GFP_HIGH
high priorityALLOC_CPUSET
: Indicates checkingcpuset
whether allocation of memory pages is allowed
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags);
if (page)
goto got_pg;
There is a saying here called "Five Swords". The first sword is used casually and is allocated
low
according to standard, that is, called directlyget_free_page_list()
. The second sword is used here tomin
allocate memory according to the standard, and when allocating memory. Signs that addALLOC_WMARK_MIN
andALLOC_HARDER
andALLOC_HIGH
to signs
Here the call is made without checking the water level at all, and alloc_flags
the value is assigned asALLOC_NO_WATERMARKS
rebalance:
if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
&& !in_interrupt()) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS); // 不检查水位分配内存
if (page)
goto got_pg;
if (gfp_mask & __GFP_NOFAIL) {
congestion_wait(WRITE, HZ/50);
goto nofail_alloc;
}
}
goto nopage;
}
try_to_free_pages
Acquire memory pages by synchronously releasing memory. The main function istry_to_free_pages
cpuset_update_task_memory_state();
p->flags |= PF_MEMALLOC;
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
did_some_progress = try_to_free_pages(zonelist, order,
gfp_mask, nodemask);
p->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
p->flags &= ~PF_MEMALLOC;
The function body is
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
struct scan_control sc = {
// 扫描控制结构
.gfp_mask = gfp_mask,
.may_writepage = !laptop_mode,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
.swappiness = vm_swappiness,
.order = order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
.nodemask = nodemask,
};
return do_try_to_free_pages(zonelist, &sc);
}
node
There is a conceptual issue that needs to be clarified. The first three methods are to allocate memory. If the memory is insufficient, just get the page frame from other nodes (of course there is also memory recyclingzone_reclaim
, but the essence is to find memory from other nodes), and from here on, it is directly from other nodes. Search for this node
do_try_to_free_pages
Filled with a large number of shrink_zone
sums shrink_slab
, it is the main logical function of recycling pages.
static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
struct scan_control *sc)
{
int priority;
unsigned long ret = 0;
unsigned long total_scanned = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
struct zoneref *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
delayacct_freepages_start();
if (scanning_global_lru(sc))
count_vm_event(ALLOCSTALL);
/*
* mem_cgroup will not do shrink_slab.
*/
if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
lru_pages += zone_lru_pages(zone);
}
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
sc->nr_scanned = 0;
if (!priority)
disable_swap_token();
shrink_zones(priority, zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
*/
if (scanning_global_lru(sc)) {
shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages);
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}
}
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->swap_cluster_max) {
ret = sc->nr_reclaimed;
goto out;
}
/*
* Try to write back as many pages as we just scanned. This
* tends to cause slow streaming writers to write data to the
* disk smoothly, at the dirtying rate, which is nice. But
* that's undesirable in laptop mode, where we *want* lumpy
* writeout. So in laptop mode, write out the whole world.
*/
if (total_scanned > sc->swap_cluster_max +
sc->swap_cluster_max / 2) {
wakeup_pdflush(laptop_mode ? 0 : total_scanned);
sc->may_writepage = 1;
}
/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
congestion_wait(WRITE, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
ret = sc->nr_reclaimed;
out:
/*
* Now that we've scanned all the zones at this priority level, note
* that level within the zone so that the next thread which performs
* scanning of this zone will immediately start out at this priority
* level. This affects only the decision whether or not to bring
* mapped pages onto the inactive list.
*/
if (priority < 0)
priority = 0;
if (scanning_global_lru(sc)) {
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
zone->prev_priority = priority;
}
} else
mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
delayacct_freepages_end();
return ret;
}
if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, alloc_flags); // 回收后再去分配
if (page)
goto got_pg;
todo
, too complicated, we’ll look at it later
The last thing is to use omm
the mechanism, that is, if there is really no page allocated, then kill the page occupied by a certain process (somewhat cruel), which is the so-called out of memory killer
mechanism.
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
if (!try_set_zone_oom(zonelist, gfp_mask)) {
schedule_timeout_uninterruptible(1);
goto restart;
}
/*
* Go through the zonelist yet one more time, keep
* very high watermark here, this is only to catch
* a parallel oom killing, we must fail if we're still
* under heavy pressure.
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
ALLOC_WMARK_HIGH|ALLOC_CPUSET); // 虚晃一枪,使用ALLOC_WMARK_HIGH来要求,明显不可能完成
if (page) {
clear_zonelist_oom(zonelist, gfp_mask);
goto got_pg;
}
/* The OOM killer will not help higher order allocs so fail */
if (order > PAGE_ALLOC_COSTLY_ORDER) {
clear_zonelist_oom(zonelist, gfp_mask);
goto nopage;
}
out_of_memory(zonelist, gfp_mask, order); // 释放进程的内存
clear_zonelist_oom(zonelist, gfp_mask);
goto restart;
Scan_Control
Scan control structure, its main function is to save variables and parameters for a memory recycling or memory compression, and some processing results will also be saved here.
Mainly used in memory recycling and memory compression
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned; // 已经扫描的页框数量
/* Number of pages freed so far during a call to shrink_zones() */
unsigned long nr_reclaimed; // 已经回收的页框数量
/* This context's GFP mask */
gfp_t gfp_mask; // 申请内存时使用的分配标志
int may_writepage; // 能否执行回写操作
/* Can mapped pages be reclaimed? */
int may_unmap; // 能否进行unmap操作,即将所有映射了此页的页表项清空
/* Can pages be swapped as part of reclaim? */
int may_swap; // 能否进行swap交换
/* This context's SWAP_CLUSTER_MAX. If freeing memory for
* suspend, we effectively ignore SWAP_CLUSTER_MAX.
* In this context, it doesn't matter that we scan the
* whole list at once. */
int swap_cluster_max;
int swappiness;
int all_unreclaimable;
int order; // 申请内存时使用的order值,因为只有申请内存,然后内存不足时才会进行扫描
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup; // 目标memcg,如果是针对整个zone的,则此为NULL
/*
* Nodemask of nodes allowed by the caller. If NULL, all nodes
* are scanned.
*/
nodemask_t *nodemask; // 允许执行扫描的node节点掩码
/* Pluggable isolate pages callback */
unsigned long (*isolate_pages)(unsigned long nr, struct list_head *dst,
unsigned long *scanned, int order, int mode,
struct zone *z, struct mem_cgroup *mem_cont,
int active, int file);
};
Memory compression technology: Under normal circumstances, when memory is tight, in order to achieve system optimization, frequent writing of memory data
I/O
back to the disk not only affectsflash
the lifespan, but also seriously affects system performance. Therefore, the introduction of memory compression technology, the mainstream ones are as follows: kind
zSwap
:Swap space, generally compressed anonymous pageszRam
: A method of using memory to simulate block devices. Generally, anonymous pages are compressed.zCache
: Generally, file pages are compressed.
Anonymous pages are pages that are not associated with files, such as the heap and stack of a process. They cannot be exchanged with disk files, but they can be exchanged by dividing additional swap
swap
partitions on the hard disk or using swap files.
Regarding active pages and lazy pages, generally we judge whether the page is active by whether it is frequently accessed by applications in the system. If the page is not set, it means it is lazy and needs to be moved. Lazy linked list, and if the page is set, indicating that it has been accessed recently, it should be moved to the active linked list
As time goes by, the least active pages will be at the end of the lazy linked list. When there is insufficient memory, the kernel will swap out these pages, because these pages are rarely used from birth to when they are swapped out, so According to
LRU
the principle of , swapping out these pages will cause minimal damage to the system.
Memory Shrink
Memory recycling generally refers to the right zone
memory recycling, or it may refer to the last one memcg
to be recycled.
- Recycle
2^(order+1)
page frames each time, be satisfied with this memory allocation, and try to reclaim as many page frames as possible.lru
If the number in the inactive linked list does not meet this standard, the judgment of this state will be cancelled. zone
Memory recycling is accompanied byzone
memory compression, sozone
when memory recycling is performed, free page frames will be returned until memory compression is satisfied.
According to the previous recycling function, there are three main functions
shrink_zone
,shrink_list
andshrink_slab
shrink_zone
static void shrink_zone(int priority, struct zone *zone,
struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
unsigned long percent[2]; /* anon @ 0; file @ 1 */
enum lru_list l;
unsigned long nr_reclaimed = sc->nr_reclaimed;
unsigned long swap_cluster_max = sc->swap_cluster_max;
get_scan_ratio
get_scan_ratio(zone, sc, percent);
Generally, when the physical memory is not enough, there are two options:
- Replace some anonymous pages to
swap
a partition - Flush
page cache
the data inside back to the disk, or clean it up directly
In both methods, swap
the weights replaced by
/*
* With swappiness at 100, anonymous and file have the same priority.
* This scanning priority is essentially the inverse of IO cost.
*/
anon_prio = sc->swappiness;
file_prio = 200 - sc->swappiness;
First determine whether it is completely closedswap
static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
unsigned long *percent)
{
unsigned long anon, file, free;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (nr_swap_pages <= 0)) {
percent[0] = 0;
percent[1] = 100;
return;
}
Count the number of anonymous pages and page cache
pages
anon = zone_nr_pages(zone, sc, LRU_ACTIVE_ANON) +
zone_nr_pages(zone, sc, LRU_INACTIVE_ANON);
file = zone_nr_pages(zone, sc, LRU_ACTIVE_FILE) +
zone_nr_pages(zone, sc, LRU_INACTIVE_FILE);
If the number of free pages free
+ page cache
the number of pages is less than high
the threshold, then all are placed swap
in the middle
if (scanning_global_lru(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
if (unlikely(file + free <= zone->pages_high)) {
percent[0] = 100;
percent[1] = 0;
return;
}
}
Calculate proportion
anon_prio = sc->swappiness;
file_prio = 200 - sc->swappiness;
/*
* The amount of pressure on anon vs file pages is inversely
* proportional to the fraction of recently scanned pages on
* each list that were recently referenced and in active use.
*/
ap = (anon_prio + 1) * (reclaim_stat->recent_scanned[0] + 1);
ap /= reclaim_stat->recent_rotated[0] + 1;
fp = (file_prio + 1) * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
/* Normalize to percentages */
percent[0] = 100 * ap / (ap + fp + 1);
percent[1] = 100 - percent[0];
There is a transition in the middle
for_each_evictable_lru(l) {
int file = is_file_lru(l);
unsigned long scan;
scan = zone_nr_pages(zone, sc, l);
if (priority) {
scan >>= priority;
scan = (scan * percent[file]) / 100;
}
if (scanning_global_lru(sc)) {
zone->lru[l].nr_scan += scan;
nr[l] = zone->lru[l].nr_scan;
if (nr[l] >= swap_cluster_max)
zone->lru[l].nr_scan = 0;
else
nr[l] = 0;
} else
nr[l] = scan;
}
The macro for_each_evictable_lru
expands to
for (l = 0; l <= LRU_ACTIVE_FILE; l++)
l
The variable represented isstruct lru_list
enum lru_list {
LRU_INACTIVE_ANON = LRU_BASE,
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
#ifdef CONFIG_UNEVICTABLE_LRU
LRU_UNEVICTABLE,
#else
LRU_UNEVICTABLE = LRU_ACTIVE_FILE, /* avoid compiler errors in dead code */
#endif
NR_LRU_LISTS
};
Visible
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE
, indicating a recyclable linked list
Therefore, this loop loops through these three lru
linked lists and traverses all recyclable linked lists.
shrink_list
Traverse the linked list in LRU_INACTIVE_ANON,LRU_ACTIVE_ANON,LRU_INACTIVE_FILE,LRU_ACTIVE_FILE
this orderlru
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
if (nr[l]) {
nr_to_scan = min(nr[l], swap_cluster_max);
nr[l] -= nr_to_scan;
// 对此lru类型的链表进行回收
nr_reclaimed += shrink_list(l, nr_to_scan,
zone, sc, priority);
}
}
/*
* On large memory systems, scan >> priority can become
* really large. This is fine for the starting priority;
* we want to put equal scanning pressure on each zone.
* However, if the VM has a harder time of freeing pages,
* with multiple processes reclaiming pages, the total
* freeing target can get unreasonably large.
*/
if (nr_reclaimed > swap_cluster_max &&
priority < DEF_PRIORITY && !current_is_kswapd())
break;
}
Remember vfs_cache_init
how each directory entry was stored in shrinker_list
? This is where memory recycling is performed, with the main categories
LRU_ACTIVE_FILE
: Active file, call , processshrink_active_list
active linked listlru
LRU_ACTIVE_ANON
: Activity anonymous, callshrink_active_list
,lru
process the activity list and there are too few non-anonymous activity pagesshrink_inactive_list
, inactivelru
linked list
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
int file = is_file_lru(lru);
if (lru == LRU_ACTIVE_FILE) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
if (lru == LRU_ACTIVE_ANON && inactive_anon_is_low(zone, sc)) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
}
Set reclaimed memory
sc->nr_reclaimed = nr_reclaimed;
When there are too few non-anonymous active pages, callshrink_active_list
/*
* Even if we did not try to evict anon pages at all, we want to
* rebalance the anon lru active/inactive ratio.
*/
if (inactive_anon_is_low(zone, sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);
If too many dirty pages are written back, sleep here for a while.
throttle_vm_writeout(sc->gfp_mask);
shrink_active_list
Select a function to see
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
unsigned long pgmoved;
int pgdeactivate = 0;
unsigned long pgscanned;
LIST_HEAD(l_hold); /* The pages which were snipped off */
LIST_HEAD(l_inactive);
struct page *page;
struct pagevec pvec;
enum lru_list lru;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
pgmoved = sc->isolate_pages(nr_pages, &l_hold, &pgscanned, sc->order,
ISOLATE_ACTIVE, zone,
sc->mem_cgroup, 1, file);
/*
* zone->pages_scanned is used for detect zone's oom
* mem_cgroup remembers nr_scan by itself.
*/
if (scanning_global_lru(sc)) {
zone->pages_scanned += pgscanned;
}
reclaim_stat->recent_scanned[!!file] += pgmoved;
if (file)
__mod_zone_page_state(zone, NR_ACTIVE_FILE, -pgmoved);
else
__mod_zone_page_state(zone, NR_ACTIVE_ANON, -pgmoved);
spin_unlock_irq(&zone->lru_lock);
pgmoved = 0;
while (!list_empty(&l_hold)) {
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
if (unlikely(!page_evictable(page, NULL))) {
putback_lru_page(page);
continue;
}
/* page_referenced clears PageReferenced */
if (page_mapping_inuse(page) &&
page_referenced(page, 0, sc->mem_cgroup))
pgmoved++;
list_add(&page->lru, &l_inactive);
}
/*
* Move the pages to the [file or anon] inactive list.
*/
pagevec_init(&pvec, 1);
lru = LRU_BASE + file * LRU_FILE;
spin_lock_irq(&zone->lru_lock);
/*
* Count referenced pages from currently used mappings as
* rotated, even though they are moved to the inactive list.
* This helps balance scan pressure between file and anonymous
* pages in get_scan_ratio.
*/
reclaim_stat->recent_rotated[!!file] += pgmoved;
pgmoved = 0;
while (!list_empty(&l_inactive)) {
page = lru_to_page(&l_inactive);
prefetchw_prev_lru_page(page, &l_inactive, flags);
VM_BUG_ON(PageLRU(page));
SetPageLRU(page);
VM_BUG_ON(!PageActive(page));
ClearPageActive(page);
list_move(&page->lru, &zone->lru[lru].list);
mem_cgroup_add_lru_list(page, lru);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
spin_unlock_irq(&zone->lru_lock);
pgdeactivate += pgmoved;
pgmoved = 0;
if (buffer_heads_over_limit)
pagevec_strip(&pvec);
__pagevec_release(&pvec);
spin_lock_irq(&zone->lru_lock);
}
}
__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
pgdeactivate += pgmoved;
__count_zone_vm_events(PGREFILL, zone, pgscanned);
__count_vm_events(PGDEACTIVATE, pgdeactivate);
spin_unlock_irq(&zone->lru_lock);
if (buffer_heads_over_limit)
pagevec_strip(&pvec);
pagevec_release(&pvec);
}
Fast memory reclamation
In get_page_from_freelist()
the function, during the traversal zonelist
process, each zone
one is judged before allocation. If zone
the amount of free memory after allocation <threshold + the number of reserved page frames, then zone
rapid memory recycling will be performed.
The threshold may be
min/low/high
any of
direct memory reclamation
Direct memory reclamation occurs in slow allocation. In slow allocation, the kernel thread node
of all nodes is first woken up, and then the call is tried to use the threshold to remove the continuous page frame from the back . If it fails, the asynchronous compression is performed. After the asynchronous compression This call attempts to use a threshold, and if it fails, direct memory recycling is performed.kswap
get_page_from_freelist
min
zonelist
zone
zonelist
zone
get_page_from_freelist
min
kswapd memory recycling
kswapd->balance_pgdat()->kswapd_shrink_zone()->shrink_zone()
During the allocation process, as long as get_page_from_freelist()
the function cannot obtain a continuous page frame from the threshold and the allocation memory flag is low
not marked , the kernel thread will be awakened and memory recycling will be performed in it.zonelist
zone
gfp_mask
__GFP_NO_KSWAPD
kswapd
kswapd