Linux kernel memory management knowledge and partner system

1. Linux kernel architecture diagram

insert image description here
insert image description here

2. Virtual memory address space layout

2.1, user space

The application uses malloc() to apply for memory, and uses free() to release memory. malloc()/free() is the interface provided by the memory allocator ptmalloc of the glibc library. ptmalloc uses the system call brk/mmap to request memory from the kernel in units of pages , and then divided into small memory blocks allocated to user applications.

2.2. Kernel space

The basic functions of the kernel space: virtual memory management is responsible for allocating virtual pages from the virtual address space of the process, sys_brk is used to expand or shrink the heap, sys_mmap is used to allocate virtual pages in the memory mapping area, and sys_munmap is used to release virtual pages.

The page allocator is responsible for allocating physical pages, the currently used page allocator is the buddy allocator. The kernel space provides a block allocator that divides pages into small memory blocks for allocation, and provides the interface kmalloc() for allocating memory and the interface kfree() for releasing memory.

Block allocator: SLAB/SLUB/SLOB.

2.3. Hardware level

The processor contains a component called the Memory Management Unit (MMU), which is responsible for translating virtual addresses into physical addresses. The memory management unit contains a component called the Translation Lookaside Buffer (TLB), which saves the most recently used page table mapping, avoiding the need to query the page table in memory every time a virtual address is converted to a physical address.

2.4, virtual address space division

Take the ARM64 processor as an example: the maximum width of a virtual address is 48 bits.

  • The kernel virtual address is at the top of the 64-bit address space, the upper 16 bits are all 1, and the range is [0xFFFF 0000 0000 0000,0xFFFF FFFF FFFF FFFF].
  • The user virtual address is at the bottom of the 64-bit address space, the upper 16 bits are all 0, and the range is [0x0000 0000 0000 0000,0x0000 FFFF FFFF FFFF].
    insert image description here

When compiling the Linux kernel of the ARM64 architecture, you can choose the virtual address width:

  • Select a page length of 4KB, and the default virtual address width is 39 bits;
  • Select a page length of 16KB, and the default virtual address width is 47 bits;
  • Select a page length of 64KB, and the default virtual address width is 42 bits;
  • Select 48-bit virtual address.

In the ARM64 architecture linux kernel, the kernel virtual address has the same width as the user virtual address. All processes share the kernel virtual address space, each process has an independent user virtual address space, user threads in the same thread group share the user virtual address space, and kernel threads have no user virtual address space.

2.5. User virtual address space layout

The starting address of the user virtual address space of the process is 0, the length is TASK_SIZE, and each processor architecture defines its own macro TASK_SIZE. The ARM64 architecture defines the macro TASK_SIZE as follows:

  • 32-bit user space program: The value of TASK_SIZE is TASK_SIZE_32, which is 0x10000000, which is equal to 4GB.
  • 64-bit user space program: the value of TASK_SIZE is TASK_SIZE_64, which is 2 to the power of VA_BITS bytes, and VA_BITS is the number of virtual address bits selected when compiling the kernel.

(arch/arm64/include/asm/memory.h)

/*
 * PAGE_OFFSET - the virtual address of the start of the linear map, at the
 *               start of the TTBR1 address space.
 * PAGE_END - the end of the linear map, where all other kernel mappings begin.
 * KIMAGE_VADDR - the virtual address of the start of the kernel image.
 * VA_BITS - the maximum number of bits for virtual addresses.
 */
#define VA_BITS			(CONFIG_ARM64_VA_BITS)
#define _PAGE_OFFSET(va)	(-(UL(1) << (va)))
#define PAGE_OFFSET		(_PAGE_OFFSET(VA_BITS))
#define KIMAGE_VADDR		(MODULES_END)
#define BPF_JIT_REGION_START	(KASAN_SHADOW_END)
#define BPF_JIT_REGION_SIZE	(SZ_128M)
#define BPF_JIT_REGION_END	(BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
#define MODULES_END		(MODULES_VADDR + MODULES_VSIZE)
#define MODULES_VADDR		(BPF_JIT_REGION_END)
#define MODULES_VSIZE		(SZ_128M)
#define VMEMMAP_START		(-VMEMMAP_SIZE - SZ_2M)
#define PCI_IO_END		(VMEMMAP_START - SZ_2M)
#define PCI_IO_START		(PCI_IO_END - PCI_IO_SIZE)
#define FIXADDR_TOP		(PCI_IO_START - SZ_2M)

#if VA_BITS > 48
#define VA_BITS_MIN		(48)
#else
#define VA_BITS_MIN		(VA_BITS)
#endif

#define _PAGE_END(va)		(-(UL(1) << ((va) - 1)))

#define KERNEL_START		_text
#define KERNEL_END		_end

The Linux kernel uses the memory descriptor mm_struct to describe the user virtual address space of the process. The main core members are as follows:

// include/linux/mm_types.h

struct mm_struct {
    
    
	struct {
    
    
		struct vm_area_struct *mmap;//		虚拟内存区域的链表指针
		struct rb_root mm_rb;       // 虚拟内存区域红黑树       
		u64 vmacache_seqnum;                   /* per-thread vmacache */
/*通过在内存映射区域查找一个没有映射的区域*/
#ifdef CONFIG_MMU
		unsigned long (*get_unmapped_area) (struct file *filp,
		unsigned long addr, unsigned long len,
		unsigned long pgoff, unsigned long flags);
#endif
                unsigned long mmap_base;//内存映射区域的起始地址
                // .......
                unsigned long task_size;//表示用户虚拟地址空间的长度大小
               // ......
                pgd_t * pgd;//指向页全局目录(第一级页表)
               // ......
		/**
		 * @mm_users: The number of users including userspace.
		 *
		 * Use mmget()/mmget_not_zero()/mmput() to modify. When this
		 * drops to 0 (i.e. when the task exits and there are no other
		 * temporary reference holders), we also release a reference on
		 * @mm_count (which may then free the &struct mm_struct if
		 * @mm_count also drops to 0).
		 */
		atomic_t mm_users;//共享同一个用户虚拟地址空间的进程数量。

		/**
		 * @mm_count: The number of references to &struct mm_struct
		 * (@mm_users count as 1).
		 *
		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
		 * &struct mm_struct is freed.
		 */
		atomic_t mm_count;//引用计数器
                // ......
                spinlock_t arg_lock; /* protect the below fields */
		unsigned long start_code, end_code, start_data, end_data;//代码段、数据段的起始地址和结束地址
		unsigned long start_brk, brk, start_stack;//栈、堆的起始地址和结束地址
		unsigned long arg_start, arg_end, env_start, env_end;//参数字符串、环境变量的起始地址和结束地址
               // ......
                /* Architecture-specific MM context */
		mm_context_t context;//处理器架构的特殊内存管理上下文操作
               // ......
        } __randomize_layout;
        /*
	 * The mm_cpumask needs to be at the end of mm_struct, because it
	 * is dynamically sized based on nr_cpu_ids.
	 */
	unsigned long cpu_bitmap[];
}

The relationship between mm and mm_user
The user virtual address space of a process includes areas: code segment, dynamic library code segment, data segment, uninitialized data segment, code segment, uninitialized code segment, environment variables stored at the bottom of the stack, parameter strings, etc.
insert image description here

2.6. Process description and memory descriptor relationship of a process

insert image description here

2.7, kernel address space layout

The kernel address space layout of the ARM64 processor architecture is as follows:
insert image description here
user space (256T), kernel space (256T), module area (128M), PCI I/O area (16M), vmalloc area (about 123T), vmemmap area (4096G) .

KASAN shadow area: sanitization of kernel addresses, a dynamic memory error checking tool.

3. SMP/NUMA Architecture

3.1、SMP

SMP (Symmetric Multi-processing) symmetric multi-processor structure. For example, multiple CPUs in a server run symmetrically, there is no primary and secondary relationship, and the CPUs share the same physical memory. SMP is also known as Uniform Memory Access Architecture (UMA: Uniform Memory Access).
insert image description here
The processors of the SMP architecture are peer-to-peer, connected to the same physical memory or the same IO device through the bus, which causes the CPU, memory, IO, etc. to be shared; thus there is a disadvantage: low scalability. The performance and availability of the CPU are affected by the bus, so the performance of the SMP server is optimal between two and four CPUs.

3.2、NUMA

NUMA (Non Uniform Memory Access) non-uniform memory access. Memory access time depends on the processor's memory location. Under NUMA, a processor can access its own local memory faster than non-local memory (memory somewhere shared between processors or memory to another processor).
insert image description here
NUMA also has some problems: resource interaction between nodes will be slower, and when there are many CPUs, its performance improvement is not very high.

NUMA servers are suitable for single-point servers, and the effect of 2 to 4 nodes is the best.

The NUMA system provides memory interconnection and is a new type of dynamic analysis system (MP\MMP).

4. Partner system and algorithm

After the Linux kernel is initialized, the page allocator is used to manage physical pages. The currently used page allocator is the partner allocator. The feature of the partner allocator is that the management algorithm is simple and efficient.

4.1. Basic partner allocator

Contiguous physical pages are called page blocks, order is the unit of page quantity, 2 to the nth power consecutive pages are called n-order page blocks, and two n-order page blocks satisfying the following conditions are called partner (buddy).

  1. The two page blocks are adjacent, that is, the physical addresses are continuous;
  2. The physical page number of the first page of the page block must be an integer multiple of 2 to the nth power;
  3. If merging (n+1) order page blocks, the physical page number of the first page must be an integer multiple of 2 to the power of (n+1) in parentheses.

The unit of number of physical pages allocated and released by the partner allocator is also order.

Example: Taking a single page as an example, page 0 and page 1 are partners, page 2 and page 3 are partners; pages 1 and 2 are not partners? Because page 1 and page 2 are combined to form a first-order page block, the physical page number of the first page is not an integer multiple of 2.

4.2. Partition partner allocator

The structure member free_area of ​​the memory area is used to maintain free page blocks, and the array subscript corresponds to the divisor of the page block. The member free_list of the structure free_area is a linked list of free page blocks, and nr_free is the number of free page blocks. The structure member managed_pages of the memory area is the number of physical pages managed by the buddy allocator.

The data structure analysis of the memory area is as follows:

(include/linux/mmzone.h)

struct free_area {
    
    
	struct list_head	free_list[MIGRATE_TYPES];
	unsigned long		nr_free;
};

(include/linux/mmzone.h)

struct zone {
    
    //内存区域数据结构
	/* Read-mostly fields */

	/* zone watermarks, access with *_wmark_pages(zone) macros */
	unsigned long _watermark[NR_WMARK];//页分配器使用的水线
	unsigned long watermark_boost;

	unsigned long nr_reserved_highatomic;

	/*
	 * We don't know if the memory that we're going to allocate will be
	 * freeable or/and it will be released eventually, so to avoid totally
	 * wasting several GB of ram we must reserve some of the lower zone
	 * memory (otherwise we risk to run OOM on the lower zones despite
	 * there being tons of freeable ram on the higher zones).  This array is
	 * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
	 * changes.
	 */
	long lowmem_reserve[MAX_NR_ZONES];//页分配器使用,当前区域保留多少页不能借给高的区域类型

#ifdef CONFIG_NUMA
	int node;
#endif
	struct pglist_data	*zone_pgdat;
	struct per_cpu_pageset __percpu *pageset;

#ifndef CONFIG_SPARSEMEM
	/*
	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
	 * In SPARSEMEM, this map is stored in struct mem_section
	 */
	unsigned long		*pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
	unsigned long		zone_start_pfn;

	/*
	 * spanned_pages is the total pages spanned by the zone, including
	 * holes, which is calculated as:
	 * 	spanned_pages = zone_end_pfn - zone_start_pfn;
	 *
	 * present_pages is physical pages existing within the zone, which
	 * is calculated as:
	 *	present_pages = spanned_pages - absent_pages(pages in holes);
	 *
	 * managed_pages is present pages managed by the buddy system, which
	 * is calculated as (reserved_pages includes pages allocated by the
	 * bootmem allocator):
	 *	managed_pages = present_pages - reserved_pages;
	 *
	 * So present_pages may be used by memory hotplug or memory power
	 * management logic to figure out unmanaged pages by checking
	 * (present_pages - managed_pages). And managed_pages should be used
	 * by page allocator and vm scanner to calculate all kinds of watermarks
	 * and thresholds.
	 *
	 * Locking rules:
	 *
	 * zone_start_pfn and spanned_pages are protected by span_seqlock.
	 * It is a seqlock because it has to be read outside of zone->lock,
	 * and it is done in the main allocator path.  But, it is written
	 * quite infrequently.
	 *
	 * The span_seq lock is declared along with zone->lock because it is
	 * frequently read in proximity to zone->lock.  It's good to
	 * give them a chance of being in the same cacheline.
	 *
	 * Write access to present_pages at runtime should be protected by
	 * mem_hotplug_begin/end(). Any reader who can't tolerant drift of
	 * present_pages should get_online_mems() to get a stable value.
	 */
	atomic_long_t		managed_pages;
	unsigned long		spanned_pages;
	unsigned long		present_pages;

	const char		*name;

#ifdef CONFIG_MEMORY_ISOLATION
	/*
	 * Number of isolated pageblock. It is used to solve incorrect
	 * freepage counting problem due to racy retrieving migratetype
	 * of pageblock. Protected by zone->lock.
	 */
	unsigned long		nr_isolate_pageblock;
#endif

#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif

	int initialized;

	/* Write-intensive fields used from the page allocator */
	ZONE_PADDING(_pad1_)

	/* free areas of different sizes */
	struct free_area	free_area[MAX_ORDER];

	/* zone flags, see below */
	unsigned long		flags;

	/* Primarily protects free_area */
	spinlock_t		lock;

	/* Write-intensive fields used by compaction and vmstats. */
	ZONE_PADDING(_pad2_)

	/*
	 * When free pages are below this point, additional steps are taken
	 * when reading the number of free pages to avoid per-cpu counter
	 * drift allowing watermarks to be breached
	 */
	unsigned long percpu_drift_mark;

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* pfn where compaction free scanner should start */
	unsigned long		compact_cached_free_pfn;
	/* pfn where async and sync compaction migration scanner should start */
	unsigned long		compact_cached_migrate_pfn[2];
	unsigned long		compact_init_migrate_pfn;
	unsigned long		compact_init_free_pfn;
#endif

#ifdef CONFIG_COMPACTION
	/*
	 * On compaction failure, 1<<compact_defer_shift compactions
	 * are skipped before trying again. The number attempted since
	 * last failure is tracked with compact_considered.
	 */
	unsigned int		compact_considered;
	unsigned int		compact_defer_shift;
	int			compact_order_failed;
#endif

#if defined CONFIG_COMPACTION || defined CONFIG_CMA
	/* Set to true when the PG_migrate_skip bits should be cleared */
	bool			compact_blockskip_flush;
#endif

	bool			contiguous;

	ZONE_PADDING(_pad3_)
	/* Zone statistics */
	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
	atomic_long_t		vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
} ____cacheline_internodealigned_in_smp;

Regional waterline data structure analysis: (include/linux/mmzone.h)

enum zone_watermarks {
    
    
	WMARK_MIN,
	WMARK_LOW,
	WMARK_HIGH,
	NR_WMARK
};

Under what circumstances does the preferred memory region borrow physical pages from the alternate region? This problem is deeply understood from the explanation of regional waterlines. Each memory region has 3 waterlines.

  • High watermark (HIGH): If the number of free pages in the memory area is greater than the high watermark, it means that the memory in the memory area is sufficient;
  • Low watermark (LOW): If the number of free pages in the memory area is less than the low watermark, it means that the memory in the memory area is slightly insufficient;
  • Minimum waterline (MIN): If the number of free pages in the memory area is less than the minimum waterline, it means that the memory in this memory area is seriously insufficient.

Five, block allocator (Slab/Slub/Slob)

5.1. Basic concepts

Buddy provides a page-based memory allocation interface, which is too granular for the kernel, so a new mechanism is needed to split the page into smaller units for management.

The main ones supported in Linux are: slab, slub, and slob. Among them, the total code size of the slob allocator is relatively small, but the allocation speed is not the most efficient, so it is not designed for large-scale systems, and is suitable for embedded systems with tight memory.

5.2. Principle of slab block allocator

The role of the slab allocator is not only to allocate small blocks of memory, but also to act as a cache for frequently allocated and released objects. The core idea of ​​the slab allocator is: create a memory cache for each object type, each memory cache consists of multiple large blocks, a large block consists of one or more contiguous physical pages, and each large block contains multiple object. The slab adopts object-oriented thinking and manages memory based on object types. Each object is divided into a class. For example, the process descriptor task_struct is a class, and each process descriptor instance is an object. The structure of the memory cache is shown in the figure below:
insert image description here
The slab allocator does not give priority to performance in some cases, so the Linux kernel provides two improved block allocators.

  • On a large computer equipped with a large amount of physical memory, the memory overhead of the management data structure of the slab allocator is relatively large, so the slab allocator is designed.
  • On embedded devices with small memory, the code of the slab allocator is too much and quite complicated, so a simplified slob allocator is designed.

The slub allocator is now the default chunk allocator.

5.3. Calculating slab length and coloring

(1) Calculate slab: The function calculate_slab_order is responsible for calculating the length of slab, from order 0 to kmalloc() function supports the largest divisor KMALLOC_MAX_ORDER.

(mm/slab.c)

/**
 * calculate_slab_order - calculate size (page order) of slabs
 * @cachep: pointer to the cache that is being created
 * @size: size of objects to be created in this cache.
 * @flags: slab allocation flags
 *
 * Also calculates the number of objects per slab.
 *
 * This could be made much more intelligent.  For now, try to avoid using
 * high order pages for slabs.  When the gfp() functions are more friendly
 * towards high-order requests, this should be changed.
 *
 * Return: number of left-over bytes in a slab
 */
static size_t calculate_slab_order(struct kmem_cache *cachep,
				size_t size, slab_flags_t flags)
{
    
    
	size_t left_over = 0;
	int gfporder;

	for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
    
    
		unsigned int num;
		size_t remainder;

		num = cache_estimate(gfporder, size, flags, &remainder);
		if (!num)
			continue;

		/* Can't handle number of objects more than SLAB_OBJ_MAX_NUM */
		if (num > SLAB_OBJ_MAX_NUM)
			break;

		if (flags & CFLGS_OFF_SLAB) {
    
    
			struct kmem_cache *freelist_cache;
			size_t freelist_size;

			freelist_size = num * sizeof(freelist_idx_t);
			freelist_cache = kmalloc_slab(freelist_size, 0u);
			if (!freelist_cache)
				continue;

			/*
			 * Needed to avoid possible looping condition
			 * in cache_grow_begin()
			 */
			if (OFF_SLAB(freelist_cache))
				continue;

			/* check if off slab has enough benefit */
			if (freelist_cache->size > cachep->size / 2)
				continue;
		}

		/* Found something acceptable - save it away */
		cachep->num = num;
		cachep->gfporder = gfporder;
		left_over = remainder;

		/*
		 * A VFS-reclaimable slab tends to have most allocations
		 * as GFP_NOFS and we really don't want to have to be allocating
		 * higher-order pages when we are unable to shrink dcache.
		 */
		if (flags & SLAB_RECLAIM_ACCOUNT)
			break;

		/*
		 * Large number of objects is good, but very large slabs are
		 * currently bad for the gfp()s.
		 */
		if (gfporder >= slab_max_order)
			break;

		/*
		 * Acceptable internal fragmentation?
		 */
		if (left_over * 8 <= (PAGE_SIZE << gfporder))
			break;
	}
	return left_over;
}

(2) Coloring:
A slab is one or more contiguous physical pages, and the starting address is always an integer multiple of the page length. The positions with the same offset in different slabs have the same index in the processor's L1 cache. If the length of the remaining part of the slab exceeds the length of the first-level cache line, the corresponding level-1 cache line of the remaining part is not used; if the length of the padding bytes of the object exceeds the length of the first-level cache line, the corresponding level-1 cache line The cache line is not being utilized. These two conditions cause some cache lines of the processor to be overused and others to be seldom used.

5.4. Per Processor Array Cache

The memory cache creates an array cache (structure array_cache) for each processor. When releasing an object, store the object in the array cache corresponding to the current processor; when allocating an object, first allocate the object from the array cache of the current processor, using the principle of Last In First Out (LIFO), this Doing can improve performance.
insert image description here

// include/linux/slab_def.h

struct kmem_cache {
    
    
	struct array_cache __percpu *cpu_cache;

/* 1) Cache tunables. Protected by slab_mutex */
	unsigned int batchcount;
	unsigned int limit;
	unsigned int shared;

	unsigned int size;
	struct reciprocal_value reciprocal_buffer_size;
/* 2) touched by every alloc & free from the backend */

	slab_flags_t flags;		/* constant flags */
	unsigned int num;		/* # of objs per slab */

/* 3) cache_grow/shrink */
	/* order of pgs per slab (2^n) */
	unsigned int gfporder;

	/* force GFP flags, e.g. GFP_DMA */
	gfp_t allocflags;

	size_t colour;			/* cache colouring range */
	unsigned int colour_off;	/* colour offset */
	struct kmem_cache *freelist_cache;
	unsigned int freelist_size;

	/* constructor func */
	void (*ctor)(void *obj);

/* 4) cache creation/removal */
	const char *name;
	struct list_head list;
	int refcount;
	int object_size;
	int align;

/* 5) statistics */
#ifdef CONFIG_DEBUG_SLAB
	unsigned long num_active;
	unsigned long num_allocations;
	unsigned long high_mark;
	unsigned long grown;
	unsigned long reaped;
	unsigned long errors;
	unsigned long max_freeable;
	unsigned long node_allocs;
	unsigned long node_frees;
	unsigned long node_overflow;
	atomic_t allochit;
	atomic_t allocmiss;
	atomic_t freehit;
	atomic_t freemiss;

	/*
	 * If debugging is enabled, then the allocator can add additional
	 * fields and/or padding to every object. 'size' contains the total
	 * object size including these internal fields, while 'obj_offset'
	 * and 'object_size' contain the offset to the user object and its
	 * size.
	 */
	int obj_offset;
#endif /* CONFIG_DEBUG_SLAB */

#ifdef CONFIG_MEMCG
	struct memcg_cache_params memcg_params;
#endif
#ifdef CONFIG_KASAN
	struct kasan_cache kasan_info;
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
	unsigned int *random_seq;
#endif

	unsigned int useroffset;	/* Usercopy region offset */
	unsigned int usersize;		/* Usercopy region size */

	struct kmem_cache_node *node[MAX_NUMNODES];
};

5.5. The slab allocator supports NUMA architecture

The memory cache creates a kmem_cache_node instance view for each memory node as follows:

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/Long_xu/article/details/129417361