Linux memory management: (2) slab allocator

Article description:

1. Background of slab allocator

The partner system allocates memory in physical pages. In practice, many memory requirements are in bytes. So if we need to allocate small memory blocks in bytes, how should we allocate them? The slab allocator is used to solve the problem of small memory block allocation , and it is also one of the very important roles in memory allocation.

The slab allocator ultimately uses the buddy system to allocate actual physical pages, but the slab allocator implements its own mechanism on these contiguous physical pages to manage small memory blocks.

The slab mechanism is shown in the figure below:

Insert image description here

Each slab descriptor will establish a shared object buffer pool and a local object buffer pool. The slab mechanism has the following characteristics:

  • Treat allocated memory blocks as objects. Objects can customize constructors and destructors to initialize and release the contents of the object.
  • After the slab object is released, it will not be discarded immediately but will continue to remain in the memory. It may be used later, so there is no need to re-apply for memory from the partner system.
  • The slab mechanism can create slab descriptors based on memory blocks of specific sizes, such as common data structures in memory, open file objects, etc. This can effectively avoid the generation of memory fragmentation and quickly obtain frequently accessed data structures. In addition, the slab mechanism also supports allocating memory blocks according to 2 to the nth power of byte size.
  • The slab mechanism creates a multi-layer buffer pool, making full use of the idea of ​​​​exchanging space for time, taking precautions against mistakes, and effectively solving efficiency problems.
    • Each CPU has a local object buffer pool, which avoids lock contention problems between multiple cores.
    • Each memory node has a shared object buffer pool.

2. Macro understanding

In order to better understand the details of the slab allocator, we first have a general understanding of the architecture of the slab system from a macro perspective, as shown in the following figure:

Insert image description here

The slab system consists of slab descriptors, slab nodes, local object buffer pools, shared object buffer pools, 3 slab linked lists, n slab allocators, and numerous slab cache objects. The relevant data structure annotations are as follows.

slab descriptor:

// kmem_cache数据结构是 slab 分配器中的核心成员,每个 slab 描述符都用一个 kmem_cache 数据结构来抽象描述
struct kmem_cache {
    
    
	// Per-cpu 变量的 array_cache 数据结构,每个CPU一个,表示本地 CPU 的对象缓冲池
	struct array_cache __percpu *cpu_cache;

/* 1) Cache tunables. Protected by slab_mutex */
	// 表示在当前 CPU 的本地对象缓冲池 array_cache 为空时,从共享对象缓冲池或 slabs_partial/slabs_free 列表中获取的对象的数目
	unsigned int batchcount;
	// 当本地对象缓冲池中的空闲对象的数目大于 limit 时,会主动释放 batchcount 个对象,便于内核回收和销毁 slab
	unsigned int limit;
	// 用于多核系统
	unsigned int shared;

	// 对象的长度,这个长度要加上 align 对齐字节
	unsigned int size;
	struct reciprocal_value reciprocal_buffer_size;
/* 2) touched by every alloc & free from the backend */

	// 对象的分配掩码
	slab_flags_t flags;		/* constant flags */
	// 一个 slab 中最多有多少个对象
	unsigned int num;		/* # of objs per slab */

/* 3) cache_grow/shrink */
	/* order of pgs per slab (2^n) */
	unsigned int gfporder;

	/* force GFP flags, e.g. GFP_DMA */
	gfp_t allocflags;

	// 一个 slab 中可以有多少个不同的缓存行
	size_t colour;			/* cache colouring range */
	// 着色区的长度,和 L1 缓存行大小相同
	unsigned int colour_off;	/* colour offset */
	struct kmem_cache *freelist_cache;
	// 每个对象要占用 1 字节来存放 freelist
	unsigned int freelist_size;

	/* constructor func */
	void (*ctor)(void *obj);

/* 4) cache creation/removal */
	// slab 描述符的名称
	const char *name;
	struct list_head list;
	int refcount;
	// 对象的实际大小
	int object_size;
	// 对齐的长度
	int align;

...

	// slab 节点
	// 在 NUMA 系统中,每个节点有一个 kmem_cache_node 数据结构
	// 在 ARM Vexpress 平台上,只有一个节点
	struct kmem_cache_node *node[MAX_NUMNODES];
};

slab node:

struct kmem_cache_node {
    
    
	// 用于保护 slab 节点中的 slab 链表
	spinlock_t list_lock;

#ifdef CONFIG_SLAB
	// slab 链表,表示 slab 节点中有部分空闲对象
	struct list_head slabs_partial;	/* partial list first, better asm code */
	// slab 链表,表示 slab 节点中没有空闲对象
	struct list_head slabs_full;
	// slab 链表,表示 slab 节点中全部都是空闲对象
	struct list_head slabs_free;
	// 表示 slab 节点中有多少个 slab 对象
	unsigned long total_slabs;	/* length of all slab lists */
	// 表示 slab 节点中有多少个全是空闲对象的 slab 对象
	unsigned long free_slabs;	/* length of free slab list only */
	// 空闲对象的数目
	unsigned long free_objects;
	// 表示 slab 节点中所有空闲对象的最大阈值,即 slab 节点中可容许的空闲对象数目最大阈值
	unsigned int free_limit;
	// 记录当前着色区的编号。所有 slab 节点都按照着色编号来计算着色区的大小,达到最大值后又从 0 开始计算
	unsigned int colour_next;	/* Per-node cache coloring */
	// 共享对象缓冲区。在多核 CPU 中,除了本地 CPU 外,slab 节点中还有一个所有 CPU 都共享的对象缓冲区
	struct array_cache *shared;	/* shared per node */
	// 用于 NUMA 系统
	struct alien_cache **alien;	/* on other nodes */
	// 下一次收割 slab 节点的时间
	unsigned long next_reap;	/* updated without locking */
	// 表示访问了 slabs_free 的 slab 节点
	int free_touched;		/* updated without locking */
#endif

...

};

Object buffer pool:

// slab 描述符会给每个 CPU 提供一个对象缓冲池(array_cache)
// array_cache 可以描述本地对象缓冲池,也可以描述共享对象缓冲池
struct array_cache {
    
    
	// 对象缓冲池中可用对象的数目
	unsigned int avail;
	// 对象缓冲池中可用对象数目的最大阈值
	unsigned int limit;
	// 迁移对象的数目,如从共享对象缓冲池或者其他 slab 中迁移空闲对象到该对象缓冲池的数量
	unsigned int batchcount;
	// 从缓冲池中移除一个对象时,将 touched 置为 1 ;
	// 当收缩缓冲池时,将 touched 置为 0;
	unsigned int touched;
	// 保存对象的实体
	// 指向存储对象的变长数组,每一个成员存放一个对象的指针。这个数组最初最多有 limit 个成员
	void *entry[];
};

The data structure of the object buffer pool uses the zero-length array of the GCC compiler. The entry[] array is used to store multiple objects, as shown in the following figure:

Insert image description here

After having a macro understanding of the slab system architecture diagram, we will next study the details.

3. Detailed description

3.1 Memory layout of slab allocator

The memory layout of the slab allocator usually consists of the following three parts:

  • Colored area
  • n slab objects
  • management area. The management area can be regarded as a freelist array. Each member of the array occupies 1 byte, and each member represents a slab object.

The Linux 5.0 kernel supports the following three slab allocator layout modes:

  • OBJFREELIST_SLAB mode. This is a new optimization added to the Linux 4.6 kernel, and its purpose is to efficiently utilize the memory in the slab allocator. Use the space of the last slab object in the slab allocator as the management area, as shown in the figure below

    Insert image description here

  • OFF_SLAB mode. The management data of the slab allocator is not in the sIab allocator, and the additional allocated memory is used for management, as shown in the figure below

    Insert image description here

  • Normal mode. Traditional layout mode, as shown in the figure below

    Insert image description here

3.2 Create slab descriptor

In the kernel, the kmem_cache_create() function is used to create a slab descriptor. The same picture is shown above. The process of the kmem_cache_create() function is as shown in the figure below:

Insert image description here

In order to give readers a more realistic understanding, this process will be explained below based on the flow chart around the source code:

kmem_cache_create

// 创建 slab 描述符
// kmem_cache_create() 函数用于创建自己的缓存描述符;kmalloc() 函数用于创建通用的缓存
// name:slab 描述符的名称
// size:缓冲对象的大小
// align:缓冲对象需要对齐的字节数
// flags:分配掩码
// ctor:对象的构造函数
struct kmem_cache *
kmem_cache_create(const char *name, unsigned int size, unsigned int align,
		slab_flags_t flags, void (*ctor)(void *))

kmem_cache_create->...->__kmem_cache_create

// 创建 slab 缓存描述符
int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags)
{
    
    
    ...
	// 让 slab 描述符的大小和系统的 word 长度对齐(BYTES_PER_WORD)
	// 当创建的 slab 描述符的 size 小于 word 长度时,slab 分配器会最终按 word 长度来创建
	size = ALIGN(size, BYTES_PER_WORD);

	// SLAB_RED_ZONE 检查是否溢出,实现调试功能
	if (flags & SLAB_RED_ZONE) {
    
    
		ralign = REDZONE_ALIGN;
		size = ALIGN(size, REDZONE_ALIGN);
	}

	/* 3) caller mandated alignment */
	// 调用方强制对齐
	if (ralign < cachep->align) {
    
    
		ralign = cachep->align;
	}
	...
	 * 4) Store it.
	 */
	cachep->align = ralign;
	// colour_off 表示一个着色区的长度,它和 L1 高速缓存行大小相同
	cachep->colour_off = cache_line_size();
	/* Offset must be a multiple of the alignment. */
	if (cachep->colour_off < cachep->align)
		cachep->colour_off = cachep->align;

	// 枚举类型 slab_state 用来表示 slab 系统中的状态,如 DOWN、PARTIAL、PARTIAL_NODE、UP 和 FULL 等。当 slab 机制完全初始化完成后状态变成 FULL
	// slab_is_available() 表示当 slab 分配器处于 UP 或者 FULL 状态时,分配掩码可以使用 GFP_KERNEL;否则,只能使用 GFP_NOWAIT
	if (slab_is_available())
		gfp = GFP_KERNEL;
	else
		gfp = GFP_NOWAIT;

...

	// slab 对象的大小按照 cachep->align 大小来对齐
	size = ALIGN(size, cachep->align);
	
    ...

	// 若数组 freelist 小于一个 slab 对象的大小并且没有指定构造函数,那么 slab 分配器就可以采用 OBJFREELIST_SLAB 模式
	if (set_objfreelist_slab_cache(cachep, size, flags)) {
    
    
		flags |= CFLGS_OBJFREELIST_SLAB;
		goto done;
	}

	// 若一个 slab 分配器的剩余空间小于 freelist 数组的大小,那么使用 OFF_SLAB 模式
	if (set_off_slab_cache(cachep, size, flags)) {
    
    
		flags |= CFLGS_OFF_SLAB;
		goto done;
	}

	// 若一个 slab 分配器的剩余空间大于 slab 管理数组大小,那么使用正常模式
	if (set_on_slab_cache(cachep, size, flags))
		goto done;

	return -E2BIG;

done:
	// freelist_size 表示一个 slab 分配器中管理区————freelist 大小
	cachep->freelist_size = cachep->num * sizeof(freelist_idx_t);
	cachep->flags = flags;
	cachep->allocflags = __GFP_COMP;
	if (flags & SLAB_CACHE_DMA)
		cachep->allocflags |= GFP_DMA;
	if (flags & SLAB_RECLAIM_ACCOUNT)
		cachep->allocflags |= __GFP_RECLAIMABLE;
	// size 表示一个 slab 对象的大小
	cachep->size = size;
	cachep->reciprocal_buffer_size = reciprocal_value(size);

...

	// 继续配置 slab 描述符
	err = setup_cpu_cache(cachep, gfp);
	if (err) {
    
    
		__kmem_cache_release(cachep);
		return err;
	}

	return 0;
}

3.3 Allocate slab objects

kmem_cache_alloc() is the core function for allocating slab cache objects. It calls the slab_alloc() function internally. During the slab object allocation process, local interrupts are turned off throughout the process. The flow chart of the kmem_cache_alloc() function is as follows:

Insert image description here

In order to give readers a more realistic understanding, this process will be explained below based on the flow chart around the source code:

kmem_cache_alloc->slab_alloc

// slab_alloc() 函数在 slab 对象分配过程中是全程关闭本地中断的
static __always_inline void *
slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
{
    
    
	...
    local_irq_save(save_flags);
	// 获取 slab 对象
	objp = __do_cache_alloc(cachep, flags);
	local_irq_restore(save_flags);
	...

	// 如果分配时设置了 __GFP_ZERO 标志位,那么使用 memset() 把 slab 对象的内容清零
	if (unlikely(flags & __GFP_ZERO) && objp)
		memset(objp, 0, cachep->object_size);

	slab_post_alloc_hook(cachep, flags, 1, &objp);
	return objp;
}

kmem_cache_alloc->slab_alloc->...->____cache_alloc

// 获取 slab 对象
static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
{
    
    
	void *objp;
	struct array_cache *ac;

	check_irq_off();

	// 获取 slab 描述符 cachep 中的本地对象缓冲池 ac
	ac = cpu_cache_get(cachep);
	// 判断本地对象缓冲池中有没有空闲的对象
	if (likely(ac->avail)) {
    
    
		ac->touched = 1;
		// 获取 slab 对象
		objp = ac->entry[--ac->avail];

		STATS_INC_ALLOCHIT(cachep);
		goto out;
	}

	STATS_INC_ALLOCMISS(cachep);
	// 第一次分配缓存对象时 ac->avail 值为 0,所以它应该在 cache_alloc_refill() 函数中
	objp = cache_alloc_refill(cachep, flags);
	...
    
	return objp;
}

kmem_cache_alloc->slab_alloc->__do_cache_alloc->____cache_alloc->cache_alloc_refill

static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags)
{
    
    
	...
	
	// 获取本地对象缓冲池 ac
	ac = cpu_cache_get(cachep);
	...
	// 获取 slab 节点
	n = get_node(cachep, node);

	BUG_ON(ac->avail > 0 || !n);
	// shared 表示共享对象缓冲池
	shared = READ_ONCE(n->shared);
	// 若 slab 节点没有空闲对象并且共享对象缓冲池 shared 为空或者共享对象缓冲池里也没有空闲对象,那么直接跳转到 direct_grow 标签处
	if (!n->free_objects && (!shared || !shared->avail))
		goto direct_grow;

	...
	
	// 若共享对象缓冲池里有空闲对象,那么尝试迁移 batchcount 个空闲对象到本地对象缓冲池 ac 中
	// transfer_objects() 函数用于从共享对象缓冲池迁移空闲对象到本地对象缓冲池
	if (shared && transfer_objects(ac, shared, batchcount)) {
    
    
		shared->touched = 1;
		goto alloc_done;
	}

	while (batchcount > 0) {
    
    
		/* Get slab alloc is to come from. */
		// 如果共享对象缓冲池中没有空闲对象,那么 get_first_slab() 函数会查看 slab 节点中的 slabs_partial 链表和 slabs_free 链表
		page = get_first_slab(n, false);
		if (!page)
			goto must_grow;

		check_spinlock_acquired(cachep);

		// 从 slab 分配器中迁移 batchcount 个空闲对象到本地对象缓冲池中
		batchcount = alloc_block(cachep, ac, page, batchcount);
		fixup_slab_list(cachep, n, page, &list);
	}

must_grow:
	// 更新 slab 节点中的 free_objects 计数值
	n->free_objects -= ac->avail;
alloc_done:
	spin_unlock(&n->list_lock);
	fixup_objfreelist_debug(cachep, &list);

// 表示 slab 节点没有空闲对象并且共享对象缓冲池中也没有空闲对象,这说明整个内存节点里没有 slab 空闲对象
// 这种情况下只能重新分配 slab 分配器,这就是一开始初始化和配置 slab 描述符的情景
direct_grow:
	if (unlikely(!ac->avail)) {
    
    
		/* Check if we can use obj in pfmemalloc slab */
		if (sk_memalloc_socks()) {
    
    
			void *obj = cache_alloc_pfmemalloc(cachep, n, flags);

			if (obj)
				return obj;
		}

		// 分配一个 slab 分配器
		page = cache_grow_begin(cachep, gfp_exact_node(flags), node);

		/*
		 * cache_grow_begin() can reenable interrupts,
		 * then ac could change.
		 */
		ac = cpu_cache_get(cachep);
		if (!ac->avail && page)
			// 从刚分配的 slab 分配器的空闲对象中迁移 batchcount 个空闲对象到本地对象缓冲池中
			alloc_block(cachep, ac, page, batchcount);
		// 把刚分配的 slab 分配器添加到合适的队列中,这个场景下应该添加到 slabs_partial 链表中
		cache_grow_end(cachep, page);

		if (!ac->avail)
			return NULL;
	}
	// 设置本地对象缓冲池的 touched 为 1,表示刚刚使用过本地对象缓冲池
	ac->touched = 1;

	// 返回一个空闲对象
	return ac->entry[--ac->avail];
}

3.4 Release the slab cache object

The interface function to release slab cache objects is kmem_cache_free(). The process is as follows:

Insert image description here

In order to give readers a more realistic understanding, this process will be explained below based on the flow chart around the source code:

kmem_cache_free->__cache_free->___cache_free

void ___cache_free(struct kmem_cache *cachep, void *objp,
		unsigned long caller)
{
    
    
	struct array_cache *ac = cpu_cache_get(cachep);
	...
	// 当本地对象缓冲池的空闲对象数量 ac->avail 大于或等于 ac->limit 阈值时,就会调用 cache_flusharray() 做刷新动作,尝试回收空闲对象
	if (ac->avail < ac->limit) {
    
    
		STATS_INC_FREEHIT(cachep);
	} else {
    
    
		STATS_INC_FREEMISS(cachep);
		// 主要用于回收 slab 分配器
		cache_flusharray(cachep, ac);
	}
	...
	// 把对象释放到本地对象缓冲池 ac 中
	ac->entry[ac->avail++] = objp;
}

Guess you like

Origin blog.csdn.net/qq_58538265/article/details/135183146