Physical Memory Management of the Linux Kernel

In terms of physical content definition, Linux introduces the concepts of memory node (node), memory zone (zone), and memory page page. The management of physical memory is divided into two parts: the page-level memory management partner system implemented at the bottom layer, the kernel object cache and general-purpose cache Slab memory management implemented based on the partner system.

2. Buddy System

Node: The kernel uses the struct pglist_data data structure to uniformly represent the memory nodes of the UMA system and the NUMA system. UMA has only one memory node, and NUMA connects the memory nodes in the form of a linked list.

    typedef struct pglist_data {
         struct zone node_zones[MAX_NR_ZONES];
         struct zonelist node_zonelists[GFP_ZONETYPES];
         int nr_zones;
         struct page *node_mem_map;
         struct bootmem_data *bdata;
         unsigned long node_start_pfn;
         unsigned long node_present_pages; /* total number of physical pages */
         unsigned long node_spanned_pages; /* total size of physical page range, including holes */
         int node_id;
         struct pglist_data *pgdat_next;
         wait_queue_head_t       kswapd_wait;
         struct task_struct *kswapd;
    } pg_data_t;

Memory area: Linux divides the physical memory of the node into different memory areas. struct zone represents the memory area. The types of memory areas are:

ZONE_DMA，ZONE_NORMAL，ZONE_HIGHMEM 。

    struct zone {
        unsigned long  free_pages;
        unsigned long  pages_min, pages_low, pages_high;
        unsigned long  protection[MAX_NR_ZONES];

        struct per_cpu_pageset pageset[NR_CPUS];
        spinlock_t  lock;
        struct free_area free_area[MAX_ORDER];
        spinlock_t  lru_lock;
        struct list_head active_list;
        struct list_head inactive_list;
        unsigned long  nr_scan_active;
        unsigned long  nr_scan_inactive;
        unsigned long  nr_active;
        unsigned long  nr_inactive;
        unsigned long  pages_scanned;    /* since last reclaim */
        int   all_unreclaimable; /* All pages pinned */
        wait_queue_head_t * wait_table;
        unsigned long  wait_table_size;
        unsigned long  wait_table_bits;
        struct pglist_data *zone_pgdat;
        struct page  *zone_mem_map;
        char   *name;
    }

    struct free_area {
         struct list_head free_list[MIGRATE_TYPES];
         unsigned long  nr_free;
    };

Memory page: The Linux kernel creates a struct page object for each physical memory page frame, and the system uses a global variable struct page* mem_page array to store the page object.

struct page {
page_flags_t flags;

        atomic_t _count;  /* Usage count, see below. */
        atomic_t _mapcount;
        unsigned long private;

struct address_space *mapping;

pgoff_t index;

struct list_head lru;

#if defined(WANT_PAGE_VIRTUAL)
void *virtual;

#endif /* WANT_PAGE_VIRTUAL */
};

The buddy system allocates a single physical page or a power of 2 contiguous physical page. The member free_area array of struct zone holds pages. Order is a very important term for the buddy system, it describes the quantity unit of memory allocation, the length of the memory block is the order power of 2, and the order ranges from 0 to 11. The free_area[] array is indexed by order, and each element holds a linked list of contiguous memory pages. The 0th linked list manages a single page-sized memory block, the first linked list manages two consecutive page-sized memory blocks, the third linked list manages 4 consecutive page-sized memory blocks, and so on. Linux uses the famous buddy System algorithm to solve the problem of external fragmentation, grouping all free page frames into 11 block linked lists, each block linked list contains sizes of 1, 2, 4, 8, 16, 32, 64, 128, 256 , 512, 1024 consecutive page frames. The maximum request for 1024 page frames corresponds to 4MB contiguous RAM blocks.

working principle:

Suppose you want to request a block of 256 page frames (1MB), the algorithm first checks whether there is a free block in the linked list of 256 page frames, if not, the algorithm finds the linked list of the next larger page block, that is, in the block Find a free block in the linked list with a size of 512 pages. If it exists, the kernel divides the block into two blocks, half of which is used to satisfy the request, and the other half is inserted into the block linked list with the size of 256 page frames. If there is no free block in the linked list of 512 pages, continue to search for a larger linked list of blocks. After finding, 256 pages are used to satisfy the request, and the remaining 768 page frames are divided into two parts, 512 page frames and 256 page frames. Blocks, respectively, are inserted into the corresponding linked list. The release process is the opposite. The kernel attempts to merge the block of size b and the block in the linked list that is contiguous with the freed block into a separate block of size 2b, which is inserted into the larger linked list of blocks. Two blocks that satisfy the following conditions are called partners:

Both blocks have the same size, denoted by b
their physical addresses are contiguous
The physical address of the first page frame of the first block is 2*b*(2 to the 12th power) number of bits.

The algorithm is iterative, and if it successfully merges the freed blocks, it tries to merge 2b's blocks in order to try to form larger blocks.

3，Buddy System API

The buddy system's API only allocates pages that are powers of 2

alloc_pages(mask, order) allocates the order power of 2 pages and returns a struct page instance.

get_zeroed_page(mask) allocates a page and returns an instance of struct page with the corresponding content of the page filled with 0.

__get_free_pages and __get_free_page work in the same way as the above functions, but return the virtual address of the allocated memory block.

get_dma_pages(mask, order) is used to get pages suitable for DMA

Memory allocation flag mask:

    #define __GFP_WAIT ((__force gfp_t)___GFP_WAIT) /* Can wait and reschedule? */
    #define __GFP_HIGH ((__force gfp_t)___GFP_HIGH) /* Should access emergency pools? */
    #define __GFP_IO ((__force gfp_t)___GFP_IO) /* Can start physical IO? */
    #define __GFP_FS ((__force gfp_t)___GFP_FS) /* Can call down to low-level FS? */
    #define __GFP_COLD ((__force gfp_t)___GFP_COLD) /* Cache-cold page required */
    #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)
    #define __GFP_REPEAT ((__force gfp_t)___GFP_REPEAT) /* See above */
    #define __GFP_NOFAIL ((__force gfp_t)___GFP_NOFAIL) /* See above */
    #define __GFP_NORETRY ((__force gfp_t)___GFP_NORETRY) /* See above */
    #define __GFP_MEMALLOC ((__force gfp_t)___GFP_MEMALLOC

    #define __GFP_COMP ((__force gfp_t)___GFP_COMP) /* Add compound page metadata */
    #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO) /* Return zeroed page on success */
    #define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)

#define __GFP_HARDWALL ((__force gfp_t)___GFP_HARDWALL)

    #define __GFP_THISNODE ((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
    #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
    #define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */

    #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
    #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
    #define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */

#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
Common flag combinations:

4. Slab distributor

The slab allocator is used to manage smaller memory spaces or specific objects on the basis of physical page frame allocation. The basic idea of slab is to first use the page allocator to allocate a single or continuous physical page, and then divide the entire page into multiple equal small memory units to meet the needs of small memory space.

The slab allocator groups objects into caches, each cache containing objects of the same type. The cache representation for kmem_cache_s,

The cache memory area is divided into slabs, each slab consists of one or more contiguous page frames that contain allocated and free objects.

    struct kmem_cache_s {
        struct kmem_list3 lists;
        unsigned int  objsize;
        unsigned int flags; /* constant flags */
        unsigned int  num; /* # of objs per slab */
        unsigned int  free_limit; /* upper limit of objects in the lists */
        spinlock_t  spinlock;

/* order of pgs per slab (2^n) */
unsigned int gfporder;

        const char  *name;
        struct list_head next;
    };

    struct kmem_list3 {
        struct list_head slabs_partial; /* partial list first, better asm code */
         struct list_head slabs_full;
        struct list_head slabs_free;
         unsigned long free_objects;
        int  free_touched;
         unsigned long next_reap;
         struct array_cache *shared;
       };

       struct slab {
         struct list_head list;
         unsigned long  colouroff;
          void   *s_mem;  /* including colour offset */
         unsigned int  inuse;  /* num of objs active in slab */
          kmem_bufctl_t  free;
       };

size_cache:

    The Linux kernel is called the general cache or default cache. It is the basis for the implementation of the kmalloc function.
     struct cache_sizes {
       size_t cs_size;
       kmem_cache_t *cs_cachep;
        kmem_cache_t *cs_dmacachep;
     };

    struct cache_sizes malloc_sizes[] = {
        #define CACHE(x) { .cs_size = (x) },
        #if (PAGE_SIZE == 4096)
        CACHE(32)
        #endif
        CACHE(64)
        #if L1_CACHE_BYTES < 64
        CACHE(96)
        #endif
        CACHE(128)
        #if L1_CACHE_BYTES < 128
        CACHE(192)
        #endif
        CACHE(256)
        CACHE(512)
        CACHE(1024)
        CACHE(2048)
        CACHE(4096)
        CACHE(8192)
        CACHE(16384)
        CACHE(32768)
        CACHE(65536)
        CACHE(131072)
       #ifndef CONFIG_MMU
        CACHE(262144)
        CACHE(524288)
        CACHE(1048576)
       #ifdef CONFIG_LARGE_ALLOCS
        CACHE(2097152)
        CACHE(4194304)
        CACHE(8388608)
        CACHE(16777216)
        CACHE(33554432)
       #endif /* CONFIG_LARGE_ALLOCS */
       #endif         { 0, }
        #undef CACHE
    };

5. Assign API

The kmalloc function is the most used memory allocation function in the driver. The memory space it allocates is physically continuous, and the function is not responsible for clearing the memory space. It is based on the slab allocator, and its implementation is mainly based on size_cache.

void * kmalloc(size_t size, int flags){

struct cache_sizes *csizep = malloc_sizes;

struct kmem_cache *cachep;

while(size > csizep->size)

csizep ++

cachep = csizep->cs_cachep;

return kmem_cache_alloc（cachep, flags）;

}

Allocation of kernel objects:

In the Linux kernel source code, kmem_cache_create is widely used to create the cache of kernel objects. Compared with size_cache, kernel module developers can specify kmem_cache that meets their specific requirements.

struct kmem_cache* kmem_cache_create(const char *name, size_t size, size_t align, unsigned long flags, void (*ctor) (void *))

name is a character pointer that represents the name of the kernel object cache, and size specifies the size of the cached kernel object.

    /**
     * kmem_cache_alloc - Allocate an object
     */
    void * kmem_cache_alloc (kmem_cache_t *cachep, int flags)

    /**
     * kmem_cache_destroy - delete a cache
     * @cachep: the cache to destroy
      */
    int kmem_cache_destroy (kmem_cache_t * cachep)

Physical Memory Management of the Linux Kernel

Guess you like