page

The kernel regards the physical page as the basic unit of memory management.

The memory management unit (MMU) is the hardware used to manage memory and convert virtual addresses to physical addresses.

The physical page in the kernel is represented by the page structure (include\linux\mm_types.h, slightly deleted):

struct page {
    unsigned long flags;        /* Atomic flags, some possibly
                     * updated asynchronously */
    atomic_t _count;        /* Usage count, see below. */
    union {
        atomic_t _mapcount; /* Count of ptes mapped in mms,
                     * to show when page is mapped
                     * & limit reverse map searches.
                     */
        struct {        /* SLUB */
            u16 inuse;
            u16 objects;
        };
    };
    union {
        struct {
        unsigned long private;      /* Mapping-private opaque data:
                         * usually used for buffer_heads
                         * if PagePrivate set; used for
                         * swp_entry_t if PageSwapCache;
                         * indicates order in the buddy
                         * system if PG_buddy is set.
                         */
        struct address_space *mapping;  /* If low bit clear, points to
                         * inode address_space, or NULL.
                         * If page mapped as anonymous
                         * memory, low bit is set, and
                         * it points to anon_vma object:
                         * see PAGE_MAPPING_ANON below.
                         */
        };
        struct kmem_cache *slab;    /* SLUB: Pointer to slab */
        struct page *first_page;    /* Compound tail pages */
    };
    union {
        pgoff_t index;      /* Our offset within mapping. */
        void *freelist;     /* SLUB: freelist req. slab lock */
    };
    struct list_head lru;       /* Pageout list, eg. active_list
                     * protected by zone->lru_lock !
                     */
    /*
     * On machines where all RAM is mapped into kernel address space,
     * we can simply calculate the virtual address. On machines with
     * highmem some memory is mapped into kernel virtual memory
     * dynamically, so we need a place to store that address.
     * Note that this field could be 16 bits on x86 ... ;)
     *
     * Architectures with slow multiplication can define
     * WANT_PAGE_VIRTUAL in asm/page.h
     */
#if defined(WANT_PAGE_VIRTUAL)
    void *virtual;          /* Kernel virtual address (NULL if
                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
};

Each physical page corresponds to a page structure.

flags : used to store the status of the page, each bit represents a status, the specific description is located in include\linux\page-flags.h;

_count : the reference count of the page, -1 means that the current kernel does not refer to this page, and it can be used in the new allocation;

virtual : the virtual address of the page, that is, the address of the page in virtual memory;

A page can be used by the page cache. At this time, the mapping points to an associated address_space object; it can also be used as private data and pointed to by private .

Area

The kernel divides the page into different areas.

The kernel uses the area to group pages with similar characteristics.

Linux does this because:

Some hardware can only use certain specific memory addresses to perform DMA;
The physical addressing range of some architecture memory is much larger than the virtual addressing range, so some memory cannot be permanently mapped to the kernel address space (such as x86-32, the maximum kernel address space is only 4G, but the time physical memory Can exceed 4G);

Linux mainly uses 4 zones (include\linux\mmzone.h):

ZONE_DMA : The pages contained in this zone can be used to perform DMA operations;

ZONE_DMA32 : Similar to ZONE_DMA, but this page can only be accessed by 32-bit devices;

ZONE_NORMAL : a page that can be mapped normally;

ZONE_HIGHMEM : Contains high-end memory, the pages of which cannot be permanently mapped to the kernel address space;

The actual use and distribution of zones are related to the system architecture.

For example, x86-32 is like this:

The division of zones has no physical meaning.

When allocating memory, pages can be obtained from a specific area. Some allocations have requirements for areas, while others do not. But the allocation cannot be across regions.

Not all architectures define all areas.

For example, x86-64 does not have ZONE_HIGHMEM, because the 64-bit address space is much larger than the total amount of memory currently supported.

Each zone corresponds to the zone structure, located in include\linux\mmzone.h, this structure is very large.

Pages in high-end memory are mapped between 3G-4G.

Functions used (include\linux\highmem.h):

Permanent mapping:

static inline void *kmap(struct page *page)

Contact mapping:

static inline void kunmap(struct page *page)

Temporary mapping:

static inline void *kmap_atomic(struct page *page, enum km_type idx)

Remove temporary mapping:

#define kunmap_atomic(addr, idx)    do { pagefault_enable(); } while (0)

Kernel allocation related functions

The bottom operation function of the page is located in include\linux\gfp.h:

The functions to release the page are:

extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
#define free_page(addr) free_pages((addr),0)

To allocate memory in bytes, use kmalloc() (include\linux\slab_def.h):

static __always_inline void *kmalloc(size_t size, gfp_t flags)

The corresponding release function:

void kfree(const void *);

There are three types of flags (gfp_mask and flags) involved in memory allocation:

Behavior signs:

Zone modifier:

Type flags (flags):

The type flag is actually a combination of the previous two:

Regarding when to use what logo:

vmalloc()

Similar to kmalloc(), but the virtual address of the allocated memory is continuous, and the physical address does not need to be continuous. (The physical addresses of the memory allocated by kmalloc() are consecutive.)

This is also how the user space allocation function works. The pages returned by malloc() are continuous in the virtual address space of the process, which is not physically guaranteed.

In most cases, only hardware devices need to get memory with consecutive physical addresses.

To the kernel, all memory appears to be contiguous logically.

But many kernel codes use kmalloc() instead of vmalloc(), which is mainly due to performance considerations.

vmalloc() is only used when it is a last resort, such as to obtain large blocks of memory.

In the declaration of vmalloc() (include\linux\vmalloc.h):

extern void *vmalloc(unsigned long size);

This function may sleep, so it cannot be called in interrupt context. (Kmalloc() decides whether to sleep by flags)

Corresponding to vmalloc() are:

extern void vfree(const void *addr);

slab distributor

The slab allocator plays the role of a general data structure cache layer to facilitate frequent data allocation and recovery.

The slab layer divides different objects into so-called cache groups, and each cache group stores different types of correspondence.

Each type of object corresponds to a cache. For example, one cache is used to store process descriptors, and the other cache is used to store inode objects.

The kmalloc() interface is built on the slab layer and uses a set of general-purpose caches.

The cache is divided into slabs.

A slab consists of one or more physically continuous pages.

In general, a slab consists of only one page.

Each cache can be composed of multiple slabs. These slabs are in one of three states: full, partially full and empty.

When a certain part of the kernel needs to create a new object, it is allocated from the partially full slab; if there is no partially full, it is allocated from the empty slab; if there is no empty slab, the slab needs to be created.

Specific relationship:

Each cache is represented by kmem_cache:

struct kmem_cache {
/* 1) per-cpu data, touched during every alloc/free */
    struct array_cache *array[NR_CPUS];
/* 2) Cache tunables. Protected by cache_chain_mutex */
    unsigned int batchcount;
    unsigned int limit;
    unsigned int shared;
    unsigned int buffer_size;
    u32 reciprocal_buffer_size;
/* 3) touched by every alloc & free from the backend */
    unsigned int flags;     /* constant flags */
    unsigned int num;       /* # of objs per slab */
/* 4) cache_grow/shrink */
    /* order of pgs per slab (2^n) */
    unsigned int gfporder;
    /* force GFP flags, e.g. GFP_DMA */
    gfp_t gfpflags;
    size_t colour;          /* cache colouring range */
    unsigned int colour_off;    /* colour offset */
    struct kmem_cache *slabp_cache;
    unsigned int slab_size;
    unsigned int dflags;        /* dynamic flags */
    /* constructor func */
    void (*ctor)(void *obj);
/* 5) cache creation/removal */
    const char *name;
    struct list_head next;
/* 6) statistics */
#ifdef CONFIG_DEBUG_SLAB
    unsigned long num_active;
    unsigned long num_allocations;
    unsigned long high_mark;
    unsigned long grown;
    unsigned long reaped;
    unsigned long errors;
    unsigned long max_freeable;
    unsigned long node_allocs;
    unsigned long node_frees;
    unsigned long node_overflow;
    atomic_t allochit;
    atomic_t allocmiss;
    atomic_t freehit;
    atomic_t freemiss;
    /*
     * If debugging is enabled, then the allocator can add additional
     * fields and/or padding to every object. buffer_size contains the total
     * object size including these internal fields, the following two
     * variables contain the offset to the user object and its size.
     */
    int obj_offset;
    int obj_size;
#endif /* CONFIG_DEBUG_SLAB */
    /*
     * We put nodelists[] at the end of kmem_cache, because we want to size
     * this array to nr_node_ids slots instead of MAX_NUMNODES
     * (see kmem_cache_init())
     * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache
     * is statically defined, so we reserve the max number of nodes.
     */
    struct kmem_list3 *nodelists[MAX_NUMNODES];
    /*
     * Do not add fields after nodelists[]
     */
};

There is a kmem_list3 at the end, which contains 3 linked lists (mm\slab.c):

/*
 * The slab lists for all objects.
 */
struct kmem_list3 {
    struct list_head slabs_partial; /* partial list first, better asm code */
    struct list_head slabs_full;
    struct list_head slabs_free;
    unsigned long free_objects;
    unsigned int free_limit;
    unsigned int colour_next;   /* Per-node cache coloring */
    spinlock_t list_lock;
    struct array_cache *shared; /* shared per node */
    struct array_cache **alien; /* on other nodes */
    unsigned long next_reap;    /* updated without locking */
    int free_touched;       /* updated without locking */
};

They correspond to full, partially full and empty slabs respectively.

The slab descriptor structure is as follows (mm\slab.c):

struct slab {
    struct list_head list;
    unsigned long colouroff;
    void *s_mem;        /* including colour offset */
    unsigned int inuse; /* num of objs active in slab */
    kmem_bufctl_t free;
    unsigned short nodeid;
};

The slab description is either allocated outside the slab or placed at the beginning of the slab itself.

The slab allocator can create a new slab, using the following functions:

static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)

Its internal implementation is implemented by the __get_free_pages low-level kernel page allocator.

Release slab to use:

static void kmem_freepages(struct kmem_cache *cachep, void *addr)

Create a new cache:

struct kmem_cache *
kmem_cache_create (const char *name, size_t size, size_t align,
    unsigned long flags, void (*ctor)(void *))

Cancel the cache:

void kmem_cache_destroy(struct kmem_cache *cachep)

Allocate objects from the cache:

void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)

Release the object:

void kmem_cache_free(struct kmem_cache *cachep, void *objp)

CPU data

The operating system that supports SMP needs to use CPU data, that is, its data is unique to a given CPU.

The data of each CPU can be stored by data, the following is an example:

#ifdef __ARCH_SYNC_CORE_ICACHE
unsigned long icache_invld_count[NR_CPUS];
void resync_core_icache(void)
{
    unsigned int cpu = get_cpu();
    blackfin_invalidate_entire_icache();
    icache_invld_count[cpu]++;
    put_cpu();
}

get_cpu() will prohibit kernel preemption, so there will be no data exceptions caused by kernel preemption problems until put_cpu() is called.

The 2.6 kernel adds a new CPU data interface.

Each CPU data during compilation (include\linux\percpu-defs.h, include\linux\percpu.h):

DEFINE_PER_CPU(type, name)
DECLARE_PER_CPU(type, name)
/*
 * Must be an lvalue. Since @var must be a simple identifier,
 * we force a syntax error here if it isn't.
 */
#define get_cpu_var(var) (*({               \
    preempt_disable();              \
    &__get_cpu_var(var); }))
/*
 * The weird & is necessary because sparse considers (void)(var) to be
 * a direct dereference of percpu variable (var).
 */
#define put_cpu_var(var) do {               \
    (void)&(var);                   \
    preempt_enable();               \
} while (0)

Each CPU data at runtime:

void alloc_percpu(type)
extern void __percpu *__alloc_percpu(size_t size, size_t align);
extern void free_percpu(void __percpu *__pdata);

Benefits of CPU data:

Reduce data lock;
Reduce cache invalidation;

The only security requirement is to prohibit kernel preemption.

"Linux Kernel Design and Implementation" Reading Notes-Memory Management