CMA technical principle analysis

foreword

This article introduces the technical principle of CMA (Contiguous Memory Allocator), analyzes the initialization and allocation process of CMA from the source code, and explains the knowledge involved in page migration, LRU (Least Rencntly Used) cache, PCP (per cpu page) cache, etc.

1. Overview of CMA

What are CMAs? Why do you need a CMA?

The Linux buddy system (Buddy) uses Page granularity to manage memory, and each page size is 4K. The partner system mounts memory into free_list linked lists of different lengths according to the length of free memory blocks. The unit of free_list is incremented by (2^order pages), that is, 1 page, 2 pages, ... 2^n. Usually, the maximum order is 10, and the corresponding free memory size is 4M bytes. We use the buddy system to apply for continuous physical pages, the maximum page size is 4M bytes, and when the system memory fragmentation is severe, it is difficult to allocate high-order pages.

c2682386e685764c8261aa7459640252.png

Some peripheral devices on the embedded system, such as GPU, Camera, HDMI, etc., need to reserve a large amount of continuous memory to work normally, and in many cases, only 4M continuous memory is not enough to meet the needs of the device. Of course, we can also use memblock The method of reserving memory is to reserve a larger continuous memory, but this part of memory can only be used by the device but not by Buddy, which will lead to memory waste. CMA was born from this. We need to be able to allocate a large continuous memory space to the device, and to use the memory for the system when the device is not in use, so as to maximize the use of memory.

The CMA continuous memory allocator is mainly used to allocate continuous large blocks of memory. When the system is initialized, a physical memory area will be reserved. When the device driver is not in use, the memory management system will use this area to allocate and manage movable pages, and provide them to APP or kernel movable pages. When the device driver is used, the pages that have been allocated at this time are migrated away, and this area is used for continuous memory allocation.

In the following chapters, we mainly introduce the initialization, allocation and page migration process of CMA by reading the source code.

Note: The source code posted in the follow-up article is kernel5.4 version, the code screenshot will omit the minor code, and only keep the key code.

2. CMA main data structure and API

2.1 struct cma

Use struct cma to describe a CMA area:

6ab8662ef37cbba2c5a50b8785ea6a18.png

base_pfn: The starting page frame number of the physical address of the CMA area (page frame number)

count: the number of pages in the CMA area

bitmap: Describes the allocation of pages in the cma area, 1 means allocated, and 0 means free.

order_per_bit: Indicates the number of pages represented by a bit in the bitmap (2^order_per_bit).

2.2 cma_init_reserved_mem

e91817cf82ceb493726e97f16d093f7e.png

Obtain a piece of memory whose address is base and size is size from the reserved memory block, and is used to create and initialize struct cma.

2.3 cma_init_reserved_areas

cfd651e0e2bacb9ef8dd20f48da1ad80.png

In order to improve memory utilization, this function is used to mark the CMA memory and return it to the buddy system for buddy to use as a removable page memory application.

2.4 cma_alloc

534ca82fcf02e216fc45c38a4f6e5c2c.png

It is used to allocate count consecutive pages from the specified CMA area, aligned according to align.

2.5 cma_release

a5495183a1610021e94de3dc6b7fdf24.png

Used to release count consecutive pages that have been allocated.

3. Analysis of the main process of CMA

3.1 CMA initialization process

3.1.1 System initialization:

The system initialization process needs to create the CMA area first, and the creation methods include: reserved memory of dts or through command line parameters. Here we look at the frequently used reserved memory method through dts. The description of physical memory is placed in dts for configuration, for example:

0fcf4079d9f55f14ea1f9167afa137d5.png

linux,cma is the name of the CMA area.

compatible must be "shared-dma-pool".

resuable means cma memory can be used by buddy system.

size indicates the size of the cma area, in bytes

alignment specifies the address alignment size of the CMA area.

The linux,cma-default attribute indicates that the current cma memory will be used as the default cma pool for cma memory application.

During system startup, the kernel parses the dtb file described above to complete the registration of memory information. The calling process is:

setup_arch

       arm64_memblock_init

              early_init_fdt_scan_reserved_mem

                   __reserved_mem_init_node

779bc2629d88e85e7af6ec5da7ec1cca.png

__reserved_mem_init_node will traverse the contents of __reservedmem_of_table section, check that there is a compatible match in dts (CMA here is "shared-dma-pool"), and then further execute the corresponding initfn. Those defined by RESERVEDMEM_OF_DECLARE will be linked to the __reservedmem_of_table section, and will eventually be transferred to the function defined by RESERVEDMEM_OF_DECLARE, as follows rmem_cma_setup:

3.1.2 rmem_cma_setup

10edcd340c2fc417c4edf9eb5da582a7.png

@1 cma_init_reserved_mem Obtain a block of memory whose address is base and size is size from the reserved memory block. Here, the address information parsed from dtb is used to initialize CMA, which is used to create and initialize struct cma. The code is very simple:

5a7256cb51f147ae4c5a237316410ee6.png

@2 If dts specifies linux, cma-default, point dma_contiguous_set_default to this CMA area. When using dma_alloc_contiguous to allocate memory from CMA, it will be allocated from this area by default.

So far, CMA is the same as other reserved memory, they are all stored in memblock.reserved, and this part of reserved memory is also not used by Buddy system. As mentioned earlier, in order to improve memory utilization, it is necessary to mark the part of the CMA memory and return it to the Buddy system, so that Buddy can provide it as a removable page to the APP or kernel memory application, which is implemented by cma_init_reserved_areas.

3.1.3 cma_init_reserved_areas

In the later stage of kernel initialization, the initialization function described by core_initcall is called:

cma_init_reserved_areas, it directly calls cma_activate_area to achieve. cma_activate_area allocates the bitmap according to the cma size, and then calls init_cma_reserved_pageblock to operate all the pages in the CMA area, see the source code:

6533e4d7bc97ccee1a6eb9cc7cfa0240.png

@1 The CMA area uses a bitmap to manage the status of each page. cma_bitmap_maxno calculates how much memory the Bitmap requires, and the i variable indicates how many pageblocks (4M) the CMA eara has.

@2 traverse all pageblocks in the CM area

@3 Make sure that all pages in the CMA area are in a zone

@4 Finally call init_cma_reserved_pageblock, process in pageblock, set migrate type to MIGRATE_CMA, add the page to the partner system and update the total number of pages managed by the zone. as follows:

cec75540b624815f82f62c96338d9f0e.png

@1 Clears the reserved flag that has been set on the page.

@2 Set migratetype to MIGRATE_CMA

@3 Circularly calls the __free_pages function to release all pages in the CMA area to the buddy system.

@4 Updates the amount of memory managed by the partner system.

At this point, the subsequent CMA memory can be applied for by buddy. In the partner system, the migratetype is movable and the flag is allocated with CMA, memory can be allocated from CMA:

e8f12f68abe8424792ad7fe18fb3c094.png

3.2 CMA allocation process

Before reading the cma allocation process code, let's look at its function call process, and then analyze the process and each function through the source code.

8b510d929780ed9edc58fe7b648f79b4.png

3.2.1 cma_alloc

949a043e9ce661285eeba709230824ad.png

@1 The calculation of bitmap is mainly to obtain the maximum available bit number of bimap (bitmap_maxno), how much bitmap (bitmap_count) is needed for this allocation, etc.

@2 According to the bitmap information calculated above, find a free position from the bitmap.

@3 Some special circumstances (which will be discussed later) often cause CMA allocation to fail. When the allocation returns to EBUSY, msleep(100) is required to retry, and the default is to retry 5 times.

@4 Set the corresponding bitmap of the page to be allocated to 1 first, indicating that it has been allocated.

@5 Use alloc_config_range for memory allocation, which will be analyzed in detail in the following sections.

@6 If the allocation fails, the bitmap will be cleared.

3.2.2 "Batch processing" in the kernel: LRU cache and PCP cache

Before analyzing alloc_config_range, let’s talk about two knowledge points, LRU cache and PCP cache. When reading the kernel source code, we will find that the kernel likes to use some "batch processing" methods to improve efficiency and reduce some locking overhead.

1. LRU cache

The classic LRU (Least Rencntly Used) linked list algorithm is as follows:

ee394164e744b04452b3997999d6371c.png

Note: For a detailed introduction to the LRU algorithm, please refer to the previous article of Kernel Craftsman: Introduction to kswapd

Newly allocated pages are continuously added to the ACTIVE LRU linked list, and the ACTIVE LRU linked list is also continuously taken out and pages are put into the INACTIVE LRU linked list. The lock (pgdat->lru_lock) competition in the linked list is very strong. If the page transfer is performed one by one, the competition for the lock will be very serious.

In order to improve this situation, the kernel adds a PER-CPU LRU cache (represented by struct pagevec), and the page to be added to the LRU list will be put into the LRU cache of the current CPU until the LRU cache is full (usually 15 page), and then obtain lru_lock, and batch these pages into the LRU linked list at one time.

2. PCP (PER-CPU PAGES) cache

Since memory pages are public resources, frequent allocation and release of pages in the system will cause a lot of consumption due to the acquisition of release locks (zone->lock) and synchronization operations between CPUs. Also in order to improve this situation, the kernel adds a per cpu page cache (represented by struct per_cpu_pages), and each CPU applies for a small number of pages from Buddy wholesale and stores them locally. When the system needs to apply for memory, it will take it from the PCP cache first, and then wholesale it from buddy when it is used up. When it is released, it is also put back into the PCP cache first, and then put back into the buddy system when the cache is full.

Previously, the kernel only supported PCP with order=0, and the latest patch in the community can support per-cpu with order>0.

7e9937412c51483961415b669d709bdd.png

3.2.3 alloc_config_range function:

Continue to see what alloc_config_range of the cma_alloc process will do:

In short, the purpose is to get a clean contiguous memory block from a "dirty" contiguous memory block (already used by various types of memory), either recycle it or migrate it away, and finally remove this block Clean continuous memory is returned to the caller for use, as shown in the following figure:

784b1e287a4eb265a9d8a32e27598985.png

Walk through the code:

5a47ba27126f066d58a1ddf3a1da6e72.png

@1 start_isolate_page_range: Change the migration type of the pageblock of the target memory block from MIGRATE_CMA to MIGRATE_ISOLATE. Because the buddy system will not allocate pages from pageblocks of the MIGRATE_ISOLATE migration type, it can prevent these pages from being separated from Buddy during the cma allocation process.

@2 drain_all_pages: Recycle per-cpu pages. PCP has been introduced earlier. During the recycling process, the pages in the PCP cache need to be returned to Buddy.

@3 __alloc_contig_migrate_range: Migrate the pages used by the target memory block. The migration process is to copy the content of the page to other memory areas and update the reference to the page.

4335822c64583e7b02af6a3842424871.png

@3.1 lru_cache_disable: Because the pages in the LRU cache cannot be migrated, you need to flash the pagevec page to the LRU first, and you are about to add it to the LRU list, but the pages that have not yet been added to the LRU (still in the LRU cache) are added to the LRU , and turn off the LRU cache function.

 @3.2 isolate_migratepages_range Isolate the pages that have been used by Buddy in the area to be allocated, store them in the linked list of cc, and return the page frame number that was last scanned and processed. The isolation here is mainly to prevent the subsequent migration process, the page is released or used by the LRU recycling path.

@3.3 reclaim_clean_pages_from_list: For clean file pages, just reclaim them directly.

@3.4 migrate_pages: This function is the main interface of page migration in the kernel mode. Most of the functions related to page migration in the kernel are called. It migrates the movable physical page to a newly allocated page.

It is described in detail in the next section.

@3.5 The lru_cache_enable migration process is completed, re-enable LRU PAGEVEC

@4.undo_isolate_page_range: The migration type of pageblock in the reverse process of @1 is restored from MIGRATE_ISOLATE to MIGRATE_CMA.

Finally these pages are returned to the caller.

3.3 CMA release process

The code for cma_release to release the CMA memory is very simple. It is to assign the page from the new free to Buddy and clear to the bitmap of the cma. Here is the code directly:

12d1a0c4bd85466e96f2feec462d962d.png

4. Page Migration

If the system wants to use the memory in the CMA area, the pages on the memory must be migratable. In this way, the pages can only be migrated when the device uses CMA. So which pages can be migrated? There are two types:

1. The pages on the LRU, the pages on the LRU linked list are pages mapped to the address space of the user process, such as anonymous pages and file pages, are all allocated from the pageblock whose buddy allocator migrate type is movable.

2. Non-LRU, but a movable page. Non-LRU pages are usually pages allocated for kernel space. To implement migration, the driver needs to implement the relevant methods in page->mapping->a_ops. For example, the pages of our common zsmalloc memory allocator support migration.

migrate_pages() is the main interface for page migration in the kernel state, and most of the functions related to page migration in the kernel are transferred to it. As shown in the figure below, migrate_pages() is nothing more than allocating a new page, disconnecting the mapping relationship of the old page, re-mapping the resume to the new page, and copying the content of the old page to the new page, and the struct page attribute of the new page must be consistent with The old page is set the same, and finally the old page is released. Let's read its source code below.

d05439dadcfb548265764674412c3a28.png

4.1 migrate_pages:

migrate_pages function and parameters:

41d0467eeffd4b195225efba59bfe597.png

from: list of pages to be migrated

get_new_page: pointer to apply for a new page function

putnew_page: pointer to release new page function

private: The parameter passed to get_new_page, CMA is not used here to pass NULL

mode: migration mode, the migration mode of CMA will be set to MIGRATE_SYNC. There are the following types:

1ce15859709522dd29a6c335b4a4d1c5.png

reason: Migration reason, record what function triggers the migration behavior. Because many paths in the kernel need to use migrate_pages to migrate, such as memory regulation, hot swapping, etc. CMA passes MR_CONTIG_RANG, which means calling alloc_contig_range() to allocate continuous memory.

Look at the migrate_pages code again, it traverses the from linked list, and calls unmap_and_move for each page to implement migration processing.

ef97f2b417f7f810a1fb0dd4ab32e167.png

4.2 unmap_and_move

23f268afa81ddd7bd1b876c408a10989.png

 The parameters of the unmap_and_move function are exactly the same as migrate_pages. It calls get_new_page to allocate a new page, and then uses __unmap_and_move to migrate the page to this newly allocated page. We mainly look at __unmap_and_move

4.3 __unmap_and_move:

3512a99503cfd25d181617c40b156c8b.png

@1 Try to obtain the page lock PG_locked of the old page. If the page is already locked by other processes, the attempt to obtain the lock will fail. For MIGRATE_ASYNC mode, if the lock cannot be obtained for asynchronous migration, this page will be skipped directly. The CMA migration mode is MIGRATE_SYNC, and lock_page must be used here to wait for the lock.

@2 Process the page that is being written back, and judge whether to wait for the page to be written back to complete according to the migration mode. MIGRATE_SYNC_LIGHT and MIGRATE_ASYNC do not wait, the cma migration mode is MIGRATE_SYNC, and the wait_on_page_writeback() function will be called to wait for the page writeback to complete.

@3 For anonymous pages, in order to prevent the anon_vma data structure from being released during the migration process, page_get_anon_vma needs to be used to increase the anon_vma->refcount reference count.

@4 Get the page lock PG_locked of the new page, which can be obtained under normal circumstances.

@5 Determine whether this page is a non-LRU page,

If the page is a non-LRU page, it is processed by calling move_to_new_page, which will call back the miratepage function of the driver to perform page migration.

If it is an LRU page, continue to execute @6 

ac55a40e4c52ace7a691df7707887493.png

@6 Use page_mapped() to determine whether there is a user PTE mapped to the changed page. If there is, call try_to_unmap() to release all related PTEs of the old page through the reverse mapping mechanism.

@7 Call move_to_new_page to copy the content of the old page and the attribute data of the struct page to the new page. For the LRU page, move_to_new_page does two things by calling migrate_page: copying the attributes and page content of the struct page.

@8 Migrate the page table: remove_migration_ptes establishes the mapping relationship from new page to process through the reverse mapping mechanism.

@9 After the migration is completed, the PG_locked of the old and new pages are released. Of course, for anonymous pages, we also need to put_anon_vma to reduce the anon_vma->refcount reference count

@10 For non-LRU pages, call put_page to release the old page reference count (_refcount minus 1)

For traditional LRU putback_lru_page adds newpage to the LRU linked list.

4.4 move_to_new_page

In @5 and @7, non-LRU and LRU pages are copied through move_to_new_page, let's look at his implementation:

caaf6866d53c35d1e7db963af40e7029.png

  1. For non-LRU pages, this function will call back the miratepage function of the driver to perform page migration

For example, the migration callback function will be registered in the zsmalloc memory allocator, and the migration process will call zs_page_migrate of zsmalloc to migrate the requested page. The zsmalloc memory allocator will not be discussed here. Interested readers can read the source code of zsmalloc.

4fe8b53489717680189f0251daec7978.png

2. For the LRU page, calling migrate_page does two things: copy the attributes and page content of the struct page.

e1287318ca06fe5d5ab8328fc37bf381.png

@7.1 Duplication of struct page attributes:

migrate_page_move_mapping first checks whether the refcount of the page meets expectations, and then copies the mapping data of the page, such as page->index, page->mapping, and PG_swapbacked

ecec84e7f1695acb2e848e1861ec79d0.png

Here, by the way, refcount: refcount is the heavy reference count in the struct page, which is used to indicate the number of times the page is referenced in the kernel. When refcount=0, it means that the page is a free page or a page about to be released. When the value of refcount > 0, it means that the page has been allocated and is being used by the kernel, and will not be released temporarily.

Functions such as get_page, pin_user_pages, and get_user_pages are used in the kernel to increase the reference count of _refcount, which can prevent pages from being released by other paths during certain operations (such as adding to LRU). It just cannot be migrated here.

Copy of @7.2 page content:

copy_highpage is very simple, use kmap to map two pages, and then copy the memory of the old page to the new page.

15b702bfc37cd06c3085e4d931f74efb.png

@7.3 migrate_page_states is used to copy page flags, such as PG_dirty, PG_XXX and other flags, which are also copies of struct page attributes.

4.5 Summary:

The entire migration process has been analyzed, and the flow chart is drawn as follows.

448bdf25a922f15a31038ec730a29e0d.png

V. Summary

From the analysis of the above chapters, we can see that the design of CMA is based on these two points:

1. When the device driver is not in use, the CMA memory is handed over to Buddy for management, which is implemented in the initialization process cma_init_reserved_areas() or cma_release().

2. When the device driver needs to use it, apply for physically continuous CMA memory through cma_alloc. For the pages that have been allocated by APP or kernel movable in Buddy, the memory should be "cleaned" through recycling or migration, and finally this physically continuous "clean" memory should be returned to the device driver for use. The core implementation is in the alloc_config_range() and migrate_pages() functions.

references

1. The codes quoted and interpreted in this article are from kernel-5.4 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/?h=v5. 4.234

2. "Run Linux Kernel"

3. Song Baohua: On Linux Page Migration (Page Migration) full version:

https://blog.csdn.net/21cnbao/article/details/108067917

482c3513bd198b7e8f54123cbe818ee9.gif

Long press to follow Kernel Craftsman WeChat

Linux Kernel Black Technology | Technical Articles | Featured Tutorials

Guess you like

Origin blog.csdn.net/feelabclihu/article/details/129457653