A brief analysis of Linux physical memory fragmentation

The kernel code appearing in this article comes from Linux4.19. If you are interested, readers can read this article along with the code.

1. Overview of Linux physical memory fragmentation

What is Linux physical memory fragmentation? Linux physical memory fragmentation includes two types:

1. Fragmentation in physical memory: refers to the unused part of the memory space allocated to users.

For example, a process needs to use 3K bytes of physical memory, so it applies to the system for memory equal to 3K bytes. However, since the minimum particle size of the Linux kernel partner system algorithm is 4K bytes, 4K bytes of memory are allocated. Then the unused memory of 1K bytes is In-memory fragmentation.

Linux physical memory fragmentation

2. Fragmentation outside physical memory: refers to small memory blocks that cannot be used in the system.

For example, the remaining memory of the system is 16K bytes, but the 16K bytes memory is composed of four 4K bytes pages, that is, the physical page frame number #1 of the 16K memory is not consecutive. When the system has 16K bytes of memory left, the system cannot successfully allocate contiguous physical memory larger than 4K. This situation is caused by out-of-memory fragmentation. This article describes out-of-physical memory fragmentation.

Note: #1 Physical page frame number: Linux physical memory is managed through pages, and each page is numbered, called the page frame number. If there are two consecutive physical pages, the page frame numbers are consecutive.

Linux physical memory external fragmentation

2. Linux physical memory management framework

Before explaining the ins and outs of fragmentation outside physical memory, we must first understand how Linux manages physical memory? The Linux kernel uses buddy system allocation, the famous buddy system allocator.

1. Design ideas

The core idea of the partner system allocator: divide the free pages of the system into 11 block linked lists, and each block linked list manages 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pages with consecutive physical page frame numbers. . Each page size is 4K bytes, and the block size managed by buddy ranges from 4K bytes to 4M bytes, increasing in multiples of 2.

Linux physical memory management framework diagram

2. Management logic

The framework of Linux's physical page management is as shown above. Since this article describes the fragmentation outside the physical memory, this article only does a simple analysis of the partner system. It does not involve specific details and does not elaborate on per cpu pageset and other contents. If readers are interested, You can refer to the kernel source code.

Linux divides physical memory into different nodes and zones for management:

node : In order to support the NUMA structure, that is, the CPU has different access speeds to different memory clusters, Linux designed the node structure, which divides the physical memory into multiple memory nodes for management; for the UMA structure, there is only one node node.
zone : In order to be compatible with the hardware limitations of different platforms, such as hardware bus access issues in the 80x86 architecture, Linux divides the memory under the node node into multiple zones; currently on the ARM platform, multiple zone management is no longer necessary.

The memory under the zone management unit is divided into 11 block linked lists through the free_area array for management:

The free_area array has a total of 11 indexes, and each index manages a block linked list of different sizes.

The memory unit managed by free_area[0] is 2^0 pages, which is 4K byte memory;
The memory unit managed by free_area[1] is a continuous page of physical page frame number of 2^1, that is, 8K bytes memory;
And so on;

The memory managed by free_area is also subdivided into various types, such as non-movable pages and movable pages. Each type of page type corresponds to a free_list linked list, which is linked to the page structure.

When a page is allocated, the steps for the partner system to get the page are as follows: (regardless of memory slow paths)

According to the allocated page type, find the corresponding memory node node and memory management unit zone;
According to the allocated page size, the free_area structure of the corresponding size is found;
According to the allocated page type, find the corresponding free_list linked list and allocate the page;

When releasing a page to the buddy system, the steps for buddy to release the page are as follows:

According to the allocated page type, find the corresponding memory node node and memory management unit zone;
Determine whether there is a free memory block connected to the physical page frame number, which can be merged with the released memory block into a larger block of memory. The conditions for merging are:
Physical frames must all be continuous;
Same type and same size;
The physical address of the first page of the merged block memory satisfies a multiple of "2*block size*4K".

According to the size of the released page or the merged size, the free_area structure of the corresponding size is found;
According to the type of the released page, find the corresponding free_list linked list and release the page;

Information Direct: Linux kernel source code technology learning route + video tutorial kernel source code

Learning Express: Linux Kernel Source Code Memory Tuning File System Process Management Device Driver/Network Protocol Stack

3. Linux measures against fragmentation outside physical memory

From "2. Linux physical memory management architecture", we can find that the partner system memory management framework can effectively improve physical memory external fragmentation, because the partner system has the following two management logics, which can reduce the occurrence of external fragmentation:

Small blocks of memory are allocated in small-block linked lists to reduce the probability of large-block linked lists being contaminated;
When memory is released, it will try to integrate the logic into large blocks of memory, which is helpful for the synthesis of large blocks of memory;

In addition, the kernel also supports the following measures to improve out-of-physical memory fragmentation (only the main mechanisms are listed):

1.memory compaction

(1) Principle of memory organization

The Linux physical page defragmentation mechanism is similar to disk defragmentation. It mainly applies the kernel's page migration mechanism. It is a method of migrating movable pages to free up continuous physical memory.

Suppose there is a very small memory domain as follows:

Blue represents free pages, and white represents pages that have been allocated. It can be seen that the free pages (blue) in the above memory domain are very scattered, and continuous physical memory larger than two pages cannot be allocated.

The following demonstrates the simplified working principle of memory regularization. The kernel will run two independent scanning actions: The first scan starts from the bottom of the memory domain and records the allocated moveable (MOVABLE) pages into a list while scanning:

In addition, the second scan starts from the top of the memory domain, scans the free page locations that can be used as page migration targets, and then also records them in a list:

When the two scans meet in the middle of the domain, it means that the scan is over, and then the allocated pages obtained by the left scan are migrated to the free pages on the right. A continuous piece of physical memory is formed on the left to complete page regularization.

(2) How to use

If you want to turn on memory regularization, the kernel needs to turn on the relevant configuration (default is y)

After the above configuration is turned on, memory pruning is triggered automatically. The way to trigger memory pruning is as follows: when the process attempts to allocate high-order memory that cannot be satisfied and completes direct_reclaim#1 (the costly_order situation will not be analyzed for the time being), the system will determine whether it is based on the remaining memory. Trigger memory pruning;

Note: #1direct_reclaim: When the process allocates memory, it finds that there is insufficient memory and starts a direct memory recycling operation. In this mode, allocation and recycling are synchronous, which means that the process that allocates memory will be blocked waiting for memory recycling.

The kernel also provides an interface for users to trigger regular actions. The interface is as follows:

/proc/sys/vm/compact_memory

Simply writing a value to this node can trigger memory consolidation for the memory managed by all nodes in the system.

2.kcompactd

(1) kcompact design principle

kcompactd is a kernel-organized background process. Its difference from memory compaction is:

The triggering method of memory compaction is that after the memory allocation enters direct_reclaim (the costly_order situation will not be analyzed for the time being), the system will determine whether to trigger memory compaction based on the remaining memory, or the user will trigger it manually;

kcompactd actively triggers memory consolidation when it wakes up kswapd or when kswapd goes to sleep.

The trigger path of kcompactd is as follows: There are mainly two ways:

Trigger regulation before waking up kswapd. The triggering conditions are: this allocation does not support direct_reclaim, the node memory node is balanced, and the number of kswapd failures is greater than MAX_RECLAIM_RETRIES (default 16).

When kswapd is about to go to sleep:

(2) How to use

If you want to turn on memory regularization, the kernel needs to turn on the relevant configuration (default is y)

3. Other optimization ideas

The kernel has been continuously optimized, so why does Linux still have fragmentation outside of physical memory? That's because fragmentation outside physical memory can be continuously optimized, but it cannot be eradicated. In the current kernel, I think there are two main reasons for physical fragmentation:

Immovable pages pollute the memory environment, causing page regularization to fail;
As the system continues to apply for and release pages, the physical memory page frame numbers allocated by the partner system become more and more random, resulting in a higher probability of memory being isolated and a higher degree of fragmentation, as explained in 3.2.

For the above two reasons, the following optimization measures may achieve a certain optimization effect:

(1) Reduce UNMOVABLE page pollution of the memory environment

Limit page stealing behavior that cannot be moved

Linux memory allocation supports the fallback mechanism, also called page stealing mechanism. This mechanism is to avoid the problem of remaining unbalanced page types in the same zone management unit. For example, in the same zone, page type A has more free memory, but page type B free memory is very scarce. If there is no page stealing mechanism, page type B is allocated It will enter the slow path of memory allocation. With the page stealing mechanism, in the same zone management unit, if there are no free pages of UNMOVABLE type, but there are free pages for MOVABLE type pages, the page stealing mechanism supports the case where list_head[UNMOVABLE] cannot allocate pages to list_head[MOVABLE]. ] allocation page.

The following array clarifies the page types that can be stolen for various page types. For example, the first line indicates that UNMOVABLE pages can steal RECLAIMABLE and MOVABLE type pages, and the others are similar.

If non-removable pages frequently steal pages, the non-removable pages will quickly pollute the memory environment, especially the memory of movable pages, which will have a greater impact on the success rate of memory consolidation. Based on the above situation, the allocation size limit can be added to the page stealing mechanism. Only large blocks of memory are allocated for non-movable pages to allow page stealing, thereby reducing the pollution of the memory environment by non-movable pages.

Immovable type pages will be repaid automatically after page theft.

This method is mainly a compensation method after page theft of non-removable pages. If a non-movable page steals a page, we will record the page in a list, and when there is a free page of the subsequent non-movable page type, the stolen page will be migrated back to the non-movable page, thereby reducing the pollution of the non-movable page.

(2) Reduce the randomness of allocated page frame numbers

Assume that there is a small memory domain as follows. Which of the following two memory allocation methods will cause severe memory fragmentation?

Pages are allocated from the head to the tail;
Pages are randomly assigned from any location.

Obviously, the second memory allocation method is less friendly to out-of-memory fragmentation, and this is also a problem that the partner system has not yet solved. Although the partner system plans the memory into memory blocks of various sizes so that small memory is allocated in a small block linked list and tries not to pollute the large memory linked list, there is no way to guarantee which page frame of the physical memory the small block of memory is located in. Range allocation. As the system runs longer, the physical page frame numbers allocated to small blocks of memory become more random, and the probability of fragmentation becomes higher.

reservation law

According to this situation, corresponding optimization can be carried out through reservation.

Reserve a certain amount of memory specifically for small block memory allocation. After this optimization measure, the randomness of the physical page frame number of small block memory allocation can be effectively reduced, thereby reducing the probability of small block memory contaminating the memory environment.

Reserve a certain amount of memory specifically for large-block memory allocation. After this optimization measure, the probability of the reserved memory being contaminated by small-block memory will be reduced, which can improve the success rate of reserved memory allocation for large-block memory.

Original author: Kernel Craftsman