30 pictures take you through the essence of glibc memory management

When I was browsing Zhihu recently, I saw a post, as follows:

picture

After reading all the answers below, either they did not answer to the point, or the answers were not in-depth enough. Therefore, with the help of this article, I will explain C/C++ memory management in depth.

1 written in front

Source code analysis itself is very boring, especially when it comes to writing it into an easy-to-understand article.

This article tries its best to analyze from the reader's perspective, focusing on the points that everyone is concerned about. When necessary, some source codes will be posted to deepen everyone's understanding. Through this article, everyone can understand the essential principles of memory allocation and release.

The following content is full of useful information, and it will be a rewarding process for you and me. Mainly from the memory layout, glibc memory management, malloc implementation and free implementation, we will show you the essence of glibc memory management. Finally, solutions to the problems in the project are pointed out. The outline is as follows:

picture

main content

2 background

A few years ago, I worked on a project in my last company, let’s call it SeedService. SeedService obtains the repost, comment, and like information of the feed stream from kafka and loads it into memory. Because the data is different every day, Kafka's topics are also divided by day. Internally, the entire server is similar to a double-pointer implementation, and the data of the first day is released at 0 o'clock on the same day.

The project is online and everything is running normally.

But after a few days, the process started disappearing for no reason. I started to locate the problem, and finally found that it was due to the sudden increase in memory that caused OOM, and was eventually killed by the operating system.

After figuring out the reason why the process disappeared, we started to analyze the memory leak. After solving several potential problems of memory leaks, it was found that the system would still experience a sudden increase in memory when running for a long time in a high-pressure and high-concurrency environment. Eventually, the process would be killed by the operating system due to OOM.

Since memory management is nothing more than three levels, the user management layer, the C runtime library layer, and the operating system layer. It was found that the memory of the process increased sharply at the operating system layer. At the same time, it was confirmed that there was no memory leak in the user management layer, so it was suspected that C The problem with the runtime library, that is, the memory management method of Glibc, causes the memory of the process to explode.

The problem is reduced to the memory management of glibc. Only by clarifying the following issues can we solve the problem of the SeedService process disappearing:

  • Under what circumstances does glibc not return memory to the operating system?

  • What are the constraints on glibc's memory management method? What kind of memory allocation scenarios is it suitable for?

  • Is the memory management method in our system contrary to the memory management constraints of glibc?

  • How does glibc manage memory?

With the above questions in mind, I spent nearly a month analyzing the memory management code of the glibc runtime library. Today I have sorted out the notes from that time, hoping it will be useful to everyone.

3 basics

When the Linux system loads a program file in elf format, it will call loader to load each segment of the executable file into the space starting from a certain address.

User programs can directly use system calls to manage heap and mmap mapping areas, but more often than not, programs use the malloc() and free() functions provided by the C language to dynamically allocate and release memory. The stack area is the only memory area that does not require mapping but can be accessed by users. This is also the basis for attacks using stack overflow.

3.1 Process memory layout

Computer systems are divided into 32-bit and 64-bit, and the process layout of 32-bit and 64-bit is different. Even if they are both 32-bit systems, their layout depends on the kernel version.

Before introducing the detailed memory layout, we first describe several concepts:

  • Stack - stores local variables and function parameters during program execution, growing from high addresses to low addresses

  • Heap dynamic memory allocation area, managed through functions such as malloc, new, free and delete

  • Uninitialized variable area (BSS) - stores uninitialized global variables and static variables

  • Data area (Data) - global variables and static variables with predefined values ​​stored in the source code

  • Code area (Text) - stores read-only program execution code, that is, machine instructions

3.1.1 32-bit process memory layout
classic layout

Before Linux kernel 2.6.7, the process memory layout is as shown in the figure below.

picture

32-bit classic layout

In the memory layout example diagram, the mmap area and the stack area grow relatively, which means that the heap only has 1GB of virtual address space to use. If it continues to grow, it will enter the mmap mapping area, which is obviously not what we want. This is due to the 32-mode address space limitation, so the kernel introduces another form of virtual address space layout. But for 64-bit systems, because they provide a huge virtual address space, 64-bit systems adopt this layout method.

default layout

As shown above, due to the space limitations of the classic memory layout, the default process layout method shown below was introduced after kernel 2.6.7.

picture

32-bit default layout

As you can see from the picture above, the stack expands from the top to the bottom, and the stack is bounded. The heap expands from bottom to top, and the mmap mapped area expands from top to bottom. The mmap mapped area and the heap expand relatively until the remaining areas in the virtual address space are exhausted. This structure facilitates the C runtime library to use the mmap mapped area and the heap to store memory. distribute.

3.1.2 64-bit process memory layout

As mentioned before, the 64-bit process memory layout method is consistent with the 32-bit classic memory layout method because of its sufficient address space and easy implementation, as shown in the following figure:

picture

64-bit process layout

3.2 Operating system memory allocation function

When introducing the memory layout before, it was mentioned that the heap and mmap mapping areas are virtual memory spaces that can be provided to user programs. So how do we get the memory in this area?

The operating system provides related system calls to complete memory allocation work.

  • For heap operations, the operating system provides the brk() function, and the c runtime library provides the sbrk() function.

  • For the operation of mmap mapping area, the operating system provides mmap() and munmap() functions.

sbrk(), brk() or mmap() can all be used to add additional virtual memory to our process. glibc uses these functions to apply for virtual memory from the operating system to complete memory allocation.

A very important concept is mentioned here, delayed allocation of memory. The physical mapping of an address is only established when the address is actually accessed. This is one of the basic ideas of Linux memory management. When the user applies for memory, the Linux kernel only allocates a linear area (that is, virtual memory) to it and does not allocate actual physical memory; only when the user uses this memory, the kernel will allocate specific physical pages to it. Users only occupy precious physical memory at this time. The kernel releases physical pages by releasing the linear area, finding its corresponding physical page, and releasing all of them.

picture

memory allocation

The memory structure of the process is represented by mm_struct in the kernel, and its definition is as follows:

struct mm_struct {
 ...
 unsigned long (*get_unmapped_area) (struct file *filp,
 unsigned long addr, unsigned long len,
 unsigned long pgoff, unsigned long flags);
 ...
 unsigned long mmap_base; /* base of mmap area */
 unsigned long task_size; /* size of task vm space */
 ...
 unsigned long start_code, end_code, start_data, end_data;
 unsigned long start_brk, brk, start_stack;
 unsigned long arg_start, arg_end, env_start, env_end;
 ...
}

In the above mm_struct structure:

  • [start_code,end_code) represents the address space range of the code segment.

  • [start_data,end_start) represents the address space range of the data segment.

  • [start_brk,brk) respectively represent the starting space of the heap segment and the current heap pointer.

  • [start_stack, end_stack) represents the address space range of the stack segment.

  • mmap_base represents the starting address of the memory mapping segment.

The basic function of dynamic memory allocation in C language is malloc(), which is implemented on Linux through the kernel's brk system call. brk() is a very simple system call that simply changes the value of the member variable brk of the mm_struct structure.

picture

mm_struct

3.2.1 Heap operation

As mentioned before, there are two functions that can apply for memory directly from the heap. The brk() function is a system call and sbrk() is a c library function.

System calls usually provide a minimal function, while library functions provide more complex functions than system calls. In glibc, malloc represents memory allocation and release by calling the sbrk() function to move the lower bound of the data segment. The sbrk() function maps the virtual address space to memory under the management of the kernel for use by the malloc() function.

The following is the declaration of the brk() function and sbrk() function.

#include <unistd.h>
int brk(void *addr);

void *sbrk(intptr_t increment);

It should be noted that when the parameter increment of sbrk() is 0, sbrk() returns the current brk value of the process. Expand the brk value when increment is positive, and shrink the brk value when increment is negative.

3.2.2 MMap operation

In LINUX, we can use mmap to allocate address space in the process virtual memory address space and create a mapping relationship with physical memory.

picture

Shared memory

The mmap() function maps a file or other object into memory. Files are mapped to multiple pages, and if the size of the file is not the sum of the sizes of all pages, the unused space on the last page will be cleared.

munmap performs the opposite operation, deleting the object mapping for a specific address range.

The function is defined as follows:

#include <sys/mman.h>
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); 

int munmap(void *addr, size_t length);

·Mapping relationships are divided into the following two types:

  • File Mapping: A disk file maps the virtual address space of a process, using the file contents to initialize physical memory.

  • Anonymous mapping: initialize the memory space to all zeros

Whether the mapping relationship is shared can be divided into:

  • Private mapping (MAP_PRIVATE)

    • Data is shared between multiple processes, and modifications are not reflected in the actual file on the disk. It is a copy-on-write (copy-on-write) mapping method.

  • Shared mapping (MAP_SHARED)

    • Data is shared between multiple processes, and modifications are reflected in the actual file on the disk.

Therefore, the entire mapping relationship can be summarized into the following four types:

  • Private file mapping uses the same physical memory page for initialization by multiple processes, but modifications to the memory file by each process will not be shared, nor will it be reflected in the physical file.

  • private anonymous mapping

    • mmap will create a new mapping, which is not shared by each process. This use is mainly used to allocate memory (malloc will call mmap to allocate large memory). For example, when a new process is opened, a virtual address space will be allocated for each process. The physical memory space mapped by these virtual addresses is shared between processes when reading, and copy-on-write is used when writing.

  • Shared file mapping

    • Multiple processes share the same physical memory space through virtual memory technology. Modifications to memory files will be reflected in the actual physical files, which is also a mechanism for inter-process communication (IPC).

  • Shared anonymous mapping

    • This mechanism does not use copy-on-write when forking. The parent and child processes completely share the same physical memory page, which also realizes parent-child process communication (IPC).

It is worth noting here that mmap only allocates address space in virtual memory, and only allocates physical memory when virtual memory is accessed for the first time.

After mmap, the file contents are not loaded onto the physical page, only the address space is allocated in virtual memory. When the process accesses this address, it searches the page table and finds that the page corresponding to the virtual memory is not cached in the physical memory. A "page fault" occurs, which is handled by the kernel's page fault exception handler, and the corresponding content of the file is stored in the physical memory. Pages are loaded into physical memory in units of 4096. Note that only missing pages are loaded. However, they may also be affected by some scheduling policies of the operating system and may load more than required.

The following content will be the focus of this article. It is crucial to understand the memory layout and the memory allocation principle of glibc later. If necessary, you can read it several times.

4 Overview

Earlier, we mentioned that there are two functions for allocating memory on the heap, namely the brk() system call and the sbrk()c runtime library function. There is the mmap function for allocating memory in the memory mapping area.

Now let's assume a situation where if every allocation, we directly use the brk(), sbrk() or mmap() function to perform multiple memory allocations. If the program frequently allocates and releases memory and deals directly with the operating system, then the performance can be imagined.

This introduces a concept, "memory management".

The outline of this section is as follows:

picture

4.1 Memory management

Memory management refers to the technology of allocating and using computer memory resources when software is running. Its main purpose is to allocate efficiently and quickly, and to release and reclaim memory resources at the appropriate time.

A good memory manager needs to have the following characteristics: 1. Cross-platform and portable. Normally, the memory manager applies for memory from the operating system and then allocates it again. Therefore, for different operating systems, the memory manager needs to support operating system compatibility so that users can operate across platforms without any difference.

2. A small memory manager wastes space to manage memory. If the memory waste is relatively large, then obviously this is not an excellent memory manager. Generally speaking, memory fragmentation is the culprit of wasting space. If there are a large number of memory fragments in the memory manager, they are small discontinuous pieces of memory. Their total amount may be large, but they cannot be used. This is obviously not a problem. Excellent memory manager.

3. Fast speed The fundamental reason why a memory manager is used is for fast allocation/release.

4. Debugging function As a C/C++ programmer, memory errors can be said to be our nightmare. The last memory error must still be fresh in your memory. The debugging function provided by the memory manager is powerful and easy to use. Especially for embedded environments, where memory error detection tools are lacking, the debugging function provided by the memory manager is even more indispensable.

4.2 Management methods

Memory management management methods are divided into two types: manual management and automatic management.

The so-called manual management means that the user uses malloc and other functions to apply for memory. When it needs to be released, it needs to call the free function to release it. Once the used memory is not released, it will cause a memory leak and occupy more system memory; if it is released before the end of use, it will cause dangerous dangling pointers, and the memory pointed to by other objects has been recycled or reused by the system.

Automatically managed memory is automatically managed by the memory management system of the programming language. In most cases, it does not require user participation and can automatically release memory that is no longer used.

4.2.1 Manual management

Manual memory management is a relatively traditional memory management method. System-level programming languages ​​such as C/C++ do not include automatic memory management mechanisms in a narrow sense. Users need to actively apply for or release memory. Experienced engineers can accurately determine the timing of memory allocation and release. As long as the memory management strategy of human flesh is accurate enough, manual memory management can improve the running performance of the program and will not cause memory security problems.

However, after all, there are still relatively few users who are experienced and can accurately determine memory and allocation and release. As long as it is handled manually, there will always be some errors. Memory leaks and dangling pointers are basically common problems in languages ​​​​such as C/C++. The most common mistake is that manual memory management also takes up a lot of engineers' energy. Many times they need to think about whether objects should be allocated on the stack or on the heap and when the memory on the heap should be released. Maintenance costs are relatively high. Yes, this is also a trade-off that must be made.

4.2.2 Automatic management

Automatic memory management is basically a standard feature of modern programming languages. Because the function of the memory management module is very certain, we can introduce automatic memory management methods in the compilation period or runtime of the programming language. The most common automatic memory management mechanism is garbage Recycling, but in addition to garbage collection, some programming languages ​​​​also use automatic reference counting to assist memory management.

The automatic memory management mechanism can help engineers save a lot of time dealing with memory, allowing users to focus all their energy on core business logic and improve development efficiency; under normal circumstances, this automatic memory management mechanism Both can solve the problem of memory leaks and dangling pointers very well, but this will also bring extra overhead and affect the runtime performance of the language.

4.1 Common memory managers

1 ptmalloc: ptmalloc is a memory allocator affiliated with glibc (GNU Libc). Now in the Linux environment, the memory allocation (malloc/new) and release (free/delete) of the runtime library we use are provided by it.

2 BSD Malloc: BSD Malloc is an implementation released with 4.2 BSD and included in FreeBSD. This allocator can allocate objects from a pool of predetermined sized objects. It has size classes for sizes of objects that are powers of 2 minus some constant. So, if you request an object of a given size, it simply allocates a matching size class. This provides a fast implementation, but may waste memory.

3 Hoard: The goal of writing Hoard is to make memory allocation happen very fast in a multi-threaded environment. Therefore, its construction centers around the use of locks so that all processes do not have to wait for memory to be allocated. It can significantly speed up multi-threaded processes that do a lot of allocations and deallocations.

4 TCMalloc: A memory allocator developed by Google and used in many projects. For example, Golang uses a similar algorithm for memory allocation. It has the basic characteristics of a modern memory allocator: resistance to memory fragmentation and the ability to scale on multi-core processors. It is said that its memory allocation speed is several times that of malloc implemented in glibc2.3.

5 glibc memory management (ptmalloc)

Because this accident involved memory allocation and release using the runtime library function new/delete, this article will focus on analyzing the memory allocation library ptmalloc under glibc.

The outline of this section is as follows:

picture

In c/c++, we allocate memory on the heap. So how is this heap represented in glibc?

Let’s first look at the structure declaration of the heap:

typedef struct _heap_info
{
  mstate ar_ptr;            /* Arena for this heap. */
  struct _heap_info *prev;  /* Previous heap. */
  size_t size;              /* Current size in bytes. */
  size_t mprotect_size;     /* Size in bytes that has been mprotected
                             PROT_READ|PROT_WRITE.  */
  /* Make sure the following data is properly aligned, particularly
     that sizeof (heap_info) + 2 * SIZE_SZ is a multiple of
     MALLOC_ALIGNMENT. */
  char pad[-6 * SIZE_SZ & MALLOC_ALIGN_MASK];

In the above definition of the heap, ar_ptr is a pointer to the allocation area, and the heaps are connected in a linked list. I will describe the structural representation of the heap under the process layout in detail later.

Before starting this part, let's understand some concepts.

5.1 Allocation area (arena)

ptmalloc manages the process memory through each Arena.

In ptmalloc, allocation is divided into main allocation area (arena) and non-main allocation area (narena). The allocation area is represented by struct malloc_state. The difference between the main allocation area and the non-primary allocation area is that the main allocation area can use sbrk and mmap to apply for memory from the os, while the non-allocation area can only apply for memory from the os through mmap.

When a thread calls malloc to apply for memory, the thread first checks whether there is already an allocation area in the thread private variable. If it exists, the allocation area is locked. If the lock is successful, the allocation area is used for memory allocation; if it fails, the circular linked list is searched to find an unlocked allocation area. If all allocation areas have been locked, malloc will open a new allocation area, add it to the circular linked list, lock it, and use it to allocate memory. The release operation also requires a lock to be obtained.

It should be noted that the non-main allocation area applies for memory from the OS through mmap, and applies for 64MB at a time. Once applied, the allocation area will not be released. In order to avoid resource waste, ptmalloc has a limit on the number of allocation areas.

For 32-bit systems, the maximum number of allocation areas = 2 * number of CPU cores + 1

For 64-bit systems, the maximum number of allocation areas = 8 * number of CPU cores + 1

Heap management structure:

struct malloc_state {
 mutex_t mutex;                 /* Serialize access. */
 int flags;                       /* Flags (formerly in max_fast). */
 #if THREAD_STATS
 /* Statistics for locking. Only used if THREAD_STATS is defined. */
 long stat_lock_direct, stat_lock_loop, stat_lock_wait;
 #endif
 mfastbinptr fastbins[NFASTBINS];    /* Fastbins */
 mchunkptr top;
 mchunkptr last_remainder;
 mchunkptr bins[NBINS * 2];
 unsigned int binmap[BINMAPSIZE];   /* Bitmap of bins */
 struct malloc_state *next;           /* Linked list */
 INTERNAL_SIZE_T system_mem;
 INTERNAL_SIZE_T max_system_mem;
 };

picture

malloc_state

Each process has only one primary allocation area and several non-primary allocation areas. The main allocation area is created and held by the main thread or the first thread. The main allocation area and the non-primary allocation area are connected by a circular linked list. There is a variable mutex in the allocation area to support multi-threaded access.

picture

Allocation area linked by circular linked list

As mentioned earlier, there is a variable mutex in each allocation area to support multi-threaded access. Each thread must correspond to an allocation area, but one allocation area can be used by multiple threads. At the same time, an allocation area can be composed of one or more heaps. The heaps under the same allocation area are connected in a linked list. The relationship is as shown below:

picture

Thread-allocation area-heap

The dynamic memory of a process is managed by the allocation area. There are multiple allocation areas in a process, and there are multiple heaps in an allocation area, which forms a complex process memory management structure.

picture

There are a few points to note:

  • The main allocation area is allocated through brk, and the non-primary allocation area is allocated through mmap.

  • Although the non-primary allocation area is allocated by mmap, it has no connection with the direct use of mmap allocation for larger than 128K. Memory larger than 128K is allocated using mmap and returned to the system directly using ummap after use.

  • Each thread will first obtain an area in malloc and use the area memory pool to allocate its own memory. There is a competition problem here.

  • In order to avoid competition, we can use thread local storage, thread cache (tc in tcmalloc means exactly this). The improvement principle of thread local storage to area is as follows:

  • If you need variables that can be accessed by each function call within a thread but cannot be accessed by other threads (called static memory local to a thread thread-local static variables), a new mechanism is needed to implement it. This is TLS.

  • The thread cache essentially opens up a unique space for each thread in the static area. Because it is exclusive, there is no competition.

  • Each time malloc is performed, first go to the thread local storage space to find the area, and use the area in the thread cache to allocate the chunk stored in the thread area. When it is not enough, look for the area in the heap area.

5.2 chunk

ptmalloc manages memory through malloc_chunk, stores some information before User data, and uses boundary markers to distinguish each chunk.

chunk is defined as follows:

struct malloc_chunk {  
  INTERNAL_SIZE_T      prev_size;    /* Size of previous chunk (if free).  */  
  INTERNAL_SIZE_T      size;         /* Size in bytes, including overhead. */  
  
  struct malloc_chunk* fd;           /* double links -- used only if free. */  
  struct malloc_chunk* bk;  
  
  /* Only used for large blocks: pointer to next larger size.  */  
  struct malloc_chunk* fd_nextsize;      /* double links -- used only if free. */  
  struct malloc_chunk* bk_nextsize; 
};  
  • prev_size: If the previous chunk is free, this field indicates the size of the previous chunk. If the previous chunk is not free, this field is meaningless.

A continuous memory is divided into multiple chunks. prev_size records the size of the adjacent previous chunk. Knowing the address of the current chunk, subtracting prev_size is the address of the previous chunk. prev_size is mainly used for merging adjacent free chunks.

  • size: The size of the current chunk, and records some attributes of the current chunk and the previous chunk, including whether the previous chunk is in use, whether the current chunk is memory obtained through mmap, and whether the current chunk belongs to a non-primary allocation area.

  • fd and bk: The pointers fd and bk only exist when the chunk block is free. Their function is to add the corresponding free chunk block to the free chunk block list for unified management. If the chunk block is allocated to the application, Then these two pointers are useless (the chunk block has been removed from the free chain), so they are also used as space for the application program and will not be wasted.

  • fd_nextsize and bk_nextsize: When the current chunk exists in large bins, the free chunks in large bins are sorted by size, but there may be multiple chunks of the same size. Adding these two fields can speed up the traversal of free chunks, and Find the free chunk that meets the needs, fd_nextsize points to the next first free chunk that is larger than the current chunk size, and bk_nextszie points to the previous first free chunk that is smaller than the current chunk size. (There may be multiple chunks of the same size. When the overall size is in order, if you want to find the next chunk larger or smaller than yourself, you need to traverse all the same chunks, so there are designs such as fd_nextsize and bk_nextsize) If The chunk is allocated to the application, so these two pointers are useless (the chunk has been removed from the size chain), so it is also used as space for the application and will not be wasted.

As described above, in ptmalloc, in order to save memory as much as possible, the structures of used chunks and unused chunks are different.

picture

In the picture above:

  • The chunk pointer points to the address where the chunk starts.

  • The mem pointer points to the address where the user memory block starts.

  • When p=0, it means that the previous chunk is free, and prev_size is valid.

  • When p=1, it means that the previous chunk is in use, and prev_size is invalid. p is mainly used for merging memory blocks; the first block allocated by ptmalloc always sets p to 1 to prevent the program from referencing non-existent areas.

  • M=1 allocates mmap mapping area; M=0 allocates heap area

  • A=0 is allocated to the main allocation area; A=1 is allocated to the non-primary allocation area.

Compared with non-free chunks, free chunks have four more pointers in the user area, namely fd, bk, fd_nextsize, and bk_nextsize. The meanings of these pointers have been explained above and will not be repeated here.

picture

free chunk

5.3 Free linked list (bins)

When the user calls the free function to release memory, ptmalloc will not immediately return it to the operating system, but will put it into the free linked list (bins), so that the next time the malloc function is called to apply for memory, it will be retrieved from the bins. Taking out a block and returning it avoids frequent calls to system call functions, thus reducing the cost of memory allocation.

In ptmalloc, chunks of similar size are linked together, called bins. There are a total of 128 bins for ptmalloc to use.

According to the size of the chunk, ptmalloc divides the bin into the following types:

  • almost am

  • unsorted bin

  • small bin

  • large bin

From the previous malloc_state structure definition, bins can be classified into fast bins and bins, among which unsorted bins, small bins and large bins belong to bins.

In glibc, the number of bins in the above 4 are not equal, as shown in the following figure:

picture

bin

5.3.1 almost am

When a program is running, it often needs to apply for and release some smaller memory space. After the allocator merges several adjacent small chunks, there may be a request for another small block of memory immediately, so the allocator needs to carve out a block from the large free memory, which is undoubtedly relatively inefficient. ,Therefore, fast bins are introduced in malloc during the ,allocation process.

In the previous definition of malloc_state

mfastbinptr fastbins[NFASTBINS]; // NFASTBINS  = 10
  1. The number of fast bins is 10

  2. Each fast bin is a singly linked list (only using fd pointers). This is because the fast bin operates at the end of the linked list whether adding or removing chunks. That is to say, the LIFO (last in first out) algorithm is used for the operation of chunks in the fast bin: add operation (free memory) It is to add the new fast chunk to the end of the linked list, and the deletion operation (malloc memory) is to delete the fast chunk at the end of the linked list.

  3. Chunk size: The chunk size contained in 10 fast bins gradually increases by 8 bytes, that is, the chunk size in the first fast bin is 16 bytes, and the chunk size in the second fast bin is 24 bytes. By analogy, the chunk size of the last fast bin is 80 bytes.

  4. Free chunks will not be merged. This is because the original intention of the fast bin design is to quickly allocate and release small memory. Therefore, the system always sets the P (unused flag) of the chunks belonging to the fast bin to 1, so that even when there is a certain chunk in the fast bin, the When a free chunk is adjacent, the system will not automatically merge them, but will retain both.

  5. Malloc operation: During malloc, if the requested memory size range is within the range of the fast bin, it will first be searched in the fast bin, and if found, it will be returned. Otherwise, search from small bin, unsorted bin and large bin.

  6. Free operation: first obtain the size of the chunk corresponding to the address pointer passed in through the chunksize function; then obtain the fast bin to which the chunk belongs based on the chunk size, and then add the chunk to the end of the chain of the fast bin. .

The following is the fastbin structure diagram:

picture

fastbin

5.3.2 unsorted bin

The queue of unsorted bin uses the first one in the bins array, which is a buffer of bins to speed up allocation. When the memory released by the user is greater than max_fast or the chunks after merging fast bins will first enter the unsorted bin.

In the unsorted bin, there is no limit on the size of chunks, which means that chunks of any size can be placed in the unsorted bin. This is mainly to allow the "glibc malloc mechanism" to have a second chance to reuse the recently released chunk (the first chance is the fast bin mechanism). Using unsorted bins can speed up memory allocation and release operations, because the entire operation no longer requires extra time to find the appropriate bin. When the user mallocs, if no suitable chunk is found in the fast bins, malloc will first search for a suitable free chunk in the unsorted bin. If there is no suitable bin, ptmalloc will put the chunk on the unsorted bin into the bins, and then go to Find suitable free chunks on bins.

The difference from fast bin is that the traversal order adopted by unsortedbin is FIFO.

The unsorted bin structure diagram is as follows:

picture

unsorted bin

5.3.3 small bin

Chunks with a size less than 512 bytes are called small chunks, and bins that hold small chunks are called small bins. The array is numbered starting from 2. The first 62 bins are small bins. The difference between each small bin is 8 bytes. Chunks in the same small bin have the same size. Each small bin includes a two-way circular linked list of free blocks (also called a binlist). The free chunk is added to the front end of the linked list, and the required chunk is removed from the back end of the linked list. Two adjacent free chunks will be merged into one free chunk. Merging eliminates the effects of fragmentation but slows down free. During allocation, when the samll bin is not empty, the corresponding bin will remove the last chunk in the binlist and return it to the user.

When freeing a chunk, check whether the chunks before or after it are free. If so, merge them, that is, remove them from their linked lists and merge them into a new chunk. The new chunk will be added to the front of the unsorted bin linked list. .

Small bin also uses the FIFO algorithm, that is, the memory release operation adds the newly released chunk to the front end of the linked list, and the allocation operation obtains the chunk from the rear end of the linked list.

picture

small bin

5.3.4 large bin

Chunks with a size greater than or equal to 512 bytes are called large chunks, and the bins that store large chunks are called large bins and are located behind small bins. Each bin in large bins contains a chunk within a given range. The chunks are sorted in descending order of size. If the size is the same, the chunks are sorted by the most recently used time.

Two adjacent free chunks will be merged into one free chunk.

The strategy of small bins is great for small allocations, but we can't have a bin for every possible block size. For blocks larger than 512 bytes (1024 bytes for 64-bit), the heap manager uses "large bin" instead.

Each of the 63 large bins operates in much the same way as the small bins, but instead of storing fixed-size blocks, it stores blocks within a range of sizes. The size range of each large bin is designed not to overlap with the block size of the small bin or the range of other large bins. In other words, given a block size, this size corresponds to a small bin or a large bin.

Among these 63 largebins: the first group of 32 largebin chains are spaced in 64-byte steps, that is, the chunksize in the first largebin chain is 1024-1087 bytes, and the chunksize in the second largebin is 1088 ~1151 bytes. The second group of 16 largebin chains are spaced at intervals of 512 bytes; the third group of 8 largebin chains are spaced at intervals of 4096; the fourth group of 4 largebin chains are spaced at intervals of 32768 bytes; the fifth group is spaced at intervals of 32768 bytes. The two largebin chains of the group are separated by 262144 bytes; the chunk size in the largebin chain of the last group is unlimited.

When performing a malloc operation, first determine which large bin the size requested by the user belongs to, and then determine whether the size of the largest chunk in the large bin is greater than the size requested by the user. If it is larger, traverse the large bin from the end, find the first chunk of equal or close size, and assign it to the user. If the chunk is larger than the size requested by the user, the chunk is split into two chunks: the former is returned to the user, and the size is equal to the size requested by the user; the remaining part is added to the unsorted bin as a new chunk.

If the size of the largest chunk in the large bin is smaller than the size requested by the user, then check whether there are chunks that meet the requirements in the subsequent large bins. However, it should be noted that in view of the large number of bins (different bins) The chunks are most likely to be in different memory pages). If you traverse according to the method introduced in the previous paragraph (that is, traverse the chunks in each bin), multiple memory page interruption operations may occur, which will seriously affect the retrieval speed. , so glibc malloc designed the Binmap structure to help improve the speed of bin-by-bin retrieval. Binmap records whether each bin is empty, and bitmap can avoid retrieving some empty bins. If the next non-empty large bin is found through binmap, the chunk will be allocated according to the method in the previous paragraph, otherwise the top chunk (discussed later) will be used to allocate appropriate memory.

The free operation of large bin is the same as that of small bin and will not be described again here.

picture

large bin

The above-mentioned bins constitute the core allocation part of the process: bins, as shown in the following figure:

picture

5.4 Special chunk

The previous section describes several bins and the allocation and release characteristics of various bin memories. However, the above bins alone are not enough. For example, if the above bins cannot meet the allocation conditions, glibc proposes several other special ones. Chunks are for allocation and release, which are top chunk, mmaped chunk and last remainder chunk respectively.

5.4.1 top trunk

The top chunk is the space at the top of the heap. It does not belong to any bin. When all bins cannot meet the allocation requirements, it will be allocated from this area. The allocated space will be returned to the user, and the remaining part will form a new top chunk, if the top chunk space does not meet the user's request, you must use brk or mmap to apply for more heap space from the system (brk, sbrk is used for the main allocation area, and mmap is used for the non-main allocation area).

When free chunk, if the chunk size does not fall within the range of fastbin, you must consider whether it is adjacent to the top chunk. If it is adjacent, it must be merged into the top chunk.

5.4.2 mmaped chunk

When the allocated memory is very large (greater than the allocation threshold, the default is 128K) and needs to be mmap mapped, it will be placed on the mmaped chunk. When the memory on the mmaped chunk is released, it will be directly returned to the operating system. (M flag position 1 in chunk)

5.4.3 last remainder chunk

Last remainder chunk is another special chunk. This special chunk is maintained in the unsorted bin.

If the size applied by the user belongs to the small bin, but cannot match it accurately, the best match will be used at this time (for example, if the user applies for 128 bytes, but the corresponding bin is empty, and only the 256-byte bin is not empty, then the It will be allocated from the 256-byte bin), which will split the chunk into two parts, one part is returned to the user, and the other part forms the last remainder chunk and is inserted into the unsorted bin.

When a small chunk needs to be allocated, but a suitable chunk cannot be found in the small bins, if the size of the last remainder chunk is larger than the required small chunk size, the last remainder chunk is split into two chunks, and one of the chunks is returned to the user. Another chunk becomes the new last remainder chunk.

The last remainder chunk mainly improves the efficiency of continuous malloc (generating a large number of small chunks) by improving the locality of memory allocation.

5.5 chunk segmentation

When a chunk is released, its length does not fall within the range of fastbins, and the adjacent chunks are merged. The length of the first allocation is in the range of the large bin, and there are free chunks in fast bins, merge the chunks in fastbins with adjacent free chunks, and then put the merged chunks into the unsorted bin. If the chunks in fastbin Adjacent chunks are not free and cannot be merged. The chunk will still be placed in the unsorted bin. That is, if it can be merged, it will be merged, but it will eventually be placed in the unsorted bin.

There are no suitable chunks in fastbins or small bin, and the length of the top chunk cannot meet the needs, so the chunks in the fast bin are merged.

5.6 chunk merging

As mentioned earlier, adjacent chunks can be merged into a large chunk, and conversely, a large chunk can also be split into two smaller chunks. Chunk splitting is the same as allocating new chunks from the top chunk. One thing to note is: the length of the two chunks after splitting must be greater than the minimum length of the chunk (32 bytes for 64-bit systems), that is, it is guaranteed that the two chunks after splitting can still be allocated and used, otherwise No splitting is performed, but the entire chunk is returned to the user.

6 Memory allocation (malloc)

The glibc runtime library allocates dynamic memory. The bottom layer uses malloc to implement it (new eventually calls malloc). The following is the malloc function call flow chart:

picture

malloc

Here, the above flow chart is expressed in text form to facilitate everyone's understanding:

  1. Obtain the lock of the allocation area. In order to prevent multiple threads from accessing the same allocation area at the same time, you need to obtain the lock of the allocation area before allocation. The thread first checks whether there is already an allocation area in the thread private instance. If there is, try to lock the allocation area. If the lock is successful, use the allocation area to allocate memory. Otherwise, the thread searches the allocation area circular linked list to try to obtain a free ( (not locked) allocation area. If all allocation areas have been locked, ptmalloc will open a new allocation area, add the allocation area to the global allocation area circular linked list and the private instance of the thread, lock it, and then use the allocation area to perform allocation operations. The newly opened allocation area must be a non-primary allocation area, because the main allocation area is inherited from the parent process. When opening a non-primary allocation area, mmap() is called to create a sub-heap and set the top chunk.

  2. Convert the user's request size into the actual chunk space size that needs to be allocated.

  3. Determine whether the size of the chunk to be allocated satisfies chunk_size <= max_fast (max_fast defaults to 64B). If so, go to the next step, otherwise skip to step 5.

  4. First try to get a chunk of the required size in fast bins and distribute it to the user. If it can be found, the allocation ends. Otherwise go to next step.

  5. Determine whether the required size is in small bins, that is, whether chunk_size < 512B is true. If the chunk size is in small bins, go to the next step, otherwise go to step 6.

  6. According to the size of the chunk to be allocated, find a specific small bin and pick a chunk that exactly meets the size from the end of the bin. If successful, the allocation ends, otherwise, go to the next step.

  7. At this step, it means that a large memory needs to be allocated, or a suitable chunk cannot be found in small bins. Therefore, ptmalloc will first traverse the chunks in fast bins, merge adjacent chunks, and link them to the unsorted bin, and then traverse the chunks in the unsorted bin. If the unsorted bin has only one chunk, and this chunk was last allocated has been used, and the chunk size that needs to be allocated belongs to small bins, and the size of the chunk is greater than or equal to the size that needs to be allocated. In this case, the chunk will be cut directly and the allocation will be completed. Otherwise, it will be cut according to the space size of the chunk. Put it into small bins or large bins. After the traversal is completed, go to the next step.

  8. At this step, it means that a large memory needs to be allocated, or no suitable chunks can be found in small bins and unsorted bins, and all chunks in fast bins and unsorted bins have been cleared. According to the "smallest-first, best-fit" principle from the large bins, find a suitable chunk, divide it into a chunk of the required size, and link the remaining part back to the bins. If the operation is successful, the allocation ends, otherwise go to the next step.

  9. If no suitable chunk is found by searching fast bins and bins, then you need to operate the top chunk for allocation. Determine whether the top chunk size meets the required chunk size, and if so, separate a piece from the top chunk. Otherwise go to next step.

  10. At this step, it means that the top chunk cannot meet the allocation requirements, so there are two options: if it is the main allocation area, call sbrk() to increase the top chunk size; if it is a non-main allocation area, call mmap to allocate one New sub-heap, increase top chunk size; or use mmap() to allocate directly. Here, you need to rely on the size of the chunk to decide which method to use. Determine whether the chunk size to be allocated is greater than or equal to the mmap allocation threshold. If so, go to the next step and call mmap allocation. Otherwise, jump to step 12 to increase the size of the top chunk.

  11. Use the mmap system call to map a space of chunk_size align 4kB to the program's memory space. The memory pointer is then returned to the user.

  12. Determine whether it is the first time to call malloc. If it is the main allocation area, you need to perform an initialization work and allocate a space of size (chunk_size + 128KB) align 4KB as the initial heap. If it has been initialized, the main allocation area calls sbrk() to increase the heap space, and the sub-main allocation area cuts a chunk in the top chunk to meet the allocation requirements, and returns the memory pointer to the user.

Putting the above processes together is:

Depending on the size of the memory allocated by the user, ptmalloc may allocate memory space for the user in two places. When memory is allocated for the first time, generally there is only one main allocation area, but it is also possible to inherit multiple non-primary allocation areas from the parent process. Here we mainly discuss the main allocation area. The brk value is equal to start_brk. So actually the heap size is 0, and the top chunk size is also 0. At this point, no allocation requirements can be met without increasing the heap size. Therefore, if the memory size requested by the user is less than the mmap allocation threshold, ptmalloc will initialize the heap. Then allocate space to the user in the heap, and subsequent allocations will be based on this heap. If the first user request is greater than the mmap allocation threshold, ptmalloc directly uses mmap() to allocate a piece of memory to the user, and the heap is not initialized until the user requests memory allocation less than the mmap allocation threshold for the first time. The allocation after the first time is more complicated. To put it simply, ptmalloc will first search for fast bins. If it cannot find a matching chunk, it will search for small bins. If the requirements are still not met, merge the fast bins, add the chunks to the unsorted bin, and search in the unsorted bin. If the requirements are still not met, add all the chunks in the unsorted bin to the large bins and search for the large bins. Searches in fast bins and small bins require exact matches, while searches in large bins follow the "smallest-first, best-fit" principle and do not require exact matches. If the above methods fail, ptmalloc will consider using the top chunk. If the top chunk cannot meet the allocation requirements. If the required chunk size is greater than the mmap allocation threshold, mmap is used for allocation. Otherwise, increase the heap and increase the top chunk. to meet distribution requirements.

Of course, the allocation of malloc in glibc is much more complicated than the above. Various situations must be taken into consideration, such as abnormal pointer ΩΩ out of bounds, etc. These judgment conditions are also added to the flow chart, as shown in the following figure:

picture

malloc ( If you need large high-definition pictures, please leave a message in the message box or send me a private message )

Because the picture is too large, it will be compressed by the official account, so even if you click on the big picture, you can't see it clearly. If you need a high-definition picture, you can leave a message in the background or send a private message to the official account

 

7 memory release (free)

Malloc performs memory allocation, and the opposite of malloc is free, which performs memory release. The following is the basic flow chart of the free function:

picture

free

The above flow chart is described as follows:

  1. Obtain the lock of the allocation area to ensure thread safety.

  2. If free is a null pointer, return and do nothing.

  3. Determine whether the current chunk is the memory mapped by the mmap mapping area. If so, directly munmap() releases this memory. In the previous data structure that has used chunk, we can see that there is M to identify whether it is mmap-mapped memory.

  4. Determine whether the chunk is adjacent to the top chunk. If it is adjacent, it will be directly merged with the top chunk (adjacent to the top chunk is equivalent to adjacent to the free memory block in the allocation area). Otherwise, go to step 8

  5. If the chunk size is larger than max_fast (64b), put it into the unsorted bin, and check whether there is merging. If there is merging and it is adjacent to the top chunk, go to step 8; if there is no merging, then free.

  6. If the chunk size is smaller than max_fast (64b), it is placed directly into the fast bin. The fast bin does not change the status of the chunk. If there is no merging, then free; if there is merging, go to step 7.

  7. In the fast bin, if the next chunk of the current chunk is also free, the two chunks are merged and placed on the unsorted bin. If the merged size is greater than 64B, the merge operation of fast bins will be triggered. The chunks in fast bins will be traversed and merged with adjacent free chunks. The merged chunks will be placed in the unsorted bin, fast bin will become empty. If the merged chunk is adjacent to the topchunk, it will be merged into the topchunk. Go to step 8

  8. Determine whether the size of the top chunk is greater than the mmap shrinkage threshold (default is 128KB). If so, for the main allocation area, it will try to return part of the top chunk to the operating system. free ends.

If various conditions inside the free function are added, the detailed flow chart of the free call is as follows:

picture

 

8 Problem analysis and solutions

Through the previous analysis of the glibc runtime library, we can basically locate the reason. It is because we called free to release, but only returned the memory to the glibc library, but the glibc library did not return the memory to the operating system. In the end, As a result, the system memory is exhausted and the program is killed by the system due to OOM.

There are two options:

  • Disable ptmalloc's mmap allocation threshold dynamic adjustment mechanism. Set any one of M_TRIM_THRESHOLD, M_MMAP_THRESHOLD, M_TOP_PAD and M_MMAP_MAX through mallopt(), turn off the mmap allocation threshold dynamic adjustment mechanism, and set the mmap allocation threshold to 64K. Memory allocations greater than 64K are allocated to the system using mmap, and release greater than 64K The memory will be released back to the system by calling munmap. However, the disadvantage of this solution is that each memory allocation and application is directly applied to the operating system, which is inefficient.

  • Estimate the maximum physical memory size that the program can use, configure the system's /proc/sys/vm/overcommit_memory, /proc/sys/vm/overcommit_ratio, and use ulimt –v to limit the amount of virtual memory space that the program can use to prevent the program from OOM. be killed. The disadvantage of this solution is that if the estimated memory is less than the actual occupied memory of the process, OOM will still occur, causing the process to be killed.

  • tcmalloc

In the end, tcmalloc was used to solve the problem. If there is a chance later, I will write an article related to tcmalloc memory management.

9 Conclusion

According to industry statements, understanding the memory management mechanism is an important difference between C/C++ programmers and other high-level language programmers. As the most important features in C/C++, pointers and dynamic memory management bring great flexibility to programming, but also bring many troubles to developers.

Understanding the underlying memory implementation can sometimes have unexpected effects.

Let’s stop here first.

Guess you like

Origin blog.csdn.net/weixin_41114301/article/details/133201233