3. Operating system memory layout related functions

3.1 Heap operation related functions

3.2 Mmap mapping area operation related functions

4.1. C-style memory management program

5. Common C memory management program

Memory management is nothing more than three levels, the user program layer, the C runtime library layer, and the kernel layer. Allocator is the memory management module of the C runtime library. It responds to the user's allocation request, applies for memory from the kernel, and then returns it to the user program. In order to maintain efficient allocation, allocator generally pre-allocates a piece of memory larger than the user's request, and manages this memory through a certain algorithm. To meet the user's memory allocation requirements, the user's free memory is not immediately returned to the operating system , On the contrary, allocator will manage the free space that is freed to meet the user's future memory allocation requirements. That is to say, allocator not only needs to manage the allocated memory block, but also needs to manage the free memory block, when responding to user allocation requirements At this time, the allocator will first find a suitable memory in the free space for the user, and allocate a new memory when it cannot be found in the free space.

1. Linux process memory layout

When the Linux system loads the program file in the elf format, it will call the loader to load each segment of the executable file into the space starting from a certain address (the load address depends on the link editor (ld) and the number of machine addresses, On a 32-bit machine, it is 0x8048000, which is at 128M). As shown below:

Taking a 32-bit machine as an example, the .text section is loaded first , then the .data section, and finally the .bss section. This can be seen as the starting space of the program. The last address that the program can access is 0xbfffffff, that is, to the 3G address. The 1G space above 3G is used by the kernel, and applications cannot directly access it.

The stack of the application program starts from the highest address and grows downward. The space between the .bss section and the stack is free. The free space is divided into two parts, one is the heap, and the other is the mmap mapping area. The mmap mapping area generally starts from TASK_SIZE/ Start at 3, but on different Linux kernels and machines, the starting position of the mmap area is generally different. Both Heap and mmap areas can be used by users freely, but they are not mapped into the memory space at the beginning and are inaccessible. Before requesting the kernel to allocate the space, access to this space will cause a segmentation fault. The user program can directly use the system call to manage the heap and mmap mapping area, but more often the program uses the malloc() and free() functions provided by the C language to dynamically allocate and release memory. The Stack area is the only memory area that users can access without mapping. This is also the basis for attacks using stack overflow.

The layout in the above figure is the default process memory layout before Linux kernel 2.6.7 . The mmap area and the stack area grow relatively, which means that the heap has only 1GB of virtual address space available. If it continues to grow, it will enter the mmap mapping area. Obviously not what we want. This is due to the 32-mode address space limitation, so the kernel introduces another form of virtual address space layout, which will be introduced later. But for 64-bit systems, which provides a huge virtual address space, this layout is quite good. After Linux kernel 2.6.7, the memory layout of the process is shown in the figure below:

As you can see from the figure above, the stack extends from the top to the bottom, and the stack is bounded. The heap expands upwards from the bottom, the mmap mapping area expands from the top downwards, and the mmap mapping area and the heap expand relatively until the remaining area in the virtual address space is exhausted. This structure is convenient for the C runtime library to use the mmap mapping area and heap for memory distribution. The layout of the above figure was introduced after kernel 2.6.7 , which is the default memory layout of the process in 32-bit mode.

What is the starting position of each area in 64-bit mode? For AMD64 system, the memory layout adopts the classic memory layout, the starting address of text is 0x0000000000400000, the heap grows upwards immediately after the BSS segment, and the starting position of mmap mapping area is generally set to TASK_SIZE/3.

#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)

#define TASK_SIZE  (test_thread_flag(TIF_IA32) ? IA32_PAGE_OFFSET : TASK_SIZE_MAX)

#define STACK_TOP TASK_SIZE

#define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE/3))

The calculation shows that the address of the starting area of mmap is 0x00002AAAAAAAA000, and the address of the top of the stack is 0x00007FFFFFFFF000. The following figure shows the 64-bit process memory layout:

The figure above is the default memory layout of the Linux process under X86_64. This is only a schematic diagram. Under the current kernel default configuration, the stack and mmap mapping area of the process does not start from a fixed address, and the value is different each time it is started. This is how the program randomly changes the settings of these values at startup, making it more difficult to attack using buffer overflows. Of course, you can also start the process stack and mmap mapping area from a fixed position, just set the global variable randomize_va_space to 0, and the default value of this variable is 1. Users can disable this feature by setting /proc/sys/kernel/randomize_va_space, or use the following command:

sudo sysctl -w kernel.randomize_va_space=0

2. Why limit the size of the stack

2.1 Process stack

The initial size of the process stack is calculated by the compiler and linker, but the real-time size of the stack is not fixed. The Linux kernel will dynamically grow the stack area according to the stacking situation (in fact, it means adding a new page table). But it does not mean that the stack area can grow indefinitely, it also has a maximum limit RLIMIT_STACK (usually 8M), and we can ulimit check or change RLIMIT_STACK the value through it.

2.2 Thread stack

From the perspective of the Linux kernel, it does not actually have the concept of threads. Linux implements all threads as processes, and unifies threads and processes into task_struct indiscriminately. A thread is only regarded as a process that shares some resources with other processes, and whether or not to share the address space is almost the only difference between a process and the so-called thread in Linux. When the thread is created, the CLONE_VM mark is added, so that the memory descriptor of the thread will directly point to the memory descriptor of the parent process. Although the address space of a thread is the same as that of a process, there are some differences in the stack of its address space. For the Linux process or the main thread, the stack is generated during fork, which actually copies the parent's stack space address, and then copy-on-write (cow) and dynamically grow. However, for the sub-threads generated by the main thread, the stack will no longer be like this, but fixed in advance. When the thread is created, the mmap system call is used to allocate the stack to the thread. The starting address and size of the thread stack are stored in pthread_attr_t.

The stack needs to be stored in a contiguous memory location. This means that the stack cannot be randomly allocated as needed, but at least a virtual address needs to be reserved for this. The larger the reserved virtual address space, the fewer threads that can be executed . For example, 32-bit applications usually have 2GB of virtual address space. This means that if the stack size is 2MB (the default value in pthreads), then up to 1024 threads can be created. For applications such as web servers, this may be small. Increasing the stack size to 100MB (i.e. reserve 100MB, but not necessarily allocate 100MB to the stack immediately) will limit the number of threads to around 20, even for simple GUI applications. An interesting question is why we still have this limitation on 64-bit platforms. I don't know the answer, but I assume that people are used to some "stack best practices": be careful to allocate huge objects on the stack, and manually increase the stack size if necessary. Therefore, no one has found it useful to add "huge" stack support on a 64-bit platform.

3. Operating system memory layout related functions

As mentioned in the previous section, the heap and mmap mapping area is a virtual memory space that can be provided to user programs. How to obtain the memory in this area? The operating system provides related system calls to complete related work. For the operation of the heap, the operating system provides the brk() function, and the C runtime library provides the sbrk() function; for the operation of the mmap mapping area, the operating system provides the mmap() and munmap() functions. sbrk(), brk() or mmap() can all be used to add additional virtual memory to our process. Glibc also uses these functions to request virtual memory from the operating system.

There is a very important concept to be mentioned here, the delayed allocation of memory, the physical mapping of this address is only established when an address is actually accessed. This is one of the basic ideas of Linux memory management. When the user applies for memory, the Linux kernel only allocates a linear area (that is, virtual memory), but does not allocate actual physical memory; only when the user uses this memory, the kernel will allocate specific physical pages to it. The user, at this time, takes up precious physical memory. The kernel releases the physical page by releasing the linear area, finding the corresponding physical page, and releasing all of it.

3.1 Heap operation related functions

There are two main Heap operation functions, brk() is a system call, and sbrk() is a C library function. System calls usually provide a minimal function, while library functions usually provide more complex functions. Glibc's malloc function family (realloc, calloc, etc.) calls the sbrk() function to move the lower bound of the data segment. The sbrk() function maps the virtual address space to memory under the management of the kernel for use by the malloc() function. In the kernel data structure mm_struct:

start_code and end_code are the start and end addresses of the process code segment;
start_data and end_data are the start and end addresses of the process data segment;
start_stack is the starting address of the process stack segment;
start_brk is the starting address of the process dynamic memory allocation (the starting address of the heap),
brk is the current last address of the heap, which is the current end address of dynamic memory allocation.

The basic function of dynamic memory allocation in C language is malloc(), and the implementation on Linux is through the kernel's brk system call. brk() is a very simple system call, which simply changes the value of the member variable brk of the mm_struct structure. The definitions of these two functions are as follows:

#include <unistd.h>

int brk(void *addr);

void *sbrk(intptr_t increment);

It should be noted that when the parameter increment of sbrk() is 0, sbrk() returns the current brk value of the process. When increment is positive, the brk value is expanded, and when the increment is negative, the brk value is contracted.

3.2 Mmap mapping area operation related functions

The mmap() function maps a file or other object into memory. The file is mapped to multiple pages. If the size of the file is not the sum of the sizes of all pages, the unused space of the last page will be cleared. munmap performs the opposite operation and deletes the object mapping in a specific address area. The definition of the function is as follows:

#include <sys/mman.h>

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); 
// 若映射成功则返回映射区的内存起始地址，否则返回MAP_FAILED(－1)，错误原因存于errno 中。

int munmap(void *addr, size_t length);

parameter:

addr: point to the starting address of the memory to be mapped, usually set to NULL, which means that the system automatically selects the address, and returns the address after the mapping is successful;
length: the length of the mapping area;
prot: The protection mode of the mapped area, which cannot conflict with the open mode of the file. It is one of the following values, which can be reasonably combined by the or operation. The following flags are mainly used in Ptmalloc:

PROT_EXEC // Page content can be executed, not used in ptmalloc

PROT_READ // The content of the page can be read, ptmalloc directly uses mmap to allocate memory and set this flag when it returns to the user immediately

PROT_WRITE // The page can be written, ptmalloc directly uses mmap to allocate memory and set this flag when it returns to the user immediately

PROT_NONE // The page is inaccessible. ptmalloc uses mmap to "wholesale" a piece of memory to the system for management. Set this flag

flags: Specify the type of the mapping object, whether the mapping options and the mapping page can be shared. Its value can be a combination of one or more of the following bits

MAP_FIXED // Use the specified mapping start address. If the memory area specified by the start and len parameters overlaps the existing mapping space, the overlapped part will be discarded. If the specified start address is not available, the operation will fail. And the starting address must fall on the page boundary. Ptmalloc sets this flag when reclaiming "wholesale" memory from the system.

MAP_PRIVATE // Create a private mapping for copy-on-write. The writing of the memory area will not affect the original file. This flag and the above flags are mutually exclusive, only one of them can be used. Ptmalloc sets this flag every time mmap is called.

MAP_NORESERVE // Do not reserve swap space for this mapping. When the swap space is reserved, the modification of the mapping area may be guaranteed. When the swap space is not reserved and the memory is insufficient, the modification of the mapping area will cause a segment violation signal. Ptmalloc sets this flag when "wholesale" memory blocks to the system.

MAP_ANONYMOUS // Anonymous mapping, the mapping area is not associated with any file. Ptmalloc sets this flag every time mmap is called.

fd: A valid file description word. If MAP_ANONYMOUS is set, for compatibility issues, its value should be -1. Some systems do not support anonymous memory mapping, you can use fopen to open the /dev/zero file, and then map the file, you can also achieve the effect of anonymous memory mapping.
offset: The starting point of the content of the mapped object, usually set to 0, which means that the corresponding starts from the front of the file. The offset must be an integer multiple of the page size.

4. Memory Management Method

4.1. C-style memory management program

The C-style memory management program mainly implements malloc() and free() functions. The memory management program mainly adds additional virtual memory by calling the brk() or mmap() process. Doug Lea Malloc, ptmalloc, BSD malloc, Hoard, TCMalloc all belong to this type of memory management program.

The malloc()-based memory manager still has many shortcomings, no matter which allocation program is used. For those programs that need to maintain long-term storage, using malloc() to manage memory can be very disappointing. If there are a large number of unfixed memory references, it is often difficult to know when they will be released. Memory whose lifetime is limited to the current function is very easy to manage, but for memory whose lifetime exceeds this range, it is much more difficult to manage memory. Because of the problem of managing memory, many programs tend to use their own memory management rules.

4.2. Pooled memory management

The memory pool is a semi-memory management method. The memory pool helps certain programs to perform automatic memory management. These programs go through some specific stages, and each stage has memory allocated to a specific stage of the process. For example, many web server processes allocate a lot of memory for each connection-the maximum lifetime of memory is the lifetime of the current connection. Apache uses pooled memory to split its connection into various stages, each stage has its own memory pool. At the end of each phase, all memory is released at once.

In pooled memory management, each memory allocation will specify a memory pool from which to allocate memory. Each memory pool has a different lifetime. In Apache, there is a memory pool whose duration is the lifetime of the server, another memory pool whose duration is the lifetime of the connection, and a pool whose duration is the lifetime of the request, and there are other memory pools. Therefore, if my series of functions do not generate data longer than the connection duration, then I can completely allocate memory from the connection pool and know that the memory will be automatically released when the connection ends. In addition, there are some implementations that allow the registration of cleanup functions, which can be called just before the memory pool is cleared to complete all other tasks that need to be completed before the memory is cleaned up (similar to object-oriented destructors).

The advantages of using pooled memory allocation are as follows:

The application can simply manage the memory.
Memory allocation and reclamation is faster because it is done in a pool every time. Allocation can be completed in O(1) time, and the time required to release the memory pool is about the same (actually O(n) time, but in most cases it will be divided by a large factor to make it O(1) )).
Error-handling pools can be allocated in advance so that the program can recover when conventional memory is exhausted.
There are very easy-to-use standard implementations.

The disadvantages of pooled memory are:

The memory pool is only suitable for programs whose operations can be staged.
Memory pools usually do not work well with third-party libraries.
If the structure of the program changes, you have to modify the memory pool, which may lead to a redesign of the memory management system.
You must remember from which pool you need to allocate. In addition, if something goes wrong here, it will be difficult to capture the memory pool.

4.3. Reference counting

In reference counting, all shared data structures have a field to contain the number of times that the structure is currently active. When passing a pointer to a data structure to a program, the program increments the reference count by 1. In essence, it is telling the data structure how many locations it is being stored in. Then, when the process finishes using it, the program will decrement the reference count by one. After finishing this action, it also checks whether the count has been reduced to zero. If it is, then it will release the memory.

In high-level languages such as Java and Perl, reference counting is widely used for memory management. In these languages, reference counting is handled automatically by the language, so you don't have to worry about it at all unless you want to write extension modules. Since everything must be reference counted, this will have some impact on speed, but it greatly improves the safety and convenience of programming.

The following are the benefits of reference counting:

Simple to implement.
Easy to use.
Since the reference is part of the data structure, it has a good cache location.

It also has its shortcomings:

You are required to never forget to call the reference counting function.
The structure that is part of the cyclic data structure cannot be released.
Slow down the allocation of almost every pointer.
Although the objects used are reference counted, when using exception handling (such as try or setjmp()/longjmp()), you must take other approaches.
Additional memory is required to handle references.
The reference count occupies the first position in the structure, and it is this position that can be accessed most quickly on most machines.
It is slower and more difficult to use in a multithreaded environment.

4.4. Garbage Collection

Garbage collection is the automatic detection and removal of data objects that are no longer used. The garbage collector usually runs when the available memory drops below a specific threshold. Usually, they use a set of "basic" data known to the program—stack data, global variables, registers—as a starting point. Then they try to track each piece of data connected to it through these data. What the collector finds is useful data; what it does not find is garbage, which can be destroyed and reused. In order to effectively manage memory, many types of garbage collectors need to know the planning of pointers inside the data structure. Therefore, in order to run the garbage collector correctly, they must be part of the language itself.

Some advantages of garbage collection:

Never worry about double release of memory or the life cycle of objects.
With some collectors, you can use the same API as for regular allocation.

Its disadvantages include:

When using most collectors, you cannot interfere with when to release the memory.
In most cases, garbage collection is slower than other forms of memory management.
Defects caused by garbage collection errors are difficult to debug.
If you forget to set the pointer that is no longer used to null, there will still be a memory leak.

5. Common C memory management program

Doug Lea Malloc:

Doug Lea Malloc is actually a complete set of allocation procedures, including Doug Lea's original allocation procedures, GNU libc allocation procedures and ptmalloc. Doug Lea's allocation program adds an index, which makes the search faster and can combine multiple unused blocks into one big block. It also supports caching so that the recently released memory can be reused more quickly. ptmalloc is an extended version of Doug Lea Malloc that supports multi-threading.

BSD Malloc:

BSD Malloc is an implementation released with 4.2 BSD and included in FreeBSD. This allocator can allocate objects from a pool of objects of a certain size in advance. It has some size classes for the size of objects. The size of these objects is a power of 2 minus a constant. So, if you request an object of a given size, it simply allocates a matching size class. This provides a fast implementation, but may waste memory.

Hoard:

The goal of writing Hoard is to make memory allocation very fast in a multi-threaded environment. Therefore, its construction is centered on the use of locks, so that all processes do not have to wait to allocate memory. It can significantly speed up multi-threaded processes that do a lot of allocation and recovery.

TCMalloc:

tcmalloc is a memory allocator developed by Google, which is used for memory allocation in Golang and Chrome. Effectively optimize the existing problems in ptmalloc. Of course, some price has been paid for this. Click here to see the specific implementation of tcmalloc.

Jemalloc:

jemalloc is launched by facebook and is currently used extensively in services such as firefox, facebook server, and android 5.0. The biggest advantage of jemalloc is its powerful multi-core/multi-thread allocation capabilities. In terms of modern computer hardware architecture, the biggest bottleneck is no longer the memory capacity or CPU speed, but the lock contention (lock contention) under multi-core/multi-thread Because regardless of the number of CPU cores, there is usually only one copy of memory. It can be said that if the memory is large enough, the more the number of CPU cores and the more program threads, the faster the allocation speed of jemalloc.

reference:

"Glibc Memory Management Ptmalloc2 Source Code Analysis"

Linux memory management (1): overview