In-depth explanation of the principle and implementation of the memory allocation function malloc

Anyone who has used or learned C will be familiar with malloc. Everyone knows that malloc can allocate a continuous memory space and release it through free when it is no longer used. However, many programmers are not familiar with the things behind malloc, and many even regard malloc as a system call provided by the operating system or a C keyword.

In fact, malloc is just a common function provided in the C standard library, and the basic idea of implementing malloc is not complicated and can be easily understood by any programmer who has some knowledge of C and operating systems.

This article describes the mechanism behind malloc by implementing a simple malloc. Of course, compared with existing C standard library implementations (such as glibc), our implementation of malloc is not particularly efficient, but this implementation is much simpler than the current real malloc implementation, so it is easy to understand. Importantly, this implementation is consistent with the real implementation in basic principles.

This article will first introduce some basic knowledge required, such as the operating system's memory management of the process and related system calls, and then gradually implement a simple malloc. For the sake of simplicity, this article will only consider the x86_64 architecture and the operating system is Linux.

1 What is malloc

Before implementing malloc, we must first define malloc relatively formally.

According to the definition of standard C library function, malloc has the following prototype:

void* malloc(size_t size);

The function to be implemented by this function is to allocate a continuous period of available memory in the system. The specific requirements are as follows:

The memory size allocated by malloc is at least the number of bytes specified by the size parameter.
The return value of malloc is a pointer pointing to the starting address of a segment of available memory.
The addresses allocated by malloc multiple times cannot overlap unless the address allocated by malloc is released.
malloc should complete the memory allocation and return as soon as possible ( the memory allocation algorithm of NP-hard[1] cannot be used)
When implementing malloc, memory size adjustment and memory release functions (i.e. realloc and free) should be implemented at the same time.

For more instructions on malloc, you can type the following command on the command line to view:

man malloc

2 Preliminary knowledge

Before implementing malloc, some knowledge related to Linux system memory needs to be explained.

2.1 Linux memory management

2.1.1 Virtual memory address and physical memory address

For the sake of simplicity, modern operating systems generally use virtual memory address technology when processing memory addresses. That is, at the assembler (or machine language) level, when memory addresses are involved, virtual memory addresses are used. When using this technology, each process seems to have its own 2N bytes of memory, where N is the number of machine bits. For example, under a 64-bit CPU and a 64-bit operating system, the virtual address space of each process is 264Byte.

The main function of this virtual address space is to simplify program writing and facilitate the operating system's isolation management of inter-process memory. It is unlikely (and cannot be used) for a real process to have such a large memory space. The actual memory that can be used Depends on physical memory size.

Since virtual addresses are used at the machine language level, when the actual machine code program involves memory operations, the virtual address needs to be converted into a physical memory address according to the actual context of the current process running in order to realize the operation of real memory data. This conversion is generally completed by a piece of hardware called MMU[2] (Memory Management Unit).

2.1.2 Page and address composition

In modern operating systems, neither virtual memory nor physical memory is managed in units of bytes, but in units of pages. A memory page is the general term for a fixed-size continuous memory address. Specifically in Linux, the typical memory page size is 4096Byte (4K).

So the memory address can be divided into page number and offset within the page. Taking a 64-bit machine, 4G physical memory, and 4K page size as an example, the composition of the virtual memory address and the physical memory address is as follows:

The top is the virtual memory address, and the bottom is the physical memory address. Since the page size is all 4K, the offset within the page is represented by the lower 12 bits, and the remaining high address represents the page number.

The MMU mapping unit is not bytes, but pages. This mapping is implemented by looking up a memory-resident data structure page table [3] . Nowadays, the specific memory address mapping of computers is relatively complex. In order to speed up the process, a series of caches and optimizations are introduced, such as TLB [4] and other mechanisms.

A simplified schematic diagram of memory address translation is given below. Although it is simplified, the basic principle is consistent with the actual situation of modern computers.

2.1.3 Memory pages and disk pages

We know that memory is generally regarded as a disk cache. Sometimes when the MMU is working, it will find that the page table indicates that a certain memory page is not in the physical memory. At this time, a page fault exception (Page Fault) will be triggered. At this time, the system will The corresponding location on the disk loads the disk page into memory, and then re-executes the machine instruction that failed due to the page fault. Regarding this part, because it can be regarded as transparent to the malloc implementation, I will not go into details.

Finally, I attach a process found on Wikipedia that is more in line with the real address translation for your reference. This picture adds the process of TLB and missing page exceptions.

2.2 Linux process-level memory management

2.2.1 Memory arrangement

Now that we understand the relationship between virtual memory and physical memory and the related mapping mechanism, let's take a look at how memory is arranged within a process.

Take Linux 64-bit system as an example. Theoretically, the available space for 64-bit memory addresses is 0x0000000000000000 ~ 0xFFFFFFFFFFFFFFFF. This is a quite large space, and Linux actually uses only a small part of it (256T).

According to the Linux kernel related documents [6] , the Linux 64-bit operating system only uses the lower 47 bits and the upper 17 bits for expansion (can only be all 0s or all 1s). Therefore, the actual addresses used are the spaces 0x0000000000000000 ~ 0x00007FFFFFFFFFFFF and 0xFFFF800000000000 ~ 0xFFFFFFFFFFFFFFFF, where the former is User Space and the latter is Kernel Space. The diagram is as follows:

For users, the main space of concern is User Space. After enlarging the User Space, you can see that it is mainly divided into the following sections:

Code: This is the lowest address part of the entire user space, which stores instructions (that is, the executable machine code compiled by the program)
Data: The initialized global variables are stored here.
BSS: Uninitialized global variables are stored here.
Heap: Heap, this is the focus of this article. The heap grows from low addresses to high addresses. The brk-related system calls to be discussed later allocate memory from here.
Mapping Area: This is the area related to the mmap system call. Most practical malloc implementations will consider allocating larger memory areas via mmap, and this article does not discuss this situation. This area grows from high addresses to low addresses
Stack: This is the stack area, growing from high address to low address.

Below we mainly focus on the operations of the Heap area. Students who are interested in the entire Linux memory arrangement can refer to other materials.

2.2.2 Heap memory model

Generally speaking, the memory requested by malloc is mainly allocated from the Heap area (this article does not consider applying for large blocks of memory through mmap).

As we know from the above, the virtual memory address space faced by the process can only be actually used if it is mapped to the physical memory address on a page basis. Due to physical storage capacity limitations, it is impossible for the entire heap virtual memory space to be mapped to actual physical memory. Linux’s heap management is as follows:

Linux maintains a break pointer, which points to an address in the heap space. The address space from the heap starting address to break is mapped and can be accessed by the process; and from break upwards, it is unmapped address space. If this space is accessed, the program will report an error.

2.2.3 brk and sbrk

As we know from the above, to increase the actual available heap size of a process, you need to move the break pointer to a higher address. Linux operates the break pointer through the brk and sbrk system calls. The prototypes of the two system calls are as follows:

int brk(void *addr);
void *sbrk(intptr_t increment);

brk sets the break pointer directly to an address, while sbrk moves break from the current position by the increment specified by increment. brk returns 0 when executed successfully, otherwise it returns -1 and sets errno to ENOMEM; when sbrk is successful, it returns the address pointed to before break moved, otherwise it returns (void *)-1.

A little trick is that if you set increment to 0, you can get the address of the current break.

Another thing to note is that since Linux maps memory by page, if break is set to not be aligned by page size, the system will actually map a complete page at the end, so the actual mapped memory space is larger than break The area pointed to is larger. But using the address after the break is dangerous (although maybe there is a small area of free memory address after the break).

2.2.4 Resource limits and rlimit

The resources allocated by the system to each process are not unlimited, including mappable memory space, so each process has an rlimit that represents the upper limit of resources available to the current process.

This limit can be obtained through the getrlimit system call. The following code obtains the rlimit of the current process's virtual memory space:

int main() {
struct rlimit *limit = (struct rlimit *)malloc(sizeof(struct rlimit));
getrlimit(RLIMIT_AS, limit);
printf("soft limit: %ld, hard limit: %ld\n", limit->rlim_cur, limit->rlim_max);
}

where rlimit is a structure:

struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
};

Each resource has soft limits and hard limits, and rlimit can be set conditionally through setrlimit. The hard limit serves as the upper limit of the soft limit. Non-privileged processes can only set soft limits and cannot exceed the hard limit.

Information Direct: Linux kernel source code technology learning route + video tutorial kernel source code

Learning Express: Linux Kernel Source Code Memory Tuning File System Process Management Device Driver/Network Protocol Stack

3 Implement malloc

3.1 Toy implementation

Before we officially start discussing the implementation of malloc, we can use the above knowledge to implement a simple but almost impossible to use real toy malloc, which should be used as a review of the above knowledge:

/* 一个玩具malloc */
#include <sys/types.h>
#include <unistd.h>
void *malloc(size_t size)
{
void *p;
p = sbrk(0);
if (sbrk(size) == (void *)-1)
return NULL;
return p;
}

This malloc increases the number of bytes specified by size based on the current break each time, and returns the address of the previous break. This malloc lacks records of the allocated memory and is inconvenient for memory release, so it cannot be used in real scenarios.

3.2 Formal implementation

Let’s discuss the implementation of malloc seriously.

3.2.1 Data structure

First we need to determine the data structure used. A simple and feasible solution is to organize the heap memory space in the form of blocks. Each block is composed of a meta area and a data area. The meta area records the meta-information of the data block (data area size, free flag bit, pointer, etc.) ), the data area is the actual allocated memory area, and the first byte address of the data area is the address returned by malloc.

You can define a block with the following structure:

typedef struct s_block *t_block;
struct s_block {
  size_t size; /* 数据区大小 */
  t_block next; /* 指向下个块的指针 */
  int free; /* 是否是空闲块 */
  int padding; /* 填充4字节，保证meta块长度为8的倍数 */
  char data[1] /* 这是一个虚拟字段，表示数据块的第一个字节，长度不应计入meta */
};

Since we only consider 64-bit machines, for convenience, we fill an int at the end of the structure so that the length of the structure itself is a multiple of 8 for memory alignment. The schematic diagram is as follows:

3.2.2 Find the appropriate block

Now consider how to find the appropriate block in the block chain. Generally speaking, there are two search algorithms:

First fit : Start from scratch and use the first block whose data area size is larger than the required size, the so-called block allocated this time.
Best fit : Start from the beginning, traverse all blocks, and use the block with the data area size larger than size and the smallest difference as the block allocated this time.

Both methods have their own merits. Best fit has higher memory usage (higher payload), while first fit has better operating efficiency. Here we use the first fit algorithm.

/* First fit */
t_block find_block(t_block *last, size_t size) {
  t_block b = first_block;
  while(b && !(b->free && b->size >= size)) {
     *last = b;
     b = b->next;
    }
  return b;
}

find_block starts from frist_block, finds the first block that meets the requirements and returns the block starting address. If it is not found, it returns NULL.

Here, a pointer called last will be updated during traversal. This pointer always points to the currently traversed block. This is used to open up a new block if a suitable block cannot be found, which will be used in the next section.

3.2.3 Open up a new block

If the existing blocks cannot meet the size requirements, a new block needs to be opened at the end of the linked list. The key here is how to create a struct using only sbrk:

#define BLOCK_SIZE 24 /* 由于存在虚拟的data字段，sizeof不能正确计算meta长度，这里手工设置 */
 
t_block extend_heap(t_block last, size_t s) {
t_block b;
b = sbrk(0);
if(sbrk(BLOCK_SIZE + s) == (void *)-1)
return NULL;
b->size = s;
b->next = NULL;
if(last)
last->next = b;
b->free = 0;
return b;
}

3.2.4 Split block

First fit has a fatal shortcoming, that is, a small size may occupy a large block. At this time, in order to increase the payload, it should be split into a new block when the remaining data area is large enough. , the representation is as follows:

Implementation code:

void split_block(t_block b, size_t s) {
t_block new;
new = b->data + s;
new->size = b->size - s - BLOCK_SIZE ;
new->next = b->next;
new->free = 1;
b->size = s;
b->next = new;
}

3.2.5 Implementation of malloc

With the above code, we can use them to integrate them into a simple but initially usable malloc. Note that first we need to define the head first_block of the block list and initialize it to NULL; in addition, we need the remaining space to be at least BLOCK_SIZE + 8 before performing the split operation.

Since we want the data area allocated by malloc to be aligned by 8 bytes, when size is not a multiple of 8, we need to adjust size to the smallest multiple of 8 that is greater than size:

size_t align8(size_t s) {
if(s & 0x7 == 0)
return s;
return ((s >> 3) + 1) << 3;
}

#define BLOCK_SIZE 24
void *first_block=NULL;
 
/* other functions... */
 
void *malloc(size_t size) {
t_block b, last;
size_t s;
/* 对齐地址 */
s = align8(size);
if(first_block) {
/* 查找合适的block */
last = first_block;
b = find_block(&last, s);
if(b) {
/* 如果可以，则分裂 */
if ((b->size - s) >= ( BLOCK_SIZE + 8))
split_block(b, s);
b->free = 0;
} else {
/* 没有合适的block，开辟一个新的 */
b = extend_heap(last, s);
if(!b)
return NULL;
}
} else {
b = extend_heap(NULL, s);
if(!b)
return NULL;
first_block = b;
}
return b->data;
}

3.2.6 Implementation of calloc

With malloc, there are only two steps to implement calloc:

malloc a piece of memory
Set the contents of the data area to 0

Since our data area is aligned by 8 bytes, in order to improve efficiency, we can set 0 in groups of 8 bytes instead of setting them one by one. We can achieve this by creating a new size_t pointer and forcing the memory area to be of type size_t.

void *calloc(size_t number, size_t size) {
size_t *new;
size_t s8, i;
new = malloc(number * size);
if(new) {
s8 = align8(number * size) >> 3;
for(i = 0; i < s8; i++)
new[i] = 0;
}
return new;
}

3.2.7 Implementation of free

The implementation of free is not as simple as it seems. Here we have to solve two key issues:

How to verify that the incoming address is a valid address, that is, it is indeed the first address of the data area allocated through malloc.
How to fix fragmentation issues

First of all, we need to ensure that the incoming free address is valid. This validity includes two aspects:

The address should be within the area allocated by malloc before, that is, within the range of first_block and the current break pointer
This address was indeed previously allocated through our own malloc

The first problem is easier to solve. Just compare the addresses. The key is the second problem.

There are two solutions here: one is to bury a magic number field in the structure. Before freeing, check whether the value of a specific position is the magic number we set by using a relative offset. The other method is to add a magic pointer to the structure. This pointer points to the first byte of the data area (that is, the address passed in when free is legal). We check whether the magic pointer points to the address pointed by the parameter before freeing. Here we use the second option:

First we add magic pointer to the structure (and modify BLOCK_SIZE at the same time):

typedef struct s_block *t_block;
struct s_block {
size_t size; /* 数据区大小 */
t_block next; /* 指向下个块的指针 */
int free; /* 是否是空闲块 */
int padding; /* 填充4字节，保证meta块长度为8的倍数 */
void *ptr; /* Magic pointer，指向data */
char data[1] /* 这是一个虚拟字段，表示数据块的第一个字节，长度不应计入meta */
};

Then we define a function that checks the validity of the address:

t_block get_block(void *p) {
char *tmp;
tmp = p;
return (p = tmp -= BLOCK_SIZE);
}
 
int valid_addr(void *p) {
if(first_block) {
if(p > first_block && p < sbrk(0)) {
return p == (get_block(p))->ptr;
}
}
return 0;
}

After multiple mallocs and frees, the entire memory pool may produce many fragmented blocks. These blocks are very small and often cannot be used. There may even be many fragments connected together. Although the overall requirements of malloc can be met, they are divided into multiple small blocks. Block and unable to fit, this is a fragmentation problem.

A simple solution is that when freeing a block, if it is found that its adjacent block is also free, merge the block with the adjacent block. In order to meet this implementation, s_block needs to be changed to a doubly linked list.

The modified block structure is as follows:

typedef struct s_block *t_block;
struct s_block {
size_t size; /* 数据区大小 */
t_block prev; /* 指向上个块的指针 */
t_block next; /* 指向下个块的指针 */
int free; /* 是否是空闲块 */
int padding; /* 填充4字节，保证meta块长度为8的倍数 */
void *ptr; /* Magic pointer，指向data */
char data[1] /* 这是一个虚拟字段，表示数据块的第一个字节，长度不应计入meta */
};

The merge method is as follows:

t_block fusion(t_block b) {
  if (b->next && b->next->free) {
  b->size += BLOCK_SIZE + b->next->size;
  b->next = b->next->next;
  if(b->next)
  b->next->prev = b;
  }
  return b;
}

With the above method, the implementation idea of free is relatively clear: first check the legality of the parameter address, if it is illegal, do nothing; otherwise, mark the free of this block as 1, and match it with the following if possible blocks are merged.

If the current block is the last block, roll back the break pointer to release the process memory. If the current block is the last block, roll back the break pointer and set first_block to NULL. The implementation is as follows:

void free(void *p) {
  t_block b;
  if(valid_addr(p)) {
   b = get_block(p);
  b->free = 1;
  if(b->prev && b->prev->free)
  b = fusion(b->prev);
  if(b->next)
    fusion(b);
  else {
   if(b->prev)
     b->prev->prev = NULL;
   else
    first_block = NULL;
   brk(b);
  }
 }
}

3.2.8 Implementation of realloc

In order to implement realloc, we first need to implement a memory copy method. Like calloc, for efficiency, we copy in 8-byte units:

void copy_block(t_block src, t_block dst) {
size_t *sdata, *ddata;
size_t i;
sdata = src->ptr;
ddata = dst->ptr;
for(i = 0; (i * 8) < src->size && (i * 8) < dst->size; i++)
ddata[i] = sdata[i];
}

Then we start to implement realloc. A simple (but inefficient) method is to malloc a section of memory and then copy the data there. But we can do it more efficiently. Specifically, we can consider the following aspects:

If the data area of the current block is greater than or equal to the size required by realloc, no operation will be performed.
If the new size becomes smaller, consider splitting
If the data area of the current block cannot meet the size, but its subsequent block is free and can meet the size after merging, consider merging it.

The following is the implementation of realloc:

void *realloc(void *p, size_t size) {
size_t s;
t_block b, new;
void *newp;
if (!p)
/* 根据标准库文档，当p传入NULL时，相当于调用malloc */
return malloc(size);
if(valid_addr(p)) {
s = align8(size);
b = get_block(p);
if(b->size >= s) {
if(b->size - s >= (BLOCK_SIZE + 8))
split_block(b,s);
} else {
/* 看是否可进行合并 */
if(b->next && b->next->free
&& (b->size + BLOCK_SIZE + b->next->size) >= s) {
fusion(b);
if(b->size - s >= (BLOCK_SIZE + 8))
split_block(b, s);
} else {
/* 新malloc */
newp = malloc (s);
if (!newp)
return NULL;
new = get_block(newp);
copy_block(b, new);
free(p);
return(newp);
}
}
return (p);
}
return NULL;
}

3.3 Legacy issues and optimizations

The above is a relatively simple, but initially usable malloc implementation. There are still many remaining possible optimization points, such as:

Compatible with both 32-bit and 64-bit systems
When allocating larger blocks of memory, consider using mmap instead of sbrk, which is often more efficient
You can consider maintaining multiple linked lists instead of a single one. The block size in each linked list is within a range, such as 8-byte linked list, 16-byte linked list, 24-32 byte linked list, etc. At this time, allocation can be made to the corresponding linked list according to the size, which can effectively reduce fragmentation and improve the speed of querying the block.
You can consider storing only free blocks in the linked list instead of allocated blocks, which can reduce the number of block searches and improve efficiency.

Original author: Learn embedded together