Overview of jemalloc principles

Jemalloc rose to fame in the Linux world and has been ported to multiple platforms. The performance of tcmalloc, a rising star, is similar to it. Although there is Google, the scope of use of tcmalloc is still slightly inferior because of the beauty of jemalloc. There are many interpretations of jemalloc on the Internet, so I won’t repeat them one by one. Here are some key points to analyze.

1. Address access
The first parameter of malloc and free is the memory address. How to quickly locate the base address of the memory block to which the address belongs? In high-frequency memory allocation, this is the first priority. jemalloc uses a simple trick, chunk = addr & (~chunksize_mask), to ensure that addressing can be done in O(1). In this formula, there is a very vague premise that the address of the chunk must be able to meet the format like 0xaabb0000, and the number of 0s at the end must be greater than or equal to the number of Fs in chunksize_mask.
When jemalloc allocates, it will make such an attempt, alloc_size = size + aligment - PAGE_SIZE, and then remove the header to ensure that the chunk address meets this condition. If not, return the redundant memory address to the system.

2. Memory page management
Small objects can be mapped to chunks using tricks, but there is no way to use the above trick for chunk addressing. Jemalloc uses a three-layer radix tree, so the search efficiency is still quite high, but it needs to be locked when adding, deleting, modifying and checking. Locking will affect efficiency, but because the number of times is relatively small, it will not have much impact. It should be noted that after allocation, the nodes of the global radix tree of jemalloc are not released until the final process exits. From the system, jemalloc is applied based on 4M each time.

3. Length Alignment
In actual scenarios, the byte size requested for allocation is random. If allocated according to the real size, it is easy to cause memory page missing interrupts, so byte alignment is required. In jemalloc, it is not fixed byte alignment, but according to the logic of the following table:
serial number
size range
byte alignment
0
[0--16]
8
1
(16 , 128]
16
2
(128 , 256]
32
3
(256 , 512]
64

4. Thread competition
During memory allocation, locks cause threads to wait, which has a huge impact on performance. Jemalloc uses two measures to avoid the occurrence of thread competition locks,
1. Using thread variables, each thread has its own memory manager, and the allocation is completed within this thread, so there is no need to compete with other threads.
2. The arena allocates an array, and each thread corresponds to an array element through the mapping of the thread number. In this way, the probability of multiple threads competing for an element is reduced.
It is a bit surprising that jemalloc basically does not use atomic operations, and the locks all use mutexes with larger granularity. This kind of coarse-grained lock is only necessary when you need to wait for a long time, such as when you are stuck in the system. There is a lot of information related to the arena, which can be found online.

5. Allocation process
We assume an application scenario where a memory block of size SIZE is to be allocated, the process is as follows:
1. Select an arena or tcache.
2. Calculate the corresponding alignment length, see Section 3, and calculate the subscripts of the bins in the arena according to the alignment length.
3. In a bins, if runccur is available, allocate in runcur, otherwise choose one from runs.
4. From the selected run, calculate the bitmap, get the free region, and return.



Jemalloc rose to fame in the Linux world and has been ported to multiple platforms. The performance of tcmalloc, a rising star, is similar to it. Although there is Google, the scope of use of tcmalloc is still slightly inferior because of the beauty of jemalloc. There are many interpretations of jemalloc on the Internet, so I won’t repeat them one by one. Here are some key points to analyze.

1. Address access
The first parameter of malloc and free is the memory address. How to quickly locate the base address of the memory block to which the address belongs? In high-frequency memory allocation, this is the first priority. jemalloc uses a simple trick, chunk = addr & (~chunksize_mask), to ensure that addressing can be done in O(1). In this formula, there is a very vague premise that the address of the chunk must be able to meet the format like 0xaabb0000, and the number of 0s at the end must be greater than or equal to the number of Fs in chunksize_mask.
When jemalloc allocates, it will make such an attempt, alloc_size = size + aligment - PAGE_SIZE, and then remove the header to ensure that the chunk address meets this condition. If not, return the redundant memory address to the system.

2. Memory page management
Small objects can be mapped to chunks using tricks, but there is no way to use the above trick for chunk addressing. Jemalloc uses a three-layer radix tree, so the search efficiency is still quite high, but it needs to be locked when adding, deleting, modifying and checking. Locking will affect efficiency, but because the number of times is relatively small, it will not have much impact. It should be noted that after allocation, the nodes of the global radix tree of jemalloc are not released until the final process exits. From the system, jemalloc is applied based on 4M each time.

3. Length Alignment
In actual scenarios, the byte size requested for allocation is random. If allocated according to the real size, it is easy to cause memory page missing interrupts, so byte alignment is required. In jemalloc, it is not fixed byte alignment, but according to the logic of the following table:
serial number
size range
byte alignment
0
[0--16]
8
1
(16 , 128]
16
2
(128 , 256]
32
3
(256 , 512]
64

4. Thread competition
During memory allocation, locks cause threads to wait, which has a huge impact on performance. Jemalloc uses two measures to avoid the occurrence of thread competition locks,
1. Using thread variables, each thread has its own memory manager, and the allocation is completed within this thread, so there is no need to compete with other threads.
2. The arena allocates an array, and each thread corresponds to an array element through the mapping of the thread number. In this way, the probability of multiple threads competing for an element is reduced.
It is a bit surprising that jemalloc basically does not use atomic operations, and the locks all use mutexes with larger granularity. This kind of coarse-grained lock is only necessary when you need to wait for a long time, such as when you are stuck in the system. There is a lot of information related to the arena, which can be found online.

5. Allocation process
We assume an application scenario where a memory block of size SIZE is to be allocated, the process is as follows:
1. Select an arena or tcache.
2. Calculate the corresponding alignment length, see Section 3, and calculate the subscripts of the bins in the arena according to the alignment length.
3. In a bins, if runccur is available, allocate in runcur, otherwise choose one from runs.
4. From the selected run, calculate the bitmap, get the free region, and return.



Guess you like

Origin blog.csdn.net/yc7369/article/details/124366827