table of Contents
Small memory (small_class) allocation
Large memory (large_class) allocation
Memory redistribution (realloc)
Advantages and disadvantages of JeMalloc implementation
JeMalloc is a memory allocator. Compared with other memory allocators, its biggest advantage lies in its high performance in multi-threading and the reduction of memory fragmentation.
Basic knowledge
The following content introduces the more important concepts and data structures in JeMalloc.
size_class
Each size_class
represents the memory size allocated by jemalloc, there are a total of NSIZES (232) sub-categories (if the size requested by the user is between the two sub-categories, the larger one will be used, such as 14 bytes for application, which is located between 8 and 16 bytes) Between, allocated by 16 bytes), divided into 2 categories:
small_class
( Small memory ): For 64-bit machines, the usual range is [8, 14kb], and the common ones are 8, 16, 32, 48, 64, ..., 2kb, 4kb, 8kb. Note that in order to reduce memory fragmentation and Not all are powers of 2. For example, if there is no 48 bytes, when applying for 33 bytes, allocating 64 bytes will obviously cause about 50% of external fragmentation.large_class
( Large memory ): For 64-bit machines, the usual range is [16kb, 7EiB], starting from 4 * page_size, common such as 16kb, 32kb, ..., 1mb, 2mb, 4mb, the maximum issize_index
: size issize_class
the index number in the middle, the interval is [0,231], for example, 8 bytes is 0, 14 bytes (calculated as 16) is 1, and 4kb bytes is 28. When size issmall_class
,size_index
it is also calledbinind
base
The structure used to allocate jemalloc metadata memory, usually a base
size of 2mb, all base
form a linked list.
base.extents[NSIZES]
: Storing eachsize_class
ofextent
the metadata
bin
Management is in use slab
(i.e., memory allocation for a small extent
set), each bin
corresponding to a size_class
bin.slabcur
: Currently in useslab
bin.slabs_nonfull
: There are free memory blocksslab
extent
Manage the structure of the jemalloc memory block (that is, the memory allocated by the user N * page_size(4kb)
). The size of each memory block can be (N >= 1). Each extent has a serial number.
A memory application extent
can be used to allocate one time large_class
, but it can be used to allocate multiple small_class
memory applications.
extent.e_bits
: 8 bytes long, record a variety of informationextent.e_addr
: The starting address of the managed memory blockextent.e_slab_data
: Bitmap. When this isextent
used to allocatesmall_class
memory, it is used to recordextent
the allocation of this. At this timeextent
, the small memory in each is calledregion
slab
When an extent is used to allocate small_class
memory, it is called slab
. One extent
can be used to process multiple identical size_class
memory requests.
extents
Managed extent
collection.
extents.heaps[NPSIZES+1]
: Variouspage(4kb)
multiples of sizeextent
extents.lru
: Store allextent
doubly linked listsextents.delay_coalesce
: Whether to delayextent
the merge
arena
For extent
the structure of allocation & recovery , each user thread will be bound to one arena
. By default, each logical CPU will have 4 arena
to reduce lock competition. The memory managed by each arena is independent of each other.
arena.extents_dirty
:extent
The place where the free space is located immediately after being releasedarena.extents_muzzy
:extents_dirty
The place after lazy purge,dirty -> muzzy
arena.extents_retained
: The placeextents_muzzy
after decommit or force purgeextent
,muzzy -> retained
arena.large
: Storagelarge extent
ofextents
arena.extent_avail
: heap, store availableextent
metadataarena.bins[NBINS]
: So used to allocate small memorybin
arena.base
: Used to allocate metadatabase
Introduction of purge and decommit in the memory gc module
rtree
Each globally unique storage extent
Radix Tree information in order extent->e_addr
i.e. uintptr_t
as a key, to my machine, for example, uintptr_t
64 bits (8 bytes), rtree
a height of 3, as extent->e_addr
is page(1 << 12)
alignment, i.e. need 64--12 = 52 bits can determine the position in the tree, and each layer is accessed through bits 0-16, bits 17-33, and bits 34-51.
cache_bin
Cache for allocating small memory unique to each thread
low_water
: The number of buffers remaining after the last gccache_bin.ncached
:cache_bin
The number of caches currently storedcache_bin.avail
: Memory that can be directly used for allocation, allocated from left to right (note the addressing method here)
task
Each thread's unique cache (Thread Cache), most of the memory applications can be tcache
directly obtained in, thereby avoiding locking
tcache.bins_small[NBINS]
: Small memorycache_bin
thousand
Thread Specific Data, unique to each thread, used to store the structure related to this thread
tsd.rtree_ctx
: The rtree context of the current thread for quick access toextent
informationtsd.arena
: Bound to the current threadarena
tsd.tcache
: Of the current threadtcache
Memory allocation (malloc)
Small memory (small_class) allocation
First, from the tsd->tcache->bins_small[binind]
obtaining of the corresponding size_class
memory, available memory is directly returned to the user, if bins_small[binind]
not, then, need slab(extent)
to tsd->tcache->bins_small[binind]
be filled, a plurality of filled for subsequent distribution, the filling follows (this step is not successful the next step ):
- By
bin->slabcur
distribution - From the
bin->slabs_nonfull
acquisition can be usedextent
- From the
arena->extents_dirty
recoveredextent
, the recovery mode is Best-Fit , i.e. to meet the size requirements of the minimumextent
, in thearena->extents_dirty->bitmap
Locate meet size requirements and first non-empty heap indexi
, then theextents->heaps[i]
acquiring the firstextent
. Because itextent
may be large, in order to prevent memory fragmentation, it is necessary toextent
split (buddy algorithm), and thenextent
put back the unused ones after splittingextents_dirty
. - From the
arena->extents_muzzy
recoveredextent
, the recovery mode is First-Fit , i.e. to meet the size requirement and SEQ ID NO lowest minimum address (oldest) inextent
, to meet the size requirement and traversing each non-emptyarena->extents_dirty->bitmap
, obtaining the correspondingextents->heaps
first oneextent
and then compared, Find the oldest oneextent
and still need to split afterwards - From the
arena->extents_retained
recoveredextent
, recovered with the wayextents_muzzy
the same - Try
mmap
to obtain the requiredextent
memory from the kernel andrtree
register newextent
information in it - Try again from the
bin->slabs_nonfull
acquisition can be usedextent
Simply put, this process is like this cache_bin -> slab -> slabs_nonfull -> extents_dirty -> extents_muzzy -> extents_retained -> kernal
.
Large memory (large_class) allocation
Large memory is not stored in tsd->tcache
it, because this may waste memory, so you need to reallocate one extent
for each application . The application process extent
is the same as 3, 4, 5, and 6 in the small memory application process.
Memory release (free)
Small memory release
In the rtree
Locate memory needs to be freed belong to extent
information, to be released back to memory tsd->tcache->bins_small[binind]
, if tsd->tcache->bins_small[binind]
full, it needs to be flush, process is as follows:
- Return this memory to its owner
extent
. If theextent
free memory block in this one becomes the largest (that is, no memory is allocated), skip to 2; ifextent
the free block in this one becomes 1 and thisextent
is notarena->bins[binind]->slabcur
, skip to 3 extent
Release this , that is, insertarena->extents_dirty
it- Will
arena->bins[binind]->slabcur
switch to thisextent
, provided that this isextent
"older" (the serial number is smaller and the address is lower), and the replaced one isextent
moved intoarena->bins[binind]->slabs_nonfull
Large memory release
Because the large memory is not stored in the tsd->tcache
large memory, the release of the large memory only performs step 2 of the small memory release, which is about to be extent
inserted arena->extents_dirty
.
Memory redistribution (realloc)
Small memory redistribution
- Try to
no move
allocate, if the previous actual allocation meets the conditions, you can do nothing and return directly. For example, the first application for 12 bytes, but in fact jemalloc will actually allocate 16 bytes, and then the second application to expand 12 to 15 bytes or shrink to 9 bytes, then 16 bytes have already met the demand. , So don’t do anything, if you can’t be satisfied, skip to 2 - Reallocate, apply for a new memory size (refer to memory allocation ), then copy the contents of the old memory to the new address, then release the old memory (refer to memory release ), and finally return to the new memory
Large memory redistribution
- Try to
no move
allocate, if the two applications are in the samesize class
place, you can do nothing and return directly. - Try to
no move resize
allocate. If the size of the second application is greater than the first one, try toextent
check whether the next address to which the current address belongs can be allocated. For example, the currentextent
address is 0x1000 and the size is 0x1000, then we checkextent
whether the address starting from 0x2000 exists (Passedrtree
) and whether the requirements are met. If the requirements are met, the twoextent
can be merged and become a new oneextent
without reallocation; if the size of the second application is smaller than the first time, then try toextent
split the current one and remove the The second half is needed to reduce memory fragmentation; if it cannot beno move resize
allocated, skip to 3 - Reallocate, apply for a new memory size (refer to memory allocation ), then copy the contents of the old memory to the new address, then release the old memory (refer to memory release ), and finally return to the new memory
Memory GC
Divided into 2 kinds, tcache
and extent
GC. In fact, it is more accurate to say it is decay. For convenience, use gc.
1. tcache GC
To small_class
prevent a thread from pre-allocating memory but not actually assigning it to the user, the cache will be flushed periodically extent
.
GC strategy
Each time tcache
a malloc or free operation is counted, gc will be triggered when it reaches 228 by default, one gc each time cache_bin
.
How to GC
cache_bin.low_water > 0
: gc is downlow_water
3/4, and at the same time,cache_bin
the maximum number of buffers that can be cached is doubledcache_bin.low_water < 0
:cache_bin
Double the maximum number that can be cached
In general cache_bin
, it is guaranteed that the more frequent the current allocation, the more memory will be cached, otherwise it will be reduced.
2. extent GC
When free is called, the memory is not returned to the kernel . In jemalloc, extent
gradual GC is not used for allocation from time to time . The process and memory application are reversed free -> extents_dirty -> extents_muzzy -> extents_retained -> kernal
.
GC strategy
The default 10s is a gc cycle of the extents_dirty
sum extents_muzzy
. Every time arena
a malloc or free operation is performed, a count will be performed. When it reaches 1000 times, it will detect whether the gc deadline has been reached, and if so, perform gc.
Note that it is not a one-time gc every 10s. In fact, jemalloc will divide 10s into 200 parts, that is, perform gc every 0.05s. In this way, if there are N page
gc needed at time t , then jemalloc will try to ensure that it is at t+10 At this moment, these N page
will be completed by gc.
For N need gc, page
it is not simply processing N/200 every 0.05s page
, jemalloc introduced Smoothstep
(mainly used in computer graphics) to obtain a smoother gc mechanism, which is also a very interesting point of jemalloc.
jemalloc internally maintains an array of length 200 to calculate how much page
gc should be performed at each time point in the 10s gc cycle . This ensures that the gc generated during the two gc periods page
will be reduced from N to 0 (from right to left) within a 10s period according to the change curve of the green line in the figure (smootherstep is used by default).
How to GC
extents_dirty
GC is carried out first , and then carried out extents_muzzy
.
1. Move the extents in extents_dirty into extents_muzzy:
1. In the LRU linked list in extents_dirty, get the extent to be gc, try to merge the extent before and after (provided that the two extents are in the same arena and in the same extent), get a new extent, and then remove it
2 Lazy purge the address managed by the current extent, that is, through madvise, use the MADV_FREE parameter to tell the kernel that the memory managed by the current extent may no longer be accessed
. 3. In extents_muzzy, try to merge the current extent before and after to obtain a new extent, and finally Insert it into extents_muzzy
2. Move extents_muzzy
the extent
in extents_retained
:
1. In the LRU linked list in extents_muzzy, get the extent to be gc, try to merge the extent before and after, get the new extent, and then remove it
2. Decommit the address managed by the current extent, that is, call mmap to bring it with you PROT_NONE tells the kernel that the address managed by the current extent may no longer be accessed. If the decommit fails, a force purge will be performed, that is, through madvise using the MADV_DONTNEED parameter to tell the kernel that the memory managed by the current extent may no longer be accessed
3. Try correcting in extents_retained The current extent is merged before and after to obtain a new extent, and finally it is inserted into extents_retained
3. By default, jemalloc will not return memory to the kernel. Only when the process ends, all memory will be munmap
returned to the kernel. However, it can be arena
destroyed manually , so extents_retained
that the memory in the munmap
Memory fragmentation
JeMalloc guarantees that the internal fragmentation is around 20%. For most of size_class
them, such as 160, 192, 224, 256 belong to group 7. For a group, there will be 4 size_class
, and the size of each size is calculated like this, where i is the index in this group (1, 2, 3, 4),
For the two groups:
Taking the maximum difference between the last and the first one of the first group a second group of memory fragmentation is about equal to about 20%.
Advantages and disadvantages of JeMalloc implementation
advantage
- Use multiple
arena
to avoid thread synchronization - Fine-grained locks, such as each
bin
and eachextents
has its own lock - The use of Memory Order, for example
rtree
, read and write access has different atomic semantics (relaxed, acquire, release) - Ensure alignment during structure and memory allocation for better cache locality
cache_bin
When allocating memory, stack variables are used to determine whether it is successful to avoid cache miss- Dirty
extent
delay coalesce to get better cache locality;extent
lazy purge to ensure smoother gc mechanism - Compact structure memory layout to reduce footprint, such as
extent.e_bits
rtree
The introducedrtree_ctx
two-levelcache
mechanism improves theextent
speed of information acquisition while reducing cache missestcache
Dynamic adjustment of cache capacity during gc
Disadvantage
arena
The memory between is not visible
- A certain thread
arena
uses a lot of memory in this , and then thisarena
is not used by other threads, resulting in thisarena
memory cannot be used by gc, which takes up too much - Two different
arena
threads frequently make memory applications, resultingarena
in a large amount of crossover between the two memories, but contiguous memoriesarena
cannot be merged due to differences .
2. Only one thought of
to sum up
At the beginning of the article, the advantage of JeMalloc lies in the performance under multi-threading and the reduction of memory fragmentation. There are differences in ensuring multi-threading performance arena
, the reduction of lock granularity, the use of atomic semantics, etc.; for the reduction of memory fragmentation, there are many designs that have been designed size_class
, Buddy algorithm, gc, etc.
The significance of reading the source code of JeMalloc is not only to accurately describe what happens every time malloc and free, but also to learn how the memory allocator manages memory. malloc and free are static release and distribution, while tcache
and extent
the gc is the dynamic management, which is also very important to be familiar with.
In addition, it can also help you choose the appropriate memory allocation method according to the corresponding memory usage characteristics during programming, and even use your own specific memory allocator.
Finally, I think the most interesting part of jemalloc lies in extent
the curve gc.