Introduction to JeMalloc memory allocator

table of Contents

Basic knowledge

size_class

base

bin

extent

slab

extents

arena

rtree

cache_bin

task

thousand

Memory allocation (malloc)

Small memory (small_class) allocation

Large memory (large_class) allocation

Memory release (free)

Small memory release

Large memory release

Memory redistribution (realloc)

Small memory redistribution

Large memory redistribution

Memory GC

1. tcache GC

2. extent GC

Memory fragmentation

Advantages and disadvantages of JeMalloc implementation

to sum up


JeMalloc is a memory allocator. Compared with other memory allocators, its biggest advantage lies in its high performance in multi-threading and the reduction of memory fragmentation.

Basic knowledge

The following content introduces the more important concepts and data structures in JeMalloc.

size_class

Each  size_class represents the memory size allocated by jemalloc, there are a total of NSIZES (232) sub-categories (if the size requested by the user is between the two sub-categories, the larger one will be used, such as 14 bytes for application, which is located between 8 and 16 bytes) Between, allocated by 16 bytes), divided into 2 categories:

  • small_class( Small memory ): For 64-bit machines, the usual range is [8, 14kb], and the common ones are 8, 16, 32, 48, 64, ..., 2kb, 4kb, 8kb. Note that in order to reduce memory fragmentation and Not all are powers of 2. For example, if there is no 48 bytes, when applying for 33 bytes, allocating 64 bytes will obviously cause about 50% of external fragmentation.
  • large_class( Large memory ): For 64-bit machines, the usual range is [16kb, 7EiB], starting from 4 * page_size, common such as 16kb, 32kb, ..., 1mb, 2mb, 4mb, the maximum is 
  • size_index : size is size_class the index number in the  middle, the interval is [0,231], for example, 8 bytes is 0, 14 bytes (calculated as 16) is 1, and 4kb bytes is 28. When size is  small_class , size_index it is also called binind

base

The structure used to allocate jemalloc metadata memory, usually a  base size of 2mb, all  base form a linked list.

  • base.extents[NSIZES] : Storing each  size_class of  extent the metadata

bin

Management is in use  slab(i.e., memory allocation for a small  extentset), each  bin corresponding to a size_class

  • bin.slabcur : Currently in use slab
  • bin.slabs_nonfull : There are free memory blocks slab

extent

Manage the structure of the jemalloc memory block (that is, the memory allocated by the user N * page_size(4kb)). The size of each memory block can be  (N >= 1). Each extent has a serial number.

 memory application extent can be used to allocate one time  large_class, but it can be used to allocate multiple  small_class memory applications.

  • extent.e_bits : 8 bytes long, record a variety of information
  • extent.e_addr : The starting address of the managed memory block
  • extent.e_slab_data : Bitmap. When this is  extent used to allocate  small_class memory, it is used to record  extent the allocation of this. At this time extent , the small memory in each  is called region

slab

When an extent is used to allocate  small_class memory, it is called  slab. One  extent can be used to process multiple identical  size_class memory requests.

extents

Managed  extent collection.

  • extents.heaps[NPSIZES+1] : Various  page(4kb) multiples of size extent
  • extents.lru : Store all  extent doubly linked lists
  • extents.delay_coalesce : Whether to delay  extent the merge

arena

For extent the structure of allocation & recovery  , each user thread will be bound to one  arena . By default, each logical CPU will have 4  arena to reduce lock competition. The memory managed by each arena is independent of each other.

  • arena.extents_dirty : extent The place where the free space is located immediately after being released 
  • arena.extents_muzzy :  extents_dirty The place after lazy purge,dirty -> muzzy
  • arena.extents_retained :   The place extents_muzzy after decommit or force purge  extent,muzzy -> retained
  • arena.large : Storage  large extent of extents
  • arena.extent_avail : heap, store available  extent metadata
  • arena.bins[NBINS] : So used to allocate small memory bin
  • arena.base : Used to allocate metadata base

Introduction of purge and decommit in the memory gc module

rtree

Each globally unique storage  extent Radix Tree information in order  extent->e_addr i.e.  uintptr_t as a key, to my machine, for example, uintptr_t 64 bits (8 bytes),  rtree a height of 3, as  extent->e_addr is  page(1 << 12) alignment, i.e. need 64--12 = 52 bits can determine the position in the tree, and each layer is accessed through bits 0-16, bits 17-33, and bits 34-51.

cache_bin

Cache for allocating small memory unique to each thread

  • low_water : The number of buffers remaining after the last gc
  • cache_bin.ncached : cache_bin The number of caches currently  stored
  • cache_bin.avail : Memory that can be directly used for allocation, allocated from left to right (note the addressing method here)

 

task

Each thread's unique cache (Thread Cache), most of the memory applications can be  tcache directly obtained in, thereby avoiding locking

  • tcache.bins_small[NBINS] : Small memory cache_bin

thousand

Thread Specific Data, unique to each thread, used to store the structure related to this thread

  • tsd.rtree_ctx : The rtree context of the current thread for quick access to  extent information
  • tsd.arena : Bound to the current thread arena
  • tsd.tcache : Of the current thread tcache

 

Memory allocation (malloc)

Small memory (small_class) allocation

First, from the  tsd->tcache->bins_small[binind] obtaining of the corresponding  size_class memory, available memory is directly returned to the user, if  bins_small[binind] not, then, need  slab(extent) to  tsd->tcache->bins_small[binind] be filled, a plurality of filled for subsequent distribution, the filling follows (this step is not successful the next step ):

  1. By  bin->slabcur distribution
  2. From the  bin->slabs_nonfull acquisition can be used extent
  3. From the  arena->extents_dirty recovered  extent, the recovery mode is  Best-Fit , i.e. to meet the size requirements of the minimumextent , in the  arena->extents_dirty->bitmap Locate meet size requirements and first non-empty heap index  i, then the  extents->heaps[i] acquiring the first  extent. Because it  extent may be large, in order to prevent memory fragmentation, it is necessary to  extent split (buddy algorithm), and then extent put back the unused ones after splitting  extents_dirty.
  4. From the  arena->extents_muzzy recovered  extent, the recovery mode is  First-Fit , i.e. to meet the size requirement and SEQ ID NO lowest minimum address (oldest) in  extent, to meet the size requirement and traversing each non-empty  arena->extents_dirty->bitmap, obtaining the corresponding  extents->heaps first one  extentand then compared, Find the oldest one  extentand still need to split afterwards
  5. From the  arena->extents_retained recovered  extent, recovered with the way  extents_muzzy the same
  6. Try  mmap to obtain the required extent memory from the kernel  and  rtree register new  extent information in it
  7. Try again from the  bin->slabs_nonfull acquisition can be used extent

Simply put, this process is like this cache_bin -> slab -> slabs_nonfull -> extents_dirty -> extents_muzzy -> extents_retained -> kernal.

 

Large memory (large_class) allocation

Large memory is not stored in  tsd->tcache it, because this may waste memory, so you need to reallocate one extentfor each application  . The application process extent is the same as 3, 4, 5, and 6 in the small memory application  process.

 

Memory release (free)

Small memory release

In the  rtree Locate memory needs to be freed belong to  extent information, to be released back to memory  tsd->tcache->bins_small[binind], if  tsd->tcache->bins_small[binind] full, it needs to be flush, process is as follows:

  1. Return this memory to its owner  extent. If the extent free memory block in this  one becomes the largest (that is, no memory is allocated), skip to 2; if extent the free block in this  one becomes 1 and this  extent is not  arena->bins[binind]->slabcur, skip to 3
  2. extent Release this  , that is, insert  arena->extents_dirty it
  3. Will  arena->bins[binind]->slabcur switch to this  extent, provided that this is  extent "older" (the serial number is smaller and the address is lower), and the replaced one is  extent moved into arena->bins[binind]->slabs_nonfull

 

Large memory release

Because the large memory is not stored in the  tsd->tcache large memory, the release of the large memory only performs step 2 of the small memory release, which is about to be  extent inserted  arena->extents_dirty .

 

 

Memory redistribution (realloc)

Small memory redistribution

  1. Try to  no move allocate, if the previous actual allocation meets the conditions, you can do nothing and return directly. For example, the first application for 12 bytes, but in fact jemalloc will actually allocate 16 bytes, and then the second application to expand 12 to 15 bytes or shrink to 9 bytes, then 16 bytes have already met the demand. , So don’t do anything, if you can’t be satisfied, skip to 2
  2. Reallocate, apply for a new memory size (refer to memory allocation ), then copy the contents of the old memory to the new address, then release the old memory (refer to memory release ), and finally return to the new memory

 

Large memory redistribution

  1. Try to  no move allocate, if the two applications are in the same  size class place, you can do nothing and return directly.
  2. Try to  no move resize allocate. If the size of the second application is greater than the first one, try to extent check whether the next address to which the current address belongs  can be allocated. For example, the current  extent address is 0x1000 and the size is 0x1000, then we check extent whether the address starting from 0x2000  exists (Passed  rtree) and whether the requirements are met. If the requirements are met, the two  extent can be merged and become a new one  extent without reallocation; if the size of the second application is smaller than the first time, then try to extent split the current one  and remove the The second half is needed to reduce memory fragmentation; if it cannot be  no move resize allocated, skip to 3
  3. Reallocate, apply for a new memory size (refer to memory allocation ), then copy the contents of the old memory to the new address, then release the old memory (refer to memory release ), and finally return to the new memory

 

 

Memory GC

Divided into 2 kinds,  tcache and  extent GC. In fact, it is more accurate to say it is decay. For convenience, use gc.

1. tcache GC

To  small_classprevent a thread from pre-allocating memory but not actually assigning it to the user, the cache will be flushed periodically  extent.

GC strategy

Each time  tcache a malloc or free operation is counted, gc will be triggered when it reaches 228 by default, one gc each time  cache_bin.

How to GC

  1. cache_bin.low_water > 0 : gc is down  low_water 3/4, and at the same time, cache_bin the maximum number of buffers that  can be cached is doubled
  2. cache_bin.low_water < 0 : cache_bin Double the maximum number that  can be cached

In general cache_bin , it is guaranteed that the more frequent the current  allocation, the more memory will be cached, otherwise it will be reduced.

 

2. extent GC

When free is called, the memory is not returned to the kernel . In jemalloc, extent gradual GC is not used for allocation from time to time  . The process and memory application are reversed  free -> extents_dirty -> extents_muzzy -> extents_retained -> kernal.

GC strategy

The default 10s is  a gc cycle of the  extents_dirty sum  extents_muzzy. Every time  arena a malloc or free operation is performed, a count will be performed. When it reaches 1000 times, it will detect whether the gc deadline has been reached, and if so, perform gc.

Note that it is not a one-time gc every 10s. In fact, jemalloc will divide 10s into 200 parts, that is, perform gc every 0.05s. In this way, if there are N page gc needed at time t  , then jemalloc will try to ensure that it is at t+10 At this moment, these N  page will be completed by gc.

For N need gc,  page it is not simply processing N/200 every 0.05s  page, jemalloc introduced  Smoothstep(mainly used in computer graphics) to obtain a smoother gc mechanism, which is also a very interesting point of jemalloc.

jemalloc internally maintains an array of length 200 to calculate how much page gc should be performed at each time point in the 10s gc cycle  . This ensures that the gc generated during the two gc periods  page will be reduced from N to 0 (from right to left) within a 10s period according to the change curve of the green line in the figure (smootherstep is used by default).

How to GC

extents_dirty GC is carried out first  , and then carried out  extents_muzzy .

1. Move the extents in extents_dirty into extents_muzzy:

1. In the LRU linked list in extents_dirty, get the extent to be gc, try to merge the extent before and after (provided that the two extents are in the same arena and in the same extent), get a new extent, and then remove it
2 Lazy purge the address managed by the current extent, that is, through madvise, use the MADV_FREE parameter to tell the kernel that the memory managed by the current extent may no longer be accessed
. 3. In extents_muzzy, try to merge the current extent before and after to obtain a new extent, and finally Insert it into extents_muzzy 

2. Move  extents_muzzy the  extent in  extents_retained :

1. In the LRU linked list in extents_muzzy, get the extent to be gc, try to merge the extent before and after, get the new extent, and then remove it
2. Decommit the address managed by the current extent, that is, call mmap to bring it with you PROT_NONE tells the kernel that the address managed by the current extent may no longer be accessed. If the decommit fails, a force purge will be performed, that is, through madvise using the MADV_DONTNEED parameter to tell the kernel that the memory managed by the current extent may no longer be accessed
3. Try correcting in extents_retained The current extent is merged before and after to obtain a new extent, and finally it is inserted into extents_retained 

3. By default, jemalloc will not return memory to the kernel. Only when the process ends, all memory will be  munmapreturned to the kernel. However, it can be arena destroyed manually  , so  extents_retained that the memory in the munmap

 


Memory fragmentation

JeMalloc guarantees that the internal fragmentation is around 20%. For most of  size_class them,   such as 160, 192, 224, 256 belong to group 7. For a group, there will be 4  size_class, and the size of each size is calculated like this, where i is the index in this group (1, 2, 3, 4),

For the two groups:

Taking the maximum difference between the last and the first one of the first group a second group of memory fragmentation is about   equal to about 20%.

 

Advantages and disadvantages of JeMalloc implementation

advantage

  1. Use multiple  arena to avoid thread synchronization
  2. Fine-grained locks, such as each  bin and each  extents has its own lock
  3. The use of Memory Order, for example  rtree , read and write access has different atomic semantics (relaxed, acquire, release)
  4. Ensure alignment during structure and memory allocation for better cache locality
  5. cache_bin When allocating memory, stack variables are used to determine whether it is successful to avoid cache miss
  6. Dirty  extent delay coalesce to get better cache locality; extent lazy purge to ensure smoother gc mechanism
  7. Compact structure memory layout to reduce footprint, such as extent.e_bits
  8. rtree The introduced  rtree_ctx two-level  cache mechanism improves the  extent speed of information acquisition while reducing cache misses
  9. tcache Dynamic adjustment of cache capacity during gc

Disadvantage

  1. arena The memory between is not visible
  • A certain thread arena uses a lot of memory in this  , and then this  arena is not used by other threads, resulting in this  arena memory cannot be used by gc, which takes up too much
  • Two different  arena threads frequently make memory applications, resulting arena in a large amount of crossover between the two  memories, but contiguous memories arena cannot be merged due to differences  .

2. Only one thought of

 

to sum up

At the beginning of the article, the advantage of JeMalloc lies in the performance under multi-threading and the reduction of memory fragmentation. There are differences in ensuring multi-threading performance  arena, the reduction of lock granularity, the use of atomic semantics, etc.; for the reduction of memory fragmentation, there are many designs that have been designed  size_class, Buddy algorithm, gc, etc.

The significance of reading the source code of JeMalloc is not only to accurately describe what happens every time malloc and free, but also to learn how the memory allocator manages memory. malloc and free are static release and distribution, while  tcache and  extent the gc is the dynamic management, which is also very important to be familiar with.

In addition, it can also help you choose the appropriate memory allocation method according to the corresponding memory usage characteristics during programming, and even use your own specific memory allocator.

Finally, I think the most interesting part of jemalloc lies in  extent the curve gc.

 

 

Guess you like

Origin blog.csdn.net/whatday/article/details/115202462