TCMalloc: thread cache Malloc

【Blog post catalog >>>】 【Project address >>>】


TCMalloc: thread cache Malloc

This article translates TCMalloc: The most important part of Thread-Caching Malloc . TCMalloc is the cornerstone of go language memory allocation. Go memory allocation is derived from TCMalloc. The rest of the content can be found in the original text.

motivation

Memory allocation is fast . TCMalloc is faster than glibc 2.3 malloc (a standalone library that can be called ptmalloc2) and other malloc I have tested. It takes about 300 nanoseconds for ptmalloc2 to execute malloc / free pair on 2.8 GHz P4 (for small objects). For the same operation pair, TCMalloc implementation takes about 50 nanoseconds. Speed ​​is important for malloc implementation, because if malloc is not fast enough, application writers tend to write their own custom free lists on top of malloc. Unless the application writer is very careful to properly adjust the size of the available list and clear free objects from the free list, this may result in additional complexity and more memory usage.

TCMalloc also reduces lock contention for multi-threaded programs . For small objects, the competition is almost zero. For large objects, TCMalloc attempts to use fine-grained and efficient spin locks. ptmalloc2 also reduces lock contention by using a per-thread arena, but ptmalloc2 uses a per-thread arena there is a big problem. In ptmalloc2, memory can never be moved from one area to another. This causes a lot of wasted space. For example, in a Google application, the first stage will allocate approximately 300MB of memory for its data structure. When the first phase is completed, the second phase will start in the same address space. If the stage allocated for the second stage is different from the stage used in the first stage, any memory remaining after the first stage will not be reused in this stage, and an additional 300MB will be added to the address space. Similar memory explosion problems have been found in other applications.

Another benefit of TCMalloc is the space-efficient representation of small objects . For example, N 8-byte objects can be allocated when using approximately 8N * 1.01 bytes of space. That is, the space overhead is one percent. ptmalloc2 uses a four-byte header for each object, and (I think) round the size to a multiple of 8 bytes, and finally ends with 16N bytes.

## Overview
TCMalloc allocates a thread local cache for each thread. The thread local cache satisfies the small allocation. Move objects from the central data structure to the thread local cache as needed, and use regular garbage collection to migrate memory from the thread local cache back to the central data structure.

TCMalloc treats objects smaller than or equal to 32KB ("small" objects) differently from large objects. Use the page-level allocator (one page is a 4KB aligned memory area) to allocate large objects directly from the central heap. That is, large objects are always page-aligned and occupy integer pages.

You can split a page of paper into a series of small objects, each of equal size. For example, a page (4K) can be divided into 32 objects, each of which is 128 bytes in size.

Small object allocation

Each small object size maps to one of approximately 170 assignable size categories. For example, all allocations in the range of 961 to 1024 bytes are rounded to 1024. The size categories are spatialized so that small sizes are separated by 8 bytes, large sizes are separated by 16 bytes, larger sizes are separated by 32 bytes, and so on. . The maximum space (for size> = ~ 2K) is 256 bytes.

The thread cache contains a single-chain list of free objects of each size category.

IMAGE

When assigning a small object:

  • (1) We map its size to the corresponding size category.
  • (2) Find the current thread in the corresponding free list in the thread cache.
  • (3) If the free list is not empty, delete the first object from the list and return it. When following this fast path, TCMalloc does not acquire any lock at all. Because on a 2.8 GHz Xeon, the lock / unlock pair takes about 100 nanoseconds, this greatly helps speed up the allocation.

If the free list is empty:

  • (1) We get a bunch of objects from the central free list of this size category (the central free list is shared by all threads).
  • (2) Put them in the thread local free list.
  • (3) Return one of the newly acquired objects to the application.

If the central free list is also empty:

  • (1) We distribute a large number of pages (a run of pages) from the central page distributor.
  • (2) Divide the "large number of pages" (the run) into a set of objects of this size level.
  • (3) Place the new object on the central free list.
  • (4) As before, move some of these objects to the thread local free list.

Large object allocation

The large object size (> 32K) is rounded to the page size (4K, which is an integer multiple of 4K) and is handled by the central page heap. The central page heap is also a set of free lists. For i <256, the k-th entry is a running free list containing k pages. Item 256 is a free free list with a length> = 256 pages:

IMAGE

By searching in the k free list, the requirement of allocating k pages can be satisfied. If the free list is empty, we look for the next free list, and so on. Eventually, if necessary, we will find the last free list. If it fails, we get memory from the system (using sbrk, mmap or by mapping part of / dev / mem).

If the allocation of k pages is satisfied by a large number of pages with a length greater than k (for example: 4KB pages are required, but a large number of 8KB pages are allocated), the rest of the large number is reinserted into the corresponding free list in the page heap The 8KB free page is put into the corresponding 8KB free list).

Span

The heap managed by TCMalloc consists of a set of pages. A large number of consecutive pages is represented by a Span object. Span is either allocated or free . If it is free, the spna is one of the entries in the page heap linked list (page heap linked-list). If allocated, it is either a large object that has been handed over to the application, or a page that has been divided into a sequence of multiple small objects. If split into small objects, record the size level of the object in the span.

The central array indexed by the page number can be used to find the span to which the page belongs. For example, the following span a takes 2 pages, span b takes 1 page, span c takes 5 pages, and span d takes 3 pages.

IMAGE

The 32-bit address space can accommodate 2 ^ 20 4K pages, so this central array occupies 4MB of space, which seems acceptable. On 64-bit computers, we use a 3-level cardinality tree instead of an array to map page numbers to corresponding span pointers.

Deallocation

After releasing the object, we will calculate its page number and look in the central array to find the corresponding span object. The span tells us whether the object is small, and if it is small, tells us the size level. If the object is small, it is inserted into the corresponding free list in the thread cache of the current thread. If the thread cache now exceeds the predetermined size (2MB by default), we will run the garbage collector to move unused objects from the thread cache to the central free list.

If the object is large, the span will tell us the range of pages that the object contains. Assume the range is [p, q]. We will also find the span q + 1 of the page p-1 sum. If any of these adjacent spans are free, then merge them with the [p, q] span. Insert the result span into the corresponding free list in the page heap.

Small object central free list

As mentioned earlier, we maintain a central free list for each size level. Each central free list is organized into a two-level data structure: a set of spans, and a linked list of free objects for each span.

By deleting the first entry from the link list of a span, you can assign an object from the central free list. (If all spans have an empty list of links, first allocate an appropriately sized range from the central page heap.)

By adding the object to its linked list containing spans, you can return the object to the central free list. If the length of the link list is now equal to the total number of small objects in the span, the span is now completely free and returns to the page heap.

Thread cache garbage collection

When the total size of all objects in the cache exceeds 2MB, the thread cache is garbage collected. As the number of threads increases, the garbage collection threshold will automatically decrease, so that we will not waste too much memory in programs with many threads.

We traverse all free lists in the cache, and then move some objects from the free list to the corresponding central list.

Use the low water mark L of each list to determine the number of objects to be moved from the free list. L records the minimum length of the list since the last garbage collection. Please note that we can shorten the list by the object L during the last garbage collection without any additional access to the central list. We use past history as a prediction for future visits and move L / 2 objects from the thread cache available list to the corresponding central free list. The algorithm has a good property, that if a thread stops using a certain size, all objects of that size will quickly move from the thread cache to the central free list, and other threads can use them in the list.

Published 537 original articles · won 1145 · views 1.67 million +

Guess you like

Origin blog.csdn.net/DERRANTCM/article/details/105342996