Deadly GC: A big comparison between Java GC and GO GC, after reading it, you will become a master in seconds

say up front

It is very difficult to get an offer now , and I can't even get a call for an interview.

In Nien's technical community (50+), many small partners have obtained offers with their unique skills of "left-handed cloud native + right-handed big data", and they are very high-quality offers. It is said that the year-end awards are all for 18 months .

The second case is: some time ago, a 2-year buddy wanted to raise his salary to 18K. Nien wrote the project structure of the GO language into his resume, which made his resume shiny and reborn, and he could go to the headlines , Tencent and other 30K offers, the annual salary can be directly increased by 20W .

The second case is: a 6-year-old partner relies on the Java+go dual-language cloud-native architecture, with an annual salary of 60W .

From the perspective of Java high-paying positions and employment positions, cloud native, K8S, and GO are now becoming more and more important for senior engineers/architects. Therefore, Nien wrote a PDF " Go Study Bible-Technical Freedom Edition " for you to quickly transform from Java to Java+ GO amphibious master .

In the learning process of GO, GC is the absolute core focus and difficulty. Moreover, during the interview process, you encountered a large number of GC-related problems, such as:

  • Chat: Common Garbage Collection Algorithms
  • Let's talk: three-color notation
  • Chat: GO's STW (Stop The World)
  • Chat: How to observe Go GC?
  • Chat: With GC, why do memory leaks still occur?

This article starts from the principle, introduces Java and Golang garbage collection algorithms, and makes a comparison between them in principle.

Through this answer, everyone can fully demonstrate their strong "technical muscles", so that the interviewer can't help but drool . The questions and reference answers are also included in our " Go Study Bible-Technical Freedom Edition " V2, for the reference of later friends, to improve everyone's 3-high architecture, design, and development levels.

Note: This article is continuously updated in PDF. For the PDFs of "Nin Architecture Notes", "Nin High Concurrency Trilogy" and "Nin Java Interview Collection", please go to the official account [Technical Freedom Circle] at the end of the article to get

As a Java+ GO amphibious master , we first start with Java's GC.

Article directory

Java Garbage Collection

There are two ways to manage memory in modern high-level programming languages: automatic and manual. Typical representatives of manual memory management are C and C++, which need to actively apply for or release memory during code writing; languages ​​such as PHP, Java, and Go use automatic memory management systems, which are allocated and recovered by memory allocators and garbage collectors Memory, where the garbage collector is what we often call GC.

1. Java garbage collection area and division

Before introducing Java garbage collection, we need to understand where Java's garbage mainly exists.

The JVM memory runtime area division is shown in the following figure:

JVM memory runtime area division

Program counter : It is a small memory space. It can be regarded as the line number indicator of the bytecode executed by the current thread. The counters of each thread do not affect each other and are stored independently.

Virtual machine stack : It describes the memory model of Java method execution: each method will create a stack frame (Stack Frame, the basic data structure when the method is running) at the same time as it is executed to store local variable tables and operand stacks , dynamic link, method export and other information. The process of each method from invocation to execution completion corresponds to the process of a stack frame being pushed into the virtual machine stack to popped out of the stack.

Local method stack : It is very similar to the role played by the virtual machine stack. The difference between them is that the virtual machine stack serves the virtual machine to execute Java methods (that is, bytecodes), while the local method stack is virtual. The Native method service used by the machine.

Java heap : It is the largest piece of memory managed by the Java virtual machine. The Java heap is a memory area shared by all threads and created when the virtual machine starts. The sole purpose of this memory area is to store object instances, and almost all object instances allocate memory here.

Method area : Like the Java heap, it is a memory area shared by each thread. It is used to store data such as class information, constants, static variables, and code compiled by the instant compiler that have been loaded by the virtual machine.

Various parts of the Java memory runtime area, among which the program counter, virtual machine stack, and local method stack are born and destroyed with the thread; stack frames in the stack are pushed in and out with the entry and exit of the method Stack, the allocated memory in each stack frame is known when the class structure is determined. The Java heap is different from the method area. The memory required by multiple implementation classes in an interface may be different, and the memory required by multiple branches in a method may also be different. We only know when the program is running. Which objects, the allocation and recovery of this part of memory are dynamic, and in java8, the method area is stored in the metaspace, and the metaspace and the heap share physical memory. Therefore, the Java heap and method area are the main areas managed by the garbage collector . area .

From the perspective of garbage collection , since the JVM garbage collector basically adopts the theory of generational garbage collection, the Java heap can be subdivided into the following areas (taking the default situation of the HotSpot virtual machine as an example):

Among them, the Eden area, From Survivor0 ("From") area, and To Survivor1 ("To") area all belong to the new generation, and the Old Memory area belongs to the old age.

In most cases, the object will be allocated in the Eden area first; after a new generation garbage collection, if the object is still alive, it will enter the To area, and the age of the object will be increased by 1 (Eden area -> Survivor area). Age becomes 1), when its age increases to a certain extent (more than half of the survivor area, take this value and a smaller value in MaxTenuringThreshold as the new promotion age threshold), it will be promoted to the old age . After this GC, the Eden area and the From area have been cleared. At this time, From and To will exchange their roles to ensure that the Survivor area named To is empty. Minor GC will keep repeating this process. During this process, it is possible that after the Minor GC, the space in the "From" area of ​​Survivor is not enough, and some instances that do not meet the conditions for entering the old generation cannot be placed, and the parts that cannot be placed will enter the old generation in advance.

For the implementation of HotSpot VM, there are actually only two types of GC in it:

1. Partial collection (Partial GC) :

  • Young GC (Minor GC/Young GC): only collect garbage in the new generation;
  • Old generation collection (Major GC/Old GC): Only garbage collection is performed on the old generation. It should be noted that Major GC is also used to refer to the whole heap collection in some contexts;
  • Mixed collection (Mixed GC): Garbage collection is performed on the entire new generation and part of the old generation.

2. Full heap collection (Full GC): Collect the entire Java heap and method area .

Common allocation strategies for Java heap memory

1. Objects are allocated in the eden area first. Most objects live and die.

2. Large objects enter the old generation directly. Large objects are objects that require a large amount of contiguous memory space (for example: strings, arrays), and it is easy to trigger garbage collection in advance to obtain enough contiguous space to accommodate them if there is still a lot of space in the memory. In order to avoid reducing efficiency due to copying brought about by the allocation guarantee mechanism when allocating memory for large objects, it is recommended that large objects directly enter the old generation with a large space.

3. Long-term surviving objects will enter the old age. Dynamic object age determination: After a new generation garbage collection, if the object is still alive, it will enter s0 or s1, and the age of the object will be increased by 1 (Eden area -> Survivor The initial age of the object after the area becomes 1), when its age increases to a certain extent (more than half of the survivor area, take this value and a smaller value in MaxTenuring Threshold as the new promotion age threshold), then Will be promoted to the old generation. The age threshold for an object to be promoted to the old generation can be -XX:MaxTenuring Thresholdset through parameters.

4. Space Allocation Guarantee. Before Minor GC occurs, the virtual machine first checks whether the maximum available continuous memory space in the old generation is greater than the total space of all objects in the new generation. If this condition holds, then Minor GC can be guaranteed to be safe. If not, the virtual machine checks whether the HandlePromotionFailure setting value is allowed [guarantee failure]:

  • If allowed, it will continue to check whether the maximum available continuous space in the old generation is greater than the average size of objects promoted to the old generation.
  • If it is larger, a Minor GC will be attempted, although this Minor GC is risky.
  • If it is less than, or the HandlePromotionFailure setting does not allow risk, then a Full GC should also be performed at this time.

2. Judging the death of the subject

Almost all object instances are placed in the heap. The first step before garbage collection on the heap is to determine which objects have died (that is, objects that can no longer be used by any means).

There are two algorithms for judging whether an object is alive: reference counting and reachability analysis. Both algorithms have their own advantages and disadvantages.

Both Java and Go use reachability analysis algorithms, and some dynamic scripting languages ​​(such as: ActionScript) generally use reference counting algorithms.

(1) Reference counting method

The reference counting method adds a reference counter to the object header of each object, and whenever the object is referenced elsewhere, the counter is incremented by 1;

When the reference expires, the counter is decremented by 1; any time an object with a counter of 0 cannot be used anymore.

This method is simple to implement and has high efficiency, but this algorithm is not selected in mainstream Java virtual machines to manage memory. The main reason is that it is difficult to solve the problem of mutual circular references between objects.

That is, as shown in the following code: Except that the objects objA and objB refer to each other, there is no reference between these two objects.

But because they refer to each other, their reference counters are not 0, so the reference counting algorithm cannot notify the GC collector to recycle them.

public class ReferenceCountingGc {
    
    
    Object instance = null;
    public static void main(String[] args) {
    
    
        ReferenceCountingGc objA = new ReferenceCountingGc();
        ReferenceCountingGc objB = new ReferenceCountingGc();
        objA.instance = objB;
        objB.instance = objA;
        objA = null;
        objB = null;
    }
}

(2) Accessibility analysis algorithm

The basic idea of ​​this algorithm is to use a series of objects called "GC Roots" as the starting point, and start searching downward from these nodes. The path traveled by the node is called a reference chain. When an object does not have any reference chain to GC Roots If connected, it proves that the object is not available.

The advantage of the algorithm is that it can accurately identify all useless objects, including objects that refer to each other circularly;

The disadvantage is that the implementation of the algorithm is more complicated than the reference counting method. For example, as shown in the figure below, both Root1 and Root2 are "GC Roots", and the white nodes should be garbage collected.

As for tools for viewing reachability analysis and memory leaks in Java, "Memory Analyzer Tool" is strongly recommended, which can view memory distribution, inter-object dependencies, and object status.

In Java, there are many objects that can be used as " GC Roots ", such as:

  • Objects referenced in the virtual machine stack (local variable table in the stack frame), such as parameters, local variables, temporary variables, etc. used in the method stack called by each thread.
  • The object referenced by the class static attribute in the method area, such as the application type static variable of the Java class.
  • Objects to which constants are applied in the method area, such as references in the string pool.
  • The object referenced by JNI in the native method stack.
  • References inside the Java virtual machine, such as Class objects corresponding to basic data types, some resident exception objects (such as NPE), and system class loaders.
  • All objects held by synchronized locks.
  • JMXBean that reflects the internal situation of the Java virtual machine, callbacks registered in JVMTI, local code cache, etc.

Unreachable objects are not "must die"

Even objects that are unreachable in the reachability analysis method are not "must die". At this time, they are temporarily in the "probation stage". To really declare an object dead, at least two marking processes are required;

In the reachability analysis method, the unreachable object is marked for the first time and is screened once, and the screening condition is whether it is necessary to execute the finalize method for this object.

When the object does not override the finalize method, or the finalize method has been called by the virtual machine, the virtual machine regards these two cases as unnecessary to execute.

Objects that are judged to need to be executed will be placed in a queue for a second mark, and unless the object is associated with any object on the reference chain, it will be actually recycled.

Determines that a constant in the runtime constant pool is an obsolete constant

1. Before JDK1.7, the runtime constant pool logic included the string constant pool stored in the method area. At this time, the implementation of the method area by the hotspot virtual machine is a permanent generation.

2. In JDK1.7, the string constant pool is taken from the method area to the heap. There is no mention of the runtime constant pool here, which means that the string constant pool is taken to the heap separately, and the rest of the runtime constant pool is still there. The method area, which is the permanent generation in hotspot.

3. JDK1.8 hotspot removes the permanent substitute metaspace (Metaspace) and replaces it. At this time, the string constant pool is still in the heap, and the runtime constant pool is still in the method area, but the implementation of the method area has changed from the permanent generation to the metaspace. Space (Metaspace).

If the string "abc" exists in the string constant pool, if there is no String object currently referencing the string constant, it means that the constant "abc" is an obsolete constant. If memory recovery occurs at this time and it is necessary, " abc" will be cleaned out of the constant pool by the system.

How to judge that a class in the method area is a useless class

A class needs to meet the following three conditions at the same time to be considered a "useless class", and the virtual machine can recycle useless classes.

1. All instances of this class have been recycled, that is, there is no instance of this class in the Java heap.

2. The ClassLoader that loaded this class has been recycled.

3. The java.lang.Class object corresponding to this class is not referenced anywhere, and the method of this class cannot be accessed through reflection anywhere.

3. Garbage collection algorithm

After determining which objects can be recycled, it is necessary to consider how to recycle these objects. Currently, there are mainly the following types of garbage collection algorithms.

  • Mark-sweep algorithm : mark useless objects, and then clear and recycle them. Disadvantages: Inefficient, unable to remove debris.
  • Copy algorithm : Divide two memory areas of equal size according to capacity, copy the living object to another block when one block is used up, and then clean up the used memory space at one time. Disadvantages: The memory usage is not high, only half of the original.
  • Mark-sorting algorithm : mark useless objects, let all surviving objects move to one end, and then directly clear the memory outside the end boundary.
  • Generation Algorithm : Divide the memory into several pieces according to the life cycle of the object, usually the new generation and the old generation. The new generation basically adopts the copy algorithm, and the old generation uses the mark sorting algorithm.

(1) Mark-clear algorithm

The algorithm is divided into "marking" and "clearing" phases: first mark all objects that do not need to be recycled, and after the marking is completed, all unmarked objects are uniformly recycled. Applicable occasions : when there are many surviving objects, it is suitable for the old generation.

Mark-Sweep algorithm (Mark-Sweep) is a common basic garbage collection algorithm, which divides garbage collection into two stages:

  • Mark phase : Mark objects that can be recycled.
  • Cleanup phase : Reclaim the space occupied by marked objects.

The execution process of the mark-clear algorithm is shown in the figure below

Advantages : simple implementation, no need for objects to move.

Disadvantages :

1. Space issues , prone to memory fragmentation, may trigger garbage collection in advance when allocating space for a large object (for example, the size of the object is larger than the size of each block in the free list but smaller than the sum of two of them).

2. For efficiency issues , the entire space is scanned twice (first time: mark surviving objects; second time: clear unmarked objects). The process of marking and clearing is inefficient and produces a large number of discontinuous memory fragments, which increases the frequency of garbage collection.

In short:

The reason why the mark-sweep algorithm is basic is because the garbage collection algorithms mentioned later are all improved on the basis of this algorithm.

(2) Mark-copy algorithm

In order to solve the problem of low efficiency of the mark-sweep algorithm, a mark-copy algorithm was produced.

It divides the memory space into two equal areas, and only uses one of them at a time.

During garbage collection, traverse the currently used area, copy the surviving objects to another area, and finally recycle the recyclable objects in the currently used area.

Advantages : Allocate memory in sequence, simple implementation, efficient operation, no need to consider memory fragmentation.

Disadvantages : The available memory size is reduced to half of the original, and the object will be copied frequently when the survival rate of the object is high.

The execution process of the replication algorithm is shown in the figure below

(3) Marking-sorting algorithm

The replication algorithm can be used in the new generation, but the replication algorithm cannot be selected in the old generation, because the survival rate of objects in the old generation will be higher, so there will be more copy operations, resulting in lower efficiency.

The mark-clear algorithm can be applied in the old generation, but it is not efficient, and it is easy to generate a lot of memory fragments after memory recovery.

Therefore, there is a mark-compact algorithm (Mark-Compact) algorithm, which is different from the mark-compact algorithm. After marking the recyclable objects, all surviving objects are compressed to one end of the memory, so that they are arranged in a compact manner. together, and then reclaim memory beyond the endian boundary.

After recycling, used and unused memory are separated.

Advantages : Solve the problem of memory fragmentation in the mark-sweep algorithm.

Disadvantages : Local object movement is still required, which reduces efficiency to a certain extent.

The execution process of the marking-sorting algorithm is shown in the figure below

Applicable occasions for mark-organization algorithm : it is more efficient when there are few surviving objects, and it is used for the young generation (ie, the new generation).

(4) Generational collection algorithm

The generational collection method is currently adopted by most JVMs. Its core idea is to divide the memory into different domains according to the different life cycles of the objects. Generally, the GC heap is divided into the old generation (Tenured/Old Generation) and The new generation (YoungGeneration).

The characteristic of the old generation is that only a small number of objects need to be recycled each time garbage collection is performed. The characteristic of the new generation is that a large amount of garbage needs to be recycled each time garbage collection is performed. Therefore, different algorithms can be selected according to different regions.

Current commercial virtual machines all use the garbage collection algorithm of generational collection . The generational collection algorithm, as the name implies, divides the memory into several blocks according to the life cycle of the object. It generally includes young generation , old generation and permanent generation , as shown in the figure:

The current mainstream VM garbage collection adopts the "Generational Collection" (Generational Collection) algorithm, which divides the memory into several blocks according to the life cycle of the object, such as the new generation, old generation, and permanent generation in the JVM. The most appropriate GC algorithm can be used according to the characteristics of each age

Young Generation and Mark-Copy Algorithm

Every garbage collection can find that a large number of objects are dead, and only a small number are alive. Therefore, the replication algorithm is selected,

Collection can be done with only a small cost of copying live objects

At present, the GC of most JVMs adopts the Copying algorithm for the new generation, because each garbage collection in the new generation needs to recover most of the objects, that is, there are fewer operations to be copied, but the new generation is usually not divided according to 1:1.

Generally, the new generation is divided into a larger Eden space and two smaller Survivor spaces (From Space, To Space). Each time the Eden space and one of the Survivor spaces are used, when recycling, the two spaces are The objects that are still alive in the space are copied to another Survivor space.

Old Generation and Mark Cleanup Algorithm

Because the survival rate of objects in the old generation is high and there is no extra space to guarantee its allocation, it is necessary to use the "mark-clean" or "mark-organize" algorithm for recycling, without memory copying, and free memory directly. Therefore, the Mark-Compact algorithm is adopted.

  1. The Permanet Generation in the method area mentioned by the JAVA virtual machine is used to store classes, constants, method descriptions, etc. The collection of the permanent generation mainly includes obsolete constants and useless classes.
  2. The memory allocation of objects is mainly in the Eden Space of the new generation and the From Space of the Survivor Space (where Survivor currently stores objects), and in a few cases, it will be directly allocated to the old generation.
  3. When the Eden Space and From Space of the new generation are insufficient, a GC will occur. After the GC, the surviving objects in the Eden Space and From Space areas will be moved to To Space, and then Eden Space and From Space will be cleaned up.
  4. If To Space cannot store enough of an object, this object will be stored in the old generation.
  5. After GC, Eden Space and To Space are used, and the cycle repeats.
  6. When an object escapes a GC in the Survivor area, its age will be +1. By default, objects whose age reaches 15 will be moved to the old generation.

4. GC garbage collector in java

The Java heap memory is divided into two parts: the new generation and the old generation. The new generation mainly uses the copy and mark-sweep garbage collection algorithm; the old generation mainly uses the mark-sort garbage collection algorithm. Each generation provides a variety of different garbage collectors. The garbage collectors of the Sun HotSpot virtual machine in JDK1.6 are as follows:

garbage collector features algorithm Applicable scene advantage shortcoming
Serial The most basic and oldest single-threaded garbage collector. The new generation adopts the mark-copy algorithm, and the old generation adopts the mark-sort algorithm. Virtual machine running in Client mode simple and efficient All other worker threads must be suspended during garbage collection
ParNew A multi-threaded version of the Serial collector The new generation adopts the mark-copy algorithm, and the old generation adopts the mark-sort algorithm Virtual machine running in server mode Parallel, high efficiency
Parallel Scavenge Multi-threaded collector using mark-copy algorithm with focus on throughput The new generation adopts the mark-copy algorithm, and the old generation adopts the mark-sort algorithm. The JDK1.8 default collector is used when throughput and CPU resources are emphasized high throughput
SerialOld Old generation version of the Serial collector Mark-Collating Algorithm Use with Parallel Scavenge collector in JDK<1.5 as a backup solution for CMS collector simple and efficient All other worker threads must be suspended during garbage collection
Parallel Old Old generation of the Parallel Scavenge collector Mark-Collating Algorithm Where throughput and CPU resources are important high throughput
CMS Multi-threaded garbage collector ( user thread and garbage collection thread can run concurrently ) mark-sweep algorithm It is hoped that the system pause time will be the shortest, and the scene that pays attention to the response speed of the service Concurrent collection, low pause Sensitive to CPU resources, unable to handle floating garbage and generate garbage fragments
G1 A server-oriented garbage collector with parallel concurrency, spatial consolidation, and predictable pause times mark-copy algorithm Server application, for machines with large memory multiprocessors Controllable pause time, basically no space debris There may be a waste of space, and the additional execution load when the program is running is high

While we compare collectors, it's not an attempt to single out the best one.

Because until now there is no best garbage collector, let alone a universal garbage collector, what we can do is to choose a garbage collector that suits us according to specific application scenarios.

The above content is too complicated. If you don’t understand it, please refer to the supporting video of "Go Study Bible-Technical Free Circle Edition"

Famous high-performance memory allocator

As a Java+ GO amphibious master , we first start with java's GC, and then go to GO's GC.

To introduce GC in go, we must introduce high-performance memory allocators.

There are some well-known high-performance memory allocators in the industry, such as ptmalloc, tcmalloc, and jemalloc.

A simple comparison is as follows:

  • ptmalloc(per-thread malloc) The memory allocator implemented based on glibc has good compatibility because it is a standard implementation. The disadvantage is that the memory cannot be shared between multiple threads, and the memory overhead is very large.
  • tcmalloc(thread-caching malloc) is open sourced by Google. Its biggest feature is that it has a thread cache . It is currently used in Chrome, Safari and other products. tcmalloc allocates a local cache for each thread , and can allocate small memory objects from the thread local buffer, while for large memory allocation, spin locks are used to reduce memory competition and improve memory efficiency.
  • jemallocLearn from the excellent design ideas of tcmalloc, so there are many similarities between the two in terms of architecture design, and both include thread caching features . But jemalloc is more complex in design than tcmalloc. It divides the memory allocation granularity into Small, Large, and Huge, and records a lot of metadata, so the space occupied by metadata is higher than that of tcmalloc .

The core goal of a high-performance memory allocator is nothing more than two points:

  • Efficient memory allocation and recovery, improving performance in single-threaded or multi-threaded scenarios.
  • Reduce memory fragmentation, including memory fragmentation and external fragmentation. Improve the effective utilization of memory

Internal Fragmentation and External Fragmentation

In the Linux world, physical memory is divided into several 4KB memory pages (page), which is the minimum granularity of memory allocation. Allocation and recovery are done based on page.

The fragments generated within the page are called internal fragments, and the fragments generated outside the page are called external fragments.

Causes of memory fragmentation:

1. The memory is divided into small blocks. Although these blocks are free and have continuous addresses, they are too small to be used.

2. As the number of memory allocation and release increases, the memory will become more and more discontinuous.

3. In the end, the entire memory will only be left with fragments. Even if there are enough free page frames to satisfy the request, it cannot be satisfied by allocating a large continuous page frame.

Causes of external fragmentation:

1. External fragmentation refers to the free area of ​​memory that has not been allocated (does not belong to any process), but cannot be allocated because it is too small .

2. External fragments are free memory blocks that are outside any allocated region or page .

3. The sum of these storage blocks can meet the length requirements of the current application, but due to their discontinuous addresses or other reasons, the system cannot meet the current application.

Therefore, the core of reducing memory waste is to avoid memory fragmentation as much as possible.

buddy algorithm

There are two ways how to avoid external fragmentation:

(1) Use the paging unit to map a group of non-contiguous free page frames to a continuous linear address range;

(2) Buddy system: It records the existing free continuous page frame blocks, so as to avoid splitting large continuous free blocks in order to satisfy the request for small blocks.

Buddy memory allocation technology is a memory allocation algorithm that divides memory into partitions to satisfy memory requests with the most appropriate size.

Buddy memory allocation was invented by Harry Markowitz in 1963.

The principle of the buddy algorithm

The buddy algorithm is allocated in units of blocks according to different specifications. Each memory block can be divided and combined, but not arbitrarily divided and combined.

Each block has a friend, or "partner," with whom it can be separated and joined.

Therefore, only the memory blocks of the partnership can be split and merged.

系统中的空闲内存总是按照相邻关系,两两分组,每组中的两个内存块称作伙伴。
伙伴的分配可以是彼此独立的。

But if both buddies are free, the kernel merges them into a larger memory block as a buddy of a memory block on the next level.

Specifically look at the next example:

First, the buddy algorithm divides all free pages into 10 block groups,

The block size in each group is a power of 2 pages, e.g.

  • The size of the block in group 0 is 1 page,
  • The size of the block in the first group is 2 pages,
  • The block size in group 9 is all 512 pages.

That is to say, the size of the blocks in each group is the same, and the blocks of the same size form a linked list (which can be interpreted as a bucket of keys with the same hash value in the hashmap).

The remaining unallocated memory, we add it to the 10th block group.

And, except that the 0th block group has 2 blocks, the rest have only one block, but the 10th block group can have multiple blocks.

Buddy block for what? How to get it?

A partner block refers to two consecutive blocks with the same size, and the merged block can be iteratively merged into a large block of 1024 pages.

This may not be easy to understand, draw a picture:

Suppose we are looking for the partner block of block 1. If it is block 2, when blocks 1 and 2 are merged, it cannot continue to be merged with block 0. At this time, block 0 becomes unmergeable, so The partner block of block 1 should be number 0 not number 2.

The way to find the partner block is this:

When the block size is n, the found partner block must meet the requirements. The size of the left (low memory) area of ​​the combined large block should be k times that of the merged large block, that is, the size of 2nk (k is a non-negative integer).

How to perform fast allocation?

When a memory size of n needs to be allocated, a memory block needs to be allocated, the size of the block is m, and m satisfies: m/2 < n and m>=n.

Through this restriction, we can obtain the size of the block to be allocated, and search for a free block in the corresponding block group. If there is, we will allocate this block. If not, we will continue to search for the upper level block. If there is one at the upper level, split the free block into two and allocate one block, and the other block will be included in the block group at the next level.

If there is no free block in the upper level, continue to search for the upper level, and find a suitable block recursively.

How to release blocks?

Similar to the reverse operation of block allocation, when reclaiming a block, it will first check whether its partner block is free. If it is free, the reclaimed block will be merged with the partner block into a larger block, and the partner block will be merged in the upper block group , do this recursively until no more merges are possible.

A Simple Example of the Buddy Algorithm

假设,一个最初由256KB的物理内存。假设申请21KB的内存,内核需分片过程如下:

The kernel divides the 256KB memory into two 128KB memory blocks, AL and AR, which are called partners.

Then he found that 128KB was much larger than 21KB, so he continued to divide it into two 64KB memory blocks, and found that 64KB was not the smallest memory block that met the demand, so he continued to divide it into two 32KB ones.

32KB is further down to 16KB, which does not meet the demand, so 32KB is the lowest memory block that meets the demand, so he allocates the divided CL or CR to the demand side.

When the demand side is used up, it needs to be returned:

Then he returns the 32KB of memory. If its other partner is not occupied, then their addresses are continuous, and they are merged into a 64KB memory block, and so on, for merging.

Notice:

All the divisions here are divided into two, and the size of all memory blocks is a power of 2.

The secret of the buddy algorithm:

Store memory blocks in more advanced data structures than linked lists. These structures are often combinations or variants of buckets, trees, and heaps. In general, the way each partner allocator works differs greatly due to the chosen data structure.

Buddy allocators are widely used due to the availability of a wide variety of data structures with known properties.

Buddy allocators are often complex to write, and their performance can vary.

Buddy algorithm in linux kernel

An important task of Linux kernel memory management is how to avoid fragmentation in the case of frequent applications for freeing memory.

Linux uses the partner system to solve the problem of external fragmentation, and uses slab to solve the problem of internal fragmentation.

Linux2.6 uses a different partner system for each management area. The kernel space is divided into three areas, DMA, NORMAL, and HIGHMEM. For each area, there is a corresponding partner algorithm.

In the linux kernel, the buddy algorithm groups all free page frames into 11 block lists, and each block list contains 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 consecutive page frame.

The maximum memory request size is 4MB, and the memory is contiguous. The buddy algorithm has the same size and consecutive addresses.

In the 11 block list:

  • The 0th block list contains 2^0 consecutive page frames,
  • In the first block linked list, each linked list element contains 2 continuous address spaces of page frame size
  • ….
  • In the 10th block linked list, each linked list element indicates a continuous address space of 4M.

The number of elements in each linked list is determined when the system is initialized, and changes dynamically during execution.

#ifndef CONFIG_FORCE_MAX_ZONEORDER
#define MAX_ORDER 11
#else
#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
#endif
#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))

  struct free_area {
    
    
    struct list_head    free_list[MIGRATE_TYPES];//空闲块双向链表
    unsigned long       nr_free;//空闲块的数目
  };
  
  struct zone{
    
    
       ....
       struct free_area    free_area[MAX_ORDER];  
       ....
  };

The type of each element of the zone from top to bottom is free_area, and free_area stores a free_list linked list inside,

struct free_area free_area[MAX_ORDER] #MAX_ORDER 默认值为11,分别存放着11个组

The Kth free_area element in the free_area array identifies all free blocks with a size of 2^k, and all free blocks are organized by the two-way circular linked list pointed to by free_list.

The buddy algorithm can only allocate power of 2 page frames at a time, and the size of each page is usually 4K

For example: allocate 1 page, 2 pages, 4 pages, 8 pages, ..., 1024 pages (2^10) and so on at a time, so the buddy algorithm can allocate up to 4M (1024*4K) memory space at a time

The default value of MAX_ORDER is 11, which store 11 groups respectively, and the free_area structure also marks the free memory block of the group

Members of zone: nr_free and zone_mem_map arrays

partner bitmap mem_map

Partnership : Two memory blocks with the same size and consecutive addresses belong to the same large block area. (block 0 and block 1 are buddies, block 2 and block 3 are buddies, but block 1 and block 2 are not buddies)

Partner bit code : use one bit to describe the status bit code of the partner block, called the partner bit code.

For example, bit0 is the partner bit code of block 0 and block 1. If bit0 is 1, it means that at least one of the two blocks has been allocated. If bit0 is 0, it means that both blocks are free or both blocks are in use.

If bit0 is 1, it means that at least one of the two blocks has been allocated. If bit0 is 0, it means that both blocks are free and have not been allocated yet.

Throughout the process, bitmaps play an important role

The bitmap of each order in the Linux kernel partner algorithm represents all free blocks, and a bit of the bitmap corresponds to two partner blocks.

A value of 1 means that one of the blocks is busy, and a value of 0 means that both blocks are free or in use.

Each time the system allocates and reclaims buddy blocks, it performs an XOR operation on their buddy bits and 1.

The so-called XOR means that at the beginning, both partner blocks are free, and their partner bits are 0:

  • If one of them is used, XOR will result in 1;
  • If another block is also used, it will be 0 after XOR;
  • If the previous block uses XOR, it will get 1;
  • If the other block also uses XOR, it will get 0.

As shown in the figure, a certain bit of the bitmap corresponds to two blocks that are partners with each other. If it is 1, it means that one of the blocks has been allocated, and if it is 0, it means that both blocks are free or have been used.

Regardless of whether it is allocated/released in the partner, it is only the XOR operation of the relative bitmap.

When allocating memory, the bitmap is used to serve the release process, and the release process judges whether the partner exists according to the bitmap:

  • If the XOR operation is performed on the corresponding bit, 1 is obtained, indicating that it was 0 before, and both blocks are busy, and no partner can be merged.
  • If the XOR operation is 0, it means that it was 1 before, which means that only one piece is busy and one piece is idle (/free), and the busy one is yourself, and then merge. And, continue merging partners in this way until no merging is possible.

The main purpose of the bitmap is to indicate whether it can be merged with the partner block in the recycling algorithm. It is enough to search the free list when allocating. However, at the same time as the allocation, the corresponding bit must be XORed, which is to serve the recycling algorithm.

Buddy algorithm problem:

The buddy algorithm manages raw memory, such as the most primitive physical memory (outside the heap in Java), or a large continuous heap memory.

When applying, the buddy algorithm will allocate a large memory space to the program, that is, to ensure that all large blocks of memory can be satisfied.

Obviously, allocating more memory space than required will cause internal fragmentation.

Therefore, although the buddy algorithm can completely avoid the generation of external fragments, it is precisely at the cost of generating internal fragments.

shortcoming:

Although the buddy algorithm effectively reduces external fragmentation, the minimum granularity is still page (4K), so it may cause very serious internal fragmentation, and the worst case is 50% memory fragmentation.

slab algorithm

The buddy algorithm is not suitable for small memory scenarios, because a page will be allocated every time, resulting in very serious internal fragmentation.

The Slab algorithm is specially optimized for small memory allocation scenarios based on the partner algorithm:

Provide an adjustment cache mechanism to store kernel objects. When the kernel needs to allocate memory again, it can basically be obtained through the cache.

The bottom layer of Linux uses the Slab algorithm for small memory allocation.

Linux uses the partner system to solve the problem of external fragmentation, and uses slab to solve the problem of internal fragmentation.

The basic principle of the slab allocator:
divide the allocated memory into blocks of a specific length according to a predetermined fixed size, so as to completely solve the problem of memory fragmentation.

Specifically:

The slab allocator splits allocated memory into chunks of various sizes and groups chunks of the same size into groups.

In addition, after the allocated memory is used up, it will not be released, but will be returned to the corresponding group for reuse.

Implementation of jemalloc algorithm

jemalloc 是基于 buddy+Slab 而来,比buddy+ Slab 更加复杂。

Slab 提升小内存分配场景下的速度和效率,jemalloc 通过 Arena 和 Thread Cache 在多线程场景下也有出色的内存分配效率。

Arena 是分而治之思想的体现,与其让一个人管理全部内存,到不如将任务派发给多个人,每个人独立管理,互不干涉(线程竞争)。

Thread Cache 是 tcmalloc 的核心思想,jemalloc 也把它借鉴过来。

通过Thread Cache机制, 每个线程有自己的内存管理器,分配在这个线程内完成,就不需要和其他线程竞争。

以上内容太过复杂,如果看不懂,请参见 《Go学习圣经-技术自由圈版》 配套视频

注意,Netty的内存池,就是参考了jemalloc算法。具体请参见尼恩的博客:

Netty内存池 (5w长文+史上最全)

TCMalloc线程缓冲内存分配器

TCMalloc简介

为啥要介绍 TCMalloc

因为golang的内存分配算法绝大部分都是来自TCMalloc,golang只改动了其中的一小部分。

所以要理解golang内存分配算法,就要先了解下TCMalloc,为后面分析golang内存做一做功课。

tcmalloc 是google开发的内存分配算法库,最开始它是作为google的一个性能工具库 perftools 的一部分。

TCMalloc是用来替代传统的malloc内存分配函数。

TCMalloc有减少内存碎片,适用于多核,更好的并行性支持等特性。TCMalloc 前面TC就是Thread Cache两英文的简写。

TCMalloc提供了很多优化,如:

  • TCMalloc用固定大小的page(页)来执行内存获取、分配等操作。这个特性跟Linux物理内存页的划分是不是有同样的道理。
  • TCMalloc uses fixed-size objects, such as 8KB, 16KB, etc., for memory allocation of objects of a specific size, which simplifies operations such as memory acquisition or release.
  • TCMalloc also utilizes caching of commonly used objects to improve the speed of memory fetches.
  • TCMalloc can also set the cache size on a per-thread or per-CPU basis, which is the default setting.
  • TCMalloc independently sets the cache allocation strategy based on each thread, which reduces the competition for locks among multiple threads.

TCMalloc Architecture Diagram

From: google tcmalloc design

  • Front-end:
    It is a memory cache that provides the ability to quickly allocate and reallocate memory to applications. It mainly consists of 2 parts: Per-thread cache and Per-CPU cache.
  • Middle-end:
    The responsibility is to provide cache for Front-end. That is to say, when the Front-end cache memory is not enough, apply for memory from the Middle-end.
    It is mainly the content of the Central free list .
  • Back-end:
    This block is responsible for obtaining memory from the operating system and providing cache for Middle-end. It mainly deals with Page Heap content.

TCMalloc divides the entire virtual memory space into n pages of equal size. Connect n consecutive pages together to form a Span.

PageHeap applies for memory from the OS, and the requested span may have only one page or n pages.

ThreadCache will apply to CentralCache when the memory is not enough, CentralCache will apply to PageHeap when the memory is not enough, and apply to the OS operating system if PageHeap is not enough.

Concepts in TCMalloc

Page

The memory management unit of the operating system, TCMalloc also manages memory in units of pages, but the Page size in TCMalloc is a multiple of the pages in the operating system. 2, 4, 8...

Span

Span is the unit that manages memory pages in PageHeap. It is composed of a group of continuous Pages, such as a span composed of 2 Pages. Multiple such spans are managed by a linked list.

Of course, there can also be a span composed of 4 Pages and so on.

ThreadCache

ThreadCache is a cache independently owned by each thread. A cache contains multiple free memory linked lists (size classes), each linked list (size-class) has its own object, and each object is the same size.

CentralCache

CentralCache provides memory for ThreadCache to use when it runs out of memory. It maintains a list of free blocks, and the number of linked lists is the same as the number of ThreadCache. When there is too much memory in ThreadCache, it can be put back into CentralCache.

PageHeap

PageHeap also saves several linked lists, but the linked list saves Span (multiple identical pages form a Span). When CentralCache runs out of memory, it can obtain Span from PageHeap, and then cut the Span into objects.

Small object memory allocation ThreadCache

TCMalloc defines many size classes, and each size class maintains an allocatable free list. Each item in the free list is called an object (as shown in the figure below), and each item in the free list of the same size-class The objects have the same size.

When applying for small memory (less than 256K), TCMalloc will map it to a certain size-class according to the size of the requested memory.

for example,

  • When applying for a size of 0 to 8 bytes, it will be mapped to size-class1 and allocated a size of 8 bytes;
  • When applying for a size of 9 to 16 bytes, it will be mapped to size-class2, and the size of 16 bytes will be allocated....

and so on.

Each of the above objects is N bytes. Used for Thread Cache small memory allocation.
This constitutes the free list of each ThreadCache. Threads can obtain objects from their respective free lists without locking, so the speed is very fast.

What if the free list of ThreadCache is empty? Then get several objects from CentralFreeList in CentralCache to the size class list corresponding to ThreadCache, and then take out one of the objects and return.
What if the objects in CentralFreeList are not enough? Then CentralFreeList will apply to PageHeap for a series of pages composed of Span, cut the applied pages into a series of objects, and then transfer some objects to ThreadCache.
What if PageHeap is not enough? Then apply for memory to the OS operating system.
As can be seen from the above discussion, this is also an application of the idea of ​​multi-level caching.

When the requested memory is greater than 256K, it is not allocated through ThreadCache, but directly allocated through PageHeap.

Large object memory allocation PageHeap

PageHeap is responsible for applying for memory from the operating system.

tcmalloc is also a page-based allocation method, that is, at least one page (page) of memory size is applied for each time.

The size of one page in tcmalloc is 8KB (default, can be set), one page in most linux is 4KB, and one page of tcmallo is twice the size of one page of linux.

PageHeap applies for memory according to the page, but the basic unit when it manages the allocated page memory is Span, and the Span object represents a continuous page. As shown below:

How to organize Span in PageHeap, as shown below

Middle end-Central Free List

CentralFreeList is in CentralCahe. Its function is to take part of Span from PageHeap, and then split it into fixed-size objects according to the predetermined size, and provide them to ThreadCache.

The above content is too complicated. If you don’t understand it, please refer to the supporting video of "Go Study Bible-Technical Free Circle Edition"

Golang Garbage Collection

Starting with Go v1.12, Go uses a non-generational, concurrent, three-color mark-and-sweep based garbage collector .

For related mark-clearing algorithms, please refer to C/C++, and Go is a statically typed compiled language.

Therefore, Go does not require a VM, and a small runtime (Go runtime) is embedded in the Go application binary that handles language features such as garbage collection (GC), scheduling, and concurrency.

First let's look at what memory management looks like inside Go.

1. Golang memory management

Here is a brief introduction to Golang operation scheduling.

There are three basic concepts in Golang: G, M, P.

  • G: The context of Goroutine execution.
  • M: Operating system thread.
  • P: Processor. The key to process scheduling, the scheduler, can also be considered approximately equal to the CPU.

The operation of a Goroutine requires the combination of G+P+M.

Go memory management

Source: "Golang-Memory Management (Memory Allocation)"

(http://t.zoukankan.com/zpcoding-p-13259943.html)

(1)TCMalloc

Go divides and groups memory into pages (Page), which is completely different from Java's memory structure.

Go does not have generational memory. The reason for this is that Go's memory allocator adopts the design idea of ​​TCMalloc:

1.Page

Same as the Page in TCMalloc, the size of a Page under x64 is 8KB.

At the bottom of the picture above, a light blue rectangle represents a Page.

2.Span

Same as Span in TCMalloc, Span is the basic unit of memory management. In the code, it is mspan. A set of continuous Pages form a Span, so a set of continuous light blue rectangles in the above figure represent a set of 1 Pages. Span, in addition, a lavender rectangle is a Span.

3.mcache

mcache is cache provided to P (logical processor) for storing small objects (object size <= 32Kb).

Although this is similar to the thread stack, it is part of the heap and is used for dynamic data.

mcache for all class sizes contains scan and noscan type mspan. Goroutine can get memory from mcache without any lock, because P can only have one lock G at a time.

Therefore, this is more efficient. mcache requests new spans from mcentral when needed.

4.mcentral

mcentral is similar to CentralCache in TCMalloc,

It is a cache shared by all threads and needs to be locked for access. It classifies Spans according to Span class and concatenates them into a linked list. When the memory of a certain level of Span in mcache is allocated, it will apply to mcentral for a current level of Span.

Each mcentral contains two mspanLists:

  • empty : A two-way span linked list, including spans without idle objects or spans in mcache. When the span here is freed, it will be moved to the non-empty span list.
  • non-empty : A span doubly linked list with free objects. When a new span is requested from mcentral, mcentral will get the span from this linked list and move it into the empty span linked list.

5. map

mheap is similar to PageHeap in TCMalloc. It is an abstraction of heap memory and a key area of ​​garbage collection. It organizes memory pages requested from the OS into Spans and saves them.

When the Span of mcentral is not enough, it will apply to mheap, and when the Span of mheap is not enough, it will apply to OS. The memory application to OS is done on a page-by-page basis, and then the applied memory pages are generated into Spans to organize them, which also requires locking access. of.

6. stack

This is the stack storage area, and each Goroutine (G) has a stack.

Static data is stored here, including function stack frames, static structures, native type values, and pointers to dynamic structures.

This is not the same thing as mcache assigned to each P.

(2) Memory allocation

The memory classification in Go is not divided into small, medium, and large objects like TCMalloc, but its small objects are subdivided into a Tiny object. A Tiny object refers to an object whose size is between 1Byte and 16Byte and does not contain pointers.

Small objects and large objects are only delineated by size, and there is no other distinction.

Core idea : Divide memory into multi-level management, reduce the granularity of locks (just go to mcentral and mheap to apply for locks), and multiple object size types to reduce memory fragmentation caused by allocation.

  • Tiny objects (size<16B)

Objects smaller than 16 bytes are allocated using mcache's tiny allocator, and multiple tiny allocations can be done on a single 16-byte block.

  • Small objects (size 16B~32KB)

Objects with a size between 16 bytes and 32k bytes are allocated on the corresponding mspan size class of the mcache of P where G runs.

  • Large objects (size >32KB)

Objects larger than 32 KB are directly allocated on the corresponding size class of mheap (size class).

  • If the mheap is empty or there are no pages large enough to satisfy the allocation request, it will allocate a new set of pages (at least 1MB) from the operating system.
  • If there is no available block in mcache for the corresponding size specification, apply to mcentral.
  • If there is no available block in mcentral, apply to mheap, and find the most suitable mspan according to the BestFit algorithm. If the applied mspan exceeds the application size, it will be divided according to the demand to return the number of pages required by the user. The remaining pages constitute a new mspan and are put back on the mheap's free list.
  • If there is no span available in the mheap, apply to the operating system for a new series of pages (minimum 1MB). Go allocates huge pages (called arenas) on the operating system. Allocating a large number of pages reduces the cost of communicating with the operating system.

(3) Memory recovery

Go memory is divided into two parts, the heap and the stack. The program can actively apply for memory space from the heap during operation. The memory is allocated by the memory allocator and reclaimed by the garbage collector.

The memory in the stack area is automatically allocated and released by the compiler. The parameters and local variables of the function are stored in the stack area. They will be created with the creation of the function and destroyed when the function returns. If you only request and allocate memory, the memory will eventually be exhausted.

Go uses garbage collection to collect spans that are no longer used, and releases the spans to mheap. Mheap merges the spans and adds the merged spans to the scav tree. When waiting for memory to be allocated, mheap performs memory reallocation.

Therefore, the Go heap is the main area managed by the Go garbage collector .

2. Mark-and-sweep algorithm

After successfully distinguishing the living objects and dead objects in the management area of ​​the Go garbage collector, the next task of the Go garbage collector is to perform GC to release the memory space occupied by useless objects so that there is enough available memory space to allocate memory for new objects .

The current common garbage collection algorithm has been introduced in the "Garbage Collection Algorithm" section of the previous article, and Go uses the mark-and-sweep algorithm , which is a very basic and common garbage collection algorithm, which was introduced in 1960 by J.McCarthy et al. people raised.

When the heap space is exhausted, it will STW (also known as stop the world), and its execution process can be divided into two stages: marking and clearing. The Go garbage collector traverses from the root node, executes the reachability analysis algorithm, and recursively marks all referenced objects as alive; after the marking phase ends, the garbage collector traverses the objects in the heap in turn and clears them Objects marked as alive.

Since the user program cannot be executed during garbage collection (STW). In the reachability analysis algorithm, Go's GC Roots are generally global variables and reference pointers in G Stack, which are only a small number compared with the objects in the whole heap, so the pause it brings is very short and relatively fixed, and does not change randomly. The heap capacity grows.

In the process of traversing objects from GC Roots down, the larger the heap, the more objects are stored, the more complex the recursive traversal, and the longer the pause time caused by marking more objects. Therefore, we need to use a more complex mechanism to solve the STW problem.

3. Three-color accessibility analysis

In order to solve the STW problem caused by the mark-clearing algorithm, both Go and Java will implement a variant of the three-color reachability analysis mark algorithm to shorten the STW time. The three-color reachability analysis marking algorithm divides the objects in the program into white, black and gray according to "whether they have been visited":

  • White object —The object has not been accessed by the garbage collector. At the beginning of the reachability analysis, all objects are white. If the object is still white at the end of the analysis, it means it is unreachable.
  • Black object —indicates that the object has been accessed by the garbage collector, and all references to this object have been scanned. Black objects represent that they have been scanned and are safe to survive. If there are other objects that only want black objects, there is no need to scan again Again, it is impossible for a black object to point directly (without going through a gray object) to some white object.
  • Gray object - indicates that the object has been accessed by the garbage collector, but there is at least one reference on this object that has not been scanned, because there are external pointers to white objects, and the garbage collector will scan the subobjects of these objects.

The general flow of the three-color accessibility analysis algorithm is (all objects in the initial state are white):

1. Start enumeration from GC Roots, all their direct references become gray (move into the gray set), and GC Roots become black.

2. Take a gray object from the gray collection for analysis:

  • Turn all direct references of this object into gray and put them into the gray collection;
  • Make this object black.

3. Repeat step 2 until the gray set is empty.

4. After the analysis is complete, the objects that are still white are objects that are not reachable by GC Roots and can be cleaned up as garbage.

A specific example is shown in the figure below. After three-color reachability analysis, the white H is an unreachable object, which needs to be garbage collected.

The three-color mark clear algorithm itself cannot be executed concurrently or incrementally, it requires STW ,

However, if executed concurrently, the user program may modify the pointer of the object during the execution of the mark.

There are generally 2 types of this situation:

1. One is to mistakenly mark dead objects that should have been garbage collected as alive.

Although this is not good, it will not lead to serious consequences. It just produces a little floating garbage that escaped this recycling, which can be cleaned up next time. For example, during the three-color marking process shown in the figure above, the user program canceled the The reference from object B to object E, but because B to E has been marked and will not continue to step 2, so object E will eventually be marked as black by mistake and will not be recycled. This D is floating garbage and will be used next time Clean up during garbage collection.

2. One is to wrongly mark the originally surviving object as dead, resulting in "object disappearing", which is a very serious error in memory management.

For example, in the three-color marking process shown in the figure above, the user program establishes a reference from the B object to the H object (for example, B.next =H ), and then executes D.next=nil , but because there is no gray in B to H object, so step 2 in the three-color concurrent marking will not continue to be executed during this period, and the link between D and H is broken, so the H object will eventually be marked as white and will be erroneously recycled by the garbage collector. We call this kind of error a dangling pointer , that is, the pointer does not point to a legal object of a specific type, which affects the safety of memory.

4. Barrier technology

In order to solve the above-mentioned "object disappearing" phenomenon, Wilson proved in theory in 1994 that if and only if the following two conditions are satisfied at the same time, the problem of "object disappearing" will occur, that is, the object that should be black is Mislabeled as white :

  • The setter inserted one or more new references from the black object to the white object;
  • The evaluator removes all direct or indirect references from the gray object to the white object.

Therefore, in order to solve the problem of object disappearance during concurrent scanning and ensure the correctness of the garbage collection algorithm, we only need to destroy any of these two conditions. The barrier technology is to ensure the three-color invariance in the process of concurrent or incremental marking important technology.

Note that the barrier technology in garbage collection and the memory barrier technology of the operating system are not a dimensional concept

Memory barrier technology is a barrier instruction that allows the CPU or compiler to follow specific constraints when performing memory-related operations. At present, most modern processors execute instructions out of order to maximize performance, but this technology can guarantee memory operations. The sequence of operations performed before the memory barrier must be performed before the operation performed after the memory barrier.

The barrier technology in garbage collection is more like a hook method. It is a piece of code executed when the user program reads objects, creates new objects, and updates object pointers. According to different types of operations, we can divide them into read barriers (Read barrier) and write barrier (Write barrier), because the read barrier needs to add code fragments in the read operation, which has a great impact on the performance of the user program, so the programming language often uses the write barrier to ensure the invariance of the three colors.

(1) Insert a write barrier

Dijkstra proposed inserting write barriers in 1978, also known as incremental updates ,

Violation of the first condition above (the setter inserts one or more new references from the black object to the white object) via a write barrier as shown below:

func DijkstraWritePointer(slot *unsafe.Pointer, ptr unsafe.Pointer) 
     shade(ptr)  //先将新下游对象 ptr 标记为灰色
     *slot = ptr
}

//说明:
添加下游对象(当前下游对象slot, 新下游对象ptr) {
    
     
 //step 1
 标记灰色(新下游对象ptr) 
 
 //step 2
 当前下游对象slot = 新下游对象ptr 
}

//场景:
A.添加下游对象(nil, B) //A 之前没有下游, 新添加一个下游对象B, B被标记为灰色
A.添加下游对象(C, B) //A 将下游对象C 更换为B, B被标记为灰色

The above pseudo code is very easy to understand. When the black object (slot) inserts a new reference relationship pointing to the white object (ptr), try to use the shade function to mark the newly inserted reference (ptr) as gray.

Assume that we use the insert write barrier in the concurrent reachability analysis of the example above:

  1. GC marks the B object pointed to by the root object Root2 as black and marks the object D pointed to by the B object as gray;
  2. The user program modifies the pointer, and B.next=H triggers the write barrier to mark the H object in gray;
  3. The user program modifies the pointer D.next=null ;
  4. GC traverses H and D in the program in turn and marks them as black respectively.

Since the objects on the stack are considered root objects in garbage collection and there is no write barrier, the black stack may point to the white heap object. For example, in Figure 1 above, Root2 points to H, and the reference from D to H is deleted. , since there is no write barrier, then H will be deleted. In order to ensure memory safety, Dijkstra must add a write barrier to the objects on the stack or re-scan the objects on the stack during the marking phase. These two methods have their own disadvantages. The former will greatly increase the additional overhead of writing pointers. , the latter needs to suspend the program when rescanning the stack objects, and the designer of the garbage collection algorithm needs to make a trade-off between the two.

(2) Delete the write barrier

In Yuasa's 1990 paper Real-time garbage collection on general-purpose machines, he proposed to delete the write barrier, because once the write barrier starts working, it will ensure that all objects on the heap are reachable when the write barrier is turned on.

At the beginning, STW scans all goroutine stacks to ensure that all objects in use on the heap are under gray protection, so it is also called snapshot garbage collection (Snapshot GC), which destroys the second condition of "object disappearing". (The setter removes all direct or indirect references from the gray object to the white object).

// 黑色赋值器 Yuasa 屏障
func YuasaWritePointer(slot *unsafe.Pointer, ptr unsafe.Pointer) {
    
    
    shade(*slot) 先将*slot标记为灰色
    *slot = ptr
}

//说明:
添加下游对象(当前下游对象slot, 新下游对象ptr) {
    
    
  //step 1
  if (当前下游对象slot是灰色 || 当前下游对象slot是白色) {
    
    
          标记灰色(当前下游对象slot)     //slot为被删除对象, 标记为灰色
  }  
  //step 2
  当前下游对象slot = 新下游对象ptr
}

//场景
A.添加下游对象(B, nil)   //A对象,删除B对象的引用。B被A删除,被标记为灰(如果B之前为白)
A.添加下游对象(B, C)     //A对象,更换下游B变成C。B被A删除,被标记为灰(如果B之前为白)

The above code will paint the white old object gray when the reference to the old object is deleted, so that deleting the write barrier can ensure the weak three-color invariance, and the downstream objects referenced by the old object must be referenced by the gray object.

But this will also cause a problem, because the objects that may survive will be marked in gray , so the object that should be recycled may not be recycled in the end, and this object will only be recycled in the next cycle, such as D in the figure below object.

Due to the original snapshot, STW is executed at the beginning. Deleting the write barrier is not suitable for scenarios with particularly large stacks. The larger the stack, the longer the STW scan time.

(3) Hybrid write barrier

Before Go language v1.7, the runtime would use Dijkstra to insert write barriers to ensure strong three-color invariance, but the runtime did not enable insert write barriers on all garbage collection root objects.

Because the application may contain hundreds or thousands of Goroutines, and the root objects of garbage collection generally include global variables and stack objects, if the write barrier needs to be enabled on the stacks of hundreds of Goroutines at runtime, it will bring huge additional overhead , so the Go team combined the above two write barriers to form a hybrid write barrier in v1.8. In terms of implementation, it chose to suspend the program when the marking phase is completed, mark all stack objects as gray and rescan.

Go language v1.8 combined Dijkstra insert write barrier and Yuasa delete write barrier to form a mixed write barrier as shown below. This write barrier will mark the overwritten object in gray and mark the new object when the current stack is not scanned. into gray:

writePointer(slot, ptr):
    shade(*slot)
    if current stack is gray:
        shade(ptr)
    *slot = ptr

In order to remove the rescanning process of the stack, in addition to introducing a hybrid write barrier, during the marking phase of garbage collection, we also need to mark all new objects created as black, preventing newly allocated stack memory and objects in the heap memory was incorrectly reclaimed, because the stack memory will eventually turn black during the marking phase, so there is no need to rescan the stack space. In summary, there are mainly these points:

  • GC starts to scan and mark all objects on the stack as black;
  • During GC, any new objects created on the stack are black;
  • Deleted heap objects are marked in gray;
  • Added heap objects are marked gray.

Five, GC evolution process

v1.0 — a fully serial mark-and-sweep process that requires halting the entire program;

v1.1 — executes the mark-and-sweep phase of garbage collection in parallel on multi-core hosts;

v1.3 — Based on the assumption that only pointer-type values ​​contain pointers at runtime , precise scan support for stack memory has been added to achieve truly precise garbage collection; it is illegal to convert unsafe.Pointer-type values ​​to integer-type values , may cause serious problems such as hanging pointers;

v1.5 — Implemented concurrent garbage collector based on three-color mark sweeping :

  • Significantly reduce the delay of garbage collection from hundreds of ms to less than 10ms;
  • Calculate the appropriate time for garbage collection to start and accelerate the process of garbage collection through concurrency;

v1.6 — Implemented a decentralized garbage collection coordinator:

  • Based on the explicit state machine, any Goroutine can trigger the state transition of garbage collection;
  • Use a dense bitmap instead of the heap memory represented by the free list to reduce the CPU usage in the clearing phase;

v1.7 - reduce garbage collection time to less than 2ms through parallel stack shrinkage ;

v1.8 - Using hybrid write barriers to reduce garbage collection time to less than 0.5ms;

v1.9 — Completely remove the process of rescanning the stack for suspended programs;

v1.10 — Updated the implementation of the garbage collection pacer (Pacer), separating the goals of soft and hard heap sizes;

v1.12 - Simplify several phases of the garbage collector with a new mark-and-terminate algorithm ;

v1.13 — Solve the problem of returning memory to the operating system from applications with high transient memory usage through the new Scavenger;

v1.14 — use the new page allocator to optimize the speed of memory allocation ;

v1.15 - Improves compiler and runtime internals CL 226367, which allows the compiler to use more x86 registers for write barrier calls to the garbage collector;

v1.16 — Go runtime uses MADV_DONTNEED by default to be more aggressive in releasing unused memory to the OS.

Mark and sweep algorithm prior to Go V1.3

Let's take a look at the common mark-clear algorithm that was mainly used before Golang 1.3. This algorithm has two main steps:

  • Mark phase
  • Clear (Sweep phase)
(1) Specific steps of the mark-and-sweep algorithm

The first step is to suspend the business logic of the program, classify the reachable and unreachable objects, and then mark them.

(1) The figure shows the reachability relationship between the program and the object. At present, the reachable objects of the program include five objects: object 1-2-3, object 4-7, etc.

(1) The figure shows the reachability relationship between the program and the object. At present, the reachable objects of the program include five objects: object 1-2-3, object 4-7, etc.

The second step is to start marking. The program finds all its reachable objects and marks them. As shown below:

(2) Five objects including object 1-2-3 and object 4-7 are reachable and marked.

(2) Five objects including object 1-2-3 and object 4-7 are reachable and marked.

In the third step , after marking, then start to clear unmarked objects. The result is as follows.

(3) Objects 5 and 6 are unreachable and cleared by GC

(3) Objects 5 and 6 are unreachable and cleared by GC

The operation is very simple, but one thing needs extra attention:

When the mark and sweep algorithm is executing, the program needs to be suspended! That is STW(stop the world), in the process of STW, the CPU does not execute user code, and all of them are used for garbage collection. This process has a great impact, so STW is also the biggest problem of some recycling mechanisms and the point that needs to be optimized. Therefore, during the execution of the third step, the program will temporarily stop any work, and it will be stuck there waiting for the recovery to complete.

The fourth step is to stop the pause and let the program continue to run. Then repeat this process in a loop until the process program life cycle ends.

The above is the algorithm of mark and sweep (mark and sweep) collection.

(2) Mark-clear (mark and sweep) disadvantages

The mark-and-sweep algorithm is straightforward, and the process is straightforward, but it also has very serious problems.

  • STW, stop the world; let the program pause, the program freezes ( important problem );
  • The mark needs to scan the entire heap;
  • Clearing data creates heap fragments.

Before the Go V1.3 version, the above was implemented. The basic process of executing GC is to first start STW pause, then execute marking, then execute data recovery, and finally stop STW, as shown in the figure.

From the above figure, all the GC time is wrapped in the STW range, so it seems that the program pauses for too long, which affects the running performance of the program. So Go V1.3 made a simple optimization to advance the steps of STW and reduce the time range of STW pause. As follows

The above figure mainly advances the steps of STW asynchronously, because when Sweep is cleared, it is not necessary to stop STW, because these objects are already unreachable objects, and there will be no problems such as recycling and writing conflicts.

But no matter how it is optimized, Go V1.3 faces this important problem, that is, the mark-and-sweep algorithm will pause the entire program .

How does Go deal with this problem? The G V1.5 version uses the three-color concurrent marking method to optimize this problem.

Three-color concurrent notation for Go V1.5

Garbage collection in Golang mainly uses the three-color marking method. The GC process and other user goroutines can run concurrently, but it takes a certain period of STW (stop the world) . The so-called three-color marking method is actually determined by marking in three stages What are the clear objects?

Let's take a look at the specific process.

In the first step , the default color of each newly created object is marked as "white", as shown in the figure.

(1) The program is initially created, all marked as white, and all objects are put into the white collection

(1) The program is initially created, all marked as white, and all objects are put into the white collection

As shown in the figure above, the relationship between the memory objects that can be reached by the application Root Set is shown in the left figure, and the mark table on the right is used to record the current mark color classification of each object.

(2) The form of expanding the root node set of the program

(2) The form of expanding the root node set of the program

In the second step , every time GC recycling starts, it will traverse all objects from the root node, and put the traversed objects from the white collection into the "gray" collection as shown in the figure.

Traverse Root Set (non-recursive form, only traverse once) to get gray nodes

It should be noted here that this traversal is a traversal in a non-recursive form, and it traverses a layer of objects that can be reached from the program, as shown in the figure above.

The currently reachable objects are object 1 and object 4, so when the current round of traversal ends, object 1 and object 4 will be marked as gray, and these two objects will be added to the gray mark table.

The third step is to traverse the gray collection, put the object referenced by the gray object from the white collection into the gray collection, and then put the gray object into the black collection, as shown in the figure.

Traverse the Gray gray marking table, mark the reachable objects from white to gray, and mark the gray after traversal as black

Traverse the Gray gray marking table, mark the reachable objects from white to gray, and mark the gray after traversal as black

This traversal only scans the gray objects, and changes the gray objects that can be reached by the first layer of traversal from white to gray, such as: object 2, object 7. The previous gray object 1 and object 4 will be marked as black , while moving from the gray mark table to the black mark table.

In the fourth step , repeat the third step until there is no object in the gray, as shown in the figure.

Repeat the previous step until there are no objects in the gray mark table

Repeat the previous step until there are no objects in the gray mark table

Repeat the previous step until there are no objects in the gray mark table

Repeat the previous step until there are no objects in the gray mark table

When all our reachable objects have been traversed, there will be no more gray objects in the gray mark table. At present, all the data in the memory has only two colors, black and white. Then the black objects are the objects that are reachable (needed) by our program logic. These data currently support the normal operation of the program. They are legal and useful data and cannot be deleted. The white objects are all unreachable objects. Currently, the program logic does not depend on They, then the white objects are garbage data currently in memory and need to be cleared.

Step 5 : Recycle all white marked table objects. That is, recycling garbage, as shown in the figure.

Above we delete and recycle all white objects

Collect all white objects (garbage)

Collect all white objects (garbage)

All that's left is the black objects that all depend on.

The above is 三色并发标记法, it is not difficult to see, the characteristics that we have clearly reflected above 三色.

However, there may be many concurrent processes that will be scanned, and the memory that executes the concurrent processes may depend on each other. In order to ensure data security during the GC process, we will add STW before starting the three-color mark, and then scan to determine black and white Subject then releases STW.

But obviously the performance of such a GC scan is too low.

So how does Go solve the problem of stuttering (stw, stop the world) in the mark and sweep (mark and sweep) algorithm?

Three-color notation without STW

Let's throw a brick first, if we add that if there is no STW, then there will be no performance problems, then let's assume that what will happen if the three-color notation method does not add STW?
We are still based on the above-mentioned three-color concurrent marking method, he must rely on STW. Because if the program is not suspended, the logic of the program changes the object reference relationship. If this action is modified during the marking phase, it will affect the correctness of the marking result. Let’s take a look at a scenario. If the three-color marking method is used, the marking process does not use What will happen to STW?

We set the initial state to have gone through the first round of scanning. Currently, the black objects include object 1 and object 4, the gray objects include object 2 and object 7, and the others are white objects, and object 2 points to object 3 through the pointer p ,as the picture shows.

(1) Object 2 that has been marked as gray has a pointer p pointing to object 3 that is white

(1) Object 2 that has been marked as gray has a pointer p pointing to object 3 that is white

Now, if the three-color marking process does not start STW, then during the GC scanning process, any object may be read and written. As shown in the figure, when the object 2 has not been scanned, the object 4 that has been marked as black, At this time, a pointer q is created and points to the white object 3.

(2) Before object 2 has been scanned and object 4 has been marked as black, create a pointer q pointing to object 3

(2) Before object 2 has been scanned and object 4 has been marked as black, create a pointer q pointing to object 3

At the same time, the gray object 2 removes the pointer p, then the white object 3 is actually hung under the scanned black object 4, as shown in the figure.

(3) At the same time, object 2 removes the pointer p, and object 3 is hung under the scanned black object 4

(3) At the same time, object 2 removes the pointer p, and object 3 is hung under the scanned black object 4

Then we normally point to the algorithm logic of the three-color marking, and mark all gray objects as black, then object 2 and object 7 are marked as black, as shown in the figure.

(4) The algorithm logic is executed normally, objects 2 and 3 are marked as black, and object 3, because object 4 is no longer scanned, is waiting to be recycled and cleared

(4) The algorithm logic is executed normally, objects 2 and 3 are marked as black, and object 3, because object 4 is no longer scanned, is waiting to be recycled and cleared

Then the last step of the three-color marking is performed, and all white objects are recycled as garbage, as shown in the figure.

(5) Object 3, a normally referenced object, is innocently cleared

(5) Object 3, a normally referenced object, is innocently cleared

But in the end we found out that object 3, originally legally referenced by object 4, was "killed by mistake" and recycled by GC.

It can be seen that there are two situations that are not expected to occur in the three-color notation method.

  • Condition 1 : A white object is referenced by a black object (white is hung under black)
  • Condition 2 : The gray object and the white object of the reachability relationship between it are destroyed (the gray loses the white at the same time).
    If the above two conditions are met at the same time, the object will be lost!

Moreover, in the scene shown in the figure, if there are many downstream objects in the white object 3 in the example, they will also be cleaned up.

In order to prevent this phenomenon, the easiest way is STW, which directly prohibits the interference of other user programs on the object reference relationship, but the process of STW has obvious waste of resources, which has a great impact on all user programs . So is it possible to reasonably improve GC efficiency and reduce STW time while ensuring that objects are not lost? The answer is yes, we just need to use a mechanism to try to destroy the above two necessary conditions.

barrier mechanism

We let the GC collector meet one of the following two conditions to ensure that the object is not lost.

These two ways are "strong three-color invariant" and "weak three-color invariant".

(1) "Strong-weak" three-color invariant
  • strong tricolor invariant

There is no pointer from the black object to the white object.

It is mandatory not to allow black objects to refer to white objects

It is mandatory not to allow black objects to refer to white objects

The strong three-color color change is actually mandatory to not allow black objects to reference white objects, so that white objects will not be deleted by mistake.

  • weak tricolor invariant

All white objects referenced by black objects are protected in gray.

A black object can refer to a white object, which has references to it from other gray objects, or a gray object on a link upstream to it

A black object can refer to a white object, which has references to it from other gray objects, or a gray object on a link upstream to it

The weak three-color invariant emphasizes that a black object can refer to a white object, but this white object must have references to it from other gray objects, or there must be a gray object on the upstream link up to it.

In this way, the black object refers to the white object, and the white object is in a dangerous state of being deleted, but the reference of the upstream gray object can protect the white object and make it safe.

In order to follow the above two methods, the GC algorithm evolves into two barrier methods, they "insert barriers" and "delete barriers".

(2) Insert barrier

Specific operation: When an object A refers to an object B, the object B is marked in gray. (Put B downstream of A, B must be marked gray)

Satisfied: strong three-color invariant . (There is no case where a black object refers to a white object, because white will be forced to become gray)

The pseudo code is as follows:

func DijkstraWritePointer(slot *unsafe.Pointer, ptr unsafe.Pointer) 
     shade(ptr)  //先将新下游对象 ptr 标记为灰色
     *slot = ptr
}

//说明:
添加下游对象(当前下游对象slot, 新下游对象ptr) {
    
     
 //step 1
 标记灰色(新下游对象ptr) 
 
 //step 2
 当前下游对象slot = 新下游对象ptr 
}

//场景:
A.添加下游对象(nil, B) //A 之前没有下游, 新添加一个下游对象B, B被标记为灰色
A.添加下游对象(C, B) //A 将下游对象C 更换为B, B被标记为灰色

Scenes:

  • Add downstream object (nil, B) // There is no downstream before A, add a new downstream object B, B is marked in gray
  • Add downstream object (C, B) //A replaces downstream object C with B, and B is marked in gray

This pseudo-code logic is the write barrier,

We know that the memory slot of the black object has two locations, and .

The stack space is characterized by a small capacity, but requires fast response speed, because the function call pop-up is frequently used, so the "insertion barrier" mechanism is not used in the object operation of the stack space .

"Insertion barriers" are only used for operations on objects in the heap space.

Next, we use a few pictures to simulate the entire detailed process, hoping that you can see the overall process more clearly.

(1) The program is initially created, all marked as white, and all objects are put into the white collection

(1) The program is initially created, all marked as white, and all objects are put into the white collection


(2) Traverse Root Set (non-recursive form, only traverse once) to get gray nodes

(2) Traverse Root Set (non-recursive form, only traverse once) to get gray nodes


(3) Traverse the Gray gray mark table, mark the reachable objects from white to gray, and mark the gray after traversal as black

(3) Traverse the Gray gray mark table, mark the reachable objects from white to gray, and mark the gray after traversal as black


(4) Due to the concurrency feature, at this moment, the outside world adds object 8 to object 4 and object 9 to object 1. Object 4 is in the heap, and the insertion barrier mechanism is about to be triggered, but object 1 does not trigger

(4) Due to the concurrency feature, at this moment, the outside world adds object 8 to object 4 and object 9 to object 1. Object 4 is in the heap, and the insertion barrier mechanism is about to be triggered, but object 1 does not trigger


(5) Due to the insertion of the write barrier (the black object adds white and changes white to gray), object 9 becomes gray, and object 9 remains white

(5) Due to the insertion of the write barrier (the black object adds white and changes white to gray), object 9 becomes gray, and object 9 remains white


(6) Continue to cycle the above process for three-color marking until there are no gray nodes

(6) Continue to cycle the above process for three-color marking until there are no gray nodes


But if the stack is not added, after scanning all the three-color marks, there may still be white objects referenced on the stack (such as object 9 in the above figure). Not lost, to initiate a STW pause for this token scan.

Until the end of the three-color mark of the stack space.

(7) Before preparing to recycle white, traverse and scan the stack space again.  At this time, add STW to suspend the protection stack to prevent external interference (new white is added by black)

(7) Before preparing to recycle white, traverse and scan the stack space again. At this time, add STW to suspend the protection stack to prevent external interference (new white is added by black)


(8) In STW, mark the objects in the stack with three colors until there are no gray nodes

(8) In STW, mark the objects in the stack with three colors until there are no gray nodes


(9) stop STW

(9) stop STW


Finally, all white nodes remaining in the stack and heap space scanning are cleared. The approximate time of STW this time is between 10~100ms.

(10) clear white

(10) clear white


(3) Remove the barrier

Specific operation: If the deleted object itself is gray or white, it will be marked as gray.

Satisfied: Weak three-color invariant . (Protect the path from the gray object to the white object from being broken)

// 黑色赋值器 Yuasa 屏障
func YuasaWritePointer(slot *unsafe.Pointer, ptr unsafe.Pointer) {
    
    
    shade(*slot) 先将*slot标记为灰色
    *slot = ptr
}

//说明:
添加下游对象(当前下游对象slot, 新下游对象ptr) {
    
    
  //step 1
  if (当前下游对象slot是灰色 || 当前下游对象slot是白色) {
    
    
          标记灰色(当前下游对象slot)     //slot为被删除对象, 标记为灰色
  }  
  //step 2
  当前下游对象slot = 新下游对象ptr
}

//场景
A.添加下游对象(B, nil)   //A对象,删除B对象的引用。B被A删除,被标记为灰(如果B之前为白)
A.添加下游对象(B, C)     //A对象,更换下游B变成C。B被A删除,被标记为灰(如果B之前为白)

Scenes:

  • Add downstream object(B, nil) //A object, delete the reference of B object. B is deleted by A and is marked gray (if B was white before)
  • Add downstream object (B, C) //A object, replace downstream B with C. B is deleted by A and is marked gray (if B was white before)

Next, we use a few pictures to simulate the entire detailed process, hoping that you can see the overall process more clearly.

(1) The program is initially created, all marked as white, and all objects are put into the white collection

(1) The program is initially created, all marked as white, and all objects are put into the white collection


(2) Traverse Root Set (non-recursive form, only traverse once) to get gray nodes

(2) Traverse Root Set (non-recursive form, only traverse once) to get gray nodes


(3) Gray object 1 deletes object 5. If the delete write barrier is not triggered, the 5-2-3 path will be disconnected from the main link, and will be cleared in the end

(3) Gray object 1 deletes object 5. If the delete write barrier is not triggered, the 5-2-3 path will be disconnected from the main link, and will be cleared in the end


(4) The delete write barrier is triggered, and the deleted object 5 itself is marked as gray

(4) The delete write barrier is triggered, and the deleted object 5 itself is marked as gray


(5) Traverse the Gray gray mark table, mark the reachable objects from white to gray, and mark the gray after traversal as black

(5) Traverse the Gray gray mark table, mark the reachable objects from white to gray, and mark the gray after traversal as black


(6) Continue to cycle the above process for three-color marking until there are no gray nodes

(6) Continue to cycle the above process for three-color marking until there are no gray nodes


(7) Clear white

(7) Clear white


The recycling accuracy of this method is low. Even if an object is deleted with the last pointer to it, it can still survive this round and be cleaned up in the next round of GC.

Go V1.8's hybrid write barrier mechanism

Short boards for inserting and removing write barriers:

  • Insert a write barrier: STW is required to rescan the stack at the end and mark the survival of white objects referenced on the stack;
  • Delete the write barrier: The recovery accuracy is low. When the GC starts, STW scans the stack to record the initial snapshot. This process will protect all surviving objects at the beginning.

The Go V1.8 version introduces a hybrid write barrier mechanism (hybrid write barrier), which avoids the process of re-scanning the stack and greatly reduces the STW time. Combining the advantages of both.

(1) Mixed write barrier rules

Specific operation:

1. The GC starts to scan all the objects on the stack and marks them as black (there will be no second repeated scan after that, no need for STW),

2. During GC, any new objects created on the stack are black.

3. Deleted objects are marked in gray.

4. Added objects are marked in gray.

Satisfy: Weak three-color invariant of deformation .

pseudocode:

添加下游对象 (当前下游对象 slot, 新下游对象 ptr) {
    
    
//1
标记灰色 (当前下游对象 slot) // 只要当前下游对象被移走,就标记灰色

  //2 
  标记灰色(新下游对象ptr)

  //3
  当前下游对象slot = 新下游对象ptr

}
writePointer(slot, ptr):
    shade(*slot)
    if current stack is gray:
        shade(ptr)
    *slot = ptr

Here we note that the barrier technology is not applied on the stack, because it is necessary to ensure the operating efficiency of the stack.

(2) Specific scenario analysis of mixed write barriers

Next, we use a few pictures to simulate the entire detailed process, hoping that you can see the overall process more clearly.

Note that the mixed write barrier is a barrier mechanism of Gc, so this mechanism will only be triggered when the program executes GC.

GC starts: Scan the stack area and mark all reachable objects as black

(1) GC has just started, and the default is white

(1) GC has just started, and the default is white


(2) Three-color marking method, which scans all stack objects first, and marks all reachable objects as black

(2) Three-color marking method, which scans all stack objects first, and marks all reachable objects as black


Scenario 1: The object is referenced by a heap object and becomes the downstream of the stack object

pseudocode:

// 前提:堆对象 4-> 对象 7 = 对象 7; // 对象 7 被 对象 4 引用
栈对象 1-> 对象 7 = 堆对象 7// 将堆对象 7 挂在 栈对象 1 下游
堆对象 4-> 对象 7 = null; // 对象 4 删除引用 对象 7

(1) Add object 7 to the downstream of object 1, because the stack does not start the write barrier, so it hangs directly below

(1) Add object 7 to the downstream of object 1, because the stack does not start the write barrier, so it hangs directly below


(2) Object 4 deletes the reference relationship of object 7. Since object 4 is a heap area, a write barrier is triggered (deletion means that the new value is assigned to null), and the deleted object 7 is marked as gray

(2) Object 4 deletes the reference relationship of object 7. Since object 4 is a heap area, a write barrier is triggered (deletion means that the new value is assigned to null), and the deleted object 7 is marked as gray


Scenario 2: The object is dereferenced by a stack object and becomes the downstream of another stack object

pseudocode:

new 栈对象 9;
对象 8-> 对象 3 = 对象 3// 将栈对象 3 挂在 栈对象 9 下游
对象 2-> 对象 3 = null; // 对象 2 删除引用 对象 3

(1) Create a new object 9 on the stack (in the hybrid write barrier mode, any newly created object during the GC process is marked black)

(1) Create a new object 9 on the stack (in the hybrid write barrier mode, any newly created object during the GC process is marked black)


(2) Object 9 adds downstream reference stack object 3 (directly added, the stack does not start the barrier, and there is no barrier effect)

(2) Object 9 adds downstream reference stack object 3 (directly added, the stack does not start the barrier, and there is no barrier effect)


(3) Object 2 deletes the reference relationship of object 3 (delete directly, the stack does not start the write barrier, and there is no barrier effect)

(3) Object 2 deletes the reference relationship of object 3 (delete directly, the stack does not start the write barrier, and there is no barrier effect)


Scenario 3: The object is deleted and referenced by a heap object and becomes the downstream of another heap object

pseudocode:

堆对象 10-> 对象 7 = 堆对象 7// 将堆对象 7 挂在 堆对象 10 下游
堆对象 4-> 对象 7 = null; // 对象 4 删除引用 对象 7

(1) Heap object 10 has been scanned and marked as black (the case of black is special, other colors are not considered for now)

(1) Heap object 10 has been scanned and marked as black (the case of black is special, other colors are not considered for now)


(2) The heap object 10 adds a downstream reference to the heap object 7, triggering the barrier mechanism, the added object is marked as gray, and the object 7 becomes gray (object 6 is protected)

(2) The heap object 10 adds a downstream reference to the heap object 7, triggering the barrier mechanism, the added object is marked as gray, and the object 7 becomes gray (object 6 is protected)


(3) The heap object 4 deletes the downstream reference heap object 7, triggering the barrier mechanism, the deleted object is marked gray, and the object 7 is marked gray

(3) The heap object 4 deletes the downstream reference heap object 7, triggering the barrier mechanism, the deleted object is marked gray, and the object 7 is marked gray


Scenario 4: An object deletes a reference from a stack object and becomes downstream of another heap object

pseudocode:

堆对象 10-> 对象 7 = 堆对象 7// 将堆对象 7 挂在 堆对象 10 下游
堆对象 4-> 对象 7 = null; // 对象 4 删除引用 对象 7

(1) Stack object 1 deletes the reference to stack object 2 (the stack space does not trigger the write barrier)

(1) Stack object 1 deletes the reference to stack object 2 (the stack space does not trigger the write barrier)


(2) Heap object 4 transfers the previously referenced relationship of object 7 to object 2 (object 4 deletes the reference relationship of object 7)

(2) Heap object 4 transfers the previously referenced relationship of object 7 to object 2 (object 4 deletes the reference relationship of object 7)


(3) When object 4 is deleted, a write barrier is triggered, and the deleted object 7 is marked as gray, protecting object 7 and downstream nodes

(3) When object 4 is deleted, a write barrier is triggered, and the deleted object 7 is marked as gray, protecting object 7 and downstream nodes


The hybrid write barrier in Golang satisfies the weak three-color invariant and combines the advantages of deleting the write barrier and inserting the write barrier. It only needs to scan the stacks of each goroutine concurrently at the beginning to make it black and keep it. This process does not require STW, and after the marking is over, because the stack is always black after scanning, there is no need to perform re-scan operation, which reduces the time of STW.

Summary of GC evolution process

The above is all the mark-clear logic and scene demonstration process of Golang's GC.

GoV1.3- Ordinary mark-and-sweep method, the overall process needs to start STW, which is extremely inefficient.

GoV1.5- three-color marking method, the heap space enables the write barrier, and the stack space does not activate. After all scans, the stack needs to be re-scanned (requires STW), and the efficiency is average.

GoV1.8 - Three-color marking method, mixed write barrier mechanism, stack space is disabled, and heap space is enabled. The whole process hardly needs STW, and the efficiency is high.

Six, GC process source code analysis (translated from Golang v1.16 version source code)

The Golang GC-related code is in the runtime/mgc.go file. You can see that the GC is divided into 4 stages:

1.sweep termination

  • Pause the program, trigger STW. All P (processors) will enter safe-point (safe point);
  • Clean up spans that have not been cleaned up. If the current garbage collection is forcibly triggered, the memory management unit that has not been cleaned up needs to be processed;

2.the mark phase (mark phase)

  • Change the GC state gcphase from _GCoffto_GCmark , enable the write barrier, enable mutator assists, and enqueue the root object;
  • Resume program execution, mark workers and helper programs will start concurrently marking objects in memory, write barriers will overwrite overwritten pointers and new pointers (marked in gray), and all newly created objects will be directly marked as black;
  • The GC performs root marking, which includes scanning all stacks, global objects, and runtime data structures that are not on the heap. Scanning the goroutine stack will cause the goroutine to stop, gray all pointers found on the stack, and then continue to execute the goroutine;
  • GC traverses the gray object queue, turns the gray object into black, and sets the object pointed to by the pointer to gray;
  • Since the GC work is distributed in the local cache, the GC will use the distributed termination algorithm (distributed termination algorithm) to detect when there are no more root mark jobs or gray objects. If there is no GC, it will turn to mark termination (mark termination).

3. mark termination

  • STW;
  • Switch the GC state gcphase to _GCmarktermination, close the gc worker thread and helper program;
  • Perform housekeeping, such as flushing mcaches.

4. the sweep phase

  • Switch the GC state gcphase to _GCoffprepare for the cleanup phase, initialize the cleanup phase and close the write barrier;
  • Restore the user program, from now on, all newly created objects will be marked white; if necessary, allocate cleanup spans before use;
  • The background concurrently cleans up all memory management class units.

GC process code example

func gcfinished() *int {
    
    
  p := 1
  runtime.SetFinalizer(&p, func(_ *int) {
    
    
    println("gc finished")
  })
  return &p
}
func allocate() {
    
    
  _ = make([]byte, int((1<<20)*0.25))
}
func main() {
    
    
  f, _ := os.Create("trace.out")
  defer f.Close()
  trace.Start(f)
  defer trace.Stop()
  gcfinished()
  // 当完成 GC 时停止分配
  for n := 1; n < 50; n++ {
    
    
    println("#allocate: ", n)
    allocate()
  }
  println("terminate")
}

run the program

hewittwang@HEWITTWANG-MB0 rtx % GODEBUG=gctrace=1 go run new1.go  
gc 1 @0.015s 0%: 0.015+0.36+0.043 ms clock, 0.18+0.55/0.64/0.13+0.52 ms cpu, 4->4->0 MB, 5 MB goal, 12 P
gc 2 @0.024s 1%: 0.045+0.19+0.018 ms clock, 0.54+0.37/0.31/0.041+0.22 ms cpu, 4->4->0 MB, 5 MB goal, 12 P
....

stack analysis

gc 2      : 第一个GC周期
@0.024s   : 从程序开始运行到第一次GC时间为0.0241%        : 此次GC过程中CPU 占用率

wall clock
0.045+0.19+0.018 ms clock
0.045 ms  : STW,Marking Start, 开启写屏障
0.19 ms   : Marking阶段
0.018 ms  : STW,Marking终止,关闭写屏障

CPU time
0.54+0.37/0.31/0.041+0.22 ms cpu
0.54 ms   : STW,Marking Start
0.37 ms  : 辅助标记时间
0.31 ms  : 并发标记时间
0.041 ms   : GC 空闲时间
0.22 ms   : Mark 终止时间

4->4->0 MB, 5 MB goal
4 MB      :标记开始时,堆大小实际值
4 MB      :标记结束时,堆大小实际值
0 MB      :标记结束时,标记为存活对象大小
5 MB      :标记结束时,堆大小预测值

12 P      :本次GC过程中使用的goroutine 数量

Seven, GC trigger conditions

The runtime will use the runtime.gcTrigger.test method to determine whether garbage collection needs to be triggered. When the basic conditions for triggering garbage collection are met (that is, the exit conditions of the _GCoff phase are met) - garbage collection is allowed, the program does not crash and is not in garbage collection Loop, this method will trigger different checks in three different ways:

//mgc.go 文件 runtime.gcTrigger.test
func (t gcTrigger) test() bool {
    
    
    //测试是否满足触发垃圾手机的基本条件
    if !memstats.enablegc || panicking != 0 || gcphase != _GCoff {
    
    
        return false
    }
    switch t.kind {
    
    
        case gcTriggerHeap:    //堆内存的分配达到达控制器计算的触发堆大小
        // Non-atomic access to gcController.heapLive for performance. If
        // we are going to trigger on this, this thread just
        // atomically wrote gcController.heapLive anyway and we'll see our
        // own write.
        return gcController.heapLive >= gcController.trigger
        case gcTriggerTime:      //如果一定时间内没有触发,就会触发新的循环,该触发条件由 `runtime.forcegcperiod`变量控制,默认为 2 分钟;
        if gcController.gcPercent < 0 {
    
    
            return false
        }
        lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime))
        return lastgc != 0 && t.now-lastgc > forcegcperiod
        case gcTriggerCycle:      //如果当前没有开启垃圾收集,则触发新的循环;
        // t.n > work.cycles, but accounting for wraparound.
        return int32(t.n-work.cycles) > 0
    }
    return true
}

The method used to start garbage collection is runtime.gcStart, so all places where this function is called are codes that trigger GC:

  • GC is triggered according to the heap size when runtime.mallocgc applies for memory
  • runtime.GC user program manually triggers GC
  • runtime.forcegchelper background running timing check triggers GC

(1) Applying for memory triggers runtime.mallocgc

The Go runtime divides the objects on the heap into three types according to their size: micro objects, small objects, and large objects. The creation of these three types of objects may trigger a new GC.

1. When there is no free space in the memory management unit of the current thread, creating micro objects (noscan && size<maxTinySize) and small objects needs to call runtime.mcache.nextFree to obtain a new management unit from the central cache or page heap. At this time, if If the span is full, it will return shouldhelpgc=true, which may trigger garbage collection;

2. When the user program applies for the allocation of large objects larger than 32KB, it will definitely construct the runtime.gcTrigger structure to try to trigger garbage collection.

func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
    
    
    省略代码 ...
    shouldhelpgc := false  
  dataSize := size
  c := getMCache()       //尝试获取mCache。如果没启动或者没有P,返回nil;
 
    省略代码 ...
    if size <= maxSmallSize {
    
      
       if noscan && size < maxTinySize {
    
     // 微对象分配
  省略代码 ...
          v := nextFreeFast(span)
          if v == 0 {
    
    
             v, span, shouldhelpgc = c.nextFree(tinySpanClass)
          }
      省略代码 ...
      } else {
    
          //小对象分配
         省略代码 ...
          if v == 0 {
    
    
             v, span, shouldhelpgc = c.nextFree(spc)
          }
        省略代码 ...
      }
    } else {
    
    
       shouldhelpgc = true
       省略代码 ...
    }
  省略代码 ...
    if shouldhelpgc {
    
          //是否应该触发gc
      if t := (gcTrigger{
    
    kind: gcTriggerHeap}); t.test() {
    
       //如果满足gc触发条件就调用gcStart()
          gcStart(t)
      }
    }
  省略代码 ...
    return x
 }

At this time, t.test() is called to execute the gcTriggerHeap situation. You only need to judge whether gcController.heapLive >= gcController.trigger is true or false. heapLive indicates the number of bytes of surviving objects in garbage collection, and trigger indicates the size of the heap memory that triggers the mark; when the number of bytes of surviving objects in the memory is greater than the size of the heap that triggers garbage collection, a new round of garbage collection will start.

1.heapLive — In order to reduce lock competition, the runtime will only be updated when the central cache allocates or releases the memory management unit and allocates large objects on the heap;

2.trigger — Call runtime.gcSetTriggerRatio to update the heap size that triggers the next garbage collection during the marking termination phase. It can determine the time to trigger garbage collection and the number of marking tasks for user programs and background processing. The feedback control algorithm is used according to the heap size Growth and garbage collection CPU utilization determine when garbage collection is triggered.

(2) Manually trigger runtime.GC

The user program will actively notify the runtime to execute during the running of the program through the runtime.GC function. When this method is called, the caller will be blocked until the current garbage collection cycle is completed. During garbage collection, the entire program may also be suspended through STW:

func GC() {
    
    
    //在正式开始垃圾收集前,运行时需要通过runtime.gcWaitOnMark等待上一个循环的标记终止、标记和清除终止阶段完成;
    n := atomic.Load(&work.cycles)
    gcWaitOnMark(n)
 
  //调用 `runtime.gcStart` 触发新一轮的垃圾收集
    gcStart(gcTrigger{
    
    kind: gcTriggerCycle, n: n + 1})
 
    //`runtime.gcWaitOnMark` 等待该轮垃圾收集的标记终止阶段正常结束;
    gcWaitOnMark(n + 1)
 
    // 持续调用 `runtime.sweepone` 清理全部待处理的内存管理单元并等待所有的清理工作完成
    for atomic.Load(&work.cycles) == n+1 && sweepone() != ^uintptr(0) {
    
    
        sweep.nbgsweep++
        Gosched()  //等待期间会调用 `runtime.Gosched` 让出处理器
    }
 
    //
    for atomic.Load(&work.cycles) == n+1 && !isSweepDone() {
    
    
        Gosched()
    }
 
    // 完成本轮垃圾收集的清理工作后,通过 `runtime.mProf_PostSweep` 将该阶段的堆内存状态快照发布出来,我们可以获取这时的内存状态
    mp := acquirem()
    cycle := atomic.Load(&work.cycles)
    if cycle == n+1 || (gcphase == _GCmark && cycle == n+2) {
    
       //仅限于没有启动其他标记终止过程
        mProf_PostSweep()
    }
    releasem(mp)
}

(3) Timing checks run in the background to trigger runtime.forcegchelper

The runtime will start a Goroutine in the background to forcibly trigger garbage collection when the application starts. The Goroutine calls runtime.gcStart to try to start a new round of garbage collection:

// start forcegc helper goroutine
func init() {
    
    
   go forcegchelper()
}
 
func forcegchelper() {
    
    
   forcegc.g = getg()
   lockInit(&forcegc.lock, lockRankForcegc)
   for {
    
    
      lock(&forcegc.lock)
      if forcegc.idle != 0 {
    
    
         throw("forcegc: phase error")
      }
      atomic.Store(&forcegc.idle, 1)
      
     //该 Goroutine 会在循环中调用runtime.goparkunlock主动陷入休眠等待其他 Goroutine 的唤醒
      goparkunlock(&forcegc.lock, waitReasonForceGCIdle, traceEvGoBlock, 1)
       
      if debug.gctrace > 0 {
    
    
         println("GC forced")
      }
      // Time-triggered, fully concurrent.
      gcStart(gcTrigger{
    
    kind: gcTriggerTime, now: nanotime()})
   }
}

The above content is too complicated. If you don’t understand it, please refer to the supporting video of "Go Study Bible-Technical Free Circle Edition"

A big comparison in the history of Java and Golang

1. Garbage collection area PK

Each part of the Java memory runtime area, among which the program counter, virtual machine stack, and local method stack are born and destroyed with the thread; the stack frame in the stack is methodically moved with the entry and exit of the method The operations of popping and pushing are performed, and how much memory is allocated in each stack frame is basically known when the class structure is determined. The Java heap is different from the method area. The memory required by multiple implementation classes in an interface may be different, and the memory required by multiple branches in a method may also be different. We only know when the program is running. Which objects, the allocation and recovery of this part of memory are dynamic.

Therefore, the Java heap and method area are the main areas managed by the Java garbage collector .

Go memory is divided into two parts: the heap and the stack. The program can actively apply for memory space from the heap during operation. The memory is allocated by the memory allocator and reclaimed by the garbage collector.

The memory in the Go stack area is automatically allocated and released by the compiler. The parameters and local variables of the function are stored in the stack area. They will be created when the function is created and destroyed when the function returns. If you only request and allocate memory, the memory will eventually be exhausted. Go uses garbage collection to collect spans that are no longer used, and releases the spans to mheap. Mheap merges the spans and adds the merged spans to the scav tree. When waiting for memory to be allocated, mheap performs memory reallocation.

Therefore, the Go heap is the main area managed by the Go garbage collector .

Go memory management

Go memory management

2. Timing to trigger garbage collection PK

Java GC is called when the application is idle, that is, when no application threads are running. Because GC is performed in the thread with the lowest priority, when the application is busy, the GC thread will not be called, except for the following conditions.

When the Java heap runs out of memory, GC will be called. However, in this case, since java is a generational collection algorithm and there are many types of garbage collectors, the timing of GC triggering various garbage collectors may not be completely consistent. Here we are talking about the general situation.

  1. Minor GC when there is insufficient space in the Eden area;
  2. Young GC when the age of the object increases to a certain level;
  3. When objects in the new generation are transferred to the old generation or created as large objects or large arrays, the space in the old generation will be insufficient, triggering the Old GC;
  4. System.gc() call triggers Full GC;
  5. Various cases where block occupancy exceeds a threshold.

Go will be triggered according to the following conditions:

  • runtime.mallocgc triggers GC according to the heap size when applying for memory;
  • runtime.GC user program manually triggers GC;
  • runtime.forcegchelper runs a regular check in the background to trigger GC.

3. Collection Algorithm PK

The garbage collection of the current Java virtual machine uses a generational collection algorithm , which divides the memory into several pieces according to the different life cycles of the objects. For example, in the new generation, a large number of objects will die every collection, so you can choose the "mark-copy" algorithm, and only need to pay the cost of copying a small number of objects to complete each garbage collection. The probability of survival of objects in the old age is relatively high, and there is no additional space to guarantee its allocation, so we must choose the "mark-clear" or "mark-compact" algorithm for garbage collection.

Currently, Go's garbage collection is based on the mark-and-sweep algorithm .

4. Garbage debris treatment PK

Due to Java's memory management division, it is easy to generate garbage objects. The JVM has continuously improved and updated the GC algorithm over the years. The JVM adopts more ideas of space compression and generational collection in dealing with memory fragmentation problems , such as using "mark" in the new generation -Copy algorithm, the G1 collector supports object movement to reduce long-running memory fragmentation problems, and the design of dividing regions makes it easier to return free memory to OS and other designs.

Due to the implementation of Go's memory management, it is difficult to implement generation, and moving objects may also lead to a larger and more complex runtime. Therefore, Go's solution to memory fragmentation is not the same as that of Java.

1. The design of the Go language span memory pool alleviates many problems of memory fragmentation .

The process of Go memory release is as follows: when there are many free spans in mcache, they will be returned to mcentral; and when there are many free spans in mcentral, they will be returned to mheap; mheap will be returned to the operating system. This design has the following advantages:

  • Most of the memory allocation is done in the user mode, and there is no need to enter the kernel mode frequently.
  • Each P has an independent span cache, and multiple CPUs will not read and write the same memory concurrently, thereby reducing the dirty cacheline of the CPU L1 cache and increasing the CPU cache hit rate.
  • For the problem of memory fragmentation, Go manages it in the user mode by itself, and there is no fragmentation at the OS level, which reduces the pressure on the management of fragmentation at the operating system level.
  • The existence of mcache makes memory allocation without locking.

2. The tcmalloc allocation mechanism, Tiny object and large object allocation optimization, to some extent also lead to basically no memory fragmentation .

For example, conventionally, spans with sizeclass=1 are used for objects <=8B, so commonly used tiny objects such as int32, byte, bool, and small strings will use spans with sizeclass=1, but assign them 8B Most of the space is unused. And these types are used very frequently, which leads to a lot of internal fragmentation.

Therefore, Go tries not to use spans with sizeclass=1, but treats objects <16B as tiny objects. When allocating, a 16B object is obtained from the span of sizeclass=2 for allocation. If the stored object is smaller than 16B, this space will be temporarily saved (mcache.tiny field), and this space will be reused in the next allocation until the object is used up.

Take the above picture as an example, the space utilization rate in this way is (1+2+8)/16 * 100%= 68.75%, and if according to the original management method, the utilization rate is (1+2+8)/(8 * 3) = 45.83%. The comment description in the source code says that it is a special treatment for tiny objects, which will save about 20% of memory on average. If there is a pointer in the data to be stored, even if it is <= 8B, it will not be treated as a tiny object, but a span with sizeclass=1 will be used normally.

In Go, the largest sizeclass can only store 32K objects. If you apply for more than 32K memory at one time, the system will directly bypass mcache and mcentral, and obtain it directly from mheap. There is a freelarge field in mheap to manage super large spans.

3. Go objects (that is, struct types) can be allocated on the stack.

Go will perform static escape analysis (Escape Analysis) at compile time. If it is found that an object does not escape the current scope, it will allocate the object on the stack instead of the heap, thereby reducing the pressure of GC memory fragmentation recovery.

For example, the following code:

func F() {
    
    
  temp := make([]int, 0, 20) //只是内函数内部申请的临时变量,并不会作为返回值返回,它就是被编译器申请到栈里面。
  temp = append(temp, 1)
}

func main() {
    
    
  F()
}

Run the code as follows, the result shows that the temp variable is allocated on the stack and not on the heap:

hewittwang@HEWITTWANG-MB0 rtx % go build -gcflags=-m
# hello
./new1.go:4:6: can inline F
./new1.go:9:6: can inline main
./new1.go:10:3: inlining call to F
./new1.go:5:14: make([]int, 0, 20) does not escape
./new1.go:10:3: make([]int, 0, 20) does not escapeh

When we change the above code to:

package main
import "fmt"

func F() {
    
    
  temp := make([]int, 0, 20)
  fmt.Print(temp)
}

func main() {
    
    
  F()
}

Run the code as follows, and the result shows that the temp variable is allocated on the heap. This is because temp is passed into the print function, and the compiler will think that the variable will be used later. Therefore, apply to the heap, and apply to the memory on the heap to cause garbage collection. If this process (especially garbage collection is triggered continuously) is too frequent, it will cause excessive GC pressure and program performance problems.

hewittwang@HEWITTWANG-MB0 rtx % go build -gcflags=-m
# hello
./new1.go:9:11: inlining call to fmt.Print
./new1.go:12:6: can inline main
./new1.go:8:14: make([]int, 0, 20) escapes to heap
./new1.go:9:11: temp escapes to heap
./new1.go:9:11: []interface {
    
    }{
    
    ...} does not escape
<autogenerated>:1: .this does not escape

Five, "GC Roots" object selection PK

In Java, due to the division of the memory runtime area, the following objects are usually selected as "GC Roots":

  • Objects referenced in the virtual machine stack (local variable table in the stack frame);
  • Objects referenced in the native method stack (Native method);
  • Objects referenced by class static properties in the method area;
  • Objects referenced by constants in the method area;
  • Java virtual machine internal reference;
  • All objects held by synchronization locks.

Whereas in Java unreachable objects have the possibility to escape. Even objects that are unreachable in the reachability analysis method are not "must die". Since there are runtime constant pools and classes in , it is also necessary to clean up the runtime constant pools and classes in the method area.

The choice of Go is relatively simple, that is, global variables and reference pointers in G Stack, in simple terms, are global variables and reference pointers in go programs. Because there is no concept of class encapsulation in Go, the selection of GC Root is relatively simple.

6. Write barrier PK

In order to solve the problem of dangling pointers in the concurrent three-color reachability analysis, there are two solutions, namely "Dijkstra inserts the write barrier" and "Yuasa deletes the write barrier".

In java, the above two methods are applied. For example, CMS is based on "Dijkstra insert write barrier" for concurrent marking, and G1 and Shenandoah are implemented using "Yuasa delete write barrier".

Before version 1.7 of the Go language, Dijkstra would be used to insert the write barrier at runtime to ensure strong three-color invariance. Go language v1.8 combined Dijkstra to insert the write barrier and Yuasa to delete the write barrier to form a hybrid write barrier. The combination of hybrid write barriers The characteristics of the two achieve concurrent and stable GC in the following ways:

1. Scan and mark all objects on the stack as black.

2. During GC, any new objects created on the stack are black.

3. Deleted objects are marked in gray.

4. Added objects are marked in grey.

Due to the need to ensure the operating efficiency of the stack, the mixed write barrier is used for the heap area. That is, the stack area does not trigger the write barrier, only the heap area triggers it . Since the initially marked reachable nodes in the stack area are all black nodes, there is no need for a second scan under STW. In essence, it combines the characteristics of inserting barriers and deleting barriers, and solves the problem that inserting barriers requires a second scan. At the same time, different strategies are adopted for the heap area and the stack area to ensure that the operating efficiency of the stack is not damaged.

7. Summary of the comparison between Java and Golang in history

Compared Java Go
GC area Java heap and method area Gohei
Departure GC Timing Generational collection leads to many trigger opportunities Apply for memory, manual trigger, timing trigger
garbage collection algorithm Generational collection. In the young generation ("mark-copy"); in the old generation ("mark-clear" or "mark-compact") mark-and-sweep algorithm
Garbage type dead objects (may escape), obsolete constants and useless classes Global variables and reference pointers in G Stack
marking stage Three-color reachability analysis algorithm (insert write barrier, delete write barrier) Three-color reachability analysis algorithm (hybrid write barrier)
space compression yes no
memory allocation pointer collision/free list span memory pool
Garbage Solutions Designs such as generational GC, object movement, and region division Go language span memory pool, tcmalloc allocation mechanism, objects can be allocated on the stack, object pool

From the perspective of garbage collection, after multiple generations of development, Java's garbage collection mechanism is relatively complete. Java divides the new generation and the old generation to store objects. Objects usually allocate memory in the young generation, and objects that have survived many times will be moved to the old generation. Due to the low survival rate of the new generation, the possibility of space fragmentation is high, and "mark-copy" is usually used as the recovery algorithm, while the old generation The survival rate is high, and "mark-clear" or "mark-sort" is usually selected as the recovery algorithm to compress the space for sorting.

Go is a non-generational, concurrent, garbage collector based on three-color marking and clearing. Its advantages can only be realized by combining its tcmalloc memory allocation strategy, because the allocation of small and micro objects has its own memory pool, and all fragments Can be perfectly reused, so GC does not need to consider the problem of space fragmentation.

The above content is too complicated. If you don’t understand it, please refer to the supporting video of "Go Study Bible-Technical Free Circle Edition"

GC related interview questions

1. Chat: Common Garbage Collection Algorithms

  • Reference counting: Each object maintains a reference count. When the referenced object is created or assigned to another object, the reference count is automatically increased by +1; when the object referencing the object is destroyed, the count is -1, and when the count is When 0, the object is recycled.
    • Advantages: Objects can be recycled quickly, and there will be no memory exhaustion or recycling when the threshold is reached.
    • Disadvantages: Can't handle circular references well
  • Mark-clear: start from the root variable to traverse all referenced objects, mark the referenced objects as "referenced", and recycle those that are not marked.
    • Advantages: Solve the disadvantages of reference counting.
    • Disadvantages: STW (stop the world) is required to temporarily stop the program from running.
  • Generational collection: Divide different generation spaces according to the length of object life cycle. Objects with long life cycles are placed in the old generation, and objects with short life cycles are placed in the new generation. Different generations have different recycling algorithms and recycling frequencies.
    • Advantages: good recycling performance
    • Disadvantages: complex algorithm

2. Chat: three-color notation

  1. Initially all objects are white.
  2. Traverse all objects from the root node, and turn the traversed objects into gray objects
  3. Traverse the gray object, turn the object referenced by the gray object into a gray object, and then turn the traversed gray object into a black object.
  4. Repeat step 3 until all gray objects turn black.
  5. Detect changes in the object through the write barrier (write-barrier), repeat the above operation
  6. Recycle all white objects (garbage).

3. Chat: What is the root object?

The root object, also known as the root collection in garbage collection terms, is the first object the garbage collector checks during the marking process, including:

  1. Global variables: those variables that exist in the entire life cycle of the program that the program can determine at compile time.
  2. Execution stack: Each goroutine contains its own execution stack, which contains variables on the stack and pointers to allocated heap memory blocks.
  3. Register: The value of a register may represent a pointer, and these pointers involved in the calculation may point to a heap memory block allocated by some setter.

4. Chat: GO's STW (Stop The World)

  • In order to avoid new changes in the reference relationship between objects during the GC process, resulting in errors in the GC result (for example, a new reference is added during the GC process, but the referenced object is cleared because the reference is not scanned. ), stops all running coroutines.
  • STW has some impact on performance, and Golang can currently achieve STW below 1ms.

5. Chat: Write Barrier

  • In order to avoid errors in the newly modified references related to the GC results during the GC process, we need to perform STW. But STW will affect the performance of the program, so we need to shorten the STW time as much as possible through the write barrier technology.
    The conditions that cause the reference object to be lost:
    a black node A adds a reference to the white node C, and the white node C has no references to other gray nodes except A, or exists but is deleted during the GC process.
    The above two conditions need to be met at the same time: when condition 1 is met, it means that node A has been scanned, and the reference from A to C can no longer be scanned; when condition 2 is met, it means that the white node C has no references to other gray nodes, that is, after the scan ends will be ignored.

Write barrier breaks one of two conditions

  • Breaking condition 1: Dijistra write barrier

Satisfy the strong three-color invariance: black nodes are not allowed to reference white nodes When a black node adds a reference to a white node, change the corresponding white node to gray

  • Breaking condition 2: Yuasa write barrier

Satisfy the weak three-color invariance: black nodes are allowed to refer to white nodes, but the white node has indirect references from other gray nodes (to ensure that they will not be missed) When a white node is deleted from a reference, it is pessimistically believed that it will be The black node adds a new reference, so it is grayed out

The newly allocated memory during the GC process will be marked immediately, using the write barrier technology, that is, the newly allocated memory during the GC process will not be reclaimed in this round.

6. Chat: GC trigger timing

The amount of memory allocated reaches the threshold and triggers GC

Every time memory is allocated, it will check whether the current memory allocation has reached the threshold. If the threshold is reached, the GC will be started immediately.

阈值 = 上次GC内存分配量 × 内存增长率

The memory growth rate is controlled by the environment variable GOGC, and the default is 100, that is, the GC is started every time the memory doubles.

Trigger GC periodically

By default, a GC is triggered at a maximum of 2 minutes, and this time interval is runtime.forcegcperioddeclared by a variable

Active trigger :

It can be called in the program code runtime.GC()to trigger GC, which is mainly used for GC performance testing and statistics.

7. What is GC and what does it do?

GC, the full name Garbage Collectionis garbage collection, which is an automatic memory management mechanism.

When the memory requested by the program from the operating system is no longer needed, the garbage collector actively reclaims it and reuses it for other codes to apply for memory, or returns it to the operating system. This automatic recovery process for memory-level resources is for garbage collection. The program component responsible for garbage collection is the garbage collector.

Garbage collection is actually a perfect example of "Simplicity is Complicated". On the one hand, programmers benefit from GC, no need to worry about, and no longer need to manually apply and release memory. GC automatically releases residual memory when the program is running. On the other hand, GC is almost invisible to programmers. It only shows up when the program needs special optimization, by providing an adjustable API to control the timing and overhead of GC.

Typically, the execution of the garbage collector is divided into two semi-independent components:

  • Mutator : This name essentially refers to userland code. Because for the garbage collector, the user-mode code is only modifying the reference relationship between objects, that is, operating on the object graph (a directed graph of the reference relationship between objects).
  • Collector : The code responsible for performing garbage collection.

8. What are the common GC implementations? What does the GC of the Go language use?

The existence forms of all GC algorithms can be attributed to the mixed use of the two forms of tracing (Tracing) and reference counting (Reference Counting).

  • Tracking GC
    starts from the root object, and proceeds step by step according to the reference information between objects until the entire heap is scanned and the objects that need to be retained are determined, so as to reclaim all recyclable objects. Go, Java, and V8's implementation of JavaScript are all tracking GC.
  • Reference counting GC
    Each object itself contains a referenced counter, which is automatically recycled when the counter reaches zero. Because this method has many defects, it is usually not applied when pursuing high performance. Python, Objective-C, etc. are reference counting GC.

At present, the more common GC implementation methods include:

  • Tracking , which is divided into different types, such as:
    • Marking and sweeping : Starting from the root object, mark the objects that are determined to be alive, and clean up the objects that can be recycled.
    • Marking : In order to solve the problem of memory fragmentation, in the process of marking, the objects should be sorted into a piece of continuous memory as much as possible.
    • Incremental : The process of marking and cleaning is executed in batches, and a small part is executed each time, so as to promote garbage collection incrementally, achieving the goal of near real-time and almost no pause.
    • Incremental sorting : on an incremental basis, increase the process of sorting out objects.
    • Generational : classify objects according to their survival time. Those whose survival time is less than a certain value are called the young generation, those whose survival time is greater than a certain value are called the old generation, and objects that will never participate in recycling are called the permanent generation. Objects are recycled based on generational assumptions (if an object does not live for a long time, it tends to be recycled, if an object has lived for a long time, it tends to live longer).
  • Reference counting : Recycle according to the reference count of the object itself, and recycle immediately when the reference count reaches zero.

The detailed introduction of various methods and their implementation are not discussed in detail in this article. For Go, Go's GC currently uses a three-color mark with no generation (the object has no generational distinction), no finishing (the object is not moved and sorted during the recycling process), and concurrent (concurrent execution with user code) cleaning algorithm. The reason [1] lies in:

  1. Object collation has the advantage of solving memory fragmentation problems and "allowing" the use of sequential memory allocators. However, the allocation algorithm of the Go runtime is based on tcmalloc, so there is basically no fragmentation problem. And sequential memory allocators are not suitable for multi-threaded scenarios. Go uses a modern tcmalloc-based memory allocation algorithm, and defragmenting objects does not provide substantial performance gains.
  2. Generational GC relies on the generational assumption, that is, the GC puts the main recycling target on newly created objects (short-lived, more inclined to be recycled), rather than checking all objects frequently. But the Go compiler will store most of the new objects on the stack through escape analysis (the stack is directly recycled), and only those objects that need to exist for a long time will be allocated to the heap that needs to be garbage collected. That is to say, those short-lived objects reclaimed by generational GC are directly allocated to the stack in Go. When the goroutine dies, the stack will also be directly reclaimed without the participation of GC, and the generational assumption does not bring Come to immediate advantage. And Go's garbage collector and user code are executed concurrently, so that the STW time has nothing to do with the generation of objects and the size of objects. The Go team is more focused on how to better have the GC execute concurrently with user code (using the appropriate amount of CPU to perform garbage collection), rather than the single goal of reducing pause times.

9. Talk carefully: What is the three-color marking method?

The key to understanding the three-color notation is to understand the three-color abstraction of objects and the concepts of wavefront advancement . The three-color abstraction is just a way to describe the tracking collector. It has no practical meaning in practice. Its important role is to logically deduce the correctness of the garbage collection method of marking and cleaning. That said, when we talk about three-color notation, we're usually referring to the garbage collection of the mark sweep.

From the garbage collector's point of view, the three-color abstraction specifies three different types of objects, matched with different colors:

  • White objects (possibly dead): Objects that have not been accessed by the collector. At the beginning of the collection, all objects are white, and when the collection is over, the white objects are unreachable.
  • Gray object (wave front): Objects that have been accessed by the collector, but the collector needs to scan one or more pointers in it, because they may still point to white objects.
  • Black object (determined to be alive): An object that has been accessed by the collector, and all fields in it have been scanned, and it is impossible for any pointer in the black object to directly point to the white object.

The recycling process defined by the three kinds of invariance is actually a process of continuous advancing of the wave front , and this wave front is also the boundary of the black object and the white object, and the gray object is the wave front.

When garbage collection starts, there are only white objects. As the labeling process begins, gray objects begin to appear (colored) and the wavefront expands. When all child nodes of an object are scanned, they are colored black. When the entire heap traversal is completed, only black and white objects are left. At this time, the black object is a reachable object, that is, alive; and the white object is an unreachable object, that is, dead. This process can be regarded as a process in which gray objects are used as wave fronts, black objects are separated from white objects, and the wave front is continuously pushed forward until all reachable gray objects become black objects. As shown below:

The whole picture of the three-color marking method

The whole picture of the three-color marking method

The figure shows the relationship among root object, reachable object, unreachable object, black, gray, white object and wave surface.

10. Talk carefully: What does STW mean?

STWCan be Stop the Worldan abbreviation of or Start the Worldan abbreviation of . In the usual sense, it refers to the period of time from Stop the Worldwhen this action occurs to when this action occurs, that is, everything is at rest. Start the WorldIn order to ensure the correctness of the implementation and prevent endless memory growth and other problems during the garbage collection process, STW inevitably needs to stop the further operation of the object graph by the setter.

During this process, the entire user code is stopped or slowed down. STWThe longer it is, the greater the impact (such as delay) on the user code. The early Go implementation of the garbage collector STWlasted for hundreds of milliseconds, which is time-sensitive Applications such as real-time communications can have a huge impact. Let's look at an example:

package main

import (
	"runtime"
	"time"
)

func main() {
    
    
	go func() {
    
    
		for {
    
    
		}
	}()

	time.Sleep(time.Millisecond)
	runtime.GC()
	println("OK")
}

The above program will never output before Go 1.14 OK. The culprit is that the execution of the operation of entering STW is infinitely extended.

Although STW has been optimized to less than half a millisecond level, the reason why this program is stuck is because it needs to enter STW. The reason is that when GC needs to enter STW, it needs to notify and stop all user mode codes, but for {}the goroutine where it is located will never be interrupted, so it can never enter STW phase. The same is true in practice. When a certain goroutine of the program cannot be stopped for a long time, and the timing of entering STW is forced to be slowed down, the impact (stuck) caused by this situation is very terrible. Fortunately, since Go 1.14, this type of goroutine can be preempted asynchronously, so that the time to enter STW will not exceed the period triggered by the preemption signal, and the program will not stop before entering STW because it is only waiting for a goroutine to stop superior.

11. Chat: How to observe Go GC?

Let's take the following program as an example, first use four different ways to introduce how to observe GC, and then discuss how to optimize GC through several detailed examples in the following questions.

package main

func allocate() {
    
    
	_ = make([]byte, 1<<20)
}

func main() {
    
    
	for n := 1; n < 100000; n++ {
    
    
		allocate()
	}
}

Method 1: GODEBUG=gctrace=1

We can first pass

$ go build -o main
$ GODEBUG=gctrace=1 ./main

gc 1 @0.000s 2%: 0.009+0.23+0.004 ms clock, 0.11+0.083/0.019/0.14+0.049 ms cpu, 4->6->2 MB, 5 MB goal, 12 P
scvg: 8 KB released
scvg: inuse: 3, idle: 60, sys: 63, released: 57, consumed: 6 (MB)
gc 2 @0.001s 2%: 0.018+1.1+0.029 ms clock, 0.22+0.047/0.074/0.048+0.34 ms cpu, 4->7->3 MB, 5 MB goal, 12 P
scvg: inuse: 3, idle: 60, sys: 63, released: 56, consumed: 7 (MB)
gc 3 @0.003s 2%: 0.018+0.59+0.011 ms clock, 0.22+0.073/0.008/0.042+0.13 ms cpu, 5->6->1 MB, 6 MB goal, 12 P
scvg: 8 KB released
scvg: inuse: 2, idle: 61, sys: 63, released: 56, consumed: 7 (MB)
gc 4 @0.003s 4%: 0.019+0.70+0.054 ms clock, 0.23+0.051/0.047/0.085+0.65 ms cpu, 4->6->2 MB, 5 MB goal, 12 P
scvg: 8 KB released
scvg: inuse: 3, idle: 60, sys: 63, released: 56, consumed: 7 (MB)
scvg: 8 KB released
scvg: inuse: 4, idle: 59, sys: 63, released: 56, consumed: 7 (MB)
gc 5 @0.004s 12%: 0.021+0.26+0.49 ms clock, 0.26+0.046/0.037/0.11+5.8 ms cpu, 4->7->3 MB, 5 MB goal, 12 P
scvg: inuse: 5, idle: 58, sys: 63, released: 56, consumed: 7 (MB)
gc 6 @0.005s 12%: 0.020+0.17+0.004 ms clock, 0.25+0.080/0.070/0.053+0.051 ms cpu, 5->6->1 MB, 6 MB goal, 12 P
scvg: 8 KB released
scvg: inuse: 1, idle: 62, sys: 63, released: 56, consumed: 7 (MB)

Two different types of information can be observed in this log:

gc 1 @0.000s 2%: 0.009+0.23+0.004 ms clock, 0.11+0.083/0.019/0.14+0.049 ms cpu, 4->6->2 MB, 5 MB goal, 12 P
gc 2 @0.001s 2%: 0.018+1.1+0.029 ms clock, 0.22+0.047/0.074/0.048+0.34 ms cpu, 4->7->3 MB, 5 MB goal, 12 P
...

as well as:

scvg: 8 KB released
scvg: inuse: 3, idle: 60, sys: 63, released: 57, consumed: 6 (MB)
scvg: inuse: 3, idle: 60, sys: 63, released: 56, consumed: 7 (MB)
...

For garbage collection generated by user code applying for memory at runtime:

gc 2 @0.001s 2%: 0.018+1.1+0.029 ms clock, 0.22+0.047/0.074/0.048+0.34 ms cpu, 4->7->3 MB, 5 MB goal, 12 P

The meanings are shown in the table below:

field meaning
gc 2 second GC cycle
0.001 0.001 seconds after program start
2% CPU usage in this GC cycle
0.018 The time spent in STW when marking starts (wall clock)
1.1 During the marking process, the time spent on concurrent marking (wall clock)
0.029 The time spent in STW (wall clock) when the mark is terminated
0.22 The time spent in STW when marking starts (cpu time)
0.047 During the marking process, the time spent on marking assistance (cpu time)
0.074 During the marking process, the time spent on concurrent marking (cpu time)
0.048 During the marking process, the GC idle time (cpu time)
0.34 The time spent in STW when the mark is terminated (cpu time)
4 The actual value of the heap size at the start of marking
7 The actual value of the heap size at the end of marking
3 Size of objects marked alive at the end of marking
5 Estimated size of the heap at the end of marking
12 number of P

Wall clock refers to the actual time from start to finish, including the time consumed by other programs and this program; cpu time refers to the time a specific program uses the CPU; they have the following relationship:

  • wall clock < cpu time: make full use of multi-core
  • wall clock ≈ cpu time: not executed in parallel
  • wall clock > cpu time: the advantage of multi-core is not obvious

For the garbage collection generated by applying for memory from the operating system at runtime (returning excess memory to the operating system):

scvg: 8 KB released
scvg: inuse: 3, idle: 60, sys: 63, released: 57, consumed: 6 (MB)

The meanings are shown in the table below:

field meaning
8 KB released Returned 8 KB of memory to the operating system
3 The total memory size (MB) that has been allocated to user code and is being used
60 The total size of memory that is free and waiting to be returned to the operating system (MB)
63 Inform the operating system of the reserved memory size (MB)
57 The memory size (MB) that has been returned to the operating system (or has not been officially applied for)
6 The memory size that has been requested from the operating system (MB)

Method 2: go tool trace

go tool traceThe main function of is to display the statistical information to users in a visual way. To use this tool, you can call the trace API:

package main

func main() {
    
    
	f, _ := os.Create("trace.out")
	defer f.Close()
	trace.Start(f)
	defer trace.Stop()
	(...)
}

and pass

$ go tool trace trace.out
2019/12/30 15:50:33 Parsing trace...
2019/12/30 15:50:38 Splitting trace...
2019/12/30 15:50:45 Opening browser. Trace viewer is listening on http://127.0.0.1:51839

Command to start the visual interface:

Select the first link to get the following illustration:

The question mark in the upper right corner can open the help menu, the main usage methods include:

  • The w/s key can be used to zoom in or zoom out the view
  • a/d keys can be used to move left and right
  • Hold Shift to select multiple events

Method 3: debug.ReadGCStats

此方式可以通过代码的方式来直接实现对感兴趣指标的监控,例如我们希望每隔一秒钟监控一次 GC 的状态:

func printGCStats() {
    
    
	t := time.NewTicker(time.Second)
	s := debug.GCStats{
    
    }
	for {
    
    
		select {
    
    
		case <-t.C:
			debug.ReadGCStats(&s)
			fmt.Printf("gc %d last@%v, PauseTotal %v\n", s.NumGC, s.LastGC, s.PauseTotal)
		}
	}
}
func main() {
    
    
	go printGCStats()
	(...)
}

我们能够看到如下输出:

$ go run main.go

gc 4954 last@2019-12-30 15:19:37.505575 +0100 CET, PauseTotal 29.901171ms
gc 9195 last@2019-12-30 15:19:38.50565 +0100 CET, PauseTotal 77.579622ms
gc 13502 last@2019-12-30 15:19:39.505714 +0100 CET, PauseTotal 128.022307ms
gc 17555 last@2019-12-30 15:19:40.505579 +0100 CET, PauseTotal 182.816528ms
gc 21838 last@2019-12-30 15:19:41.505595 +0100 CET, PauseTotal 246.618502ms

方式4:runtime.ReadMemStats

除了使用 debug 包提供的方法外,还可以直接通过运行时的内存相关的 API 进行监控:

func printMemStats() {
    
    
	t := time.NewTicker(time.Second)
	s := runtime.MemStats{
    
    }

	for {
    
    
		select {
    
    
		case <-t.C:
			runtime.ReadMemStats(&s)
			fmt.Printf("gc %d last@%v, next_heap_size@%vMB\n", s.NumGC, time.Unix(int64(time.Duration(s.LastGC).Seconds()), 0), s.NextGC/(1<<20))
		}
	}
}
func main() {
    
    
	go printMemStats()
	(...)
}
$ go run main.go

gc 4887 last@2019-12-30 15:44:56 +0100 CET, next_heap_size@4MB
gc 10049 last@2019-12-30 15:44:57 +0100 CET, next_heap_size@4MB
gc 15231 last@2019-12-30 15:44:58 +0100 CET, next_heap_size@4MB
gc 20378 last@2019-12-30 15:44:59 +0100 CET, next_heap_size@6MB

当然,后两种方式能够监控的指标很多,读者可以自行查看 debug.GCStats [2] 和 runtime.MemStats [3] 的字段,这里不再赘述。

12、有了 GC,为什么还会发生内存泄露?

在一个具有 GC 的语言中,我们常说的内存泄漏,用严谨的话来说应该是:预期的能很快被释放的内存由于附着在了长期存活的内存上、或生命期意外地被延长,导致预计能够立即回收的内存而长时间得不到回收。

在 Go 中,由于 goroutine 的存在,所谓的内存泄漏除了附着在长期对象上之外,还存在多种不同的形式。

形式1:预期能被快速释放的内存因被根对象引用而没有得到迅速释放

当有一个全局对象时,可能不经意间将某个变量附着在其上,且忽略的将其进行释放,则该内存永远不会得到释放。例如:

var cache = map[interface{
    
    }]interface{
    
    }{
    
    }

func keepalloc() {
    
    
	for i := 0; i < 10000; i++ {
    
    
		m := make([]byte, 1<<10)
		cache[i] = m
	}
}

形式2:goroutine 泄漏

Goroutine 作为一种逻辑上理解的轻量级线程,需要维护执行用户代码的上下文信息。在运行过程中也需要消耗一定的内存来保存这类信息,而这些内存在目前版本的 Go 中是不会被释放的。因此,如果一个程序持续不断地产生新的 goroutine、且不结束已经创建的 goroutine 并复用这部分内存,就会造成内存泄漏的现象,例如:

func keepalloc2() {
    
    
	for i := 0; i < 100000; i++ {
    
    
		go func() {
    
    
			select {
    
    }
		}()
	}
}

验证

我们可以通过如下形式来调用上述两个函数:

package main

import (
	"os"
	"runtime/trace"
)

func main() {
    
    
	f, _ := os.Create("trace.out")
	defer f.Close()
	trace.Start(f)
	defer trace.Stop()
	keepalloc()
	keepalloc2()
}

运行程序:

go run main.go

会看到程序中生成了 trace.out 文件,我们可以使用 go tool trace trace.out 命令得到下图:

可以看到,图中的 Heap 在持续增长,没有内存被回收,产生了内存泄漏的现象。

值得一提的是,这种形式的 goroutine 泄漏还可能由 channel 泄漏导致。而 channel 的泄漏本质上与 goroutine 泄漏存在直接联系。Channel 作为一种同步原语,会连接两个不同的 goroutine,如果一个 goroutine 尝试向一个没有接收方的无缓冲 channel 发送消息,则该 goroutine 会被永久的休眠,整个 goroutine 及其执行栈都得不到释放,例如:

var ch = make(chan struct{
    
    })

func keepalloc3() {
    
    
	for i := 0; i < 100000; i++ {
    
    
		// 没有接收方,goroutine 会一直阻塞
		go func() {
    
     ch <- struct{
    
    }{
    
    } }()
	}
}

13、并发标记清除法的难点是什么?

在没有用户态代码并发修改三色抽象的情况下,回收可以正常结束。但是并发回收的根本问题在于,用户态代码在回收过程中会并发地更新对象图,从而造成赋值器和回收器可能对对象图的结构产生不同的认知。这时以一个固定的三色波面作为回收过程前进的边界则不再合理。

我们不妨考虑赋值器写操作的例子:

时序 回收器 赋值器 说明
1 shade(A, gray) 回收器:根对象的子节点着色为灰色对象
2 shade(C, black) 回收器:当所有子节点着色为灰色后,将节点着为黑色
3 C.ref3 = C.ref2.ref1 赋值器:并发的修改了 C 的子节点
4 A.ref1 = nil 赋值器:并发的修改了 A 的子节点
5 shade(A.ref1, gray) 回收器:进一步灰色对象的子节点并着色为灰色对象,这时由于 A.ref1nil,什么事情也没有发生
6 shade(A, black) 回收器:由于所有子节点均已标记,回收器也不会重新扫描已经被标记为黑色的对象,此时 A 被着色为黑色,scan(A) 什么也不会发生,进而 B 在此次回收过程中永远不会被标记为黑色,进而错误地被回收。
  • 初始状态:假设某个黑色对象 C 指向某个灰色对象 A ,而 A 指向白色对象 B;
  • C.ref3 = C.ref2.ref1:赋值器并发地将黑色对象 C 指向(ref3)了白色对象 B;
  • A.ref1 = nil:移除灰色对象 A 对白色对象 B 的引用(ref2);
  • 最终状态:在继续扫描的过程中,白色对象 B 永远不会被标记为黑色对象了(回收器不会重新扫描黑色对象),进而对象 B 被错误地回收。

gc-mutator

总而言之,并发标记清除中面临的一个根本问题就是如何保证标记与清除过程的正确性。

14、什么是写屏障、混合写屏障,如何实现?

要讲清楚写屏障,就需要理解三色标记清除算法中的强弱不变性以及赋值器的颜色,理解他们需要一定的抽象思维。写屏障是一个在并发垃圾回收器中才会出现的概念,垃圾回收器的正确性体现在:不应出现对象的丢失,也不应错误的回收还不需要回收的对象。

可以证明,当以下两个条件同时满足时会破坏垃圾回收器的正确性:

  • 条件 1: 赋值器修改对象图,导致某一黑色对象引用白色对象;
  • 条件 2: 从灰色对象出发,到达白色对象的、未经访问过的路径被赋值器破坏。

只要能够避免其中任何一个条件,则不会出现对象丢失的情况,因为:

  • 如果条件 1 被避免,则所有白色对象均被灰色对象引用,没有白色对象会被遗漏;
  • 如果条件 2 被避免,即便白色对象的指针被写入到黑色对象中,但从灰色对象出发,总存在一条没有访问过的路径,从而找到到达白色对象的路径,白色对象最终不会被遗漏。

我们不妨将三色不变性所定义的波面根据这两个条件进行削弱:

  • 当满足原有的三色不变性定义(或上面的两个条件都不满足时)的情况称为强三色不变性(strong tricolor invariant)
  • 当赋值器令黑色对象引用白色对象时(满足条件 1 时)的情况称为弱三色不变性(weak tricolor invariant)

当赋值器进一步破坏灰色对象到达白色对象的路径时(进一步满足条件 2 时),即打破弱三色不变性, 也就破坏了回收器的正确性;或者说,在破坏强弱三色不变性时必须引入额外的辅助操作。 弱三色不变形的好处在于:只要存在未访问的能够到达白色对象的路径,就可以将黑色对象指向白色对象。

如果我们考虑并发的用户态代码,回收器不允许同时停止所有赋值器,就是涉及了存在的多个不同状态的赋值器。为了对概念加以明确,还需要换一个角度,把回收器视为对象,把赋值器视为影响回收器这一对象的实际行为(即影响 GC 周期的长短),从而引入赋值器的颜色:

  • 黑色赋值器:已经由回收器扫描过,不会再次对其进行扫描。
  • 灰色赋值器:尚未被回收器扫描过,或尽管已经扫描过但仍需要重新扫描。

赋值器的颜色对回收周期的结束产生影响:

  • 如果某种并发回收器允许灰色赋值器的存在,则必须在回收结束之前重新扫描对象图。
  • 如果重新扫描过程中发现了新的灰色或白色对象,回收器还需要对新发现的对象进行追踪,但是在新追踪的过程中,赋值器仍然可能在其根中插入新的非黑色的引用,如此往复,直到重新扫描过程中没有发现新的白色或灰色对象。

于是,在允许灰色赋值器存在的算法,最坏的情况下,回收器只能将所有赋值器线程停止才能完成其跟对象的完整扫描,也就是我们所说的 STW。

为了确保强弱三色不变性的并发指针更新操作,需要通过赋值器屏障技术来保证指针的读写操作一致。因此我们所说的 Go 中的写屏障、混合写屏障,其实是指赋值器的写屏障,赋值器的写屏障作为一种同步机制,使赋值器在进行指针写操作时,能够“通知”回收器,进而不会破坏弱三色不变性。

有两种非常经典的写屏障:Dijkstra 插入屏障和 Yuasa 删除屏障。

灰色赋值器的 Dijkstra 插入屏障的基本思想是避免满足条件 1:

// 灰色赋值器 Dijkstra 插入屏障
func DijkstraWritePointer(slot *unsafe.Pointer, ptr unsafe.Pointer) {
    
    
    shade(ptr)
    *slot = ptr
}

为了防止黑色对象指向白色对象,应该假设 *slot 可能会变为黑色,为了确保 ptr 不会在被赋值到 *slot 前变为白色,shade(ptr) 会先将指针 ptr 标记为灰色,进而避免了条件 1。如图所示:

Dijkstra 插入屏障的好处在于可以立刻开始并发标记。但存在两个缺点:

  1. 由于 Dijkstra 插入屏障的“保守”,在一次回收过程中可能会残留一部分对象没有回收成功,只有在下一个回收过程中才会被回收;
  2. 在标记阶段中,每次进行指针赋值操作时,都需要引入写屏障,这无疑会增加大量性能开销;为了避免造成性能问题,Go 团队在最终实现时,没有为所有栈上的指针写操作,启用写屏障,而是当发生栈上的写操作时,将栈标记为灰色,但此举产生了灰色赋值器,将会需要标记终止阶段 STW 时对这些栈进行重新扫描。

另一种比较经典的写屏障是黑色赋值器的 Yuasa 删除屏障。其基本思想是避免满足条件 2:

// 黑色赋值器 Yuasa 屏障
func YuasaWritePointer(slot *unsafe.Pointer, ptr unsafe.Pointer) {
    
    
    shade(*slot)
    *slot = ptr
}

为了防止丢失从灰色对象到白色对象的路径,应该假设 *slot 可能会变为黑色,为了确保 ptr 不会在被赋值到 *slot 前变为白色,shade(*slot) 会先将 *slot 标记为灰色,进而该写操作总是创造了一条灰色到灰色或者灰色到白色对象的路径,进而避免了条件 2。

Yuasa 删除屏障的优势则在于不需要标记结束阶段的重新扫描,结束时候能够准确的回收所有需要回收的白色对象。

Yuasa 删除屏障的缺陷是会拦截写操作,进而导致波面的退后,产生“冗余”的扫描:

Go 在 1.8 的时候为了简化 GC 的流程,同时减少标记终止阶段的重扫成本,将 Dijkstra 插入屏障和 Yuasa 删除屏障进行混合,形成混合写屏障。

该屏障提出时的基本思想是:对正在被覆盖的对象进行着色,且如果当前栈未扫描完成,则同样对指针进行着色。

However, in the final implementation, ptrthe coloring of in the original proposal [4] additionally includes the coloring check of the execution stack, but due to limited time, it has not been fully implemented, so the current implementation pseudocode of the mixed write barrier is:

// 混合写屏障
func HybridWritePointerSimple(slot *unsafe.Pointer, ptr unsafe.Pointer) {
    
    
	shade(*slot)
	shade(ptr)
	*slot = ptr
}

In this implementation, if unconditional coloring is performed on both sides of the reference, it naturally combines the advantages of Dijkstra and Yuasa write barriers, but the disadvantages are also very obvious, because the cost of coloring is doubled, and the code that the compiler needs to insert is also multiplied. As a result, the compiled binary size is further increased. In order to optimize the performance of write barriers, before and after Go 1.10, the Go team subsequently implemented a batch write barrier mechanism. ptrThe basic idea is to uniformly write the pointers that need to be colored into a cache, and color all the pointers in the cache uniformly when the cache is full .

The above content is too complicated. If you don’t understand it, please refer to the supporting video of "Go Study Bible-Technical Free Circle Edition"

say later

If you encounter difficulties, you can come to Nien for help.

Nien will do a bottom-up, strangling, and systematic sorting out of the problems for everyone, and help you really make the interviewer love you to death .

In addition, if your resume and projects are low, and you don’t have an interview opportunity, you can also ask Nien for an upgrade to make your resume shiny and attractive.

Recommended related reading

" Go Study Bible: 0 Basics Proficient in GO Development and High Concurrency Architecture "

" Go Study Bible: Queue peak shaving + batch write ultra-high concurrency principle and practice "

" Go Study Bible: Starting from 0, Proficient in Go Language Rest Microservice Architecture and Development "

" Go Study Bible: Go language realizes high concurrent CRUD business development "

Please go to the following "Technical Freedom Circle" official account to get the PDF file update of Nien's architecture notes and interview questions↓↓↓

Guess you like

Origin blog.csdn.net/crazymakercircle/article/details/131673351