[JVM] Fifteen, garbage collection related algorithms

Hello, everyone, I am a pig who is dominated by cabbage.

A person who loves to study, sleepless and forgets to eat, is obsessed with the girl's chic, calm and indifferent coding handsome boy.

If you like my text, please follow the public account "Let go of this cabbage and let me come".

15-Garbage collection related algorithms

1. Marking stage: reference counting algorithm

Garbage marking stage: Judgment of object survival

What is garbage (marking) and how to remove it (clearing).

  • Almost all Java object instances are stored in the heap. Before the GC performs garbage collection, it is first necessary to distinguish those living objects in the memory and which ones are dead objects. **Only when the object is marked as dead, the GC will release the memory space occupied by the garbage collection, so this process can become the garbage marking stage.
  • So how exactly is a dead object marked in the JVM? Simply put, when an object is no longer referenced by any surviving objects, it can be declared dead.
  • There are generally two ways to determine the survival of an object: reference counting algorithm and reachability analysis algorithm .

Method 1: Reference counting algorithm

  • The reference counting algorithm (Reference Counting) is relatively simple, storing an integer reference counter attribute for each object. Used to record the situation where the object is referenced.
  • For an object A, as long as any object references A, the reference counter of A is incremented by one; when the reference becomes invalid, the reference counter is decremented by one. As long as the value of the reference counter of object A is 0, it means that object A can no longer be used and can be recycled.
  • Advantages: simple implementation, easy identification of garbage objects; high judgment efficiency, and no delay in recycling.
  • Disadvantages:
    • He needs a separate field to store the counter, which increases the storage space overhead.
    • Each assignment needs to update the counter, accompanied by addition and subtraction operations, which increases time overhead.
    • A serious problem with reference counters is that they cannot handle circular references . This is a fatal flaw, resulting in no such algorithm being used in Java's garbage collector.

There is no need to wait until the memory is not enough to recycle, as long as the value of the reference counter is found to be 0, it can be reclaimed.

summary

  • The reference counting algorithm is a resource recovery choice for many languages. For example, Python, which is more popular due to artificial intelligence, supports both reference counting and garbage collection mechanisms.
  • Which is the best one depends on the scenario. There are attempts in the industry to keep only the reference counting mechanism in large-scale practice to improve throughput.
  • Java did not choose reference counting because it has a basic problem, that is, it is difficult to deal with circular references.
  • How does Python resolve circular references?
    • Manual release: It is easy to understand, that is, to release the reference relationship at the right time.
    • Use weak references weakref, weakref is a standard library provided by Python, designed to solve circular references.

Circular reference

Insert picture description here

2. Marking stage: reachability analysis algorithm

Method 2: Reachability analysis (or root search algorithm, traceable garbage collection)

  • Compared with the reference counting algorithm, the reachability analysis algorithm not only has the characteristics of simple implementation and efficient execution, but more importantly, the algorithm can effectively solve the problem of circular references in the reference counting algorithm and prevent the occurrence of memory leaks.
  • Compared with the reference counting algorithm, the reachability analysis here is the choice of Java and C#. This type of garbage collection is often called Tracing Garbage Collection (Tracing Garbage Collection).
  • The so-called "GC Roots" and collections are a set of references that must be active.
  • The basic idea:
    • The reachability analysis algorithm takes the root object set (GC Roots) as the starting point, and searches for the reachability of the target object connected by the root object set in a top-down manner .
    • After using the reachability analysis algorithm, the surviving objects in the memory will be directly or indirectly connected by the root object collection, and the path traversed by the search is called the Reference Chain .
    • If the target object is not connected by any reference chain, it is unreachable, which means that the object has died and can be marked as a garbage object.
    • In the reachability analysis algorithm, only the objects that can be directly or indirectly connected by the root object set are the surviving objects.

GC Roots

In the Java language, GC Roots includes the following types of elements:

  • Objects referenced in the virtual machine stack
    • For example: the parameters, local variables, etc. used in the method that each thread is called.
  • Objects referenced by JNI (usually called local methods) in the native method stack
  • Objects referenced by class static properties in the method area
    • For example: reference type static variable of Java class
  • Variables referenced by constants in the method area
    • For example: references in the String Table
  • All objects held by the synchronization lock synchronized
  • References inside the Java virtual machine.
    • The basic data type is for the Class object, some resident exception objects (such as: NullPointerException, OutOfMemoryError), and the system class loader.
    • The JMXBean that reflects the internal situation of the java virtual machine, the callback registered in the JVMTI, the local diamante cache, etc.

  • In addition to these fixed Gc Roots collections, depending on the garbage collector selected by the user and the memory area currently being reclaimed, other objects can also be added "temporarily" to form a complete GC Roots collection. For example: generational collection and partial collection (Partial GC).

    • If only a certain area of ​​the Java heap is garbage collected (for example: typically only for the new generation), it must be considered that the memory area is the implementation details of the virtual machine itself, and it is not isolated and closed. The objects in this area are completely possible. If it is referenced by objects in other areas, it is necessary to add the associated area objects to the GC Roots collection at this time to ensure the accuracy of the reachability analysis.
  • Tips:

    Because Root uses the stack to store variables and pointers, if a pointer saves objects in the heap memory but does not store it in the heap memory, it is a root.

note:

  • If you want to use the reachability analysis algorithm to determine whether the memory is recyclable, then the analysis must be performed in a snapshot that can guarantee consistency. If this is not satisfied, the accuracy of the analysis result cannot be guaranteed.
  • This is also an important reason why "Stop The World" must be "Stop The World" during GC.
    • Even in the CMS collector, which claims to have (almost) no pauses, it is necessary to pause when enumerating the root node.

Three, the finalizetion mechanism of the object

Object finalization mechanism

  • The Java language provides an object finalization mechanism to allow the development of custom processing logic that provides an object before it is destroyed.
  • When the garbage collector finds that there is no reference to an object, that is, before the object is garbage collected, it will always call the finalize () method of this object.
  • The finalize() method allows to be overridden in subclasses and is used to release resources when the object is recycled. Usually in this method, some resources are released and cleaned up, such as closing files, sockets, and database connections.

  • Never actively call the finalize() of an object, it should be called by the garbage collection mechanism. The reasons include the following three points:
    • The finalize() may cause the object to be resurrected.
    • The execution time of the finalize() method is not guaranteed. It is completely determined by the GC thread. In extreme cases, if GC does not occur, the finalize() method will have no chance of execution.
    • A bad finalize() will seriously affect the performance of the GC.
  • Functionally, the finalize() method is similar to the destructor in C++, but Java uses an automatic memory management mechanism based on the garbage collector, so the finalize() method is essentially different from the destructor in C++ .
  • Due to the existence of the finalize() method , the objects in the virtual machine are generally in three possible states.

To survive or die?

  • If an object cannot be accessed from all root nodes, it means that the object is no longer used. Generally speaking, this object needs to be recycled. But in fact, they are not necessarily "death". At this time, they are temporarily in the "probation" stage. Yi Ge not touch objects it is possible to "resurrect" themselves under certain conditions , and if so, its recovery is unreasonable, therefore, define virtual machine objects in three states possible. as follows:
    • Reachable: The object can be reached from the meeting at the root node.
    • Resurrection: All references to the object are released, but the object may be resurrected in finalize().
    • Untouchable: The object's finalize() is called, and it is not resurrected, then it will enter the untouchable state. Untouchable objects cannot be resurrected because finalize() will only be called once.
  • Among the above three states, the distinction is made due to the existence of the finalize() method. Only when the object is untouchable can it be recycled.

Specific process

To determine whether an object objA is recyclable, at least two marking processes are required:

  1. If the object objA to GC Roots does not have a reference chain, it will be marked for the first time.

  2. Filter to determine whether it is necessary for this object to execute the finalize() method

    1. If the object objA does not rewrite the finalize() method, or the finalize() method has been called by the virtual machine, the virtual machine is deemed "unnecessary to execute" and objA is determined to be inaccessible.
    2. If the object objA overrides the finalize () method and has not been executed, then objA will be inserted into the F-Queue queue, and a low-priority Finalizer thread automatically created by a virtual machine will trigger its finalize() method to execute .
    3. The finalize() method is the last chance for the object to escape death . Later, the GC will mark the object in the F-Queue for a second time. If objA establishes a connection with any object in the reference chain in the finalize() method, then objA will be removed from the "almost to be recycled" collection when it is marked the second time. After that, the object will again appear without references. In this case, the finalize method will not be called again, and the object will directly become inaccessible, that is, the finalize method of an object will only be called once.

4. GC Roots traceability of MAT and Jprofiler

MAT is the abbreviation of Memory Analyzer, it is a powerful Java heap memory analyzer. Used to find memory leaks and view memory consumption.

MAT is developed based on Eclipse and is a free performance analysis tool.

You can download and use MAT at http://www.eclipse.org/mat/.

Get dump file

Method 1: Use jmap from the command line

Method 2: Use JVisualVM to export

5. Clearing stage: mark-clearing algorithm

Garbage removal stage

After successfully distinguishing the surviving objects and dead objects in the memory, the next task of the GC is to perform garbage collection to release the memory space occupied by useless objects so that there is enough free memory space to allocate memory for new objects.

At present, the three common garbage collection algorithms in JVM are mark-sweep algorithm (Mark-Sweep), copy algorithm (copying), and mark-compression algorithm (Mark-Compact).

Mark-Sweep algorithm

background:

The Mark-Sweep algorithm (Mark-Sweep) is a very basic and common garbage collection algorithm, which was proposed by J. McCarthy et al. in 1960 and applied to the Lisp language.

Implementation process:

When the available memory in the heap is exhausted, the entire program will be stopped (also known as stop the world), and then two tasks will be performed. The first item is marking and the second item is Clear.

  • Marking: Collector traverses from the reference root node, marking all referenced objects. Generally, it is recorded as a reachable object in the Header of the object.
  • Clear: Collector traverses the heap memory linearly from beginning to end. If an object is found that is not marked as reachable in its Header, it will be recycled.
    Insert picture description here

  • Disadvantages:
    • Not very efficient
    • When performing GC, the entire application needs to be stopped, resulting in poor user experience
    • The free memory cleared in this way is not continuous, resulting in memory fragmentation. Need to maintain a free list.
  • Note: What is removal?
    • The so-called clearing here is not really emptying, but saving the address of the object that needs to be cleared in the free address list. Next time there is a new object to be loaded, it is judged whether the space of the garbage location is enough, and if it is enough, it will be stored.

6. Cleanup phase: copy algorithm

background:

In order to solve the shortcomings of the mark-sweep algorithm in the efficiency of garbage collection, MLMinsky published a famous paper in 1963, "CALISP Garbage Collector Algorithm Using serial Secondary storage, a Lisp language garbage collector using dual storage areas." The algorithm described by MLMinsky in the paper is called the Copying algorithm, and it was successfully introduced by MLMinsky himself into an implementation version of the JLisp language.

main idea:

Divide the living memory space into two blocks, use only one of them at a time, assign the live objects in the memory being used to the unused memory block during garbage collection, and then clear all the memory blocks in use Object, swap the roles of the two memory, and finally complete the garbage collection.
Insert picture description here


advantage:

  • No marking and removal process, simple implementation and efficient operation
  • After copying the past, ensure the continuity of the space and there will be no "fragmentation" problem.

Disadvantages:

  • The disadvantage of this algorithm is also obvious, that is, it needs twice the memory space.
  • For the GC of a large number of regions in the G1 split car, copying rather than moving means that the GC needs to maintain the object reference relationship between regions, regardless of memory usage or time overhead.

special:

  • If there are a lot of garbage objects in the system, the number of live objects that the replication algorithm needs to replicate will not be too large, or very low.

    That is, it is especially suitable for scenes with many garbage objects and few surviving objects; for example, s0 and s1 in the Young area.

Application scenarios:

​ In the new generation, the garbage collection of conventional applications can usually reclaim 70%-99% of the memory space at a time. Recycling is cost-effective. Therefore, current commercial virtual machines use this collection algorithm to reclaim the new generation.

7. Clearing stage: mark-compression algorithm

background:

The efficiency of the replication algorithm is based on the premise that there are few surviving objects and many garbage objects. This situation often occurs in the young generation, but in the old generation, it is more common that most objects are surviving objects. If the replication algorithm is still used, the cost of replication will also be high due to the large number of surviving objects. Therefore, based on the characteristics of garbage collection in the old age, other algorithms need to be used.

The mark-and-clear algorithm can indeed be applied in the old age, but the algorithm is not only inefficient in execution, but also generates memory fragmentation after the memory recovery is performed, so the JVM designer needs to improve on this basis. The Mark-Compact algorithm was born.

Around 1970, researchers such as GL steele, CJ Chene, and DS Wise released mark-compression algorithms. In many modern garbage collectors, people use the mark-compression algorithm or its improved version.


Implementation process:

The first stage is the same as the mark removal algorithm, starting from the root node to mark all referenced objects

The second stage compresses all surviving objects into a section of memory and discharges them in order.

After that, all the spaces outside the boundary of the group.
Insert picture description here


The final effect of the mark-compression algorithm is equivalent to that after the mark-sweep algorithm is executed, the memory is defragmented again. Therefore, it can also be called the mark-sweep-compact (Mark-Sweep-Compact) algorithm.

The essential difference between the two is that the mark-sweep algorithm is a non-mobile recycling algorithm, and the mark-compression is mobile. Whether to move the surviving objects after recycling is a risky decision with both advantages and disadvantages.

It can be seen that the marked live objects will be sorted and arranged in order according to the memory address, while the unmarked memory will be cleaned up. In this way, when we need to allocate memory for a new object, the JVM only needs to hold a starting address of the memory, which is obviously much less expensive than maintaining a free list.


Bump the Pointer

If the memory space is distributed in a regular and orderly manner, that is, used and unused memory are on their own sides, and a marker pointer that records the starting point of the next allocation is maintained between each other. When allocating memory for a new object, only need The new object is allocated to the first free memory location by modifying the offset of the pointer. This allocation method is called Bump tHe Pointer.


advantage:

  • Eliminates the shortcomings of the memory area scattered in the mark-sweep algorithm. When we need to allocate memory for a new object, the JVM only needs to hold a starting address of the memory.
  • Eliminates the high cost of halving memory in the copy algorithm.

Disadvantages:

  • In terms of efficiency, the mark-and-sort algorithm is lower than the copy algorithm.

  • While moving the object, if the object is referenced by other objects, you also need to jump to the referenced address.

  • During the move, the user application needs to be suspended throughout the entire process. Namely: STW.

8. Summary

In terms of efficiency, the replication algorithm is the well-deserved boss, but it wastes too much memory.

In order to take into account the three indicators mentioned above, the mark-organization algorithm is relatively smoother, but the efficiency is not satisfactory. It has one more stage of marking than the copy algorithm, and one more stage than the mark-to-clear. The stage of arranging memory.

Nine, generational collection algorithm

Isn't there an optimal algorithm?

Answer: No, there is no best algorithm, only the most suitable algorithm.


Among all the previous algorithms, none of them can completely replace other algorithms. They all have their own unique advantages and characteristics. The generational collection algorithm came into being.

The generational collection algorithm is based on the fact that the life cycle of different objects is different. Therefore, objects of different life cycles can be collected in different ways to improve recycling efficiency. Generally, the Java heap is divided into the new generation and the old generation, so that different recycling algorithms can be used according to the characteristics of each generation to improve the efficiency of garbage collection.

In the process of running a Java program, a large number of objects will be generated, some of which are related to business information, such as session objects, threads, and socket connections in Http requests . These objects are directly linked to the business, so their life cycle is relatively long. However, there are still some objects, mainly temporary variables generated during the running of the program. The life cycle of these objects will be relatively short, such as: String objects. Due to the characteristics of their immutable classes, the system will generate a large number of these objects, and some objects even only It can be recycled once used.


Almost all GCs currently use generational collecting (Generational collecting) algorithms to perform garbage collection.

In Hotspot, based on the concept of generations, the memory recovery algorithm used by Gc must combine the characteristics of the young generation and the old generation.

  • Young Gen
    • Characteristics of the young generation: The area is relatively small compared to the old generation, the object life cycle is short, the survival rate is low, and the recycling is frequent.
    • The recovery and sorting speed of this copy algorithm is the fastest. The efficiency of the replication algorithm is only related to the size of the current surviving object, so it is very suitable for the collection of the young generation. The problem of low memory utilization of the replication algorithm is alleviated by the design of two survivors in hotspot.
  • Tenured Gen
    • Characteristics of the old generation: large area, long object life cycle, high survival rate, and less frequent recycling than the young generation.
    • In this case, there are a large number of objects with a high survival rate, and the replication algorithm obviously becomes inappropriate. It is generally realized by a hybrid of mark-clear or mark-organize
      • The cost of the Mark phase is proportional to the number of surviving objects.
      • The overhead of the Sweep phase is positively related to the size of the managed area.
      • The cost of the Compact phase is proportional to the data of the surviving objects.

Take the cMS collector in HotSpot as an example. CMS is implemented based on Mark-Sweep, which has a high efficiency for object collection. As for the fragmentation problem, CMS uses the Serial old collector based on the Mark-Compact algorithm as a compensation measure: when the memory is poorly recycled (concurrent mode failure caused by fragmentation), it will use Serial old to execute Full Gc to achieve the memory of the old generation. sort out.

The idea of ​​generation is widely used by existing virtual machines. Almost all garbage collectors distinguish between the young generation and the old generation.

10. Incremental collection algorithm, partition algorithm

In the above-mentioned existing algorithm, the application software will be in a state of stop the world during the garbage collection process. In the stop the world state, all threads of the application will be suspended, suspend all normal work, and wait for the completion of the garbage collection. If the garbage collection time is too long, the application will be suspended for a long time, which will seriously affect the user experience or the stability of the system. **In order to solve this problem, the research on real-time garbage collection algorithm directly led to the birth of incremental collection (Incremental Collecting) algorithm.

Basic idea

If all the garbage is processed at one time and the system needs to be paused for a long time, then the garbage collection thread and the application thread can be executed alternately. Each time, the garbage collection thread only collects a small area of ​​memory space, and then switches to the application thread. Repeat in turn until the garbage collection is complete.

In general, the basis of the incremental collection algorithm is still the traditional mark-sweep and copy algorithm. The incremental collection algorithm allows the garbage collection thread to complete the marking, cleaning or copying work in a phased manner by properly handling the conflicts between threads.

Disadvantages:

In this way, because the application code is intermittently executed during the garbage collection process, the system pause time can be reduced. However, because of the consumption of thread switching and context switching, the overall cost of garbage collection will increase, resulting in a decrease in system throughput.


Partition algorithm

Generally speaking, under the same conditions, the larger the heap space, the longer the time required for a cc, and the longer the pause related to cc. In order to better control the pause time generated by cc, a large memory area is divided into multiple small blocks. According to the pause time of the target, several cells are reclaimed reasonably at a time instead of the entire heap space, thereby reducing a GC The resulting pause.

The generation algorithm divides the object into two parts according to the life cycle length of the object, and the partition algorithm divides the entire heap space into continuous different cells.

Each district is used independently and recycled independently. The advantage of this algorithm is that it can control how many cells are reclaimed at one time.
Insert picture description here

Write at the end

Note that these are just basic algorithm ideas. The actual Gc implementation process is much more complicated. The frontier Gc that is still under development is a compound algorithm, and it has both parallel and concurrency.

Guess you like

Origin blog.csdn.net/weixin_44226263/article/details/112723272