Interview must-see: talk about what you know about JVM tuning, summary of JVM performance tuning

Talk about what you know about JVM tuning from an interview perspective

Steps for GC tuning

The general steps of GC optimization can be summarized as: determine the target, optimize parameters, and check the results

1. Identify goals:

For example: - High availability, how many availability; - How many milliseconds of low latency requests must complete the response High throughput, how many transactions per second

2. Optimize parameters

The following refers to the practice of the Meituan technical team. If you focus on high availability and low latency, then you need a quantitative indicator, such as N times of GC within time T, and the proportion of requests affected by GC = (interface response time + GC time) × N/T. It can be seen that whether reducing a single GC or reducing the number of GCs can effectively reduce the impact of GC response time.

img

Specific optimization:

By collecting GC log information and combining system requirements, determine optimization solutions, such as selecting an appropriate GC collector, resetting the memory ratio, adjusting JVM parameters, etc.

3. Acceptance of optimization results

Apply the modification to all servers, judge whether the optimization results meet expectations, and summarize relevant experience.

Next, we use three cases to practice the above optimization process and basic principles (the garbage collectors used in the three cases in this article are all ParNew+CMS, and Serial Old is used as a substitute when CMS fails).

GC optimization case

1. Frequent Major GC and Minor GC

If the interview gave us a scenario: Minor GC 100 times per minute, Major GC once every 4 minutes , a single Minor GC takes 25ms, a single Major GC takes 200ms, and the interface response time is 50ms

  • First of all, we can see that Minor GC is very frequent, so start with Minor GC first.

Optimize Minor GC frequent problems:

1. The memory of the new generation can be appropriately increased

Usually, due to the small space of the new generation, the Eden area is quickly filled, which will lead to frequent Minor GC, so the frequency of Minor GC can be reduced by increasing the space of the new generation . For example, under the premise of the same memory allocation rate, if the Eden area in the new generation is doubled, the number of Minor GCs will be reduced by half ; similarly, **Minor GC depends more on the number of surviving objects after GC, rather than Eden area the size of. **Therefore, if we have a lot of short-term objects in the new generation and expand the new generation, the time for a single Minor GC will not increase significantly.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-wG0FRgRp-1678358886599)(null)]

image-20230309162054238

  • If there are a large number of short-term objects in the application, a larger young generation should be selected; if there are relatively many persistent objects, the old generation should be appropriately enlarged.

2. GC occurs during the peak request period, resulting in reduced service availability

The GC log shows that the peak CMS takes 1.39s in the remark (Remark) phase . The Remark stage is Stop-The-World (hereinafter referred to as STW), that is, when garbage collection is performed, all threads in the Java application except the garbage collector thread are suspended, which means that during this period, the user normally All working threads are suspended, which is unacceptable for low-latency services. The goal of this optimization is to reduce the Remark time.

img

Optimize the problem of long pause time in marking:

Before solving the problem, let's review the four main stages of CMS and the work content of each stage. The figure below shows the objects that can be marked in each stage of CMS, which are distinguished by different colors.

  • 1. Init-mark initial mark (STW), this stage conducts reachability analysis and marks objects that GC ROOT can directly relate to , so it is very fast.
  • 2. Concurrent-mark concurrent marking starts from the green objects marked in the previous stage, and all reachable objects are marked in this stage .
  • 3. Remark (STW), suspend all user threads, re-scan objects in the heap, perform reachability analysis, and mark alive objects. **Because the concurrent marking phase is a process that is executed concurrently with user threads, there may be user threads modifying the fields of some active objects during this process, pointing to an unmarked object. **As shown in the figure below, the red object is concurrently marked It is unreachable at the beginning, but the reference changes during the parallel period, and the object becomes reachable. At this stage, such objects need to be remarked to prevent them from being cleaned up in the next stage. This process also requires STW. Special attention should be paid to the fact that at this stage, the object in the new generation is used as the root to determine whether the object is alive.
  • 4. Concurrent cleanup, concurrent garbage cleanup

img

The remark phase mainly scans the heap to determine whether the object is alive . So to accurately judge whether the object is alive, which objects need to be scanned?

  • Note cross-generational references : the new generation holds references to the old generation

img

Because of its existence, the Remark phase must scan the entire heap to determine whether the object is alive, including the gray unreachable object in the figure

The new generation GC and the old generation GC are carried out separately and independently . Only during the Minor GC, the root search algorithm is used to mark whether the objects in the new generation are reachable. It will not be marked as unreachable, and CMS cannot identify which objects are alive, and can only scan the whole heap (new generation + old generation). It can be seen that the number of objects in the heap affects the time-consuming of the Remark stage. Analyzing the GC log shows the same pattern. When Remark takes >500ms, the usage rate of the new generation is above 75%. In this way, the time-consuming problem of reducing the Remark stage is converted into how to reduce the number of objects in the new generation

Converted to how to reduce the number of objects in the new generation

The object in the generation is characterized by "living and dying", so **if a Minor GC is executed before Remark, most of the objects will be recycled. CMS adopts such a method, adding an interruptible concurrent precleaning (CMS-concurrent-abortable-preclean) before Remark, **The main work of this stage is still whether the concurrent marking object is alive, but this process can be interrupted

In addition, CMS provides the parameter CMSMaxAbortablePrecleanTime , which defaults to 5s, in order to avoid waiting indefinitely for Minor GC at this stage, which means that if the interruptible precleaning execution exceeds 5s, regardless of whether Minor GC occurs or not, this stage will be aborted , enter Remark; in this case, CMS provides CMSScavengeBeforeRemark parameter to ensure that a Minor GC is forced before Remark

After optimization, after adding CMSScavengeBeforeRemarkparameters, the GC pause with a single execution time > 200ms disappeared. From the monitoring observation, the GC time is consistent with the business fluctuation, and there are no obvious glitches anymore.

summary

Through the case analysis, we learned that due to the existence of cross-generational references, CMS must scan the entire heap during the remark phase . At the same time, in order to avoid many objects in the new generation during scanning, an interruptible pre-cleaning phase is added to wait for the occurrence of Minor GC. It’s just that there is a time limit for this stage. If the timeout does not wait for the Minor GC, there will still be many objects in the new generation during Remark. Our tuning strategy is to force a Minor GC before Remark through parameters, thereby reducing the time of the Remark stage.

Thinking : If the old generation may hold new generation object references, the old generation must also be scanned during Minor GC.

How does the JVM avoid scanning the full heap during Minor GC? Statistics show that less than 1% of the old generation holds object references in the new generation. According to this feature, the JVM introduces a card table to achieve this purpose.

The specific strategy of the card table is to divide the space of the old generation into several cards with a size of 512B. The card table itself is a single-byte array, and each element in the array corresponds to a card. **When the old generation refers to the new generation, the virtual machine sets the card table element corresponding to the card to an appropriate value. **As shown in the figure above, the card table 3 is marked as dirty (the card table has another function, identifying which blocks have been modified during the concurrent marking stage) , and then Minor GC can quickly identify which cards by scanning the card table There are references from the old generation pointing to the new generation . In this way, the virtual machine avoids full heap scanning by exchanging space for time.

3. GC of Stop-The-World occurs

  • Determine the target

According to the GC log, it can be seen that this Full GC took 1.23s. This online service also requires low latency and high availability. The goal of this optimization is to reduce the pause time of a single STW recovery and improve availability

img

problem analysis:

First of all, when might STW's Full GC be triggered?

  1. Insufficient Perm space ;

  2. Promotion failed and concurrent mode failure occur during CMS GC (the reason for concurrent mode failure is generally that CMS is in progress, but due to insufficient space in the old generation, it is necessary to recycle objects that are no longer used in the old generation as soon as possible, and stop all threads at this time , and terminate the CMS at the same time, and directly perform Serial Old GC );

  3. According to statistics, the average size of Young GC promoted to the old age is larger than the remaining space of the old age;

  4. Actively trigger Full GC (execute jmap -histo:live [pid]) to avoid fragmentation problems

Then, let's analyze them one by one: - Exclude reason 2: If it is the two cases in reason 2, there will be special marks in the log, but there is no such thing at present. - Exclusion reason 3: According to the GC log, the usage of the old generation was only 20% at that time, and there were no large objects larger than 2G. - Exclusion reason 4: because no related commands were executed at that time. - Locking reason 1: According to the log, it is found that after the Full GC, the Perm area has become larger, which is inferred to be due to insufficient permanent generation space and capacity expansion

Insufficient permanent generation space leads to the problem of capacity expansion

  1. By setting -XX:PermSizethe parameters and -XX:MaxPermSizesettings to be the same, the virtual machine is forced to fix the capacity of the permanent generation when it starts to avoid automatic expansion during runtime .
  2. By default, CMS will not recycle the Perm area. Through the parameters CMSPermGenSweepingEnabled and CMSClassUnloadingEnabled , CMS can recycle it when the capacity of the Perm area is insufficient.

Analysis and solution of common CMS GC problems in Java 9

2. GC basics

  • TLAB: Shorthand for Thread Local Allocation Buffer. Mutator Threads based on CAS can preferentially allocate objects in a piece of memory in Eden. Because the memory area exclusive to Java threads has no lock competition, the allocation speed is faster. Each TLAB is exclusive to a thread.
  • Card Table: Chinese translation is card table, which is mainly used to mark the status of the card page, and each card table item corresponds to a card page. When an object reference in the card page has a write operation, the write barrier will mark the status of the card table where the object is located to dirty. The essence of the card table is to solve the problem of cross-generational references

2.3 Allocation objects

The object address operation in Java mainly uses Unsafe to call the allocate and free methods of C. There are two allocation methods:

  • Free list (free list): records free addresses through additional storage, changing random IO into sequential IO, but brings additional space consumption.
  • Bump pointer: A pointer is used as a demarcation point. When memory needs to be allocated, the pointer only needs to be moved to the free end by a distance equal to the size of the object. The allocation efficiency is high, but the usage scenarios are limited.

2.4 Collection objects

  • Reference Counting (Reference Counting): Count the references of each object. Whenever there is a reference to it, the counter is +1, and the reference is -1 when the reference is invalid. The reference count is placed in the object header, and objects greater than 0 are considered is a surviving object. Although the problem of circular references can be solved by the Recycler algorithm, in a multi-threaded environment, expensive synchronization operations are required to change the reference count, and the performance is low. Early programming languages ​​​​will use this algorithm.
  • Reachability analysis, also known as the reference chain method (Tracing GC): start object search from GC Root, and the objects that can be searched are reachable objects. At this time, it is not enough to judge whether the object is alive or dead. The secondary mark can be determined more accurately, and the objects outside the entire connected graph can be collected as garbage. Currently, mainstream virtual machines in Java use this algorithm.

What are the objects of GC Root?

Objects referenced in the virtual machine stack, constants of class static properties in the method area, constants in the runtime method area, objects referenced by JNI in the local method stack

2.4.1 Collection algorithm

  • Mark-Sweep (mark-clear): The recycling process is mainly divided into two stages. The first stage is the Tracing stage, which starts from the GC Root to traverse the object graph and marks (Mark) each object encountered. The second stage is the Sweep stage, that is, the collector checks every object in the heap and recycles all unmarked objects, and no object movement occurs during the entire process. The whole algorithm will use technologies such as Tricolor Abstraction and BitMap in different implementations to improve the efficiency of the algorithm, and it is more efficient when there are many surviving objects.
  • Mark-Compact (Mark-Sweep): The main purpose of this algorithm is to solve the fragmentation problem that exists in non-mobile collectors. It is also divided into two stages. The first stage is similar to Mark-Sweep, and the second stage is The surviving objects will be sorted according to the compaction order (Compaction Order). The main implementations include the Two-Finger recycling algorithm, the sliding recycling (Lisp2) algorithm, and the Threaded Compaction algorithm, etc.
  • Copying: Divide the space into two halves of the same size, From and To, and only use one of them at a time, and transfer the surviving objects of one half to the other by copying each time recycling half area. There are recursive (proposed by Robert R. Fenichel and Jerome C. Yochelson) and iterative (proposed by Cheney) algorithms, and an approximate priority search algorithm that solves the problems of the former two recursive stacks, cache lines, etc. The copy algorithm can quickly allocate memory by colliding pointers, but it also has the disadvantage of low space utilization. In addition, when the surviving objects are relatively large, the cost of copying is relatively high.

2.5 Collector

2.5.1 Generational collectors

  • ParNew: A multi-threaded collector that uses a copy algorithm and mainly works in the Young area. -XX:ParallelGCThreadsThe number of threads collected can be controlled by parameters. The whole process is STW, and it is often used in combination with CMS.
  • CMS: With the goal of obtaining the shortest recovery pause time, the "mark-clear" algorithm is used to collect garbage in 4 major steps, in which the initial marking and re-marking will be STW

2.5.2 Partition Collection

  • G1: A server-side garbage collector, applied in a multi-processor and large-capacity memory environment , while achieving high throughput, it meets the requirements of garbage collection pause time as much as possible.
  • ZGC: A low-latency garbage collector introduced in JDK11, suitable for memory management and recycling of large-memory low-latency services, SPECjbb 2015 benchmark test, under the 128G large heap, the maximum pause time is only 1.68 ms, the pause time is far better than For G1 and CMS.
  • Shenandoah: Developed by a Red Hat team, similar to G1, a garbage collector based on Region design, but does not require Remember Set or Card Table to record cross-Region references, and the pause time has nothing to do with the size of the heap. The pause time is close to ZGC, and the figure below shows the benchmark with collectors such as CMS and G1.

2.5.3 Commonly used collectors

img

3.1 Determine whether there is a problem with GC

3.1.1 Evaluation criteria

  • Latency: It can also be understood as the maximum pause time, that is, the longest time for a STW in the garbage collection process. The shorter the better, the increase in frequency can be accepted to a certain extent, and the main development direction of GC technology.
  • Throughput: During the life cycle of the application system, since the GC thread will occupy the currently available CPU clock cycles of the Mutator, the throughput is the percentage of the time effectively spent by the Mutator in the total running time of the system.

At present, the systems of major Internet companies are basically pursuing low latency, so as to avoid the loss of user experience caused by too long a GC pause. The measurement indicators need to be combined with the SLA of the application service, mainly as follows:

img

3.1.2 Understand the corresponding GC Cause

img

Several GC Causes that need attention:

  • System.gc(): Manually trigger GC operations.
  • CMS: Some actions of CMS GC during execution, focusing on the two STW stages of CMS Initial Mark and CMS Final Remark.
  • Promotion Failure: There is not enough space in the Old area for objects promoted in the Young area (even though the total available memory is large enough).
  • Concurrent Mode Failure: During the operation of CMS GC, the space reserved in the Old area is not enough to allocate new objects. At this time, the collector will degenerate, seriously affecting GC performance. The following case is such a scenario.
  • GCLocker Initiated GC: If the thread executes in the JNI critical section, GC just needs to be performed. At this time, GC Locker will prevent GC from happening and prevent other threads from entering the JNI critical section until the last thread exits the critical section and triggers a GC.

3.3.2 Classification of GC problems

  • Unexpected GC: GC that occurs unexpectedly does not actually need to happen, and we can avoid it by some means.
    • Space Shock: Space shock problem, see "Scenario 1: Space shock caused by dynamic capacity expansion".
    • Explicit GC: Displays the problem of executing GC, see "Scenario 2: Going and Staying of Explicit GC".
  • Partial GC: GC of partial collection operation, which only recycles certain generations/partitions.
    • Young GC: Young area collection action in generational collection, also called Minor GC.
      • ParNew: Young GC is frequent, see "Scenario 4: Premature promotion".
    • Old GC: The old area collection action in the generational collection can also be called Major GC, and some are also called Full GC, but in fact this name is not standardized. It is Full GC when Foreground GC occurs in CMS. CMSScavengeBeforeRemark parameter It also only triggers Young GC once before Remark.
      • CMS: Old GC is frequent, see "Scenario 5: CMS Old GC is frequent".
      • CMS: Old GC is infrequent but takes a long time, see "Scenario 6: A single CMS Old GC takes a long time".
  • Full GC: Full collection of GC, which recycles the entire heap. The STW time will be relatively long. Once it occurs, the impact will be greater. It can also be called Major GC. See "Scenario 7: Memory Fragmentation & Collector Degradation".
  • MetaSpace: Metaspace reclamation causes problems, see "Scenario 3: OOM in MetaSpace area".
  • Direct Memory: The reclamation of direct memory (also called off-heap memory) causes problems, see "Scenario 8: OOM of off-heap memory".
  • JNI: The local Native method causes problems, see "Scenario 9: GC problems caused by JNI".

Commonly used JVM parameter settings

parameter illustrate
-Xms Set the JVM initial memory size. The advice is -Xmxthe same as for avoiding the JVM to reallocate memory after each garbage collection completes.
-Xmx Sets the maximum available memory size for the JVM. To avoid container OOM, please reserve enough memory size for the system.
-XX:+PrintGCDetails Output GC details.
-XX:+PrintGCDateStamps Output GC timestamp. Date format, such as 2019-12-24T21:53:59.234+0800.
-Xloggc:/home/admin/nas/gc-${POD_IP}-$(date '+%s').log GC log file path. It is necessary to ensure that the container path where the Log file is located already exists. It is recommended that you mount the container path to the NAS directory or collect it to SLS, so as to automatically create the directory and realize the persistent storage of the log.
-XX:+HeapDumpOnOutOfMemoryError When OOM occurs in the JVM, a DUMP file is automatically generated.
-XX:HeapDumpPath=/home/admin/nas/dump-${POD_IP}-$(date '+%s').hprof DUMP file path. It is necessary to ensure that the container path where the DUMP file is located already exists. It is recommended that you mount the container path to the NAS directory to automatically create the directory and implement persistent storage of logs.
-Xms2048m -Xmx2048m -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/admin/nas/gc-${POD_IP}-$(date '+%s').log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/admin/nas/dump-${POD_IP}-$(date '+%s').hprof

m -Xmx2048m -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/admin/nas/gc- P O D I P − {POD_IP}- PODIP(date ‘+%s’).log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/admin/nas/dump- P O D I P − {POD_IP}- PODIP(date ‘+%s’).hprof


Guess you like

Origin blog.csdn.net/weixin_59823583/article/details/129428748