FGC troubleshooting (from big guys sharing)

The GC problem of online services is a very typical problem of Java programs, which tests the engineer's ability to troubleshoot problems. At the same time, it is almost a must-test question for interviews, but there are not many people who can really answer this question. They either don't understand the principles well or lack practical experience.

In the past six months, our advertising system has experienced many online problems related to GC. Full GC is too frequent, and Young GC takes too long. The impact of these problems is: the program in the GC process Stuttering, further leading to service overtime and thus affecting advertising revenue.

In this article, I will use a frequent online case of FGC as an introduction to introduce the GC troubleshooting process in detail. In addition, I will give a practical guide based on the operating principle of GC, and I hope it will be helpful to you. The content is divided into the following 3 parts:

  • Speaking from a frequent online case of FGC

  • Introduction to the operating principle of GC

  • A practical guide for troubleshooting FGC issues

 

01 Speaking from a frequent online case of FGC

In October last year, our advertising recall system received frequent system alerts from FGC after the program was launched. You can see from the following monitoring chart: FGC is performed every 35 minutes on average. Before the program went live, our FGC frequency was about once every 2 days. Next, we will introduce the troubleshooting process of this issue in detail.

1. Check the JVM configuration

View the startup parameters of the JVM through the following command:

ps aux | grep "applicationName=adsearch"

-Xms4g -Xmx4g -Xmn2g -Xss1024K 

-XX:ParallelGCThreads=5 

-XX:+UseConcMarkSweepGC 

-XX:+UseParNewGC 

-XX:+UseCMSCompactAtFullCollection 

-XX:CMSInitiatingOccupancyFraction=80

It can be seen that the heap memory is 4G, the new generation is 2G, and the old generation is also 2G. The new generation uses the ParNew collector, and the old generation uses the CMS collector with concurrent mark removal. When the memory usage of the old generation reaches 80%, FGC will be performed. .

Further through jmap -heap 7276 | head -n20, we can know that the Eden area of ​​the new generation is 1.6G, and the S0 and S1 areas are both 0.2G.

2. Observe the memory changes in the old age

By observing the usage of the old generation, we can see that after each FGC, the memory can return to about 500M, so we have ruled out the memory leak.

3. View the objects in the heap memory through the jmap command

By command jmap -histo 7276 | head -n20

In the figure above, sorted by the memory size of the objects, the number of instances of the surviving objects, the memory occupied, and the class name are displayed. It can be seen that the first one is: int[], and the memory size is much larger than other live objects. At this point, we have locked the suspect target at int[].

4. Further dump heap memory files for analysis

After locking int[], we plan to dump the heap memory file and further track the source of the object through visualization tools. Considering that the program will be suspended during the heap dump process, we first removed this node from the service management platform, and then dumped the heap memory through the following command:

jmap -dump:format=b,file=heap 7276

Through the JVisualVM tool to import the dumped heap memory file, you can also see the space occupied by each object, where int[] occupies more than 50% of the memory, and further down you can find the business object to which int[] belongs, and find that it comes from The basic components of codis provided by the architecture team.

5. Analyze suspicious objects through code

Through code analysis, the basic components of codis will generate an int array with a size of about 40M every minute, which is used to count TP99 and TP90. The life cycle of the array is one minute. When observing the memory changes in the old age according to step 2, it is found that the memory of the old age is basically increasing by more than 40 M per minute, so it is inferred that the 40M int array should be promoted from the new generation to the old generation.

We further checked the frequency monitoring of YGC. From the figure below, we can see that there are about 8 YGCs in 1 minute, which basically verifies our inference: because the default generation age of the CMS collector is 6 times, that is, YGC 6 times. The surviving objects will be promoted to the old age, and the life cycle of the large array in the codis component is 1 minute, which just meets this requirement.

At this point, the entire investigation process is basically over, so why didn't this problem occur before the program went live? From the above figure, we can see that the frequency of YGC was about 5 times before the program went online, and the frequency of YGC became about 8 times after the program went online, which caused this problem.

6. Solution

In order to solve the problem quickly, we changed the generation age of the CMS collector to 15 times. After the change, the FGC frequency returned to once every 2 days. If the YGC frequency exceeds 15 times per minute, this problem will be triggered again. Of course, our most fundamental solution is: to optimize the program to reduce the frequency of YGC, while shortening the life cycle of the int array in the codis component, we will not expand it here.

 

02 Introduction to the operating principle of GC

The analysis process of the entire case above actually involves a lot of GC principle knowledge. If you don’t understand these principles, you can start to deal with it. In fact, the entire investigation process is very blind.

Here, I choose a few core knowledge points, expand and introduce the operating principle of GC, and finally give a practical guide.

1. Heap memory structure

Everyone knows: GC is divided into YGC and FGC, both of which occur on the heap memory of the JVM. First look at the heap memory structure of JDK8:

It can be seen that the heap memory adopts a generational structure, including the new generation and the old generation. The new generation is divided into: Eden area, From Survivor area (S0 for short), To Survivor area (S1 for short), the default ratio of the three is 8:1:1. In addition, the default ratio between the young generation and the old generation is 1:2.

The reason why the heap memory adopts a generational structure is to consider that most objects have short life cycles, so that objects with different life cycles can be placed in different areas, and then different garbage collection algorithms are used for the new generation and the old generation. , So that the GC efficiency is the highest.

2. When is YGC triggered?

In most cases, objects are allocated directly in the Eden area in the young generation. If the Eden area does not have enough space, then YGC (Minor GC) will be triggered. The area processed by YGC is only the young generation. Because most objects can be recovered in a short time, only a few objects can survive after YGC and are moved to S0 area (using a copy algorithm).

When the next YGC is triggered, the surviving objects in the Eden area and the S0 area will be moved to the S1 area, and the Eden area and the S0 area will be cleared at the same time. When the YGC is triggered again, the area processed at this time becomes the Eden area and the S1 area (that is, the roles of S0 and S1 are exchanged). Every time YGC passes, the age of the surviving object will increase by one.

3. When is FGC triggered ?

The subject will enter the old age in the following four situations:

  • In YGC, the To Survivor area is not enough to store surviving objects, and the objects will directly enter the old age.

  • After many YGCs, if the age of the surviving object reaches the set threshold, it will be promoted to the old age.

  • Dynamic age determination rules. If the objects of the same age in the To Survivor area account for more than half of the space in the To Survivor area, the objects older than this age will directly enter the old age without reaching the default generational age. .

  • Large objects: controlled by the -XX:PretenureSizeThreshold startup parameter. If the size of the object is larger than this value, it will bypass the young generation and be directly allocated in the old generation.

When the objects promoted to the old generation are larger than the remaining space in the old generation, FGC (Major GC) will be triggered. The area processed by FGC includes both the young generation and the old generation. In addition, there are the following 4 situations that will also trigger FGC:

  • The memory usage of the old generation reaches a certain threshold (adjustable by parameter), which directly triggers FGC.

  • Space allocation guarantee: Before YGC, it will first check whether the largest available continuous space in the old generation is greater than the total space of all objects in the new generation. If it is less than, it means that YGC is not safe, it will check whether the parameter HandlePromotionFailure is set to allow guarantee failure, if it is not allowed, it will directly trigger Full GC; if it is allowed, it will further check whether the maximum available continuous space in the old generation is greater than the previous times Promoted to the average size of objects in the old generation, if it is smaller than it will trigger Full GC.

  • Metaspace (metaspace) will expand when the space is insufficient, and when the expansion reaches the specified value of the -XX:MetaspaceSize parameter, FGC will also be triggered.

  • When System.gc() or Runtime.gc() is called explicitly, FGC is triggered.

4. Under what circumstances will GC affect the program?

Regardless of YGC or FGC, it will cause a certain degree of program lag (Stop The World problem: GC thread starts to work, other worker threads are suspended), even if more advanced garbage collection algorithms such as ParNew, CMS or G1 are used, it is only In reducing the time of lag, it can not completely eliminate the lag.

So under what circumstances does GC affect the program? According to the severity from high to bottom, I think the following 4 situations are included:

  • FGC is too frequent: FGC is usually slow, ranging from a few hundred milliseconds to as many as a few seconds. Normally, FGC is executed every few hours or even days, and the impact on the system is acceptable. However, once FGC occurs frequently (for example, it will be executed once in tens of minutes), this is definitely a problem. It will cause the worker threads to be stopped frequently, making the system seem to be stuck, and it will also make the overall program Performance deteriorates.

  • YGC takes too long: Generally speaking, it is normal for the total time of YGC to be tens or hundreds of milliseconds. Although it will cause the system to freeze for several milliseconds or tens of milliseconds, this situation is almost insensitive to users. The impact of the program is negligible. But if YGC takes 1 second or even a few seconds (almost to catch up with the time of FGC), the stall time will increase, and YGC itself is more frequent, which will cause more service timeout problems.

  • FGC takes too long: FGC time consumption increases, and the stall time will also increase, especially for high-concurrency services, which may cause more timeout problems during FGC and reduced availability. This also requires attention.

  • YGC is too frequent: Even if YGC does not cause service timeout, too frequent YGC will reduce the overall performance of the service. It is also necessary to pay attention to high-concurrency services.

Among them, "FGC is too frequent" and "YGC takes too long", these two situations are more typical GC problems, which will most likely affect the service quality of the program. The remaining two cases are less serious, but also need to pay attention to high-concurrency or high-availability programs.

 

03 A practical guide for troubleshooting FGC problems

Through the above case analysis and theoretical introduction, we will summarize the troubleshooting ideas for FGC problems, as a practical guide for your reference.

1. From a procedural point of view, what causes FGC? 

  • Large objects: The system loads too much data into memory at one time (for example, SQL queries are not paged), causing large objects to enter the old age.

  • Memory leak: A large number of objects are frequently created, but cannot be recycled (for example, the close method is not called to release resources after the IO object is used up), FGC is triggered first, and OOM is finally caused.

  • The program frequently generates some long-lived objects. When the survival age of these objects exceeds the generational age, they will enter the old age, and finally trigger FGC. (The case in this article)

  • The program BUG leads to the dynamic generation of many new classes, making Metaspace constantly occupied, first triggering FGC, and finally leading to OOM.

  • The gc method is explicitly called in the code, including your own code and even the code in the framework.

  • JVM parameter setting issues: including the total memory size, the size of the new generation and the old generation, the size of the Eden area and the S area, the size of the meta space, the garbage collection algorithm, and so on.

 

2. Know which tools you can use when troubleshooting

  • The company's monitoring system: Most companies will have it, which can monitor all the indicators of the JVM.

  • JDK's own tools, including commonly used commands such as jmap and jstat:

    # View the usage rate of each area of ​​the heap memory and the GC situation

    jstat -gcutil -h20 pid 1000

    # View the surviving objects in the heap memory and sort them by space

    jmap -histo pid | head -n20

    # dump heap memory file

    jmap -dump:format=b,file=heap pid

  • Visual heap memory analysis tools: JVisualVM, MAT, etc.

 

3. Troubleshooting Guide

  • Check the monitoring to understand the time point of the problem and the frequency of the current FGC (you can compare the normal situation to see if the frequency is normal)

  • Understand whether there are any programs going online, basic component upgrades, etc. before this point in time.

  • Understand the parameter settings of the JVM, including: the size settings of each area of ​​the heap space, which garbage collectors are used in the new generation and the old generation, and then analyze whether the JVM parameter settings are reasonable.

  • Then eliminate the possible causes listed in step 1. Among them, the meta space is full, memory leaks, and the code explicitly calls the gc method. It is easier to troubleshoot.

  • For FGC caused by large objects or long-lived objects, you can use the jmap -histo command in conjunction with the dump heap memory file for further analysis, and you need to locate the suspicious object first.

  • Re-analyze the specific code by locating the suspicious object. At this time, it is necessary to combine the GC principle and JVM parameter settings to figure out whether the suspicious object meets the conditions for entering the old age to draw a conclusion.

Transfer from the big guy: https://mp.weixin.qq.com/s/I1fp89Ib2Na1-vjmjSpsjQ

Guess you like

Origin blog.csdn.net/qq_39809613/article/details/107354493