For online service GC troubleshooting, this is enough

The GC problem of online services is a very typical problem of Java programs, which tests the engineer's ability to troubleshoot problems. At the same time, it is almost a must-test question for interviews, but there are not many people who can really answer this question. They either don't understand the principles well or lack practical experience.

In the past six months, our advertising system has experienced many online problems related to GC. Full GC is too frequent, and Young GC takes too long. The impact of these problems is: the program in the GC process Stall, further leading to service overtime and thus affecting advertising revenue.

In this article, I will use a frequent online case of FGC as an introduction to introduce the GC troubleshooting process in detail. In addition, I will give a practical guide based on the operating principle of GC, which I hope will help you. The content is divided into the following 3 parts:

  • Speaking from a frequent online case of FGC
  • Introduction to the operating principle of GC
  • A practical guide to troubleshooting FGC issues

 

01 Speaking from a frequent online case of FGC

In October last year, our advertising recall system received frequent system alerts from FGC after the program was launched. You can see from the following monitoring chart: FGC is performed every 35 minutes on average. Before the program went live, our FGC frequency was about once every 2 days. Next, we will introduce the troubleshooting process of this issue in detail.

For online service GC troubleshooting, this is enough

 

1. Check the JVM configuration

View the startup parameters of the JVM with the following command:

ps aux | grep "applicationName=adsearch"

-Xms4g -Xmx4g -Xmn2g -Xss1024K

-XX:ParallelGCThreads=5

-XX:+UseConcMarkSweepGC

-XX:+UseParNewGC

-XX:+UseCMSCompactAtFullCollection

-XX:CMSInitiatingOccupancyFraction=80

It can be seen that the heap memory is 4G, the new generation is 2G, the old generation is also 2G, the new generation uses the ParNew collector, and the old generation uses the CMS collector with concurrent mark removal. When the memory usage of the old generation reaches 80%, FGC will be performed. .

Further through jmap -heap 7276 | head -n20, we can know that the Eden area of ​​the new generation is 1.6G, and the S0 and S1 areas are both 0.2G.

2. Observe the memory changes in the old age

By observing the usage of the old generation, we can see that after each FGC, the memory can return to about 500M, so we have ruled out the memory leak.

For online service GC troubleshooting, this is enough

 

3. View the objects in the heap memory through the jmap command

Through the command jmap -histo 7276 | head -n20

For online service GC troubleshooting, this is enough

 

In the figure above, sorted by the memory size of the objects, it shows the number of instances of surviving objects, the memory occupied, and the class name. You can see that the first one is: int[], and the memory size is much larger than other live objects. At this point, we have locked the suspect target at int[].

4. Further dump heap memory files for analysis

After locking int[], we plan to dump the heap memory file and further track the source of the object through visualization tools. Considering that the program will be suspended during the heap dump, we first removed this node from the service management platform, and then dumped the heap memory through the following command:

jmap -dump:format=b,file=heap 7276

Import the dumped heap memory file through the JVisualVM tool, you can also see the space occupied by each object, where int[] occupies more than 50% of the memory, further down you can find the business object to which int[] belongs, and find that it comes from The basic components of codis provided by the architecture team.

For online service GC troubleshooting, this is enough

 

5. Analyze suspicious objects through code

Through code analysis, the basic components of codis will generate an int array with a size of about 40M every minute, which is used to count TP99 and TP90. The life cycle of the array is one minute. According to the observation of the memory changes of the old generation in step 2, it is found that the memory of the old generation basically increases by more than 40M per minute, so it is inferred that the 40M int array should be promoted from the new generation to the old generation.

We further checked the frequency monitoring of YGC. From the figure below, we can see that there are about 8 YGCs per minute, which basically verifies our inference: because the default generation age of the CMS collector is 6 times, that is, YGC 6 times The surviving objects will be promoted to the old age, and the life cycle of the large array in the codis component is 1 minute, which just meets this requirement.

For online service GC troubleshooting, this is enough

 

At this point, the entire investigation process is basically over, so why didn't this problem occur before the program went live? From the above figure, we can see that the frequency of YGC was about 5 times before the program was launched, and the frequency of YGC became about 8 after the program was launched, which caused this problem.

6. Solution

In order to quickly solve the problem, we changed the generation age of the CMS collector to 15 times. After the change, the FGC frequency returned to once every 2 days. If the YGC frequency exceeds 15 times per minute, this problem will be triggered again. Of course, our most fundamental solution is: to optimize the program to reduce the frequency of YGC, while shortening the life cycle of the int array in the codis component, we will not expand it here.

 

02 Introduction to the operating principle of GC

The analysis process of the entire case above actually involves a lot of GC principle knowledge. If you don't understand these principles, you can start to deal with it. In fact, the entire investigation process is very blind.

Here, I choose a few core knowledge points, start to introduce the operating principle of GC, and finally give a practical guide.

1. Heap memory structure

Everyone knows: GC is divided into YGC and FGC, both of which occur on the heap memory of the JVM. First look at the heap memory structure of JDK8:

For online service GC troubleshooting, this is enough

 

It can be seen that the heap memory adopts a generational structure, including the new generation and the old generation. The new generation is divided into: Eden area, From Survivor area (S0 for short), To Survivor area (S1 for short), the default ratio of the three is 8:1:1. In addition, the default ratio between the young generation and the old generation is 1:2.

The reason why the heap memory adopts the generational structure is that most objects have short life cycles, so that objects with different life cycles can be placed in different areas, and then different garbage collection algorithms are used for the new generation and the old generation. , Which makes GC the most efficient.

2. When is YGC triggered?

In most cases, objects are allocated directly in the Eden area of ​​the young generation. If the Eden area does not have enough space, then YGC (Minor GC) will be triggered. The area processed by YGC is only the young generation. Because most objects are recoverable in a short time, only a few objects can survive after YGC and are moved to S0 area (using a copy algorithm).

When the next YGC is triggered, the surviving objects in the Eden area and the S0 area are moved to the S1 area, and the Eden area and the S0 area are cleared at the same time. When YGC is triggered again, the area processed at this time becomes the Eden area and the S1 area (that is, S0 and S1 exchange roles). Every time YGC passes, the age of the surviving object will increase by 1.

3. When is FGC triggered?

The subject will enter the old age in the following four situations:

  • In YGC, the To Survivor area is not enough to store surviving objects, and the objects will directly enter the old age.
  • After many YGCs, if the age of the surviving object reaches the set threshold, it will be promoted to the old age.
  • Dynamic age determination rules. If the objects of the same age in the To Survivor area occupy more than half of the space in the To Survivor area, the objects older than this age will directly enter the old age without reaching the default generational age. .
  • Large objects: controlled by the -XX:PretenureSizeThreshold startup parameter. If the object size is larger than this value, it will bypass the young generation and be directly allocated in the old generation.

When the objects promoted to the old age are larger than the remaining space in the old age, FGC (Major GC) will be triggered. The area processed by FGC includes both the young and old. In addition, there are 4 situations that will trigger FGC:

  • The memory usage in the old age reaches a certain threshold (adjustable by parameter), which directly triggers FGC.
  • Space allocation guarantee: Before YGC, it will first check whether the maximum available continuous space in the old generation is greater than the total space of all objects in the new generation. If it is less than, it means that YGC is not safe, it will check whether the parameter HandlePromotionFailure is set to allow guarantee failure, if it is not allowed, it will directly trigger Full GC; if it is allowed, it will further check whether the maximum available continuous space in the old generation is greater than previous Promoted to the average size of objects in the old generation, if it is smaller than it will trigger Full GC.
  • Metaspace (metaspace) will expand when the space is insufficient. When the expansion reaches the specified value of -XX:MetaspaceSize parameter, FGC will also be triggered.
  • When System.gc() or Runtime.gc() is explicitly called, FGC is triggered.

4. Under what circumstances will GC affect the program?

Regardless of YGC or FGC, it will cause a certain degree of program lag (Stop The World problem: GC thread starts to work, other worker threads are suspended), even if more advanced garbage collection algorithms such as ParNew, CMS or G1 are used, it is just In reducing the time of lag, it cannot completely eliminate the lag.

So under what circumstances will GC affect the program? According to the severity from highest to lowest, I think the following 4 situations are included:

  • FGC is too frequent: FGC is usually slow, ranging from several hundred milliseconds to several seconds. Normally, FGC is executed every few hours or even days, and the impact on the system is acceptable. However, once FGC occurs frequently (for example, it will be executed every tens of minutes), this is definitely a problem. It will cause the worker thread to be stopped frequently, making the system seem to be stuck, and it will also make the overall program Performance deteriorates.
  • YGC takes too long: Generally speaking, it is normal for the total time of YGC to be tens or hundreds of milliseconds. Although it will cause the system to freeze for several milliseconds or tens of milliseconds, this situation is almost insensitive to users. The influence of the procedure is negligible. But if YGC takes 1 second or even a few seconds (almost to catch up with the time of FGC), the stall time will increase, and YGC itself is more frequent, which will cause more service timeout problems.
  • FGC time-consuming too long: FGC time-consuming increases, and the stall time will also increase, especially for high-concurrency services, which may cause more timeout issues during FGC and reduced availability. This also requires attention.
  • YGC is too frequent: Even if YGC does not cause service timeouts, too frequent YGC will reduce the overall performance of the service. It also requires attention for high-concurrency services.

Among them, "FGC is too frequent" and "YGC takes too long", these two situations are more typical GC problems, which will likely affect the service quality of the program. The remaining two cases are less serious, but also need to pay attention to high concurrency or high availability programs.

 

03 A practical guide to troubleshooting FGC problems

Through the above case analysis and theoretical introduction, we will summarize the troubleshooting ideas for FGC problems, as a practical guide for your reference.

1. From a program point of view, what causes FGC?

  • Large objects: The system loads too much data into memory at once (for example, SQL queries are not paged), causing large objects to enter the old age.
  • Memory leak: A large number of objects are frequently created, but they cannot be recycled (for example, the close method is not called to release resources after the IO object is used up), FGC is triggered first, and finally OOM.
  • The program frequently generates some long-lived objects. When the survival age of these objects exceeds the generational age, they will enter the old age, and finally trigger FGC. (The case in this article)
  • The program BUG leads to the dynamic generation of many new classes, making Metaspace constantly occupied, first triggering FGC, and finally leading to OOM.
  • The gc method is explicitly called in the code, including your own code and even the code in the framework.
  • JVM parameter setting issues: including the total memory size, the size of the young and old generations, the size of the Eden area and the S area, the size of the meta space, the garbage collection algorithm, and so on.

 

2. Know which tools you can use when troubleshooting

  • The company's monitoring system: Most companies will have it, which can monitor all indicators of the JVM.
  • JDK's own tools, including commonly used commands such as jmap, jstat: # View the usage of each area of ​​the heap memory and GC status jstat -gcutil -h20 pid 1000# View the live objects in the heap memory, and sort by space jmap -histo pid | head -n20# dump heap memory file jmap -dump:format=b,file=heap pid
  • Visual heap memory analysis tools: JVisualVM, MAT, etc.

 

3. Troubleshooting Guide

  • Check the monitoring to understand the time point of the problem and the frequency of the current FGC (you can compare the normal situation to see if the frequency is normal)
  • Understand whether there are any programs going online, basic component upgrades, etc. before that time.
  • Understand the JVM parameter settings, including: the size settings of each area of ​​the heap space, which garbage collectors are used in the new generation and the old generation, and then analyze whether the JVM parameter settings are reasonable.
  • Then, eliminate the possible causes listed in step 1. Among them, the meta space is full, memory leaks, and the code explicitly calls the gc method. It is easier to troubleshoot.
  • For FGC caused by large objects or long-lived objects, the jmap -histo command can be used in conjunction with the dump heap memory file for further analysis, and suspicious objects need to be located first.
  • Re-analyze the specific code by locating the suspicious object. At this time, it is necessary to combine the GC principle and JVM parameter settings to find out whether the suspicious object meets the conditions for entering the old age to draw a conclusion.

Final words

This article introduces the FGC troubleshooting process in detail through online cases combined with GC principles, and also gives a practical guide.

In the follow-up, I will share another YGC case that takes too long in a similar way. I hope it can help you understand the troubleshooting of GC problems. If you think this article is helpful to you, please help forward or click to read it again!

Reprinted at: https://mp.weixin.qq.com/s/Hs2bo37x7mcx7XTdNQVgZQ

Guess you like

Origin blog.csdn.net/qq_45401061/article/details/108761500