How to troubleshoot and solve the problems of long GC pauses

For many enterprise applications, especially OLTP applications, long pauses are likely to cause service timeouts. For these applications running on JVM, garbage collection (GC) may be the main reason for long pauses . This article will describe some different scenarios where GC pauses may be encountered, and explain how we can troubleshoot and solve these GC pauses.

The following are some different scenarios that may cause long GC pauses when the application is running.

1. Fragmentation

This must definitely be ranked first. Because it is precisely because of the fragmentation problem-the most fatal flaw of CMS, that the garbage collector that has dominated the OLAP system for more than ten years has directly withdrawn from the stage of history (CMS is already deprecated , and future versions will be removed. Please cherish those configurations. The JVM of CMS), in the face of G1 and the latest ZGC, the inherently disabled (fragmented) CMS has no power to fight back.

For CMS, due to the fragmentation problem of the old generation, promotion failures may be encountered in YGC ( promotion failures , even if the old generation still has enough effective space, it may still cause allocation failure because there is not enough continuous space ), thus Trigger Concurrent Mode Failure , a FullGC that will completely STW occurs. FullGC requires a longer pause time to complete the garbage collection work than the concurrent mode of CMS, which is definitely one of the biggest disasters for Java applications.

Why is there a fragmentation problem in the CMS scenario? Since CMS uses the Mark-Sweep algorithm when recycling in the old age, it does not compress the heap during garbage collection. Over time, the fragmentation problem in the old age will become more and more serious , until single-threaded The Mark-Sweep-Compact GC, FullGC, will be completely STW. If the heap is relatively large, the STW time may take several seconds, even more than ten seconds, or dozens of seconds.

In the following cms gc log, due to the very high fragmentation rate, it led to promotion failure and then concurrent mode failure. The triggered FullGC took a total of 17.1365396 seconds to complete:

{Heap before GC invocations=7430 (full 24):

parnew generation total 134400K, used 121348K[0x53000000, 0x5c600000, 0x5c600000)

eden space 115200K, 99% used [0x53000000, 0x5a07e738, 0x5a080000)

from space 19200K, 32% used [0x5a080000, 0x5a682cc0, 0x5b340000)

to space 19200K, 0% used [0x5b340000, 0x5b340000, 0x5c600000)

concurrent mark-sweep generation total 2099200K, used 1694466K [0x5c600000, 0xdc800000, 0xdc800000)

concurrent-mark-sweep perm gen total 409600K, used 186942K [0xdc800000, 0xf5800000, 0xfbc00000)

10628.167: [GC Before GC:

Statistics for BinaryTreeDictionary:

------------------------------------

Total Free Space: 103224160

Max Chunk Size: 5486

Number of Blocks: 57345

Av. Block Size: 1800

Tree Height: 36 <---- High fragmentation

Statistics for IndexedFreeLists:

--------------------------------

Total Free Space: 371324

Max Chunk Size: 254

Number of Blocks: 8591 <---- High fragmentation

Av. Block Size: 43

free=103595484

frag=1.0000 <---- High fragmentation

Before GC:

Statistics for BinaryTreeDictionary:

------------------------------------

Total Free Space: 0

Max Chunk Size: 0

Number of Blocks: 0

Tree Height: 0

Statistics for IndexedFreeLists:

--------------------------------

Total Free Space: 0

Max Chunk Size: 0

Number of Blocks: 0

free=0 frag=0.0000

10628.168: [ParNew (promotion failed) Desired survivor size 9830400 bytes, new threshold 1 (max 1)

- age 1: 4770440 bytes, 4770440 total: 121348K->122157K(134400K), 0.4263254secs]

10628,594: [CMS10630.887: [CMS-concurrent-mark: 7.286/8.682 secs] [Times: user=14.81, sys=0.34, real=8.68 secs]

(concurrent mode failure):1698044K->625427K(2099200K), 17.1365396 secs]

1815815K->625427K(2233600K), [CMS Perm : 186942K->180711K(409600K)]

After GC:

Statistics for BinaryTreeDictionary:

------------------------------------

Total Free Space: 377269492

Max Chunk Size:

377269492

Number of Blocks: 1 <---- No fragmentation

Av. Block Size: 377269492

Tree Height: 1 <---- No fragmentation

Statistics for IndexedFreeLists:

--------------------------------

Total Free Space: 0

Max Chunk Size: 0

Number of Blocks: 0

free=377269492

frag=0.0000 <---- No fragmentation

After GC:

Statistics for BinaryTreeDictionary:

------------------------------------

Total Free Space: 0

Max Chunk Size: 0

Number of Blocks: 0

Tree Height: 0

Statistics for IndexedFreeLists:

--------------------------------

Total Free Space: 0

Max Chunk Size: 0

Number of Blocks: 0

free=0 frag=0.0000

, 17.5645589 secs] [Times: user=17.82 sys=0.06, real=17.57 secs]

Heap after GC invocations=7431 (full 25):

parnew generation total 134400K, used 0K [0x53000000, 0x5c600000, 0x5c600000)

eden space 115200K, 0% used [0x53000000, 0x53000000, 0x5a080000)

from space 19200K, 0% used [0x5b340000, 0x5b340000, 0x5c600000)

to space 19200K, 0% used [0x5a080000, 0x5a080000, 0x5b340000)

concurrent mark-sweep generation total 2099200K, used 625427K [0x5c600000, 0xdc800000, 0xdc800000)

concurrent-mark-sweep perm gen total 409600K, used 180711K [0xdc800000, 0xf5800000, 0xfbc00000)

}

Total time for which application threads were stopped: 17.5730653 seconds

2. Operating system activities during GC

When GC occurs, some operating system activities, such as swap, may cause the GC pause time to be longer. These pauses may be several seconds or even tens of seconds.

If your system is configured to allow the use of swap space, the operating system may move the inactive memory pages of the JVM process to the swap space, thereby releasing memory to the currently active process (may be other processes on the operating system, depending on system scheduling). Swapping needs to access the disk, so compared to physical memory, its speed is horribly slow . Therefore, if the system just needs to perform Swapping during GC, the GC pause time will be very, very, very scary.

The following is a YGC log that lasted 29.48 seconds:

{Heap before GC invocations=132 (full 0):

par new generation total 2696384K, used 2696384K [0xfffffffc20010000, 0xfffffffce0010000, 0xfffffffce0010000)

eden space 2247040K, 100% used [0xfffffffc20010000, 0xfffffffca9270000, 0xfffffffca9270000)

from space 449344K, 100% used [0xfffffffca9270000, 0xfffffffcc4940000, 0xfffffffcc4940000)

to space 449344K, 0% used [0xfffffffcc4940000, 0xfffffffcc4940000, 0xfffffffce0010000)

concurrent mark-sweep generation total 9437184K, used 1860619K [0xfffffffce0010000, 0xffffffff20010000, 0xffffffff20010000)

concurrent-mark-sweep perm gen total 1310720K, used 511451K [0xffffffff20010000, 0xffffffff70010000, 0xffffffff70010000)

2013-07-17T03:58:06.601-0700: 51522.120: [GC Before GC: :2696384K->449344K(2696384K), 29.4779282 secs] 4557003K->2326821K(12133568K) ,29.4795222 secs] [Times: user=915.56, sys=6.35, real=29.48 secs]

In the last line [Times: user=915.56, sys=6.35, real=29.48 secs], real is the real pause time when YGC is applied.

At this point in time when YGC occurs, the output of the vmstat command is as follows:

r b w swap free re mf pi po fr de sr s0 s1 s2 s3 in sy cs us sy id

0 0 0 77611960 94847600 55 266 0 0 0 0 0 0 0 0 0 3041 2644 2431 44 8 48

0 0 0 76968296 94828816 79 324 0 18 18 0 0 0 0 1 0 3009 3642 2519 59 13 28

1 0 0 77316456 94816000 389 2848 0 7 7 0 0 0 0 2 0 40062 78231 61451 42 6 53

2 0 0 77577552 94798520 115 591 0 13 13 0 0 13 12 1 0 4991 8104 5413 2 0 98

YGC took a total of 29 seconds to complete. The output of the vmstat command indicates that the available swap space has decreased by 600m during this time period. This means that during the GC, some pages in the memory are moved to the swap space. This memory page does not necessarily belong to the JVM process, but may be other processes on other operating systems.

It can be seen from the above that the physical content available on the operating system is not enough to run all the processes on the system. The solution is to run as few processes as possible and increase RAM to increase the physical memory of the system. In this example, the Old area has 9G, but only 1.8G (mark-sweep generation total 9437184K, used 1860619K) is used. We can appropriately reduce the size of the Old area and the size of the entire heap, thereby reducing memory pressure and minimizing the possibility of swapping in applications on the system.

In addition to swapping, we also need to monitor and understand any IO or network activity during long GC pauses, which can be achieved through two tools, iostat and netstat. We can also view CPU statistics through mpstat to figure out when GC Whether there are enough CPU resources.

3. Not enough heap space

If the memory required by the application is larger than the Xmx we are executing, it will also cause frequent garbage collection and even OOM. Due to insufficient heap space and object allocation failure, JVM needs to call GC to try to reclaim the allocated space, but GC cannot release more space, which leads to GC again and enters a vicious circle.

When the application is running, frequent FullGC will cause a long pause. In the following example, the Perm space is almost full, and attempts to allocate memory in the Perm area also fail, which triggers FullGC:

166687.013: [Full GC [PSYoungGen:126501K->0K(922048K)] [PSOldGen: 2063794K->1598637K(2097152K)]2190295K->1598637K(3019200K) [PSPermGen: 165840K->164249K(166016K)],6.8204928 secs] [Times: user=6.80 sys=0.02, real=6.81 secs]

166699.015: [Full GC [PSYoungGen:125518K->0K(922048K)] [PSOldGen: 1763798K->1583621K(2097152K)]1889316K->1583621K(3019200K) [PSPermGen: 165868K->164849K(166016K)],4.8204928 secs] [Times: user=4.80 sys=0.02, real=4.81 secs]

Similarly, if the space in the old age is not enough, it will also lead to frequent FullGC. This kind of problem is easier to handle. For the old and permanent generations, don't be too stingy, hehe.

4. JVM Bug

All software has BUG, ​​and JVM is no exception. Sometimes, the long pause of GC may be caused by BUG. For example, the BUG of these JVMs listed below may cause Java applications to pause for a long time during GC.

6459113: CMS+ParNew: wildly different ParNew pause times depending on heap shape caused by allocation spread

fixed in JDK 6u1 and 7

6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses

fixed in JDK 6u4 and 7

6631166: CMS: better heuristics when combatting fragmentation

fixed in JDK 6u21 and 7

6999988: CMS: Increased fragmentation leading to promotion failure after CR#6631166 got implemented

fixed in JDK 6u25 and 7

6683623: G1: use logarithmic BOT code such as used by other collectors

fixed in JDK 6u14 and 7

6976350: G1: deal with fragmentation while copying objects during GC

fixed in JDK 8

If your JDK happens to be the above versions, it is strongly recommended to upgrade to the version where the bug has been fixed.

5. Display System.gc call

Check whether there is a displayed System.gc call. In some classes in the application, or in a third-party module, calling System.gc to trigger the FullGC of STW may also cause a very long pause. As shown in the following GC log, the (System) behind Full GC indicates that it is FullGC triggered by calling System.GC, and it takes 5.75 seconds:

164638.058: [Full GC (System) [PSYoungGen: 22789K->0K(992448K)]

[PSOldGen: 1645508K->1666990K(2097152K)] 1668298K->1666990K(3089600K)

[PSPermGen: 164914K->164914K(166720K)], 5.7499132 secs] [Times: user=5.69, sys=0.06, real=5.75 secs]


If you use RMI, you can observe FullGC at a fixed time interval, because the implementation of RMI calls System.gc. This time interval can be configured through system properties:

-Dsun.rmi.dgc.server.gcInterval=7200000

-Dsun.rmi.dgc.client.gcInterval=7200000

The default value of JDK 1.4.2 and 5.0 is 60000 milliseconds, which is 1 minute; for JDK6 and later versions, the default value is 3600000 milliseconds, which is 1 hour.

If you want to turn off the trigger FullGC by calling System.gc (), JVM configuration parameters -XX:+DisableExplicitGCcan be.

So how to locate and solve such problems?

  1. Configure JVM parameters: -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps and -XX:+PrintGCApplicationStoppedTime. If it is a CMS, you also need to add -XX:PrintFLSStatistics=2, and then collect GC logs. Because the GC log can tell us the GC frequency, whether it pauses for a long time and other important information.

  2. Use tools such as vmstat, iostat, netstat and mpstat to monitor the overall health of the system.

  3. Use the GCHisto tool to visually analyze the GC log to understand the GCs that have consumed a long time and whether there are certain patterns in the appearance of these GCs.

  4. Try to find out the signs of JVM heap fragmentation from the GC log.

  5. Monitor whether the heap size of the specified application is sufficient.

  6. Check the version of the JVM you are running, whether there are any bugs related to the long pause, and then upgrade to the latest JDK that fixes the problem.

Guess you like

Origin blog.csdn.net/doubututou/article/details/109098975