jvm cases and troubleshooting routines

1. A practical guide for troubleshooting FGC issues: https://mp.weixin.qq.com/s/I1fp89Ib2Na1-vjmjSpsjQ

 1. From a procedural point of view, what causes FGC? 

  • Large objects: The system loads too much data into memory at one time (for example, SQL queries are not paged), causing large objects to enter the old age.

  • Memory leak: A large number of objects are frequently created, but cannot be recycled (for example, the close method is not called to release resources after the IO object is used up), FGC is triggered first, and OOM is finally caused.

  • The program frequently generates some long-lived objects. When the survival age of these objects exceeds the generational age, they will enter the old age, and finally trigger FGC. (The case in this article)

  • The program BUG leads to the dynamic generation of many new classes, making Metaspace constantly occupied, first triggering FGC, and finally leading to OOM.

  • The gc method is explicitly called in the code, including your own code and even the code in the framework.

  • JVM parameter setting issues: including the total memory size, the size of the new generation and the old generation, the size of the Eden area and the S area, the size of the meta space, the garbage collection algorithm, and so on.

2. Know which tools you can use when troubleshooting

  • The company's monitoring system: Most companies will have it, which can monitor all the indicators of the JVM.

  • JDK's own tools, including commonly used commands such as jmap and jstat:

    # View the usage rate of each area of ​​the heap memory and the GC situation

    jstat -gcutil -h20 pid 1000

    # View the surviving objects in the heap memory and sort them by space

    jmap -histo pid | head -n20

    # dump heap memory file

    jmap -dump:format=b,file=heap pid

  • Visual heap memory analysis tools: JVisualVM, MAT, etc.

3. Troubleshooting Guide

  • Check the monitoring to understand the time point of the problem and the frequency of the current FGC (you can compare the normal situation to see if the frequency is normal)

  • Understand whether there are any programs going online, basic component upgrades, etc. before this point in time.

  • Understand the parameter settings of the JVM, including: the size settings of each area of ​​the heap space, which garbage collectors are used in the new generation and the old generation, and then analyze whether the JVM parameter settings are reasonable.

  • Then eliminate the possible causes listed in step 1. Among them, the meta space is full, memory leaks, and the code explicitly calls the gc method. It is easier to troubleshoot.

  • For FGC caused by large objects or long-lived objects, you can use the jmap -histo command in conjunction with the dump heap memory file for further analysis, and you need to locate the suspicious object first.

  • Re-analyze the specific code by locating the suspicious object. At this time, it is necessary to combine the GC principle and JVM parameter settings to figure out whether the suspicious object meets the conditions for entering the old age to draw a conclusion.

 

YGC : 

Classic troubleshooting case: https://mp.weixin.qq.com/s/O0l-d928hr994OpSNw3oow

1. Receive a timeout alarm and call the service timeout. Viewing service monitoring takes a long time, from the usual tens of milliseconds to hundreds of milliseconds.

2. Take off a node, dump the heap memory file through the command to keep the scene

     jmap -dump:format=b,file=heap pid

3. View the jvm parameter configuration

    ps aux | grep "applicationName=adsearch"

   Jmap -heap pid found: the Eden area of ​​the new generation is 1.6G, and the S0 and S1 areas are both 0.2G

4. Because there are no changes, it is related to the code.

5. Analyze the dump heap file through the tool to see if there are large objects in the structure heap, and then never collect it.

  (1) Look at the external interface: use the static map to convert the old and new data, but the memory is only 100m, and the old age will follow.

 (2) Analyzing the dump file. Long-period objects accumulate too much, or serious lock contention leads to thread blocking, and the life cycle of local variables becomes longer.

  Other teams quoted apollo and packaged it for use by the team, but the getconfig method always adds a list when reading the configuration (although it is static, it adds values ​​when calling the method, so it is in the young generation). Then there is no de-duplication.

Many teams use this apollo to reference configuration, and then accumulate more and more data.

6. The old version was regressed and tested with a single machine, and it was found to be normal. Determine the reason (external team uses super-pom business to release the latest version without perception)

Guess you like

Origin blog.csdn.net/qq_39809613/article/details/107353733