An online GC troubleshooting process record

After three or four hours of investigation, I finally solved the GC problem. Record the solution process here, I hope it will be helpful to everyone. This article assumes that the reader has basic GC common sense and JVM tuning knowledge. For the use of JVM tuning tools, you can check my other article under the same category:

    http://my.oschina.net/feichexia/blog/196575

 

background note

    The system in question was deployed on Unix and had been running for over two weeks before the problem occurred.

    Among them, I used the CountingBloomFilter in the Hadoop source code, and modified it into a thread-safe implementation (for details, see: AdjustedCountingBloomFilter ), mainly using the CAS optimistic lock of AtomicLong, and replaced long[] with AtomicLong[]. This results in 5 huge AtomicLong arrays in the system (each array occupies about 50MB), and each array contains a large number of AtomicLong objects (all AtomicLong objects occupy about 1.2G memory). And the survival time of these AtomicLong arrays is at least one day.

    The server has been launched before the mobile client. The client was originally planned to be launched on Thursday (Monday when this article was written), so I plan to continue to observe the operating status of the system in the next few days. Debug level logs.

    Some JVM parameters are excerpted as follows (JVM parameters are configured in setenv.sh in the bin directory under the root directory of the tomcat server deployed by the project, which can be viewed through ps -ef | grep xxx | grep -v grep):

[java] view plain copy

 

  1. <code class="hljs groovy">-<span class="hljs-string">XX:</span>PermSize=<span class="hljs-number">256</span>M -<span class="hljs-string">XX:</span>MaxPermSize=<span class="hljs-number">256</span>M -Xms6000M -Xmx6000M -Xmn1500M -Xss256k -<span class="hljs-string">XX:</span>ParallelGCThreads=<span class="hljs-number">8</span> -<span class="hljs-string">XX:</span>+UseConcMarkSweepGC -<span class="hljs-string">XX:</span>+UseParNewGC -<span class="hljs-string">XX:</span>+DisableExplicitGC -<span class="hljs-string">XX:</span>+CMSParallelRemarkEnabled -<span class="hljs-string">XX:</span>+CMSClassUnloadingEnabled -<span class="hljs-string">XX:</span>+CMSPermGenSweepingEnabled -<span class="hljs-string">XX:</span>CMSInitiatingOccupancyFraction=<span class="hljs-number">70</span> -<span class="hljs-string">XX:</span>CMSFullGCsBeforeCompaction=<span class="hljs-number">5</span> -<span class="hljs-string">XX:</span>+UseCMSCompactAtFullCollection -<span class="hljs-string">XX:</span>+CMSScavengeBeforeRemark -<span class="hljs-string">XX:</span>+HeapDumpOnOutOfMemoryError -<span class="hljs-string">Xloggc:</span><span class="hljs-regexp">/usr/</span>local<span class="hljs-regexp">/webserver/</span>point<span class="hljs-regexp">/logs/</span>gc.log -<span class="hljs-string">XX:</span>+PrintGCDetails -<span class="hljs-string">XX:</span>+PrintGCTimeStamps -<span class="hljs-string">XX:</span>+PrintGCApplicationStoppedTime -<span class="hljs-string">XX:</span>+PrintGCApplicationConcurrentTime</code>  

 

    It can be seen that the persistent generation is set to 256M, and the heap memory is set to 6000M (-Xms and --Xmx are set equal to avoid "heap oscillation", which can reduce the number of GCs to a certain extent, but it will increase the average consumption of each GC. time), the young generation is set to 1500M.

    -XX:+UseConcMarkSweepGC Set the old generation to use the CMS (Concurrent Mark-Sweep) collector. -XX:+UseParNewGC sets the new generation to use the parallel collector. The -XX:ParallelGCThreads parameter specifies the number of parallel collector worker threads. When the number of CPU cores is less than or equal to 8, it is generally recommended to be the same as the number of CPUs, but it is recommended when the number of CPU cores is greater than 8 Set to: 3 + [(5*CPU_COUNT) / 8]. Other parameters are omitted.

 

Problem discovery and resolution process

    In the morning test, I found that the online system suddenly hung up and reported an abnormal access timeout.

    First of all, my first reaction was that the system memory overflowed or the process was killed by the operating system. Use the ps -ef | grep xxx | grep -v grep command to check that the process is still there. Then look at the catlalina.out log of tomcat and the system gc log, and no memory overflow was found.

    Next, I used jstat -gcutil pid 1000 to check the occupancy and GC status of each generation in the heap, and found a scary phenomenon: the Eden area occupies more than 77%, S0 occupies 100%, and both the Old and Perm areas are very large. Space remaining.

    It is suspected that there is insufficient space in the new generation, but there is no conclusive evidence, so I have to use jstack to obtain thread Dump information.

    From the first paragraph above, it can be seen that there is an internal thread of the Low Memory Detector system (the daemon thread started by the JVM to monitor and report low memory) has been occupying the lock 0x00....00, and the following C2 CompilerThread1, C2 CompilerThread0, Signal Both the Dispatcher and Surrogate Locker threads are waiting for this lock, causing the entire JVM process to hang.

    I searched the Internet and found that most of them recommend increasing the heap memory, so according to the suggestions, I plan to increase the size of the entire heap memory, increase the size of the new generation (-Xmn parameter), and increase the proportion of the Survivor area in the new generation (- XX: SurvivorRatio parameter). And because there is an AtomicLong array large object, set -XX:PretenureThreshold=10000000, that is, if an object exceeds 10M (the unit is bytes, so it is converted to 10M), it will directly enter the old age. Note that this parameter is only in the Serial collector. and the ParNew collector is valid. In addition, I hope that a large number of AtomicLong small objects with a long life cycle can enter the old age as soon as possible, so as to avoid a large number of AtomicLong array objects in the old age referencing the AtomicLong objects of the new generation, I reduced -XX:MaxTenuringThreshold (the default value of this parameter is 15), that Now objects in the young generation can survive at most 8 generations in the young generation. If more than 8 generations are still alive, even if there is enough memory in the young generation, they will be promoted to the old generation. The modified or added JVM GC parameters are as follows:

[java] view plain copy

 

  1. <code class="hljs groovy">-Xms9000M -Xmx9000M -Xmn1500M -<span class="hljs-string">XX:</span>SurvivorRatio=<span class="hljs-number">6</span> -<span class="hljs-string">XX:</span>MaxTenuringThreshold=<span class="hljs-number">8</span> -<span class="hljs-string">XX:</span>PretenureSizeThreshold=<span class="hljs-number">10000000</span></code>  

 

    After restarting the system, use the jstat -gcutil pid 1000 command to find a more terrifying phenomenon, as shown in the figure below: The memory in the Eden area continues to grow rapidly, and the usage of the Survivor is still very high. The memory usage of the generation will increase a lot, so that it can be predicted that a Full GC will occur every three or four hours, which is very unreasonable.

    The second column is S1, which occupies up to 87.45%, and the third column is the change of memory usage in the Eden area. It can be seen that the growth is very fast.

    So I used jmap -histo:live (note that the jmap command will trigger Full GC, use it with caution in online environments with large concurrent access) to check the live objects, and found that there are some Integer arrays and some Character arrays occupying memory continues to grow, And it occupies about several hundred megabytes of memory, and then declines after Young GC, then grows rapidly again, and then Young GC declines again and again. 

    At this point, I speculate that it may be that a large number of Integer array objects and Character array objects basically fill up the Survivor, resulting in that after Eden is full, the newly generated Integer array objects and Character array objects are not enough to put into the Survivor, and then the objects are directly promoted by Promote In the old generation, this speculation is partially correct. It explains the reason why the S1 usage is so high, but it cannot explain the continuous increase in the memory usage of the Eden area above.

    So I continued to check the interface call log, but I didn't know if I didn't read it, but I was shocked when I saw it: the log refreshed very quickly (99% of it was DEBUG log). It turns out that the operation and testing have released an Android online version on a certain channel yesterday without notifying our server (no wonder the problem was exposed today), and after looking at the system, there are already more than 6,400 users. , was completely fooled by them. This can explain why the memory occupied by an array of Integer and Character continues to grow. The reason is that a large number of system interface calls trigger a large number of DEBUG log refreshes. Log writing is a heavyweight operation for online systems, whether it is for CPU. The occupancy is still the memory occupancy, so the high-concurrency online system must remember to increase the log level to INFO or even ERROR.

    So modify the log level in log4j.properties to INFO, and then use the jmap -histo:live pid command to view the live objects, and find that the Integer array objects and Character[] array objects have dropped significantly, and the memory occupied has also dropped from the previous hundreds of M to a few m.

    Then use jstat -gcutil pid 1000 to check the GC situation, as follows:

    Obviously, the usage of Survivor is not so high. The most important thing is that the memory usage of the old generation will not increase after the Young GC. The Eden area seems to grow quite fast here, because the number of users is much more than before. The problems that have arisen so far have basically been solved, but it remains to be observed.

 

Summarize

    In general, there is something that violates the GC assumption in this system, that is, there are a large number of small objects (AtomicLong objects) with a long life cycle in the JVM heap. This will undoubtedly bury a hole in the system.

    The basic assumptions of GC generation are:

[java] view plain copy

 

  1. <code class="hljs">Most objects that exist in the JVM heap are short-lived small objects. </code>  

 

    This is why the young generation of the Hotspot JVM uses a replication algorithm.

 

    Others recommend some very good reference articles on GC (the first two are from the book "In-depth Understanding of Java Virtual Machine", most of the reference links are the materials I checked today, you can choose to read):

    JVM Memory Management: Deep Dive into Java Memory Regions and OOM  http://www.iteye.com/topic/802573

    JVM Memory Management: In-depth Garbage Collector and Memory Allocation Strategy  http://www.iteye.com/topic/802638

    Oracle GC Tuning http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

    Java 6 JVM parameter options Daquan  http://kenwublog.com/docs/java6-jvm-options-chinese-edition.htm

    Java HotSpot VM Options http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

    CMS GC practice summary  http://www.iteye.com/topic/473874

    JVM memory allocation and recoveryhttp  ://blog.csdn.net/eric_sunah/article/details/7893310

    Step by step optimization of JVM series  http://blog.csdn.net/zhoutao198712/article/category/1194642

    Java Thread Dump Analysis  http://www.linuxidc.com/Linux/2009-01/18171.htm  http://jameswxx.iteye.com/blog/1041173

    JVM Troubleshooting with Java Dump  http://www.ibm.com/developerworks/cn/websphere/library/techarticles/0903_suipf_javadump/

    Detecting Low Memory in Java https://techblug.wordpress.com/2011/07/16/detecting-low-memory-in-java/

    Detecting Low Memory in Java Part 2 http://techblug.wordpress.com/2011/07/21/detecting-low-memory-in-java-part-2/

    http://blog.sina.com.cn/s/blog_56d8ea9001014de3.html

    http://stackoverflow.com/questions/2101518/difference-between-xxuseparallelgc-and-xxuseparnewgc

    http://stackoverflow.com/questions/220388/java-concurrent-and-parallel-gc

    http://j2eedebug.blogspot.com/2008/12/what-to-look-for-in-java-thread-dumps.html

    https://devcenter.heroku.com/articles/java-memory-issues

    http://blog.csdn.net/sun7545526/article/category/1193563

    http://java.dzone.com/articles/how-tame-java-gc-pauses

Original link: https://my.oschina.net/feichexia/blog/277391

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325859917&siteId=291194637