Diagnosis of game server JVM Full GC pauses for a long time causing tens of thousands of players to go offline

Recently, I received a game server that caused a large number of players to drop offline due to GC. Let me have a look, and send me a JMC flight record and an hprof heap dump file of the heap memory. I used jmc and jvisualvm in the JDK to open it for analysis (blind J) (8 views). Let's look at the basic information first.

  Basic information:
    Generated date: Tue Jul 28 19:51:09 CST 2020
    File: E:\Downloads\java_pid11875\java_pid11875.hprof.4
    File size: 13,721.7 MB

    Total Bytes: 12,583,316,263
    Total Classes: 5,124
    Total Instances: 209,093,920
    ClassLoaders: 60
    Garbage Collected Roots: 4,351
    Pending Objects Waiting for Finalization: 0

  Environment:
    OS: Linux (3.16.0-6-amd64)
    Architecture: amd64 64bit
    Java Home Directory: /usr/lib/jvm/jdk1.8.0_144/jre
    Java Version: 1.8.0_144
    JVM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01, mixed mode)
    Java Vendor: Oracle Corporation

On the JMC dashboard, the maximum GC pause time is close to 14 seconds, which is terrifying!

Take a look, the JVM parameter -XX:+UnlockCommercialFeatures -Xmx40g -Xms40g, the heap memory is arranged 40G. Since it is a GC problem, first look at what objects are in the heap. As seen in jvisualvm, the most numerous objects are related to network sending and receiving, such as classes in java nio, netty and nukkitx packages. I judge that such a large heap memory also leads to frequent Full GC, indicating that there must be a large number of objects in the old generation, which should be caused by the inability to recycle these network-related objects.

Look in JMC, the hot code (referring to the code with high execution frequency) is also in the network/IO related class.

Look at the threads in the JVM again, what are you doing at this time? Sure enough, each thread is either busy releasing memory or applying for memory allocation at this time, and they are all methods in the netty library. It means that the user connection is very busy at this time.

     

Seeing this information, and then thinking that this is a game server with high concurrency and long connections, I think the problem may be on netty. Netty is an excellent network programming framework, but the disadvantage of the netty library is that there are many objects. A link in netty is a channel, and each channel will have a DefaultPipeline, and then there will be a HeadContext, TailContext, and Unsafe object. Long connections will cause these related objects not to be released all the time. They can go through GC again and again, and finally enter the old generation. The old generation becomes more and more full, and the Full GC takes a long time. However, even if it is known that this may be the reason, there is no way to remove netty, the cost is too high, and the project code must be rewritten. What to do, we can only start optimizing from the aspects of adjusting the size of each area of ​​the heap memory and adjusting the garbage collection strategy (collector type).

Let’s start with Amway, what are the garbage collectors provided by the Java 8 Oracle official virtual machine.

As can be seen from the figure above, until Java 8, the official JVM provides 7 types of garbage collectors. Serial, ParNew, and Parallel Scavenge are mainly responsible for the garbage collection of the new generation. CMS, Serial Odl, and Parallel Old are mainly responsible for the garbage collection of the old generation. G1 can be used in both the new generation and the old generation. Comparison of their features:

  • Serial collector: Serial + Serial Old;
  • Parallel collector: Parallel Scavenge + Parallel Old, focusing on application throughput;
  • Concurrent collectors: CMS, G1, focus on response time.

What is throughput ? In the JVM, a separate thread is used to execute GC. The garbage collection thread will compete with the application thread for CPU execution time. The longer the application thread execution time, the higher the throughput.

What is the response time ? That is, the GC should not cause the application execution to stall, so that the program can execute as soon as possible and respond to the user.

You can think of it with your toes. A good GC has high throughput and short (fast) response time . But unfortunately, the two are contradictory - in order to improve throughput, it is necessary to reduce the GC running time, but if the GC runs less, it will accumulate a large number of garbage objects and memory fragments, resulting in the need to spend more time on GC later Spending too much time pausing the program's execution leads to a decrease in response time and throughput.

Back to our diagnostic case. Let's take a look at what garbage collector the JVM of this server uses! The default Parallel series collector in Oracle 1.8 HotSpot JVM server mode is used. Among them, Parallel Scavenge uses the copy algorithm to recycle the young generation, and Parallel Old uses the mark sorting algorithm to recycle the old generation. These two are throughput-first recyclers. In the case of a large number of network connections on this game server, if the GC strategy is to prioritize throughput, many garbage objects that should have been collected in time will be delayed in collection and eventually enter the old age. The capacity of the old generation reaches 26.66GB. When there are many objects accumulated in the old generation, the time of each GC pause will be terrible. When the memory pressure is very high (for example, I see a lot of memory allocation failures, Allocation Failure failures), it will lead to Full GC. At this time, each Full GC will take a long time, resulting in a very obvious effect of server suspension, even if JVM will not OutOfMemoryError, it is not surprising that the user connection times out and drops.

According to this idea, the development team finally adjusted the JVM startup parameters and used the G1 recycler to improve the problem that the long-term Full GC caused the user to drop the line. The long-term effect remains to be seen.

WeChat scan code to follow my video number:

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin blog.csdn.net/liudun_cool/article/details/107676438