A boss came to the company and optimized the FullGC 40 times/day to 1 time in 10 days, which is too good~!

Click to follow the official account, Java dry goods will be delivered in time6d4a0542a0bd802ab958a914143d248f.png

2b71c4b2a88a44275d886399a888f1ac.png There is no one of the strongest microservice frameworks in China!

9beb2e5997d021bbe5cae0e8a5ed2492.png Covers almost all Spring Boot operations!

177041f35e497d86e751de33ccff5cba.png 2023 New Java Interview Questions (2500+)


Source: https://heapdump.cn/article/1859160

Through this more than a month of hard work, the FullGC was optimized from 40 times/day to trigger once in nearly 10 days, and the time of YoungGC was also reduced by more than half. For such a large optimization, it is necessary to record the intermediate tuning process.

For JVM garbage collection, it has always been in the theoretical stage before, and we know the promotion relationship between the new generation and the old generation. This knowledge is only enough for interviews.

Recommend an open source and free Spring Boot practical project:

https://github.com/javastacks/spring-boot-best-practice

question

Some time ago, the FullGC of the online server was very frequent, with an average of more than 40 times a day, and the server automatically restarted every few days, which indicates that the state of the server is very abnormal. To get such a good opportunity, of course, you must actively request Tuning is done.

The server GC data before tuning, FullGC is very frequent.

abdef8e170ed466fd33aa1e977874409.png

First of all, the configuration of the server is very general (2 cores 4G), a total of 4 server clusters. The number and time of FullGC for each server are basically the same. Among them, the startup parameters of several JVM cores are:

-Xms1000M -Xmx1800M -Xmn350M -Xss300K -XX:+DisableExplicitGC -XX:SurvivorRatio=4 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:LargePageSizeInBytes=128M -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
  • -Xmx1800M: Set the maximum available memory of the JVM to 1800M.

  • -Xms1000m: Set the JVM initialization memory to 1000m. This value can be set the same as -Xmx to avoid the JVM from reallocating memory every time garbage collection is done.

  • -Xmn350M: Set the young generation size to 350M. The entire JVM memory size = young generation size + old generation size. After increasing the young generation, the size of the old generation will be reduced. This value has a great impact on system performance, and Sun officially recommends configuring it as 3/8 of the entire heap.

  • -Xss300K: Set the stack size for each thread. After JDK5.0, the stack size of each thread is 1M, and the stack size of each thread was 256K before. Adjust according to the memory size required by the application's threads. Under the same physical memory, reducing this value can generate more threads. However, the operating system still has a limit on the number of threads in a process, and it cannot be generated infinitely. The experience value is around 3000~5000.

  • If you are preparing for an interview to change jobs in the near future, it is recommended to brush up questions online in the Java interview library applet, covering 2000+ Java interview questions, covering almost all mainstream technical interview questions.

first optimization

Looking at the parameters, I immediately wonder why the new generation is so small, and how to improve the throughput if it is so small, and it will cause frequent triggering of YoungGC. As shown in the figure above, the collection of the new generation takes 830s.

The initial heap memory is not consistent with the maximum heap memory. After consulting various materials, it is recommended to set these two values ​​to be the same, which can prevent memory reallocation after each GC.

Based on the previous knowledge, the first online tuning was carried out: increasing the size of the new generation and setting the initial heap memory to the maximum memory.

-Xmn350M -> -Xmn800M
-XX:SurvivorRatio=4 -> -XX:SurvivorRatio=8
-Xms1000m ->-Xms1800m

The original intention of changing SurvivorRatio to 8 is to let as much garbage as possible be recycled in the new generation.

In this way, after deploying the configuration to two online servers (prod, prod2, the other two remain unchanged for comparison), after running for 5 days, observing the GC results, the number of YoungGC has been reduced by more than half, and the time has been reduced by 400s, but FullGC The average number of times increased by 41 times. YoungGC is basically in line with expectations, but this FullGC is completely dead.

e32abc27fc8d156a664eae7b91be7a94.png

In this way, the first optimization failed. This " 46 PPTs Understanding JVM Performance Tuning" is shared for you to learn.

Second optimization

During the optimization process, our supervisor found that there are more than 10,000 instances of an object T in memory, and these instances occupy nearly 20M of memory. So according to the use of this bean object, the reason was found in the project: caused by anonymous inner class reference, the pseudo code is as follows:

public void doSmthing(T t){
 redis.addListener(new Listener(){
  public void onTimeout(){
   if(t.success()){
    //执行操作
   }
  }
 });
}

Since the listener will not be released after the callback, and the callback is a timeout operation, the callback will only be performed when an event exceeds the set time (1 minute), which will cause the object T to never be recycled, so the memory There will be so many object instances in .

After discovering a memory leak through the above example, first check the error log file in the program, and first resolve all error events. Then after the re-release, the GC operation remained basically unchanged. Although a little memory leak problem was solved, it can be explained that the root cause was not solved, and the server continued to restart inexplicably.

memory leak investigation

After the first tuning, a memory leak problem was found, so everyone started to investigate the memory leak, and first checked the code, but this efficiency was quite low, and basically no problems were found. So I continued to dump the memory when the line was not very busy, and finally caught a large object.

bcbc746add6184ccec3e366cabcf35cb.png e403a4be60c021e2faf6863b229d41eb.png

There are more than 4W objects, and they are all ByteArrowRow objects. It can be confirmed that these data are generated during database query or insertion.

So another round of code analysis was conducted. In the process of code analysis, colleagues in operation and maintenance found that the entrance traffic had doubled several times at a certain time of the day, and it was as high as 83MB/s. After some confirmation, there is no such thing at present. There is a large amount of business, and there is no file upload function.

Consulting Alibaba Cloud customer service also showed that the traffic is completely normal, and the possibility of an attack can be ruled out.

4b50967e90f978266fe52f80353ea0b4.png

While I was still investigating the problem of ingress traffic, another colleague found the root cause. It turned out that under a certain condition, all unprocessed specified data in the table would be queried, but because the where condition was less added when querying The condition of the module is met, resulting in more than 400,000 queries, and by viewing the requests and data at that time through the log, it can be judged that this logic has indeed been executed. There are only more than 4W objects in the dumped memory. This is Because so many were queried during the dump, and the rest were still being transmitted. And this can also explain very well why the server will automatically restart.

After solving this problem, the online server is running completely normally. With the parameters before tuning, it has been running for about 3 days and there are only 5 times of FullGC.

3bea7e90562a2abcd65633ef93f09be0.png

Second tuning

The memory leak problem has been solved, and the rest can be tuned. After checking the GC log, it is found that during the first three FullGCs, the old generation occupied less than 30% of the memory, but FullGC occurred.

So I conducted a survey of various materials, and https://blog.csdn.net/zjwstz/article/details/77478054explained very clearly in the blog that the metaspace caused FullGC. The default metaspace of the server is 21M. When I GC logsaw the largest metaspace occupied about 200M, I made the following adjustments. The following are respectively The modified parameters for prod1 and prod2, prod3, prod4 remain unchanged.

-Xmn350M -> -Xmn800M
-Xms1000M ->1800M
-XX:MetaspaceSize=200M
-XX:CMSInitiatingOccupancyFraction=75

and

-Xmn350M -> -Xmn600M
-Xms1000M ->1800M
-XX:MetaspaceSize=200M
-XX:CMSInitiatingOccupancyFraction=75

prod1 and 2 are only different in the size of the new generation, and the others are the same. It has been running online for about 10 days for comparison:

prod1:

4c5817b1f0da9851a491fe7fb4af2f93.png

prod2:

1b7ff66707bf2b7aeb6b49699acd8362.png

prod3:

0bca8582fbd0d1601b21c01dd96d6ba9.png

prod4:

6c610da55dc992d7c9b3814f1ee0ae17.png

In contrast, the FullGC of the two servers 1 and 2 is far lower than the two servers 3 and 4, and the YoungGC of the two servers 1 and 2 is also reduced by about half compared with 3 and 4.

Moreover, the efficiency of the first server is more obvious, except that the number of YoungGC is reduced, and the throughput is higher than that of 3 and 4, which have been running for one more day (through the number of thread starts), indicating that the throughput of prod1 is particularly improved.

Through the number of GC times and GC time, this optimization is declared successful, and the configuration of prod1 is better, which greatly improves the throughput of the server and reduces the GC time by more than half.

The only FullGC in prod1:

e2e330fc469aa6fee8b5ea5241942f0d.png c78e252550fa45a8eec13dfc211580e9.png

I can't see the reason through the GC log. The old generation only occupied about 660M when cms remarked. This should not be the condition to trigger FullGC, and through the previous YoungGC investigations, the promotion of large memory objects was also ruled out. It is possible that the size of the metaspace does not meet the conditions of GC. This still needs to be investigated further. Anyone who knows is welcome to point it out. Thanks in advance.

Summarize

Through this more than a month of tuning, the following points are summarized:

  • FullGC more than once a day is definitely not normal.

  • When it is found that FullGC is frequent, the priority is to investigate memory leaks.

  • After the memory leak is resolved, there is less room for jvm tuning, which is fine for learning, otherwise don't invest too much time.

  • If you find that the CPU continues to be high, you can consult the operation and maintenance customer service of Alibaba Cloud after troubleshooting the code problem. During this investigation, it was found that 100% of the CPU was caused by a server problem. After the server migration, it will be normal.

  • Data query is also counted as the ingress traffic of the server. If there is not such a large amount of access business and there is no attack problem, you can investigate the database.

  • It is necessary to pay attention to the GC of the server from time to time, so that problems can be found early.


course recommendation

If you want to learn Spring Boot systematically, I recommend my " Spring Boot Core Technology Course ", based on the latest 3.x version, covering almost all core knowledge points of Spring Boot , one-time payment, permanent learning...

Those who are interested scan the code to contact and subscribe to learn:

1fc4bcb1e87b743c9b28d8842a0196bc.png

Add stack leader Wechat registration study

82f11fee48ca0aa2069353045e12ddf0.jpeg

Please note for fast pass: 399

Guess you like

Origin blog.csdn.net/youanyyou/article/details/132353176