3K characters record a CMS GC that takes 46.6 seconds to check and solve the process

Today 早上7.16分左右, a service had an emergency alarm, and many interfaces timed out. Out of a learning + curious attitude, I started the investigation tour~~~

[At the end of the investigation, it was found that not all of these timed-out interfaces had problems, but that one of the interfaces affected the entire service . As long as it is an interface on this service, there may be timeouts, exceptions, and other abnormal phenomena]

First, look at the phenomenon:

DingTalk group alert:

image.png

The following figure shows the JVM monitoring panel of the alarm machine:

image.png

1.1. Observe the phenomenon seen on the monitoring panel, as follows:

  1. 1. CMS GC takes too long for a single time:
    • It can be seen CMS GC耗时46.6秒!that we know CMS收集器是作用于老年代的垃圾收集器that he has 四个stages, 阶段1 (初始标记)and stop-the-world ( ) 阶段4(重新标记)will occur . If these 46.6 seconds happen to be one of these two stages, then it is equivalent to that the entire service does not process any requests during these 46.6 seconds, and only focuses on garbage collection.即暂停所有应用线程整个服务高延迟,低吞吐!我去!那还怎么玩???这种情况下,该服务就算所有接口都超时,也不足为奇呀!
  2. Second, the number of threads: the number of threads increases sharply
  3. 3. Thread status: The threads of waiting and time_waiting increase sharply
  4. 4. CPU situation: The CPU usage rate is very high

1.2. Conclusion after observation & investigation direction:

  1. 1. Draw a conclusion: Observing the monitoring information above, we can draw a conclusion: 单次 CMS GC 耗时太长!, threads, and cpu are all abnormal!
  2. 2. The direction of investigation: My first instinct is that there is a memory leak somewhere ( 因为据我的经验来看, 内存泄露很有可能和 GC 相关联), 或者哪个地方分配了大对象. Since the thread information was not dumped at that time (I don’t know why it didn’t come out), 并且我猜测根因与 内存泄露 或者 大对象分配 方面更接近so my investigation direction did not go in the direction of the thread , but focused on observing the stack and GC information .

    Guessing is useless, so I asked the operation and maintenance classmate to ask for the stack information, and the subsequent investigation process is roughly like this:

    dump堆栈文件-> 使用 MAT 工具分析-> 仔细观察各项指标 -> 定位问题代码

2. Get the stack file and use MAT analysis

Find the operation and maintenance classmate to get the stack information file .dumpending with the following:

image.png

Unzip the file after getting it, and change the suffix to .hprof, otherwise the mat tool will not recognize it, and it will not be imported.

Speaking of dump files, there are actually many tools that can analyze them, as follows:

  • Simple: JConsole, JVisualvm, HA, GCHisto, GCViewer
  • Advanced: MAT, JProfiler

Since I have used mat before, I still use mat to troubleshoot. Let me introduce mat first:

 
 

arduino

copy code

MAT(Eclipse Memory Analyzer)是一种快速且功能丰富的Java堆分析器,它帮助查找内存泄漏。 使用内存分析 器分析具有数亿个对象的高效堆转储,快速计算对象的保留大小,查看谁在阻止垃圾收集器收集对象, 运行报告以自动提取泄漏嫌疑。 官网地址https://www.eclipse.org/mat, 下载地址为https://www.eclipse.org/mat/downloads.php。我们可以在下载 页面看到:MAT可以独立运行, 也可以通过插件的形式在Eclipse中安装使用,具体的安装过程就不再这里介绍了,我是下载的独立运行 的程序包。

import:

image.png


After importing, it will generally be analyzed automatically, and it will take a certain amount of time. After the analysis is completed, it is as follows:

image.png

2.1. Understand the analysis report of the MAT tool

If a worker wants to do a good job, he must first sharpen his tool. Next, let's take a look at some common operations of this tool, and what indicators have been analyzed?

2.1.1. Histogram:Histogram (which contains the number of objects corresponding to each class and the size of the memory occupied)

  • image.png

2.1.2. Dominator Tree:You can see which objects occupy the largest space and their proportion (often used to find large objects)

  • image.png

2.1.3. Leak Suspects:Suspicious memory leak analysis, the pie chart is very intuitive (a powerful tool for troubleshooting memory leaks).

  • image.png

2.1.4. Top components:A series of analyzes on components that occupy more than 1% of the entire heap memory

It can be seen that he analyzes components that account for more than 1% from multiple perspectives (such as: objects, class loaders, and packages).

image.png

2.1.5. Top consumers:List the most expensive objects

  • image.png

It can be seen that Top components and Top consumers are similar, and the difference does not seem to be very big.

2.1.6. Component ReportSome reports of components (analyzing objects belonging to common root packages or class loaders)

2.1.7, Duplicate Classesused to find duplicate classes

Among them Component Reportand Duplicate ClassesI personally think that they are not commonly used, so I won't introduce them too much.

During the investigation and analysis, I think Histogram , Leak Suspects , Dominator Tree , Top Consumers , Top Components , these all need to be carefully observed. The more observations, the closer the answer will be to you.

2.2. The results observed by mat

2.2.1, Histogram results:

image.png

Since Histogram lists the number of objects corresponding to each class and the size of the memory it occupies, it is generally more basic classes with more objects, which is also a normal phenomenon, such as (basic type arrays, various collections such as String, map, list, and some built-in classes in Java) com.mysql,jdbc.ByteArrayRow.

2.2.2. Dominator Tree result:

可以清楚的看到,其中某个arrayList占到了 整个堆的 56.90% 说明有大对象出现了。(list里边的对象其实就是 com.mysql,jdbc.ByteArrayRow ! )

image.png

2.2.3, Leak Suspects results:

image.png

The screenshot above tells us that there may be a memory leak, so I traced it and clicked outgoing references (Note: outgoing references should be able to observe what things have been done in this thread, or it can be simply understood as which objects are referenced by this thread), and the results are as follows:

image.png

ps: Since the analysis results of and Top Consumersare very similar to the results of , I won't post them.Top ComponentsDominator Tree

3. Locate the problem and the code snippet

From Histogramthe analysis of the three indicators Dominator Tree, the cause of this problem is obtained :Leak Suspects

A sql query has millions of pieces of data (millions of ByteArrayRow objects), causing the list collection to occupy 56.90% of the heap memory (leading to GC problems)

Explanation: Unfortunately, I did not get the GC log, so I can only observe it from the monitoring panel and the mat tool: forward (jvm monitoring) + reverse (mat to get the result), to confirm that it is a GC problem, and I cannot start directly from the GC log.

3.1. List of codes

ok, so far I have actually based on the sql in mat:

image.png

Found the code snippet, the code is as follows:

image.png

When I saw this code, I was overwhelmed with thoughts. . . . . I don't know how to speak for a while. . . .

The condition is empty, isn’t it just select a,b,c from tableA where 1=1? Isn't this a full table scan, cruel and ruthless. . .

3.2. Solve

The solution so easyis to remove 1=1, and remove the if judgment. When the incoming condition is null, use null to match, and then the result is also returned empty (because the field must not be empty).

Four. Summary

  1. First of all, in this investigation process, with the help of two major tools, one is jvm监控: Prometheus+ grafanathrough the combination of these two brothers, we can observe the jvm situation, 直观的看出服务是否有异常.
  2. The second is that dump堆栈文件because of the jvm monitoring panel gc及其不正常, I followed the vines to find the stack information (thanks to the operation and maintenance students here), and passed it. I have to say that the investigation was relatively smooth, and mat工具分析,然后找到可疑地方the problem points were found directly .sql语句都找出来了快速的定位了问题代码
  3. Explanation: Since this article focuses on the actual investigation process, the cms gc knowledge has not been expanded too much. In addition, the little regret this time is that I did not get the GC log. The next time there is a problem, I will ask for the GC log, stack information and thread files as soon as possible haha ~.
  4. Conclusion (important): Although I didn’t get the GC log, according to my personal experience, I can basically determine the cause of this alarm:

    Indirect cause:由于全表扫描,返回几百万数据(全部缓存在了内存中)

    direct cause:由于内存中数据过多,在CMS GC 的Final Remark(重新标记阶段)这个耗时就会很长(我猜测这个重新标记 耗时正好是监控看到的46.6秒,或者比这个值小点),并且这个重新标记阶段是Stop The World,也就是说在一段时间内服务将暂停所有的工作线程,也就导致了服务吞吐下降,大量接口超时的现象了~

感慨:

  • A journey of a thousand miles begins with a single step. In fact, after working for a long time, you will find that many seemingly serious problems are actually caused by a very basic thing, or a very simple thing. For example, the sql writing this time is really very basic, but it is such a basic thing that if you don’t pay attention, you will scan the entire table. Small tables are fine, but large tables return millions, or even tens of millions, hundreds of millions of data. Does the interface not time out? Is the service still down? So respecting every line of code and doing every little thing well is the goal I pursue.

Guess you like

Origin blog.csdn.net/wdj_yyds/article/details/131811721