jvm summary five "basic operations for online troubleshooting"

This article talks about the analysis and troubleshooting of some online problems, and it is limited to basic operations, because there are too many thread problems and too strange, even the big guys will encounter difficult problems. More importantly, as a landlord, my level is limited. . . I can only summarize my own experience and what I have heard and watched, and hope that I will have ideas when encountering problems.
For Java programmers, troubleshooting online problems is inevitable. I often walk by the river without getting my shoes wet. Suddenly faced with high CPU, memory overflow, frequent GC, system freeze and other similar problems. what should we do? How to solve these problems?

First of all, when a problem occurs, we must first locate the problem, then analyze the cause of the problem, then solve the problem, and finally summarize to prevent it from happening again next time.

Frequently Asked Questions about Java Services

Focus on online issues: CPU, memory
Server troubleshooting command: top command focuses on CUP, free command focuses on memory
Note: Due to my environmental problems, there are no screenshots, and the following screenshots are network screenshots

1. The CPU is soaring

Online CPU soaring should be a common problem.
The idea: first locate the java process that has the CPU soaring, then find the java thread with the highest occupation in this process, and finally find the problem code according to the stack information of this thread, and then Check the code to solve the problem.
operate:

  1. The top command finds the java process with the highest CPU
    insert image description here
    usage. The top command displays the CPU usage of each process, arranged from high to low. Load Average shows the average load of the system in the last 1 minute, 5 minutes and 15 minutes. The values ​​in the above figure are 2.46, 1.96, and 1.99.
Label Introduction
PID process id
USER process owner
PR process owner
IN nice value. Negative values ​​indicate high priority, positive values ​​indicate low priority
VIRT The total amount of virtual memory used by the process, in kb. VIRT=SWAP+RES
RES The physical memory size used by the process and not swapped out, in kb. RES=CODE+DATA
SHR Shared memory size, unit kb
S process state. D=uninterruptible sleep R=running S=sleep T=trace/stop Z=zombie process
%CPU % of CPU time since last update
%MEM The percentage of physical memory used by the process
TIME+ The total CPU time used by the process, in units of 1/100 second
COMMAND process name

Here we found that the java process with the highest CPU usage has a PID of 11506.
2. Use the command ps -mp pid -o THREAD, tid, time to find the thread with the highest CPU usage in process 11506.
insert image description here
From the above figure, we can see that the thread with the highest CPU usage is 11508, accounting for 96.6%.
Convert 11508 It is in hexadecimal format (because java native thread is output in hexadecimal format), you can use printf "%x\n" 11508 to convert to hexadecimal format
and then use jstack -l 11508>jstack.log to dump the thread's stack information to a log file, and then analyze the stack log file through the hexadecimal code of the thread id just now to see if an infinite loop has occurred in a certain business. My friends are e-commerce companies. I heard The most they talk about is deadlock, so if you encounter this kind of problem, you can first check whether the system has deadlock.
If there is no deadlock, check to see if there is a crazy GC. There was a little brother whose CPU soared because he accessed a third-party service through HttpClient. Their HttpClient has not set a timeout period and their business is still creating objects. As a result, the jvm keeps GC and pushes the CPU to the limit. . It can be seen that the soaring CPU is not necessarily caused by any reason, and the index needs to master common methods to analyze the cause.

2. Troubleshooting memory problems

What I said above is that the CPU is soaring, and the big reason is because of the pits we dug ourselves.
The memory problem we are talking about is usually a GC problem
. There are two situations, one is that the memory is overflowing, and the other is that the memory is not overflowing but the GC is not healthy

In the case of memory overflow, you can add the -XX:+HeapDumpOnOutOfMemoryError parameter. The function of this parameter is to output a dump file when the program memory overflows.
With the dump file, it can be analyzed by dump analysis tools, such as commonly used MAT, Jprofile, jvisualvm and other tools can be analyzed, these tools can see where the overflow, where a large number of objects are created and so on.

The second situation is more complicated. GC is unhealthy.
A healthy GC
YGC is about once every 5 seconds, and each time does not exceed 50 milliseconds. It is best not to have FGC, and CMS GC is about once a day.

GC optimization has two dimensions, one is frequency, and the other is duration.
Let's look at YGC, first look at the frequency, if YGC is more than 5 seconds, or even longer, it means that the system memory is too large, and the capacity should be reduced. If the frequency is high, it means that the Eden area is too small, you can increase the Eden area, but the entire The generation capacity should be between 30% - 40% of the heap, and the ratio of eden, from and to should be around 8:1:1, and this ratio can be adjusted according to the size of object promotion.

What if YGC takes too long? YGC has two processes, one is scanning and the other is copying. Usually the scanning speed is very fast, and the copying speed is relatively slow. If there are a large number of objects to be copied each time, the STW time will be extended. There is also a The situation is StringTable. This data structure stores the reference of the constant connection pool returned by the String.intern method. YGC will scan this data structure (HashTable) every time. Long STW duration, another situation is the virtual memory of the operating system, when the GC happens to be the operating system is exchanging memory, it will also lengthen the STW duration.

Let's take a look at FGC again. In fact, we can only optimize the frequency of FGC, but not the duration, because this duration cannot be controlled. How to optimize the frequency?
First of all, there are several reasons for FGC, 1 is insufficient memory in the Old area, 2 is insufficient memory in the metadata area, 3 is System.gc(), 4 is jmap or jcmd, 5 is CMS Promotion failed or concurrent mode failure, 6 is JVM Based on the pessimistic strategy, it is believed that after the YGC, the Old District will not be able to accommodate the candidates for promotion, so the YGC will be canceled and the FGC will be advanced.
The usual point of optimization is that the memory in the Old area is not enough to cause FGC. If there are still a large number of objects after FGC, it means that the Old area is too small, and the Old area should be expanded. If the effect is good after FGC, it means that there are a large number of short-lived objects in the Old area. The point of optimization should be to let these objects be YGC in the new generation The usual way is to increase the new generation. If there are large and short-lived objects, set the size of the objects through parameters to prevent these objects from entering the Old area. It is also necessary to check whether the promotion age is too young. If after YGC, a large number of objects are promoted early because they cannot enter the Survivor area, then the Survivor area should be increased, but it should not be too large.

The above are all optimization ideas, and we also need some tools to know the status of GC.
JDK provides many tools, such as jmap, jcmd, etc. Oracle officially recommends using jcmd instead of jmap, because jcmd can indeed replace many functions of jmap. jmap can print the distribution information of objects, and can dump files. Note that jmap and jcmd will trigger FGC when dumping files. Pay attention to the scene when using it.
Another commonly used tool is jstat, which can view the detailed information of GC, such as the memory usage of eden, from, to, old and other areas.
Another tool is jinfo, which can check which parameters are used by the current jvm, and can also modify parameters without stopping.
Including some of the visualization tools we mentioned above for analyzing dump files, MAT, Jprofile, jvisualvm, etc. These tools can analyze the files dumped by jmap to see which object uses more memory, and usually can find out the problem.

还有很重要的一点就是,线上环境一定要带上 GC 日志!!!

There are many and complicated online problems, which cannot be covered in one or two articles. What we need to master is the method. With the correct method, we can have ideas for whatever problems we encounter, and a little experience is very important. After encountering a problem and solving it, make a summary.

I hope that all ape stars can go online and have less bugs

Guess you like

Origin blog.csdn.net/u010994966/article/details/103011654