Java Online Troubleshooting

Online faults mainly include cpu, disk, memory, and network problems, and most faults may contain more than one level of problems, so try to check the four aspects in turn when troubleshooting. At the same time, tools such as jstack and jmap are not limited to one aspect of the problem. Basically, the problem is df, free, top three consecutive, and then jstack and jmap serve in turn, and specific problems can be analyzed in detail.

cpu

Generally speaking, we will first troubleshoot the cpu problem. CPU exceptions are often better located. The reasons include business logic problems (infinite loop), frequent gc (this is included in the GC chapter below), and too many context switches. The most common is often caused by business logic (or framework logic), you can use jstack to analyze the corresponding stack situation.

jstack

We first use the ps command to find the pid of the corresponding process (if you have several target processes, you can use top to see which one takes up more).

ps

Then use top -H -p pid to find some threads with relatively high cpu usage, and then convert the highest occupied pid to hexadecimal printf '%x\n' pid to get nid

top -H -p pid
printf ‘%x\n’ pid

Then find the corresponding stack information directly in jstack

jstack pid |grep ‘nid’ -C5 –color

insert image description here

It can be seen that we have found the stack information with nid 0x42, and then we only need to analyze it carefully.
Of course, it is more common for us to analyze the entire jstack file. Usually we will pay more attention to the parts of WAITING and TIMED_WAITING, not to mention BLOCKED.
We can use the following commands to have an overall grasp of the status of jstack. If there are too many WAITING and the like, then there is probably a problem.

cat jstack.log | grep “java.lang.Thread.State” | sort -nr | uniq -c

context switch

For frequent context problems, we can use the vmstat command to view
insert image description here

For a more detailed command, continuously collect performance data 10 times at intervals of 3 seconds:

vmstat 3 10

The cs (context switch) column represents the number of context switches. If we want to monitor a specific pid then we can use

pidstat -w pid

insert image description here

cswch and nvcswch represent voluntary and involuntary switching.

disk

Disk problems are as basic as cpu. First of all, in terms of disk space, we directly use the following command to view the status of the file system

df -hl

insert image description here

More often than not, disk problems are performance problems. We can analyze it with the following command

iostatiostat -d -k -x

insert image description here

In the last column %util, you can see the writing level of each disk, while rrqpm/s and wrqm/s respectively indicate the read and write speed, which can generally help locate the specific disk that has a problem.

But what we get here is tid, we want to convert it to pid, we can find pid through readlink.

readlink -f /proc/*/task/tid/…/…

insert image description here

After finding the pid, you can see the specific read and write conditions of this process

cat /proc/pid/io

insert image description here

Determine the specific file read and write conditions

lsof -p pid

insert image description here

Memory

Troubleshooting memory problems is more troublesome than CPU, and there are more scenarios. Mainly including OOM, GC issues and off-heap memory. Generally speaking, we will first use the free command to check the various conditions of the memory.
insert image description here

heap memory

Most of the memory problems are also heap memory problems. Appearance is mainly divided into OOM and StackOverflow.

OOM

Insufficient memory in JMV, OOM can be roughly divided into the following categories:

  • Exception in thread “main” java.lang.OutOfMemoryError: unable to create new native thread This means that there is not enough memory space to allocate a java stack to the thread. Basically, there is a problem with the thread pool code, such as forgetting to shutdown, so You should first look for problems at the code level, using jstack or jmap. If everything is normal, the JVM can reduce the size of a single thread stack by specifying Xss. In addition, at the system level, you can increase the os limit on threads by modifying /etc/security/limits.confnofile and nproc
  • Exception in thread “main” java.lang.OutOfMemoryError: Java heap space means that the memory usage of the heap has reached the maximum value set by -Xmx, which should be the most common OOM error. The solution is still to find it in the code first, suspect that there is a memory leak, and use jstack and jmap to locate the problem. If everything is normal, it is necessary to expand the memory by adjusting the value of Xmx.
  • Caused by: java.lang.OutOfMemoryError: Meta space means that the memory usage of the metadata area has reached the maximum value set by XX:MaxMetaspaceSize. The troubleshooting idea is consistent with the above. The parameters can be adjusted by XX:MaxPermSize (here Not to mention the permanent generation before 1.8).

StackOverflow

Stack memory overflow, you have seen this a lot. Exception in thread “main” java.lang.StackOverflowError indicates that the memory required by the thread stack is greater than the Xss value, and the investigation is also performed first. The parameters are adjusted through Xss, but if the adjustment is too large, it may cause OOM.

jmap

For the above code troubleshooting of OOM and StackOverflow, we generally use JMAP

jmap -dump:format=b,file=filename pid

To export the dump file, use the dump analysis software such as mat (Eclipse Memory Analysis Tools) to import the dump file for analysis. Generally, we can directly select Leak Suspects for memory leak problems, and mat gives suggestions for memory leaks. Alternatively, select Top Consumers to view the largest objects report. Questions related to threads can be analyzed by selecting thread overview. In addition, choose the overview of the Histogram class to analyze slowly by yourself. You can search for relevant tutorials on mat.
insert image description here

insert image description here

In daily development, code memory leaks are relatively common and hidden, requiring developers to pay more attention to details. For example, new objects are requested every time, resulting in a large number of repeated object creation; file stream operations are performed but not closed correctly; manual improper triggering of gc; ByteBuffer cache allocation is unreasonable, etc. will cause code OOM.
On the other hand, we can specify the following command in the startup parameters to save the dump file when OOM.

-XX:+HeapDumpOnOutOfMemoryError

off-heap memory

It's really unfortunate if you encounter an off-heap memory overflow. First of all, the performance of off-heap memory overflow is that the physical resident memory grows rapidly. If an error is reported, it depends on the usage method. If it is caused by the use of Netty, an OutOfDirectMemoryError error may appear in the error log. If it is directly DirectByteBuffer, an OutOfMemoryError will be reported : Direct buffer memory .
Out-of-heap memory overflow is often related to the use of NIO. Generally, we first check the memory occupied by the process through pmap

pmap -x pid | sort -rn -k3 | head -30

This paragraph means to view the top 30 memory segments corresponding to the reverse order of the pid. Here you can run the command again after a period of time to see the memory growth, or where is the suspicious memory segment compared with the normal machine.

insert image description here

However, in fact, the above operations are difficult to locate specific problems. The key is to look at the error log stack, find suspicious objects, figure out its recovery mechanism, and then analyze the corresponding objects. For example, if DirectByteBuffer allocates memory, it needs full GC or manual system.gc to recycle (so it is best not to use -XX:+DisableExplicitGC). So in fact, we can track the memory of the DirectByteBuffer object, and manually trigger fullGC through jmap -histo:live pid to see if the memory outside the heap has been recycled. If it is recycled, then there is a high probability that the allocation of the off-heap memory itself is too small, and it can be adjusted by **-XX:MaxDirectMemorySize**. If there is no change, then use jmap to analyze those objects that cannot be gc, and the reference relationship with DirectByteBuffer.

GC problem

If you want to do a good job, you must use your tools, so we need to know what tools gc can use to view information:

  • The company's monitoring system: Most companies will have it, which can monitor various indicators of the JVM in an all-round way.
  • JDK's built-in tools include common commands such as jmap and jstat:
    • View the usage rate and GC status of each area of ​​the heap memory: jstat -gcutil -h20 pid 1000
    • View the surviving objects in the heap memory and sort them by space: jmap -histo pid | head -n20
    • Dump heap memory file: jmap -dump:format=b,file=heap pid
  • Visual heap memory analysis tools: JVisualVM, MAT, etc.

Sometimes we can first determine whether gc is too frequent, and use the following command to observe the changes of gc generation.

jstat -gc pid 1000

insert image description here

1000 represents the sampling interval (ms), and S0C/S1C, S0U/S1U, EC/EU, OC/OU, and MC/MU respectively represent the capacity and usage of the two Survivor areas, Eden area, old generation, and metadata area. YGC/YGT, FGC/FGCT, and GCT represent the time consumption, times and total time consumption of YGC and FGC. If you see that gc is more frequent, then do further analysis on gc.

We use the GC log to troubleshoot the problem. Add **-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps to the startup parameters to enable the GC log. If jdk8 or higher, you can also use -Xlog :gc*** to view the log.

Will GC affect the program? According to the severity from high to low, I think it includes the following 4 situations:

  • Too frequent YGC: Even if YGC does not cause the service to time out, too frequent YGC will reduce the overall performance of the service, and it is also necessary to pay attention to high-concurrency services.
  • YGC takes too long: Generally speaking, it is normal for the total time of YGC to be tens or hundreds of milliseconds. Although it will cause the system to freeze for a few milliseconds or tens of milliseconds, this situation is almost imperceptible to the user. The impact of the procedure is negligible. But if YGC takes 1 second or even a few seconds (almost catching up with FGC), then the freezing time will increase, and YGC itself is more frequent, which will lead to more service timeout problems.
  • FGC is too frequent: FGC is usually relatively slow, ranging from a few hundred milliseconds to a few seconds. Normally, FGC is executed every few hours or even days, and the impact on the system is acceptable. However, once the FGC occurs frequently (for example, it will be executed once in tens of minutes), there must be a problem. It will cause the worker threads to be stopped frequently, making the system seem to be stuck all the time, and will also make the overall program Performance deteriorates.
  • The time-consuming of FGC is too long: the time-consuming of FGC increases, and the freezing time will also increase accordingly. Especially for high-concurrency services, it may cause more time-out problems during FGC and reduce availability. This also needs attention.

Among them, "FGC is too frequent" and "YGC takes too long", these two situations are relatively typical GC problems, which will have a high probability of affecting the service quality of the program. The severity of the remaining two cases is lower, but programs with high concurrency or high availability also need attention.

YGC is too frequent

Frequent YGC usually means that there are many small objects in a short period. First, consider whether the setting of the new generation is too small, and see if the problem can be solved by adjusting parameter settings such as -Xmn and -XX:SurvivorRatio . If the parameters are normal, but the frequency of young gc is still too high, you need to use jmap and mat to further check the dump file.

YGC takes too long

The problem of taking too long depends on where the time is spent in the GC log. Taking the G1 log as an example, you can focus on stages such as Root Scanning, Object Copy, and Ref Proc. Ref Proc takes a long time, so we must pay attention to referencing related objects. Root Scanning takes a long time, so pay attention to the number of threads and cross-generational references. Object Copy needs to pay attention to the object life cycle. Moreover, time-consuming analysis requires horizontal comparison, which is a time-consuming comparison with other projects or normal time periods. For example, if the Root Scanning in the figure increases more than the normal time period, it means that there are too many threads started.

FGC too frequent

Frequent FGC may be due to insufficient memory, which needs to be expanded, or System.gc() is explicitly called in the code, otherwise, there will be a memory leak in the heap. Still have to use jmap and mat to further check the dump file.

FGC takes too long

This is more difficult to deal with. Each collector has a different processing method, and the configuration can only be adjusted according to their own characteristics to solve the problem.

Internet problem

Problems related to the network level are generally more complicated, with many scenarios and difficult positioning, which has become a nightmare for most developers, and should be the most complicated. Some examples will be given here, and they will be explained from the aspects of tcp layer, application layer and the use of tools.

time out

Most of the timeout errors are at the application level, so this piece focuses on understanding the concept. Timeouts can be roughly divided into connection timeouts and read-write timeouts. Some client frameworks that use connection pools also have connection timeouts and idle connection cleanup timeouts.

  • Read and write timed out. readTimeout/writeTimeout, some frameworks are called so_timeout or socketTimeout, both refer to data read and write timeout. Note that most of the timeouts here refer to logical timeouts. The timeout of soa also refers to the read timeout. Read and write timeouts are generally only set for clients.
  • Connection timed out. connectionTimeout, the client usually refers to the maximum time to establish a connection with the server. The connectionTimeout on the server side is a bit varied. Jetty indicates the idle connection cleaning time, and tomcat indicates the maximum time for connection maintenance.
  • other. Including connection acquisition timeout connectionAcquireTimeout and idle connection cleanup timeout idleConnectionTimeout. Mostly used for client or server frameworks that use connection pools or queues.

When we set various timeouts, what we need to confirm is to keep the timeout of the client side smaller than the timeout time of the server as far as possible, so as to ensure that the connection ends normally.
In actual development, what we care most about should be that the read and write of the interface timed out.
How to set a reasonable interface timeout is a problem. If the interface timeout setting is too long, it may occupy too much tcp connection of the server. And if the interface is set too short, then the interface will time out very frequently.
It is another problem that the server interface clearly lowers the rt, but the client still times out all the time. This problem is actually very simple. The link from the client to the server includes network transmission, queuing, and service processing. Each link may be time-consuming.

TCP queue overflow

The tcp queue overflow is a relatively low-level error, which may cause more superficial errors such as timeout and rst. Therefore, the error is also more subtle, so let's talk about it separately.
insert image description here

As shown in the figure above, there are two queues: syns queue (semi-connection queue) and accept queue (full connection queue). Three-way handshake, after the server receives the client's syn, put the message into the syns queue, reply syn+ack to the client, the server receives the client's ack, if the accept queue is not full at this time, then take out the temporary storage from the syns queue Put the information into the accept queue, otherwise follow the instructions of tcp_abort_on_overflow.
tcp_abort_on_overflow 0 means that if the accept queue is full during the third step of the three-way handshake, the server will discard the ack sent by the client. tcp_abort_on_overflow 1 means that if the full connection queue is full in the third step, the server sends an rst packet to the client, indicating that the handshake process and the connection are abolished, which means that there may be many connection reset / connection reset by peer in the log .
So in actual development, how can we quickly locate the tcp queue overflow? Use the following command:

netstat command, execute netstat -s | egrep "listen|LISTEN"

insert image description here

As shown in the figure above, overflowed indicates the number of full-connection queue overflows, and sockets dropped indicates the number of semi-connection queue overflows.

or the following command:

ss -lnt

insert image description here

As seen above, Send-Q indicates that the maximum number of fully connected queues on the listen port in the third column is 5, and the first column of Recv-Q indicates how much the fully connected queue is currently used.
Then let's see how to set the size of the fully connected and semi-connected queues:
the size of the fully connected queue depends on min(backlog, somaxconn). The backlog is passed in when the socket is created, and somaxconn is an os-level system parameter. The size of the semi-join queue depends on max(64, /proc/sys/net/ipv4/tcp_max_syn_backlog).
In daily development, we often use servlet containers as the server, so we sometimes need to pay attention to the size of the connection queue of the container. The backlog is called acceptCount in tomcat , and acceptQueueSize in jetty .

RST exception

The RST packet means connection reset, which is used to close some useless connections. It usually means abnormal shutdown, which is different from four waved hands.
In actual development, we often see connection reset / connection reset by peer errors, which are caused by the RST package.

port does not exist

If a SYN request to establish a connection is sent to a port that does not exist, the server will directly return a RST message to terminate the connection if it finds that it does not have this port.

Actively terminate the connection instead of FIN

Generally speaking, the normal connection closing needs to be realized through the FIN message, but we can also use the RST message instead of the FIN, indicating that the connection is terminated directly. In actual development, you can set the value of SO_LINGER to control it. This is often intentional to skip TIMED_WAIT to improve interaction efficiency. Use it with caution when you are not idle.

An exception occurs on one side of the client or the server, and the direction sends RST to the other end to inform the close connection

The tcp queue overflow we mentioned above to send RST packets actually belongs to this category. This is often due to some reasons, one party can no longer process the request connection normally (for example, the program crashes, the queue is full), thus telling the other party to close the connection.

The received TCP packet is not in a known TCP connection

For example, if one machine loses a TCP message due to poor network conditions, the other party closes the connection, and then receives the missing TCP message after a long time, but since the corresponding TCP connection no longer exists, it will directly send a RST packet to open a new connection.

One party has not received the confirmation message from the other party for a long time, and sends an RST message after a certain time or number of retransmissions

Most of these are also related to the network environment, and poor network environment may lead to more RST packets.
I said before that too many RST packets will cause the program to report an error. A read operation on a closed connection will report a connection reset, while a write operation on a closed connection will report a connection reset by peer. Usually we may also see a broken pipe error, which is an error at the pipeline level, indicating that reading and writing to a closed pipeline is often an error of continuing to read and write datagrams after receiving an RST and reporting a connection reset error. It is also introduced in the glibc source code comments.
How do we determine the existence of RST packets when troubleshooting? Of course, use the tcpdump command to capture packets, and use wireshark for simple analysis. tcpdump -i en0 tcp -w xxx.cap, en0 indicates the monitoring network card.
insert image description here

Next, we open the captured packets through wireshark, and we may see the following picture, and the red ones represent RST packets.
insert image description here

TIME_WAIT和CLOSE_WAIT

I believe everyone knows what TIME_WAIT and CLOSE_WAIT mean. When online, we can directly use the command

netstat -n | awk ‘/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}’

To view the number of time-wait and close_wait,
it will be faster to use the ss command

ss -ant | awk ‘{++S[$1]} END {for(a in S) print a, S[a]}’

TIME_WAIT

The existence of time_wait is for the lost data packets to be multiplexed by the subsequent connection, and the second is to close the connection normally within the time range of 2MSL. Its existence will actually greatly reduce the occurrence of RST packets.
Excessive time_wait is more likely to appear in scenarios with frequent short connections. In this case, some kernel parameter tuning can be done on the server side:

#表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接,默认为0,表示关闭
net.ipv4.tcp_tw_reuse = 1
#表示开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表示关闭
net.ipv4.tcp_tw_recycle = 1

Of course, we should not forget that in the NAT environment, the data packet is rejected due to the wrong time stamp. Another way is to reduce the tcp_max_tw_buckets, and the time_wait exceeding this number will be killed, but this will also cause the time wait bucket table overflow to be reported. wrong.

CLOSE_WAIT

close_wait is often because the application program has a problem, and the FIN message is not sent again after the ACK. The probability of close_wait is even higher than that of time_wait, and the consequences are more serious. It is often because a certain place is blocked and the connection is not closed normally, thus gradually consuming all the threads.
If you want to locate this kind of problem, it is best to analyze the thread stack through jstack to troubleshoot the problem. For details, please refer to the above chapters. Here is just one example.
The developer said that CLOSE_WAIT kept increasing after the application went online until it hangs up. After jstack, I found a suspicious stack because most of the threads were stuck in the countdownlatch.await method. After looking for the developer, I learned that multi-threading was used but it was not. catch exception, after modification, it is found that the exception is only the simplest class not found that often occurs after upgrading the sdk.

Guess you like

Origin blog.csdn.net/h295928126/article/details/126941638