JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Online faults mainly include cpu, disk, memory, and network problems, and most faults may contain more than one level of problems, so when troubleshooting, try to check the four aspects in turn.

At the same time, tools such as jstack and jmap are not limited to one aspect of the problem. Basically, the problem is df, free, and top, and then jstack and jmap are served in sequence. The specific problems can be analyzed in detail.

CPU

Generally speaking, we will first troubleshoot cpu problems. cpu exceptions are often better located. Reasons include business logic problems (endless loop), frequent gc, and excessive context switching. And the most common is often caused by business logic (or framework logic), you can use jstack to analyze the corresponding stack situation.

Use jstack to analyze cpu problems

We first use the ps command to find the pid of the corresponding process (if you have several target processes, you can use top to see which occupies higher).

Then use top -H -p pid to find some threads with high cpu usage

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Then convert the highest pid to hexadecimal to printf '%x\n' pidget nid

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Then find the corresponding stack information directly in jstackjstack pid |grep 'nid' -C5 –color

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

You can see that we have found the stack information with nid 0x42, and then we just need to analyze it carefully.

Of course, it is more common that we analyze the entire jstack file. Usually we will pay more attention to the WAITING and TIMED_WAITING parts, and needless to say BLOCKED. We can use commands cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cto have an overall grasp of the status of jstack. If there are too many WAITING, then there is probably a problem.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Frequent gc

Of course, we will still use jstack to analyze the problem, but sometimes we can first determine whether the gc is too frequent, and use the jstat -gc pid 1000 command to observe the changes in the gc generation. 1000 represents the sampling interval (ms), S0C /S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent the capacity and usage of the two Survivor area, Eden area, old age, and metadata area respectively. YGC/YGT, FGC/FGCT, and GCT represent the time and frequency and total time consumed by YoungGc and FullGc. If you see that gc is more frequent, do further analysis on gc.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Context switch

For frequent context problems, we can use the vmstat command to view

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

The cs (context switch) column represents the number of context switches.

If we want to monitor a specific pid, we can use the pidstat -w pid command, cswch and nvcswch indicate voluntary and involuntary switching.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Disk

Disk problems are more basic than cpu. The first is disk space, we directly use df -hl to view the file system status

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

More often, the disk problem is still a performance problem. We can analyze by iostatiostat -d -k -x

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

The last column %util can see the degree of writing to each disk, while rrqpm/s and wrqm/s represent the read and write speeds respectively, which can generally help locate the specific disk that has a problem.

In addition, we also need to know which process is reading and writing. Generally speaking, the developer knows it well, or uses the iotop command to locate the source of file reading and writing.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

But what we got here is tid, we need to convert it to pid, we can find pidreadlink -f /proc/*/task/tid/../.. through readlink.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

After finding the pid, you can see the specific read and write conditions of this process cat /proc/pid/io

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

We can also use the lsof command to determine the specific file read and write conditions lsof -p pid

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

RAM

Memory problems are more troublesome than CPU, and there are more scenarios. Mainly include OOM, GC issues and off-heap memory. Generally speaking, we will first use the free command to check the various conditions of the memory of a shot.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Heap memory

Most of the memory problems are still heap memory problems. It is mainly divided into OOM and StackOverflow.

UNCLE

Insufficient memory in JMV, OOM can be roughly divided into the following types:

Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread

This means that there is not enough memory space to allocate the Java stack to the thread. Basically, there is still a problem with the thread pool code, such as forgetting to shut down, so we should first look for the problem from the code level, using jstack or jmap. If everything is normal, the JVM can reduce the size of a single thread stack by specifying Xss.

In addition, at the system level, you can increase the thread limit of OS by modifying /etc/security/limits.confnofile and nproc

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

This means that the memory usage of the heap has reached the maximum value set by -Xmx, which should be the most common OOM error. The solution is still to find it in the code first. If there is a memory leak, use jstack and jmap to locate the problem. If everything is normal, you need to adjust the value of Xmx to expand the memory.

Caused by: java.lang.OutOfMemoryError: Meta space

This means that the memory usage of the metadata area has reached the maximum value set by XX:MaxMetaspaceSize. The troubleshooting idea is the same as the above. The parameters can be adjusted through XX:MaxPermSize (not to mention the permanent generation before 1.8).

Stack Overflow

The stack memory overflows, and everyone sees more of this.

Exception in thread "main" java.lang.StackOverflowError

Indicates that the memory required by the thread stack is greater than the Xss value, and it is also checked first. The parameters are adjusted through Xss, but too large adjustment may cause OOM.

Use JMAP to locate code memory leaks

For the above code investigation on OOM and StackOverflow, we generally use JMAPjmap -dump:format=b,file=filename pid to export the dump file

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Import the dump file through mat (Eclipse Memory Analysis Tools) for analysis. Generally, we can directly select Leak Suspects for memory leaks. Mat gives suggestions on memory leaks. You can also choose Top Consumers to view the largest object report. Questions related to threads can be analyzed by selecting thread overview. In addition, choose the Histogram class overview to analyze it yourself slowly. You can search for related tutorials on mat.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

In daily development, memory leaks in code are relatively common and relatively hidden, requiring developers to pay more attention to details. For example, each request is a new object, resulting in a large number of repeated object creation; file stream operations are performed but not closed properly; manual improper triggering of gc; unreasonable ByteBuffer cache allocation will cause code OOM.

On the other hand, we can specify in the startup parameters -XX:+HeapDumpOnOutOfMemoryErrorto save the dump file during OOM.

gc issues and threads

The gc problem not only affects the cpu but also affects the memory, the troubleshooting ideas are also the same. Generally, use jstat first to check the generational changes, such as whether there are too many youngGC or fullGC times; whether the growth of indicators such as EU and OU is abnormal.

If there are too many threads and not in time, gc will also cause oom, most of which are the previously mentioned “unable to create new native thread”. In addition to detailed analysis of the dump file by jstack, we generally look at the overall thread first, through pstreee -p pid |wc -l.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Or directly by viewing the number of /proc/pid/task is the number of threads.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Off-heap memory

It would be really unfortunate if you encounter an out-of-heap memory overflow. First of all, the performance of the off-heap memory overflow is the rapid growth of the physical resident memory. If an error is reported, the usage method is uncertain. If it is caused by using Netty, an OutOfDirectMemoryError error may appear in the error log. If it is DirectByteBuffer directly, it will be reported OutOfMemoryError: Direct buffer memory.

Out-of-heap memory overflow is often related to the use of NIO. Generally, we first use pmap to view the memory occupied by the process pmap -x pid | sort -rn -k3 | head -30, this section means to view the first 30 of the corresponding pid in reverse order Large memory segment. Here you can run the command again after a while to see the memory growth, or where the memory segments are suspicious compared to normal machines.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

If we are sure that there is a suspicious memory end, we need to analyze gdb through gdb --batch --pid {pid} -ex "dump memory filename.dump {memory starting address} {memory starting address + memory block size}"

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

After obtaining the dump file, you can use heaxdump to view it hexdump -C filename | less, but most of what you see are binary garbled characters.

NMT is a new HotSpot feature introduced by Java7U40. With the jcmd command, we can see the specific memory composition. Need to be added to the startup parameters  -XX:NativeMemoryTracking=summary or  -XX:NativeMemoryTracking=detailthere will be a slight performance loss.

Generally, for the situation where the off-heap memory grows slowly until it explodes, a baseline jcmd pid VM.native_memory baseline can be set first.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Then wait for a period of time to see the memory growth, and do a summary or detail level diff through jcmd pid VM.native_memory detail.diff(summary.diff).

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

You can see that the memory analyzed by jcmd is very detailed, including the heap, thread, and gc (so the other memory exceptions mentioned above can actually be analyzed by nmt). Here we focus on the memory growth of the internal memory, if the increase is very obvious If so, there is a problem.

At the detail level, there will also be the growth of specific memory segments, as shown in the figure below.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

In addition, at the system level, we can also use the strace command to monitor memory allocation strace -f -e "brk, mmap, munmap" -p pid

The memory allocation information here mainly includes pid and memory address.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

However, it is difficult to locate the specific problem with the above operations. The key is to look at the error log stack, find the suspicious object, figure out its recovery mechanism, and then analyze the corresponding object. For example, if DirectByteBuffer allocates memory, full GC or manual system.gc is required to recycle it (so it is best not to use -XX:+DisableExplicitGC).

In fact, we can track the memory situation of the DirectByteBuffer object, and manually trigger fullGC through jmap -histo:live pid to see if the off-heap memory has been recycled. If it is reclaimed, there is a high probability that the off-heap memory itself is allocated too small, which can be adjusted by -XX:MaxDirectMemorySize. If there is no change, then use jmap to analyze the objects that cannot be gc and the reference relationship with DirectByteBuffer.

GC issues

In-heap memory leaks are always accompanied by GC exceptions. However, GC issues are not only related to memory issues, but may also cause a series of complications such as CPU load and network issues. They are only relatively closely related to memory, so we will separately summarize GC related issues here.

In the cpu chapter, we introduced the use of jstat to obtain the current GC generational change information. More often, we use the GC log to troubleshoot problems, and add it -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStampsto the startup parameters to turn on the GC log.

The meaning of common Young GC and Full GC logs will not be repeated here.

For the gc log, we can roughly infer whether the youngGC and fullGC are too frequent or take too long, so as to prescribe the right medicine. We will analyze the G1 garbage collector below. It is also recommended that you use G1-XX:+UseG1GC.

youngGC too frequent

Frequent youngGC is usually short-period small objects. First consider whether the Eden area/Cenozoic setting is too small, and see if the problem can be solved by adjusting parameter settings such as -Xmn, -XX:SurvivorRatio. If the parameters are normal, but the young gc frequency is still too high, you need to use Jmap and MAT to further investigate the dump file.

youngGC takes too long

The problem of excessive time-consuming depends on which part of the GC log is time-consuming. Taking the G1 log as an example, you can focus on Root Scanning, Object Copy, Ref Proc and other stages. Ref Proc takes a long time, so pay attention to referencing related objects.

Root Scanning takes a long time, so pay attention to the number of threads and cross-generation references. Object Copy needs to pay attention to the object life cycle. And time-consuming analysis requires a horizontal comparison, that is, time-consuming comparison with other projects or normal time periods. For example, if the Root Scanning in the figure increases more than the normal time period, it means that there are too many threads.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Trigger fullGC

G1 is more of mixedGC, but mixedGC can be investigated in the same way as youngGC. When fullGC is triggered, there will usually be problems. G1 will degenerate and use the Serial collector to clean up garbage. The pause time reaches the second level, which can be said to be half kneeling.

The reasons for fullGC may include the following, as well as some ideas for parameter adjustment:

  • Concurrent phase failure: In the concurrent marking phase, the old generation was filled before MixGC, then G1 will give up the marking cycle at this time. In this case, it may be necessary to increase the heap size, or adjust the number of concurrent marking threads-XX: ConcGCThreads.
  • Promotion failure: There is not enough memory for the survival/promotion object during GC, so Full GC is triggered. At this time, you can increase the percentage of reserved memory through -XX:G1ReservePercent, reduce -XX:InitiatingHeapOccupancyPercent to start the mark in advance, and -XX:ConcGCThreads to increase the number of marked threads is also possible.
  • Large object allocation failure: Large objects cannot find a suitable region space for allocation, and fullGC will be performed. In this case, you can increase the memory or increase -XX:G1HeapRegionSize.
  • The program actively executes System.gc(): Don't just write it casually.

In addition, we can configure -XX:HeapDumpPath=/xxx/dump.hprof in the startup parameters to dump fullGC related files, and use jinfo to dump before and after gc

jinfo -flag +HeapDumpBeforeFullGC pid 
jinfo -flag +HeapDumpAfterFullGC pid

In this way, two dump files are obtained. After comparison, the main focus is on the problem objects dropped by gc to locate the problem.

Search for Java soulmates, reply to "back-end interview", and send you an interview book.pdf

The internet

The issues related to the network level are generally more complex, with many scenarios and difficult positioning, which has become a nightmare for most developments, and should be the most complicated. Here will give some examples, and explain from the tcp layer, application layer and the use of tools.

time out

Most of the timeout errors are at the application level, so this one focuses on understanding concepts. Timeouts can be roughly divided into connection timeouts and read-write timeouts. Some client frameworks that use connection pools also have connection acquisition timeouts and idle connection cleanup timeouts.

  • Read and write timeout. readTimeout/writeTimeout, some frameworks are called so_timeout or socketTimeout, both refer to data read and write timeout. Note that most of the timeouts here refer to logical timeouts. The timeout of soa also refers to the read timeout. Read and write timeouts are generally only set for the client.

  • Connection timed out. connectionTimeout, the client usually refers to the maximum time to establish a connection with the server. The connectionTimeout on the server side is a bit varied. Jetty represents the idle connection cleanup time, and tomcat represents the maximum time the connection is maintained.

  • other. Including connection acquisition timeout connectionAcquireTimeout and idle connection cleanup timeout idleConnectionTimeout. It is mostly used for client or server frameworks that use connection pools or queues.

When we set various timeouts, we need to confirm that we try to keep the client's timeout less than the server's timeout to ensure that the connection ends normally.

In actual development, what we care about most should be the read and write timeout of the interface.

How to set a reasonable interface timeout is a problem. If the interface timeout is set too long, it may occupy the server's TCP connection too much. If the interface setting is too short, the interface timeout will be very frequent.

The server interface obviously reduces rt, but the client still keeps timeout is another problem. This problem is actually very simple. The link from the client to the server includes network transmission, queuing, and service processing. Each link may be a time-consuming cause.

TCP queue overflow

TCP queue overflow is a relatively low-level error, it may cause more superficial errors such as timeout and rst. Therefore, the error is more concealed, so let's talk about it separately.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

As shown in the figure above, there are two queues: syns queue (semi-connected queue) and accept queue (fully connected queue). Three-way handshake, after the server receives the syn from the client, it puts the message in the syns queue and responds with syn+ack to the client. The server receives the ack from the client. If the accept queue is not full at this time, it will take out the temporary storage from the syns queue The information is put into the accept queue, otherwise it is executed as indicated by tcp_abort_on_overflow.

tcp_abort_on_overflow 0 means that if the accept queue is full during the third step of the three-way handshake, the server throws away the ack sent by the client. tcp_abort_on_overflow 1 means that if the full connection queue is full in the third step, the server sends an rst packet to the client, indicating that the handshake process and the connection are abolished, which means that there may be many connection reset / connection reset by peers in the log.

So in actual development, how can we quickly locate the tcp queue overflow?

netstat command, executenetstat -s | egrep "listen|LISTEN"

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

As shown in the figure above, overflowed represents the number of overflows of the fully connected queue, and sockets dropped represents the number of overflows of the semi-connected queue.

ss command, execute ss -lnt

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

You can see Send-Q above that the maximum number of fully connected queues on the listen port in the third column is 5, and the first column Recv-Q is how much the fully connected queue is currently using.

Then let's see how to set the size of the fully connected and semi-connected queues:

The size of the fully connected queue depends on min(backlog, somaxconn). The backlog is passed in when the socket is created, somaxconn is an os-level system parameter. The size of the semi-connection queue depends on max(64, /proc/sys/net/ipv4/tcp_max_syn_backlog).

In daily development, we often use the servlet container as the server, so we sometimes need to pay attention to the connection queue size of the container. The backlog in tomcat is called acceptCount, and in jetty it is acceptQueueSize.

RST exception

The RST packet represents a connection reset and is used to close some useless connections. It usually represents an abnormal close, which is different from four waves of hands.

In actual development, we often see connection reset / connection reset by peer errors, which are caused by the RST package.

Port does not exist

If a SYN request to establish a connection is sent like a port that does not exist, the server will directly return a RST message to interrupt the connection if it finds that it does not have this port.

Actively terminate the connection instead of FIN

Generally speaking, normal connection closure needs to be achieved through FIN messages, but we can also use RST messages instead of FIN, which means to terminate the connection directly. In actual development, you can set the SO_LINGER value to control. This is often deliberate, to skip TIMED_WAIT, to provide interactive efficiency, and use it with caution.

An exception occurs on one side of the client or server, and the other side sends an RST to inform the connection to close

The tcp queue overflow we mentioned above to send RST packets actually belongs to this kind. This is often due to some reasons, one party can no longer normally process the request connection (for example, the program crashes, the queue is full), and the other party is notified to close the connection.

The received TCP packet is not in a known TCP connection

For example, if one party's machine has a TCP packet missing due to a bad network, the other party closes the connection, and then receives the missing TCP packet after a long time, but since the corresponding TCP connection no longer exists, it will send one directly RST packet to open a new connection.

One party has not received the confirmation message from the other party for a long time, and sends an RST message after a certain period of time or retransmission times

Most of this is also related to the network environment. A poor network environment may cause more RST packets.

I said before that many RST messages will cause the program to report errors. A read operation on a closed connection will report connection reset, and a write operation on a closed connection will report connection reset by peer. Usually we may also see broken pipe errors. This is a pipe-level error. It means reading and writing to a closed pipe. It is often an error that reads and writes data after receiving an RST and reporting a connection reset error. It is also introduced in the glibc source code comments.

How do we determine the existence of the RST packet when troubleshooting? Of course, use the tcpdump command to capture packets, and use wireshark for a simple analysis. tcpdump -i en0 tcp -w xxx.cap, en0 represents the monitoring network card.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

Next, when we open the captured packet through wireshark, we may be able to see the following figure, the red one means the RST packet.

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

TIME_WAIT和CLOSE_WAIT

I believe everyone knows what TIME_WAIT and CLOSE_WAIT mean.

When online, we can directly use the command netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'to view the number of time-wait and close_wait

Use ss command will be fasterss -ant | awk '{++S[$1]} END {for(a in S) print a, S[a]}'

JAVA online troubleshooting routines, from CPU, disk, memory, network to GC one-stop!

TIME_WAIT

The existence of time_wait is for the lost data packet to be reused by the subsequent connection, and the second is to close the connection normally within the time range of 2MSL. Its existence will actually greatly reduce the appearance of RST packets.

Too much time_wait is more likely to occur in scenarios with frequent short connections. In this case, you can do some kernel parameter tuning on the server:

#表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接,默认为0,表示关闭
net.ipv4.tcp_tw_reuse = 1
#表示开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表示关闭
net.ipv4.tcp_tw_recycle = 1

Of course, we should not forget the pit of data packet rejection in the NAT environment due to incorrect timestamps. Another way is to change tcp_max_tw_buckets. Any time_wait that exceeds this number will be killed, but this will also result in the report of time wait bucket table overflow. wrong.

CLOSE_WAIT

Close_wait is often due to problems written by the application, and the FIN message is not initiated again after the ACK. The probability of close_wait is even higher than that of time_wait, and the consequences are more serious. It is often because a place is blocked and the connection is not closed properly, which gradually consumes all threads.

To locate such problems, it is best to use jstack to analyze the thread stack to troubleshoot the problem. For details, refer to the above chapters. Here is just one example.

Development students said that CLOSE_WAIT kept increasing after the application went online, until it hung up. After jstack found the suspicious stack, most of the threads were stuck in the countdownlatch.await method. After looking for the development students, they learned that multithreading was used but it did not. Catch exception, the exception found after modification is just the simplest class not found that often appears after upgrading the SDK.

Guess you like

Origin blog.51cto.com/14957073/2542746