Basic routines for system troubleshooting under linux

Summarize common commands

  1. top Find processes with high CPU usage
  2. ps Find the pid of the corresponding process
  3. top -H -p pid Find threads with high cpu utilization
  4. printf '%x\n' pid Convert thread pid to hexadecimal to get nid
  5. jstack pid |grep 'nid' -C5 –color Find the corresponding stack information in jstack for analysis
  6. cat jstack.log | grep “java.lang.Thread.State” | sort -nr | uniq -c Have a comprehensive grasp of jstack, pay attention to thread states such as waiting and timed_waiting
  7. jstat -gc pid 1000 to observe the changes of gc generation,
  8. vmstat l View frequent context issues
  9. pidstat -w pid monitor specific pid
  10. disk
  11. df -hl View disk file system status
  12. iostatiostat -d -k -x Analyze the disk read and write speed and locate the problematic disk
  13. iotop View thread read and write id, readlink -f /proc/*/task/tid/…/… find process pid
  14. cat /proc/pid/io View the specific read and write conditions of the process
  15. lsof -p pid Determine the specific file read and write conditions
  16. Memory
  17. free Check the memory status
  18. After checking jstack and jmap, there is no problem
  19. jvm-settings-Xss
  20. -Xms
  21. XX:MaxPermSize
  22. JMAPjmap -dump:format=b,file=filename pid export dump file
  23. Specify -XX:+ HeapDumpOnOutOfMemoryError in the startup parameters
    to save the dump file when oom
  24. pstreee -p pid |wc -l or ls -l /proc/pid/task | ws -l to view the overall number of threads
  25. pmap -x pid | sort -rn -k3 | head -30 View the top 30 memory segments corresponding to the countdown of the process
  26. strace -f -e “brk,mmap,munmap” -p pid monitor memory allocation

original

Online faults mainly include cpu, disk, memory, and network problems, and most faults may contain more than one level of problems, so try to check the four aspects in turn when troubleshooting.

At the same time, tools such as jstack and jmap are not limited to one aspect of the problem. Basically, the problem is df, free, top three consecutive, and then jstack and jmap serve in turn, and specific problems can be analyzed in detail.

Generally speaking, we will first troubleshoot the cpu problem. CPU exceptions are often better located. Reasons include business logic problems (infinite loop), frequent gc, and excessive context switching. The most common is often caused by business logic (or framework logic), you can use jstack to analyze the corresponding stack situation.

We first use the ps command to find the pid of the corresponding process (if you have several target processes, you can use top to see which one takes up more).

Then use top -H -p pid to find some threads with relatively high cpu usage

Then convert the highest occupied pid to hexadecimal printf '%x\n' pid to get nid

Then find the corresponding stack information directly in jstack jstack pid |grep 'nid' -C5 –color

It can be seen that we have found the stack information with nid 0x42, and then we only need to analyze it carefully.

Of course, it is more common for us to analyze the entire jstack file. Usually we will pay more attention to the parts of WAITING and TIMED_WAITING, not to mention BLOCKED. We can use the command cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c to have an overall grasp of the status of jstack. If there are too many WAITING and the like, then there are probably problem.

Of course, we will still use jstack to analyze the problem, but sometimes we can first determine whether gc is too frequent, use the jstat -gc pid 1000 command to observe the change of gc generation, 1000 means sampling interval (ms), S0C /S1C, S0U/S1U, EC/EU, OC/OU, and MC/MU respectively represent the capacity and usage of the two Survivor areas, Eden area, old generation, and metadata area. YGC/YGT, FGC/FGCT, and GCT represent the time consumption, times and total time consumption of YoungGc and FullGc. If you see that gc is more frequent, then do further analysis on gc.

For frequent context problems, we can use the vmstat command to view

The cs (context switch) column represents the number of context switches.

If we want to monitor a specific pid, we can use the pidstat -w pid command, cswch and nvcswch represent voluntary and involuntary switching.

Disk problems are as basic as cpu. First of all, in terms of disk space, we directly use df -hl to view the status of the file system

More often than not, disk problems are performance problems. We can analyze it through iostatiostat -d -k -x

In the last column %util, you can see the writing level of each disk, while rrqpm/s and wrqm/s respectively indicate the read and write speed, which can generally help locate the specific disk that has a problem.

In addition, we also need to know which process is reading and writing. Generally speaking, developers know it well, or use the iotop command to locate the source of file reading and writing.

But what we get here is tid, we want to convert it to pid, we can find pid through readlink -f /proc/*/task/tid/../....

After finding the pid, you can see the specific reading and writing of the process cat /proc/pid/io

We can also use the lsof command to determine the specific file read and write conditions lsof -p pid

Troubleshooting memory problems is more troublesome than CPU, and there are more scenarios. Mainly including OOM, GC issues and off-heap memory. Generally speaking, we will first use the free command to check the various conditions of the memory.

Most of the memory problems are also heap memory problems. Appearance is mainly divided into OOM and StackOverflow.

Insufficient memory in JMV, OOM can be roughly divided into the following categories:

Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread

This means that there is not enough memory space to allocate java stacks to threads. Basically, there is a problem with the thread pool code, such as forgetting to shutdown, so you should first look for problems from the code level, using jstack or jmap. If everything is normal, the JVM can reduce the size of a single thread stack by specifying Xss.

In addition, at the system level, you can modify the

/etc/security/limits.confnofile and nproc to increase the limit of os to threads

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

This means that the memory usage of the heap has reached the maximum value set by -Xmx, which should be the most common OOM error. The solution is still to find it in the code first, suspect that there is a memory leak, and use jstack and jmap to locate the problem. If everything is normal, it is necessary to expand the memory by adjusting the value of Xmx.

Caused by: java.lang.OutOfMemoryError: Meta space

This means that the memory usage of the metadata area has reached the maximum value set by XX:MaxMetaspaceSize. The troubleshooting idea is consistent with the above. The parameters can be adjusted through XX:MaxPermSize (not to mention the permanent generation before 1.8).

Stack memory overflow, you have seen this a lot.

Exception in thread "main" java.lang.StackOverflowError

It means that the memory required by the thread stack is greater than the Xss value, and it is also checked first. The parameters are adjusted through Xss, but if the adjustment is too large, it may cause OOM.

For the above-mentioned code troubleshooting of OOM and StackOverflow, we generally use JMAPjmap -dump:format=b,file=filename pid to export dump files

Import the dump file through mat (Eclipse Memory Analysis Tools) for analysis. Generally, we can directly select Leak Suspects for memory leaks, and mat gives suggestions for memory leaks. Alternatively, select Top Consumers to view the largest objects report. Questions related to threads can be analyzed by selecting thread overview. In addition, choose the overview of the Histogram class to analyze slowly by yourself. You can search for relevant tutorials on mat.

In daily development, code memory leaks are relatively common and hidden, requiring developers to pay more attention to details. For example, new objects are created every time a request is made, resulting in a large number of repeated object creation; file stream operations are performed but not closed correctly; manual improper triggering of gc; ByteBuffer cache allocation is unreasonable, etc. will cause code OOM.

On the other hand, we can specify -XX:+ in the startup parameters

HeapDumpOnOutOfMemoryError to save the dump file when OOM.

In addition to affecting the CPU, the gc problem will also affect the memory, and the troubleshooting ideas are the same. Generally, jstat is used to check the generational changes, such as whether there are too many youngGC or fullGC times; whether the growth of indicators such as EU and OU is abnormal, etc.

Too many threads and not being gc in time will also cause oom, most of which are unable to create new native thread as mentioned before. In addition to the detailed analysis of the dump file by jstack, we generally look at the overall thread first, through pstreee -p pid |wc -l.

Or directly by viewing the number of /proc/pid/task is the number of threads.

It's really unfortunate if you encounter an off-heap memory overflow. First of all, the performance of off-heap memory overflow is that the physical resident memory grows rapidly. If an error is reported, it depends on the usage method. If it is caused by the use of Netty, an OutOfDirectMemoryError error may appear in the error log. If it is directly DirectByteBuffer, an OutOfMemoryError will be reported: Direct buffer memory.

Out-of-heap memory overflow is often related to the use of NIO. Generally, we first check the memory occupied by the process through pmap pmap -x pid | sort -rn -k3 | head -30, which means to check the top 30 corresponding to the pid in reverse order Large memory segment. Here you can run the command again after a period of time to see the memory growth, or where is the suspicious memory segment compared with the normal machine.

If we determine that there is a suspicious memory terminal, we need to analyze it through gdb. gdb --batch --pid {pid} -ex "dump memory filename.dump {memory start address} {memory start address + memory block size}"

After obtaining the dump file, you can use hexdump to view it with hexdump -C filename | less, but most of the files you see are binary garbled characters.

NMT is a new feature of HotSpot introduced by Java7U40. With the jcmd command, we can see the specific memory composition. You need to add -XX:NativeMemoryTracking=summary or -XX:NativeMemoryTracking=detail to the startup parameters, and there will be a slight performance loss.

Generally, for the case where the off-heap memory grows slowly until it explodes, you can first set a baseline jcmd pid VM.native_memory baseline.

Then wait for a while to check the memory growth, and do a summary or detail level diff through jcmd pid VM.native_memory detail.diff(summary.diff).

You can see that the memory analyzed by jcmd is very detailed, including heap, thread, and gc (so the above-mentioned other memory exceptions can actually be analyzed by nmt). Here, we focus on the memory growth of Internal memory outside the heap. If the growth is very obvious If so, there is a problem.

At the detail level, there will also be growth of specific memory segments, as shown in the figure below.

In addition, at the system level, we can also use the strace command to monitor memory allocation strace -f -e "brk,mmap,munmap" -p pid

The memory allocation information here mainly includes pid and memory address.

However, in fact, the above operations are difficult to locate specific problems. The key is to look at the error log stack, find suspicious objects, figure out its recovery mechanism, and then analyze the corresponding objects. For example, if DirectByteBuffer allocates memory, it needs full GC or manual system.gc to recycle (so it is best not to use -XX:+DisableExplicitGC).

So in fact, we can track the memory status of the DirectByteBuffer object, and manually trigger fullGC through jmap -histo:live pid to see if the memory outside the heap has been recycled. If it is recycled, then there is a high probability that the allocation of off-heap memory itself is too small, and it can be adjusted by -XX:MaxDirectMemorySize. If there is no change, then use jmap to analyze those objects that cannot be gc, and the reference relationship with DirectByteBuffer.

GC related

In-heap memory leaks are always accompanied by GC exceptions. However, GC problems are not only related to memory problems, but may also cause a series of complications such as CPU load and network problems, but they are relatively closely related to memory, so we will summarize GC-related problems separately here.

We introduced the use of jstat to obtain the current GC generation change information in the cpu chapter. More often, we use GC logs to troubleshoot problems, and add -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps to the startup parameters to enable GC logs.

The meaning of common Young GC and Full GC logs will not be repeated here.

According to the gc log, we can roughly infer whether youngGC and fullGC are too frequent or take too long, so as to prescribe the right medicine. We will analyze the G1 garbage collector below, and it is also recommended that you use G1-XX:+UseG1GC.

YoungGC is frequent because there are many small objects in a short period. First, consider whether the Eden area/new generation setting is too small, and see if the problem can be solved by adjusting parameter settings such as -Xmn and -XX:SurvivorRatio. If the parameters are normal, but the frequency of young gc is still too high, you need to use Jmap and MAT to further check the dump file.

The problem of taking too long depends on where the time is spent in the GC log. Taking the G1 log as an example, you can focus on stages such as Root Scanning, Object Copy, and Ref Proc. Ref Proc takes a long time, so we must pay attention to referencing related objects.

Root Scanning takes a long time, so pay attention to the number of threads and cross-generational references. Object Copy needs to pay attention to the object life cycle. Moreover, time-consuming analysis requires horizontal comparison, which is a time-consuming comparison with other projects or normal time periods. For example, if the Root Scanning in the figure increases more than the normal time period, it means that there are too many threads started.

In G1, there are more mixed GCs, but mixed GCs can be checked in the same way as young GCs. There are usually problems when fullGC is triggered. G1 will degenerate and use the Serial collector to complete the garbage cleaning work. The pause time reaches the second level, which can be said to be half-kneeling.

The reasons for fullGC may include the following, as well as some ideas for parameter adjustment:

Failure in the concurrent phase: In the concurrent marking phase, the old generation is filled before MixGC, then G1 will give up the marking cycle at this time. In this case, it may be necessary to increase the heap size, or adjust the number of concurrent marking threads -XX:ConcGCThreads.

Promotion failure: There is not enough memory for surviving/promoting objects during GC, so Full GC is triggered. At this time, you can use -XX:G1ReservePercent to increase the percentage of reserved memory, reduce -XX:InitiatingHeapOccupancyPercent to start marking in advance, and -XX:ConcGCThreads to increase the number of marking threads is also possible.

Large object allocation failure: The large object cannot find a suitable region space for allocation, and fullGC will be performed. In this case, you can increase the memory or increase -XX:G1HeapRegionSize.

The program actively executes System.gc(): just don't just write it casually.

In addition, we can configure -XX:HeapDumpPath=/xxx/dump.hprof in the startup parameters to dump fullGC-related files, and use jinfo to perform dumps before and after gc


 
 
  
  
  1. jinfo -flag +HeapDumpBeforeFullGC pid
  2. jinfo -flag +HeapDumpAfterFullGC pid

In this way, two dump files are obtained. After comparison, we mainly focus on the problem objects dropped by gc to locate the problem.

network

Problems related to the network level are generally more complicated, with many scenarios and difficult positioning, which has become a nightmare for most developers, and should be the most complicated. Some examples will be given here, and they will be explained from the aspects of tcp layer, application layer and the use of tools.

Most of the timeout errors are at the application level, so this piece focuses on understanding the concept. Timeouts can be roughly divided into connection timeouts and read-write timeouts. Some client frameworks that use connection pools also have connection timeouts and idle connection cleanup timeouts.

  • read and write timeout

readTimeout/writeTimeout, some frameworks are called so_timeout or socketTimeout, both refer to data read and write timeout. Note that most of the timeouts here refer to logical timeouts. The timeout of soa also refers to the read timeout. Read and write timeouts are generally only set for clients.

  • Connection timed out

connectionTimeout, the client usually refers to the maximum time to establish a connection with the server. The connectionTimeout on the server side is a bit varied. Jetty indicates the idle connection cleaning time, and tomcat indicates the maximum time for connection maintenance.

  • other

Including connection acquisition timeout connectionAcquireTimeout and idle connection cleanup timeout idleConnectionTimeout. Mostly used for client or server frameworks that use connection pools or queues.

When we set various timeouts, what we need to confirm is to keep the timeout of the client side smaller than the timeout time of the server as far as possible, so as to ensure the normal end of the connection.

In actual development, what we care most about should be that the read and write of the interface timed out.

How to set a reasonable interface timeout is a problem. If the interface timeout setting is too long, it may occupy too much tcp connection of the server. And if the interface is set too short, then the interface will time out very frequently.

It is another problem that the server interface clearly lowers the rt, but the client still times out all the time. This problem is actually very simple. The link from the client to the server includes network transmission, queuing, and service processing. Each link may be time-consuming.

The tcp queue overflow is a relatively low-level error, which may cause more superficial errors such as timeout and rst. Therefore, the error is also more subtle, so let's talk about it separately.

As shown in the figure above, there are two queues: syns queue (semi-connection queue) and accept queue (full connection queue). Three-way handshake, after the server receives the client's syn, put the message into the syns queue, reply syn+ack to the client, the server receives the client's ack, if the accept queue is not full at this time, then take out the temporary storage from the syns queue Put the information into the accept queue, otherwise follow the instructions of tcp_abort_on_overflow.

tcp_abort_on_overflow 0 means that if the accept queue is full during the third step of the three-way handshake, the server will discard the ack sent by the client. tcp_abort_on_overflow 1 means that if the full connection queue is full in the third step, the server sends an rst packet to the client, indicating that the handshake process and the connection are abolished, which means that there may be many connection reset / connection reset by peer in the log.

So in actual development, how can we quickly locate the tcp queue overflow?

netstat command, execute netstat -s | egrep "listen|LISTEN"

As shown in the figure above, overflowed indicates the number of full-connection queue overflows, and sockets dropped indicates the number of semi-connection queue overflows.

As seen above, Send-Q indicates that the maximum number of fully connected queues on the listen port in the third column is 5, and the first column of Recv-Q indicates how much the fully connected queue is currently used.

Then let's see how to set the full connection and semi-connection queue size:

The size of the fully connected queue depends on min(backlog, somaxconn). The backlog is passed in when the socket is created, and somaxconn is an os-level system parameter. The size of the semi-connected queue depends on max(64,

/proc/sys/net/ipv4/tcp_max_syn_backlog)。

In daily development, we often use servlet containers as the server, so we sometimes need to pay attention to the size of the connection queue of the container. The backlog is called acceptCount in tomcat, and acceptQueueSize in jetty.

The RST packet means connection reset, which is used to close some useless connections. It usually means abnormal shutdown, which is different from four waved hands.

In actual development, we often see connection reset / connection reset by peer errors, which are caused by the RST package.

If a SYN request to establish a connection is sent to a port that does not exist, the server will directly return a RST message to terminate the connection if it finds that it does not have this port.

Generally speaking, the normal connection closing needs to be realized through the FIN message, but we can also use the RST message instead of the FIN, indicating that the connection is terminated directly. In actual development, you can set the value of SO_LINGER to control it. This is often intentional to skip TIMED_WAIT to improve interaction efficiency. Use it with caution when you are not idle.

An exception occurs on one side of the client or the server, and the direction sends RST to the other end to inform the close connection

The tcp queue overflow we mentioned above to send RST packets actually belongs to this category. This is often due to some reasons, one party can no longer process the request connection normally (for example, the program crashes, the queue is full), thus telling the other party to close the connection.

The received TCP packet is not in a known TCP connection

For example, if one machine loses a TCP message due to poor network conditions, the other party closes the connection, and then receives the missing TCP message after a long time, but since the corresponding TCP connection no longer exists, it will directly send a RST packet to open a new connection.

One party has not received the confirmation message from the other party for a long time, and sends an RST message after a certain time or number of retransmissions

Most of these are also related to the network environment, and poor network environment may lead to more RST packets.

I said before that too many RST packets will cause the program to report an error. A read operation on a closed connection will report a connection reset, while a write operation on a closed connection will report a connection reset by peer. Usually we may also see a broken pipe error, which is an error at the pipeline level, indicating that reading and writing to a closed pipeline is often an error of continuing to read and write datagrams after receiving an RST and reporting a connection reset error. It is also introduced in the glibc source code comments.

How do we determine the existence of RST packets when troubleshooting? Of course, use the tcpdump command to capture packets, and use wireshark for simple analysis. tcpdump -i en0 tcp -w xxx.cap, en0 indicates the monitoring network card.

Next, we open the captured packets through wireshark, and we may see the following picture, and the red ones represent RST packets.

I believe everyone knows what TIME_WAIT and CLOSE_WAIT mean.

When online, we can directly use the command netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' to view time-wait and the number of close_wait

It will be faster to use the ss command ss -ant | awk '{++S[$1]} END {for(a in S) print a, S[a]}'

TIME_WAIT

The existence of time_wait is for the lost data packets to be multiplexed by the subsequent connection, and the second is to close the connection normally within the time range of 2MSL. Its existence will actually greatly reduce the occurrence of RST packets.

Excessive time_wait is more likely to appear in scenarios with frequent short connections. In this case, some kernel parameter tuning can be done on the server side:

#表示开启重用。允许将TIME-WAIT sockets重新用于新的TCP连接,默认为0,表示关闭net.ipv4.tcp_tw_reuse = 1#表示开启TCP连接中TIME-WAIT sockets的快速回收,默认为0,表示关闭net.ipv4.tcp_tw_recycle = 1
 
 
  
  

Of course, we should not forget that in the NAT environment, the data packet is rejected due to the wrong time stamp. Another way is to reduce the tcp_max_tw_buckets, and the time_wait exceeding this number will be killed, but this will also cause the time wait bucket table overflow to be reported. wrong.

CLOSE_WAIT

close_wait is often because the application program has a problem, and the FIN message is not sent again after the ACK. The probability of close_wait is even higher than that of time_wait, and the consequences are more serious. It is often because a certain place is blocked and the connection is not closed normally, thus gradually consuming all the threads.

If you want to locate this kind of problem, it is best to analyze the thread stack through jstack to troubleshoot the problem. For details, please refer to the above chapters. Here is just one example.

The developer said that CLOSE_WAIT kept increasing after the application went online until it hangs up. After jstack, I found a suspicious stack because most of the threads were stuck in the countdownlatch.await method. After looking for the developer, I learned that multi-threading was used but it was not. catch exception, after modification, it is found that the exception is only the simplest class not found that often occurs after upgrading the sdk.

Original link:

https://www.toutiao.com/article/6868136002486010371/?channel=&source=search_tab

Guess you like

Origin blog.csdn.net/qq_36256590/article/details/132448345