Common methods of system failure analysis

"Peak of Performance" Chapter 2.5 Method

USE是utilization、saturation、erros三个词的缩写，应用于性能研究，用来识别系统瓶颈，一言以蔽之，就是：对于所有的资源，查看它的使用率、饱和度和错误。

这些术语定义如下：
资源：所有服务器物理元器件（CPU、总线。。。）。某些软件资源也能算在内，提供有用的指标。
使用率：在规定的时间间隔内，资源用于服务工作的时间百分比。
饱和度：资源不能再服务更多额外工作的程度，通常有等待队列。
错误：错误事件的个数。

For example:

U：

hardware
- CPU: Utilization (results viewed by TOP), ratio of busy time within 1 second
- Memory: currently available memory
- Network card: packet receiving throughput/maximum bandwidth, packet sending throughput/maximum bandwidth (network card capability)
- Storage device I/O: device busy time/total time
software
- Mutex: time the lock is held/total time
- Thread pool: the time when the thread pool is busy/total time
- Process/Thread Capacity: The number of threads/processes in the system
- File descriptor capacity: the number of descriptors in the system

S：

hardware
- CPU: length of runnable queue
- Memory: anonymous pages, swap, etc., and oom events
- Storage device I/O: length of waiting queue
software
- Mutex: the length of the queue of threads waiting for the lock
- Thread pool: the length of the request queue waiting for the thread pool to process
- Process/thread capacity: the number of threads/processes waiting for the system to create
- File descriptor capacity: the number of descriptors waiting for the system to create

E：

hardware
- Storage Device I/O: Number of Device Errors
software
- Process/Thread Capacity: Number of creation failures
- File descriptor capacity: the number of file descriptor allocation failures

Common commands:

# 以下命令部分依赖sysstat包
uptime              # 对系统的全局状态有一个大致的了解
 
dmesg | tail        # 显示最新的几条系统日志

df  -h              # 检查磁盘空间
 
top                 # 全面了解系统指标
 
vmstat 1            # 显示CPU，内存，虚拟内存相关信息
 
mpstat -P ALL 1     # 显示每个核具体消耗情况
 
pidstat 1           # 带历史信息的top
 
iostat -xz 1        # 查看块设备即io状况
 
free -m             # 查看系统内存
 
sar -n DEV 1        # 查看网络接口的吞吐量
 
sar -n TCP,ETCP 1   # 检查TCP相关信息

uptime：

This command can quickly check the system load average. On Linux systems, this includes tasks that want or are using cpu, as well as tasks that are blocked on io.

These three values are dynamic averages weighted by the index of the system load in 1 minute, 5 minutes, and 15 minutes, which can be simply considered as the average value in this time period. Based on these three values, we can understand how the system load changes over time. For example, suppose there is a problem with the system now, you check these three values and find that the load value of 1 minute is much smaller than the load value of 15 minutes, then you have probably missed the point in time when the system went wrong.

What is system load?
System load average is defined as the average number of processes in the run queue over a specific time interval. A process is placed in the run queue if it satisfies the following conditions:

It is not waiting for the result of the I/O operation.
It has not actively entered the waiting state (that is, it has not called 'wait')
and has not been stopped (for example: waiting for termination)
. Generally speaking, the number of current active processes per CPU core is not greater than 3, Then the system works well! Of course, what is mentioned here is each cpu core, that is, if your host is a quad-core cpu, as long as the last string of characters output by uptime is less than 12, it means that the system load is not very serious. Of course, if it reaches 20, it means The current system load is very serious, and it is estimated that opening and executing web scripts will be very slow.

dmesg:

Need to add permission sudo dmesg
This command shows the latest few system logs. Here we mainly look for whether there are some system errors that will cause performance problems.

df -h：

View disk usage

top

CPU status (CPU states): including user process occupancy ratio, system process occupancy ratio, user's nice priority process occupancy ratio and idle CPU resource ratio, etc.;

Memory status (Mem): including the total amount of memory, usage, free amount, etc.;

Swap partition status (Swap): Including the total amount of swap partitions, usage, idle amount, etc.;

The status of each process: including process ID, user name, priority, CPU and memory usage, and the command line executed when running the process;

After entering top, press the following buttons to have corresponding functions:

【q】Exit the top program
【c】Extended command line information to display the complete command line;
【P】Sort by CPU usage;
[N] Sort by process ID;
【M】Sort by memory usage;
[Space] Immediately refresh the display information;

After executing the top command, the system will automatically refresh the status information every 5 seconds. If you want to specify the refresh interval, add "-d" at startup

vmstat1：

Adding up the user mode CPU time (us) and the kernel mode CPU time (sy), we can further confirm whether the CPU is busy. If the waiting time for io (wa) is high, it means that the disk is the bottleneck; this is also included in the idle time (id), the CPU is also idle at this time, and the task is blocked on the disk IO at this time. The time waiting for io (wa) can be regarded as another form of CPU idle, which can explain why the CPU is idle.
When the system handles IO, it will definitely consume kernel mode time (sy). If the time spent in the kernel mode is high, such as more than 20%, we need further analysis. Maybe the kernel's processing efficiency for IO is not high, or the code has problems such as multiple asynchronous loop calls instead of batch operations.
Indicators to look at:

- r: tasks in runnable state, including running tasks and tasks waiting to run. This value is a better indicator of CPU saturation than load average. This value does not include waiting for io-related tasks. When the value of r is larger than the current number of cpus, the system is in a saturated state.

- free: Free memory size in KB.

- si, so: swap in and out memory pages. If these two values are non-zero, it means that there is not enough memory.

- us, sy, id, wa, st: various indicators of cpu time (taking the average value for all cpus), respectively representing: user mode time, kernel mode time, idle time, waiting for io, stealing time (in a virtualized environment lower system overhead on other tenants)

vmstat shows some information about virtual memory and cpu. 1 on the command line means display every 1 second.

mpstat -P ALL 1：

The mpstat command prints out the time of each CPU, and you can check whether the CPU processes tasks evenly. If a single CPU usage is high, it is a single-threaded application.

pidstat 1：

pidstat is very similar to top, except that it can print every other interval instead of clearing the screen every time like top. This command can easily view the possible behavior patterns of the process. You can also directly copy the past, which can easily record the changes in the running status of each process over time.

iostat -xz 1：

iostat is an important tool for understanding the current load and performance of block devices (disks). The meaning of several indicators:

- r/s, w/s, rkB/s, wkB/s: The number of reads per second, the number of writes per second, the amount of data read per second, and the amount of data written per second sent by the system to the device. These indicators reflect the workload of the system. The performance problem of the system is likely to be too much load.

- await: Average response time of the system's requests to the IO device. This includes the time the request was queued, and the time the request was processed. Average response times that exceed empirical values indicate that the device is saturated, or that there is a problem with the device.

- avgqu-sz: The average length of the device request queue. A queue length greater than 1 indicates that the device is saturated.

- %util: device utilization. The degree of busyness of the device indicates the proportion of time the device processes io within each second. Utilization greater than 60% usually causes performance issues (seen with await), but each device is different. A utilization close to 100% indicates that the disk is saturated

Notice:

If the block device is a logical block device and there are many physical disks behind the logical fast device, 100% utilization can only indicate that some IO processing time has reached 100%; the back-end physical disk may be far from being saturated , can handle more load.

Poor disk IO performance does not necessarily mean that the application has problems. Applications can perform asynchronous IO in many ways without blocking on IO; applications can also use technologies such as pre-reading and write buffering to reduce the impact of IO latency on themselves.

free -m

available: Currently available memory; we want to see if this value is close to zero. Values close to zero will result in higher disk IO and poorer performance.

sar -n DEV 1

Check the throughput of the network interface: rxkB/s and txkB/s can measure the load, and you can also see whether the network traffic limit has been reached. In the above example, the throughput of eth2 reaches about 8 Mbytes/s, which is almost 64 Mbits/sec, which is much less than 10 Gbit/sec.

sar -n TCP,ETCP 1

Some summary of important TCP indicators:
- retrans/s: the number of tcp retransmissions per second

- passive/s: tcp connection initiated from the source every second, that is, the connection accepted by the local program using accept()

- active/s: The tcp connection that is actively opened locally every second, that is, the local program uses the connect() system call

Note:

The number of atctive and passive can usually be used to measure the load of the server: the number of accepted connections (passive), the number of downstream connections (active). You can simply think that active is the connection from the host, and passive is the connection from the host; but this is not a very strict statement, such as the connection between loalhost and localhost.

Retransmissions indicate network or server problems. Maybe the network is unstable, maybe the server is overloaded and packet loss begins. The above example means only 2 retransmissions per second.