[Linux] 19. CPU: load balancing top mpstat pidstat vmstat, context switching, stress test stress sysbench, high CPU usage pstree execsnoop perf sar

  • If there is monitoring, first check the monitoring board to see if there are any abnormal alarms. If there is no monitoring in the early stage, I will follow the steps below to see if there are any abnormalities at the system level.
  • First, we will look at the average load of the system. Use the top or htop command to check. The average load reflects the overall situation of the system. It should be a combination of CPU, memory, and disk performance. Generally, the average load value is greater than the machine CPU. The number of cores means that the machine resources are already tight.
  • After the average load is high, the next step is to see what resources are causing it. I will first look at the usage of each core of the CPU in top. If the proportion is high, the bottleneck should be the CPU. Next, I will look See what process caused it
  • If there is no problem with the CPU, then I will check the memory. First, I will use free to check the usage of the memory, but I will not directly check how much is left. I will also check the cache and buffer together, and then check the specific usage. Which process occupies too much memory? I also use top to sort.
  • If there is no problem with the memory, you need to check the disk. I use iostat to check the disk. I have encountered relatively few disk problems.
  • There is also the issue of bandwidth. Generally, iftop is used to check the traffic situation to see if the traffic exceeds the bandwidth given by the machine.
  • If it comes to a specific application, it must be checked according to the setting parameters of the specific application, such as whether the number of connections has been checked against the setting value, etc.
  • If no abnormalities are found after checking various indicators at the system layer, then external systems should be considered, such as databases, caches, storage, etc.

1. CPU

1.1 Load Average Load Average

The average load refers to the average number of processes in the system's runnable and uninterruptible states per unit time, that is, the average number of active processes. It has no direct relationship with CPU usage.

  • Runnable state process: refers to the process that is using the CPU or waiting for the CPU, that is, the process in the R state (Running or Runnable) in the ps command
  • Processes in an uninterruptible state: They are processes that are in key processes in the kernel state, and these processes are uninterruptible. For example, the most common one is waiting for the I/O response of the hardware device, that is, the D state (Uninterruptible) in the ps command. Sleep, also called Disk Sleep) process.
    • For example, when a process reads and writes data to the disk, in order to ensure data consistency, it cannot be interrupted by other processes or interrupts before getting a reply from the disk. At this time, the process is in an uninterruptible state. If at this time If the process is interrupted, the problem of inconsistency between disk data and process data is prone to occur. Therefore, the uninterruptible state is actually a protection mechanism of the system for processes and hardware devices.

Therefore, you can simply understand that the average load is actually the average number of active processes. The intuitive understanding of the average number of active processes is the number of active processes per unit time, but it is actually the exponential decay average of the number of active processes. ( You don’t have to worry about the detailed meaning of this “exponential decay average”. It’s just a faster calculation method for the system. You can just treat it as the average number of active processes. )

Since the average is the number of active processes, the ideal thing is to have exactly one process running on each CPU, so that each CPU is fully utilized. For example, what does it mean when the average load is 2?

  • On a system with only 2 CPUs, this means that all CPUs are just about fully occupied.
  • On a 4-CPU system, this means the CPU is 50% idle.
  • In a system with only 1 CPU, it means that half of the processes cannot compete for the CPU.

in conclusion:

  • High load averages may be caused by CPU-intensive processes;
  • High load average does not necessarily mean high CPU usage, it may also mean that the I/O is busier;
  • When you find that the load is high, you can use tools such as mpstat and pidstat to assist in analyzing the source of the load.
uptime # 输出23:01:47 up 69 days, 15:32,  2 users,  load average: 1.73, 3.09, 3.40

NAME
       uptime - Tell how long the system has been running.
DESCRIPTION
       uptime  gives a one line display of the following information.  The current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.
       This is the same information contained in the header line displayed by w(1).
       System load averages is the average number of processes that are either in a runnable or uninterruptable state.  A process in a runnable state is either using the CPU or waiting to use the CPU.  A process in uninterrupt‐
       able  state is waiting for some I/O access, eg waiting for disk.  The averages are taken over the three time intervals.  Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a
       single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.

The logical number of CPUs can be obtained through the following. If the load average is greater than the logical number of CPUs * 70%, it is considered overloaded:

grep 'model name' /proc/cpuinfo | wc -l


Three load periods:

  • If the three values ​​​​of 1 minute, 5 minutes, and 15 minutes are basically the same, or have little difference, it means that the system load is very stable.
  • But if the value of 1 minute is much smaller than the value of 15 minutes, it means that the load of the system is decreasing in the last minute, but there was a heavy load in the past 15 minutes.
  • Conversely, if the value of 1 minute is much greater than the value of 15 minutes, it means that the load in the last minute has increased. This increase may be temporary, or it may continue to increase, so you need to continue to observe. Once If the 1-minute average load is close to or exceeds the number of CPUs, it means that the system is overloaded. At this time, it is necessary to analyze and investigate what is causing the problem and find ways to optimize it.
  • For example: Suppose we see the average load on a single CPU system as 1.73, 0.60, 7.98, then it means that in the past 1 minute, the system was 73% overloaded, and in 15 minutes, there was 698% overloaded, from the overall Judging from the trend, the load on the system is decreasing.

In an actual production environment, when the average load is high, do we need to focus on it?

  • When the average load is higher than 70% of the number of CPU logic, you should analyze and troubleshoot the problem of high load. Once the load is too high, it may cause the process to respond slowly, thereby affecting the normal function of the service. But the number 70% does not Not absolutely.
  • The most recommended method is to monitor the average load of the system, and then judge the load change trend based on more historical data. When you find that the load has an obvious upward trend, for example, the load has doubled, you can do the analysis. and investigation.

1.1.1 CPU usage

The average load refers to the number of processes in the runnable state and the uninterruptible state per unit time. Therefore, it not only includes the processes that are using the CPU, also includes the processes waiting for the CPU and processes waiting for I/O.

[CPU Usage] is a statistic of CPU busyness per unit time, and does not necessarily correspond to [Average Load]. For example:

  • CPU-intensive processes, using a large amount of CPU will cause the average load to increase, and the two are consistent at this time;
  • For I/O-intensive processes, waiting for I/O will also cause the average load to increase, but the CPU usage is not necessarily high;
  • A large number of processes waiting for CPU scheduling will also cause the average load to increase, and the CPU usage will be relatively high at this time.
1.1.1.1 High CPU usage, locate which specific application is causing it

In the previous section, I talked about what CPU usage is, and taught you through a case study how to use top, vmstat, pidstat and other tools to troubleshoot processes with high CPU usage, and then use the perf top tool to locate problems with internal functions of the application. However, someone left a message saying that it seems that the problem of high CPU usage is quite easy to troubleshoot.

Can all problems with high CPU usage be analyzed in this way? I think your answer should be no.

Looking back at the previous content, we know that the system's CPU usage not only includes the operation of process user mode and kernel mode, but also includes interrupt processing, waiting for I/O, and kernel threads. Therefore, when you find that the CPU usage of the system is very high, you may not necessarily be able to find the corresponding process with high CPU usage.

Today, I will use a Nginx + PHP web service case to take you to analyze this situation.

Environment preparation:
Machine configuration: 2 CPU, 8GB memory

Pre-install tools such as docker, sysstat, perf, ab, etc., such as apt install docker.io sysstat linux-tools-common apache2-utils

One is used as a web server to simulate performance issues; the other is used as a client of the web server to increase stress requests on the web service. The purpose of using two virtual machines is to isolate them from each other and avoid "cross-infection".

Next, we open two terminals, log in to the two machines via SSH, and install the above tools.

1.1.1.1.1 Operation and analysis







Insert image description here



Also available perf record -ag -- sleep 2;perf report Generate reports in one step.

sar -w or sar -w 1 can also visually see the number of threads or processes generated per second.



1.1.2 Case: uptime, mpstat, pidstat

Machine configuration: ubuntu 18.04, 2 CPU, 8GB memory, pre-installed with apt install stress sysstat

  • stress is a Linux system stress testing tool. Here we use it as an abnormal process to simulate scenarios where the average load increases.
  • sysstat includes commonly used Linux performance tools for monitoring and analyzing system performance. In our case, we will use two commands of this package, mpstat and pidstat.
    • mpstat is a commonly used multi-core CPU performance analysis tool, used to view the performance indicators of each CPU in real time, as well as the average indicators of all CPUs.
    • pidstat is a commonly used process performance analysis tool, used to view the CPU, memory, I/O, context switching and other performance indicators of the process in real time.

Each scenario requires opening three terminals on the same machine:

# 先 uptime 看一下最初的平均负载基本都为0:
uptime # load average: 0.11, 0.15, 0.09

# mpstat -P ALL 5 20  (-P ALL 表示监控所有 CPU, 后面数字 5 表示间隔 5 秒后输出一组数据, 观测20次)
Linux 4.15.0 (ubuntu) 09/22/18 _x86_64_ (2 CPU)
13:30:06     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle

1.1.2.1 CPU-intensive processes
# 第一个终端: 先运行 stress 模拟 CPU 使用率 100% 的场景
stress --cpu 1 --timeout 600 

# 第二个终端运行 uptime 查看平均负载的变化情况:
watch -d uptime # load average: 1.00, 0.75, 0.39 # 近1min cpu 负载高

# 第三个终端运行 mpstat 查看 CPU 使用率的变化情况:
# mpstat -P ALL 5
Linux 4.15.0 (ubuntu) 09/22/18 _x86_64_ (2 CPU)
13:30:06     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
13:30:11     all   50.05    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.95
13:30:11       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
13:30:11       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 # %usr高达100%

从终端二中可以看到, 1 分钟的平均负载会慢慢增加到 1.00
而从终端三中还可以看到, 正好有一个 CPU 的使用率为 100%, 但它的 iowait 只有 0
这说明, 平均负载的升高正是由于 CPU 使用率为 100%


那么, 到底是哪个进程导致了 CPU 使用率为 100% 呢? 你可以用 pidstat 来查询:
$ pidstat -u 5 1 # 间隔 5 秒后输出一组数据
13:37:07      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:37:12        0      2962  100.00    0.00    0.00    0.00  100.00     1  stress # %usr高
从这里可以明显看到, stress 进程的 CPU 使用率为 100%. 

1.1.2.2 IO-intensive processes
# 首先还是运行 stress 命令, 但这次模拟 I/O 压力, 即不停地执行 sync
stress-ng -i 1 --hdd 1 --timeout 600 # --hdd表示读写临时文件, stress-ng 是下一代的stress, 支持更丰富的参数

# 在第二个终端运行 uptime 查看平均负载的变化情况:
watch -d uptime # load average: 1.06, 0.58, 0.37
# 第三个终端运行 mpstat 看 CPU 使用率的变化情况:
$ mpstat -P ALL 5 1 # 显示所有 CPU 的指标, 并在间隔 5 秒输出一组数据
Linux 4.15.0 (ubuntu)     09/22/18     _x86_64_    (2 CPU)
13:41:28     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
13:41:33     all    0.21    0.00   12.07   32.67    0.00    0.21    0.00    0.00    0.00   54.84
13:41:33       0    0.43    0.00   23.87   67.53    0.00    0.43    0.00    0.00    0.00    7.74 # %sys高, %iowait高
13:41:33       1    0.00    0.00    0.81    0.20    0.00    0.00    0.00    0.00    0.00   98.99

从这里可以看到, 1 分钟的平均负载会慢慢增加到 1.06, 其中一个 CPU 的【系统CPU使用率】升高到了 23.87, 而 iowait 高达 67.53%. 这说明平均负载的升高是由于 iowait 的升高. 


那么到底是哪个进程, 导致 iowait 这么高呢?还是用 pidstat 查询:
$ pidstat -u 5 1 # 间隔5秒后输出1次数据, -u 表示 CPU 指标
Linux 4.15.0 (ubuntu)     09/22/18     _x86_64_    (2 CPU)
13:42:08      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:42:13        0       104    0.00    3.39    0.00    0.00    3.39     1  kworker/1:1H
13:42:13        0       109    0.00    0.40    0.00    0.00    0.40     0  kworker/0:1H
13:42:13        0      2997    2.00   35.53    0.00    3.99   37.52     1  stress # %system高, %wait有, %CPU高
13:42:13        0      3057    0.00    0.40    0.00    0.00    0.40     0  pidstat
可以发现, 还是 stress 进程导致的. 
1.1.2.3 Large number of processes

When the number of processes running in the system exceeds the CPU's capabilities, processes waiting for the CPU will appear.

# 比如, 我们还是使用 stress, 但这次模拟的是 80 个进程:
stress -c 80 --timeout 600
# 由于系统只有 2 个 CPU, 明显比 8 个进程要少得多, 因而, 系统的 CPU 处于严重过载状态, 平均负载高达 7.97
uptime # 17.97, 5.93, 3.021

$ pidstat -u 5 1 # 间隔 5 秒后输出一组数据
14:23:25      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:23:30        0      3190   25.00    0.00    0.00   74.80   25.00     0  stress # %wait高
14:23:30        0      3191   25.00    0.00    0.00   75.20   25.00     0  stress # %wait高
14:23:30        0      3192   25.00    0.00    0.00   74.80   25.00     1  stress # %wait高
14:23:30        0      3193   25.00    0.00    0.00   75.00   25.00     1  stress # %wait高
14:23:30        0      3194   24.80    0.00    0.00   74.60   24.80     0  stress # %wait高
14:23:30        0      3195   24.80    0.00    0.00   75.00   24.80     0  stress # %wait高
14:23:30        0      3196   24.80    0.00    0.00   74.60   24.80     1  stress # %wait高
14:23:30        0      3197   24.80    0.00    0.00   74.80   24.80     1  stress # %wait高
14:23:30        0      3200    0.00    0.20    0.00    0.20    0.20     0  pidstat

可以看出, 8 个进程在争抢 2 个 CPU, 每个进程等待 CPU 的时间(也就是代码块中的 %wait 列)高达 75%. 这些超出 CPU 计算能力的进程, 最终导致 CPU 过载. 

1.1.3 htop、atop

htop looks at the load, because it is more direct (check all the switch items in the F2 configuration and turn on the color differentiation function). Different loads will be marked with different colors. For example, for CPU-intensive applications, their load color is green. , the operation of iowait, its load color is red, etc. Based on these indicators, it is easy to locate the problematic process by using htop's sort.

The atop command seems to be a report generated based on sar statistics. It directly marks the problematic process in red, which is more intuitive.

1.2 CPU context switching

Linux is a multi-tasking OS that supports tasks much larger than the number of CPUs to run at the same time. In fact, these tasks do not really run at the same time, but because the OS allocates the CPU to them in turn in a short period of time, creating the illusion that multiple tasks are running at the same time.

Before each task runs, the CPU needs to know the [instructions currently being executed or to be executed] (from the Program Counter, PC), and [data] (from the registers of the CPU), both of which are task-dependent environments. Also called [CPU context].

CPU: general register R0~RN, counter PC

CPU context switching: first save the [CPU context of the previous task], then load the [CPU context of the new task] and jump to the new location pointed to by the [new task PC program counter] to run the new task, as shown below:

CPU context switch

There are three types of CPU tasks: processes, threads, and interrupts.

Effects of CPU context switching:

  • It is one of the core functions that ensures the normal operation of the Linux system, and generally does not require our special attention.
  • However, excessive context switching will consume CPU time in saving and restoring data such as registers, kernel stacks, and virtual memory, thereby shortening the actual running time of the process and causing a significant decline in the overall performance of the system.

1.2.1 Process context switching

It is divided into two types: [Privileged mode switching of intra-process system calls] and [CPU context switching between processes]

1.2.1.1 In-process system call: privileged mode switching

Linux is divided into four privilege levels: Ring 0, 1, 2, and 3 (which can be understood as four status enumerations, each status has its own constraints), of which Ring 1 and 2 are rarely used:

  • The kernel space is Ring0, which has the highest authority and can directly access all resources.
  • The user space is Ring3, which can only access restricted resources, but cannot directly access hardware devices such as memory. It must fall into the kernel through [system calls] to access these restricted resources.
    CPU Rings

That is, the process can be executed in user space (user mode) or in kernel space (kernel mode). It needs to pass [system call] to go from user mode to kernel mode. For example, when viewing a file, you need to call open() to open the file. , read() reads the file, write() writes the file, close() closes the file and other system calls.

When [system call] occurs, [privileged mode switching] occurs twice (not called CPU context switching: because it is always within a process (the process will not be switched), and it does not involve process user-mode resources such as virtual memory), as follows picture:

  • The first time: the CPU saves the [original user-mode instruction location] to the register, loads the [kernel-mode instruction location to be executed] to the register, and jumps to the kernel state to execute the kernel task.
  • The second time: CPU [restore] [original user mode instruction location], switch to user space to continue running the process.

Single process system call

1.2.1.2 Inter-process: CPU context switching

Because processes are managed and scheduled by the kernel, switching between processes can only occur in [kernel mode]. Therefore, the context between processes not only includes virtual memory, stack, global variables, etc. [user space status], but also includes kernel stacks and registers. Wait for [the status of the kernel space].

Therefore, [inter-process context switching] has one more step than [system call]: you need to first save [current process user mode: virtual memory, stack, global variables], and then save [current process kernel mode: CPU register] , then load the [kernel mode: CPU registers] of the next process, and then load the next process [user mode: virtual memory, stack, global variables]. As shown below:

Inter-process context switching

CPU context switching has potential performance issues:

  • Each context switch requires tens of nanoseconds to several microseconds of CPU time. When the number of switches is large, it is easy for the CPU to spend a lot of time saving and restoring resources such as registers, kernel stacks, and virtual memory, thus greatly shortening the time required. The time it takes to actually run the process.
  • In addition, Linux uses TLB (Translation Lookaside Buffer) to manage the mapping relationship between virtual memory and physical memory. When the virtual memory is updated, the TLB also needs to be refreshed, and memory access will also become slower. Especially on multi-processor systems, The cache is shared by multiple processors, and refreshing the cache will not only affect the processes of the current processor, but also the processes of other processors that share the cache.

Timing of CPU context switching: The CPU context needs to be switched when and only when the process switches. Linux maintains a ready queue for each CPU, which divides active processes (that is, processes that are running and waiting for the CPU) according to priority and waiting Sort by CPU time, and then select the process that needs the most CPU (that is, the process with the highest priority and the longest waiting time for the CPU) to run.

When will the process be scheduled to run on the CPU?

  • The easiest time to think of is when the process is terminated after execution, and the CPU it used before will be released. At this time, a new process will be taken from the ready queue to run.
  • But in fact, there are many other scenarios that will also trigger process scheduling:
    • First, in order to ensure that all processes can be scheduled fairly, CPU time is divided into time slices, and these time slices are allocated to each process in turn. In this way, when the time slice of a process is exhausted, it will be The system hangs and switches to other processes waiting for the CPU.
    • Second, when system resources are insufficient (such as insufficient memory), a process must wait until the resources are met before it can run. At this time, the process will also be suspended, and the system will schedule other processes to run.
    • Third, when a process actively suspends itself through the sleep function sleep, it will naturally be rescheduled.
    • Fourth, when a process with a higher priority is running, in order to ensure that the high-priority process runs, the current process will be suspended and the high-priority process will run.
    • Finally, when a hardware interrupt occurs, the process on the CPU will be suspended by the interrupt and execute the interrupt service routine in the kernel.

1.2.2 Thread context switching

Threads are the basic unit of scheduling, and processes are the basic units of resource ownership.

That is, task scheduling in the kernel, the actual scheduling object is a thread; and the process only provides resources such as virtual memory and global variables to the thread. So it can be understood like this:

  • When a process has only one thread, it can be considered that the process is equal to the thread.
  • When a process has multiple threads, these threads will share the same resources such as virtual memory and global variables. These resources do not need to be modified during context switching.
  • In addition, threads also have their own private data, such as stacks and registers, which also need to be saved during context switching.

Therefore, thread context switching can actually be divided into two situations:

  • First, the two threads before and after belong to different processes: because resources are not shared, the switching process is the same as process context switching.
  • Second, the two threads before and after belong to the same process: because the virtual memory is shared, the virtual memory resources remain unchanged when switching, and only the thread's private data, registers and other non-shared data need to be switched.

Therefore, although they are both context switches, thread switching within the same process consumes less resources than switching between multiple processes, and this is also an advantage of multi-threads replacing multiple processes.

1.2.3 Interrupt context switch

In order to quickly respond to hardware events, interrupt processing will interrupt the normal scheduling and execution of the process, and instead call the interrupt handler to respond to device events. When interrupting other processes, the current state of the process needs to be saved, so that when After the interruption ends, the process can still resume running from its original state.

Unlike process context, interrupt context switching does not involve the user mode of the process. Therefore, even if the interrupt process interrupts a process that is in user mode, there is no need to save and restore the virtual memory, global variables and other user modes of the process. Resources. Therefore, the interrupt context actually only includes the state necessary for the execution of the kernel mode interrupt service program, including CPU registers, kernel stack, hardware interrupt parameters, etc.

For the same CPU, interrupt processing has a higher priority than the process, so interrupt context switching does not occur at the same time as process context switching. In the same way, because interrupts interrupt the scheduling and execution of normal processes, most Interrupt handlers are kept short and concise so that execution ends as quickly as possible.

In addition, like process context switching, interrupt context switching also consumes CPU. Too many switches will consume a lot of CPU and even seriously reduce the overall performance of the system. Therefore, when you find that there are too many interrupts, you need to pay attention to troubleshooting. Will it cause serious performance issues on your system.

1.2.4 vmstat View context switching status

vmstat is a commonly used system performance analysis tool. It is mainly used to analyze the memory usage of the system. It is also commonly used to analyze the number of CPU context switches and interrupts.

# 每隔 5 秒输出 1 组数据
$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 7005360  91564 818900    0    0     0     0   25   33  0  0 100  0  0
cs(context switch)是每秒上下文切换的次数. 
in(interrupt)则是每秒中断的次数. 
r(Running or Runnable)是就绪队列的长度, 也就是正在运行和等待 CPU 的进程数. 
b(Blocked)则是处于不可中断睡眠状态的进程数. 
可以看到, 这个例子中的上下文切换次数 cs 是 33 次, 而系统中断次数 in 则是 25 次, 而就绪队列长度 r 和不可中断状态进程数 b 都是 0. 

vmstat only gives the overall context switching situation of the system. If you want to view the details of each process, you need to run pidstat -w to view the context switching situation of each process.

  • cswch, represents the number of voluntary context switches per second. It refers to the context switch caused by the process being unable to obtain the required resources. For example, when system resources such as I/O and memory are insufficient, voluntary context switches will occur.
  • nvcswch, represents the number of non-voluntary context switches per second. It refers to the context switch that occurs when a process is forcibly scheduled by the system due to reasons such as the time slice has expired. For example, when a large number of processes are competing for the CPU , it is easy for involuntary context switching to occur.
# 每隔 5 秒输出 1 组数据
$ pidstat -w 5
Linux 4.15.0 (ubuntu)  09/23/18  _x86_64_  (2 CPU)
08:18:26      UID       PID   cswch/s nvcswch/s  Command
08:18:31        0         1      0.20      0.00  systemd
08:18:31        0         8      5.40      0.00  rcu_sched
...

1.2.4.1 Case: sysbench simulates multi-thread scheduling switching

Now that you know how to check these indicators, another question arises, how many times is the frequency of context switching considered normal? Don’t rush to get the answer. Similarly, let’s first look at a case of context switching. Through the actual case practice, you You can analyze and find out this standard yourself.

sysbench is a multi-threaded benchmarking tool, generally used to evaluate the database load under different system parameters. Of course, in this case, we only treat it as an abnormal process, and its function is to simulate excessive context switching. question.

The following case is based on Ubuntu 18.04. Of course, other Linux systems are also applicable. The case environment I used is as follows:

  • Machine configuration: 2 CPU, 8GB memory
  • Pre-installed sysbench and sysstat packages, such as apt install sysbench sysstat
  • Before the official operation begins, you need to open three terminals and log in to the same Linux machine.
# 首先观察空闲系统的上下文切换次数, 间隔 1 秒后输出 1 组数据, column -t 使输出表格对齐方便阅读
$ vmstat 1 | column -t
procs 	 -----------memory---------- 	 ---swap--  -----io---- 	-system-- 		------cpu-----
r  b    swpd   	free   	buff  	cache     si   so    bi    bo   	in   cs 		us sy id wa st
0  0      0 	6984064  92668 	830896    0    0     2    19   		19   35  		1  0 99  0  0 # r低, in低, cs低,  us低, sy低

# 然后以 10 个线程运行 5 分钟的基准测试, 模拟多线程切换的问题
$ sysbench --threads=10 --max-time=300 threads run
# 接着, 在第二个终端运行 vmstat , 观察上下文切换情况:
# 每隔 1 秒输出 1 组数据(需要 Ctrl+C 才结束)
$ vmstat 1 | column -t
procs 	 -----------memory---------- 	 ---swap--  -----io---- 	-system-- 		------cpu-----
r  b   	swpd   	free    buff  	cache     si   so    bi    bo   	in   	cs 		us sy id wa st
6  0      0 	6487428 118240 	1292772    0    0     0     0 		9019 	1398830 16 84  0  0  0 # r高, in高, cs高, us高, sy高
8  0      0 	6487428 118240 	1292772    0    0     0     0 		10191 	1392312 16 84  0  0  0

可以发现, cs 列的上下文切换次数从之前的 35 骤然上升到了 139 万. 同时注意观察其他几个指标:
r 列:就绪队列的长度已经到了 8, 远远超过了系统 CPU 的个数 2, 所以肯定会有大量的 CPU 竞争. 
us(user)和 sy(system)列:这两列的 CPU 使用率加起来上升到了 100%, 其中系统 CPU 使用率(sy 列)高达 84%, 说明 CPU 主要被内核占用了. 
in 列:中断次数也上升到了 1 万左右, 说明中断处理也是个潜在的问题. 
综合这几个指标可知, 系统的就绪队列r(即正在运行和等待 CPU 的进程数过多)太多, 导致了大量的上下文切换(cs), 而上下文切换又导致了系统 CPU 的占用率(uptime)升高. 
那么到底是什么进程导致了这些问题呢? 继续分析, 在第三个终端 pidstat 看一下,  CPU 和进程上下文切换的情况:
$ pidstat -w -u 1 # 每隔 1 秒输出 1 组数据, -w 参数表示输出进程切换指标, -u 参数表示输出 CPU 使用指标
08:06:33      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
08:06:34        0     10488   30.00  100.00    0.00    0.00  100.00     0  sysbench # %CPU高(来自%system)
08:06:34        0     26326    0.00    1.00    0.00    0.00    1.00     0  kworker/u4:2
 
08:06:33      UID       PID   cswch/s nvcswch/s  Command
08:06:34        0         8     11.00      0.00  rcu_sched
08:06:34        0        16      1.00      0.00  ksoftirqd/1
08:06:34        0       471      1.00      0.00  hv_balloon
08:06:34        0      1230      1.00      0.00  iscsid
08:06:34        0      4089      1.00      0.00  kworker/1:5
08:06:34        0      4333      1.00      0.00  kworker/0:3
08:06:34        0     10499      1.00    224.00  pidstat # nvcswch/s高
08:06:34        0     26326    236.00      0.00  kworker/u4:2 # cswch/s高
08:06:34     1000     26784    223.00      0.00  sshd # cswch/s高

从 pidstat 的输出可发现:
- CPU 使用率的升高(已达 100%)果然是由 sysbench 导致的
- 但上下文切换则是来自其他进程
	- 包括非自愿上下文切换频率最高的 pidstat
	- 以及自愿上下文切换频率最高的内核线程 kworker 和 sshd

不过细心的你肯定也发现了一个怪异的事儿:pidstat 输出的上下文切换次数, 加起来也就几百(1+224), 比 vmstat 的 139w 明显小了太多.  因为上文只显示了进程, 而未显示线程. 其实 pidstat -t 即可显示【线程】的指标, 如下:
$ pidstat -wt 1 # 每隔 1 秒输出一组数据, -wt 参数表示输出线程的上下文切换指标
08:14:05      UID      TGID       TID   cswch/s nvcswch/s  Command
...
08:14:05        0     10551         -      6.00      0.00  sysbench
08:14:05        0         -     10551      6.00      0.00  |__sysbench
08:14:05        0         -     10552  18911.00 103740.00  |__sysbench
08:14:05        0         -     10553  18915.00 100955.00  |__sysbench
08:14:05        0         -     10554  18827.00 103954.00  |__sysbench
...
现在你就能看到了, 虽然 sysbench 进程(也就是主线程)的上下文切换次数看起来并不多, 但它的子线程的上下文切换次数却有很多. 看来上下文切换的罪魁祸首还是过多的 sysbench 线程. 即共10个sysbench线程, 每个上下文切换次数大致为(cswch/s + nvcswch/s = 12w个), 所以共120w个, 大致和上文vmstat的cs列的上下文切换次数139w吻合(因为还未计入其他占用不高的进程)
我们已经找到了上下文切换次数增多的根源, 那是不是到这儿就可以结束了呢?

当然不是, 前面观察系统指标时, 除了上下文切换频率骤然升高, 还有一个中断次数(vmstat的in列)指标也上升到了 1 万, 但到底是什么类型的中断上升了, 仍需按如下方法排查:

既然是中断, 我们都知道, 它只发生在内核态, 而 pidstat 只是一个进程的性能分析工具, 并不提供任何关于中断的详细信息, 怎样才能知道中断发生的类型呢?

没错, 那就是从 /proc/interrupts 这个只读文件中读取. /proc 实际上是 Linux 的一个虚拟文件系统, 用于内核空间与用户空间之间的通信. /proc/interrupts 就是这种通信机制的一部分, 提供了一个只读的中断使用情况. 

还是在第三个终端里, 运行下面的命令, 观察中断的变化情况:
$ watch -d 'cat /proc/interrupts | grep RES' # -d 参数表示高亮显示变化的区域
           CPU0       CPU1
...
RES:    2450431    5279697   Rescheduling interrupts
...

观察一段时间可发现, 变化速度最快的是重调度中断(RES), 此中断类型表示唤醒空闲状态的 CPU 来调度新的任务运行. 
这是多处理器系统(SMP)中, 调度器用来分散任务到不同 CPU 的机制, 也常被称为处理器间中断(Inter-Processor Interrupts, IPI)
所以这里的中断升高, 还是因为过多任务的调度问题, 跟前面上下文切换次数的分析结果是一致的. 
1.2.4.2 Summary: Observation method of context switching

How many context switches per second is considered normal? This value actually depends on the CPU performance of the system itself:

  • Generally, if the number of context switches in the system is relatively stable, ranging from hundreds to less than 10,000 should be considered normal.
  • But when the number of context switches exceeds 10,000, or the number of switches increases by an order of magnitude, performance problems are likely to have occurred.

At this time, specific analysis needs to be done based on the type of context switch. For example:

  • The number of voluntary context switches has increased, indicating that processes are waiting for resources, and other problems such as I/O may have occurred;
  • The number of involuntary context switches has increased, indicating that processes are being forcibly scheduled, that is, they are competing for the CPU, indicating that the CPU has indeed become a bottleneck;
  • If the number of interrupts increases, it means that the CPU is occupied by the interrupt handler. You also need to check the /proc/interrupts file to analyze the specific interrupt type.

Commonly used routines:

  • First, use uptime to check the system load.

  • Then run mpstat -P ALL 3 to see the current overall status of each CPU. You can focus on the three parameters of user mode, kernel mode, and io waiting.

  • pidstat initially determines whether the CPU has a large amount of calculations, excessive process contention, or too much IO

  • vmstat 1: Analyze memory usage, CPU context switches and the number of interrupts. (cs the number of context switches per second, in the number of interrupts per second, r the number of processes running or waiting for the CPU, b the number of processes interrupting the sleep state.) To further determine whether too much io is causing the problem, or whether there is fierce competition for processes cause problems

    • pidstat -w (process switching indicator)/-u (cpu usage indicator)/-wt (thread context switching indicator) depends on the specific situation
      • cswch (voluntary) number of context switches per second, caused by insufficient system resources such as IO
      • nvcswch The number of involuntary context switches per second, such as CPU time slice running out or high priority threads

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/132240844