Detailed explanation of linux load (load) problem + pressure test command set

This article mainly helps to understand CPU-related performance indicators, common CPU performance problems and solutions.

Original address: https://zhuanlan.zhihu.com/p/180402964

System load average

Introduction

System load average : is the average number of processes in a runnable or uninterruptible state.

Runnable processes : processes using the CPU or waiting to use the CPU

Uninterruptible state process : waiting for some IO access, generally interacting with hardware, and cannot be interrupted (the reason why it cannot be interrupted is to protect the consistency of system data and prevent data reading errors)

View system load average

First, the top command checks the running status of the process , as follows:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10760 user   20   0 3061604  84832   5956 S  82.4  0.6 126:47.61 Process
29424 user   20   0   54060   2668   1360 R  17.6  0.0   0:00.03 **top**

Program status Status The process can run in R, and the non-interruptible operation is D (details will be explained in the follow-up explanation of top)

top View system load average:

top - 13:09:42 up 888 days, 21:32,  8 users,  load average: 19.95, 14.71, 14.01
Tasks: 642 total,   2 running, 640 sleeping,   0 stopped,   0 zombie
%Cpu0  : 37.5 us, 27.6 sy,  0.0 ni, 30.9 id,  0.0 wa,  0.0 hi,  3.6 si,  0.3 st
%Cpu1  : 34.1 us, 31.5 sy,  0.0 ni, 34.1 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
...
KiB Mem : 14108016 total,  2919496 free,  6220236 used,  4968284 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6654506 avail Mem

The load average here indicates the system bottleneck load of the system in the last 1 minute, 5 minutes, and 15 minutes.

uptime view system bottleneck load

[root /home/user]# uptime
 13:11:01 up 888 days, 21:33,  8 users,  load average: 17.20, 14.85, 14.10

View CPU core information

The average system load is closely related to the number of CPU cores. We can view the current machine CPU information through the following command:

lscpu View CPU information:

[root@Tencent-SNG /home/user_00]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
...
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-7  // NUMA架构信息

cat /proc/cpuinfo to view the information of each CPU core:

processor       : 7   // 核编号7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 6
...

Reasons for high system load average

In general, higher system load averages mean higher CPU usage. But they are not necessarily related. If there are more CPU-intensive computing tasks, the average system load will generally increase, but if there are more IO-intensive tasks, the average system load will also increase. However, the CPU usage at this time is not necessarily high. It may be very low because many processes are in an uninterruptible state, and waiting for CPU scheduling will also increase the average system load .

So if the average load of our system is very high, but the CPU usage is not very high, we need to consider whether the system has encountered an IO bottleneck, and the IO read and write speed should be optimized.

Therefore, whether the system encounters a CPU bottleneck needs to be checked together with the CPU usage and system bottleneck load (of course, there are other indicators that need to be compared and checked, and the explanation will continue below)

Case Troubleshooting

stress is a tool for applying system pressure and stress testing the system. We can use the stress tool to pressure test the CPU so that we can locate and troubleshoot CPU problems.

yum install stress // 安装stress工具

The stress command uses

 // --cpu 8:8个进程不停的执行sqrt()计算操作
 // --io 4:4个进程不同的执行sync()io操作(刷盘)
 // --vm 2:2个进程不停的执行malloc()内存申请操作
 // --vm-bytes 128M:限制1个执行malloc的进程申请内存大小
 stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10s

Here we mainly verify the problem of too many CPU, IO, and processes

Troubleshooting CPU issues

Use stress -c 1 to simulate high CPU load, and then use the following command to observe the load change:

uptime : Use uptime to view the system load at this time:

# -d 参数表示高亮显示变化的区域
$ watch -d uptime
... load average: 1.00, 0.75, 0.39

mpstat : Use mpstat -P ALL 1 to view the change information of each CPU core per second. The whole is similar to top. The advantage is that you can output the data per second (custom) to facilitate observation of data changes, and finally output the average data:

13:14:53     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
13:14:58     all   12.89    0.00    0.18    0.00    0.00    0.03    0.00    0.00    0.00   86.91
13:14:58       0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
13:14:58       1    0.40    0.00    0.20    0.00    0.00    0.20    0.00    0.00    0.00   99.20

From the above output, it can be concluded that the current system load is increasing, and one of the cores is running full, mainly performing user mode tasks, and most of them are business tasks at this time. So at this time, you need to check which process causes the single-core CPU to run full:

pidstat : Use pidstat -u 1 to output the current system process and CPU data every 1 second:

13:18:00      UID       PID    %usr %system  %guest    %CPU   CPU  Command
13:18:01        0         1    1.00    0.00    0.00    1.00     4  systemd
13:18:01        0   3150617  100.00    0.00    0.00  100.00     0  stress
...

top : Of course, the most convenient way is to use the top command to view the load:

top - 13:19:06 up 125 days, 20:01,  3 users,  load average: 0.99, 0.63, 0.42
Tasks: 223 total,   2 running, 221 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.5 us,  0.3 sy,  0.0 ni, 85.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16166056 total,  3118532 free,  9550108 used,  3497416 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6447640 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
3150617 root      20   0   10384    120      0 R 100.0  0.0   4:36.89 stress

At this point, it can be seen that stress occupies a high CPU.

IO Troubleshooting

We use stress -i 1 to simulate the IO bottleneck problem, that is, execute the sync disk operation in an endless loop: uptime : Use uptime to view the system load at this time:

$ watch -d uptime
...,  load average: 1.06, 0.58, 0.37

mpstat : Check the IO consumption at this time, but in fact we found that the CPU here is basically consumed by sys, that is, system consumption.

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    0.33    0.00   12.64    0.13    0.00    0.00    0.00    0.00    0.00   86.90
Average:       0    0.00    0.00   99.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
Average:       1    0.00    0.00    0.33    0.00    0.00    0.00    0.00    0.00    0.00   99.67

The problem that IO cannot be raised :

The problem that iowait cannot be raised is because the stress in the case uses the sync() system call, and its function is to refresh the buffer memory to the disk. For a newly installed virtual machine, the buffer may be relatively small and cannot generate large IO pressure, so most of it is consumed by system calls. So, you'll see only system CPU usage go up. The solution is to use stress-ng, the next generation of stress, which supports richer options, such as stress-ng -i 1 --hdd 1 --timeout 600 (--hdd means read and write temporary files).

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all    0.25    0.00    0.44   26.22    0.00    0.00    0.00    0.00    0.00   73.09
Average:       0    0.00    0.00    1.02   98.98    0.00    0.00    0.00    0.00    0.00    0.00

pidstat : Same as above (omitted)

It can be seen that the increase in CPU IO leads to an increase in the average load of the system. We use pidstat to find out which process is causing the IO to rise.

top : The use of top here is still the most comprehensive parameter to view, and it can be concluded that stress is the culprit that causes the increase in IO.

pidstat does not have the iowait option : it may be that the default sysstat of CentOS is too old, and it needs to be upgraded to a version after 11.5.5 to be available.

Troubleshooting too many processes

The problem of too many processes is quite special. If the system runs many processes that exceed the operating capacity of the CPU, there will be processes waiting for the CPU. Use stress -c 24 to simulate the execution of 24 processes (my CPU is 8 cores) uptime : Use uptime to view the system load at this time:

$ watch -d uptime
...,  load average: 18.50, 7.13, 2.84

mpstat : Same as above (omitted)

pidstat : Same as above (omitted)

It can be observed that the system processing at this time is seriously overloaded, and the average load is as high as 18.50 .

top : We can also use the top command to view the number of processes in the Running state at this time. If this number is large, it means that the system is running and there are too many processes waiting to run.

Summarize

Through the above problems and solutions, it can be concluded that:

A high load average may be caused by a CPU-intensive process

A high average load does not necessarily mean high CPU usage, it may also mean that I/O is busier

When you find that the load is high, you can use tools such as mpstat and pidstat to assist in analyzing the source of the load

Summary tools: mpstat, pidstat, top and uptime

CPU context switch

CPU context : The CPU needs to know where the task is loaded and started to run when executing each task. That is to say, the system needs to set up the CPU register and program counter (Program Counter, PC) for it in advance, including the CPU register. It is called the CPU context.

CPU context switching : CPU context switching is to save the CPU context (that is, CPU registers and program counter) of the previous task first, then load the context of the new task to these registers and program counter, and finally jump to the new location pointed by the program counter to run the new task.

CPU context switching : divided into process context switching , thread context switching and interrupt context switching .

process context switch

Switching from user mode to kernel mode needs to be done through system calls , where process context switching (privileged mode switching) will occur, and context switching will also occur when switching back to user mode.

Generally, each context switch takes tens of nanoseconds to several microseconds of CPU time. If there are many switches, it is easy to waste CPU time on saving and restoring resources such as registers, kernel stacks, and virtual memory. This will also lead to an increase in the average system load .

Linux maintains a ready queue for each CPU, sorts R state processes according to priority and waiting CPU time, and selects the most needed CPU process for execution. Running the process here involves the timing of process context switching:

Process time slice exhausted, .

The process is running low on system resources (out of memory).

The process actively sleeps.

A process with a higher priority is executed.

A hard interrupt occurs.

thread context switch

Threads and processes:

When a process has only one thread, it can be considered that the process is equal to the thread.

When a process has multiple threads, these threads share the same resources such as virtual memory and global variables. These resources do not need to be modified during context switches.

Threads also have their own private data, such as stacks and registers, which also need to be saved during context switching.

So thread context switching includes 2 situations:

Threads of different processes, this situation is equivalent to process switching.

The thread switching of the common process only needs to switch the non-shared data such as thread private data and registers.

interrupt context switch

Interrupt processing will interrupt the normal scheduling and execution of the process, and instead call the interrupt handler to respond to device events. When interrupting other processes, it is necessary to save the current state of the process, so that after the interruption is over, the process can still resume from the original state.

For the same CPU, interrupt processing has a higher priority than process, so interrupt context switching does not happen at the same time as process context switching. Since interrupts will interrupt the scheduling and execution of normal processes, most interrupt handlers are short and concise, so as to end the execution as soon as possible.

View system context switches

vmstat : The tool can view the system's memory, CPU context switching and interrupt times:

// 每隔1秒输出
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0      0 157256 3241604 5144444    0    0    20     0 26503 33960 18  7 75  0  0
17  0      0 159984 3241708 5144452    0    0    12     0 29560 37696 15 10 75  0  0
 6  0      0 162044 3241816 5144456    0    0     8   120 30683 38861 17 10 73  0  0

cs : The number of context switches per second.

in : The number of interrupts per second.

r : length of the ready queue, processes that are running or waiting for the CPU.

b : The number of processes in an uninterruptible sleep state, such as interacting with hardware.

pidstat : Use the pidstat -w option to view the number of context switches for a specific process:

$ pidstat -w -p 3217281 1
10:19:13      UID       PID   cswch/s nvcswch/s  Command
10:19:14        0   3217281      0.00     18.00  stress
10:19:15        0   3217281      0.00     18.00  stress
10:19:16        0   3217281      0.00     28.71  stress

Among them, cswch/s and nvcswch/s represent voluntary context switching and involuntary context switching.

Voluntary context switching : Refers to the context switching caused by the inability of the process to obtain the required resources. For example, voluntary context switching occurs when system resources such as I/O and memory are insufficient.

Involuntary context switching : It refers to the context switching that occurs because the process is forced to be scheduled by the system due to reasons such as the time slice has expired. For example, when a large number of processes are competing for the CPU, involuntary context switches are prone to occur

Case Troubleshooting

Here we use the sysbench tool to simulate the context switching problem.

First use vmstat 1 to view the current context switching information:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 514540 3364828 5323356    0    0    10    16    0    0  4  1 95  0  0
 1  0      0 514316 3364932 5323408    0    0     8     0 27900 34809 17 10 73  0  0
 1  0      0 507036 3365008 5323500    0    0     8     0 23750 30058 19  9 72  0  0

Then use sysbench --threads=64 --max-time=300 threads run to simulate 64 threads to perform tasks. At this time, we check the context switching information again with vmstat 1:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 318792 3385728 5474272    0    0    10    16    0    0  4  1 95  0  0
 1  0      0 307492 3385756 5474316    0    0     8     0 15710 20569 20  8 72  0  0
 1  0      0 330032 3385824 5474376    0    0     8    16 21573 26844 19  9 72  0  0
 2  0      0 321264 3385876 5474396    0    0    12     0 21218 26100 20  7 73  0  0
 6  0      0 320172 3385932 5474440    0    0    12     0 19363 23969 19  8 73  0  0
14  0      0 323488 3385980 5474828    0    0    64   788 111647 3745536 24 61 15  0  0
14  0      0 323576 3386028 5474856    0    0     8     0 118383 4317546 25 64 11  0  0
16  0      0 315560 3386100 5475056    0    0     8    16 115253 4553099 22 68  9  0  0

We can clearly observe:

The current cs and in have increased dramatically at this time.

The CPU usage of sy+us exceeds 90%.

r The length of the ready queue reaches 16 and exceeds the number of CPU cores by 8.

Analyze cs context switching problems

We use pidstat to view current CPU information and specific process context switching information:

// -w表示查看进程切换信息,-u查看CPU信息,-t查看线程切换信息
$ pidstat -w -u -t 1

10:35:01      UID       PID    %usr %system  %guest    %CPU   CPU  Command
10:35:02        0   3383478   67.33  100.00    0.00  100.00     1  sysbench

10:35:01      UID       PID   cswch/s nvcswch/s  Command
10:45:39        0   3509357         -      1.00      0.00  kworker/2:2
10:45:39        0         -   3509357      1.00      0.00  |__kworker/2:2
10:45:39        0         -   3509702  38478.00  45587.00  |__sysbench
10:45:39        0         -   3509703  39913.00  41565.00  |__sysbench

So we can see that there are a lot of context switches in a lot of sysbench threads.

Analyze in interrupt problems

We can view the system's watch -d cat /proc/softirqs and watch -d cat /proc/interrupts to view the system's soft interrupts and hard interrupts (kernel interrupts). We mainly observe /proc/interrupts here.

$ watch -d cat /proc/interrupts
RES:  900997016  912023527  904378994  902594579  899800739  897500263  895024925  895452133   Rescheduling interrupts

Here it is obvious that the number of rescheduling interrupts (RES) has increased. This interrupt indicates that the idle state CPU is awakened to schedule new task execution.

Summarize

There are more voluntary context switches, indicating that processes are waiting for resources, and other problems such as I/O may have occurred.

There are more involuntary context switches, indicating that processes are being forcibly scheduled, that is, they are all competing for the CPU, indicating that the CPU has indeed become a bottleneck.

If the number of interrupts increases, it means that the CPU is occupied by the interrupt handler, and it is necessary to analyze the specific interrupt type by viewing the /proc/interrupts file.

CPU usage

In addition to system load and context switching information, the most intuitive indicator of CPU problems is CPU usage information. Linux provides the internal status information of the system to the user control through the /proc virtual file system, where /proc/stat is the CPU and task information statistics.

$ cat /proc/stat | grep cpu
cpu  6392076667 1160 3371352191 52468445328 3266914 37086 36028236 20721765 0 0
cpu0 889532957 175 493755012 6424323330 2180394 37079 17095455 3852990 0 0
...

The meaning of each column here is as follows:

user (usually abbreviated as us), represents the user mode CPU time. Note that it does not include the nice time below, but includes the guest time.

nice (usually abbreviated as ni), stands for low-priority user mode CPU time, that is, the CPU time when the nice value of the process is adjusted to be between 1-19. Note here that the range of possible values ​​for nice is -20 to 19. The larger the value, the lower the priority.

system (often abbreviated as sys), stands for kernel-mode CPU time.

idle (often abbreviated as id), stands for idle time. Note that it does not include time waiting for I/O (iowait).

iowait (often abbreviated to wa), stands for CPU time waiting for I/O.

irq (often abbreviated as hi), stands for CPU time processing hard interrupts.

softirq (often abbreviated as si), stands for CPU time handling softirqs.

steal (usually abbreviated as st) represents the CPU time occupied by other virtual machines when the system is running in a virtual machine.

Guest (usually abbreviated as guest) represents the time of running other operating systems through virtualization, that is, the CPU time of running a virtual machine.

guest_nice (often abbreviated as gnice), which represents the amount of time the virtual machine is running at low priority.

Here we can use top, ps, pidstat and other tools to easily query these data, and we can easily see the processes with high CPU usage. Here we can use these tools to preliminarily identify them, but the specific cause of the problem needs other methods to continue to search.

Here we can use perf top to view hotspot data conveniently, or use perf record to save the current data for subsequent viewing with perf report.

Troubleshooting CPU usage

Here is a summary of CPU usage issues and troubleshooting ideas:

If the user CPU and Nice CPU are high, it means that the user mode process occupies more CPU, so you should focus on troubleshooting the performance problems of the process.

The system CPU is high, indicating that the kernel mode occupies more CPU, so you should focus on troubleshooting the performance problems of kernel threads or system calls.

I/O waiting CPU is high, indicating that the waiting time for I/O is relatively long, so you should focus on checking whether there is an I/O problem in the system storage.

The high number of soft interrupts and hard interrupts indicates that the processing program of soft interrupts or hard interrupts takes up more CPU, so you should focus on checking the interrupt service program in the kernel.

CPU Troubleshooting Routine

CPU usage

CPU usage mainly includes the following aspects:

User CPU usage, including user-mode CPU usage (user) and low-priority user-mode CPU usage (nice), indicating the percentage of time the CPU is running in user mode. High user CPU usage usually indicates a busy application.

System CPU usage, indicating the percentage of time the CPU is running in kernel mode (excluding interrupts). The high CPU usage of the system indicates that the core is busy.

CPU usage waiting for I/O, also commonly referred to as iowait, indicates the percentage of time waiting for I/O. If iowait is high, it usually indicates that the I/O interaction time between the system and the hardware device is relatively long.

The CPU usage of soft interrupts and hard interrupts indicates the percentage of time that the kernel calls soft interrupt handlers and hard interrupt handlers, respectively. Their high usage usually indicates a high number of outages on the system.

In addition to the steal CPU usage (steal) and guest CPU usage (guest) that will be used in the virtualization environment, they respectively represent the percentage of CPU time occupied by other virtual machines and the percentage of CPU time running guest virtual machines.

average load

It reflects the overall load of the system, and you can view the average load of the past 1 minute, the past 5 minutes and the past 15 minutes.

context switch

Context switching focuses on 2 metrics:

A voluntary context switch due to failure to acquire a resource.

Involuntary context switches caused by system-forced scheduling.

CPU cache hit ratio

The CPU access speed is much faster than the memory access, so it is inevitable to wait for the memory response when the CPU accesses the memory. In order to coordinate the speed gap between the two, a CPU cache (multi-level cache) appeared. If the CPU cache hit rate is higher, the performance will be better. We can use the following tools to view the CPU cache hit rate, tool address , project address perf-tools

# ./cachestat -t
Counting cache functions... Output every 1 seconds.
TIME         HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
08:28:57      415        0        0   100.0%            1        191
08:28:58      411        0        0   100.0%            1        191
08:28:59      362       97        0    78.9%            0          8
08:29:00      411        0        0   100.0%            0          9
08:29:01      775    20489        0     3.6%            0         89
08:29:02      411        0        0   100.0%            0         89
08:29:03     6069        0        0   100.0%            0         89
08:29:04    15249        0        0   100.0%            0         89
08:29:05      411        0        0   100.0%            0         89
08:29:06      411        0        0   100.0%            0         89
08:29:07      411        0        3   100.0%            0         89
[...]

Summarize

Through the performance index checking tool (CPU related)

Performance indicators tool description load average uptime

topuptime simply displays the average load in the most recent period

top show more indicators CPU usage vmstat

mpstat

top

sar

/proc/stat

top, vmstat, and mpstat can only dynamically view the current state, while sar can view the history

/proc/stat is the data source for other performance tools process CPU usage top

pidstat

ps

htop

atop

top and ps can display the process CPU in sorted mode, and pidstat can not be sorted and displayed

htop and atop display all kinds of data in different colors, which is more intuitive. System context switching vmstat displays context switching at this time, running status, and number of processes that cannot be interrupted. Process context switching pidstat displays many items, including process context switching information soft interrupt top

/proc/softirqs

mpstattop can view soft interrupt CPU usage

/proc/softirqs and mpstat can view the cumulative information on each CPU hard interrupt vmstat

/proc/interruptsvmstat View the total number of interrupts information

/proc/interrupts View the cumulative information network dstat of various interrupts on each CPU core

sar

tcpdumpdstat and sar show the overall network sending and receiving situation in more detail

tcpdump 提供动态抓取数据包的能力IOdstat、sar2 者都提供了详细的 IO 整体情况CPU 信息/proc/cpuinfo

lscpu都可以查看 CPU 信息系统分析perf

execsnoopperf 分析各种内核函数调用、热点函数信息

execsnoop 监控短时进程

根据工具查性能指标(CPU 相关)

性能工具CPU 性能指标uptime5、10、15 分钟内的平均负载展示top平均负载、运行队列、CPU 各项使用率、进程状态和 CPU 使用率htoptop 增强版,以不同颜色区分不同类型进程,展示更直观atopCPU、内存、磁盘、网络资源全访问监控,十分齐全vmstat系统整体 CPU 使用率、上下文切换次数、中断次数,还包括处于运行(r)和不可中断状态(b)的进程数量pidstat进程、线程(-t)的每个 CPU 占用信息,中断上下文切换次数/proc/softirqs展示每个 CPU 上的软中断类型及次数/proc/inerrupts展示每个 CPU 上的硬中断类型及次数ps每个进程的状态和 CPU 使用率pstree进程的父子关系展示dstat系统整体 CPU 使用率(以及相关 IO、网络信息)sar系统整体 CPU 使用率,以及使用率历史信息strace跟踪进程的系统调用perfCPU 性能事件分析,例如:函数调用链、CPU 缓存命中率、CPU 调度等execsnoop短时进程分析

CPU 问题排查方向

有了以上性能工具,在实际遇到问题时我们并不可能全部性能工具跑一遍,这样效率也太低了,所以这里可以先运行几个常用的工具 top、vmstat、pidstat 分析系统大概的运行情况然后在具体定位原因。

top 系统CPU => vmstat 上下文切换次数 => pidstat 非自愿上下文切换次数 => 各类进程分析工具(perf strace ps execsnoop pstack)

top 用户CPU => pidstat 用户CPU => 一般是CPU计算型任务

top 僵尸进程 =>  各类进程分析工具(perf strace ps execsnoop pstack)

top 平均负载 => vmstat 运行状态进程数 =>  pidstat 用户CPU => 各类进程分析工具(perf strace ps execsnoop pstack)

top 等待IO CPU => vmstat 不可中断状态进程数  => IO分析工具(dstat、sar -d)

top 硬中断 => vmstat 中断次数 => 查看具体中断类型(/proc/interrupts)

top 软中断 => 查看具体中断类型(/proc/softirqs) => 网络分析工具(sar -n、tcpdump) 或者 SCHED(pidstat 非自愿上下文切换)

CPU 问题优化方向

性能优化往往是多方面的,CPU、内存、网络等都是有关联的,这里暂且给出 CPU 优化的思路,以供参考。

程序优化

基本优化:程序逻辑的优化比如减少循环次数、减少内存分配,减少递归等等。

编译器优化:开启编译器优化选项例如gcc -O2对程序代码优化。

算法优化:降低苏研发复杂度,例如使用nlogn的排序算法,使用logn的查找算法等。

异步处理:例如把轮询改为通知方式

多线程代替多进程:某些场景下多线程可以代替多进程,因为上下文切换成本较低

缓存:包括多级缓存的使用(略)加快数据访问

系统优化

CPU 绑定:绑定到一个或多个 CPU 上,可以提高 CPU 缓存命中率,减少跨 CPU 调度带来的上下文切换问题

CPU 独占:跟 CPU 绑定类似,进一步将 CPU 分组,并通过 CPU 亲和性机制为其分配进程。

优先级调整:使用 nice 调整进程的优先级,适当降低非核心应用的优先级,增高核心应用的优先级,可以确保核心应用得到优先处理。

为进程设置资源限制:使用 Linux cgroups 来设置进程的 CPU 使用上限,可以防止由于某个应用自身的问题,而耗尽系统资源。

NUMA 优化:支持 NUMA 的处理器会被划分为多个 Node,每个 Node 有本地的内存空间,这样 CPU 可以直接访问本地空间内存。

中断负载均衡:无论是软中断还是硬中断,它们的中断处理程序都可能会耗费大量的 CPU。开启 irqbalance 服务或者配置 smp_affinity,就可以把中断处理过程自动负载均衡到多个 CPU 上。

参考

极客时间:Linux 性能优化实战

更多干货尽在腾讯技术,官方微信/QQ交流群已建立,交流讨论可加:Journeylife1900、711094866(备注腾讯技术) 。

推荐阅读

Linux CPU性能优化方法

在Linux系统中,由于成本的限制,往往会存在资源上的不足,例如 CPU、内存、网络、IO 性能。本文,就对 Linux 进程和 CPU 的原理进行分析,总结出 CPU 性能优化的方法。 1. 分析手段 在理解…

point...发表于linux...

Linux性能优化-CPU性能优化思路

linux发表于linux...

Linux性能优化-CPU性能优化思路

关于这个,我觉得这篇文章写得很详细,转载记录一下: CPU性能指标CPU使用率 1.CPU使用率描述了非空闲时间占总CPU时间的百分比,根据CPU上运行任务的不同,又被分为用户CPU,系统CPU,等待I…

FreeR...发表于Linux...

Linux性能优化之CPU优化

Linux 的性能进行监测,以下是 VPSee 常用的工具: 工具 简单介绍 top 查看进程活动状态以及一些系统状况 vmstat 查看系统状态、硬件和系统信息等 iostat 查看CPU 负载,硬盘状况 sar 综合…

linux...发表于linux...

Guess you like

Origin blog.csdn.net/star1210644725/article/details/129483962