Linux performance optimization practical study notes

The following content is from the geek course, if it is helpful to you, please see the poster for the detailed course:
Picture name

CPU performance

Common commands for cpu performance debugging

  1. mpstat view CPU usage
# -P ALL 表示监控所有CPU,后面数字5表示间隔5秒后输出一组数据
$ mpstat -P ALL 5
Linux 4.15.0 (ubuntu) 09/22/18 _x86_64_ (2 CPU)
13:30:06 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
13:30:11 all 50.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 49.95
13:30:11 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
13:30:11 1 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
  1. View specific process CPU usage pidstat
# 间隔5秒后输出一组数据
$ pidstat -u 5 1
13:37:07      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:37:12        0      2962  100.00    0.00    0.00    0.00  100.00     1  stress
  1. View the context switch pidstat -w of the process. By default, pidstat displays the indicator data of the process. After the -t parameter is added, the indicator of the thread will be output.
# 每隔5秒输出1组数据
$ pidstat -wt 5
Linux 4.15.0 (ubuntu) 09/23/18 _x86_64_ (2 CPU)
08:18:26 UID PID cswch/s nvcswch/s Command
08:18:31 0 1 0.20 0.00 systemd
08:18:31 0 8 5.40 0.00 rcu_sched

There are two columns of content in this result that are our focus. One is cswch, which represents the number of voluntary context switches per second, and the other is nvcswch, which represents the number of non-voluntary context switches per second. You must keep these two concepts in mind, because they mean different performance issues:

  • The so-called voluntary context switch refers to the context switch caused by the process being unable to obtain the required resources. For example, when system resources such as I/O and memory are insufficient, voluntary context switching occurs.
  • Involuntary context switching refers to a context switch that occurs when the process is forcibly scheduled by the system due to the time slice has expired. For example, when a large number of processes are competing for CPU, involuntary context switching is prone to occur.
  1. View CPU context, run queue information, etc. through vmstat
# 每隔5秒输出1组数据
$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- 
r b swpd free buff cache si so bi bo in cs us sy id wa st 
0 0 0 7005360 91564 818900 0 0 0 0 25 33 0 0 100 0 0

Let's look at the results together, you can try to interpret the meaning of each column by yourself. Here, I will highlight the four columns that need special attention:
cs (context switch) is the number of context switches per second.
in (interrupt) is the number of interrupts per second.
r (Running or Runnable) is the length of the ready queue, that is, the number of processes that are running and waiting for the CPU.
b (Blocked) is the number of processes in an uninterruptible sleep state.

  1. View the relationship between processes through pstree
# -a 表示输出命令行选项
# p表PID
# s表示指定进程的父进程
$ pstree -aps 3084
systemd,1 
└─dockerd,15006 -H fd:// 
   └─docker-containe,15024 --config /var/run/docker/containerd/containerd.toml 
     └─docker-containe,3991 -namespace moby -workdir... 
       └─app,4009 
         └─(app,3084)

CPU related concepts

1. Average load

The average load refers to the average number of processes that the system is in a runnable state and an uninterruptible state per unit time , that is, the average number of active processes. A reasonable average load value is recommended: when the average load is not higher than 70% of the number of CPUs

Average load and CPU usage. We still have to return to the meaning of load average, which refers to the number of processes in a runnable state and an uninterruptible state per unit time. Therefore, it includes not only the processes that are using the CPU, but also the processes waiting for the CPU and waiting for I/O. The CPU usage rate is the statistics of CPU busyness per unit time, and it does not necessarily correspond to the average load. such as:

  • CPU-intensive processes, using a lot of CPU will lead to an increase in average load, at this time the two are the same;
  • I/O-intensive processes, waiting for I/O will also lead to an increase in average load, but the CPU usage is not necessarily high;
  • A large number of process scheduling waiting for the CPU will also lead to an increase in the average load, and the CPU usage will also be relatively high at this time.

2. How much context switch is problematic?

This value actually depends on the CPU performance of the system itself. In my opinion, if the number of context switches in the system is relatively stable, it should be considered normal if it is within a few hundred to ten thousand. But when the number of context switches exceeds 10,000, or the number of switches increases by orders of magnitude, performance problems are likely to have occurred.
At this time, you also need to do a specific analysis based on the type of context switch. for example:

  • There are more voluntary context switches, indicating that processes are waiting for resources, and other problems such as I/O may occur;
  • Involuntary context switching has increased, indicating that processes are being forced to schedule, that is, they are all competing for CPU, indicating that CPU has indeed become a bottleneck;
  • The number of interrupts has increased, indicating that the CPU is occupied by the interrupt handler. It is also necessary to analyze the specific interrupt type by viewing the /proc/interrupts file.

3. Process status?

  • R is the abbreviation of Running or Runnable, indicating that the process is in the ready queue of the CPU, running or waiting to run.
  • D is the abbreviation of Disk Sleep, that is, Uninterruptible Sleep, which generally means that the process is interacting with the hardware, and the interaction process is not allowed to be interrupted by other processes or interrupts.
  • Z is the abbreviation of Zombie. If you have played the game "Plants vs. Zombies", you should know what it means. It indicates a zombie process, that is, the process has actually ended, but the parent process has not reclaimed its resources (such as the process descriptor, PID, etc.).
  • S is the abbreviation of Interruptible Sleep, that is, interruptible sleep, which means that the process is suspended by the system because it is waiting for an event. When the event that the process is waiting for occurs, it will be awakened and enter the R state.
  • I is the abbreviation of Idle, which is the idle state, and is used in uninterruptible sleep kernel threads. As mentioned earlier, uninterruptible processes caused by hardware interaction are represented by D, but for some kernel threads, they may not actually have any load. Idle is used to distinguish this situation. It should be noted that a process in the D state will increase the load average, but a process in the I state will not.

4. CPU interrupt

The interrupt handler in Linux is divided into the upper part and the lower part:

  • The upper half corresponds to the hardware interrupt, which is used to quickly handle the interrupt.
  • The lower half corresponds to the soft interrupt, which is used to asynchronously process the unfinished work of the upper half.
    Soft interrupts in Linux include various types such as network transceiver, timing, scheduling, RCU lock, etc. You can observe the operation of soft interrupts by viewing /proc/softirqs.

5. Common ideas for CPU optimization

Application optimization
First of all, from the perspective of applications, the best way to reduce CPU usage is of course to eliminate all unnecessary work and keep only the core logic. For example, reduce the level of loops, reduce recursion, reduce dynamic memory allocation, and so on. In addition, the performance optimization of the application also includes many methods. I have listed the most common ones here, and you can write them down.

  • Compiler optimization: Many compilers will provide optimization options, turn them on appropriately, and you can get help from the compiler during the compilation phase to improve performance. For example, gcc provides the optimization option -O2, which automatically optimizes the application code after it is turned on.
  • Algorithm optimization: Using a lower complexity algorithm can significantly speed up the processing speed. For example, in the case of relatively large data, O(nlogn) sorting algorithms (such as fast sorting, merge sorting, etc.) can be used instead of O(n^2) sorting algorithms (such as bubbling, insertion sorting, etc.).
  • Asynchronous processing: The use of asynchronous processing can prevent the program from being blocked because it is waiting for a certain resource, thereby improving the concurrent processing capability of the program. For example, by replacing polling with event notification, you can avoid the CPU-consuming problem of polling. Multi-threading instead of multi-process: As mentioned earlier, compared to process context switching, thread context switching does not switch the process address space, so the cost of context switching can be reduced.
  • Make good use of cache: Frequently accessed data or steps in the calculation process can be cached in the memory, so that it can be directly obtained from the memory next time to speed up the processing speed of the program.

System optimization
From a system perspective, to optimize the operation of the CPU, on the one hand, it is necessary to make full use of the locality of the CPU cache to accelerate cache access; on the other hand, it is to control the CPU usage of the process and reduce the interaction between processes. Specifically, there are many CPU optimization methods at the system level. Here I also list some of the most common methods to facilitate your memory and use.

  • CPU binding: Binding a process to one or more CPUs can increase the hit rate of the CPU cache and reduce the context switching problem caused by cross-CPU scheduling.
  • CPU exclusive: Similar to CPU binding, the CPUs are further grouped, and processes are allocated to them through the CPU affinity mechanism. In this way, these CPUs are exclusively occupied by the specified process, in other words, other processes are not allowed to use these CPUs.
  • Priority adjustment: Use nice to adjust the priority of the process, a positive value lowers the priority, and a negative value increases the priority. The numerical meaning of the priority has been mentioned earlier, if you forget it, please review it in time. Here, appropriately reducing the priority of non-core applications and increasing the priority of core applications can ensure that core applications are processed first.
  • Set resource limits for the process: Use Linux cgroups to set the upper limit of the CPU usage of the process, which can prevent the exhaustion of system resources due to an application's own problems.
  • NUMA (Non-Uniform Memory Access) optimization: A processor that supports NUMA will be divided into multiple nodes, and each node has its own local memory space. NUMA optimization is actually to allow the CPU to only access local memory as much as possible.
  • Interrupt load balancing: Whether it is a soft or hard interrupt, their interrupt handlers may consume a lot of CPU. Turn on the irqbalance service or configure smp_affinity to automatically load balance the interrupt processing process to multiple CPUs.

Memory performance

View memory related commands

  1. View cache hit rate cachestat
$ cachestat 1 3 
TOTAL MISSES HITS DIRTIES BUFFERS_MB CACHED_MB 
2 0 2 1 17 279 
2 0 2 1 17 279 
2 0 2 1 17 279
  • TOTAL, said the total number of I/O;
  • MISSES, indicating the number of cache misses;
  • HITS, which means the number of cache hits;
  • DIRTIES, represents the number of dirty pages added to the cache; BUFFERS_MB represents the size of Buffers, in MB; CACHED_MB represents the size of Cache, in MB.
  1. View process cache hit rate cachetop
$ cachetop
11:58:50 Buffers MB: 258 / Cached MB: 347 / Sort: HITS / Order: ascending
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
   13029 root     python                  1        0        0     100.0%       0.0%

Memory related concepts

1. How does Linux memory work

Although the address space of each process includes the kernel space, these kernel spaces are actually associated with the same physical memory.

Of course, the system will not allow a process to use up all the memory. When the memory is found to be tight, the system will reclaim the memory through a series of mechanisms, such as the following three methods:

  • Reclaim the cache, such as using the LRU (Least Recently Used) algorithm to reclaim the least recently used memory pages;
  • Reclaim the memory that is not frequently accessed, and write the memory that is not frequently used directly to the disk through the swap partition;
  • Kill the process. When the memory is tight, the system will use OOM (Out of Memory) to directly kill the process that takes up a lot of memory. The more memory a process consumes, the larger the oom_score; the more CPU a process consumes, the smaller the oom_score.

When the process applies for memory through malloc(), the memory is not allocated immediately, but when it is accessed for the first time, the memory is allocated in the kernel through a page fault exception.
Detailed memory test: https://blog.holbertonschool.com/hack-the-virtual-memory-malloc-the-heap-the-program-break/

2.Buffer and cache

cache = Cached+SReclaimable

  • Buffers are temporary storage of original disk blocks, which are used to cache disk data, and are usually not particularly large (about 20MB). In this way, the kernel can concentrate scattered writes and optimize disk writes uniformly. For example, multiple small writes can be merged into a single large write, and so on.
  • Cached is a page cache that reads files from disk, that is, it is used to cache data read from files. In this way, the next time you access these file data, you can quickly obtain it directly from the memory, without having to access the slow disk again.
  • SReclaimable is part of Slab. Slab consists of two parts, the reclaimable part is recorded with SReclaimable; the non-reclaimable part is recorded with SUnreclaim.

Buffer is a cache of disk data, and Cache is a cache of file data. They are used in both read requests and write requests.

3. Memory analysis

The second example, when you free it and find that the available memory of the system is insufficient, first confirm whether the memory is occupied by the cache/buffer. After excluding the cache/buffer, you can continue to use pidstat or top to locate the process that occupies the most memory. After identifying the process, use the process memory space tool (such as pmap) to analyze the memory usage in the process address space.

4. There are several common optimization ideas

  • It is best to prohibit Swap. If you must enable Swap, reduce the value of swappiness to reduce the tendency to use Swap during memory recovery.
  • Reduce the dynamic allocation of memory. For example, using arrays instead of memory allocation, you can use memory pools, huge pages (HugePage), and so on.
  • Try to use cache and buffer to access data. For example, you can use the stack to explicitly declare the memory space to store the data that needs to be cached; or use external caching components such as Redis to optimize data access.
  • Use cgroups and other methods to limit the memory usage of the process. In this way, you can ensure that the system memory will not be exhausted by abnormal processes.
  • Adjust the oom_score of the core application through /proc/pid/oom_adj. In this way, it can be guaranteed that even if the memory is tight, the core application will not be killed by OOM.

5. The difference between file system and disk

  • When reading and writing ordinary files, I/O requests will first pass through the file system, and then the file system will be responsible for interacting with the disk.
  • When reading and writing block device files, the file system is skipped and directly interacts with the disk, which is the so-called "raw I/O".
    The caches used by these two read and write methods are naturally different. The cache managed by the file system is actually a part of the cache. The raw disk cache uses Buffer.

Guess you like

Origin blog.csdn.net/zimu312500/article/details/113871150