The panoramic guide to Linux performance optimization is all in this article. It is recommended to collect it~

Linux performance optimization

Performance optimization

Performance

High concurrency and fast response correspond to the two core indicators of performance optimization: throughput and latency.

picture

  • Application load perspective: directly affects the user experience of the product terminal

  • System resource perspective: resource usage, saturation, etc.

The essence of the performance problem is that the system resources have reached the bottleneck, but the request processing is not fast enough to support more requests. Performance analysis is actually to find the bottlenecks of the application or system and try to avoid or alleviate them.

  • Select metrics to evaluate application and system performance

  • Set performance goals for applications and systems

  • Do a performance benchmark

  • Performance Analysis Locating Bottlenecks

  • Performance Monitoring and Alerting

Different performance analysis tools should be selected for different performance problems. The following are commonly used Linux Performance Tools and the corresponding types of performance problems analyzed.

picture

How should we understand "load average"?

Average load : The average number of processes in the system's runnable and uninterruptible states per unit time, that is, the average number of active processes. It is not directly related to CPU usage as we traditionally understand it.

The uninterruptible process is a process that is in a critical process in the kernel state (such as a common I/O response waiting for a device). The uninterruptible state is actually a protection mechanism of the system for processes and hardware devices.

What is the reasonable load average?

In the actual production environment, the average load of the system is monitored, and the load change trend is judged based on historical data. When there is an obvious upward trend in load, conduct timely analysis and investigation. Of course, you can also set a threshold (such as when the average load is higher than 70% of the number of CPUs)

In real work, we often confuse the concepts of load average and CPU usage. In fact, the two are not completely equivalent:

  • CPU-intensive processes, heavy CPU usage will cause the average load to increase, and the two are consistent at this time

  • For I/O-intensive processes, waiting for I/O will also cause the average load to increase. At this time, the CPU usage is not necessarily high.

  • A large number of processes waiting for CPU scheduling will cause the average load to increase, and the CPU usage will also be relatively high.

High load averages can be caused by CPU-intensive processes or busy I/O. During specific analysis, you can use the mpstat/pidstat tool to assist in analyzing the load source.

CPU

CPU context switching (Part 1)

CPU context switching is to save the CPU context (CPU registers and PC) of the previous task, then load the context of the new task into these registers and program counter, and finally jump to the location pointed by the program counter to run the new task. Among them, the saved context will be stored in the system kernel and loaded again when the task is rescheduled for execution to ensure that the original task status is not affected.

According to the task type, CPU context switching is divided into:

  • process context switch

  • Thread context switching

  • interrupt context switch

process context switch

Linux processes divide the running space of the process into kernel space and user space according to the level of permissions. The transition from user mode to kernel mode needs to be completed through system calls.

A system call process actually performs two CPU context switches:

  • The user-mode instruction location in the CPU register is first saved, the CPU register is updated to the kernel-mode instruction location, and jumps to the kernel state to run the kernel task;

  • After the system call is completed, the CPU registers restore the originally saved user state data, and then switch to user space to continue running.

During the system call process, process user-mode resources such as virtual memory will not be involved, and processes will not be switched. It is different from process context switching in the traditional sense. Therefore system calls are often called privileged mode switches.

Processes are managed and scheduled by the kernel, and process context switching can only occur in kernel mode. Therefore, compared with system calls, before saving the kernel state and CPU registers of the current process, the virtual memory and stack of the process need to be saved first. After loading the kernel state of the new process, the virtual memory and user stack of the process must be refreshed.

The process only needs to switch context when it is scheduled to run on the CPU. There are the following scenarios: CPU time slices are allocated in turn, insufficient system resources cause the process to hang, the process actively hangs through the sleep function, and high-priority processes preempt the time slice. When a hardware interrupt occurs, the process on the CPU is suspended and instead executes the interrupt service in the kernel.

Thread context switching

There are two types of thread context switching:

  • The front and rear threads belong to the same process, and the virtual memory resources remain unchanged when switching. You only need to switch the thread's private data, registers, etc.;

  • The front and rear threads belong to different processes, which is the same as process context switching.

Thread switching in the same process consumes less resources, which is also an advantage of multi-threading.

interrupt context switch

Interrupt context switching does not involve the user state of the process, so the interrupt context only includes the state necessary for the execution of the kernel state interrupt service program (CPU registers, kernel stack, hardware interrupt parameters, etc.).

Interrupt processing priority is higher than that of the process, so interrupt context switching and process context switching will not occur at the same time.

CPU context switching (below)

You can view the overall context switching situation of the system through vmstat

vmstat 5         #每隔5s输出一组数据procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0 0  0      0 103388 145412 511076    0    0     0     2  450 1176  1  1 99  0  0 0  0      0 103388 145412 511076    0    0     0     8  429 1135  1  1 98  0  0 0  0      0 103388 145412 511076    0    0     0     0  431 1132  1  1 98  0  0 0  0      0 103388 145412 511076    0    0     0    10  467 1195  1  1 98  0  0 1  0      0 103388 145412 511076    0    0     0     2  426 1139  1  0 99  0  0 4  0      0  95184 145412 511108    0    0     0    74  500 1228  4  1 94  0  0 0  0      0 103512 145416 511076    0    0     0   455  723 1573 12  3 83  2  0
  • cs (context switch) number of context switches per second

  • in (interrupt) Number of interrupts per second

  • r (running or runnable) the length of the ready queue, the number of processes running and waiting for the CPU

  • b (Blocked) Number of processes in uninterruptible sleep state

To view the details of each process, you need to use pidstat to view the context switching of each process.

pidstat -w 514时51分16秒   UID       PID   cswch/s nvcswch/s  Command14时51分21秒     0         1      0.80      0.00  systemd14时51分21秒     0         6      1.40      0.00  ksoftirqd/014时51分21秒     0         9     32.67      0.00  rcu_sched14时51分21秒     0        11      0.40      0.00  watchdog/014时51分21秒     0        32      0.20      0.00  khugepaged14时51分21秒     0       271      0.20      0.00  jbd2/vda1-814时51分21秒     0      1332      0.20      0.00  argusagent14时51分21秒     0      5265     10.02      0.00  AliSecGuard14时51分21秒     0      7439      7.82      0.00  kworker/0:214时51分21秒     0      7906      0.20      0.00  pidstat14时51分21秒     0      8346      0.20      0.00  sshd14时51分21秒     0     20654      9.82      0.00  AliYunDun14时51分21秒     0     25766      0.20      0.00  kworker/u2:114时51分21秒     0     28603      1.00      0.00  python3
  • cswch Number of voluntary context switches per second (context switches caused by the process being unable to obtain the required resources)

  • nvcswch Number of involuntary context switches per second (system forced scheduling such as time slice rotation)

vmstat 1 1    #新终端观察上下文切换情况此时发现cs数据明显升高,同时观察其他指标:r列:远超系统CPU个数,说明存在大量CPU竞争us和sy列:sy列占比80%,说明CPU主要被内核占用in列:中断次数明显上升,说明中断处理也是潜在问题

It means that there are too many processes running/waiting for the CPU, resulting in a large number of context switches. Context switching leads to high CPU usage of the system.

pidstat -w -u 1  #查看到底哪个进程导致的问题

It can be seen from the results that sysbench causes excessive CPU usage, but the total number of contexts output by pidstat is not high. Analyzing sysbench simulates thread switching, so you need to add the -t parameter after pidstat to view thread indicators.

In addition, if there are too many interrupts, we can read them through the /proc/interrupts file.

watch -d cat /proc/interrupts

The fastest changing number of discovery times is the rescheduling interrupt (RES), which is used to wake up the idle CPU to schedule new tasks to run. The analysis is still due to the scheduling problem of too many tasks, which is consistent with the context switching analysis.

What should I do if the CPU usage of an application reaches 100%?

As a multi-tasking operating system, Linux divides the CPU time into short time slices and allocates them to each task in turn through the scheduler. In order to maintain CPU time, Linux triggers time interrupts through pre-defined beat rates and uses global jiffies to record the number of beats since booting. This value is +1 once an interrupt occurs.

CPU usage, the percentage of total CPU time other than idle time. CPU usage can be calculated from the data in /proc/stat. Because the cumulative value of the number of beats since booting in /proc/stat is calculated as the average CPU usage since booting, which is generally of little significance. You can calculate the average CPU usage during that period by taking the difference between two values ​​taken at intervals of a period of time. Performance analysis tools give average CPU usage over a period of time. Pay attention to the setting of the interval.

CPU usage can be viewed through top or ps. You can analyze the CPU problems of the process through perf, which is based on performance event sampling. It can not only analyze various events of the system and kernel performance, but also can be used to analyze the performance problems of specified applications.

perf top / perf record / perf report (-g turns on the sampling of calling relationships)​​​​​​

sudo docker run --name nginx -p 10000:80 -itd feisky/nginxsudo docker run --name phpfpm -itd --network container:nginx feisky/php-fpm
ab -c 10 -n 100 http://XXX.XXX.XXX.XXX:10000/ #测试Nginx服务性能

It is found that the number of requests per second can be tolerated at this time, and the number of test requests is increased from 100 to 10,000. Run top in another terminal to view the usage of each CPU. It was found that several php-fpm processes in the system caused a sudden increase in CPU usage.

Then use perf to analyze which function in php-fpm causes the problem.

perf top -g -p XXXX #对某一个php-fpm进程进行分析

It was found that sqrt and add_function occupied too much CPU. At this time, I checked the source code and found that the test code segment was not deleted in sqrt before release, which caused a million times loop. After deleting the useless code, it was found that nginx’s load capacity was significantly improved.

The system's CPU usage is very high. Why can't I find high-CPU applications? ​​​​​​​

sudo docker run --name nginx -p 10000:80 -itd feisky/nginx:spsudo docker run --name phpfpm -itd --network container:nginx feisky/php-fpm:spab -c 100 -n 1000 http://XXX.XXX.XXX.XXX:10000/ #并发100个请求测试

In the experimental results, the number of requests per second is still not high. After we reduce the number of concurrent requests to 5, the load capacity of nginx is still very low.

At this time, top and pidstat were used to find that the system CPU usage was too high, but no processes with high CPU usage were found.

This situation usually occurs when we miss some information during analysis. Re-run the top command and observe for a while. It was found that there were too many processes in the Running state in the ready queue, exceeding our number of concurrent requests by 5. After carefully checking the process running data, we found that both nginx and php-fpm were in the sleep state, but there were several stress processes that were actually running.

The next step is to use pidstat to analyze these stress processes and find that there is no output. Use ps aux cross-validation to find that the process still does not exist. The explanation is not a problem with the tool. Check top and find that the process number of the stress process has changed. This may be caused by the following two reasons:

  • The process keeps crashing and restarting (such as segfault/configuration error, etc.). At this time, the process may be restarted by the monitoring system after exiting;

  • Caused by short-term processes, that is, external commands called through exec within other applications. These commands generally only run for a short time and end. It is difficult to find them with a long-interval tool like top.

You can use pstree to find the parent process of stress and find out the calling relationship.

pstree | grep stress

It is found that the sub-process is called by php-fpm. At this time, if you check the source code, you can see that each request will call a stress command to simulate I/O pressure. The result displayed by top before was an increase in CPU usage. Whether it is really caused by the stress command requires further analysis. After adding the verbose=1 parameter to each request in the code, you can view the output of the stress command. After interrupting the test, the command results show that there is a bug in the file creation failure due to permission issues when the stress command is running.

This is still just a guess. The next step is to continue analyzing it through the perf tool. The performance report shows that stress indeed consumes a large amount of CPU, which can be optimized and solved by fixing the permission problem.

What should I do if there are a large number of uninterruptible processes and zombie processes in the system?

process status

R  Running/Runnable, indicating that the process is in the ready queue of the CPU, running or waiting to run;
D  Disk Sleep, uninterruptible state sleep, generally indicating that the process is interacting with the hardware, and is not allowed to be interrupted by other processes during the interaction;
Z  Zombie , zombie process, which means that the process has actually ended, but the parent process has not reclaimed its resources;
S  Interruptible Sleep, which can interrupt the sleep state, means that the process is suspended by the system because it is waiting for an event. When the waiting event occurs, it will be suspended. Wake up and enter R state;
I  Idle, idle state, used on kernel threads that cannot interrupt sleep. This state will not cause the average load to increase;
T  Stop/Traced, indicating that the process is suspended or traced (SIGSTOP/SIGCONT, GDB debugging);
X  Dead, the process has died and will not be seen in top/ps.

For uninterruptible states, they generally end in a very short time and can be ignored. However, if the system or hardware fails, the process may remain in an uninterruptible state for a long time, or even a large number of uninterruptible states appear in the system. At this time, you need to pay attention to whether I/O performance problems occur.

Zombie processes are generally easy to encounter in multi-process applications. When the parent process has no time to process the status of the child process, the child process exits early. At this time, the child process becomes a zombie process. A large number of zombie processes will use up the PID process number, causing new processes to be unable to be established.

Disk O_DIRECT problem

sudo docker run --privileged --name=app -itd feisky/app:iowaitps aux | grep '/app'

You can see that there are multiple app processes running at this time, and the status is Ss+ and D+ respectively. The following s indicates that the process is the leader process of a session, and the + sign indicates the foreground process group.

The process group represents a group of interrelated processes, and the child process is a member of the group where the parent process belongs. A session refers to one or more process groups that share the same controlling terminal.

Use top to check the system resources and find that: 1) the average load is gradually increasing, and the average load reaches the number of CPUs within 1 minute, indicating that the system may have a performance bottleneck; 2) there are many zombie processes and they are increasing; 3) The CPU usage of us and sys is not high, but iowait is relatively high; 4) The CPU usage of each process is not high, but there are two processes in D state, possibly waiting for IO.

Analysis of the current data shows that: excessive iowait causes the average load of the system to increase, and the continuous growth of zombie processes indicates that a program fails to correctly clean up child process resources.

Use dstat for analysis, because it can view the usage of both CPU and I/O resources at the same time, which facilitates comparative analysis.

dstat 1 10    #间隔1秒输出10组数据

It can be seen that when wai (iowait) increases, the disk request read will be very large, indicating that the increase in iowait is related to the disk read request. Next, analyze which process is reading the disk.

For the process number in D state viewed by Top before, use pidstat -d -p XXX to display the I/O statistics of the process. It is found that the process in D state does not have any read or write operations. When using pidstat -d to check the I/O statistics of all processes, I saw that the app process was performing disk reading operations, reading 32MB of data per second. The process must use system calls to access the disk in the kernel state. The next focus is to find the system calls of the app process.

sudo strace -p XXX #对app进程调用进行跟踪

The error message is that there is no permission because it already has root permissions. So when encountering this situation, you must first check whether the process status is normal. The ps command finds that the process is already in Z state, that is, a zombie process.

In this case, tools such as top pidstat cannot give more information. At this time, like in Part 5, use  perf record -dand  perf report to analyze and view the app process call stack.

It can be seen that the app is indeed reading data through system calls  sys_read() , and performs direct read operations from  new_sync_readand  blkdev_direct_IOsee the process. The request is to read directly from the disk, without caching, causing iowait to increase.

After layer-by-layer analysis, the root cause is direct disk I/O inside the app. Then locate the specific code location for optimization.

zombie process

After the above optimization, iowait has dropped significantly, but the number of zombie processes is still increasing. First, locate the parent process of the zombie process. Use pstree -aps XXX to print out the call tree of the zombie process and find that the parent process is the app process.

Check the app code to see if the end of the child process is handled correctly (whether wait()/waitpid() is called, whether there is a SIGCHILD signal processing function registered, etc.).

When encountering an increase in iowait, first use tools such as dstat and pidstat to confirm whether there is a disk I/O problem, and then find out which processes are causing the I/O. If you cannot use strace to directly analyze process calls, you can use the perf tool to analyze it.

For the zombie problem, use pstree to find the parent process, and then look at the source code to check the processing logic for the end of the child process.

CPU performance indicators

  • CPU usage

    • User CPU usage, including user mode (user) and low-priority user mode (nice). If this indicator is too high, it indicates that the application is busy.

    • System CPU usage, the percentage of time the CPU is running in kernel mode (excluding interrupts). A high indicator indicates that the kernel is busy.

    • CPU usage waiting for I/O, iowait. A high indicator indicates that the I/O interaction time between the system and the hardware device is relatively long.

    • Soft/hard interrupt CPU usage. A high indicator indicates that a large number of interrupts occur in the system.

    • steal CPU / guest CPU, indicates the percentage of CPU occupied by the virtual machine.

  • load average

    • Ideally, the average load is equal to the number of logical CPUs, indicating that each CPU is fully utilized. If it is greater, the system load is heavier.

  • process context switch

    • Including voluntary switching when resources cannot be obtained and involuntary switching when the system is forced to schedule. Context switching itself is a core function to ensure the normal operation of Linux. Excessive switching will consume the CPU time of the original running process in registers, and the kernel occupies and virtual memory and other data saving and recovery

  • CPU cache hit rate

    • Regarding CPU cache reuse, the higher the hit rate, the better the performance. L1/L2 is commonly used in single cores, and L3 is used in multi-cores.

performance tools

  • Load average case

    • First use uptime to check the system load average

    • After determining that the load has increased, use mpstat and pidstat to view the CPU usage of each CPU and each process respectively. Find out the process causing the higher average load.

  • Context switch case

    • First use vmstat to check the number of system context switches and interruptions.

    • Then use pidstat to observe the voluntary and involuntary context switching of the process.

    • Finally, observe the context switching of the thread through pidstat

  • Case of high process CPU usage

    • First use top to check the CPU usage of the system and process, and locate the process.

    • Then use perf top to observe the process call chain and locate the specific function.

  • Cases of high system CPU usage

    • First use top to check the CPU usage of the system and process. Top/pidstat cannot find processes with high CPU usage.

    • Revisit top output

    • Start with processes that have low CPU usage but are in the Running state.

    • perf record/report found short-term process (execsnoop tool)

  • Uninterruptible and zombie process cases

    • First use top to observe the increase in iowait and find a large number of uninterruptible and zombie processes.

    • strace cannot trace process system calls

    • Perf analysis of the call chain found that the root cause comes from direct disk I/O

  • Soft interrupt case

    • top observes high system soft interrupt CPU usage

    • Check /proc/softirqs to find several softirqs with fast changing rates.

    • The sar command found that it was a network packet problem

    • tcpdump finds out the type and source of network frames and determines the cause of SYN FLOOD attack

Find the right tool based on different performance indicators:

picture

First run several tools that support many indicators, such as top/vmstat/pidstat. Based on their output, you can determine what type of performance problem it is. After locating the process, use strace/perf to analyze the calling situation for further analysis. If it is a soft interrupt, use /proc/softirqs

picture

CPU optimization

  • application optimization

    • Compiler optimization: Enable optimization options during the compilation phase, such as gcc -O2

    • algorithm optimization

    • Asynchronous processing: Avoid the program from being blocked waiting for a certain resource, and improve the program's concurrent processing capabilities. (Replace polling with event notification)

    • Multithreading instead of multiprocessing: reducing context switching costs

    • Make good use of cache: speed up program processing

  • System Optimization

    • CPU binding: Bind the process to one/multiple CPUs to improve the CPU cache hit rate and reduce context switching caused by CPU scheduling.

    • CPU exclusive: CPU affinity mechanism to allocate processes

    • Priority adjustment: Use nice to appropriately lower the priority of non-core applications

    • Set resource display for processes: cgroups sets usage limits to prevent an application from depleting system resources due to its own problems

    • NUMA optimization: CPU accesses local memory as much as possible

    • Interrupt load balancing: irpbalance, automatically load balances the interrupt processing process to each CPU

  • The difference and understanding of TPS, QPS, and system throughput

    • QPS(TPS)

    • concurrent number

    • Response time

    • QPS (TPS) = number of concurrencies/average response time

    • user request server

    • internal server processing

    • The QPS returned by the server to the client
      is similar to TPS, but a visit to a page forms a TPS, but a page request may include multiple requests to the server, which may be counted into multiple QPS.

    • QPS (Queries Per Second) query rate per second, the number of queries a server can respond to per second.

    • TPS (Transactions Per Second) number of transactions per second, the result of software testing.

  • System throughput includes several important parameters:

Memory

How Linux memory works

memory map

The main memory used in most computers is dynamic random access memory (DRAM), and only the kernel can directly access physical memory. The Linux kernel provides an independent virtual address space for each process, and this address space is continuous. In this way, the process can easily access memory (virtual memory).

The interior of the virtual address space is divided into two parts: kernel space and user space. The range of the address space of processors with different word lengths is different. The 32-bit system kernel space occupies 1G and the user space occupies 3G. The kernel space and user space of 64-bit systems are both 128T, occupying the highest and lowest parts of the memory space respectively, and the middle part is undefined.

Not all virtual memory is allocated physical memory, only the one actually used. The allocated physical memory is managed through memory mapping. In order to complete memory mapping, the kernel maintains a page table for each process to record the mapping relationship between virtual addresses and physical addresses. The page table is actually stored in the CPU's memory management unit MMU, and the processor can directly find out the memory to be accessed through hardware.

When the virtual address accessed by the process cannot be found in the page table, the system will generate a page fault exception, enter the kernel space to allocate physical memory, update the process page table, and then return to the user space to resume the operation of the process.

The MMU manages memory in units of pages, with a page size of 4KB. In order to solve the problem of too many page table entries, Linux provides multi-level page tables and HugePage mechanisms.

Virtual memory space distribution

User space memory from low to high are five different memory segments:

  • Read-only section  code and constants, etc.

  • Data segment  global variables, etc.

  • Heap  dynamically allocated memory, starting from low addresses and growing upwards

  • File mapping  dynamic library, shared memory, etc., starting from high address and growing downward

  • The stack  includes local variables and the context of function calls, etc., and the size of the stack is fixed. Normally 8MB

Memory allocation and recycling

distribute

There are two implementations of malloc corresponding to system calls:

  • brk() allocates small blocks of memory (<128K) by moving the top of the heap.

    After the memory is released, the memory is not returned immediately, but is cached.

  • mmap() directly uses memory mapping to allocate large blocks of memory (>128K), that is, find a free memory allocation in the file mapping segment.

The former's cache can reduce the occurrence of page fault exceptions and improve memory access efficiency. However, because the memory is not returned to the system, frequent memory allocation/release will cause memory fragmentation when the memory is busy.

The latter is directly returned to the system when it is released, so a page fault exception will occur every time mmap is performed.

When memory work is busy, frequent memory allocation will cause a large number of page fault exceptions, increasing the kernel management burden.

The above two calls do not actually allocate memory. These memories only enter the kernel through page fault exceptions when they are accessed for the first time, and are allocated by the kernel.

Recycle

When memory is tight, the system reclaims memory in the following ways:

  • Recycling cache: LRU algorithm reclaims the least recently used memory pages;

  • Recycle infrequently accessed memory: write infrequently used memory to disk through the swap partition

  • Kill process: OOM kernel protection mechanism (the more memory the process consumes, the bigger the oom_score, the more CPU it takes, the smaller the oom_score, you can manually adjust oom_adj through /proc)

echo -16 > /proc/$(pidof XXX)/oom_adj

How to check memory usage

free to view the memory usage of the entire system

top/ps to view the memory usage of a process

  • VIRT virtual memory size of the process

  • RES The size of resident memory, that is, the size of physical memory actually used by the process, excluding swap and shared memory

  • SHR shared memory size, memory shared with other processes, loaded dynamic link libraries and program code segments

  • %MEM process uses physical memory as a percentage of the total system memory.

How to understand Buffer and Cache in memory?

Buffer is a cache of disk data, and cache is a cache of file data. They are used in both read requests and write requests.

How to use system cache to optimize program running efficiency

Cache hit rate

The cache hit rate refers to the number of requests to obtain data directly through the cache, accounting for the percentage of all requests. The higher the hit rate, the higher the benefits brought by the cache and the better the performance of the application.

After installing the bcc package, you can use cachestat and cachetop to monitor cache read and write hits.

After installing pcstat, you can view the cache size and cache ratio of files in memory​​​​​​​

#首先安装Goexport GOPATH=~/goexport PATH=~/go/bin:$PATHgo get golang.org/x/sys/unixgo ge github.com/tobert/pcstat/pcstat

dd cache acceleration

dd if=/dev/sda1 of=file bs=1M count=512 #生产一个512MB的临时文件echo 3 > /proc/sys/vm/drop_caches #清理缓存pcstat file #确定刚才生成文件不在系统缓存中,此时cached和percent都是0cachetop 5dd if=file of=/dev/null bs=1M #测试文件读取速度#此时文件读取性能为30+MB/s,查看cachetop结果发现并不是所有的读都落在磁盘上,读缓存命中率只有50%。dd if=file of=/dev/null bs=1M #重复上述读文件测试#此时文件读取性能为4+GB/s,读缓存命中率为100%pcstat file #查看文件file的缓存情况,100%全部缓存

O_DIRECT option bypasses system cache

cachetop 5sudo docker run --privileged --name=app -itd feisky/app:io-directsudo docker logs app #确认案例启动成功#实验结果表明每读32MB数据都要花0.9s,且cachetop输出中显示1024次缓存全部命中

However, it can be seen from the feeling that if the cache hits, the read speed should not be so slow. The number of reads is 1024, the page size is 4K, and 1024*4KB data is read in five seconds, which is 0.8MB per second, which is quite different from the 32MB in the result. . This shows that the cache is not fully utilized in this case, and it is suspected that the system call sets the direct I/O flag to bypass the system cache. So next look at the system calls. ​​​​​​​

strace -p $(pgrep app)#strace 结果可以看到openat打开磁盘分区/dev/sdb1,传入参数为O_RDONLY|O_DIRECT

This explains why reading 32MB of data is so slow. Reading and writing directly from disk must be much slower than caching. After finding the problem, we looked at the source code of the case and found that the direct IO flag was specified in flags. Delete this option and rerun to verify the performance changes.

Memory leak, how to locate and deal with it?

For applications, dynamic memory allocation and recycling is a core and complex logical function module. Various "accidents" may occur during memory management:

  • The allocated memory was not properly reclaimed, resulting in leaks

  • Accessing an address outside the boundaries of allocated memory causes the program to exit abnormally.

Memory allocation and recycling

The virtual memory distribution from low to high is divided into five parts: read-only segment, data segment, heap, memory mapping segment, and stack. Among them, the ones that can cause memory leaks are:

  • Heap: Allocated and managed by the application itself. Unless the program exits, these heap memories will not be automatically released by the system.

  • Memory mapping segment: includes dynamic link libraries and shared memory, where shared memory is automatically allocated and managed by the program

Memory leaks are more harmful. Not only the application itself cannot access the memory that has been forgotten to be released, but the system cannot re-allocate them to other applications. Memory leaks accumulate and can even exhaust system memory.

How to detect memory leaks

Pre-install systat, docker, bcc​​​​​​​

sudo docker run --name=app -itd feisky/app:mem-leaksudo docker logs appvmstat 3

It can be seen that free is continuously declining, and buffer and cache remain basically unchanged. This indicates that the system memory is consistently increasing. But it does not mean that there is a memory leak. At this time, you can use the memleak tool to track the memory allocation/release requests of the system or process.

/usr/share/bcc/tools/memleak -a -p $(pidof app)

From the memleak output, we can see that the application is constantly allocating memory, and these allocated addresses are not reclaimed. Through the call stack, we can see that the memory allocated by the fibonacci function has not been released. After locating the source code, view the source code to fix and add the memory release function.

Why does the system Swap become high?

When system memory resources are tight, memory recycling and OOM killing processes can be used to solve the problem. The recyclable memory includes:

  • Cache/buffer is a recyclable resource, usually called file page in file management.

    • Synchronize dirty pages to disk via fsync in application

    • Leave it to the system, and the kernel thread pdflush is responsible for refreshing these dirty pages.

    • Data (dirty pages) that have been modified by the application and have not yet been written to the disk must be written to the disk first and then the memory can be released.

  • The file mapping page obtained by memory mapping can also be released and re-read from the file the next time it is accessed.

For the heap memory automatically allocated by the program, that is, our anonymous pages in memory management, although these memory cannot be released directly, Linux provides a Swap mechanism to write infrequently accessed memory to the disk to release the memory. Just read the disk into the memory.

Swap principle

The essence of Swap is to use a piece of disk space or a local file as memory, including two processes of swapping in and swapping out:

  • Swap out: store the memory data temporarily unused by the process to the disk, and release the memory

  • Swap in: When the process accesses memory again, read them from disk into memory

How does Linux measure whether memory resources are tight?

  • Direct memory reclamation A new large block of memory allocation was requested, but not enough memory remained.

    At this time, the system will reclaim part of the memory;

  • The kswapd0 kernel thread periodically reclaims memory.

    In order to measure memory usage, three thresholds of pages_min, pages_low, and pages_high are defined, and memory recycling operations are performed based on them.

    • Remaining memory < pages_min, the available memory for the process is exhausted, and only the kernel can allocate memory

    • pages_min < remaining memory < pages_low, memory pressure is high, kswapd0 performs memory recycling until remaining memory > pages_high

    • pages_low < remaining memory < pages_high, there is a certain pressure on the memory, but it can satisfy new memory requests.

    • Remaining memory > pages_high, indicating that there is more remaining memory and no memory pressure
      pages_low = pages_min 5 / 4 pages_high = pages_min 3 / 2

NUMA and SWAP

In many cases, the system has a lot of remaining memory, but the SWAP is still elevated. This is due to the NUMA architecture of the processor.

Under the NUMA architecture, multiple processors are divided into different Nodes, and each Node has its own local memory space. When analyzing memory usage, each Node should be analyzed separately.

numactl --hardware #查看处理器在Node的分布情况,以及每个Node的内存使用情况

The three memory thresholds can be viewed through /proc/zoneinfo, which also includes the number of active and inactive anonymous pages/file pages.

When a Node is out of memory, the system can find free resources from other Nodes or reclaim memory from local memory. Adjust via /proc/sys/vm/zone_raclaim_mode.

  • 0 means that you can either find free resources from other Nodes or reclaim memory locally.

  • 1, 2, 4 means that only local memory is recycled, 2 means that dirty data can be returned to the memory, and 4 means that the memory can be recycled using Swap.

swappiness

During the actual recycling process, Linux adjusts the activeness of using Swap according to the /proc/sys/vm/swapiness option, from 0-100. The larger the value, the more actively using Swap, that is, it is more inclined to recycle anonymous pages; the smaller the value, the more passive it is. Swap, which prefers to recycle file pages.

Note: This only adjusts the weight of Swap aggressiveness. Even if it is set to 0, Swap will still occur when the remaining memory + file page is less than the page high threshold.

How to locate and analyze when Swap increases

free #首先通过free查看swap使用情况,若swap=0表示未配置Swap#先创建并开启swapfallocate -l 8G /mnt/swapfilechmod 600 /mnt/swapfilemkswap /mnt/swapfileswapon /mnt/swapfile
free #再次执行free确保Swap配置成功
dd if=/dev/sda1 of=/dev/null bs=1G count=2048 #模拟大文件读取sar -r -S 1  #查看内存各个指标变化 -r内存 -S swap#根据结果可以看出,%memused在不断增长,剩余内存kbmemfress不断减少,缓冲区kbbuffers不断增大,由此可知剩余内存不断分配给了缓冲区#一段时间之后,剩余内存很小,而缓冲区占用了大部分内存。此时Swap使用之间增大,缓冲区和剩余内存只在小范围波动
停下sar命令cachetop5 #观察缓存#可以看到dd进程读写只有50%的命中率,未命中数为4w+页,说明正式dd进程导致缓冲区使用升高watch -d grep -A 15 ‘Normal’ /proc/zoneinfo #观察内存指标变化#发现升级内存在一个小范围不停的波动,低于页低阈值时会突然增大到一个大于页高阈值的值

It shows that the fluctuation of remaining memory and buffer is due to the cycle of memory recycling and cache reallocation. Sometimes Swap is used more, and sometimes the buffer fluctuates more. At this time, the swappiness value is 60, which is a relatively neutral configuration. The system will select the appropriate recycling type based on the actual operating conditions.

How to find system memory problems quickly and accurately

Memory performance metrics

  • System memory metrics

  • Used memory/remaining memory

  • Shared memory (tmpfs implementation)

  • Available memory: including remaining memory and reclaimable memory

  • Cache: Page cache of disk read files, reclaimable part in slab allocator

  • Buffer: temporary storage of raw disk blocks, cache of data to be written to disk

Process memory metrics

  • Virtual memory: 5 most

  • Resident memory: the physical memory actually used by the process, excluding Swap and shared memory

  • Shared memory: memory shared with other processes, and code segments of dynamic link libraries and programs

  • Swap memory: swap out memory to disk through Swap

Page fault exception

  • It can be allocated directly from physical memory, and the next page fault is abnormal

  • Disk IO intervention (such as Swap) is required, and the main page fault is abnormal. At this time, memory access will be much slower

Memory performance tools

Find the right tool based on different performance indicators:

picture

Performance indicators included in the memory analysis tool:

picture

How to quickly analyze memory performance bottlenecks

Usually run several performance tools with relatively large coverage first, such as free, top, vmstat, pidstat, etc.

  • First use free and top to check the overall memory usage of the system

  • Then use vmstat and pidstat to check the trend over a period of time to determine the type of memory problem.

  • Finally, detailed analysis is performed, such as memory allocation analysis, cache/buffer analysis, memory usage analysis of specific processes, etc.

Common optimization ideas:

  • It is best to disable Swap. If it must be enabled, try to reduce the value of swappiness.

  • Reduce the dynamic allocation of memory, such as memory pool, HugePage, etc.

  • Try to use caches and buffers to access data. For example, use the stack to explicitly declare the memory space to store the data that needs to be cached, or use the Redis external cache component to optimize data access.

  • cgroups and other methods to limit the memory usage of the process to ensure that the system memory is not exhausted by abnormal processes

  • /proc/pid/oom_adj adjusts the oom_score of the core application to ensure that the core application will not be killed by OOM even if the memory is tight.

Detailed use of vmstat

The vmstat command is the most common Linux/Unix monitoring tool. It can display the status value of the server at a given time interval, including the server's CPU usage, memory usage, virtual memory swap status, and IO read and write status. You can see the CPU, memory, and IO usage of the entire machine, instead of just seeing the CPU usage and memory usage of each process (the usage scenarios are different).

vmstat 2procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st 1  0      0 1379064 282244 11537528    0    0     3   104    0    0  3  0 97  0  0 0  0      0 1372716 282244 11537544    0    0     0    24 4893 8947  1  0 98  0  0 0  0      0 1373404 282248 11537544    0    0     0    96 5105 9278  2  0 98  0  0 0  0      0 1374168 282248 11537556    0    0     0     0 5001 9208  1  0 99  0  0 0  0      0 1376948 282248 11537564    0    0     0    80 5176 9388  2  0 98  0  0 0  0      0 1379356 282256 11537580    0    0     0   202 5474 9519  2  0 98  0  0 1  0      0 1368376 282256 11543696    0    0     0     0 5894 8940 12  0 88  0  0 1  0      0 1371936 282256 11539240    0    0     0 10554 6176 9481 14  1 85  1  0 1  0      0 1366184 282260 11542292    0    0     0  7456 6102 9983  7  1 91  0  0 1  0      0 1353040 282260 11556176    0    0     0 16924 7233 9578 18  1 80  1  0 0  0      0 1359432 282260 11549124    0    0     0 12576 5495 9271  7  0 92  1  0 0  0      0 1361744 282264 11549132    0    0     0    58 8606 15079  4  2 95  0  0 1  0      0 1367120 282264 11549140    0    0     0     2 5716 9205  8  0 92  0  0 0  0      0 1346580 282264 11562644    0    0     0    70 6416 9944 12  0 88  0  0 0  0      0 1359164 282264 11550108    0    0     0  2922 4941 8969  3  0 97  0  0 1  0      0 1353992 282264 11557044    0    0     0     0 6023 8917 15  0 84  0  0
# 结果说明- r 表示运行队列(就是说多少个进程真的分配到CPU),我测试的服务器目前CPU比较空闲,没什么程序在跑,当这个值超过了CPU数目,就会出现CPU瓶颈了。这个也和top的负载有关系,一般负载超过了3就比较高,超过了5就高,超过了10就不正常了,服务器的状态很危险。top的负载类似每秒的运行队列。如果运行队列过大,表示你的CPU很繁忙,一般会造成CPU使用率很高。
- b 表示阻塞的进程,这个不多说,进程阻塞,大家懂的。
- swpd 虚拟内存已使用的大小,如果大于0,表示你的机器物理内存不足了,如果不是程序内存泄露的原因,那么你该升级内存了或者把耗内存的任务迁移到其他机器。
- free   空闲的物理内存的大小,我的机器内存总共8G,剩余3415M。
- buff   Linux/Unix系统是用来存储,目录里面有什么内容,权限等的缓存,我本机大概占用300多M
- cache cache直接用来记忆我们打开的文件,给文件做缓冲,我本机大概占用300多M(这里是Linux/Unix的聪明之处,把空闲的物理内存的一部分拿来做文件和目录的缓存,是为了提高 程序执行的性能,当程序使用内存时,buffer/cached会很快地被使用。)
- si  每秒从磁盘读入虚拟内存的大小,如果这个值大于0,表示物理内存不够用或者内存泄露了,要查找耗内存进程解决掉。我的机器内存充裕,一切正常。
- so  每秒虚拟内存写入磁盘的大小,如果这个值大于0,同上。
- bi  块设备每秒接收的块数量,这里的块设备是指系统上所有的磁盘和其他块设备,默认块大小是1024byte,我本机上没什么IO操作,所以一直是0,但是我曾在处理拷贝大量数据(2-3T)的机器上看过可以达到140000/s,磁盘写入速度差不多140M每秒
- bo 块设备每秒发送的块数量,例如我们读取文件,bo就要大于0。bi和bo一般都要接近0,不然就是IO过于频繁,需要调整。
- in 每秒CPU的中断次数,包括时间中断
- cs 每秒上下文切换次数,例如我们调用系统函数,就要进行上下文切换,线程的切换,也要进程上下文切换,这个值要越小越好,太大了,要考虑调低线程或者进程的数目,例如在apache和nginx这种web服务器中,我们一般做性能测试时会进行几千并发甚至几万并发的测试,选择web服务器的进程可以由进程或者线程的峰值一直下调,压测,直到cs到一个比较小的值,这个进程和线程数就是比较合适的值了。系统调用也是,每次调用系统函数,我们的代码就会进入内核空间,导致上下文切换,这个是很耗资源,也要尽量避免频繁调用系统函数。上下文切换次数过多表示你的CPU大部分浪费在上下文切换,导致CPU干正经事的时间少了,CPU没有充分利用,是不可取的。
- us 用户CPU时间,我曾经在一个做加密解密很频繁的服务器上,可以看到us接近100,r运行队列达到80(机器在做压力测试,性能表现不佳)。
- sy 系统CPU时间,如果太高,表示系统调用时间长,例如是IO操作频繁。
- id 空闲CPU时间,一般来说,id + us + sy = 100,一般我认为id是空闲CPU使用率,us是用户CPU使用率,sy是系统CPU使用率。
- wt 等待IO CPU时间

Detailed use of pidstat

pidstat is mainly used to monitor the usage of system resources by all or specified processes, such as CPU, memory, device IO, task switching, threads, etc.

Instructions:

  • pidstat –d interval times counts the IO usage of each process

  • pidstat –u interval times counts the CPU statistics of each process

  • pidstat –r interval times counts the memory usage information of each process

  • pidstat -w interval times counts the context switching of each process

  • p PID Specified PID

1. Statistics on IO usage

pidstat -d 1 10
03:02:02 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command03:02:03 PM     0       816      0.00    918.81      0.00  jbd2/vda1-803:02:03 PM     0      1007      0.00      3.96      0.00  AliYunDun03:02:03 PM   997      7326      0.00   1904.95    918.81  java03:02:03 PM   997      8539      0.00      3.96      0.00  java03:02:03 PM     0     16066      0.00     35.64      0.00  cmagent
03:02:03 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command03:02:04 PM     0       816      0.00   1924.00      0.00  jbd2/vda1-803:02:04 PM   997      7326      0.00  11156.00   1888.00  java03:02:04 PM   997      8539      0.00      4.00      0.00  java
  • UID

  • PID

  • kB_rd/s: The amount of data the process reads from the disk in KB per second. The unit is read from disk each second KB.

  • kB_wr/s: The amount of data written by the process to disk in KB per second. Unit write to disk each second KB.

  • kB_ccwr/s: The amount of data written to disk by the process per second but canceled. This may occur when the task truncates some dirty pagecache.

  • iodelay:Block I/O delay,measured in clock ticks

  • Command: process name task name

2. Statistics of CPU usage

​​​​​​​# Statistics CPU
pidstat -u 1 1003:03:33 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command03:03:34 PM     0      2321    3.96    0.00    0.00    3.96     0  ansible03:03:34 PM     0      7110    0.00    0.99    0.00    0.99     4  pidstat03:03:34 PM   997      8539    0.99    0.00    0.00    0.99     5  java03:03:34 PM   984     15517    0.99    0.00    0.00    0.99     5  java03:03:34 PM     0     24406    0.99    0.00    0.00    0.99     5  java03:03:34 PM     0     32158    3.96    0.00    0.00    3.96     2  ansible
  • UID

  • PID

  • %usr: The percentage of CPU occupied by the process in user space

  • %system: The percentage of CPU occupied by the process in the kernel space

  • %guest: The percentage of CPU occupied by the process in the virtual machine

  • %wait: The percentage of the process waiting to run

  • %CPU: The percentage of CPU occupied by the process

  • CPU: CPU number of the processing process

  • Command: process name

3. Statistics of memory usage

# 统计内存pidstat -r 1 10Average:      UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  CommandAverage:        0         1      0.20      0.00  191256   3064   0.01  systemdAverage:        0      1007      1.30      0.00  143256  22720   0.07  AliYunDunAverage:        0      6642      0.10      0.00 6301904 107680   0.33  javaAverage:      997      7326     10.89      0.00 13468904 8395848  26.04  javaAverage:        0      7795    348.15      0.00  108376   1233   0.00  pidstatAverage:      997      8539      0.50      0.00 8242256 2062228   6.40  javaAverage:      987      9518      0.20      0.00 6300944 1242924   3.85  javaAverage:        0     10280      3.70      0.00  807372   8344   0.03  aliyun-serviceAverage:      984     15517      0.40      0.00 6386464 1464572   4.54  javaAverage:        0     16066    236.46      0.00 2678332  71020   0.22  cmagentAverage:      995     20955      0.30      0.00 6312520 1408040   4.37  javaAverage:      995     20956      0.20      0.00 6093764 1505028   4.67  javaAverage:        0     23936      0.10      0.00 5302416 110804   0.34  javaAverage:        0     24406      0.70      0.00 10211672 2361304   7.32  javaAverage:        0     26870      1.40      0.00 1470212  36084   0.11  promtail
  • UID

  • PID

  • Minflt/s: the number of minor page faults per second, the number of page faults generated by mapping virtual memory addresses to physical memory addresses

  • Majflt/s: Major page faults per second. When the virtual memory address is mapped to a physical memory address, the corresponding page is in swap.

  • VSZ virtual memory usage : The virtual memory KB unit used by the process

  • RSS: KB unit of physical memory used by this process

  • %MEM: memory usage

  • Command: the command task name of this process

4. Check the usage of specific processes

pidstat -T ALL -r -p 20955 1 1003:12:16 PM   UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command03:12:17 PM   995     20955      0.00      0.00 6312520 1408040   4.37  java
03:12:16 PM   UID       PID minflt-nr majflt-nr  Command03:12:17 PM   995     20955         0         0  java

Source: https://www.ctq6.cn/linux%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/132708534