Inventory of common CPU performance card points in the kernel

Our applications run on runtimes of various languages, operating system kernels, and hardware such as CPUs. We usually use Go, Java and other languages ​​for development. But these languages ​​are supported by multiple layers such as runtime, kernel, and hardware.

When our program is running, many times the performance card may not necessarily be caused by our own application code. It could also be caused by poor health of the underlying software. Performance card points may appear on the hardware. We have learned about the key indicators of CPU hardware affecting program performance, namely the average number of clock cycles per instruction, CPI, and cache hit rate. Performance card points may also appear on the kernel software. Today we will take a look at several key indicators that exist in the kernel that may affect the performance of our programs.

In fact, kernel developers also know which overheads will be relatively high during the running of the kernel. So we have long provided us with a support called software performance events. It is convenient for our application developers to observe the number of occurrences of these events and the function call chain triggered when they occur.

 

1. List of software performance events

You can view which software performance events the current system supports through the list subcommand of perf.

# perf list sw
List of pre-defined events (to be used in -e):
  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

Among them, sw in the above command is the abbreviation of software, which actually refers to the kernel. The following list lists some events that affect performance, and we will explain them one by one.

alignment-faults

This refers to an alignment exception. To put it simply, when the CPU accesses the memory address, if the accessed address is found to be misaligned, the IO may not be enough once when the kernel requests data from the memory, and another IO has to be triggered to read the data back. Alignment exceptions will increase unnecessary memory IO, which will inevitably drag down the performance of the program.

If you still don't understand, you can look at the picture below. In the figure below, the data of addresses 0-63 and 64-127 can be completed by one memory IO. But if your application has to start from the 40 position to a length of 64 data. That is misalignment.

context-switches

Process context switch. On average, each switch takes 3-5 us. This is already a very long time for a fast-running operating system, and more importantly, for user programs, this time is completely wasted. Frequent context switching will further lead to poor CPU cache hit rate and increase CPI.

cpu-migrations

If a process can be executed on the same CPU core every time it is scheduled, there is a high probability that the data stored in the L1, L2, L3 and other caches of this core can still be used. A high cache hit rate can avoid access to data Penetrate into too slow memory. Therefore, the kernel has developed the wake_affine mechanism in the implementation of the scheduler to make the scheduling use the last used core as much as possible.

But what if the process finds that the last used core is occupied by another process when the scheduler wakes up. It is not possible to prevent this process from waking up, and wait for the CPU core that was used last time. It may be better to assign it another core so that the process can get the CPU in time. But at this time, it will cause the process to jump between CPUs during execution. This phenomenon is called task migration.

Obviously task migration is not very friendly to CPU cache. If there are too many migrations, it will inevitably lead to a decrease in the performance of the process.

emulation-faults

emulation-faults are a type of error that occurs when running x86 applications in a QEMU virtual machine. An x86 program needs to run on an x86-architecture computer and depends on the computer's hardware architecture and instruction set. As an emulator, QEMU can emulate the x86 hardware architecture and instruction set, but due to differences between the emulator and real hardware, emulation-faults may occur when running x86 applications.

page-faults

This is what we often call a page fault. When applying for memory in the user process, what is actually applied is only a vm_area_struct, which is only a range of addresses. Physical memory is not allocated immediately, and the specific allocation waits until the actual access. When the process starts to allocate and access variables on the stack during running, if the physical page has not been allocated, a page fault interrupt will be triggered. Physical memory is actually allocated in a page fault interrupt.

There are two types of page faults, namely major-faults and minor-faults. The difference between these two types of errors is that major-faults will cause disk IO to occur, so they have a greater impact on the running of the program.

2. Counting statistics of software performance events

After understanding several events in the kernel that may affect the performance of the program, one of our needs is to see how many times such events actually occur in the system. This can be done using the perf stat subcommand.

# perf stat -e alignment-faults,context-switches,cpu-migrations,emulation-faults,page-faults,major-faults,minor-faults sleep 5
 Performance counter stats for 'sleep 5':
     0      alignment-faults:u
     0      context-switches:u
     0      cpu-migrations:u
     0      emulation-faults:u
    56      page-faults:u
     0      major-faults:u
    56      minor-faults:u

Since I operated the above command on a development machine at hand, many indicators were 0, and only 56 minor-faults that were not too serious occurred. This command is the statistics of the whole system.

If you only want to view the specified program or process, follow it with the program name, or specify the process pid by -p

# perf stat <可执行程序>   // 统计指定程序
# perf stat -p <pid>     // 统计指定进程

 Information through train: Linux kernel source code technology learning route + video tutorial kernel source code

Learning through train: Linux kernel source code memory tuning file system process management device driver/network protocol stack

3. Function stack trace of software performance events

It is very likely that if you know that there are too many indicators in your system, you still want to see what function call chains are causing them. At this time, the perf record command can help you sample the stack.

For example, if you want to see how context-switches work, take a sample.

# perf record -a -g -e context-switches sleep 30

In the above command, -a means to view all stacks, including user stacks and kernel stacks. -g means not only to record the name of the currently running function when sampling, but also to record the entire call chain. -e means to sample only context-switches events. sleep refers to acquisition for 30 seconds. After the command is executed, a perf.data file will be output in the current directory.

By default, perf stat collects 4000 times per second. This will cause the collected perf.data file to be too large, and will also affect program performance. You can control the collection frequency with the -F parameter.

# perf record -F 100 ...

Use perf script to view the contents of the perf.data file.

# perf script

You can also use the perf report command for a simple statistics

# perf report

The best way is to use Brendan Gregg's FlameGraph project to render the sampled perf.data into a very intuitive flame graph.

The generation method is very simple. You only need to download the FlameGraph project, and then use the stackcollapse-perf.pl and flamegraph.pl scripts to process it.

# git clone https://github.com/brendangregg/FlameGraph.git
# perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg

The function of the stackcollapse-perf.pl script is to process the call stack into one line. The front of the line indicates the call stack, and the output after the line is the number of times the function was sampled to run. For example, the following processing result indicates that the main;funcA;funcD;funcE;caculate function call link is being executed for 554118432 times during sampling, and the main;funcB;caculate function call link is being executed for 338716787 times.

main;funcA;funcD;funcE;caculate 554118432
main;funcB;caculate 338716787

The flamegraph.pl script works by drawing stackcollapse-perf.pl as an svg image.

After using the flame graph to render the context-switches kernel software event sampling result perf.data, you can clearly see on which link the context switch occurs most frequently.

By observing the flame graph, it is possible to analyze and find out what causes the most process context switching overhead. The analysis principles of other kernel software events, such as page fault interruption and CPU migration, are also the same.

 

Guess you like

Origin blog.csdn.net/youzhangjing_/article/details/131381446