The way to programmer performance starts with using perf!

1. Introduction to perf

Starting from the 2.6.31 kernel, the Linux kernel comes with a performance analysis tool perf, which can search hotspots at the function level and instruction level. Through it, applications can take advantage of PMU, tracepoint and special counters in the kernel for performance statistics. It can not only analyze the performance problem (per thread) of the specified application program, but also can be used to analyze the performance problem of the kernel. Of course, it can also analyze the application code and the kernel at the same time, so as to fully understand the performance bottleneck in the application program.

Perf is a performance profiling tool built into the Linux kernel source tree. Based on the principle of event sampling and performance events , it supports the performance analysis of processor-related performance indicators and operating system-related performance indicators. It is often used to find performance bottlenecks and locate hot codes.

1.1 Install Perf

Installing perf is very simple, as long as the kernel version is higher than 2.6.31, perf has been supported by the kernel. First install the kernel source code:

apt-get install linux-source

Then the kernel source code has been downloaded in the /usr/src directory, we decompress the source code package, then enter the tools/perf directory and type the following two commands:

make
make install

Perhaps due to system reasons, the following development packages need to be installed in advance:

apt-get install -y binutils-dev
apt-get install -y libdw-dev
apt-get install -y python-dev
apt-get install -y libnewt-dev

1.2 Basic use of Perf

CPU cycle (cpu-cycles) is the default performance event. The so-called CPU cycle refers to the smallest time unit that the CPU can recognize, usually a few millionths of a second, which is the time required for the CPU to execute the simplest instructions, such as Read the contents of the register, also called clock tick.

perf COMMAND [-e event …] PROGRAM, perf uses such a command format, COMMAND is commonly used is top, stat, record, report, etc. Then use the -e parameter to count the events that need attention. Multiple events use Multiple -e connections.

Perf is a toolset that contains 22 sub-tools, the following are the 5 most commonly used:

  • perf-list
  • perf-stat
  • perf-top
  • perf-record
  • perf-report
  • perf-trace

perf-list

Perf-list is used to view the performance events supported by perf, including software and hardware.
List all symbolic event types.

perf list [hw | sw | cache | tracepoint | event_glob]

perf stat

The best way to illustrate a tool is to give an example. Consider the following example program. Among them, the function longa() is a very long loop, which is a waste of time. Functions foo1 and foo2 will call the function 10 and 100 times, respectively.

//t1.c
 void longa()
 {
   int i,j;
   for(i = 0; i < 1000000; i++)
   j=i; //am I silly or crazy? I feel boring and desperate.
 }

 void foo2()
 {
   int i;
   for(i=0 ; i < 10; i++)
        longa();
 }

 void foo1()
 {
   int i;
   for(i = 0; i< 100; i++)
      longa();
 }

 int main(void)
 {
   foo1();
   foo2();
 }

and compile it:

gcc -o t1 -g t1.c

The following demonstrates the output of perf stat for program t1:

root@ubuntu-test:~# perf stat ./t1

 Performance counter stats for './t1':

        218.584169 task-clock # 0.997 CPUs utilized
                18 context-switches # 0.000 M/sec
                 0 CPU-migrations # 0.000 M/sec
                82 page-faults # 0.000 M/sec
       771,180,100 cycles # 3.528 GHz
     <not counted> stalled-cycles-frontend
     <not counted> stalled-cycles-backend
       550,703,114 instructions # 0.71 insns per cycle
       110,117,522 branches # 503.776 M/sec
             5,009 branch-misses # 0.00% of all branches

       0.219155248 seconds time elapsed

程序 t1 是一个 CPU bound 型,因为 task-clock-msecs 接近 1

Tuning t1 should be about finding the hotspots (i.e. the most time-consuming pieces of code) and seeing if you can improve the efficiency of the hotspots. By default, in addition to task-clock-msecs, perf stat also gives several other most commonly used statistics:

  • Task-clock-msecs: CPU utilization, if the value is high, it means that most of the program's time is spent on CPU calculations rather than IO.
  • Context-switches: The number of process switches, which records how many process switches occurred during the running of the program. Frequent process switches should be avoided.
  • Cache-misses: The overall cache utilization during the running of the program. If the value is too high, it means that the cache utilization of the program is not good
  • CPU-migrations: Indicates how many CPU migrations occurred during the running of process t1, that is, transferred from one CPU to another by the scheduler.
  • Cycles: processor clock, one machine instruction may require multiple cycles, Instructions: number of machine instructions.
  • IPC: It is the ratio of Instructions/Cycles. The larger the value, the better, indicating that the program makes full use of the characteristics of the processor.
  • Cache-references: the number of cache hits, Cache-misses: the number of cache failures.

By specifying the -e option, you can change the default event of perf stat (about the event, it has been explained in the previous section, and you can view it through perf list). If you already have a lot of tuning experience, you might use the -e option to see specific events of interest to you.

Some programs are slow because the amount of calculation is too large, and most of the time should be calculated using the CPU, which is called CPU bound; IO bound type; the tuning for CPU bound programs is different from the tuning for IO bound programs.

perf top

When using perf stat, often you already have a tuning goal. For example, the boring program t1 I just wrote.

Sometimes, you just find that the system performance drops for no reason, and you don't know which process has become a greedy hog.

At this time, a command like top is needed to list all suspicious processes and find the guy who needs further review.

Perf top is used to display the performance statistics of the current system in real time. This command is mainly used to observe the current state of the entire system. For example, you can view the most time-consuming kernel function or a user process in the current system by viewing the output of this command.

Let's devise another example to demonstrate, and I quickly came up with a program as shown in Listing 2:

//t2.c
main(){
    int i;
    while(1) i++;
}

Then compile this program:

gcc -o t2 -g t2.c

After running this program, we open another window and run perf top to see:

Events: 8K cycles
 98.67% t2 [.] main
  1.10% [kernel] [k] __do_softirq
  0.07% [kernel] [k] _raw_spin_unlock_irqrestore
  0.05% perf [.] kallsyms__parse
  0.05% libc-2.15.so [.] 0x807c7
  0.05% [kernel] [k] kallsyms_expand_symbol
  0.02% perf [.] map__process_kallsym_symbol

It's easy to spot t2 as a suspicious program that needs attention. But its modus operandi is too simple: a reckless waste of CPU. So we don't have to do anything else to find the problem. But in real life, programs that affect performance are generally not so stupid, so we often need to use other perf tools for further analysis.

Use perf record to interpret report

After using top and stat, you probably have a rough idea. For further analysis, some finer-grained information is required. For example, you have concluded that the target program has a large amount of computation, perhaps because some codes are not compact enough. So in the face of a long code file, which lines of code need to be further modified? This requires the use of perf record to record individual function-level statistics, and use perf report to display the statistical results.

Your tuning should focus on the hotspot code fragments with a high percentage. If a piece of code only takes up 0.1% of the running time of the entire program, even if you optimize it to only one machine instruction left, I am afraid that it can only reduce the overall execution time. Program performance increased by 0.1%. As the saying goes, good steel is used on the blade, so I don't need to say more.

perf record -e cpu-clock ./t1
perf report

perf report output:

Events: 229 cpu-clock
100.00% t1 t1 [.] longa

As expected, the hot spot is the longa( ) function. However, the code is very complicated and hard to tell, foo1() in the t1 program is also a potential tuning object, why call that boring longa() function 100 times? But we can't find foo1 and foo2 in the picture above, let alone understand their difference.

I once found that nearly half of the time in a program I wrote was spent on several methods of the string class. String is the C++ standard, and it is absolutely impossible for me to write better code than STL. So I can only find places where strings are used too much in my program. Therefore, I need to display statistics according to the calling relationship.

Use the -g option of perf to get the required information:

perf record -e cpu-clock -g ./t1
perf report

Output result:

Events: 270 cpu-clock
- 100.00% t1 t1 [.] longa
   - longa
      + 91.85% foo1
      + 8.15% foo2

Through the analysis of the calling graph, it is easy to see that 91.85% of the time is spent in the foo1() function, because it calls the longa() function 100 times, so if longa() is a function that cannot be optimized, then the program Operators should consider optimizing foo1 to reduce the number of calls to longa().

use tracepoint

When perf samples according to the tick time point, people can get the hot spot in the kernel code. So when do you need to use tracepoint to sample?

I think the basic need for people to use tracepoint is to care about the runtime behavior of the kernel. As mentioned earlier, some kernel developers need to focus on specific subsystems, such as memory management modules. This requires statistics on the operation of related kernel functions. In addition, the impact of kernel behavior on application performance cannot be ignored:

Taking the previous regret as an example, if I go back in time, I think what I want to do is to count how many system calls occurred during the running of the application. Where did it happen?

Below I use the ls command to demonstrate the use of the tracepoint sys_enter:

root@ubuntu-test:~# perf stat -e raw_syscalls:sys_enter ls
bin libexec off perf.data.old t1 t3 tutong.iso
bwtest minicom.log perf.data pktgen t1.c t3.c

 Performance counter stats for 'ls':

               111 raw_syscalls:sys_enter

       0.001557549 seconds time elapsed

This report details how many system calls occurred during the ls run (111 in the above example).

2. Analysis of common performance problems

Performance testing is roughly divided into the following steps:

  1. demand analysis
  2. script preparation
  3. test execution
  4. Result sorting
  5. problem analysis

Requirement description: There is a service that loads a 1G vocabulary file into the memory when it is started. After the request comes, it will put the requested word into the vocabulary for fuzzy matching. If the match is found, it will send an http request to a back-end service. , after getting back the data, while returning to the client, record the unique identifier of the request and a mark of the number of requests to mysql;

There are several key functions

Fuzzy Matching (fuzzyMatching)

Backend request function (sendingRequest)

Assembly request function (buildResponse)

Record mysql request times mark (signNum)

Questions and Analysis:

The first group: Completely random request words, when the qps reaches 1k, the server has no abnormalities, the cpu, memory, and bandwidth are not full, and the qps cannot continue to increase;

  • Analysis: Since the backend of this service is connected to other services, before stress testing, it is necessary to confirm that the backend service will not become a bottleneck point. The current status is likely to be that the backend service limits the performance of the service under test; at this time, you can check Various indicators of the machine where the back-end service is located, or check the connection status of the machine. Generally, the back-end service cannot handle it, and if the service under test keeps requesting later, there will be more connections in the timewait state;

The second group: After solving the problem of the back-end service, the second group uses an average of 30 characters to suppress the request. When the qps reaches 400, the cpu load is full;

  • Analysis: This situation is obviously due to the calculation efficiency of the fuzzyMatching function, which leads to the full load of the cpu, so that the qps cannot be increased, and the response time continues to increase. At this time, the perf+flame graph can be used to determine the function with a long response time during the entire request processing process; At this time, it is necessary to evaluate whether the pressure test data is reasonable. If the average online request word is only 2, this group of tests is obviously unreasonable. At this time, it is a waste of time to develop and optimize performance; if the evaluation test data is reasonable, it can be replaced again Short word data for pressure test verification guessing;

The third group: After solving the above two problems, use a completely random request word, reduce the qps to 1k after reaching 3k, and then increase to 3k again, and so on;

  • Analysis: At this time, pay attention to various indicators. If the above problems are excluded, the problem of slow operation of mysql is more likely. For this kind of system that requires high concurrency, directly reading and writing mysql is not a smart solution. Will use redis to make a layer of cache. Another problem mentioned here is the performance problem caused by unreasonable development and design;

The fourth group: Replace the backend with a real service to do the overall pressure test, and found that the qps can only reach 300 at the highest. At this time, check various indicators and find that the ingress bandwidth is full;

  • Analysis: The problem is obvious this time. The content returned by the backend service is too large, resulting in full bandwidth. At this time, it is still necessary to evaluate the requirements: 1. Whether all the data content returned by the backend is needed; 2. Evaluate the cost performance of replacing the 10GbE NIC ;3. Whether the bandwidth occupancy can be optimized through technical means, such as distributing one request to multiple requests of multiple groups of services;

2.1perf+ flame graph positioning function problem

Here is a brief introduction on how to use perf+ flame graph to intuitively locate performance problems:

perf

Perf has many performance analysis capabilities. For example, using Perf can calculate the number of instructions per clock cycle, called IPC. A low IPC indicates that the code does not make good use of the CPU. Perf can also sample the program at the function level to understand where the performance bottleneck of the program is and so on. Perf can also replace strace, add dynamic kernel probe points, and do benchmarks to measure the quality of the scheduler.

  • Example usage: perf record -e cpu-clock -g -p 11110 -o data/perf.data sleep 30
  • The -g option is to tell perf record the call relationship of the additional recording function -e cpu-clock means that the indicator monitored by perf record is the cpu cycle -p specifies the process pid that needs to be recorded

Generate a flame graph

1. The first step: Use stress testing tools to suppress the program to the inflection point of the program;

$sudo perf record -e cpu-clock -g -p 11110
Ctrl+c结束执行后,在当前目录下会生成采样数据perf.data.

2. The second step: use the perf tool to analyze perf.data

perf -i perf.data &> perf.unfold

3. The third step: Fold the symbols in perf.unfold:

./stackcollapse-perf.pl perf.unfold &> perf.folded

4. Finally generate the svg image:

./flamegraph.pl perf.folded > perf.svg

At this point, the function call flame graph can be generated, as shown in the following figure:

img

Native perf can directly locate C/C++ programs. Usually, more information can be seen by compiling the debug version of the program. Java, go and other languages ​​can be generated by their own customized tools. The principle is similar; through the flame graph, you can easily locate Which function takes the longest to process to find the problem.


Copyright statement: This article is an original article written by Zhihu blogger "Playing with the Linux Kernel". It follows the CC 4.0 BY-SA copyright agreement. For reprinting, please attach the original source link and this statement.
4. Finally generate the svg image:

./flamegraph.pl perf.folded > perf.svg

At this point, the function call flame graph can be generated, as shown in the following figure:

[External link image transfer...(img-etC4RSsA-1688132508735)]

Native perf can directly locate C/C++ programs. Usually, more information can be seen by compiling the debug version of the program. Java, go and other languages ​​can be generated by their own customized tools. The principle is similar; through the flame graph, you can easily locate Which function takes the longest to process to find the problem.


Copyright statement: This article is an original article written by Zhihu blogger "Playing with the Linux Kernel". It follows the CC 4.0 BY-SA copyright agreement. For reprinting, please attach the original source link and this statement.
Original link: https://zhuanlan.zhihu.com/p/638791429

Guess you like

Origin blog.csdn.net/m0_50662680/article/details/131484146