Performance analysis under Linux 3: perf

==Introduction==

The tracking method of ftrace is an overall tracking method. In other words, you count all the time lengths from one event to the next event, and then put them on the time axis, and you can know the distribution of the entire system running on the time axis.

This method is accurate but expensive to track. Therefore, we also need a tracking method for sampling patterns. Perf provides such a tracking method.

The principle of perf is as follows: every fixed time, an interrupt is generated on the CPU (on each core), and on the interrupt, we can see which pid and which function are currently in use, and then add a statistical value to the corresponding pid and function, so that we know what percentage of time the CPU spends on a certain pid or a certain function. This principle is illustrated as follows:

It can be clearly seen that this is a sampling mode. We expect that the more running time a function has, the greater the chance of being hit by a clock interrupt, so it can be speculated that the CPU usage of that function (or pid, etc.) is higher.

This method can be extended to various events, such as the ftrace event we introduced in the last blog post. You can also come up when this event occurs to see who is hit, and then calculate the distribution. We will know who will trigger a particularly large number of events.

Of course, if a process is so lucky that it just evades the position where you initiate the detection every time, your statistical results may be completely wrong. This is a problem that all sampling statistics may encounter.

Or take the sched_switch we used when we introduced ftrace as an example. We can use tracepoint as a detection point. Every time the kernel calls this function, we will come up to see who caused this tracepoint (this can only be used to classify by pid, and it is useless to classify by function, because the position of tracepoint is fixed), such as this:

sudo perf top -e sched:sched_switch -s pid

Of course, perf uses more CPU PMU counters. PMU counters are functions that most CPUs have. They can be used to count, for example, the number of L1 Cache failures and the number of branch prediction failures. The PMU can generate an interrupt when the count of these counters exceeds a specific value. For this interrupt, we can use the same method as the clock to sample and judge which function in the system has the most cache failures, branch prediction failures, etc.

The following is a trace command and dynamic result of a branch prediction failure:

sudo perf top -e branch-misses

From here, we can see which functions in the system make the most branch prediction failures. We may need to consider whether it is possible to insert a few macros such as likely()/unlikely() in those functions.

And readers should have also noticed that the biggest advantage of perf compared to ftrace is that it can directly track all programs in the entire system (not just the kernel), so perf is usually the first step in our analysis. We first see the outline of the entire system, and then go in to see the specific scheduling, delay and other issues. And perf itself also tells you whether the scheduling is normal. For example, the function occupancy rate of the kernel scheduling subsystem is particularly high. We may know that we need to analyze the scheduling process.

==Using perf==

The source code of perf is in the source code directory of Linux, because it is related to the kernel to a considerable extent. It will use the header files of the Linux kernel. But when you compile the kernel, you will not compile it. You must actively enter the tools/perf directory and execute make.

perf supports many functions, and it will automatically check whether these functions exist during make. For example, we used tracepoint to collect events earlier, you must ensure that your system has the library libtracepoint. The degree of freedom of perf is designed to be quite high, and you can have many functions without affecting your basic functions.

Since perf is associated with the kernel, in theory, whichever kernel you use, you should use the perf of the corresponding kernel, which can ensure the consistency of the interface. So for many distributions like Ubuntu, whichever kernel you install, you need to install the perf command corresponding to the kernel, and the perf command input passed is actually just a script. According to your current perf command, different perf versions are called.

But that's only in theory. In practice, the user-kernel interface of perf is quite stable. In many cases, it is no problem to use it across versions. Since the version of perf is still developing at a high speed, and the perf version of many distribution versions does not enable many functions, I often directly find the latest kernel and recompile the version in practice. It seems that there is no problem. Readers can refer to this experience to a limited extent. Perf does not have many path dependencies, you don't even need to install it after compiling, just use the absolute path to call the version you compiled.

==General Tracking==

We've seen a few examples of perf working earlier. Similar to multifunctional tools such as git and docker, perf also uses the mode of perf <subcommand>. The first thing anyone needs to learn are the two simplest commands: perf list and perf top.

perf list lists all events that perf can support. For example like this:

The old version will also list all tracepoints, but the list is too long, and the new version does not list this thing. Readers can go directly to ftrace to see it.

perf top can dynamically collect and update the statistics list, like many other perf commands. It supports many parameters, but we need to remember two key parameters:

1. -e specifies the event to track

-e can specify all the events provided by the previous perf list (including tracepoints not listed), and multiple -e can be used to specify multiple events to be tracked at the same time (but will be displayed separately when displayed)

One -e can also directly specify multiple events, separated by commas:

sudo perf top -e branch-misses,cycles

(The events given by the perf list are uploaded by the manufacturers to the Linux community, but some manufacturers have their own event statistics and have not uploaded them. You need to obtain them from the manufacturer's user manual. This kind of event can be directly represented by a number. For example, the format is rXXXX. For example, in our chip, 0x13 means cross-chip memory access, and you can use -e r0013 to track the number of cross-chip accesses of the software)

Events can specify a suffix. For example, if I want to only track branch prediction failures that occur in user mode, I can do this:

sudo perf top -e branch-misses:u,cycles

All events have this requirement, I can also:

sudo perf top -e ‘{branch-misses,cycles}:u'

If you look at the perf-list manual, you will find more suffixes, which I use less. If you are interested in this, you can dig deeper. If you have any good experience, please let me know.

2. -s specifies what parameters are used to classify

The -s parameter can not be used. By default, it will be classified by function, but if you want to classify by pid, you need to rely on -s to classify. We have seen examples of this before. -s can also specify multiple domains (separated by commas), for example:

sudo perf top -e 'cycles' -s comm,pid,dso

perf-top is used to understand and understand the functions of perf is better, but it is not used much in practice, and the perf-record and perf-report commands are used more. perf-record is used to start a trace, and perf-report is used to output the trace result.

The general process is:

sudo perf record -e 'cycles' -- myapplication arg1 arg2
sudo perf report

Here is an example report:

perf record generates a perf.data file in the current directory (if this file already exists, the old file will be renamed perf.data.old) to record process data. The perf report command run afterwards will output the statistical results. perf.data only contains raw data, and perf report needs to access the local symbol table, information such as the correspondence between pid and process to generate a report. So perf.data cannot be directly copied to other machines for use. But you can package all this data with the perf-archive command, so that it can be used on another machine.

Please note that perf-archive refers to the command perf-archive, not the subcommand perf archive. This command will be generated when compiling the perf source code. If your distribution does not support it, you can compile one yourself. It is a pity that the code backed up by perf-archive cannot be used across platforms (for example, the data you backed up from the arm platform cannot be analyzed on x86).

perf.data retains the previous version and can support the perf diff command, which compares the difference between two runs. In this way, you can run your program with different parameters to see the difference in the running results. Taking the previous cs program as an example, I use 4 threads to compare 2 threads, and the results are as follows:

We can see here that after adding threads, the proportion of heavy_cal dropped significantly by 10.70%, and other changes were not significant.

Perf record is not necessarily used to track the processes started by itself. By specifying the pid, you can directly track a fixed set of processes. In addition, everyone should have noticed that the traces given above only trace events that occur on a specific pid. But for many models, such as a webserver, you actually care about the performance of the entire system. The network will occupy part of the CPU, the WebServer itself will occupy part of the CPU, and the storage subsystem will also occupy part of the CPU. The network and storage do not necessarily belong to your WebServer pid. Therefore, for system-wide tuning, we often add the -a parameter to the record command, so that we can track the performance of the entire system. For example, it is still the tracking of the previous cs program. If I use the -a command to track, the result is very different from the original:

Pay attention to the Command column. There is not only the process of cs anymore.

perf report is a menu interface, which can be expanded to the code of each function. For example, if we want to expand the specific count of the heavy_cal() function above, we press Enter on it and select code analysis, we can get:

perf record also has other parameters that can be controlled, such as the number of events that can be triggered by specifying the event through -c, etc. Readers can read the manual by themselves.

Similar to perf record/report, there is also a perf stat command. This command does not calculate the distribution, but only performs statistics, similar to this:

In general, I don't think this feature is useful.

==Stack Trace==

There is an illusion in perf tracking that we need to pay attention to. Suppose we have a function abc() and call another function def(). In the statistics of perf, the two are counted separately. That is to say, the execution time of def does not count the time of abc. The diagram is as follows:

Here, abc() is hit 5 times, def() is hit 5 times, and ghi is hit 1 time. This will give us a lot of illusions, it seems that the calculation pressure of abc is not great, in fact it is not, you have to count def and ghi.

But this brings another problem: Maybe def is not just a function call of abc, but also others will call it. In this case, how do we know who caused it?

In this case, we can start a stack trace, that is, every time it is hit, we will go back up the call stack, so that the caller will also be hit, so that it is easier to see the problem. The principle is similar to this:

In this case, abc hits 11 times, def hits 6 times, and ghi hits 1 time. In this way, we can more easily determine the location of the bottleneck to a certain extent. The -g command can implement such a trace, the following is an example:

After using the stack trace, start_thread rose to the front, because it was the heavy_cal that it adjusted.

It should be noted that when using stack traces, stack traces are limited by the scanning depth, and stacks that are too deep may not be able to trace back, which may affect the results.

Another problem is that some of the functions we call from the source code are actually not function calls at the assembly level. For example, inline functions and macros are not function calls. In addition, on many platforms, gcc will automatically turn very short functions into inline functions, which does not generate function calls. Another is that the fastcall function, passing parameters through registers, will not generate a call stack, and may not generate a call stack. This may not be seen through the backtracking of the call stack.

Another strange situation is that some platforms use a simplified stack backtracking mechanism. When an address in the stack looks like the address of a code segment, it is considered to be the call stack. These situations will cause serious errors in the stack trace. Users should be very familiar with the ABI of the system in order to control the stack trace function well.

==Other functions==

perf is now the main performance analysis tool in Linux, almost every upgrade will have a major update, even any benchmarking function has been added, there are also commands such as perf-mem for special analysis, and perf script command for generating scripts, helping you analyze operation results with different scripting languages. This user can read the manual by himself. With the previous foundation, these functions are easy to understand.

But I would like to mention the script command in particular. Although its function seems to be only used to generate analysis scripts, we often use it to export raw analysis data. Readers can directly use this command to export results after perf-record:

sudo perf script

Each hit point is listed here, and what you like to do with the data of these hit points is entirely up to your imagination.

==Defects tracked by perf==

It has been emphasized before that perf tracking is a sampling tracking, so we must be very careful about sampling tracking itself. Once the model is not set properly, the entire analysis result may be wrong. We must always be prepared for this.

I especially remind you that every time you read the perf report, you must first pay attention to how many points you have collected in total. If you only have dozens of points, your report may not be credible.

In addition, we must be clear that modern CPUs basically don’t need to wait in a busy way, so if the CPU is idle (that is, there is no task scheduling, this situation will inevitably happen as long as your CPU usage rate is not 100%), the hit task will also stop, so there is no point on Idle (you see that the point of the Idle function itself is not the point of CPU Idle, but the time spent before and after preparing to enter Idle), so perf statistics cannot be used to allow you to analyze CPU usage. Only tools such as ftrace and top can check the CPU usage, but perf is not.

Another problem with perf is the requirement for interrupts. Many events in perf depend on interrupts, but the Linux kernel can turn off interrupts. After you turn off interrupts, you will not be able to hit the point where you turn off interrupts. Your interrupts will be delayed until you turn on interrupts. Therefore, on such a platform, you will see that many functions after turning on interrupts are intensively hit. But they are innocent. But what's worse, if multiple events occur when the interrupt is turned off, since the interrupt controller will merge the same interrupt, you will lose multiple events and make your statistics wrong.

Modern Intel platforms have basically switched PMU interrupts to NMI interrupts (unmaskable), so the previous problem does not exist. But on most ARM/ARM64 platforms, this problem has not been solved, so you must be very careful when reading the reports of this platform, especially if you see that functions such as _raw_spin_unlock() hit extremely high, you have to doubt your test results (note that this result is also usable, just depends on how you use it).

==Summary==

In this article, we introduced the basic usage of perf. Perf is usually the first step in our performance analysis, but this step is not so easy to use well. We should first grasp its principle, and then use perf to verify our guesses based on an analysis model step by step, so that we can really find the problem.

Guess you like

Origin blog.csdn.net/m0_54437879/article/details/131727004