Summary of basic knowledge of perf(1) for Linux performance analysis

Summary of perf (1) basic knowledge of Linux (09)

Author: Onceday Date: January 31, 2023

The long road has only just begun...

Reference documents:

1 Overview

This picture is from the application scenario of the linux analysis tool provided by Brendan Gregg. It can be seen that it almost includes what tools should be used to analyze every part of the system.
insert image description here

insert image description here

Here we only consider perf as a tool, and the focus is on using this tool to understand the performance consumption of programs on Linux systems.

1.1 Perf background history

PerfThe full name is Performance Event, which is a powerful performance analysis tool, which is mainly used to measure and analyze software performance in the Linux operating system. perfIt is implemented based on the performance monitoring function of the Linux kernel, which can help developers find the performance bottleneck of the program, optimize the code and improve the overall performance of the system.

perfThe birth of the tool dates back to 2008, when the Linux kernel community began to develop a kernel-based performance monitoring subsystem, the Performance Monitoring Unit (PMU), in order to improve the experience of kernel performance analysis. With the passage of time, perfunder the continuous contribution and improvement of the kernel community, the tool has become more and more perfect, and has gradually become a necessary performance analysis tool under the Linux platform.

perfTools have a wide range of functionalities, including but not limited to:

  1. CPU performance monitoring: perfIt can collect and report various CPU performance events, such as cache hit rate, branch prediction error, etc., so as to help developers understand the operating efficiency of the program.

  2. Memory performance monitoring: perfIt can analyze the program's memory access patterns, such as memory allocation and release, to help developers optimize memory usage.

  3. System call tracking: perfIt can track the system calls of the program to help developers find potential performance problems.

  4. Hot function analysis: By sampling function calls during program execution, perfyou can find the function with the longest execution time in the program, providing developers with optimization directions.

  5. Hardware event statistics: perfYou can also count hardware events, such as CPU cache, branch prediction, etc., to evaluate the hardware utilization efficiency of the program.

perfSince the tool was born in 2008, it has undergone many important updates:

  • Kernel 2.6.31: perfThe initial version is released, supporting basic performance monitoring functions.

  • Kernel 2.6.35: The command was introduced perf traceto support system call tracing.

  • Kernel 3.0: supports dynamic tracking technology, which improves perfthe flexibility of the system.

  • Kernel 4.1: The command was introduced perf c2cto support cache-line-level analysis.

With the continuous upgrading of the Linux kernel, perfthe tools are constantly expanding and improving their functions to meet the needs of more performance analysis.

Although perfdesigned for the Linux operating system, it can also run on other Unix-like systems, such as FreeBSD, macOS, etc. However, due to differences in kernel implementations and hardware architectures of these platforms, perffunctionality and performance may vary on different platforms. Therefore, when using perffor performance analysis, the characteristics and limitations of the target platform need to be considered.

1.2 PMU (Performance Monitoring Unit)

PMU (Performance Monitoring Unit, Performance Monitoring Unit) is an important part of a modern CPU, which is used to monitor and collect hardware events related to processor performance. PMU can help developers understand how the program runs on a specific processor, so that targeted optimization can be performed.

PMU usually consists of multiple Hardware Performance Counters (Hardware Performance Counter, HPC). These counters are used to count the number of times certain hardware events have occurred, such as:

  • CPU cycles (Cycles)
  • Instructions
  • Cache hit/miss (Cache Hits/Misses)
  • Branch Predictions Correct/Incorrect

When profiling with the PMU, you need to choose which events to monitor. The PMU then records the number of times the event occurred in the corresponding performance counter. By collecting and analyzing this data, program bottlenecks can be identified.

Under Linux systems, the Perf tool can access the PMU and utilize the performance data it provides. Perf provides various commands and options to configure and analyze PMU data. In the previous Perf tutorials, we've discussed how to use Perf for performance analysis.

It should be noted that the PMUs of different processors may be different, and the types of events supported and the number of performance counters may also be different. However, the Perf tool has abstracted most of the popular processors, so that a unified interface can be used for performance analysis on a variety of processors.

1.3 Principle of Perf

perf is powerful: it measures CPU performance counters, tracepoints, kprobes and uprobes (dynamic tracing). It is capable of lightweight profiling. It is also included in the Linux kernel under tools/perf and is frequently updated and enhanced.

perf started as a tool for working with the performance counter subsystem in Linux and has been variously enhanced to add tracing capabilities. Performance counters are CPU hardware registers that count hardware events such as executed instructions, cache misses, or branch mispredictions. They form the basis for analyzing applications to track dynamic control flow and identify hot spots. Perf provides rich generic abstractions for hardware-specific functionality. Among other things, it provides per-task, per-CPU and per-workload counters and samples based on these counters and source code event annotations.

**Tracepoints(tracepoints)** are instrumentation points located at logical locations in the code, such as system calls, TCP/IP events, file system operations, etc. When not in use, they have negligible overhead and can be enabled via the perf command to collect information including timestamps and stack traces. Perf can also use the kprobes and uprobes framework to dynamically create tracepoints for dynamic tracing in kernel and user space. Their possibilities are endless.

  • kprobes(Kernel Probes) is a dynamic tracing technique used to trace kernel code execution. By inserting it into the kernel kprobes, developers can monitor the call and return of kernel functions, and collect information about key events, such as parameter values ​​and return values.

    kprobesThe working principle is based on the kernel's breakpoint mechanism. On insertion kprobes, the kernel inserts a breakpoint instruction at the target address. When the program executes to this instruction, the kernel will trigger an exception and transfer control to the kprobesprocessing function of . The processing function can access the current CPU registers and memory, and obtain event information. After executing the processing function, the program will continue to execute from the target address.

    kprobesThe advantage of is that it can dynamically trace kernel code at runtime without modifying or recompiling the kernel. Also, kprobesthere is less impact on system performance since it is only triggered when critical events occur.

  • uprobes(User Probes) is a dynamic tracing technique used to trace the execution of user-space code. Similar kprobesto , uprobesyou can also insert probes at runtime to monitor function calls and returns from user programs.

    uprobesworks kprobessimilarly to and is also based on the breakpoint mechanism. The difference is that uprobesthe inserted breakpoints are located in user-space code, thus requiring kernel-user-space co-processing. When the user program reaches a breakpoint, the kernel will trigger an exception and transfer control to the uprobesprocessing function of . The processing function can access the current CPU registers and memory, and obtain event information. After executing the processing function, the program will continue to execute from the target address.

    uprobesThe advantage is that it can dynamically trace user-space code at runtime without modifying or recompiling the program. Similarly kprobes, uprobesthe impact on system performance is also small.

Perf has two working modes:

  • Counting Mode (Couting Mode) , Counting Mode will accurately count the changes of CPU-related hardware counter values ​​within a period of time. In order to count the events that the user is interested in, Perf Tool will set the registers related to performance control. The values ​​of these registers will be read out after the monitoring period is over.
  • Sampling Mode (Sampling Mode) , Sampling Mode will obtain performance data by regular sampling. The PMU counters will configure the overflow period for some specific events. When the counter overflows, related data, such as IP, general registers, EFLAG will be captured.
1.4 Perf performance events

Perf events are roughly divided into the following categories:

  • Hardware Events : These events are generated by the processor hardware itself. For example:
    • instructions: the number of instructions executed
    • cycles: number of processor cycles
    • cache-references: cache references
    • cache-misses: number of cache misses
    • branch-instructions: number of branch instructions
    • branch-misses: number of mispredicted branch instructions
  • Hardware Cache Events : These events are related to the processor's cache, such as:
    • L1-dcache-loads: L1 data cache load times
    • L1-dcache-load-misses: L1 data cache load miss count
    • L1-dcache-stores: L1 data cache storage times
    • L1-dcache-store-misses: L1 data cache store miss count
    • L1-dcache-prefetches: L1 data cache prefetch times
    • L1-dcache-prefetch-misses: L1 data cache prefetch miss count
  • Software Events (Software Events) : These events are generated by the operating system, such as:
    • context-switches: number of context switches
    • cpu-migrations: number of CPU migrations
    • page-faults: number of page faults
    • minor-faults: number of minor page faults
    • major-faults: Number of major page faults
  • Tracepoint Events (Tracepoint Events) : These events are generated by tracepoints in the kernel, and a perf listcomplete list of tracepoint events can be obtained through the command. Tracepoint events can be used to monitor the performance of kernel subsystems such as scheduler behavior, memory management, etc.
  • Investigation events (Probe events) : user-defined events, dynamically inserted into the kernel.
1.5 Properties of Perf performance events

Please refer to the perf learning summary for the following original text - Zhihu (zhihu.com) .

Hardware performance events are supported by the PMU in the processor. Due to the very high main frequency of modern processors, coupled with the deep pipeline mechanism, hundreds of instructions may have been processed on the pipeline from the time when a performance event is triggered to when the processor responds to the PMI interrupt. Then the address of the instruction captured by the PMI interrupt is no longer the address of the instruction that triggered the performance event, and may have a very serious deviation. To solve this problem, Intel processors implement high-precision event sampling through the PEBS mechanism. PEBS saves the processor scene directly to the memory when the counter overflows through hardware (instead of saving the register scene when responding to an interrupt), so that perf can collect the address of the instruction that actually triggers the performance event, improving the sampling accuracy. By default, perf does not use the PEBS mechanism.

If users want to use high-precision sampling, they need to add the suffix ":p" or ":pp" after the event name when specifying a performance event. Perf defines 4 levels of sampling accuracy, as shown in the table below.

  • 0: no accuracy guarantee
  • 1: The deviation between the sampled instruction and the instruction that triggered the performance event is constant (:p)
  • 2: It is necessary to try to ensure that the deviation between the sampling instruction and the instruction that triggers the performance event is 0 (:pp)
  • 3: Guaranteed that the deviation between the sampling instruction and the instruction that triggered the performance event must be 0 (:ppp)

Precision level of performance events Current X86 processors, including Intel processors and AMD processors, can only achieve the first three precision levels.

In addition to the precision level, performance events also have several other attributes, which can be specified by "event:X".

  • u: Only count performance events triggered by user space programs
  • k: Only count performance events triggered by the kernel
  • h: Only count performance events triggered by the Hypervisor
  • G: In the KVM virtual machine, only the performance events triggered by the Guest system are counted
  • H: Only count performance events triggered by the Host system
  • p: precision level

2. Perf actual command

2.1 perf command

The parameters supported by this command are as follows:

# perf命令的一般形式
perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
# 目前支持的选项参数:
	--help, Run perf help command. 运行perf help命令.
	--version, 显示perf的版本信息
	-vv, 打印库的编译状态。
	--exec-path, Display or set exec path. 显示或者设置执行路径
	--html-path, Display html documentation path. 显示html文档的路径
	--paginate, Set up pager.
	--no-pager, Do not set pager.
	--debugfs-dir, Set debugfs directory or set environment variable PERF_DEBUGFS_DIR.
	--buildid-dir, Setup buildid cache directory. It has higher priority than buildid.dir config file option.
	--list-cmds, List the most commonly used perf commands.
	--list-opts, List available perf options.
	--debug, Setup debug variable (see list below) in value range (0, 10).

For debug mode, you can set it in the following way:

--debug	verbose		# 表示设置verbose = 1
--debug verbose=2	# 表示设置verbose = 2
# 如verbose这样被允许设置的变量如下:
	verbose          - general debug messages
    ordered-events   - ordered events object debug messages
    data-convert     - data convert command debug messages
    stderr           - write debug output (option -v) to stderr in browser mode
    perf-event-open  - Print perf_event_open() arguments and return value

Perf Tool is a user-mode tool set, which includes multiple sub-tool sets. The following table specifically introduces the basic functions of each tool:

name Functional description
annotate Read perf.data, perform a detailed analysis of a specific function, displaying source code or assembly level information.
archive Create an archive containing the binaries, debug information, and build ID needed for profiling. This archive can be analyzed using the command on other systems perf report, especially those without the original binaries and debug information.
bench Used to run built-in micro-benchmarks (micro-benchmark) to evaluate and compare performance under different system and kernel configurations. It provides a set of predefined benchmarks, including scheduler/memory subsystem/lock operation/NUMA access performance, etc.
buildid-cache Used to manage the build ID cache in the perf tool. A build ID is a unique identifier used to associate an executable, shared library, or kernel module with its corresponding debug information. perf buildid-cacheItems in the build ID cache can be added, removed, or listed.
buildid-list Used to list perf.databuild IDs associated with files. A build ID is a unique identifier used to associate an executable, shared library, or kernel module with its corresponding debug information. perf recordThese build IDs are stored in files when you collect performance data using perf.data.
c2c The (cache-to-cache) command is a tool for analyzing memory access and cache line hit/miss performance. It focuses on analyzing cache line contention and false sharing problems in multicore systems. False sharing means that multiple cores access different data on the same cache line, causing cache lines to frequently migrate between cores, thereby reducing overall performance.
config Used to query and modify perfconfiguration options for the tool. These configuration options control perfvarious behaviors of the tool, such as default event types, display styles, color schemes, etc. perfConfiguration information for is stored in a perfconfigfile called , usually in the user's home directory ( ~/.perfconfig) or in a system-wide configuration file ( /etc/perfconfig).
daemon Run record sessions on background, run the sampling program in the background.
data Used to process and manage perfperformance data files generated by tools (typically perf.data). These files contain perf recorda sample of performance events collected by the command and can be used for further performance analysis and reporting
diff Used to compare two or more perf.datafiles to find performance changes. This can help you quickly identify the cause of performance changes when optimizing code or adjusting system configurations. perf diffThis is achieved by computing the difference in event counters and other performance metrics
married Used to display a list of configured performance events. perf recordThese events are logged when using the command to collect performance data. Events can be hardware counter events (such as CPU cycles, cache hits/misses, etc.) or software events (such as context switches, page faults, etc.)
ftrace simple wrapper for kernel's ftrace functionality, simple wrapper for kernel's ftrace functionality.
inject perf.dataUsed to insert or modify performance event records in existing files. This allows you to modify existing performance data files for further analysis without re-collecting performance data. perf injectCan be used to add new events, modify existing events, or remove unwanted events
iostat Used to monitor the I/O (input/output) performance of the system. It can display the system's disk I/O operations in real time and help you identify possible I/O bottlenecks. perf iostatis perfan additional command provided by the tool that generates reports perfsimilar to traditional commands based on data collected by the .iostat
cold swim Used to display kernel symbol table information. The kernel symbol table contains the names of kernel functions, variables, and other kernel objects and their addresses in memory. By looking at the kernel symbol table, you can better understand the structure and runtime behavior of the kernel.
kmem Used to analyze the behavior of the kernel memory allocator. The kernel memory allocator is responsible for managing memory resources in kernel space. perf kmemIt can help you identify problems such as performance bottlenecks and memory leaks of the kernel memory allocator, thereby improving system performance and stability.
kvm Used to analyze and report the performance of KVM (Kernel-based Virtual Machine) virtualization environment. KVM is an open source virtualization technology on Linux that allows running multiple virtual machines (also known as guests) on the same physical host. perf kvmIt can help you identify performance bottlenecks in the virtual environment, thereby improving the performance and resource utilization of virtual machines.
list Used to list available performance events that can be used for profiling perf recordwith and other perfsubcommands. Performance events include hardware events (such as CPU cycles, cache misses, etc.), software events (such as context switches, page faults, etc.), and tracepoint events (such as kernel function calls, traces of user-space applications, etc.)
lock Used to analyze lock contention and lock-related performance issues. In multithreaded programming, a lock is a synchronization primitive used to ensure that multiple threads can maintain consistency when accessing shared resources. However, the use of locks can lead to contention and performance bottlenecks.
mem Used to analyze memory access performance, including memory access latency, bandwidth, and cache line hit rate. Memory access performance is critical to overall application and system performance.
record For recording performance events, it collects performance data based on user-specified event type, sampling frequency, and target application/process.
report Used to analyze and display perf recordperformance data collected by the command. perf reportReads data from perf.dataa file (default) and generates a performance report with various statistics and metrics.
sched Used to analyze and debug the performance of the Linux scheduler. The scheduler is responsible for managing the execution of processes and threads, optimizing CPU usage and system response time.
script Used to process and display perf recordperformance event data collected by the command. Unlike perf report, perf scriptprovides a way to display performance data in a raw format, which is especially useful for custom analysis, generating timelines, and integrating with other tools.
stat Used to collect and display performance statistics for a specified program or system. It can monitor various hardware performance events (such as CPU cycles, cache hits/misses, etc.) as well as software performance events (such as context switches, process migrations, etc.).
test Used to check perffunctionality and correctness of tools. It runs a series of built-in self-tests to make sure it perfworks correctly on the current system.
timechart Used to visualize system-level performance data. It perfgenerates an interactive time graph from recorded performance event data showing system activity and resource utilization over time.
top It is used to display the functions that occupy the most CPU time in the system in real time, monitor the CPU utilization when the program is running, and understand which functions or code fragments in the system have the greatest impact on performance.
version Display the version information of the perf executable
probe Used to dynamically add probe points to collect performance data at specific functions or code locations.
trace Used to trace and log system calls, signals, and other kernel events. Analyze the interaction between the program and the operating system, understand the impact of kernel events on program performance, diagnose system call errors and exceptions, and help locate problems.
2.2 perf list View supported performance events

The perf tool supports a series of measurable events. The tool and the underlying kernel interface can measure events from different sources. For example, some events are pure kernel counters, in which case they are called software events. For example: context switches, glitches.

Another source of events is the processor itself and its Performance Monitoring Unit (PMU). It provides an event list to measure microarchitectural events such as cycle counts, instruction retirements, L1 cache misses, etc. These events are called PMU hardware events or simply hardware events. They vary by processor type and model.

perf_events接口还提供了一组常用的硬件事件名称。在每个处理器上,如果这些事件存在,则将它们映射到CPU提供的实际事件上,否则无法使用事件。有些令人困惑的是,这些事件也称为硬件事件(hardware event)和硬件缓存事件(hardware cache event)。

最后,还有由内核ftrace基础设施实现的tracepoint事件。这些仅在2.6.3 3x和更新的内核中可用。

命令帮助信息如下:

onceday->~:# perf list -h

 Usage: perf list [<options>] [hw|sw|cache|tracepoint|pmu|sdt|metric|metricgroup|event_glob]

    -d, --desc            Print extra event descriptions. --no-desc to not print.
    -v, --long-desc       Print longer event descriptions.
        --debug           Enable debugging output
        --deprecated      Print deprecated events.
        --details         Print information on the perf event names and expressions used internally by events.

下面是一个实际输出,该命令显示可在使用-e选项的各种perf命令中选择的符号事件类型:

onceday->~:# perf list

List of pre-defined events (to be used in -e):

  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  node-loads                                         [Hardware cache event]
  node-stores                                        [Hardware cache event]

  armv8_cortex_a72/br_mis_pred/                      [Kernel PMU event]
  armv8_cortex_a72/br_pred/                          [Kernel PMU event]
  ...(省略大量其他输出内容)...

PMU硬件事件是特定于CPU的,并由CPU供应商记录。如果链接到libpfm4, perf工具库会提供一些事件的简短描述。有关Intel和AMD处理器的PMU硬件事件列表,请参见:

perf list列出来的这些事件就是本机设备上受支持性能事件,后面中括号里面就是具体的事件类型,这些事件可能会非常多,不同的账户权限执行的结果也会有些不同

对于非root用户,通常只有上下文切换的PMU事件可用。这通常只是cpu PMU中的事件、预定义的事件(如周期和指令)以及一些软件事件。其他pmu和全局测量通常仅为root可用。一些事件限定符,如“any”,也是root限定符。这可以通过设置kernel.perf_event_paranoid-1来修改(使用sysctl),允许非root用户使用这些事件。为了访问跟踪点事件,perf需要对/sys/kernel/debug/tracing具有读访问权限,即使perf_event_paranoid处于宽松设置中也是如此。

2.3 perf 性能事件修饰符

对于任何受支持的事件,perf可以在流程执行期间保持运行计数。在计数模式中,事件的发生只是聚合在一起,并在应用程序运行结束时显示在标准输出上。要生成这些统计信息,可以使用perf的stat命令。例如:

onceday->~:# perf stat -B dd if=/dev/zero of=/dev/null count=1000000 
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4.01987 s, 127 MB/s

 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':

       2239.106840      task-clock (msec)         #    0.556 CPUs utilized          
             10041      context-switches          #    0.004 M/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               145      page-faults               #    0.065 K/sec                  
        2665504250      cycles                    #    1.190 GHz                    
        2338072047      instructions              #    0.88  insn per cycle         
   <not supported>      branches                                                    
           6692430      branch-misses                                               

       4.027916520 seconds time elapsed

在没有指定事件的情况下,perf stat收集上面列出的常见事件。有些是软件事件,如上下文切换,有些是一般的硬件事件,如循环。

可以在每次运行perftool时测量一个或多个事件。事件是用它们的符号名和可选的单位掩码和修饰符来指定的。事件名称(Event names)、单元掩码(unit masks)和修饰符(modifiers)不区分大小写。

默认情况下,事件是在用户和内核级别度量的:

perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000

若要仅在用户级别进行度量,则需要传递一个修饰符(u):

perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000

要测量用户和内核(显式地):

perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000

事件可以通过附加冒号和一个或多个修饰符来选择具有修饰符。修饰符允许用户限制何时对事件进行计数。修饰符如下:

名称标识 描述
u user-space counting,用户空间
k kernel counting,内核空间
h hypervisor counting,虚拟机
I non idle counting,非空闲时
G guest counting (in KVM guests),KVM虚拟机
H host counting (not in KVM guests),KVM主机
p precise level,硬件事件精度级别
P use maximum detected precise level,使用最大检测精度水平
S read sample value (PERF_SAMPLE_READ)读取样本值
D pin the event to the PMU,将事件绑定到PMU上
W 组是弱的,如果不可调度,将退回到非组
e 群组或事件是排他性的,不共享PMU

p修饰符可用于指定指令地址的精确程度。p修饰符可以被指定多次:

  • 0 - SAMPLE_IP可以任意滑动
  • 1 - SAMPLE_IP必须有恒定的滑动
  • 2 - SAMPLE_IP要求有O滑块
  • 3 - SAMPLE_IP必须有0滑块,或者使用随机化来避免样本副作用效果。

对于英特尔系统,精确事件采样是用PEBS实现的,它支持精确级别2,在某些特殊情况下支持精确级别3。

在AMD系统上,它是使用IBS实现的(最高精确级别到2)。精确修饰符与事件类型0x76 (cpu-cycles,CPU时钟未停止)和0xC1(micro-ops retired)一起工作。

2.4 测量特定硬件上的PMU事件

即使现在在perf中没有符号形式的事件,也可以用特定于每个处理器的方式对其进行编码。

比如对于X86CPUs,要测量CPU硬件供应商文档中提供的实际PMU,可以传递十六进制参数代码:

perf stat -e r1a8 -a sleep 1
perf record -e r1a8 ...

有些处理器,比如AMD的处理器,支持大于一个字节的事件代码和单元掩码。在这种情况下,与事件配置参数对应的位可以参考下面命令的结果:

 cat /sys/bus/event_source/devices/cpu/format/event

比如可能的命令如下:

perf record -e r20000038f -a sleep 1
perf record -e cpu/r20000038f/ ...
perf record -e cpu/r0x20000038f/ ...

有关于特定硬件上的PMU事件,需要参考处理器的说明文档来确定使用方法

在下面的路径可以查看可用的PMUs和它们的原始参数:

ls /sys/devices/*/format

一些pmu不与核心相关联,而是与整个CPU socket相关联。这些pmu上的事件通常不能采样,只能使用perf stat -a进行全局计数。它们可以绑定到一个逻辑CPU,但是会测量同一个插槽中的所有CPU

本例在Intel Xeon系统的socket 0上的第一个内存控制器上每秒测量内存带宽:

perf stat -C 0 -a uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/ -I 1000 ...

每个内存控制器都有自己的PMU。测量整个系统带宽需要指定所有imc pmu(请参阅perf list output),并将这些值相加。为了简化多个事件的创建,在PMU名称中支持前缀和全局匹配,并且在执行匹配时也忽略前缀uncore_。因此,上面的命令可以通过使用以下语法扩展到所有内存控制器:

perf stat -C 0 -a imc/cas_count_read/,imc/cas_count_write/ -I 1000 ...
perf stat -C 0 -a *imc*/cas_count_read/,*imc*/cas_count_write/ -I 1000 ...
2.5 参数化的性能事件

有一些pmu事件列出来的时候,其显示字符中带有?号。如下:

hv_gpci/dtbp_ptitc,phys_processor_idx=?/

这意味着当作为事件提供时,?所指示的内容必须也可提供。

 perf stat -C 0 -e 'hv_gpci/dtbp_ptitc,phys_processor_idx=0x2/' ...

此外还有可能指定额外的事件修饰符(percore):

perf stat -e cpu/event=0,umask=0x3,percore=1/

上面命令即汇总一个核心中所有硬件线程的事件计数

2.6 事件组测量

当活动事件的数量超过硬件性能计数器的数量时,Perf支持基于时间的事件复用。当工作负载更改其执行配置文件时,多路复用可能导致测量错误。

当使用来自事件计数的公式计算度量时,确保始终将一些事件作为一个组一起测量以最小化多路错误是很有用的。事件组可以使用{}指定。

perf stat -e '{instructions,cycles}' ...

可用性能计数器的数量取决于CPU。一个组不能包含比可用计数器更多的事件。例如,Intel Core cpu通常有四个通用的核心性能计数器,加上三个固定的instructionscyclesref-cycles计数器。一些特殊事件对它们可以调度的计数器有限制,并且可能不支持单个组中的多个实例。当组中指定的事件太多时,其中一些事件将无法测量。

全局固定事件可以限制其他组可用的计数器数量。在x86系统上,NMI看门狗默认固定一个计数器。NMI看门狗可以在root用户下禁用:

echo 0 > /proc/sys/kernel/nmi_watchdog

来自多个不同pmu的事件不能混合在一个组中,软件事件除外。

perf还支持使用:S说明符进行组领导抽样(group leader sampling)。

perf record -e '{cycles,instructions}:S' ...
perf report --group

通常情况下,所有事件都在一个事件组样本中,但是使用:S时,只有第一个事件(leader)进行采样,它只读取组中其他事件的值。然而,在AUX区域事件(例如Intel PT或CoreSight)的情况下,AUX区域事件必须是先导事件,因此第二个事件采样,而不是第一个事件。

2.7 perf list性能事件分类

默认情况下,perf list列出所有的已知事件。也可以通过下面的类别来列出其中某一类事件:

事件类名称 描述
hw or hardware 列出硬件事件,如cache-misses
sw or software 列出软件事件,例如上下文切换(context switches)
cache or hwcache 列出硬件缓存事件,如L1-dcache-loads
tracepoint 列出所有的tracepoint事件,也可使用subsys_glob:event_glob去过滤子系统追踪点事件,如sched、block等。
pmu 打印内核提供的PMU事件
sdt 列出所有静态定义的跟踪点事件(Statically Defined Tracepoint)
metric 指标列表(度量事件)
metricgroup List metric groups with metrics
–raw-dump Display the original format information of all events, this option can be followed by [hw|sw|cache|tracepoint|pmu|event_glob],

(This article is a series of articles, to be continued)

Guess you like

Origin blog.csdn.net/Once_day/article/details/131159373