Kernel debugging - perf introduction

perf concept

perf_event

Perf_events is currently a widely used profiling/tracing tool on Linux. In addition to being a part of the kernel itself, it also provides user-space (user-space) command-line tools ("perf", "perf-record" , "perf-stat", etc.).

perf_events provides two working modes:

Sampling mode (sampling)
counting mode (counting)

The "perf record" command works in sampling mode: periodically samples events and records the information, which is saved in the perf.data file by default; while the "perf stat" command works in counting mode: only counts the number of occurrences of an event .

We often see commands like this: "perf record -a ... sleep 10". Here, the "sleep" command is equivalent to a "dummy" command, which does not do any meaningful work. Its role is to let the "perf record" command sample the entire system and automatically end the sampling work after 10 seconds.

perf_event - PMU

The hardware event (hardware event) processed by Perf_events requires the support of the CPU, and the current mainstream CPU basically includes the PMU (Performance Monitoring Unit, performance monitoring unit). PMU is used to count performance-related parameters, such as cache hit rate, instruction cycles, and so on. Since these statistical work is done in hardware, the CPU overhead is minimal.

Taking the X86 architecture as an example, the PMU contains two MSRs (Model-Specific Registers, which are called Model-Specific because some registers are different for different models of CPUs): Performance Event Select Registers and Performance Monitoring Counters (PMC). When you want to perform statistics on a performance event, you need to set the Performance Event Select Register, and the statistical results will be stored in the Performance Monitoring Counter.

When perf_events works in sampling mode (sampling, the perf record command works in this mode), due to the time delay between when the sampling event occurs and the actual processing of the sampling event, as well as factors such as CPU pipeline and out-of-order execution, so The obtained instruction address IP (Instruction Pointer) is not the IP that generated the sampling event at that time, which is called skid. In order to improve this situation and make the IP value more accurate, Intel uses PEBS (Precise Event-Based Sampling), while AMD uses IBS (Instruction-Based Sampling).

Take PEBS as an example: every time a sampling event occurs, the sampling data will be stored in a buffer (PEBS buffer), and when the content of the buffer reaches a certain value, it will be processed at one time, which can solve the problem well. question.

Execute the perf list --help command and you will see the following:

The p modifier can be used for specifying how precise the instruction address should be. The p modifier can be specified multiple times:

       0 - SAMPLE_IP can have arbitrary skid
       1 - SAMPLE_IP must have constant skid
       2 - SAMPLE_IP requested to have 0 skid
       3 - SAMPLE_IP must have 0 skid

For Intel systems precise event sampling is implemented with PEBS which supports up to precise-level 2.

It is now understandable that in commands like "perf record -e "cpu/mem-loads/pp" -a" that are often seen, pp specifies the IP precision.

system call`perf_open_event`

Represents an event resource, a user perf_open_event- will create a corresponding perf_eventobject, and some corresponding important data will be stored in this data structure, including pmu ctxenabled_time

running_time count 等信息

include/linux/perf_event.h
struct perf_event {

}


./arch/arm64/kernel/perf_event.c

example

Below I use the ls command to demonstrate the use of the sys_enter tracepoint:

perf stat -e raw_syscalls:sys_enter ls

Specify pid, collect 1s:

[root@localhost /home/ahao.mah]
#perf stat -e syscalls:* -p 49770 sleep 1

A brief introduction to the output of perf stat

output of perf stat

[root@localhost /home/ahao.mah]
#perf stat ls
perf.data  perf.data.old  test  test.c

 Performance counter stats for 'ls':

          1.256036      task-clock (msec)         #    0.724 CPUs utilized
                 4      context-switches          #    0.003 M/sec
                 0      cpu-migrations            #    0.000 K/sec
               285      page-faults               #    0.227 M/sec
         2,506,596      cycles                    #    1.996 GHz                      (87.56%)
         1,891,085      stalled-cycles-frontend   #   75.44% frontend cycles idle
         1,526,425      stalled-cycles-backend    #   60.90% backend  cycles idle
         1,551,244      instructions              #    0.62  insns per cycle
                                                  #    1.22  stalled cycles per insn
           309,841      branches                  #  246.682 M/sec
            12,190      branch-misses             #    3.93% of all branches          (21.57%)

       0.001733685 seconds time elapsed

1. 执行时间： 1.256036ms
2. 持续时间： 0.001733685 seconds time ， 持续时间肯定大于执行时间， 因为cpu的调度策略，抢占等原因
3. cpu利用率：  #    0.724 CPUs utilized  等于 （ 执行时间/持续时间）

perf stat implementation

tools/perf/builtin-stat.c
run_perf_stat
   __run_perf_stat

print_stat

perf use

#include <stdio.h>
void longa()
{
  int i,j;
  for(i = 0; i < 1000000; i++)
  j=i; //am I silly or crazy? I feel boring and desperate.
}

void foo2()
{
  int i;
  for(i=0 ; i < 10; i++)
       longa();
}

void foo1()
{
  int i;
  for(i = 0; i< 100; i++)
     longa();
}

int main(void)
{
  foo1();
  foo2();
}

#perf stat -e kmem:*  ./t1

 Performance counter stats for './t1':

                 1      kmem:kmalloc
             1,443      kmem:kmem_cache_alloc
                85      kmem:kmalloc_node
                85      kmem:kmem_cache_alloc_node
             1,078      kmem:kfree
             1,472      kmem:kmem_cache_free
                37      kmem:mm_page_free
                35      kmem:mm_page_free_batched
                40      kmem:mm_page_alloc
                70      kmem:mm_page_alloc_zone_locked
                 0      kmem:mm_page_pcpu_drain
                 0      kmem:mm_page_alloc_extfrag

       0.382027010 seconds time elapsed

overhead of perf

environment:

kernel 3.10
Running java on it is fully loaded with 733.3%

For perf record collection for a single pid of java, the perf overhead is 100% in the startup phase, and 7.5% after stabilization

#perf sched record  -p 49770

#ps -eo pmem,pcpu,args   | grep perf  | grep -v grep
 0.0  0.0 [perf]
 0.0  7.5 perf sched record -p 49770

Use perf to collect syscall globally. For a single pid, the overhead is very large, and it is stable at 40%

#perf stat -e syscalls:* -p 49770 sleep 10

 0.0 88.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 96.5 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 90.6 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 68.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 54.4 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 45.3 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 38.8 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 34.0 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 30.2 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 27.2 perf stat -e syscalls:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0 24.7 perf stat -e syscalls:* -p 49770 sleep 10

Collect syscall globally, with slightly less overhead

#perf stat -e syscalls:*  sleep 10

 0.0  0.0 [perf]
 0.0  0.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  6.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  3.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  2.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  1.5 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  1.0 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.8 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.6 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.6 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.5 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.5 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -e syscalls:* sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -e syscalls:* sleep 10

The simplest perf stat with low overhead

#perf stat  -p 49770 sleep 10

 0.0  0.0 [perf]
 0.0  0.0 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  3.0 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.0 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.6 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.5 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.4 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.3 perf stat -p 49770 sleep 10

The overhead of perf collecting kmem-related events

#perf stat -e kmem:*  -p 49770 sleep 10

 Performance counter stats for process id '49770':

           163,603      kmem:kmalloc                                                  (100.00%)
           484,012      kmem:kmem_cache_alloc                                         (100.00%)
           302,553      kmem:kmalloc_node                                             (100.00%)
           301,051      kmem:kmem_cache_alloc_node                                     (100.00%)
           263,768      kmem:kfree                                                    (100.00%)
           774,941      kmem:kmem_cache_free                                          (100.00%)
            83,850      kmem:mm_page_free                                             (100.00%)
               799      kmem:mm_page_free_batched                                     (100.00%)
            83,064      kmem:mm_page_alloc                                            (100.00%)
             1,088      kmem:mm_page_alloc_zone_locked                                     (100.00%)
               403      kmem:mm_page_pcpu_drain                                       (100.00%)
                 0      kmem:mm_page_alloc_extfrag

 0.0  7.0 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  3.5 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  2.3 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.7 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.4 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.1 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  1.0 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.8 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -e kmem:* -p 49770 sleep 10
 0.0  0.0 [perf]
 0.0  0.7 perf stat -e kmem:* -p 49770 sleep 10

REF

Identify performance bottlenecks using OProfile for Linux on POWER:
https://www.ibm.com/developerworks/cn/linux/l-pow-oprofile/

http://abcdxyzk.github.io/blog/2015/07/27/debug-perf/

perf-tools:

https://www.slideshare.net/brendangregg/linux-performance-analysis-new-tools-and-old-secrets