CPU Cache Efficiency Everyone Should Know

Hello everyone, I'm Brother Fei!

When it comes to CPU performance, most students think of CPU utilization. This indicator should indeed be paid attention to first. However, in addition to the utilization rate, there is another indicator that is easily overlooked, that is, the operating efficiency of instructions. If the operating efficiency is not high, then no matter how busy the CPU utilization rate is, it will be busy, and the output will not be high.

This is like a person, who is very busy every day, but in fact, the efficiency is not the same every day. Sometimes I did a lot of things in one day, but sometimes I just wasted my day, and when I looked back, I didn’t do anything!

1. CPU hardware operating efficiency

So what is the operating efficiency of the CPU? Before introducing this, we have to briefly review the composition and working principle of the CPU. After the production process of the CPU is over, the hardware is engraved into various modules by a lithography machine.

5a1defc723a760191329187d6fd8192e.png

In the physical structure diagram above, you can see the distribution of each physical core and L3 Cache. In addition, in each physical core, more components are included. Each core will integrate its own exclusive registers and caches, where the caches include L1 data, L1 code, and L2.

c818520c70fa2c0df5dec4975d3f4277.png

During the running of the service program, the CPU core continuously obtains the instructions to be executed and the data to be calculated from the storage. The so-called storage here includes registers, L1 data cache, L1 code cache, L2 cache, L3 cache, and memory.

When a service program is started, it will be loaded into memory by means of page fault interrupt. When the CPU runs a service, it continuously reads instructions and data from memory, performs calculations, and then writes the results back to memory.

3120f72fc14b7eda112dc4b8e475995e.png

Different CPUs have different pipelines. In the pipeline of a classic CPU, each instruction cycle usually includes several stages of instruction fetching, decoding, execution, and memory access.

  • During the instruction fetch phase, the CPU fetches instructions from memory and loads them into the instruction register.

  • In the decode stage, the CPU decodes the instruction, determines the type of operation to be performed, and loads the operands into registers.

  • During the execute phase, the CPU executes instructions and stores the results in registers.

  • In the memory access phase, the CPU writes data from memory to registers or writes data in registers back to memory as needed.

However, the memory access speed is very slow. A CPU instruction cycle is generally only a few tenths of a nanosecond, but for memory, even the fastest sequential IO takes about 10 nanoseconds, and if it encounters random IO, it will cost about 30-40 nanoseconds . You can refer to several articles I have written before.

Therefore, in order to speed up the calculation, the CPU built a temporary data storage warehouse. It is the various caches we mentioned above, including registers, L1 data, L1 code, and L2 caches for each core, as well as L3 shared by the entire CPU, and TLB dedicated to virtual memory to physical memory address translation cache.

Taking the fastest register as an example, it takes about tenths of a nanosecond, and it works at the same pace as the CPU. The delay of L1 further down is about 2 ns, and that of L2 is about 4 ns, rising in turn.

But the slower storage also has the advantage that it is farther away from the CPU core, and the capacity can be made larger. Therefore, the storage accessed by the CPU is logically a pyramid structure. The closer the storage is to the tip of the pyramid, the faster the access speed, but the smaller the capacity. Going down, although the speed is slightly slower, the storage volume is larger.

ad360b7f47ba295cfda29a256eba7c67.png

So much for the basic principles. Now we start to think about the efficiency of instruction execution. According to the above pyramid diagram, we can clearly see that if the instruction storage required by the service program is located above the pyramid, the efficiency of the service operation will be high. If the program is not well written, or the kernel frequently migrates processes between different physical cores (the L1 and L2 caches of different cores are not shared), then the upper cache will have a lower hit rate and more requests Penetrating to L3, or even accessing lower memory, the running efficiency of the program will deteriorate.

So how to measure the efficiency of instruction operation? There are two main types of indicators

The first category is CPI and IPC .

The full name of CPI is cycle per instruction, which refers to the average number of clock cycles per instruction. The full name of IPC is instruction per cycle, which means how many instructions are run per clock cycle. These two indicators can help us analyze whether our executable program is running fast or slow. Since these two are reciprocal to each other, it is enough to only pay attention to one CPI in practice.

The CPI indicator allows us to have an overall grasp of the running speed of the program. If our program runs with a high cache hit rate and most of the data can be accessed in the cache, then the CPI will be relatively low. If the locality principle of our program is not well grasped, or there is a problem with the scheduling algorithm of the kernel, it is very likely that executing the same instruction will require more CPU cycles, and the performance of the program will also be relatively poor. , the CPI indicator will also be on the high side.

The second category is the cache hit rate .

The cache hit rate indicator analyzes how much data is not caught by the cache when the program is running, but penetrates into the memory. Penetrating into the memory access speed will be much slower. Therefore, the lower the Cachemiss index when the program is running, the better.

2. How to evaluate CPU hardware efficiency

In the previous section, we mentioned that the indicators of CPU hardware work efficiency mainly include CPI and cache hit rate. So how do we get these metrics?

2.1 Use the perf tool

The first way is to use the perf tool that comes with Linux by default. Use perf list to view the hardware event indicators supported on the current system.

# perf list hw cache
List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]

In the above output, we pick a few important ones to explain

  • cpu-cycles: CPU cycles consumed

  • instructions: Count the executed instructions, combined with cpu-cycles, you can calculate the CPI (the average number of cycles that each instruction needs to consume)

  • L1-dcache-loads: Level 1 data cache read times

  • L1-dcache-load-missed: The number of read failures in the first-level data cache, combined with L1-dcache-loads, the L1-level data cache hit rate can be calculated

  • dTLB-loads: dTLB cache read times

  • dTLB-load-misses: The number of dTLB cache read failures, combined with dTLB-loads can also calculate the cache hit rate

Use the perf stat command to count the above indicators of the current system or a specified process. You can directly use perf stat to count the CPI. (If you want to count the specified process, you only need multiple -p parameters, just write the name pid)

# perf stat sleep 5
Performance counter stats for 'sleep 5':
    ......
    1,758,466      cycles                    #    2.575 GHz
      871,474      instructions              #    0.50  insn per cycle

As can be seen from the comments behind the instructions in the above results, the IPC index of the current system is 0.50, which means that an average of one CPU cycle can execute 0.5 instructions. We said earlier that CPI and IPC are reciprocals of each other, so we can calculate the CPI index as 2 by 1/0.5. In other words, an average instruction consumes 2 CPU cycles.

Let's take a look at the cache hit rate of L1 and dTLB again. This time, we need to follow the perf stat with the -e option to specify the indicators to be observed, because these indicators are not output by default.

# perf stat -e L1-dcache-load-misses,L1-dcache-loads,dTLB-load-misses,dTLB-loads sleep 5
Performance counter stats for 'sleep 5':
    22,578      L1-dcache-load-misses     #   10.22% of all L1-dcache accesses
   220,911      L1-dcache-loads
     2,101      dTLB-load-misses          #    0.95% of all dTLB cache accesses
   220,911      dTLB-loads

In the above results, the number of L1-dcache-load-misses is 22,578, and the total L1-dcache-loads is 220,911. It can be calculated that the cache access failure rate of L1-dcache is about 10.22%. Similarly, we can calculate that the access failure rate of dTLB cache is 0.95. Although these two indicators are not high anymore, in practice, the lower the better.

2.2 Directly use the system calls provided by the kernel

Although perf provides us with very convenient usage. But in some business scenarios, you may still need to program yourself to obtain data. At this time, you can only bypass perf and directly use the system calls provided by the kernel to obtain these hardware indicators.

The development steps probably include these two steps

  • Step 1: Call perf_event_open to create a perf file descriptor

  • Step 2: Regular read reads the perf file descriptor to obtain data

Its core code is roughly as follows. To avoid interference, I kept only the trunk. I put the complete source code on Github where we developed internal skills and modified it.

Github address : https://github.com/yanfeizhang/coder-kung-fu/blob/main/tests/cpu/test08/main.c

int main()
{
    // 第一步:创建perf文件描述符
    struct perf_event_attr attr;
    attr.type=PERF_TYPE_HARDWARE; // 表示监测硬件
    attr.config=PERF_COUNT_HW_INSTRUCTIONS; // 标志监测指令数
    
    // 第一个参数 pid=0 表示只检测当前进程
    // 第二个参数 cpu=-1 表示检测所有cpu核
    int fd=perf_event_open(&attr,0,-1,-1,0);

    // 第二步:定时获取指标计数
    while(1)
    {   
        read(fd,&instructions,sizeof(instructions));
        ...
    }
}

In the source code, a perf_event_attr parameter object required to create a perf file is first declared. In this object, type is set to PERF_TYPE_HARDWARE to monitor hardware events. config is set to PERF_COUNT_HW_INSTRUCTIONS to monitor the number of instructions.

Then call the perf_event_open system call. In this system call, in addition to the perf_event_attr object, the two parameters pid and cpu are also very critical. Where pid is -1 means to monitor all processes, 0 means to monitor the current process, > 0 means to monitor the process with the specified pid. For cpus. -1 means to monitor all cores, other values ​​mean to monitor only the specified cores.

After the kernel is assigned to perf_event, it will return a file handle fd. The latter perf_event structure can be operated through the read/write/ioctl/mmap common file interface.

There are two ways to use perf_event programming, counting and sampling. The examples in this article are the simplest techniques. For sampling scenarios, the supported functions are more abundant, and the call stack can be obtained to render more advanced functions such as flame graphs. In this case, simple read cannot be used, and ringbuffer space needs to be allocated to perf_event, and then read through the mmap system call. The corresponding function in perf is the perf record/report function.

After compiling and running the complete source code.

# gcc main.c -o main
# ./main
instructions=1799
instructions=112654
instructions=123078
instructions=133505
...

3. The internal working principle of perf

Do you think this article is over? Big mistake! Only talking about the usage but not the principle has never been our style of developing official accounts for inner strength cultivation.

So after introducing how to obtain hardware indicators, let's talk about how the upper-layer software cooperates with the CPU hardware to obtain the underlying instructions, cache hit rate and other indicators. Expand and talk about the underlying principles.

The hardware developers of the CPU also thought of the need for software students to statistically observe hardware indicators. Therefore, when designing the hardware, a special register is added, which is specially used for system performance monitoring. For a description of this part, see Section 18 of the official Intel manual. You can find this manual on the Internet, and I will also throw it in my reader group. Students who have not joined the group add me on WeChat zhangyanfei748527.

The name of this type of register is the hardware performance counter (PMC: Performance Monitoring Counter). Each PMC register contains a counter and an event selector, the counter is used to store the number of times an event occurs, and the event selector is used to determine the type of event to be counted. For example, the PMC register can be used to count the L1 cache hit rate or the number of instruction execution cycles, etc. When the CPU executes the event specified by the PMC register, the hardware will automatically add 1 to the counter without any interference to the normal execution of the program.

With the underlying support, the upper-level Linux kernel can obtain the desired indicators by reading the values ​​of these PMC registers. The overall workflow flow chart is as follows

25f5739f2e6ae194428db11de00682fd.png

Next, let's take a look at this process from the perspective of source code.

3.1 Initialization of CPU PMU

The PMU (Performance Monitoring Unit) subsystem of Linux is a mechanism for monitoring and analyzing system performance. It defines each indicator to be observed as a PMU, which is registered into the system through the perf_pmu_register function.

Among them, for the CPU, a PMU for x86 architecture CPU is defined, and it will be registered in the system when it is started.

//file:arch/x86/events/core.c
static struct pmu pmu = {
    .pmu_enable     = x86_pmu_enable,
    .read           = x86_pmu_read,
    ...
}

static int __init init_hw_perf_events(void)
{
    ...
    err = perf_pmu_register(&pmu, "cpu", PERF_TYPE_RAW);
}

3.2 perf_event_open system call

In the previous sample code, we saw that a perf file was created through the perf_event_open system call. Let's take a look at what this creation process has done?

//file:kernel/events/core.c
SYSCALL_DEFINE5(perf_event_open,
        struct perf_event_attr __user *, attr_uptr,
        pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
{
    ...

    // 1.为调用者申请新文件句柄
    event_fd = get_unused_fd_flags(f_flags);

    ...
    // 2.根据用户参数 attr,定位 pmu 对象,通过 pmu 初始化 event
    event = perf_event_alloc(&attr, cpu, task, group_leader, NULL,
                 NULL, NULL, cgroup_fd);
    pmu = event->pmu;

    // 3.创建perf_event_context ctx对象, ctx保存了事件上下文的各种信息
    ctx = find_get_context(pmu, task, event);


    // 4.创建一个文件,指定 perf 类型文件的操作函数为 perf_fops
    event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
                    f_flags);

    // 5. 把event安装到ctx中
    perf_install_in_context(ctx, event, event->cpu);

    fd_install(event_fd, event_file);
    return event_fd;
}

The above code is the core source code of perf_event_open. The most critical one is the call of perf_event_alloc. In this function, the pmu object is searched according to the attr passed in by the user. Recalling the example code in this article, we specified that we want to monitor the number of instructions in the CPU hardware.

struct perf_event_attr attr;
    attr.type=PERF_TYPE_HARDWARE; // 表示监测硬件
    attr.config=PERF_COUNT_HW_INSTRUCTIONS; // 标志监测指令数

So here we will locate the CPU PMU object mentioned in Section 3.1, and use this pmu to initialize a new event. Then call anon_inode_getfile to create a real file object, and specify the operation method of the file as perf_fops. The operation functions defined by perf_fops are as follows:

//file:kernel/events/core.c
static const struct file_operations perf_fops = {
    ...
    .read               = perf_read,
    .unlocked_ioctl     = perf_ioctl,
    .mmap               = perf_mmap,
};

After creating the perf kernel object. It will also be triggered in perf_pmu_enable, and after a series of calls, the registers to be monitored will finally be specified.

perf_pmu_enable
-> pmu_enable
  -> x86_pmu_enable
    -> x86_assign_hw_event
//file:arch/x86/events/core.c
static inline void x86_assign_hw_event(struct perf_event *event,
                struct cpu_hw_events *cpuc, int i)
{
    struct hw_perf_event *hwc = &event->hw;
    ...
    if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
        hwc->config_base = 0;
        hwc->event_base = 0;
    } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
        hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
        hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
        hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
    } else {
        hwc->config_base = x86_pmu_config_addr(hwc->idx);
        hwc->event_base  = x86_pmu_event_addr(hwc->idx);
        hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx);
    }
}

3.3 read read count

In the second step of the example code, the read system call is called regularly to read the indicator count. In Section 3.2, we saw that the operation method of the newly created perf file object in the kernel is perf_read.

//file:kernel/events/core.c
static const struct file_operations perf_fops = {
    ...
    .read               = perf_read,
    .unlocked_ioctl     = perf_ioctl,
    .mmap               = perf_mmap,
};

The perf_read function actually supports reading multiple indicators at the same time. But for simplicity of description, I only describe the workflow when it reads one indicator. Its call chain is as follows:

perf_read
    __perf_read
        perf_read_one
            __perf_event_read_value
                perf_event_read
                    __perf_event_read_cpu
                perf_event_count

Among them, in perf_event_read, the value in the hardware register is to be read.

static int perf_event_read(struct perf_event *event, bool group)
{
    enum perf_event_state state = READ_ONCE(event->state);
    int event_cpu, ret = 0;
    ...

again:
    //如果event正在运行尝试更新最新的数据
    if (state == PERF_EVENT_STATE_ACTIVE) {
        ...
        data = (struct perf_read_data){
            .event = event,
            .group = group,
            .ret = 0,
        };
        (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
        preempt_enable();
        ret = data.ret;
    } else if (state == PERF_EVENT_STATE_INACTIVE) {
        ...
    }
    return ret;
}

smp_call_function_single This function is to run a function on the specified CPU. Because the registers are exclusive to the CPU, the CPU core should be specified to read the registers. The function to run is the __perf_event_read specified in its argument. In this function, the x86 CPU hardware registers are actually read.

__perf_event_read
-> x86_pmu_read
  -> intel_pmu_read_event
    -> x86_perf_event_update

Among them, __perf_event_read is called to the x86 architecture through the function pointer.

//file:kernel/events/core.c
static void __perf_event_read(void *info)
{
    ...
    pmu->read(event);
}

In 3.1, we introduced the pmu of the CPU, and its read function pointer points to x86_pmu_read.

//file:arch/x86/events/core.c
static struct pmu pmu = {
    ...
    .read           = x86_pmu_read,
}

This will execute x86_pmu_read, and finally call x86_perf_event_update. Call the rdpmcl assembly instruction in x86_perf_event_update to get the value in the register.

//file:arch/x86/events/core.c
u64 x86_perf_event_update(struct perf_event *event)
{
    ...
    rdpmcl(hwc->event_base_rdpmc, new_raw_count);
    return new_raw_count
}

Finally, when returning to perf_read_one, copy_to_user will be called to actually copy the value to the user space, so that our process will read the hardware execution count in the register.

//file:kernel/events/core.c
static int perf_read_one(struct perf_event *event,
                 u64 read_format, char __user *buf)
{

    values[n++] = __perf_event_read_value(event, &enabled, &running);
    ...

    copy_to_user(buf, values, n * sizeof(u64))
    return n * sizeof(u64);
}

Summarize

Although the memory is fast, its speed is only a little brother in front of the CPU. Therefore, the CPU does not directly obtain the instructions and data to be executed from the memory, but uses its own cache first. Memory is only requested when there is a cache miss, and performance will be reduced.

The main indicators to observe whether the CPU uses cache efficiently are CPI and cache hit rate. In the implementation of CPU hardware, a special PMU module is defined, which contains special user counting registers. When the CPU executes the event specified by the PMC register, the hardware will automatically add 1 to the counter without any interference to the normal execution of the program. With the underlying support, the upper-level Linux kernel can obtain the desired indicators by reading the values ​​of these PMC registers.

We can use perf to observe, or directly use the perf_event_open system call provided by the kernel to obtain the perf file object, and then read it by ourselves.

4a525fec6180fd3a749a5330b7fa946e.png

Welcome to share this article with your team members, let’s grow together!

29aa9e600d825016538b04d06edbe958.png

Guess you like

Origin blog.csdn.net/zhangyanfei01/article/details/130592240