CPU program performance optimization

A program must first ensure correctness. On the basis of ensuring correctness, performance is also an important consideration. To write high-performance programs, first, you must choose appropriate algorithms and data structures; second, you should write source code that the compiler can effectively optimize to convert into efficient executable code. To do this, you need to understand the compiler capabilities and limitations; third, we must understand how the hardware operates and optimize the hardware characteristics. This article focuses on the second and third points.

Briefly understand the compiler

To write high-performance code, you first need to have a basic understanding of compilers. The reason is that modern compilers have strong optimization capabilities, but some code compilers cannot optimize. Only with a basic understanding of compilers can you write compiler-friendly, high-performance code.

Compiler optimization options

Taking GCCfor example , GCC supports the following optimization levels:

  • -O<number>, where number is 0/1/2/3, the larger the number, the higher the optimization level. The default is -O0.

  • -Ofast, in addition to turning on all optimization options of -O3, will also turn on -ffast-math and -fallow-store-data-races. Note that these two options may cause program execution errors.

-ffast-math: Sets the options -fno-math-errno, -funsafe-math-optimizations, -ffinite-math-only, -fno-rounding-math, -fno-signaling-nans, -fcx-limited-range and -fexcess-precision=fast. It can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.

-fallow-store-data-races: Allow the compiler to perform optimizations that may introduce new data races on stores, without proving that the variable cannot be concurrently accessed by other threads. Does not affect optimization of local data. It is safe to use this option if it is known that global data will not be accessed by multiple threads.

  • -Og, the recommended optimization level when debugging code.

gcc -Q --help=optimizer -Ox can view the optimization options enabled at each optimization level.

Reference link: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Compiler limitations

In order to ensure the correctness of program operation, the compiler will not make any assumptions about the usage scenarios of the code, so some code compilers will not optimize. Here are two more obscure examples.

1、memory aliasing

void twiddle1(long *xp, long *yp) {
    *xp += *yp;
    *xp += *yp;
}
void twiddle2(long *xp, long *yp) {
    *xp += 2 * *yp;
}

When xpand yppoint to the same memory (memory aliasing), twiddle1and twiddle2are two completely different functions, so the compiler will not try to twiddle1optimize into twiddle2. If the original intention is to realize twiddle2the function of , it should be written in the form of twiddle2instead twwidle1of . twiddle2It only requires 2 reads and 1 write, while twiddle14 reads and 2 writes are required.

You can explicitly use to __restrictmodify the pointer to indicate that there is no pointer pointing to the same memory as the modified pointer. In this case, the compiler will twiddle3optimize to be equivalent to and twiddle2. You can observe the assembly code through disassembly for further understanding.

void twiddle3(long *__restrict xp, long *__restrict yp) {
    *xp += *yp;
    *xp += *yp;
}

2、side effect

long f();
long func1() {
    return f() + f() + f() + f();
}
long func2() {
    return 4 * f();
}

Since fthe implementation of function may exist as follows side effect, the compiler will not func1optimize to func2. If the original intention is to implement func2the version, it should be written directly func2in the form, which can reduce 3 function calls.

long counter = 0;
long f() {
    return counter++;
}

Program performance optimization

Before introducing it, we first introduce a program performance metric 每元素的周期数(Cycles Per Element, CPE), which is the number of cycles it takes to process an element, which can represent program performance and guide performance optimization.

The following uses an example to introduce several means of optimizing program performance. First define a data structure vector and some auxiliary functions. Vector is implemented using a continuously stored array, and the typedefdata type of the element can be specified through data_t.

typedef struct {
    long len;
    data_t *data;
} vec_rec, *vec_ptr;

/* 创建vector */
vec_ptr new_vec(long len) {
    vec_ptr result = (vec_ptr)malloc(sizeof(vec_rec));
    if (!result)
        return NULL;
    data_t *data = NULL;
    result->len = len;
    if (len > 0) {
        data = (data_t*)calloc(len, sizeof(data_t));
        if (!data) {
            free(result);
            return NULL;
        }
    }
    result->data = data;
    return result;
}

/* 根据index获取vector元素 */
int get_vec_element(vec_ptr v, long index, data_t *dest) {
    if (index < 0 || index >= v->len)
        return 0;
    *dest = v->data[index];
    return 1;
}

/* 获取vector元素个数 */
long vec_length(vec_ptr v) {
    return v->len;
}

The function of the following function is to use some operation to combine all the elements in a vector into one element. The following IDENTand OPare macro definitions, #define IDENT 0and #define OP +performs cumulative operation, #define IDENT 1and #define OP *performs cumulative multiplication operation.

void combine1(vec_ptr v, data_t *dest) {
    long i;

    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) {
        data_t val;
        get_vec_element(v, i, &val);
        *dest = *dest OP val;
    }
}

For the above combine1, the following three basic optimizations can be carried out.

1. For functions that are executed multiple times and return the same result, use temporary variables to save them.

combine1The implementation of repeatedly calls the function in the loop test condition vec_length. In this scenario, multiple calls of vec_lengthwill return the same result, so it can be rewritten as combine2the implementation for optimization. In extreme cases, it is more efficient to take care to avoid calling functions that return the same result repeatedly. For example, if a function that tests the length of a string is called in the loop end condition, the time complexity of the function is usually 0. O(n)If it is clear that the length of the string will not change, repeated calls will cause a lot of additional overhead.

void combine2(vec_ptr v, data_t *dest) {
    long i;
    long length = vec_length(v);

    *dest = IDENT;
    for (i = 0; i < length; i++) {
        data_t val;
        get_vec_element(v, i, &val);
        *dest = *dest OP val;
    }
}

2. Reduce procedure calls

Procedure (function) calls will incur certain overhead, such as parameter passing, clobber register saving and restoration, and transfer of control. So you can add a function get_vec_startthat returns a pointer to the beginning of the array and avoid calling the function in the loop get_vec_element. There is a trade off in this optimization. On the one hand, it can improve the performance of the program. On the other hand, this optimization requires knowing the implementation details of the vector data structure, which will destroy the abstraction of the program. Once the vector is modified to store data without using arrays, at the same time The implementation of needs to be modified combine3.

data_t *get_vec_start(vec_ptr v) {
    return v->data;
}
void combine3(vec_ptr v, data_t *dest) {
    long i;
    long length = vec_length(v);
    data_t *data = get_vec_start(v);

    *dest = IDENT;
    for (i = 0; i < length; i++) {
        *dest = *dest OP data[i];
    }
}

3. Eliminate unnecessary memory references

In the above implementation, each loop will read once and write once dest. Because it may exist memory aliasing, the compiler will optimize it carefully. The following are the assembly codes of the loop part at the -O1and -O2optimization levels respectively . It can be seen that when -O2 optimization is turned on, the compiler helps us store the intermediate results in temporary variables (register %xmm0), instead of reading from memory each time like when -O1 optimization is enabled; but considering the situation, even if -O2 optimization still needs to save the intermediate results to memory for each loop.combine3formemory aliasing

// combine3 -O1
.L1:
    vmovsd (%rbx), %xmm0
    vmulsd (%rdx), %xmm0, %xmm0
    vmovsd %xmm0, (%rbx)
    addq $8, %rdx
    cmpq %rax, %rdx
    jne .L1

// combine3 -O2
.L1
    vmulsd (%rdx), %xmm0, %xmm0
    addq $8, %rdx
    cmpq %rax, %rdx
    vmovsd %xmm0, (%rbx)
    jne .L1

In order to avoid frequent memory reading and writing, you can artificially use a temporary variable to save the intermediate results, as combine4shown.

void combine4(vec_ptr v, data_t *dest) {
    long i;
    long length = vec_length(v);
    data_t *data = get_vec_start(v);
    data_t acc = IDENT;
    for (i = 0; i < length; i++) {
        acc = acc OP data[i];
    }
    *dest = acc;
}
// combine4 -O1
.L1
    vmulsd (%rdx), %xmm0, %xmm0
    addq $8, %rdx
    cmpq %rax, %rdx
    jne .L1

The effect of the above optimization method can be measured by CPE. The test results on Intel Core i7 Haswell are as follows. Judging from the test results:

  • The combine1 version has different compilation optimization levels. The performance of -O1 is twice that of -O0, indicating that it is necessary to turn on the appropriate compilation optimization level.

  • After combine2 moves vec_length out of the loop and is compiled at the same optimization level, the performance is slightly improved compared to combine1.

  • However, combine3 has no performance improvement compared to combine2. The reason is that the time-consuming of other operations in the loop can cover up the time-consuming of calling get_vec_element. The reason why it can be covered up is due to CPU support and. These two will be briefly introduced later in this 分支预测article 乱序执行. concept.

  • Similarly, the -O2 version of combine3 has much better performance than the -O1 version. As can be seen from the assembly code, -O2 reduces the reading of (%rbx) by one time per cycle compared to -O1. More importantly, it eliminates the need for (%rbx) Memory access dependency of read-after-write.

  • After the optimization of combine4 to temporarily store the intermediate results in temporary variables, it can be seen that even if the compilation optimization of -O1 is used, the compilation optimization performance is better than that of combine3 -O2, indicating that even if the compiler has powerful optimization capabilities, it must pay attention to details. Writing high-performance code is also very necessary.

The following test data is quoted from Chapter 5 of "In-depth Understanding of Computer Systems".

function Optimization int + int * float + float *
combine1 -O0 22.68 20.02 19.98 20.18
combine1 -O1 10.12 10.12 10.17 11.14
combine2 move vec_length -O1 7.02 9.03 9.02 11.03
combine3 Reduce procedure calls -O1 7.17 9.02 9.02 11.03
combine3 Reduce procedure calls -O2 1.60 3.01 3.01 5.01
combine4 Accumulate to temporary variable -O1 1.27 3.01 3.01 5.01

Instruction level parallelism

The above optimization does not rely on any characteristics of the target machine. It simply reduces the overhead of procedure calls and eliminates some "optimization hindrance factors" that make compiler optimization difficult. For further optimization, some hardware characteristics need to be understood. The following figure is the back-end part of the hardware structure of Intel Core i7 Haswell:haswell.png

For the complete hardware structure of Intel Core i7 Haswell, please see: https://en.wikichip.org/w/images/c/c7/haswell_block_diagram.svg

Hardware performance

This CPU supports the following features:

  • Instruction-level parallelism: that is, through instruction pipeline technology, it supports the evaluation of multiple instructions at the same time.

  • Out-of-order execution: The execution order of instructions may not be consistent with the order in which they are written, which can enable the hardware to achieve better instruction-level parallelism. Mainly through the mechanism of out-of-order execution and sequential submission, it is possible to obtain results consistent with sequential execution.

  • Branch prediction: When a branch is encountered, the hardware will predict the direction of the branch. If the prediction is successful, it can speed up the running of the program. However, if the prediction fails, the results of early execution need to be discarded and reloaded to execute the correct instruction, which will cause relatively large consequences. prediction error penalty.

In the above figure, the main focus is on execution units (EUs), which are composed of multiple functional units. The performance of a functional unit can be measured by 延迟, 发射时间and 容量.

  • Latency: The number of clock cycles required to complete an instruction.

  • Launch time: The minimum number of clock cycles required between two consecutive operations of the same type.

  • Capacity: The number of execution units of a certain type. As can be seen from the figure above, EUsthere are 4 integer addition units (INT ALU), 1 integer multiplication unit (INT MUL), 1 floating point addition unit (FP ADD) and 2 floating point multiplication units (FP MUL).

The functional unit performance data (units are cycles) of Intel Core i7 Haswell are as follows, quoted from Chapter 5 of "In-depth Understanding of Computer Systems":

Operation delay(int) Launch time(int) Capacity(int) Delay(float) Emission time (float) Capacity (float)
addition 1 1 4 3 1 1
multiplication 3 1 1 5 1 2

The latency, launch time and capacity of these arithmetic operations will affect combinethe performance of the above functions, and we use two bounds on CPE to describe this effect. The throughput bound is the theoretical optimal performance.

  • Latency bound: The minimum CPE required by any function that must complete operations in strict order , equal to the delay of the functional unit.combine

  • Throughput bound: The maximum rate at which a functional unit can produce results, 容量/发射时间determined by. If using the CPE metric, it is equal to 容量/发射时间the reciprocal of .

Since combinethe function needs to load data, it is also subject to the limitation of the loading unit. Since there are only two load units and their emission time is 1 cycle, the throughput bound for integer addition is only 0.5 in this case instead of 0.25.

limit int + int * float + float *
Delay 1.0 3.0 3.0 5.0
Hesitation 0.5 1.0 1.0 0.5

Abstract model of processor operation

To analyze the performance of machine-level programs executing on modern processors, we introduce 数据流图, a graphical representation of how data dependencies between different operations constrain their execution order. These limits form the figure 关键路径, which is a lower bound on the clock cycles required to execute a set of machine instructions.

Usually the for loop takes up most of the execution time of the program. The following figure is the combine4data flow diagram corresponding to the for loop. The arrows indicate the flow of data. Registers can be divided into four categories:

  1. Read-only: These registers are used only as source values ​​and are not modified during the loop, in this case %rax.

  2. Write only: For data transfer purposes. This example has no such register.

  3. Local: Modified and used inside the loop, not related between iterations, condition code register in proportion.

  4. Loops : These registers serve as both source and destination values. The values ​​produced in one iteration are used in the next iteration, in this case %rdxand %xmm0. Operations on such registers are often the limiting factor in program performance due to data dependencies between iterations .data_flow1.png

A simplified data flow diagram can be obtained by rearranging the above diagram and leaving only the paths related to the loop register.data_flow_simplify.png

Simply repeat the simplified data flow diagram to get the critical path, as shown below. If combine4the calculation in is floating-point multiplication, due to support for instruction-level parallelism, the delay of floating-point multiplication can cover up the delay of integer addition (pointer movement, the path in the right half of the figure), so the theoretical lower bound of CPE is the delay of floating-point multiplication combine4. 5.0, which is basically consistent with the test data 5.01 given above.data_flow_critical.png

loop unrolling

So far, the performance of our program has only reached the latency bound. This is because the next floating-point multiplication must wait until the last multiplication is completed, and cannot fully utilize the instruction-level parallelism of the hardware. Using loop unrolling technology, the instruction parallelism of the critical path can be improved.

void combine5(vec_ptr v, data_t *dest) {
    long i;
    long length = vec_length(v);
    long limit = length - 1;
    data_t *data = get_vec_start(v);
    data_t acc0 = IDENT;
    data_t acc1 = IDENT;

    for (i = 0; i < limit; i += 2) {
        acc0 = acc0 OP data[i];
        acc1 = acc1 OP data[i + 1];
    }

    for (; i < length; ++i) {
        acc0 = acc0 OP data[i];
    }

    *dest = acc0 OP acc1;
}

combine5The data flow diagram of the critical path is as follows. There are two critical paths in the diagram, but the two critical paths can be parallelized at the instruction level. Each critical path only contains 1 operation, so the performance can break through the delay limit. In theory, floating point n/2multiplication CPE is approx 5.0/2=2.5.data_flow_critical2.png

If the number of temporary variables is increased and the number of loop unrollings is further increased, the degree of instruction parallelism can theoretically be increased and the throughput limit can eventually be reached. However, the number of loop unrollings cannot be increased without limit. First, due to the limited functional units of the hardware, the lower bound of CPE is limited by the throughput limit. After reaching a certain level, continuing to increase cannot improve the degree of instruction parallelism; second, due to limited register resources, increasing loop unrolling The number of times will increase the use of registers. After the number of used registers exceeds the register resources provided by the hardware, register overflow will occur. It may be necessary to temporarily save the register memory to the memory and then restore it from the memory to the register when used, which will lead to poor performance. As shown in the table below, the performance of loop unrolling 20 times is slightly lower than that of unrolling 10 times. Fortunately, most hardware reaches its throughput bound before register overflow occurs.

function Expansion times int + int * float + float *
combine5 2 0.81 1.51 1.51 2.51
combine5 10 0.55 1.00 1.01 0.52
combine5 20 0.83 1.03 1.02 0.68
delay bound / 1.00 3.00 3.00 5.00
throughput bound / 0.50 1.00 1.00 0.50

SIMD(single instruction multi data)

SIMDIt is another effective performance optimization method, which is different from instruction level parallelism 数据级并行. SIMD stands for Single Instruction Multiple Data. One instruction operates a batch of vector data and requires hardware support. The X86 architecture CPU supports the AVX instruction set, and the ARM CPU supports the NEON instruction set. In MegCC, a deep learning compiler we developed, SIMD technology is widely used. MegCC is a deep learning compiler developed by the Megvii Tianyuan team. It accepts a model in MegEngine format as input and outputs all the kernels required to run the model, making it easy to deploy the model. It is high-performance and lightweight. In order to facilitate users to convert models in other formats to MegEngine format models, the Megvii Tianyuan team also provides the model conversion tool MgeConvert. You can convert the model onnxand then use MgeConvert to convert it to a MegEngine format model. At the same time, if you want to test the throughput and latency of a certain instruction on your device to guide your optimization, you can use MegPeak .

Many high-performance deep learning operators are implemented in MegCC. Convolution and matrix multiplication are typical computationally intensive operators. At the same time, convolution can also be implemented with the help of matrix multiplication (im2col/winograd algorithm, etc.).

MegCC supports matrix multiplication and convolution implemented by NEON DOT and I8MM instructions on the ARM platform. One DOTinstruction can complete 32 multiplication and addition operations (16 multiplications and 16 additions); one I8MMinstruction can complete 64 multiplication and addition operations (32 multiplications and 32 additions). This is how SIMD technology can accelerate computing.

References

  1. Randal E. Bryant, David R. O’Hallaron. Computer Systems: A Programmer’s Perspective, Chapter 5.

  2. Antonio González, Fernando Latorre, Grigorios Magklis. Processor Microarchitecture: An Implementation Perspective, Chapter 1.

  3. https://github.com/MegEngine/MegCC

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5265910/blog/10143748
Recommended