GCC compilation optimization and loop unrolling

1. Brief description of GCC optimization at all levels

Generally, the default is -O0, without any optimization.
From O1 to O3, the optimization level gradually increases, and 级别高的包含前面级别低的所有优化.
As long as the optimization is turned on, no matter -O1, -O2, -O3 code execution order will be disrupted, which is not very good for debugging.
Each optimization option also includes many subdivided optimization options, and sometimes you can open one of the subitems. In the gcc compilation command, you can see this blog for details.

In addition, O2 optimization is generally enabled for online projects. When not optimized, it is used for local debugging. In actual project operation, it is definitely better to enable optimization. Moreover, it is often said that the efficiency of each STL container is not as good as its own array implementation. That is when O2 optimization is not enabled (you may not be able to enable optimization for competitions). STL containers are still very efficient when O2 is enabled. Containers have been designed to minimize their respective overhead.

-O1 optimization:

This level will be right from the beginning 指令进行重排(compiler level, not CPU level), will simply delete some redundant conditional branches, and reduce unnecessary stack frame development

-O2 optimization:

This level does not do loop unrolling! ! ChatGPT will say that O2 has loop unrolling, which is wrong;
-O2 will turn on all the optimizations of -O1, its additional optimizations such as:
1) Force the data on the memory to be copied to the register and then execute the data-related instructions, so If multiple instructions need to use the data, it saves the time to copy the data from the memory to the register again (such as a = 2, and then there are many variables that need to be +a, then copy a to the register first, and then execute All +a instructions are more efficient than copying the value of a from the memory to the register every time the +a instruction is encountered)
2) Multiple function calls can also be combined into one function call (this -funit-at-a-timecan be done in the compilation option , this option allows the compiler to analyze the entire assembly language code before generating machine instructions to perform some optimizations), such as the following example:

int add(int a, int b) {
    
    
    return a + b;
}

int mul(int a, int b) {
    
    
    return a * b;
}

int add_and_mul(int a, int b, int c) {
    
    
    int sum = add(a, b);
    return mul(sum, c);
}

In the above code, add_and_multhe function calls addand multwo functions, which can be combined into one function call, as follows:

int add_and_mul_opt(int a, int b, int c) {
    
    
    return a * b + c;
}

-O3 optimization:

Only at this level 循环展开, 函数内联
the following introduces loop unrolling.

2. Loop expansion

( 循环展开是减少循环次数,不是完全不要循环)
Loop unrolling optimization is an optimization technique that improves program performance by reducing the number of loop iterations. This technique is often used to reduce the overhead of loops, 提高CPU的利用率and 减少分支预测失败的次数.

(1) Improve CPU utilization

In a loop, each statement can be executed concurrently. A statement is ultimately multiple instructions. The more statements there are, the greater the probability that the execution instructions will be executed in parallel. For example, the following example:

for(int i = 0; i < n; i++){
    
    
	a[i] += b[i];
	c[i] += d[i];
}

After doing loop unrolling:

for(int i = 0; i < n; i += 2){
    
    
	a[i] += b[i];
	c[i] += d[i];
	a[i+1] += b[i+1];
	c[i+1] += d[i+1];
}

The above is to manually write the loop unrolling to imitate the optimization of the compiler. Although looking at before and after loop expansion, the total number of executions is the same (the total number of statements is the same).
But in fact, 每一个循环里each instruction can be executed in parallel, that is CPU的指令级并行.
For example, in the first case, those instructions that execute two sentences at the same time (not just two instructions, it is estimated to be 6)
and in the second case, those instructions that execute four sentences at the same time (12).
In this way, the efficiency will come up.

(2) Reduce the number of branch prediction failures

Because loop unrolling can 减少分支语句的数量reduce the number of branch prediction failures. When the loop is unrolled, the number of times the branch statement is executed within the loop is also reduced, thereby reducing the possibility of branch prediction failure, as in the following example:

for (int i = 0; i < n; i++) {
    
    
    if (i % 2 == 0) {
    
    
        // 偶数
        sum += i;
    } else {
    
    
        // 奇数
        sum -= i;
    }
}

After loop unrolling:

for (int i = 0; i < n; i++) {
    
    
    sum += i; // 偶数
    sum -= (i+1); // 奇数
}

It can be seen that there is no branch after the loop is expanded. Of course, in the actual project, the branch is reduced, and it is unlikely to cancel the branch, otherwise the branch will not be written. ^ _ ^
In the unexpanded loop, each loop needs to be executed For a branch statement, according to the success rate of branch prediction, there is a certain probability that branch prediction failure will occur; and in the expanded loop, the number of branch statements is reduced, thereby reducing the number of branch prediction failures.

Guess you like

Origin blog.csdn.net/mrqiuwen/article/details/130215980