Serial code optimization can be divided into the following levels: system level, application level, algorithm level, function level, loop level, statement level, instruction level.
1. System level
1.1 Network Speed, Utilization and Load Balancing
If the application often waits for the transmission and arrival of network data, then the speed and utilization of the network must be considered, and if it is a cluster, the load balancing of the network must be considered.
1.2 Processor Utilization
Find out why the processor utilization is low or high.
1.3 Memory Bandwidth Utilization
(1) Improve the locality of memory access to increase cache utilization.
(2) Save data in temporary variables to reduce memory read and write.
(3) Reduce read and write dependencies.
(4) Read and write multiple data at the same time to improve the system's instruction-level parallelism and vectorization capabilities.
1.4 Factors that deblock processor operations
By observing the utilization of the processor, you can estimate what percentage of the processor's computing time is waiting for the IO operation to complete. If it is relatively close, you can use a non-blocking function call or a separate thread to handle the IO.
2 application level
2.1 Compiler Options
For example, GCC has O0 (no optimization), O1, O2 (commonly used), O3 and other optimization options.
2.2 Calling high-performance libraries
such as BLAS, FFTW
2.3 Remove global variables
Global variables, especially global data structures shared by multiple files, can hinder compiler optimizations. For parallel programs, global variables should be absolutely forbidden.
2.4 Restricted pointers
pointer identifier restrict
2.5 Conditional compilation
Conditional compilation can make code shorter and more efficient.
3. Algorithm level
3.1 Index order
The locality when accessing multidimensional data is directly related to the order in which each dimension of data is stored in memory. For example, arrays in C language are stored in row-major order, and data is accessed row by row when computing.
Code before optimization:
for(int i = 0; i < N; i++){ for(int j = 0; i < M; j++){ r[j] += a[j][i]; } }
Optimized code:
for(int i = 0; i < M; i++){ float ret = 0.0f; for(int j = 0; i < N; j++){ ret += a[i][j]; } r [i] = ret; }
3.2 Cache partition
If the size of the data exceeds the size of the cache, it is prone to full misses for a long time. At this time, the common method to reduce the cost of misses is mainly to cache blocks.
Take matrix multiplication as an example:
for(i =0; i < N; i+= NB) for(j = 0; j <M; j+= NB) for(k = 0; k < K; k+= NB) for(i0 = i; i0 < (i+NB); i0++) for(j0 = j; j0 < (j+NB); j0++) for(k0 = k; k0 < (k+NB); k0++){ c[i0][j0] += A[i0][k0]+B[k0][j0]; }
3.3 Software Prefetching
Prefetch refers to speculatively loading data into the cache before it is used. The timing and intensity of implementation must be properly considered.
3.4 Look-up table method
The look-up table method and linear interpolation are usually used in combination to reduce the loss of calculation accuracy.
4 function levels
4.1 Function call parameters
If the function parameter is a large structure, you should pass a pointer or reference to reduce the overhead of copying and destroying when the call is returned.
4.2 Inline small functions
It can eliminate the overhead of function calls, provide more optimization opportunities such as instruction parallelism and expression removal, thereby enhancing the performance of instruction pipelines.
5 cycle levels
5.1 Loop Unrolling
Loop unrolling not only reduces the number of judgments each time and the number of loop variable calculations, but also increases the pipeline performance of the processor. Usually, unrolling small loops without internal judgment will benefit, unrolling large loops may cause performance degradation due to register overflow, and unrolling internal judgment loops will increase the prediction overhead. Conversely, performance will decrease.
Before loop unrolling:
float sum = 0.0f; for(int i =0; i < num; i++) sum+= a[i]; }
After unrolling the loop: (Unrolling the loop needs to pay attention to the data at the end)
float sum 0.0f; sum1 = 0.0f; sum2 = 0.0f; sum3 = 0.0f; for(int i = 0; i < num; i++){ sum1 += a[i]; sum2 += a[i+1]; sum3 ++ a[i+2]; sum +=a[i+3]; } sum += sum1 + sum2 + sum3;
5.2 Cyclic accumulation
The loop accumulation and loop unrolling are mainly used at the same time, which ensures the parallelism while reducing the usage of registers.
float sum = 0.0f, sum1 = 0.0f ,sum2 = 0.0f; for(int i = 0; i < num; i += 6;){ sum1 += a[i] + a[i+1]; sum += a[i+2] + a[i+3]; sum2 += a[i+4] + a[i+5]; } sum += sum1 + sum2;
If you use loop unrolling directly 6 times, you will always need at least 6 temporary variables, and now there are only 3, potentially reducing the use of registers.
5.3 Loop Merging
If the number of registers used by multiple small loops does not exceed the processor limit, there may be a performance benefit (increased out-of-order execution capability) by combining the small loops. There is no dependency between the merged two loops.
5.4 Loop Split
If there is a register overflow in the large loop, then splitting the large loop can improve the usage of registers,
6 statement level