Serial code performance optimization

Serial code optimization can be divided into the following levels: system level, application level, algorithm level, function level, loop level, statement level, instruction level.

1. System level

  1.1 Network Speed, Utilization and Load Balancing

  If the application often waits for the transmission and arrival of network data, then the speed and utilization of the network must be considered, and if it is a cluster, the load balancing of the network must be considered.

  1.2 Processor Utilization

  Find out why the processor utilization is low or high.

  1.3 Memory Bandwidth Utilization

  (1) Improve the locality of memory access to increase cache utilization.

  (2) Save data in temporary variables to reduce memory read and write.

  (3) Reduce read and write dependencies.

  (4) Read and write multiple data at the same time to improve the system's instruction-level parallelism and vectorization capabilities.

  1.4 Factors that deblock processor operations

  By observing the utilization of the processor, you can estimate what percentage of the processor's computing time is waiting for the IO operation to complete. If it is relatively close, you can use a non-blocking function call or a separate thread to handle the IO.

2 application level

  2.1 Compiler Options

  For example, GCC has O0 (no optimization), O1, O2 (commonly used), O3 and other optimization options.

  2.2 Calling high-performance libraries

  such as BLAS, FFTW

  2.3 Remove global variables

  Global variables, especially global data structures shared by multiple files, can hinder compiler optimizations. For parallel programs, global variables should be absolutely forbidden.

  2.4 Restricted pointers

  pointer identifier restrict

  2.5 Conditional compilation

  Conditional compilation can make code shorter and more efficient.

3. Algorithm level

  3.1 Index order

  The locality when accessing multidimensional data is directly related to the order in which each dimension of data is stored in memory. For example, arrays in C language are stored in row-major order, and data is accessed row by row when computing.

  Code before optimization:

for(int i = 0; i < N; i++){
    for(int j = 0; i < M; j++){
      r[j] += a[j][i];
    }
}

  Optimized code:

for(int i = 0; i < M; i++){
   float ret = 0.0f;  
    for(int j = 0; i < N; j++){
      ret += a[i][j];
    }
    r [i] = ret;
}

  3.2 Cache partition

  If the size of the data exceeds the size of the cache, it is prone to full misses for a long time. At this time, the common method to reduce the cost of misses is mainly to cache blocks.

  Take matrix multiplication as an example:

for(i =0; i < N;  i+= NB)
    for(j = 0; j <M; j+= NB)
        for(k = 0; k < K; k+= NB)
            for(i0 = i; i0 < (i+NB); i0++)
                for(j0 = j; j0 < (j+NB); j0++)
                    for(k0 = k; k0 < (k+NB); k0++){
                        c[i0][j0] += A[i0][k0]+B[k0][j0];
                        }

  3.3 Software Prefetching

  Prefetch refers to speculatively loading data into the cache before it is used. The timing and intensity of implementation must be properly considered.

  3.4 Look-up table method

  The look-up table method and linear interpolation are usually used in combination to reduce the loss of calculation accuracy.

4 function levels

  4.1 Function call parameters

  If the function parameter is a large structure, you should pass a pointer or reference to reduce the overhead of copying and destroying when the call is returned.

  4.2 Inline small functions

  It can eliminate the overhead of function calls, provide more optimization opportunities such as instruction parallelism and expression removal, thereby enhancing the performance of instruction pipelines.

5 cycle levels

  5.1 Loop Unrolling

  Loop unrolling not only reduces the number of judgments each time and the number of loop variable calculations, but also increases the pipeline performance of the processor. Usually, unrolling small loops without internal judgment will benefit, unrolling large loops may cause performance degradation due to register overflow, and unrolling internal judgment loops will increase the prediction overhead. Conversely, performance will decrease.

  Before loop unrolling:

float sum = 0.0f;
for(int i =0; i < num; i++)
    sum+= a[i];
}

  After unrolling the loop: (Unrolling the loop needs to pay attention to the data at the end)

float sum 0.0f; sum1 = 0.0f; sum2 = 0.0f; sum3 = 0.0f;
for(int i = 0; i < num; i++){
    sum1 += a[i];
    sum2 += a[i+1];
    sum3 ++ a[i+2];
    sum +=a[i+3];
}
sum += sum1 + sum2 + sum3;

  5.2 Cyclic accumulation

  The loop accumulation and loop unrolling are mainly used at the same time, which ensures the parallelism while reducing the usage of registers.

float sum = 0.0f, sum1 = 0.0f ,sum2 = 0.0f;
for(int i = 0; i < num; i += 6;){
    sum1 += a[i] + a[i+1];
    sum += a[i+2] + a[i+3];
    sum2 += a[i+4] + a[i+5];
}
sum += sum1 + sum2;

  If you use loop unrolling directly 6 times, you will always need at least 6 temporary variables, and now there are only 3, potentially reducing the use of registers.

  5.3 Loop Merging

  If the number of registers used by multiple small loops does not exceed the processor limit, there may be a performance benefit (increased out-of-order execution capability) by combining the small loops. There is no dependency between the merged two loops.

  5.4 Loop Split

  If there is a register overflow in the large loop, then splitting the large loop can improve the usage of registers,

6 statement level

  

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325300797&siteId=291194637