Share 9 tips to improve code efficiency

A piece of article I read after studying, reprint it, read it often, there are so many benefits!

The purpose of our program is to make it work stably under any circumstances. A program that runs fast but turns out to be wrong is of no use. In the process of program development and optimization, we must consider the way the code is used and the key factors that affect it. Usually, we have to make a trade-off between the simplicity of the program and its running speed. Today we will talk about how to optimize the performance of the program.

1. Reduce the amount of program calculations

1.1 Sample code

for (i = 0; i < n; i++) {
  int ni = n*i;
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
}

1.2 Analysis code

  The code is shown above, each time the outer loop is executed, we need to perform a multiplication calculation. i = 0, ni = 0; i = 1, ni = n; i = 2, ni = 2n. Therefore, we can replace multiplication with addition, using n as the step size, which reduces the amount of code in the outer loop.

1.3 Improve the code

int ni = 0;
for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
  ni += n;         //乘法改加法
}

Multiplication instructions in a computer are much slower than addition instructions.

2. Extract the common parts of the code

2.1 Sample code

  Imagine that we have an image. We represent the image as a two-dimensional array, and the array elements represent pixels. We want to get the sum of the four neighbors east, south, west, and north for a given pixel. And find their average or their sum. The code is shown below.

up =    val[(i-1)*n + j  ];
down =  val[(i+1)*n + j  ];
left =  val[i*n     + j-1];
right = val[i*n     + j+1];
sum = up + down + left + right;

2.2 Analysis code

  After compiling the above code, the assembly code is as shown below. Note that in lines 3, 4, and 5, there are three multiplication operations that multiply by n. After expanding the above up and down, we will find that i*n + j is present in the four-cell expression. Therefore, the common part can be extracted, and the values ​​of up, down, etc. can be obtained through addition and subtraction operations.

leaq   1(%rsi), %rax  # i+1
leaq   -1(%rsi), %r8  # i-1
imulq  %rcx, %rsi     # i*n
imulq  %rcx, %rax     # (i+1)*n
imulq  %rcx, %r8      # (i-1)*n
addq   %rdx, %rsi     # i*n+j
addq   %rdx, %rax     # (i+1)*n+j
addq   %rdx, %r8      # (i-1)*n+j

2.3 Improve the code

long inj = i*n + j;
up =    val[inj - n];
down =  val[inj + n];
left =  val[inj - 1];
right = val[inj + 1];
sum = up + down + left + right;

  The compilation of the improved code is shown below. There is only one multiplication after compilation. Reduced by 6 clock cycles (a multiplication cycle is about 3 clock cycles).

imulq %rcx, %rsi  # i*n
addq %rdx, %rsi  # i*n+j
movq %rsi, %rax  # i*n+j
subq %rcx, %rax  # i*n+j-n
leaq (%rsi,%rcx), %rcx # i*n+j+n
...

  For the GCC compiler, the compiler can have different optimization methods according to different optimization levels, and will automatically complete the above optimization operations. Below we introduce, those must be manually optimized.

3. Eliminate inefficient code in the loop

3.1 Sample code

  The program does not seem to be a problem, a very common case conversion code, but why does the execution time of the code increase exponentially as the length of the string input becomes longer?

void lower1(char *s)
{
  size_t i;
  for (i = 0; i < strlen(s); i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

3.2 Analysis code

  Then we test the code and enter a series of strings.

image

Lower1 code performance test

  When the length of the input string is less than 100000, the program running time has little difference. However, as the length of the string increases, the running time of the program increases exponentially.

  Let's take a look at the code converted to goto form.

void lower1(char *s)
{
   size_t i = 0;
   if (i >= strlen(s))
     goto done;
 loop:
   if (s[i] >= 'A' && s[i] <= 'Z')
       s[i] -= ('A' - 'a');
   i++;
   if (i < strlen(s))
     goto loop;
 done:
}

  The above code is divided into three parts: initialization (line 3), test (line 4), and update (line 9, 10). The initialization will only be performed once. But the test and update will be executed every time. Strlen is called once every time the loop is performed.

  Let's take a look at how the source code of the strlen function calculates the length of a string.

size_t strlen(const char *s)
{
    size_t length = 0;
    while (*s != '\0') {
 s++; 
 length++;
    }
    return length;
}

  The principle of the strlen function to calculate the length of a string is: traverse the string and stop until it encounters'\0'. Therefore, the time complexity of the strlen function is O(N). In lower1, for a string of length N, the number of calls to strlen is N, N-1, N-2 ... 1. For a linear time function call N times, the time complexity is close to O(N2).

3.3 Improve the code

  For this kind of redundant call that appears in the loop, we can move it outside the loop. Use the calculation result in the loop. The improved code is shown below.

void lower2(char *s)
{
  size_t i;
  size_t len = strlen(s);
  for (i = 0; i < len; i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

  Compare the two functions, as shown in the figure below. The execution time of the lower2 function has been significantly improved.

image

Lower1 and lower2 code efficiency

4. Eliminate unnecessary memory references

4.1 Sample code

  The following code is used to calculate the sum of all elements in each row of the a array and store it in b[i].

void sum_rows1(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
 b[i] = 0;
 for (j = 0; j < n; j++)
     b[i] += a[i*n + j];
    }
}

4.2 Analysis code

  The assembly code is shown below.

# sum_rows1 inner loop
.L4:
        movsd   (%rsi,%rax,8), %xmm0 # 从内存中读取某个值放到%xmm0
        addsd   (%rdi), %xmm0      # %xmm0 加上某个值
        movsd   %xmm0, (%rsi,%rax,8) # %xmm0 的值写回内存,其实就是b[i]
        addq    $8, %rdi
        cmpq    %rcx, %rdi
        jne     .L4

  This means that each loop needs to read b[i] from memory, and then write b[i] back to memory. b[i] += b[i] + a[i*n + j]; In fact, at the beginning of each loop, b[i] is the last value. Why do you have to read it from the memory and write it back every time?

4.3 Improve the code

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows2(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
 double val = 0;
 for (j = 0; j < n; j++)
     val += a[i*n + j];
         b[i] = val;
    }
}

  The assembly is shown below.

# sum_rows2 inner loop
.L10:
        addsd   (%rdi), %xmm0 # FP load + add
        addq    $8, %rdi
        cmpq    %rax, %rdi
        jne     .L10

  The improved code introduces temporary variables to store intermediate results, and only stores the results in an array or global variable when the final value is calculated.

5. Reduce unnecessary calls

5.1 Sample code

  For the sake of example, we define a structure that contains the array and the length of the array, mainly to prevent the array access from out of bounds, data_t can be int, long and other types. The details are as follows.

typedef struct{
 size_t len;
 data_t *data;  
} vec;

image

vec vector diagram

  The function of get_vec_element is to traverse the elements in the data array and store them in val.

int get_vec_element (*vec v, size_t idx, data_t *val)
{
 if (idx >= v->len)
  return 0;
 *val = v->data[idx];
 return 1;
}

  We will use the following code as an example to start optimizing the program step by step.

void combine1(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = NULL;
    for (i = 0; i < vec_length(v); i++) {
 data_t val;
 get_vec_element(v, i, &val);
 *dest = *dest * val;
    }
}

5.2 Analysis code

  The function of the get_vec_element function is to get the next element. In the get_vec_element function, each cycle must be compared with v->len to prevent crossing the boundary. It is a good habit to perform boundary checks, but doing it every time will result in reduced efficiency.

5.3 Improve the code

  We can move the code for calculating the length of the vector outside the loop, and add a function get_vec_start to the abstract data type. This function returns the starting address of the array. In this way, there is no function call in the loop body, but direct access to the array.

data_t *get_vec_start(vec_ptr v)
{
 return v-data;
}

void combine2 (vec_ptr v, data_t *dest)
{
 long i;
 long length  = vec_length(v);
    data_t *data = get_vec_start(v);
 *dest = NULL;
 for (i=0;i < length;i++)
 {
  *dest = *dest * data[i];
 }
}

6. Loop unrolling

6.1 Sample code

  We make improvements on the code of combine2.

6.2 Analysis code

  Unrolling by increasing each iteration of the number of elements , reducing loop iterations .

6.3 Improve the code

void combine3(vec_ptr v, data_t *dest)
{
    long i;
    long length = vec_length(v);
    long limit = length-1;
    data_t *data = get_vec_start(v);
    data_t acc = NULL;
    
    /* 一次循环处理两个元素 */
    for (i = 0; i < limit; i+=2) {
    acc = (acc * data[i]) * data[i+1];
    }
    /*     完成剩余数组元素的计算    */
    for (; i < length; i++) {
  acc = acc * data[i];
    }
    *dest = acc;
}

  In the improved code, the first loop processes two elements of the array at a time. That is, for each iteration, the loop index i is increased by 2, and in one iteration, the merge operation is used on the array elements i and i+1. Generally we call this 2×1 loop unrolling, and this transformation can reduce the impact of loop overhead.

Pay attention not to cross the limit of access, set the limit correctly, n elements, generally set the limit n-1

7. Accumulate variables, multi-channel parallel

7.1 Sample code

  We make improvements on the code of combine3.

7.2 Analysis code

  For a combinable and commutative combination operation, such as integer addition or multiplication, we can improve performance by dividing a set of combination operations into two or more parts and combining the results at the end.

Special attention: do not easily combine floating-point numbers. The encoding format of floating-point numbers is different from other integer numbers.

7.3 Improve the code

void combine4(vec_ptr v, data_t *dest)
{
 long i;
    long length = vec_length(v);
    long limit = length-1;
    data_t *data = get_vec_start(v);
    data_t acc0 = 0;
    data_t acc1 = 0;
    
    /* 循环展开,并维护两个累计变量 */
    for (i = 0; i < limit; i+=2) {
    acc0 = acc0 * data[i];
    acc1 = acc1 * data[i+1];
    }
    /*     完成剩余数组元素的计算    */
    for (; i < length; i++) {
        acc0 = acc0 * data[i];
    }
    *dest = acc0 * acc1;
}

  The above code uses two loop expansions to merge more elements in each iteration. It also uses two parallel paths to accumulate the elements with an even index in the variable acc0, and the elements with an odd index are accumulated in the variable. acc1. Therefore, we call it "2×2 loop unrolling". Use 2×2 loop unrolling. By maintaining multiple cumulative variables, this method takes advantage of multiple functional units and their pipeline capabilities

8. Recombination and transformation

8.1 Sample code

  We make improvements on the code of combine3.

8.2 Analysis code

  At this point, the performance of the code is basically close to the limit, even if you do more loop unrolling, the performance improvement is not obvious. We need to change our thinking, pay attention to the code on line 12 in the combine3 code, we can change the order of merging the elements of the next vector (floating point numbers are not applicable). The key path of the combine3 code before recombination is shown in the figure below.

image

The critical path of the combine3 code

8.3 Improve the code

void combine7(vec_ptr v, data_t *dest)
{
 long i;
    long length = vec_length(v);
    long limit = length-1;
    data_t *data = get_vec_start(v);
    data_t acc = IDENT;
    
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
   acc = acc OP (data[i] OP data[i+1]);
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
        acc = acc OP data[i];
    }
    *dest = acc;
}

  Recombination of transformations can reduce the number of operations on the critical path in the calculation. This method increases the number of operations that can be executed in parallel, and makes better use of the pipeline capabilities of functional units to obtain better performance. The critical path after recombination is as follows.

image

Critical path after combine3 recombination

9 Conditional transfer style code

9.1 Sample code

void minmax1(long a[],long b[],long n){
 long i;
 for(i = 0;i,n;i++){
        if(a[i]>b[i]){
            long t = a[i];
            a[i] = b[i];
            b[i] = t;
        }
   }
}

9.2 Analysis code

  The pipeline performance of modern processors makes the work of the processor far ahead of the instructions currently being executed. The branch prediction in the processor predicts where to jump next when encountering a comparison instruction. If the prediction is wrong, it is necessary to go back to where the branch jumped. Branch prediction errors will seriously affect the execution efficiency of the program. Therefore, we should write code that allows the processor to improve the prediction accuracy, that is, use conditional transfer instructions. We use conditional operations to calculate values, and then use these values ​​to update the state of the program, as shown in the improved code.

9.3 Improve the code

void minmax2(long a[],long b[],long n){
 long i;
 for(i = 0;i,n;i++){
 long min = a[i] < b[i] ? a[i]:b[i];
 long max = a[i] < b[i] ? b[i]:a[i];
 a[i] = min;
 b[i] = max;
 }
}

  In the 4th line of the original code, a[i] and b[i] need to be compared, and then the next step is performed. The consequence is that a prediction must be made every time. The improved code implements this function to calculate the maximum and minimum values ​​of each position i, and then assign these values ​​to a[i] and b[i], instead of branch prediction.

10. Summary

  We introduced several techniques to improve code efficiency, some of which can be automatically optimized by the compiler, and some need to be implemented by ourselves. It is summarized as follows.

  1. Eliminate consecutive function calls. When possible, move calculations outside of the loop. Consider selectively compromising the modularity of the program for greater efficiency.

  2. Eliminate unnecessary memory references. Introduce temporary variables to save intermediate results. Only when the final value is calculated, the result is stored in an array or global variable.

  3. Unroll the loop, reduce overhead, and make further optimization possible.

  4. By using techniques such as multiple accumulation variables and recombination, find ways to improve instruction-level parallelism.

  5. Rewrite the conditional operation in a functional style so that the compiler uses conditional data transfer.

Guess you like

Origin blog.csdn.net/mainmaster/article/details/113695241