9 tips to improve code efficiency!

The purpose of our program is to make it work stably under any circumstances. A program that runs quickly but turns out to be wrong is of no use. In the process of program development and optimization, we must consider the way the code is used and the key factors that affect it. Usually, we have to make a trade-off between the simplicity of the program and its running speed. Today we will talk about how to optimize the performance of the program.

1. Reduce the amount of program calculations

1.1 Sample code

for (i = 0; i < n; i++) {
  int ni = n*i;
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
}

1.2 Analysis code

The code is shown above, each time the outer loop is executed, we need to perform a multiplication calculation. i = 0, ni = 0; i = 1, ni = n; i = 2, ni = 2n. Therefore, we can replace multiplication with addition, using n as the step size, which reduces the amount of code in the outer loop.

1.3 Improve the code

int ni = 0;
for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++)
    a[ni + j] = b[j];
  ni += n;         //乘法改加法
}

Multiplication instructions in a computer are much slower than addition instructions.

2. Extract the common parts of the code

2.1 Sample code

Imagine that we have an image. We represent the image as a two-dimensional array, and the array elements represent pixels. We want to get the sum of the four neighbors east, south, west, and north for a given pixel. And find their average or their sum. The code is shown below.

up =    val[(i-1)*n + j  ];
down =  val[(i+1)*n + j  ];
left =  val[i*n     + j-1];
right = val[i*n     + j+1];
sum = up + down + left + right;

2.2 Analysis code

After compiling the above code, the assembly code is shown below. Note that in lines 3, 4, and 5, there are three multiplication operations that multiply by n. After expanding the above up and down, we will find that i*n + j is included in the four-cell expression. Therefore, the common part can be extracted, and the values of up, down, etc. can be obtained through addition and subtraction operations.

leaq   1(%rsi), %rax  # i+1
leaq   -1(%rsi), %r8  # i-1
imulq  %rcx, %rsi     # i*n
imulq  %rcx, %rax     # (i+1)*n
imulq  %rcx, %r8      # (i-1)*n
addq   %rdx, %rsi     # i*n+j
addq   %rdx, %rax     # (i+1)*n+j
addq   %rdx, %r8      # (i-1)*n+j

2.3 Improve the code

long inj = i*n + j;
up =    val[inj - n];
down =  val[inj + n];
left =  val[inj - 1];
right = val[inj + 1];
sum = up + down + left + right;

The compilation of the improved code is shown below. There is only one multiplication after compilation. Reduced by 6 clock cycles (a multiplication cycle is about 3 clock cycles).

imulq %rcx, %rsi  # i*n
addq %rdx, %rsi  # i*n+j
movq %rsi, %rax  # i*n+j
subq %rcx, %rax  # i*n+j-n
leaq (%rsi,%rcx), %rcx # i*n+j+n
...

For the GCC compiler, the compiler can have different optimization methods according to different optimization levels, and will automatically complete the above optimization operations. Below we introduce, those must be manually optimized.

3. Eliminate inefficient code in the loop

3.1 Sample code

The program does not seem to be a problem, a very common case conversion code, but why does the execution time of the code increase exponentially as the length of the string input becomes longer?

void lower1(char *s)
{
  size_t i;
  for (i = 0; i < strlen(s); i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

3.2 Analysis code

Then we test the code and enter a series of strings.

Lower1 code performance test

When the input string length is less than 100000, the program running time has little difference. However, as the length of the string increases, the running time of the program increases exponentially.

Let's take a look at the code converted to goto form.

void lower1(char *s)
{
   size_t i = 0;
   if (i >= strlen(s))
     goto done;
 loop:
   if (s[i] >= 'A' && s[i] <= 'Z')
       s[i] -= ('A' - 'a');
   i++;
   if (i < strlen(s))
     goto loop;
 done:
}

The above code is divided into three parts: initialization (line 3), testing (line 4), and update (lines 9, 10). The initialization will only be performed once. But testing and updates are executed every time. Strlen will be called once for each loop.

Let's look at how the source code of the strlen function calculates the length of the string.

size_t strlen(const char *s)
{
    size_t length = 0;
    while (*s != '\0') {
 s++; 
 length++;
    }
    return length;
}

The principle of the strlen function to calculate the length of a string is: traverse the string and stop until it encounters'\0'. Therefore, the time complexity of the strlen function is O(N). In lower1, for a string of length N, the number of calls to strlen is N, N-1, N-2 ... 1. For a linear time function call N times, the time complexity is close to O(N2).

3.3 Improve the code

For such redundant calls that appear in the loop, we can move them outside the loop. Use the calculation result in the loop. The improved code is shown below.

void lower2(char *s)
{
  size_t i;
  size_t len = strlen(s);
  for (i = 0; i < len; i++)
    if (s[i] >= 'A' && s[i] <= 'Z')
      s[i] -= ('A' - 'a');
}

Compare the two functions, as shown in the figure below. The execution time of the lower2 function has been significantly improved.

Lower1 and lower2 code efficiency

4. Eliminate unnecessary memory references

4.1 Sample code

The following code is used to calculate the sum of all elements in each row of the a array and store it in b[i].

void sum_rows1(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
 b[i] = 0;
 for (j = 0; j < n; j++)
     b[i] += a[i*n + j];
    }
}

4.2 Analysis code

The assembly code is shown below.

# sum_rows1 inner loop
.L4:
        movsd   (%rsi,%rax,8), %xmm0 # 从内存中读取某个值放到%xmm0
        addsd   (%rdi), %xmm0      # %xmm0 加上某个值
        movsd   %xmm0, (%rsi,%rax,8) # %xmm0 的值写回内存，其实就是b[i]
        addq    $8, %rdi
        cmpq    %rcx, %rdi
        jne     .L4

This means that each cycle needs to read b[i] from memory, and then write b[i] back to memory. b[i] += b[i] + a[i*n + j]; In fact, at the beginning of each loop, b[i] is the last value. Why do you have to read it from the memory and write it back every time?

4.3 Improve the code

/* Sum rows is of n X n matrix a
   and store in vector b  */
void sum_rows2(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
 double val = 0;
 for (j = 0; j < n; j++)
     val += a[i*n + j];
         b[i] = val;
    }
}

The assembly is shown below.

# sum_rows2 inner loop
.L10:
        addsd   (%rdi), %xmm0 # FP load + add
        addq    $8, %rdi
        cmpq    %rax, %rdi
        jne     .L10

The improved code introduces temporary variables to store the intermediate results, and only stores the results in an array or global variable when the final value is calculated.

5. Reduce unnecessary calls

5.1 Sample code

For the convenience of examples, we define a structure that contains the array and the length of the array, mainly to prevent the array access from out of bounds, data_t can be int, long and other types. The details are as follows.

typedef struct{
 size_t len;
 data_t *data;  
} vec;

vec vector diagram

The function of get_vec_element is to traverse the elements in the data array and store them in val.

int get_vec_element (*vec v, size_t idx, data_t *val)
{
 if (idx >= v->len)
  return 0;
 *val = v->data[idx];
 return 1;
}

We will use the following code as an example to start optimizing the program step by step.

void combine1(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = NULL;
    for (i = 0; i < vec_length(v); i++) {
 data_t val;
 get_vec_element(v, i, &val);
 *dest = *dest * val;
    }
}

5.2 Analysis code

The function of the get_vec_element function is to get the next element. In the get_vec_element function, each cycle must be compared with v->len to prevent crossing the boundary. It is a good habit to perform boundary checks, but doing it every time will result in reduced efficiency.

5.3 Improve code

We can move the code for calculating the length of the vector outside the loop, and add a function get_vec_start to the abstract data type. This function returns the starting address of the array. So there is no function call in the loop body, but direct access to the array.

data_t *get_vec_start(vec_ptr v)
{
 return v-data;
}

void combine2 (vec_ptr v, data_t *dest)
{
 long i;
 long length  = vec_length(v);
    data_t *data = get_vec_start(v);
 *dest = NULL;
 for (i=0;i < length;i++)
 {
  *dest = *dest * data[i];
 }
}

6. Loop unrolling

6.1 Sample code

We make improvements on the code of combine2.

6.2 Analysis code

Unrolling by increasing each iteration of the number of elements , reducing loop iterations .

6.3 Improve the code

void combine3(vec_ptr v, data_t *dest)
{
    long i;
    long length = vec_length(v);
    long limit = length-1;
    data_t *data = get_vec_start(v);
    data_t acc = NULL;
    
    /* 一次循环处理两个元素 */
    for (i = 0; i < limit; i+=2) {
    acc = (acc * data[i]) * data[i+1];
    }
    /*     完成剩余数组元素的计算    */
    for (; i < length; i++) {
  acc = acc * data[i];
    }
    *dest = acc;
}

In the improved code, the first loop processes two elements of the array at a time. That is, for each iteration, the loop index i is increased by 2, and in one iteration, the merge operation is used on the array elements i and i+1. Generally we call this type of 2×1 loop unrolling. This transformation can reduce the impact of loop overhead.

Pay attention not to cross the limit of access, set the limit correctly, n elements, generally set the limit n-1

7. Accumulate variables, multiple parallel

7.1 Sample code

We make improvements on the code of combine3.

7.2 Analysis code

For a combinable and commutative combination operation, such as integer addition or multiplication, we can improve performance by dividing a group of combination operations into two or more parts, and finally combining the results.

Special attention: do not easily combine floating-point numbers. The encoding format of floating-point numbers is different from other integer numbers.

7.3 Improve the code

void combine4(vec_ptr v, data_t *dest)
{
 long i;
    long length = vec_length(v);
    long limit = length-1;
    data_t *data = get_vec_start(v);
    data_t acc0 = 0;
    data_t acc1 = 0;
    
    /* 循环展开，并维护两个累计变量 */
    for (i = 0; i < limit; i+=2) {
    acc0 = acc0 * data[i];
    acc1 = acc1 * data[i+1];
    }
    /*     完成剩余数组元素的计算    */
    for (; i < length; i++) {
        acc0 = acc0 * data[i];
    }
    *dest = acc0 * acc1;
}

The above code uses two loop expansions to merge more elements in each iteration. It also uses two parallel paths to accumulate elements with an even index in the variable acc0, and elements with an odd index in the variable. acc1. Therefore, we call it "2×2 loop unrolling". Use 2×2 loop unrolling. By maintaining multiple cumulative variables, this method takes advantage of multiple functional units and their pipeline capabilities

8. Rejoin the transformation

8.1 Sample code

We make improvements on the code of combine3.

8.2 Analysis code

At this point, the performance of the code is basically close to the limit, even if you do more loop unrolling, the performance improvement is not obvious. We need to change our thinking, pay attention to the code on line 12 in the combine3 code, we can change the order of merging the elements of the next vector (floating point numbers are not applicable). The critical path of the previous combine3 code is shown in the figure below.

The critical path of the combine3 code

8.3 Improve the code

void combine7(vec_ptr v, data_t *dest)
{
 long i;
    long length = vec_length(v);
    long limit = length-1;
    data_t *data = get_vec_start(v);
    data_t acc = IDENT;
    
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
   acc = acc OP (data[i] OP data[i+1]);
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
        acc = acc OP data[i];
    }
    *dest = acc;
}

Recombination of transformations can reduce the number of operations on the critical path in the calculation. This method increases the number of operations that can be executed in parallel, and makes better use of the pipeline capabilities of functional units to obtain better performance. The critical path after recombination is as follows.

Critical path after combine3 recombination

9 Conditional transfer style code

9.1 Sample code

void minmax1(long a[],long b[],long n){
 long i;
 for(i = 0;i,n;i++){
        if(a[i]>b[i]){
            long t = a[i];
            a[i] = b[i];
            b[i] = t;
        }
   }
}

9.2 Analysis code

The pipeline performance of modern processors makes the work of the processor far ahead of the instructions currently being executed. The branch prediction in the processor predicts where to jump next when encountering a comparison instruction. If the prediction is wrong, it is necessary to go back to where the branch jumped. Branch prediction errors will seriously affect the efficiency of program execution. Therefore, we should write code that allows the processor to improve the prediction accuracy, that is, use conditional transfer instructions. We use conditional operations to calculate values, and then use these values to update the state of the program, as shown in the improved code.

9.3 Improve the code

void minmax2(long a[],long b[],long n){
 long i;
 for(i = 0;i,n;i++){
 long min = a[i] < b[i] ? a[i]:b[i];
 long max = a[i] < b[i] ? b[i]:a[i];
 a[i] = min;
 b[i] = max;
 }
}

In the fourth line of the original code, a[i] and b[i] need to be compared, and then the next step is to be performed. The consequence is that a prediction must be made every time. The improved code implements this function to calculate the maximum and minimum values of each position i, and then assign these values to a[i] and b[i], instead of branch prediction.

10. Summary

We introduced several techniques to improve code efficiency, some of which can be automatically optimized by the compiler, and some need to be implemented by ourselves. It is summarized as follows.

Eliminate consecutive function calls. When possible, move calculations outside the loop. Consider selectively compromising the modularity of the program for greater efficiency.
Eliminate unnecessary memory references. Introduce temporary variables to save intermediate results. Only when the final value is calculated, the result is stored in an array or global variable.
Unroll the loop, reduce overhead, and make further optimization possible.
Find ways to improve instruction-level parallelism by using techniques such as multiple accumulation variables and recombination.
Rewrite the conditional operation in a functional style so that the compiler uses conditional data transfer.

1. The domestically-made alternatives are intangible? Come back to the Zhaoyi Innovation Live Class!

2. Can the open source RISC-V be the antidote to China's "core shortage"?

3. Raspberry Pi Pico: MCU for only $4

4. There are many reasons why MCU supports AI function~

5. In 2020, 20 software engineering principles I learned~

6. The application of state machine ideas in embedded development~

Disclaimer: This article is reproduced online, and the copyright belongs to the original author. If you are involved in copyright issues, please contact us, we will confirm the copyright based on the copyright certification materials you provide and pay the author's remuneration or delete the content.