In-depth good article: C language code optimization program!

1. Choose the right algorithm and data structure

It is important to choose a suitable data structure. If a large number of insert and delete instructions are used in a bunch of randomly stored numbers, it is much faster to use a linked list. Arrays and pointer statements have a very close relationship. Generally speaking, pointers are more flexible and concise, while arrays are more intuitive and easy to understand. For most compilers, the code generated by using pointers is shorter than using arrays, and the execution efficiency is higher.

In many cases, pointer arithmetic can be used instead of array indexing, and doing so often produces fast and short code. Compared with array indexes, pointers generally make code faster and take up less space. The difference is more obvious when using multidimensional arrays. The following code functions are the same, but the efficiency is different.

    数组索引                指针运算

    For(;;){               p=array

    A=array[t++];          for(;;){

                                a=*(p++);

    。。。。。。。。。。。。。。。

    }                      }

The advantage of the pointer method is that every time the address of the array is loaded into the address p, it only needs to be incremented in each loop. In the array index method, complex operations must be performed to find the array subscript based on the t value in each loop.

2. Use the smallest possible data type

Variables that can be defined using character type (char) should not be defined using integer (int) variables; variables that can be defined using integer type variables should not use long int (long int), and floating-point type (float) ) Do not use floating-point variables for variables. Of course, do not exceed the scope of the variable after the variable is defined. If the assignment exceeds the scope of the variable, the C compiler does not report an error, but the program operation result is wrong, and such errors are difficult to find.

In ICCAVR, you can set the printf parameter in Options, try to use basic parameters (%c, %d, %x, %X, %u, and %s format specifiers), and use less long integer parameters (% ld, %lu, %lx, and %lX format specifiers). As for floating-point parameters (%f), try not to use them. The same is true for other C compilers. When other conditions remain unchanged, using the %f parameter will increase the number of generated codes and reduce the execution speed.

3. Reduce the intensity of calculations

(1) Check the table (a required course for game programmers)

A clever game prawn, basically does not do any calculation work in its main loop, definitely calculate it first, and then look up the table in the loop. Look at the following example:

Old code:

long factorial(int i)
{
    if (i == 0)
      return 1;
    else
      return i * factorial(i - 1);
}

New code:

static long factorial_table[] = {1, 1, 2, 6, 24, 120, 720  /* etc */ };
long factorial(int i)
{
    return factorial_table[i];
}

If the table is too large to write, write an init function to temporarily generate the table outside the loop.

(2) Remainder operation

a=a%8;

Can be changed to:

a=a&7;

Note: Bit operations can be completed in only one instruction cycle, and most of the "%" operations of C compilers are completed by calling subroutines, with long code and slow execution speed. Usually, the only requirement is to find the remainder of the 2n square, which can be replaced by bit manipulation.

(3) Square operation

a=pow(a, 2.0);

Can be changed to:

a=a*a;

Note: In the single-chip microcomputer with built-in hardware multiplier (such as 51 series), the multiplication operation is much faster than the squaring operation, because the squaring of floating-point numbers is realized by calling subroutines. In the AVR with its own hardware multiplier In the single-chip microcomputer, such as ATMega163, the multiplication operation can be completed in only 2 clock cycles. Even in an AVR microcontroller without a built-in hardware multiplier, the subroutine code of the multiplication operation is shorter than the subroutine code of the square operation, and the execution speed is faster.

If it is to the third power, such as:

a=pow(a,3.0);

change to:

a=a*a*a;

The improvement in efficiency is more obvious.

(4) Realize multiplication and division by shift

a=a*4;
b=b/4;

Can be changed to:

a=a<<2;
b=b>>2;

Usually if you need to multiply or divide by 2n, you can use the shift method instead. In ICCAVR, if you multiply by 2n, you can generate left-shift code, and multiply by other integers or divide by any number, call the multiplication and division subroutine. The code obtained by shifting method is more efficient than the code generated by calling multiplication and division subroutines. In fact, as long as it is multiplied or divided by an integer, you can use the shift method to get the result, such as:

a=a*9

Can be changed to:

a=(a<<3)+a

Replace the original expression with an expression with a smaller computational complexity. The following is a classic example:

Old code:

x = w % 8;
y = pow(x, 2.0);
z = y * 33;
for (i = 0;i < MAX;i++)
{
    h = 14 * i;
    printf("%d", h);
}

New code:

x = w & 7;                 /* 位操作比求余运算快*/
y = x * x;                 /* 乘法比平方运算快*/
z = (y << 5) + y;          /* 位移乘法比乘法快 */
for (i = h = 0; i < MAX; i++)
{
    h += 14;               /* 加法比乘法快 */
    printf("%d",h);
}

(5) Avoid unnecessary integer division

Integer division is the slowest of integer operations, so it should be avoided if possible. One possibility to reduce integer division is continuous division, where division can be replaced by multiplication. The side effect of this substitution is that it may overflow when calculating the product, so it can only be used in a certain range of division.

Bad code:

int i, j, k, m;
m = i / j / k;

Recommended code:

int i, j, k, m;
m = i / (j * k);

(6) Use increment and decrement operators

Try to use the increment and decrement operators when using the plus and minus operations, because the increment statement is faster than the assignment statement. The reason is that for most CPUs, the increase and decrease operations of the memory word are not necessary Obviously use instructions for fetching and writing memory, such as the following statement:

x=x+1;

Imitating most microcomputer assembly language as an example, the generated code is similar to:

move A,x      ;把x从内存取出存入累加器A
add A,1       ;累加器A加1
store x        ;把新值存回x

If you use the increment operator, the generated code is as follows:

incr x           ;x加1

Obviously, without fetching and storing instructions, the execution speed of increment and decrement operations is accelerated, and the length is also shortened.

(7) Use compound assignment expressions

Compound assignment expressions (such as a-=1 and a+=1, etc.) can generate high-quality program code.

(8) Extract common sub-expression

In some cases, the C++ compiler cannot propose common sub-expressions from floating-point expressions, because this means that it is equivalent to reordering expressions. What needs to be pointed out is that the compiler cannot rearrange the expressions according to the algebraic equivalence before extracting the common sub-expressions. At this time, the programmer has to manually propose the common sub-expression (in VC.NET there is a "global optimization" option to complete this work, but the effect is unknown).

Bad code:

float a, b, c, d, e, f;
。。。
e = b * c / d;
f = b / d * a;

Recommended code:

float a, b, c, d, e, f;
。。。
const float t(b / d);
e = c * t;
f = a * t;

Bad code:

float a, b, c, e, f;
。。。
e = a / c;
f = b / c;

Recommended code:

float a, b, c, e, f;
。。。
const float t(1.0f / c);
e = a * t;
f = b * t;

4. Layout of structure members

Many compilers have the option of "align structure words, double words or quad words". However, there is still a need to improve the alignment of structure members. Some compilers may allocate space to structure members in a different order than they declare. However, some compilers do not provide these functions, or the effect is not good. Therefore, to achieve the best structure and structure member alignment at the least cost, the following methods are recommended:

(1) Sort by length of data type

Sort the members of the structure according to their type length, and put the long type before the short when declaring members. The compiler requires that long data types be stored on even address boundaries. When declaring a complex data type (both multi-byte data and single-byte data), you should store multi-byte data first, and then store single-byte data, so as to avoid memory holes. The compiler automatically aligns instances of the structure on even-numbered boundaries in memory.

(2) Fill the structure to an integral multiple of the longest type length

Fill the structure to an integral multiple of the longest type length. In this way, if the first member of the structure is aligned, all the entire structure is naturally aligned. The following example demonstrates how to reorder the structure members:

Bad code, normal order:

struct
{
  char a[5];
  long k;
  double x;
} baz;

Recommended code, new order and manually filled in a few bytes:

struct
{
  double x;
  long k;
  char a[5];
  char pad[7];
} baz;

This rule also applies to the layout of the members of the class.

(3) Sort local variables by the length of the data type

When the compiler allocates the local variable space, their order is the same as the order in which they are declared in the source code. As with the previous rule, long variables should be placed before short variables. If the first variable is aligned, other variables will be stored consecutively, and will naturally be aligned without filling bytes. Some compilers do not automatically change the order of variables when assigning variables, and some compilers cannot generate 4-byte aligned stacks, so 4-bytes may not be aligned. The following example demonstrates the reordering of local variable declarations:

Bad code, normal order

short ga, gu, gi;
long foo, bar;
double x, y, z[3];
char a, b;
float baz;

Recommended code, order of improvement

double z[3];
double x, y;
long foo, bar;
float baz;
short ga, gu, gi;

(4) Copy frequently used pointer parameters to local variables

Avoid frequently using the value pointed to by the pointer parameter in the function. Because the compiler does not know whether there are conflicts between pointers, pointer parameters often cannot be optimized by the compiler. In this way, data cannot be stored in registers, and it obviously takes up memory bandwidth. Note that many compilers have "assume no conflicts" optimization switch (in VC you must manually add the compiler command line /Oa or /Ow), which allows the compiler to assume that two different pointers always have different content, so that There is no need to save pointer parameters to local variables. Otherwise, please save the data pointed to by the pointer to a local variable at the beginning of the function. If necessary, copy it back before the end of the function.

Bad code:

// 假设 q != r
void isqrt(unsigned long a, unsigned long* q, unsigned long* r)
{
  *q = a;
  if (a > 0)
  {
    while (*q > (*r = a / *q))
    {
      *q = (*q + *r) >> 1;
    }
  }
  *r = a - *q * *q;
}

Recommended code:

// 假设 q != r

void isqrt(unsigned long a, unsigned long* q, unsigned long* r)
{
  unsigned long qq, rr;
  qq = a;
  if (a > 0)
  {
    while (qq > (rr = a / qq))
    {
      qq = (qq + rr) >> 1;
    }
  }
  rr = a - qq * qq;
  *q = qq;
  *r = rr;
}

5. Cycle optimization

(1) Fully decompose small loops

To make full use of the CPU's instruction cache, it is necessary to fully break down small loops. Especially when the loop body itself is small, breaking down the loop can improve performance. Note: Many compilers cannot automatically resolve loops. Bad code:

// 3D转化:把矢量 V 和 4x4 矩阵 M 相乘
for (i = 0;i < 4;i ++)
{
  r[i] = 0;
  for (j = 0;j < 4;j ++)
  {
    r[i] += M[j][i]*V[j];
  }
}

Recommended code:

r[0] = M[0][0]*V[0] + M[1][0]*V[1] + M[2][0]*V[2] + M[3][0]*V[3];
r[1] = M[0][1]*V[0] + M[1][1]*V[1] + M[2][1]*V[2] + M[3][1]*V[3];
r[2] = M[0][2]*V[0] + M[1][2]*V[1] + M[2][2]*V[2] + M[3][2]*V[3];
r[3] = M[0][3]*V[0] + M[1][3]*V[1] + M[2][3]*V[2] + M[3][3]*v[3];

(2) Extract the public part

For some tasks that do not require loop variables to participate in calculations, they can be placed outside the loop. The tasks here include expressions, function calls, pointer operations, array access, etc. All operations that do not need to be performed multiple times should be collected together. Put it in an init initialization program.

(3) Delay function

The commonly used delay functions are all in the form of self-addition:

void delay (void)
{
  unsigned int i;
  for (i=0;i<1000;i++) ;
}

Change it to a self-decreasing delay function:

void delay (void)
{
  unsigned int i;
  for (i=1000;i>0;i--) ;
}

The delay effects of the two functions are similar, but almost all C compilers generate 1 to 3 bytes less code for the latter function than the previous code, because almost all MCUs have instructions for zero transfer. The latter method can generate such instructions. The same is true when using the while loop. Using the self-decrement instruction to control the loop will generate 1~3 letters less code than using the self-add instruction to control the loop. But when there is an instruction to read and write the array through the loop variable "i" in the loop, the use of the pre-decrement loop may cause the array to exceed the bounds, so attention should be paid.

(4)while loop and do...while loop

There are two types of loops when using a while loop:

unsigned int i;
i=0;
while (i<1000)
{
   i++;
   //用户程序
}

or:

unsigned int i;
i=1000;
do
{
   i--;
   //用户程序
}
while (i>0);

In these two kinds of loops, the length of the code generated after compiling with do...while loop is shorter than while loop.

(5) Loop unrolling

This is a classic speed optimization, but many compilers (such as gcc -funroll-loops) can do this automatically, so now it is not obvious to optimize this by yourself.

Old code:

for (i = 0; i < 100; i++)
{
  do_stuff(i);
}

New code:

for (i = 0; i < 100; )
{
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
  do_stuff(i); i++;
}

It can be seen that the comparison instruction in the new code is reduced from 100 to 10, and the cycle time is reduced by 90%. But note: For loops whose intermediate variables or results are changed, the compiler often refuses to expand (for fear of taking responsibility), you need to do the expansion yourself at this time.

One more thing, please note that on CPUs with internal instruction cache (such as MMX chips), because the loop unrolling code is very large, the cache overflows. At this time, the unrolled code will be frequently transferred between the CPU cache and memory. Adjusted, and because the cache speed is very high, the loop unrolling will slow down at this time. In addition, loop unrolling will affect the optimization of vector operations.

(6) Loop nesting

Putting related loops in a loop will also speed things up.

Old code:

for (i = 0; i < MAX; i++)         /* initialize 2d array to 0's */
    for (j = 0; j < MAX; j++)
        a[i][j] = 0.0;
    for (i = 0; i < MAX; i++)        /* put 1's along the diagonal */
        a[i][i] = 1.0;

New code:

for (i = 0; i < MAX; i++)         /* initialize 2d array to 0's */
{
    for (j = 0; j < MAX; j++)
        a[i][j] = 0.0;
    a[i][i] = 1.0;                            /* put 1's along the diagonal */
}

(7) Sort case according to frequency of occurrence in Switch statement

Switch may be transformed into code of many different algorithms. The most common of these are jump tables and comparison chains/trees. When the switch is converted by a comparison chain, the compiler will generate nested code of if-else-if and compare them in order. When a match is made, it jumps to the statement that meets the conditions for execution. Therefore, the case value can be sorted according to the probability of occurrence, and the most likely one can be put first, which can improve performance. In addition, it is recommended to use small consecutive integers in the case, because in this case, all compilers can convert the switch into a jump table.

Bad code:

int days_in_month, short_months, normal_months, long_months;

。。。。。。

switch (days_in_month)
{
  case 28:
  case 29:
    short_months ++;
    break;
  case 30:
    normal_months ++;
    break;
  case 31:
    long_months ++;
    break;
  default:
    cout << "month has fewer than 28 or more than 31 days" << endl;
    break;
}

Recommended code:

int days_in_month, short_months, normal_months, long_months;

。。。。。。

switch (days_in_month)
{
  case 31:
    long_months ++;
    break;
  case 30:
    normal_months ++;
    break;
  case 28:
  case 29:
    short_months ++;
    break;
  default:
    cout << "month has fewer than 28 or more than 31 days" << endl;
    break;
}

(8) Convert large switch statements into nested switch statements

When there are many case labels in the switch statement, in order to reduce the number of comparisons, it is wise to turn the large switch statement into a nested switch statement. Put the case label with high frequency in a switch statement, and it is the outermost layer of nested switch statement, and put the case label with relatively low frequency in another switch statement. For example, the following program segment puts the relatively low frequency of occurrence in the default case label.

pMsg=ReceiveMessage();
switch (pMsg->type)
{
      case FREQUENT_MSG1:
        handleFrequentMsg();
        break;
      case FREQUENT_MSG2:
        handleFrequentMsg2();
        break;
        。。。。。。
      case FREQUENT_MSGn:
        handleFrequentMsgn();
        break;
      default:                     //嵌套部分用来处理不经常发生的消息
        switch (pMsg->type)
        {
          case INFREQUENT_MSG1:
               handleInfrequentMsg1();
               break;
          case INFREQUENT_MSG2:
               handleInfrequentMsg2();
               break;
        。。。。。。
          case INFREQUENT_MSGm:
              handleInfrequentMsgm();
              break;
        }
}

If there is a lot of work to be done in each case in switch, it would be more effective to replace the entire switch statement with a table of pointers to functions. For example, the following switch statement has three cases:

enum MsgType{Msg1, Msg2, Msg3}
switch (ReceiveMessage()
{
    case Msg1;
        。。。。。。
    case Msg2;
        。。。。。
    case Msg3;
        。。。。。
}

In order to improve the execution speed, replace the above switch statement with the following code.

/*准备工作*/
int handleMsg1(void);
int handleMsg2(void);
int handleMsg3(void);
/*创建一个函数指针数组*/
int (*MsgFunction [])()={handleMsg1, handleMsg2, handleMsg3};
/*用下面这行更有效的代码来替换switch语句*/
status=MsgFunction[ReceiveMessage()]();

(9) Circular transpose

Some machines have special instruction processing for JNZ (transfer to 0), which is very fast. If your loop is not sensitive to direction, you can loop from large to small.

Old code:

for (i = 1; i <= MAX; i++)
{
   。。。
}

New code:

i = MAX+1;
while (--i)
{
  。。。
}

But be careful, if the pointer operation uses the value of i, this method may cause a serious error that the pointer is out of bounds (i = MAX+1;). Of course, you can correct by adding and subtracting i, but this will not speed up, unless something like the following:

Old code:

char a[MAX+5];
for (i = 1; i <= MAX; i++)
{
  *(a+i+4)=0;
}

New code:

i = MAX+1;
while (--i)
{
    *(a+i+4)=0;
}

(10) Common code block

Some common processing modules often use a large number of if-then-else structures internally in order to meet various calling needs. This is very bad. If the judgment statement is too complicated, it will consume a lot of time. You should minimize the use of public Use of code blocks. (In any case, space optimization and time optimization are opposite-East Building). Of course, if it is just a simple judgment like (3==x), it is still allowed to use it appropriately. Remember, optimization is always the pursuit of a balance, not an extreme.

(11) Improve loop performance

To improve the performance of the loop, it is very useful to reduce redundant constant calculations (for example, calculations that do not change with the loop).

Bad code (contain an unchanging if() in for()):

for( i 。。。)
{
  if( CONSTANT0 )
  {
     DoWork0( i );// 假设这里不改变CONSTANT0的值
  }
  else
  {
    DoWork1( i );// 假设这里不改变CONSTANT0的值
  }
}

Recommended code:

if( CONSTANT0 )
{
  for( i 。。。)
  {
    DoWork0( i );
  }
}
else
{
  for( i 。。。)
  {
    DoWork1( i );
  }
}

If you already know the value of if(), you can avoid double counting. Although the branches in bad code can be simply predicted, since the branch of the recommended code is determined before entering the loop, the dependence on branch prediction can be reduced.

(12) Choose a good infinite loop

In programming, we often need to use infinite loops. The two commonly used methods are while (1)sum for (;;). These two methods have exactly the same effect, but which one is better? Let's take a look at their compiled code:

Before compilation:

while (1);

After compilation:

mov eax,1
test eax,eax
je foo+23h
jmp foo+18h

Before compilation:

for (;;);

After compilation:

jmp foo+23h

Obviously, for (;;)there are few instructions, no registers, no judgment, no jump, and better comparison while (1).

6. Improve CPU parallelism

(1) Use parallel code

As much as possible, break the long dependent code chain into several non-dependent code chains that can be executed in parallel in the pipeline execution unit. Many high-level languages, including C++, do not reorder the resulting floating-point expressions because that is a rather complicated process. It should be noted that the code consistency between the reordered code and the original code is not equivalent to the same calculation result, because floating-point operations lack precision. In some cases, these optimizations can lead to unexpected results. Fortunately, in most cases, only the least important bit (the lowest bit) of the final result may be wrong.

Bad code:

double a[100], sum;
int i;
sum = 0.0f;
for (i=0;i<100;i++)
sum += a[i];

Recommended code:

double a[100], sum1, sum2, sum3, sum4, sum;

int i;

sum1 = sum2 = sum3 = sum4 = 0.0;
for (i = 0;i < 100;i += 4)
{
  sum1 += a[i];
  sum2 += a[i+1];
  sum3 += a[i+2];
  sum4 += a[i+3];
}
sum = (sum4+sum3)+(sum1+sum2);

It should be noted that the 4-way decomposition is used because it uses 4-stage pipeline floating-point addition, and each stage of floating-point addition occupies one clock cycle, which ensures the maximum resource utilization.

(2) Avoid unnecessary read and write dependencies

When the data is saved to the memory, there is a read-write dependency, that is, the data must be correctly written before it can be read again. Although AMD Athlon and other CPUs have hardware that accelerates read and write dependent delays, allowing the data to be saved to be read before being written to the memory, but if the read and write dependencies are avoided and the data is stored in internal registers, the speed will be faster . In a long and interdependent code chain, avoiding read and write dependencies is especially important. If the read-write dependency occurs when manipulating the array, many compilers cannot automatically optimize the code to avoid the read-write dependency. Therefore, it is recommended that programmers manually eliminate read and write dependencies. For example, introduce a temporary variable that can be stored in a register. This can greatly improve performance. The following piece of code is an example:

Bad code:

float x[VECLEN], y[VECLEN], z[VECLEN];
。。。。。。
for (unsigned int k = 1;k < VECLEN;k ++)
{
  x[k] = x[k-1] + y[k];
}

for (k = 1;k <VECLEN;k++)
{
  x[k] = z[k] * (y[k] - x[k-1]);
}

Recommended code:

float x[VECLEN], y[VECLEN], z[VECLEN];
。。。。。。
float t(x[0]);
for (unsigned int k = 1;k < VECLEN;k ++)
{
  t = t + y[k];
  x[k] = t;
}
t = x[0];
for (k = 1;k <;VECLEN;k ++)
{
  t = z[k] * (y[k] - t);
  x[k] = t;
}

7. Calculating without changing the cycle

For some calculation tasks that do not require loop variables to participate in the calculation, they can be placed outside the loop. Many compilers can still do this by themselves, but they dare not move for the calculations that use variables in the middle, so in many cases You have to do it yourself. For those functions that are called in the loop, all operations that do not need to be performed multiple times are put forward in an init function and called before the loop. In addition, try to reduce the number of feedings. If it is not necessary, try not to pass parameters to it. If you need a loop variable, let it build a static loop variable to accumulate by itself, and the speed will be faster.

There is also structure access. The experience of the East Building. Whenever accessing two or more elements of a structure in a loop, it is necessary to establish intermediate variables (the structure is like this, what about C++ objects? Think about it. ), see the following example:

Old code:

total = a->b->c[4]->aardvark + a->b->c[4]->baboon + a->b->c[4]->cheetah + a->b->c[4]->dog;

New code:

struct animals * temp = a->b->c[4];
total = temp->aardvark + temp->baboon + temp->cheetah + temp->dog;

Some old C language compilers do not do aggregation optimization, but new compilers that conform to the ANSI specification can automatically complete this optimization. See an example:

float a, b, c, d, f, g;
。。。
a = b / c * d;
f = b * g / c;

This way of writing is of course necessary, but it is not optimized

float a, b, c, d, f, g;
。。。
a = b / c * d;
f = b / c * g;

If written like this, a new compiler conforming to the ANSI specification can calculate b/c only once, and then substitute the result into the second formula, saving one division operation.

8. Function optimization

(1) Inline function

In C++, the keyword Inline can be added to the declaration of any function. This keyword requests the compiler to replace all calls to the specified function with the code inside the function. This is faster than function calls in two respects: first, it saves the execution time required by the call instruction; second, it saves the time needed to pass arguments and the transfer process. But while using this method to optimize the program speed, the program length becomes larger, so more ROM is needed. This optimization is most effective when the Inline function is called frequently and only contains a few lines of code.

(2) Undefined unused return value

The function definition does not know whether the return value of the function is used. If the return value is never used, void should be used to clearly declare that the function does not return any value.

(3) Reduce function call parameters

Using global variables is more efficient than function passing parameters. This removes the time required for function call parameters to be pushed onto the stack and parameters to be popped off the stack after the function is completed. However, deciding to use global variables will affect the modularity and reentry of the program, so use them carefully.

(4) All functions should have prototype definitions

Generally speaking, all functions should have prototype definitions. The prototype definition can convey to the compiler more information that may be used for optimization.

(5) Use constants (const) as much as possible

Use constants (const) whenever possible. The C++ standard stipulates that if the address of a const-declared object is not obtained, the compiler is allowed to not allocate storage space for it. This can make the code more efficient, and can generate better code.

(6) Declare the local function as static (static)

If a function is only used in the file that implements it, declare it as static to force the use of internal linkage. Otherwise, the function will be defined as an external connection by default. This may affect some compiler optimizations-for example, automatic inlining.

9. Use recursion

Unlike languages ​​such as LISP, C language morbidly likes to use repeated code loops from the beginning. Many C programmers insist on not using recursion unless the algorithm requires it. In fact, C compilers are not at all disgusted with optimizing recursive calls, on the contrary, they still like to do it. Only when a recursive function needs to pass a large number of parameters, which may cause a bottleneck, you should use looping code. At other times, it is better to use recursion.

10. Variables

(1) register variable

You can use the register keyword when declaring local variables. This allows the compiler to put the variable into a multi-purpose register, rather than in the stack, a reasonable use of this method can improve the execution speed. The more frequent function calls, the more likely it is to increase the speed of the code.

Avoid using global variables and static variables in the innermost loop, unless you can be sure that it will not change dynamically during the loop cycle. Most compilers have only one way to optimize variables, which is to set them as register variables. For dynamic variables , They simply abandon the optimization of the entire expression. Try to avoid passing a variable address to another function, although this is still very common. C language compilers always assume that the variables of each function are internal variables, which are determined by its mechanism. In this case, their optimization is done best. However, once a variable is likely to be changed by other functions, these brothers dare not put the variable in the register again, which seriously affects the speed. Look at the example:

a = b();
c(&d);

Because the address of d is used by the c function and may be changed, the compiler dare not put it in the register for a long time. Once it runs to c(&d), the compiler puts it back into the memory. If it is in a loop, It will cause N frequent reads and writes between memory and registers. As we all know, the read and write speed of CPU on the system bus is very slow. For example, your Cyyang 300, the CPU frequency is 300, and the bus speed is up to 66M. In order to read one bus, the CPU may have to wait 4-5 cycles. . Got. . Got. . Thinking of it all trembled.

(2) Declaring multiple variables at the same time is better than declaring variables individually

(3) Short variable names are better than long variable names, you should try to make variable names shorter

(4) Declare variables before the start of the loop

11. Use nested if structure

If there are many parallel conditions to be judged in the if structure, it is best to split them into multiple if structures and then nest them together to avoid unnecessary judgments.

Description:

The above optimization plan was collected and sorted by Wang Quanming. A lot of information comes from the Internet, and the source is ominous. I would like to thank all the authors together!

This scheme mainly takes into account the extremely high requirements for program execution speed in embedded development, so the scheme is mainly to optimize the execution speed of the program.

Note: Optimization is focused. Optimization is a balanced art. It often comes at the expense of program readability or increasing code length.

(In any case, space optimization and time optimization are opposite-East Building).


1. Make a reminder | Don’t miss the GD32 Arm MCU Internet of Things developer online course!

2. Should I participate in training when learning embedded?

3. Apple M1 heralds the rise of RISC-V?

4. Thinking about the startup part of RISC-V~

5. You may not understand how ARM handles exceptions!

6. Some people say that these languages ​​are going to be eliminated!

Disclaimer: This article is reproduced online, and the copyright belongs to the original author. If you are involved in copyright issues, please contact us, we will confirm the copyright based on the copyright certification materials you provide and pay the author's remuneration or delete the content.

Guess you like

Origin blog.csdn.net/DP29syM41zyGndVF/article/details/112057198