Collected some methods of C language code optimization for everyone

In this article, I have collected a lot of experience and methods. Applying these experiences and methods can help us optimize C language code in terms of execution speed and memory usage.

Introduction

In a recent project, we needed to develop a lightweight JPEG library that runs on mobile devices but does not guarantee high image quality. During the period, I summarized some ways to make the program run faster. In this article, I have collected some experiences and methods.

Applying these experiences and methods can help us optimize C language code in terms of execution speed and memory usage.

Although there are many guides on optimizing C code, there is very little knowledge about optimization in terms of compilation and the programming machine you are using.

Usually, in order to make your program run faster, the code size of the program may need to increase. The increase in the amount of code may have an adverse effect on the complexity and readability of the program. This is not allowed when writing programs on small devices such as mobile phones and PDAs that have many restrictions on memory usage.

Therefore, when optimizing code, our motto should be to ensure that both memory usage and execution speed are optimized.

statement

In fact, in my project, I used many methods of optimizing ARM programming (the project is based on the ARM platform), and also used many methods on the Internet. But not all the methods mentioned in the article can play a good role. So, I made a summary collection of useful and efficient methods. At the same time, I also modified some of the methods to make them applicable to all programming environments, not limited to the ARM environment.

Where do these methods need to be used?

Without this, all discussions would be impossible. The most important thing about program optimization is to find out where to optimize, that is, to find out which parts or modules of the program run slowly or consume a lot of memory. Only when each part of the program is optimized can the program execute faster.

The parts of the program that run the most, especially those methods that are called repeatedly by the program's internal loops, should be optimized.

For an experienced coder, it is often simple to discover the parts of the program that need to be optimized the most. In addition, there are many tools that can help us find out what needs to be optimized. I have used Visual C++'s built-in performance tool profiler to find out where the most memory is consumed in the program.

Another tool I've used is Intel's Vtune, which is also good at detecting the slowest parts of a program. In my experience, inner or nested loops, calling methods of third-party libraries are usually the biggest cause of program slowness.

Integer

If we are sure that integers are non-negative, we should use unsigned int instead of int. Some processors handle unsigned unsigned integers much more efficiently than signed signed integers (this is a good practice and also helps self-explanatory code for specific types).

Therefore, in a tight loop, the best way to declare an int integer variable is:

register unsigned int variable_name;

Remember, the operation speed of integer in is higher than that of floating-point type float, and the operation can be directly completed by the processor without resorting to FPU (floating-point operation unit) or floating-point operation library.

Although this does not guarantee that the compiler will use registers to store variables, nor that the processor can handle unsigned integers more efficiently, it is common to all compilers.

For example, in a calculation package, if the result needs to be accurate to two decimal places, we can multiply it by 100, and then convert it to a floating point number as late as possible.

Division and remainder

In standard processors, a 32-bit division uses 20 to 140 cycles for the numerator and denominator. The time consumed by the division function includes a constant time plus the time consumed by each bit division.

Time (numerator / denominator) = C0 + C1* log2 (numerator / denominator)
     = C0 + C1 * (log2 (numerator) - log2 (denominator)).

For an ARM processor, this version requires 20+4.3N cycles. This is an expensive operation and should be avoided if possible. Sometimes, multiplication expressions can be used instead of division.

For example, if we know that b is positive and b*c is an integer, then (a/b)>c can be rewritten as a>(c * b). If you are sure that the operands are unsigned, it is better to use unsigned unsigned division, because it is more efficient than signed division.

Combined division and remainder

In some scenarios, both division (x/y) and remainder (x%y) operations are required. In this case, the compiler can return the result of the division and the remainder by calling the division operation once. If we need both the result of the division and the remainder, we can write them together like this:

int func_div_and_mod (int a, int b) 
{         
    return (a / b) + (a % b);    
}

Division by powers of 2 and remainder

If the divisor in the division is a power of 2, we can better optimize the division. The compiler uses shift operations to perform division. Therefore, we need to set the divisor as a power of 2 as much as possible (such as 64 instead of 66). And still keep in mind that unsigned unsigned integer division performs more efficiently than signed integer division.

typedef unsigned int uint;

uint div32u (uint a) 
{
     return a / 32;
}
int div32s (int a)
{
    return a / 32;
}

Both of the above divisions avoid calling the division function directly, and unsigned unsigned division uses fewer computer instructions. Signed division takes more time to execute due to the need to shift to 0 and negative numbers.

An Alternative to Modulo

We use the remainder operator to provide arithmetic modulo. But sometimes you can use the if statement to perform the modulo operation. Consider the following two examples:

uint modulo_func1 (uint count)
{
    return (++count % 60);
}

uint modulo_func2 (uint count)
{
    if (++count >= 60)
        count = 0;
    return (count);
}

Prefer to use if statement instead of remainder operator because if statement executes faster. Note here that the new version of the function will only work correctly if we know that the input count balances 0 to 59.

Use array subscripts

If you wanted to set a variable to a character value that represented something, you might do something like this:

switch ( queue ) 
{
    case 0 :   letter = 'W';   
        break;
    case 1 :   letter = 'S';   
        break;
    case 2 :   letter = 'U';   
        break;
}

or do this:

if ( queue == 0 )  
    letter = 'W';
else if ( queue == 1 )  
    letter = 'S';
else  letter = 'U';

A cleaner and faster way is to use array subscripts to get the value of a character array. as follows:

static char *classes="WSU"; 
letter = classes[queue];

global variable

Global variables are never in registers. Using pointers or function calls, you can directly modify the value of a global variable. Therefore, the compiler cannot cache the value of a global variable in a register, but this would require additional (often unnecessary) reads and stores when using global variables. Therefore, we do not recommend using global variables in important loops.

If the function uses too many global variables, it is better to copy the value of the global variable to a local variable so that it can be stored in a register. This approach only works if the global variable is not used by any function we call. Examples are as follows:

int f(void);
int g(void);
int errs;
void test1(void)
{  
    errs += f();  
    errs += g();
} 
void test2(void)
{  
    int localerrs = errs;  
    localerrs += f();  
    localerrs += g();  
    errs = localerrs;
}

Note that test1 must load and store the value of the global variable errs at each increment operation, while test2 stores localerrs in a register and requires only one computer instruction.

use an alias

Consider the following example:

void func1( int *data )
{    
    int i;     
    for(i=0; i<10; i++)    
    {          
        anyfunc( *data, i);    
    }
}

Although the value of *data may never change, the compiler doesn't know that anyfunc will not modify it, so the program must read it from memory every time it is used. If we know that the value of the variable will not be changed, then the following encoding should be used:

void func1( int *data )
{    
    int i;    
    int localdata;     
    localdata = *data;    
    for(i=0; i<10; i++)    
    {          
        anyfunc (localdata, i);    
    }
}

This provides conditions for the compiler to optimize the code.

Variable lifecycle splitting

Since the registers in the processor are of fixed length, the storage of digital variables in the program in the registers is limited.

Some compilers support "live-range splitting", which means that variables can be allocated to different registers or memory in different parts of the program.

The lifetime of a variable begins with the last assignment to it and ends with the last use before the next assignment. During the life cycle, the value of the variable is valid, that is to say, the variable is alive. Between different life cycles, the value of the variable is not needed, that is to say, the variable is dead.

This way, the register can be used by the rest of the variables, allowing the compiler to allocate more variables to use the register.

The number of variables that need to be allocated using registers needs to exceed the number of different variable lifetimes in the function. If the number of different variable lifetimes exceeds the number of registers, then some variables must be temporarily stored in memory. This process is called segmentation.

The compiler partitions the most recently used variables first to reduce the cost of partitioning. The method of prohibiting variable life cycle splitting is as follows:

  • Limit the number of variables used: This can be achieved by keeping the expressions in the function simple, small, and not using too many variables. Splitting larger functions into smaller, simpler ones can also work well.

  • Use register storage for frequently used variables: This allows us to tell the compiler that the variable needs to be used frequently, so it needs to be stored in a register first. However, such variables may still be split out of registers under certain circumstances.

variable type

C compiler supports basic types: char, short, int, long (including signed and unsigned unsigned), float and double. Using the correct variable type is critical as this can reduce code and data size and greatly increase program performance.

local variable

We should try not to use local variables of type char and short. For char and short types, the compiler needs to reduce local variables to 8 or 16 bits each time they are assigned. This is called sign extension for signed variables and zero extension for unsigned variables.

These extensions can be implemented by shifting the register left by 24 or 16 bits, and then shifting right by the same number of bits according to the signed or unsigned flag, which consumes two computer instruction operations (zero extension of the unsigned char type only consumes one computer instruction).

Such shift operations can be avoided by using local variables of type int and unsigned int. This is very important for operations such as loading data into local variables first, and then processing the local variable data values. Regardless of whether the input and output data are 8-bit or 16-bit, it is worth considering them as 32-bit.

Consider the following three functions:

int wordinc (int a)
{   
    return a + 1;
}
short shortinc (short a)
{    
    return a + 1;
}
char charinc (char a)
{    
    return a + 1;
}

Although the results are the same, the first program fragment runs faster than the latter two.

pointer

We should try to pass structure data by reference value, that is to say, use pointers, otherwise the passed data will be copied to the stack, thereby reducing the performance of the program. I've seen a program pass very large structures by value that could be done better with a simple pointer.

The function accepts the pointer of the structure data through the parameter. If we are sure not to change the value of the data, we need to define the content pointed by the pointer as a constant. For example:

void print_data_of_a_structure (const Thestruct  *data_pointer)
{    
    ...printf contents of the structure...
}

This example tells the compiler that the function does not change the value of the external parameter (decorated with const), and does not need to be read every time it is accessed. Also, make sure the compiler restricts any modification operations to read-only structures to give extra protection to the structure data.

chain of pointers

Pointer chains are often used to access structured data. For example, commonly used codes are as follows:

typedef struct { int x, y, z; } Point3;
typedef struct { Point3 *pos, *direction; } Object;
 
void InitPos1(Object *p)
{
   p->pos->x = 0;
   p->pos->y = 0;
   p->pos->z = 0;
}

However, such code must repeatedly call p->pos for each operation, because the compiler does not know that p->pos->x is the same as p->pos. A better way is to cache p->pos in a local variable:

void InitPos2(Object *p)
{
   Point3 *pos = p->pos;
   pos->x = 0;
   pos->y = 0;
   pos->z = 0;
}

Another method is to directly contain the data of type Point3 in the Object structure, which can completely eliminate the use of pointer operations on Point3.

conditional execution

Conditional execution statements are mostly used in if statements, but also when calculating complex expressions using relational operators (<, ==, >, etc.) or Boolean expressions (&&, !, etc.). For code fragments containing function calls, conditional execution is invalid because the function return value will be destroyed.

Therefore, it is beneficial to keep if and else statements as simple as possible, so that the compiler can focus on them. Relational expressions should be written together.

The following example shows how the compiler uses conditional execution:

int g(int a, int b, int c, int d)
{
   if (a > 0 && b > 0 && c < 0 && d < 0)
   //  grouped conditions tied up together//
      return a + b + c + d;
   return -1;
}

Since the conditions are grouped together, the compiler is able to process them collectively.

Boolean expressions and range checking

A common boolean expression is used to determine whether a variable is within a certain range, for example, to check whether a graphics coordinate is within a window:

bool PointInRectangelArea (Point p, Rectangle *r)
{
   return (p.x >= r->xmin && p.x < r->xmax &&
                      p.y >= r->ymin && p.y < r->ymax);
}

Here's a faster way: x>min && x<max can be converted to (unsigned)(x-min)<(max-min). This is more beneficial when min is equal to 0. The optimized code is as follows:

bool PointInRectangelArea (Point p, Rectangle *r)
{
    return ((unsigned) (p.x - r->xmin) < r->xmax &&
   (unsigned) (p.y - r->ymin) < r->ymax);
 
}

Boolean expressions and zero-valued comparisons

The processor's flag bit is set after a compare instruction operation. Flag bits can also be rewritten by basic arithmetic and bare metal instructions such as MOV, ADD, AND, MUL, etc. If the data instruction sets the flags, the N and Z flags will also be set as if the result were compared with zero. The N flag indicates whether the result is a negative value, and the Z flag indicates whether the result is 0.

In C language, the N and Z flags in the processor are associated with the following instructions: signed relational operation x<0, x>=0, x==0, x!=0; unsigned relational operation x= =0, x!=0 (or x>0).

Every time a relational operator is called in C code, the compiler will issue a comparison instruction. If the operator is mentioned above, the compiler will optimize out the comparison instruction. For example:

int aFunction(int x, int y)
{
   if (x + y < 0)
      return 1;
  else
     return 0;
}

Use the above judgment method as much as possible, which can reduce the call of comparison instructions in critical loops, thereby reducing code size and improving code performance. The C language has no concept of borrow and overflow bits, therefore, it is impossible to use the borrow flag C and the overflow flag V directly without the help of assembly. But the compiler supports borrowing (unsigned overflow), for example:

int sum(int x, int y)
{
   int res;
   res = x + y;
   if ((unsigned) res < (unsigned) x) // carry set?  //
     res++;
   return res;
}

Lazy detection development

In a statement like if(a>10 && b=4), make sure that the first part of the AND expression gives the result (or the earliest, fastest calculation) most likely, so that the second part may not need to be executed .

Use the switch() function instead of if...else...

For multi-condition judgments involving if...else...else..., for example:

if( val == 1)
    dostuff1();
else if (val == 2)
    dostuff2();
else if (val == 3)
    dostuff3();

It might be faster to use switch:

switch( val )
{
    case 1: dostuff1(); break;

    case 2: dostuff2(); break;

    case 3: dostuff3(); break;
}

In the if() statement, if the last statement hits, all previous conditions need to be tested once. Switch allows us to do no extra testing. If you must use if...else... statements, put the most likely to execute first.

two point break

Use binary breaks instead of stacking code in one column, don't do something like this:

if(a==1) {
} else if(a==2) {
} else if(a==3) {
} else if(a==4) {
} else if(a==5) {
} else if(a==6) {
} else if(a==7) {
} else if(a==8)

{
}

Replace it with the following bisection, as follows:

if(a<=4) {
    if(a==1)     {
    }  else if(a==2)  {
    }  else if(a==3)  {
    }  else if(a==4)   {

    }
}
else
{
    if(a==5)  {
    } else if(a==6)   {
    } else if(a==7)  {
    } else if(a==8)  {
    }
}

or as follows:

if(a<=4)
{
    if(a<=2)
    {
        if(a==1)
        {
            /* a is 1 */
        }
        else
        {
            /* a must be 2 */
        }
    }
    else
    {
        if(a==3)
        {
            /* a is 3 */
        }
        else
        {
            /* a must be 4 */
        }
    }
}
else
{
    if(a<=6)
    {
        if(a==5)
        {
            /* a is 5 */
        }
        else
        {
            /* a must be 6 */
        }
    }
    else
    {
        if(a==7)
        {
            /* a is 7 */
        }
        else
        {
            /* a must be 8 */
        }
    }
}

Compare the following two case statements:

gif;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAADUlEQVQImWNgYGBgAAAABQABh6FO1AAAAABJRU5ErkJggg==

======001

switch statement vs lookup table

The application scenarios of Switch are as follows:

  • call one or more functions

  • Set variable value or return a value

  • Execute one or more code fragments

If there are many case labels, in the first two usage scenarios of switch, using a lookup table can be done more efficiently. For example, the following two ways to convert strings:

char * Condition_String1(int condition) {
  switch(condition) {
     case 0: return "EQ";
     case 1: return "NE";
     case 2: return "CS";
     case 3: return "CC";
     case 4: return "MI";
     case 5: return "PL";
     case 6: return "VS";
     case 7: return "VC";
     case 8: return "HI";
     case 9: return "LS";
     case 10: return "GE";
     case 11: return "LT";
     case 12: return "GT";
     case 13: return "LE";
     case 14: return "";
     default: return 0;
  }
}
 
char * Condition_String2(int condition) {
   if ((unsigned) condition >= 15) return 0;
      return
      "EQ\0NE\0CS\0CC\0MI\0PL\0VS\0VC\0HI\0LS\0GE\0LT\0GT\0LE\0\0" +
       3 * condition;
}

The first program requires 240 bytes, while the second only requires 72 bytes.

cycle

Loops are a common construct in most programs; most of the program execution time occurs in loops, so it is well worth investing in loop execution time.

loop terminated

Writing loop termination conditions can lead to additional overhead if care is not taken. We should use a loop that counts to zero and a simple loop termination condition. Simple termination conditions consume less time. See calculating n below! of the two programs. The first implementation uses an incrementing loop, and the second implementation uses a decrementing loop.

int fact1_func (int n)
{
    int i, fact = 1;
    for (i = 1; i <= n; i++)
      fact *= i;
    return (fact);
}
 
int fact2_func(int n)
{
    int i, fact = 1;
    for (i = n; i != 0; i--)
       fact *= i;
    return (fact);
}

The fact2_func of the second program executes more efficiently than the first.

Faster for() loop

It's a simple yet highly effective concept. Usually, we write the for loop code as follows:

for( i=0;  i<10;  i++){ ... }

i loops from 0 to 9. If we don't mind the order of the loop counts, we can write:

for( i=10; i--; ) { ... }

The reason this is faster is because it can process the value of i faster - the test condition is: Is i non-zero? If so, decrement the value of i. For the above code, the processor needs to calculate "compute i minus 10, is its value non-negative? If non-negative, increment i and continue".

Simple loops make a big difference. In this way, i is decremented from 9 to 0, and such a loop executes faster.

The syntax here is a bit weird, but legal. The third statement in the loop is optional (an infinite loop can be written as for(;;)). The following code has the same effect:

for(i=10; i; i--){}

or even further:

for(i=10; i!=0; i--){}

What we need to remember here is that the loop must terminate at 0 (so this won't work if looping between 50 and 80), and that the loop counter is decremented. Code that uses incrementing loop counters does not enjoy this optimization.

merge loop

If one cycle can solve the problem, it is determined not to use two. But if you need to do a lot of work in a loop, it doesn't fit in the processor's instruction cache. In this case, two separate loops may execute faster than a single loop. Below is an example:

gif;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAADUlEQVQImWNgYGBgAAAABQABh6FO1AAAAABJRU5ErkJggg==

======002

function loop

There is always a certain performance cost when calling a function. Not only does the program pointer need to change, but variables used need to be pushed onto the stack and new variables allocated. In order to improve the performance of the program, there are many optimizations in the function. While maintaining the readability of the program code, the size of the code also needs to be controllable.

If a function is called frequently in a loop, then include the loop in the function, which can reduce repeated function calls. code show as below:

for(i=0 ; i<100 ; i++)
{
    func(t,i);
}
-
-
-
void func(int w,d)
{
    lots of stuff.
}

should be changed to:

func(t);
-
-
-
void func(w)
{
    for(i=0 ; i<100 ; i++)
    {
        //lots of stuff.
    }
}

loop unrolling

Simple loops can be unrolled for better performance, at the cost of increased code size. After the loop is unrolled, the loop count should get smaller and smaller so fewer code branches are taken. If the number of loop iterations is only a few, the loop can be fully unrolled to remove the burden of loop corruption.

This can make a big difference. Loop unrolling can bring considerable performance savings because the code doesn't have to check and increment the value of i every time it loops. For example:

gif;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAADUlEQVQImWNgYGBgAAAABQABh6FO1AAAAABJRU5ErkJggg==

======003

Compilers will usually unroll simple, fixed-number-of-iteration loops like the above. But like the following code:

for(i=0;i< limit;i++) { ... }

The following code (Example 1) is significantly longer than using a loop, but it is more efficient. Setting the value of block-sie to 8 is only suitable for testing purposes, as long as we repeat the "loop-contents" the same number of times, it will have a good effect.

In this example, the loop condition is checked every 8 iterations, not every time. Since the number of iterations is unknown, it is generally not expanded. Therefore, unrolling the loop as much as possible allows us to achieve better execution speed.

//Example 1
 
#include<STDIO.H>
 
#define BLOCKSIZE (8)
 
void main(void)
{
int i = 0;
int limit = 33;  /* could be anything */
int blocklimit;
 
/* The limit may not be divisible by BLOCKSIZE,
 * go as near as we can first, then tidy up.
 */
blocklimit = (limit / BLOCKSIZE) * BLOCKSIZE;
 
/* unroll the loop in blocks of 8 */
while( i < blocklimit )
{
    printf("process(%d)\n", i);
    printf("process(%d)\n", i+1);
    printf("process(%d)\n", i+2);
    printf("process(%d)\n", i+3);
    printf("process(%d)\n", i+4);
    printf("process(%d)\n", i+5);
    printf("process(%d)\n", i+6);
    printf("process(%d)\n", i+7);
 
    /* update the counter */
    i += 8;
 
}
 
/*
 * There may be some left to do.
 * This could be done as a simple for() loop,
 * but a switch is faster (and more interesting)
 */
 
if( i < limit )
{
    /* Jump into the case at the place that will allow
     * us to finish off the appropriate number of items.
     */
 
    switch( limit - i )
    {
        case 7 : printf("process(%d)\n", i); i++;
        case 6 : printf("process(%d)\n", i); i++;
        case 5 : printf("process(%d)\n", i); i++;
        case 4 : printf("process(%d)\n", i); i++;
        case 3 : printf("process(%d)\n", i); i++;
        case 2 : printf("process(%d)\n", i); i++;
        case 1 : printf("process(%d)\n", i);
    }
}
 
}

Count the number of non-zero bits

By continuously shifting to the left, extracting and counting the lowest bits, sample program 1 efficiently checks how many non-zero bits exist in an array. Example program 2 is loop unrolled four times, then the code is optimized by combining the four shifts into one. Frequent unrolling of loops can provide many opportunities for optimization.

//Example - 1

int countbit1(uint n)
{
  int bits = 0;
  while (n != 0)
  {
    if (n & 1) bits++;
    n >>= 1;
   }
  return bits;
}

//Example - 2

int countbit2(uint n)
{
   int bits = 0;
   while (n != 0)
   {
      if (n & 1) bits++;
      if (n & 2) bits++;
      if (n & 4) bits++;
      if (n & 8) bits++;
      n >>= 4;
   }
   return bits;
}

break the loop early

Often, loops don't need to execute all of them. For example, if we're looking for a particular value in an array, once found, we should break the loop as early as possible. For example: the following loop finds whether there is -99 from 10000 integers.

found = FALSE;
for(i=0;i<10000;i++)
{
    if( list[i] == -99 )
    {
        found = TRUE;
    }
}
 
if( found ) 
    printf("Yes, there is a -99. Hooray!\n");

The code above works fine, but requires the loop to execute all the way through, regardless of whether we've found it or not. A better approach is to terminate the query once we find the number we are looking for.

found = FALSE;
for(i=0; i<10000; i++)
{
    if( list[i] == -99 )
    {
        found = TRUE;
        break;
    }
}
if( found ) 
    printf("Yes, there is a -99. Hooray!\n");

If the data to be checked is located at the 23rd position, the program will be executed 23 times, thereby saving 9977 cycles.

function design

It is a good habit to design small and simple functions. This allows registers to perform some optimizations such as register variable allocation, which is very efficient.

Performance consumption of function calls

The performance consumption of the function call on the processor is very small, and only occupies a small part of the performance consumption of the function execution work. There are certain restrictions on passing parameters into function variable registers. These parameters must be integer-compatible (char, shorts, ints, and floats all occupy one word) or less than four words in size (including doubles and long longs that occupy two words).

If the parameter limit is 4, then the fifth and subsequent words are stored on the stack. This requires loading parameters from the stack when calling the function, thereby increasing the consumption of storage and reading.

Look at the code below:

int f1(int a, int b, int c, int d) {
   return a + b + c + d;
}
 
int g1(void) {
   return f1(1, 2, 3, 4);
}
 
int f2(int a, int b, int c, int d, int e, int f) {
  return a + b + c + d + e + f;
}
 
ing g2(void) {
 return f2(1, 2, 3, 4, 5, 6);
}

The fifth and sixth parameters in function g2 are stored on the stack and loaded in function f2, which will consume 2 more parameter storage.

Reduce function parameter passing consumption

The methods to reduce the consumption of function parameter passing are as follows:

  • Try to keep functions using fewer than four parameters. This way the stack is not used to store parameter values.

  • If the function takes more than four parameters, try to make sure that the value of using the latter parameters outweighs the cost of storing them on the stack.

  • Pass the parameter reference by pointer instead of passing the parameter structure itself.

  • Putting parameters into a struct and passing them into functions via pointers reduces the number of parameters and improves readability.

  • Try to use less long type parameters that occupy two words. For programs that require floating-point types, double should be used as little as possible because it occupies two words.

  • Avoid function arguments that exist in both registers and on the stack (this is called parameter splitting). Current compilers do not handle this situation efficiently: all register variables are also placed on the stack.

  • Avoid variable parameters. Variadic functions place all arguments on the stack.

leaf function

A function that does not call any function is called a leaf function. In the following application, nearly half of the function calls are calls to leaf functions. Leaf functions are efficient on any platform because they do not need to perform register variable stores and reads.

The performance consumption of register variable reading is very small compared to the system energy consumption caused by the work done by the leaf function using four or five register variables. So write frequently called functions as leaf functions as much as possible.

The number of function calls can be checked with some tools. Here are some ways to compile a function into a leaf function:

  • Avoid calling other functions: including functions that instead call the C library (such as division or floating-point number manipulation functions).

  • Use __inline() for short functions.

inline function

Inline functions disable all compile options. Decorating a function with __inline causes the function to be directly replaced by the function body at the call site. This way the code calls the function faster, but increases the size of the code, especially if the function itself is large and called frequently.

__inline int square(int x) {
   return x * x;
}
 
#include <MATH.H>
 
double length(int x, int y){
    return sqrt(square(x) + square(y));
}

The benefits of using inline functions are as follows:

  • No function call burden. The function call is directly replaced by the function body, so there is no performance consumption such as reading register variables.

  • Smaller parameter passing consumption. Since there is no need to copy variables, the cost of passing parameters is smaller. Compilers can provide better optimizations if the arguments are constants.

The disadvantage of inline functions is that if there are many places to call, the size of the code will become very large. This mainly depends on the size of the function itself and the number of calls.

It is wise to use inline only for important functions. When used properly, inlining functions can even reduce the size of your code: a function call generates a few computer instructions, but an optimized version using inlining may generate fewer computer instructions.

use lookup table

Functions can often be designed as lookup tables, which can significantly improve performance. Lookup tables are less precise than usual calculations, but not much different for general programs.

Many signal processing programs (for example, modem demodulation software) use many sin and cos functions that are computationally expensive. For real-time systems, where accuracy is not particularly important, a sin, cos lookup table may be more appropriate. When using a lookup table, try to put similar operations into the lookup table, which is faster and saves storage space than using multiple lookup tables.

floating point operation

Although floating-point arithmetic is time-consuming for all processors, we still need to use it when implementing signal processing software. When writing floating-point manipulation programs, keep the following points in mind:

  • Floating point division is slow. Floating-point division is twice as slow as addition or multiplication. Convert division to multiplication by using constants (for example, x=x/3.0 can be replaced with x=x*(1.0/3.0)). Division by a constant is computed at compile time.

  • Use float instead of double. Variables of type Float consume better memory and registers and are more efficient due to low precision. If precision is sufficient, use float whenever possible.

  • Avoid using transcendental functions. Transcendental functions such as sin, exp, and log are implemented as a series of multiplications and additions (using precision extension). These operations are at least ten times slower than usual multiplication.

  • Simplifies floating-point arithmetic expressions. The compiler cannot apply optimizations to floating-point operations that apply to integer operations. For example, 3*(x/3) can be optimized to x, while floating-point operations lose precision. Therefore, it is necessary to perform the necessary manual floating-point optimization if the result is known to be correct.

However, the performance of floating-point operations may not meet the performance requirements of specific software. In this case, the best approach may be to use fixed-point arithmetic. When the range of values ​​is sufficiently small, fixed-point arithmetic operations are more precise and faster than floating-point operations.

other tricks

Often, space can be traded for time. This can lead to faster access if you can cache frequently used data instead of recomputing it. Such as sine and cosine lookup tables, or pseudo-random numbers.

  • Try not to use ++ and – in loops. For example: while(n–){}, which is sometimes difficult to optimize.

  • Reduce the use of global variables.

  • Unless it is declared as a global variable, use static to modify the variable for file access.

  • Whenever possible, use a word-sized variable (int, long, etc.), using them (instead of char, short, double, bitfield, etc.) the machine may run faster.

  • No recursion is used. Recursion may be elegant and simple, but requires too many function calls.

  • Calculating the square root is very performance-consuming without using the sqrt square root function in the loop.

  • One-dimensional arrays are faster than multidimensional arrays.

  • The compiler can do optimizations in one file - avoid splitting related functions into different files, the compiler can handle them better (e.g. using inline) if they are kept together.

  • Single precision functions are faster than double precision.

  • Floating point multiplication is faster than floating point division - use val*0.5 instead of val/2.0.

  • Addition is faster than multiplication - use val+val+val instead of val*3.

  • The put() function is faster than printf(), but less flexible.

  • Use #define macros instead of commonly used small functions.

  • Binary/unformatted file access is faster than formatted file access because the program does not need to convert between human-readable ASCII and machine-readable binary. If you don't need to read the contents of the file, save it as binary.

  • If your library supports the mallopt() function (used to control malloc), try to use it. The setting of MAXFAST greatly improves the performance of functions that call malloc many times. If a structure needs to be created and destroyed many times a second, try setting the mallopt option.

Last but not least - turn compiler optimizations on! Seems obvious, but is often forgotten when a product is launched. Compilers can optimize code at a lower level and perform optimizations specific to the target processor.

Original: http://www.codeceo.com/article/c-high-performance-coding.html

Copyright statement: This article comes from the Internet, conveying knowledge for free, and the copyright belongs to the original author. If it involves copyright issues, please contact me to delete it.

 

 

Guess you like

Origin blog.csdn.net/zp1990412/article/details/123931470
Recommended