Loop operation (LOOP) optimization of ARM embedded compilation

Depending on the number of iterations of the loop, it may take a lot of time to complete the loop. In addition, each iteration, you need to check whether the loop condition is true, which will also reduce the performance of the loop.

Table of contents

1 Loop unrolling-Loop unrolling

 2 Loop vectorization

 3 Loop termination in C language

4 infinite loop


1 Loop unrolling-Loop unrolling

In order to reduce the performance impact caused by the need to judge the iteration condition for each loop, the user can expand the loop to reduce the number of times to judge the loop condition. Use  #pragma unroll (<n>)  to unroll time- and performance-sensitive loops in user code. However, unrolling the loop also has a disadvantage: it increases the amount of code. The operations in the table below are only for, and  -O2-O3-Ofast-Omax优化时有效果:

Loop unrolling pragmas
Pragma Description
#pragma unroll (<n>) n iterations of the unrolled loop
#pragma unroll_completely Unroll all iterations in the loop

For specific usage, see:
#pragma unroll[(n)], #pragma unroll_completely https://developer.arm.com/documentation/101754/0620/armclang-Reference/Compiler-specific-Pragmas/-pragma-unroll--n-----pragma-unroll-completely  Note: manually expand the loop and use #p in the source code The effect of  ragma unroll (<n>)  is different. Manually unrolling the loop in the source code may hinder the compiler from optimizing the loop. ARM recommends users to use  #pragma unroll (<n>) . If n is not specified, all iterations in the loop will be unrolled by default. In addition, if the compiler cannot calculate the number of iterations, use #pragma unroll_completely, and the loop will not be unrolled at compile time.

For example, the following sample code:

int countSetBits1(unsigned int n)
{
    int bits = 0;

    while (n != 0)
    {
        if (n & 1) bits++;
        n >>= 1;
    }
    return bits;
}

Compile with the following command:

armclang --target=arm-arm-none-eabi -march=armv8-a file.c -O2 -S -o file.s

 By default, the following assembly code will be obtained:

countSetBits1:
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
        mov     r0, #0
.LBB0_1:
        and     r3, r1, #1
        cmp     r2, r1, asr #1
        add     r0, r0, r3
        lsr     r3, r1, #1
        mov     r1, r3
        bne     .LBB0_1
        bx      lr

If using loop unrolling four times: #pragma unroll (4)

int countSetBits2(unsigned int n)
{
    int bits = 0;
    #pragma unroll (4)
    while (n != 0)
    {
        if (n & 1) bits++;
        n >>= 1;
    }
    return bits;
}

Then the generated assembly code is as follows:
 

countSetBits2:
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
        mov     r0, #0
LBB0_1:
        and     r3, r1, #1
        cmp     r2, r1, asr #1
        add     r0, r0, r3
        beq     .LBB0_4
@ BB#2:
        asr     r3, r1, #1
        cmp     r2, r1, asr #2
        and     r3, r3, #1
        add     r0, r0, r3
        asrne   r3, r1, #2
        andne   r3, r3, #1
        addne   r0, r0, r3
        cmpne   r2, r1, asr #3
        beq     .LBB0_4
@ BB#3:
        asr     r3, r1, #3
        cmp     r2, r1, asr #4
        and     r3, r3, #1
        add     r0, r0, r3
        asr     r3, r1, #4
        mov     r1, r3
        bne     .LBB0_1
.LBB0_4:
        bx      lr

If the number of iterations can be determined at compile time, the ARM Embedded Compiler can fully unroll the loop.

 2 Loop vectorization

If the object of user code contains Advanced SIMD units, the ARM embedded compiler can use the vectorization engine to optimize the part of the code that can be vectorized. At optimization level  -O1上,可以使用 -fvectorize option to enable vectorization. In optimization levels higher than -O1, -fvectorize is enabled by default, and users can use   -fno-vectorize 选项将其关闭。详情见:-fvectorize, -fno-vectorize  .

Example code, vectorized using Advanced SIMD:

typedef struct tBuffer {
  int a;
  int b;
  int c;
} tBuffer;
tBuffer buffer[8];

void DoubleBuffer1 (void)
{
  int i;
  for (i=0; i<8; i++)
  {
    buffer[i].a *= 2;
    buffer[i].b *= 2;
    buffer[i].c *= 2;
  }
}

Compile with an optimization level of -O2:

armclang --target=arm-arm-none-eabi -march=armv8-a -O2 file.c -S -o file.s

Will get the following code:
 

DoubleBuffer1:
.fnstart
@ BB#0:
        movw    r0, :lower16:buffer
        movt    r0, :upper16:buffer
        vld1.64 {d16, d17}, [r0:128]
        mov     r1, r0
        vshl.i32        q8, q8, #1
        vst1.32 {d16, d17}, [r1:128]!
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #32
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #48
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #64
        add     r0, r0, #80
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        vld1.64 {d16, d17}, [r0:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r0:128]
        bxlr

If not using SIMD:

typedef struct tBuffer {
  int a;
  int b;
  int c;
} tBuffer;
tBuffer buffer[8];

void DoubleBuffer2 (void)
{
  int i;
  for (i=0; i<8; i++)
    buffer[i].a *= 2;
  for (i=0; i<8; i++)
    buffer[i].b *= 2;
  for (i=0; i<8; i++)
    buffer[i].c *= 2;
}

will get:

DoubleBuffer2:
.fnstart
@ BB#0:
        movw    r0, :lower16:buffer
        movt    r0, :upper16:buffer
        ldr     r1, [r0]
        lsl     r1, r1, #1
        str     r1, [r0]
        ldr     r1, [r0, #12]
        lsl     r1, r1, #1
        str     r1, [r0, #12]
        ldr     r1, [r0, #24]
        lsl     r1, r1, #1
        str     r1, [r0, #24]
        ldr     r1, [r0, #36]
        lsl     r1, r1, #1
        str     r1, [r0, #36]
        ldr     r1, [r0, #48]
        lsl     r1, r1, #1
        str     r1, [r0, #48]
        ldr     r1, [r0, #60]
        lsl     r1, r1, #1
        str     r1, [r0, #60]
        ldr     r1, [r0, #72]
        lsl     r1, r1, #1
        str     r1, [r0, #72]
        ldr     r1, [r0, #84]
        lsl     r1, r1, #1
        str     r1, [r0, #84]
        ldr     r1, [r0, #4]
        lsl     r1, r1, #1
        str     r1, [r0, #4]
        ldr     r1, [r0, #16]
        lsl     r1, r1, #1
        ...
        bx      lr

Advanced SIMD (Single Instruction Multiple Data), which is Neon technology, is used in the ARMv7-A series and its later architectures. It allows users to write more high-performance optimized code. Regarding the use of Neon, users can directly use the C/C++ function interface to call. For Neon usage skills, please refer to the article: Arm C Language Extensions ACLE Q1
2019

 Cortex-A Series Programmer's Guide

Arm Neon Programmer's Guide.

 use -fno-vectorize 选项并不能完全阻止编译器忽略SIMD指令。如果链接库包含了Neon相关指令,编译器或者链接器仍或使用SIMD。

To prevent the compiler from emitting advanced SIMD instructions for AArch64 targets, specify +nosimd with -march or -mcpu:

armclang --target=aarch64-arm-none-eabi -march=armv8-a+nosimd -O2 file.c -S -o file.s

To prevent the compiler from emitting Advanced SIMD instructions for AArch32 targets, set the option -mfpu to the correct value that does not include Advanced SIMD. For example, -mfpu=fp-armv8.

armclang --target=aarch32-arm-none-eabi -march=armv8-a -mfpu=fp-armv8 -O2 file.c -S -o file.s

 3 Loop termination in C language

Loop termination conditions can cause significant performance overhead if the code is not written with care. For example, there are the following situations:

  • Use a simple loop termination condition
  • Write a loop that decrements to 0, and checks for equals to 0
  • Counter using unsigned integer type: unsigned int

For example, there are two factorial n! The function:
 

int fact1(int n)
{
    int i, fact = 1;
    for (i = 1; i <= n; i++)
        fact *= i;
    return (fact);
}

int fact2(int n)
{
    unsigned int i, fact = 1;
    for (i = n; i != 0; i--)
        fact *= i;
    return (fact);
}

fact1 uses incrementing and fact2 uses decrementing.

Compile with if command:

armclang -Os -S --target=arm-arm-none-eabi -march=armv8-a

 The resulting assembly is:

Incrementing factorial function fact1:

; r1 -> n
; r0 -> fact
; r2 -> i


fact1:
        mov     r1, r0
        mov     r0, #1
        cmp     r1, #1
        bxlt    lr
        mov     r2, #0
.LBB0_1:
        add     r2, r2, #1
        mul     r0, r0, r2
        cmp     r1, r2
        bne     .LBB0_1
        bx      lr

Decreasing factorial function fact2:

; r1 -> i
; r0 -> fact

fact2:
        mov     r1, r0
        mov     r0, #1
        cmp     r1, #0
        bxeq    lr
.LBB1_1:
        mul     r0, r0, r1
        subs    r1, r1, #1
        bne     .LBB1_1
        bx      lr

By comparing the increasing function and the decreasing function, we can find that:

  1. The fact1 function uses one more  CMP r1, r2 instruction than the fact2 function: fact1 first uses the ADD instruction to perform the self-increment operation, and then uses the CMP instruction to compare the size of i and n. However, the fact2 function only needs a SUBS instruction to decrement by 1, because the SUBS operation will update the Z flag of the CPSR, so the conditional jump can be realized without using the CMP instruction.
  2. Another disadvantage of fact1 is that it uses one more register R2 than fact2, because the i of the fact1 function needs to be compared with n after adding 1, so an additional register is needed to save the value of n. n is not essential during the lifetime of the loop, the fewer registers to maintain, the easier the program is to allocate registers.

In summary, using a decrementing loop is simpler and more efficient.

If the original termination condition involved a function call, that function may be called on each iteration of the loop, even if the value it returns remains the same. In this case, it is also more efficient to use a decrementing loop. For example:

for (...; i < get_limit(); ...);

The method of initializing the loop counter (i) to the desired number of iterations (n) and then decrementing it to 0 also works for while and do statements.

4 infinite loop

 armclang considers an infinite loop without side effects to be undefined behavior, as stated in the C11 and C++11 standards. In some cases, armclang will remove or move infinite loops that have no side effects, which may cause the program to terminate, or not behave as expected.

To ensure that loops can execute for an infinite amount of time, Arm recommends writing infinite loops that include the __asm ​​volatile statement. The volatile keyword tells the compiler to consider loops as potential side effects, preventing optimizations to remove loops. In such a loop, it is also good practice to try to put the processor in a low power state until an event or interrupt occurs. The following example shows an infinite loop specified as volatile that includes an instruction WFE that puts the processor into a low-power state until an event occurs:

void infinite_loop(void) {
while (1)
  __asm volatile("wfe");
}

The volatile keyword tells armclang not to delete or move the loop. The compiler considers the loop to have side effects, so it doesn't remove it during optimization. The Wait for Event assembly instruction gives the processor a hint. Writing a loop in this way allows a processor implementing a WFE instruction to enter a low-power state until an event or interrupt occurs, so the loop does not consume power unnecessarily. WFI (wait for interrupt) can also be used to output code containing WFI instructions that allow a processor implementing WFI (wait for interrupt) to execute.

Reference article:

Optimizing loopshttps://developer.arm.com/documentation/100748/0620/Writing-Optimized-Code/Optimizing-loops?lang=en

Guess you like

Origin blog.csdn.net/luolaihua2018/article/details/130714061