Depending on the number of iterations of the loop, it may take a lot of time to complete the loop. In addition, each iteration, you need to check whether the loop condition is true, which will also reduce the performance of the loop.
Table of contents
1 Loop unrolling-Loop unrolling
3 Loop termination in C language
1 Loop unrolling-Loop unrolling
In order to reduce the performance impact caused by the need to judge the iteration condition for each loop, the user can expand the loop to reduce the number of times to judge the loop condition. Use #pragma unroll (<n>) to unroll time- and performance-sensitive loops in user code. However, unrolling the loop also has a disadvantage: it increases the amount of code. The operations in the table below are only for , , , and -O2
-O3
-Ofast
-Omax优化时有效果:
Pragma | Description |
---|---|
#pragma unroll (<n>) |
n iterations of the unrolled loop |
#pragma unroll_completely |
Unroll all iterations in the loop |
For specific usage, see:
#pragma unroll[(n)], #pragma unroll_completely https://developer.arm.com/documentation/101754/0620/armclang-Reference/Compiler-specific-Pragmas/-pragma-unroll--n-----pragma-unroll-completely Note: manually expand the loop and use #p in the source code The effect of ragma unroll (<n>) is different. Manually unrolling the loop in the source code may hinder the compiler from optimizing the loop. ARM recommends users to use #pragma unroll (<n>) . If n is not specified, all iterations in the loop will be unrolled by default. In addition, if the compiler cannot calculate the number of iterations, use #pragma unroll_completely, and the loop will not be unrolled at compile time.
For example, the following sample code:
int countSetBits1(unsigned int n)
{
int bits = 0;
while (n != 0)
{
if (n & 1) bits++;
n >>= 1;
}
return bits;
}
Compile with the following command:
armclang --target=arm-arm-none-eabi -march=armv8-a file.c -O2 -S -o file.s
By default, the following assembly code will be obtained:
countSetBits1:
mov r1, r0
mov r0, #0
cmp r1, #0
bxeq lr
mov r2, #0
mov r0, #0
.LBB0_1:
and r3, r1, #1
cmp r2, r1, asr #1
add r0, r0, r3
lsr r3, r1, #1
mov r1, r3
bne .LBB0_1
bx lr
If using loop unrolling four times: #pragma unroll (4)
int countSetBits2(unsigned int n)
{
int bits = 0;
#pragma unroll (4)
while (n != 0)
{
if (n & 1) bits++;
n >>= 1;
}
return bits;
}
Then the generated assembly code is as follows:
countSetBits2:
mov r1, r0
mov r0, #0
cmp r1, #0
bxeq lr
mov r2, #0
mov r0, #0
LBB0_1:
and r3, r1, #1
cmp r2, r1, asr #1
add r0, r0, r3
beq .LBB0_4
@ BB#2:
asr r3, r1, #1
cmp r2, r1, asr #2
and r3, r3, #1
add r0, r0, r3
asrne r3, r1, #2
andne r3, r3, #1
addne r0, r0, r3
cmpne r2, r1, asr #3
beq .LBB0_4
@ BB#3:
asr r3, r1, #3
cmp r2, r1, asr #4
and r3, r3, #1
add r0, r0, r3
asr r3, r1, #4
mov r1, r3
bne .LBB0_1
.LBB0_4:
bx lr
If the number of iterations can be determined at compile time, the ARM Embedded Compiler can fully unroll the loop.
2 Loop vectorization
If the object of user code contains Advanced SIMD units, the ARM embedded compiler can use the vectorization engine to optimize the part of the code that can be vectorized. At optimization level -O1上,可以使用
-fvectorize option to enable vectorization. In optimization levels higher than -O1, -fvectorize is enabled by default, and users can use -fno-vectorize 选项将其关闭。详情见:
-fvectorize, -fno-vectorize .
Example code, vectorized using Advanced SIMD:
typedef struct tBuffer {
int a;
int b;
int c;
} tBuffer;
tBuffer buffer[8];
void DoubleBuffer1 (void)
{
int i;
for (i=0; i<8; i++)
{
buffer[i].a *= 2;
buffer[i].b *= 2;
buffer[i].c *= 2;
}
}
Compile with an optimization level of -O2:
armclang --target=arm-arm-none-eabi -march=armv8-a -O2 file.c -S -o file.s
Will get the following code:
DoubleBuffer1:
.fnstart
@ BB#0:
movw r0, :lower16:buffer
movt r0, :upper16:buffer
vld1.64 {d16, d17}, [r0:128]
mov r1, r0
vshl.i32 q8, q8, #1
vst1.32 {d16, d17}, [r1:128]!
vld1.64 {d16, d17}, [r1:128]
vshl.i32 q8, q8, #1
vst1.64 {d16, d17}, [r1:128]
add r1, r0, #32
vld1.64 {d16, d17}, [r1:128]
vshl.i32 q8, q8, #1
vst1.64 {d16, d17}, [r1:128]
add r1, r0, #48
vld1.64 {d16, d17}, [r1:128]
vshl.i32 q8, q8, #1
vst1.64 {d16, d17}, [r1:128]
add r1, r0, #64
add r0, r0, #80
vld1.64 {d16, d17}, [r1:128]
vshl.i32 q8, q8, #1
vst1.64 {d16, d17}, [r1:128]
vld1.64 {d16, d17}, [r0:128]
vshl.i32 q8, q8, #1
vst1.64 {d16, d17}, [r0:128]
bxlr
If not using SIMD:
typedef struct tBuffer {
int a;
int b;
int c;
} tBuffer;
tBuffer buffer[8];
void DoubleBuffer2 (void)
{
int i;
for (i=0; i<8; i++)
buffer[i].a *= 2;
for (i=0; i<8; i++)
buffer[i].b *= 2;
for (i=0; i<8; i++)
buffer[i].c *= 2;
}
will get:
DoubleBuffer2:
.fnstart
@ BB#0:
movw r0, :lower16:buffer
movt r0, :upper16:buffer
ldr r1, [r0]
lsl r1, r1, #1
str r1, [r0]
ldr r1, [r0, #12]
lsl r1, r1, #1
str r1, [r0, #12]
ldr r1, [r0, #24]
lsl r1, r1, #1
str r1, [r0, #24]
ldr r1, [r0, #36]
lsl r1, r1, #1
str r1, [r0, #36]
ldr r1, [r0, #48]
lsl r1, r1, #1
str r1, [r0, #48]
ldr r1, [r0, #60]
lsl r1, r1, #1
str r1, [r0, #60]
ldr r1, [r0, #72]
lsl r1, r1, #1
str r1, [r0, #72]
ldr r1, [r0, #84]
lsl r1, r1, #1
str r1, [r0, #84]
ldr r1, [r0, #4]
lsl r1, r1, #1
str r1, [r0, #4]
ldr r1, [r0, #16]
lsl r1, r1, #1
...
bx lr
Advanced SIMD (Single Instruction Multiple Data), which is Neon technology, is used in the ARMv7-A series and its later architectures. It allows users to write more high-performance optimized code. Regarding the use of Neon, users can directly use the C/C++ function interface to call. For Neon usage skills, please refer to the article: Arm C Language Extensions ACLE Q1
2019
use -fno-vectorize 选项并不能完全阻止编译器忽略SIMD指令。如果链接库包含了Neon相关指令,编译器或者链接器仍或使用SIMD。
To prevent the compiler from emitting advanced SIMD instructions for AArch64 targets, specify +nosimd with -march or -mcpu:
armclang --target=aarch64-arm-none-eabi -march=armv8-a+nosimd -O2 file.c -S -o file.s
To prevent the compiler from emitting Advanced SIMD instructions for AArch32 targets, set the option -mfpu to the correct value that does not include Advanced SIMD. For example, -mfpu=fp-armv8.
armclang --target=aarch32-arm-none-eabi -march=armv8-a -mfpu=fp-armv8 -O2 file.c -S -o file.s
3 Loop termination in C language
Loop termination conditions can cause significant performance overhead if the code is not written with care. For example, there are the following situations:
- Use a simple loop termination condition
- Write a loop that decrements to 0, and checks for equals to 0
- Counter using unsigned integer type: unsigned int
For example, there are two factorial n! The function:
int fact1(int n)
{
int i, fact = 1;
for (i = 1; i <= n; i++)
fact *= i;
return (fact);
}
int fact2(int n)
{
unsigned int i, fact = 1;
for (i = n; i != 0; i--)
fact *= i;
return (fact);
}
fact1 uses incrementing and fact2 uses decrementing.
Compile with if command:
armclang -Os -S --target=arm-arm-none-eabi -march=armv8-a
The resulting assembly is:
Incrementing factorial function fact1:
; r1 -> n
; r0 -> fact
; r2 -> i
fact1:
mov r1, r0
mov r0, #1
cmp r1, #1
bxlt lr
mov r2, #0
.LBB0_1:
add r2, r2, #1
mul r0, r0, r2
cmp r1, r2
bne .LBB0_1
bx lr
Decreasing factorial function fact2:
; r1 -> i
; r0 -> fact
fact2:
mov r1, r0
mov r0, #1
cmp r1, #0
bxeq lr
.LBB1_1:
mul r0, r0, r1
subs r1, r1, #1
bne .LBB1_1
bx lr
By comparing the increasing function and the decreasing function, we can find that:
- The fact1 function uses one more CMP r1, r2 instruction than the fact2 function: fact1 first uses the ADD instruction to perform the self-increment operation, and then uses the CMP instruction to compare the size of i and n. However, the fact2 function only needs a SUBS instruction to decrement by 1, because the SUBS operation will update the Z flag of the CPSR, so the conditional jump can be realized without using the CMP instruction.
- Another disadvantage of fact1 is that it uses one more register R2 than fact2, because the i of the fact1 function needs to be compared with n after adding 1, so an additional register is needed to save the value of n. n is not essential during the lifetime of the loop, the fewer registers to maintain, the easier the program is to allocate registers.
In summary, using a decrementing loop is simpler and more efficient.
If the original termination condition involved a function call, that function may be called on each iteration of the loop, even if the value it returns remains the same. In this case, it is also more efficient to use a decrementing loop. For example:
for (...; i < get_limit(); ...);
The method of initializing the loop counter (i) to the desired number of iterations (n) and then decrementing it to 0 also works for while and do statements.
4 infinite loop
armclang considers an infinite loop without side effects to be undefined behavior, as stated in the C11 and C++11 standards. In some cases, armclang will remove or move infinite loops that have no side effects, which may cause the program to terminate, or not behave as expected.
To ensure that loops can execute for an infinite amount of time, Arm recommends writing infinite loops that include the __asm volatile statement. The volatile keyword tells the compiler to consider loops as potential side effects, preventing optimizations to remove loops. In such a loop, it is also good practice to try to put the processor in a low power state until an event or interrupt occurs. The following example shows an infinite loop specified as volatile that includes an instruction WFE that puts the processor into a low-power state until an event occurs:
void infinite_loop(void) {
while (1)
__asm volatile("wfe");
}
The volatile keyword tells armclang not to delete or move the loop. The compiler considers the loop to have side effects, so it doesn't remove it during optimization. The Wait for Event assembly instruction gives the processor a hint. Writing a loop in this way allows a processor implementing a WFE instruction to enter a low-power state until an event or interrupt occurs, so the loop does not consume power unnecessarily. WFI (wait for interrupt) can also be used to output code containing WFI instructions that allow a processor implementing WFI (wait for interrupt) to execute.
Reference article: