Series Article Directory
- SIMD in Digital Signal Processing
- Neon intrinsics brief tutorial
- Realization of Efficient FIR Filter with NEON
foreword
Regarding SIMD, or NEON, I have published several articles to introduce it. If you have read these contents, I believe you have a certain understanding of NEON. Before that, we stay more at the theoretical stage: introduce the API of NEON, and give a few simple examples.
Today, we will go through some exercises, these tasks you may also encounter in actual development, they are simple enough, very suitable as a teaching example for getting started with NEON. We will show you how to use NEON to optimize existing code, and use Benchmark to test the performance difference between before and after optimization.
Regrettably, I thought that mastering SIMD can double the performance of your algorithm, but after the actual test, I found that the compiler is too smart. For some simple tasks, the code optimized by the compiler is better than your handwritten code. SIMD is faster and better. To be honest, this makes me a little frustrated, and it reduces my motivation to learn SIMD a lot, but I feel fortunate that as a programmer, I can safely hand over part of the work to the compiler without re-rolling.
Originally, my expected blog content flow should be like this:
- Ask a question and give a baseline with the basic implementation
- Optimized with NEON
- Wow, the optimized performance has been doubled, SIMD is awesome!
But in fact, the performance after optimization is basically negative optimization (hematemesis~), and different compilers behave differently. The same code improves performance on platform A, but it may be negative optimization on platform B. This leads to the fact that the work of "optimization" is even bound to the compiler version and operating system, and things become more and more complicated, not just at the code level. Of course, the reason for all this may be that the example task I gave is too simple, so simple that a smart compiler can see it through at a glance.
In any case, I decided to sort out the whole process and give you a reference for the judges. The following example runs on the author's Mac M1 and Android Honor 50. You can find all the code in neon_intrinsics_exercise - github .
1. General process
When you want to optimize the performance of an algorithm, first consider whether there is a better implementation, such as changing bubble sort to quick sort, and the algorithm complexity is from O ( N 2 ) O(N^2)O ( N2 )down toO ( N log N ) O(N\log N)O ( NlogN)。
When the algorithm implementation has been fixed and there is no better way to implement it, it is time to consider using SIMD technology for performance optimization. Assuming that there is an algorithm A to be optimized, the overall process is roughly as follows:
- Perform profiling on the A algorithm, and use the result as the baseline for optimization
- Perform SIMD optimization on the A algorithm to obtain the optimized algorithm A_SIMD
- Compare A_SIMD with the output of the A algorithm to ensure that the A_SIMD results are consistent with the results before optimization
- Perform profiling on A_SIMD and compare it with the baseline to ensure positive optimization
All in all, we must not only ensure that the optimized results are correct, but also ensure that the performance has indeed been improved.
2. Some examples
1. Vector sum
Task description: Implement a function that uses the NEON instruction set to sum all the numbers in an array and return the result.
1.1 baseline
This task is very simple. If you are smart, you may already have NEON implementation ideas in your mind, but please stop. Eat one bite at a time, let's start with the simplest one, using C/C++ to achieve the simplest implementation, which has nothing to do with NEON. code show as below:
float sum(float* array, size_t size)
{
float s = 0.0f;
for(int i = 0; i < size; ++i){
s += array[i];
}
return s;
}
1.2 NEON implementation
First do a loop expansion of the baseline code:
float sum_expand(float* array, size_t size)
{
float s = 0.0f;
int i = 0;
for(; i < size; i += 4){
s += array[i];
s += array[i + 1];
s += array[i + 2];
s += array[i + 3];
}
for(; i < size; ++i) {
s += array[i];
}
return s;
}
The loop unrolling part can be done using SIMD vector operations:
float sum_neon(float* array, size_t size)
{
int i = 0;
float32x4_t out_chunk{
0.0f,0.0f,0.0f,0.0f};
for(; i < size; i+=4){
float32x4_t chunk = vld1q_f32(array + i);
out_chunk = vaddq_f32(out_chunk, chunk);
}
float x = out_chunk[0] + out_chunk[1] + out_chunk[2] + out_chunk[3];
for(;i < size; ++i){
x += array[i];
}
return x;
}
in:
vld1q_f32(array + i)
load data from memorychunk
intovaddq_f32
, for vector addition- Finally, use a
for
loop to accumulate the remaining data
1.3 Performance comparison
Mac M1 | Android Honor 50 | |
---|---|---|
baseline | 16 ns | 3167 us |
neon | 16 ns | 2445 us |
The Android performance is optimized by about 23%; there is no performance improvement under the Mac M1.
2. Left and right channel mixing
Task description: Give you the data of the left and right channels and the volume of the two channels, which are float
two arrays and two float
values respectively, mix the left and right channels, and output the mixed data
2.1 baseline
void mix(float *left, float left_volume,
float *right, float right_volume,
float *output, size_t size) {
for (int i = 0; i < size; ++i) {
output[i] = left[i] * left_volume + right[i] * right_volume;
}
}
2.2 NEON implementation
Similarly, first do the loop unrolling
void mix_expand(float *left, float left_volume,
float *right, float right_volume,
float *output, size_t size) {
int i = 0;
for (; i < size; i += 4) {
output[i] = left[i] * left_volume + right[i] * right_volume;
output[i + 1] = left[i + 1] * left_volume + right[i + 1] * right_volume;
output[i + 2] = left[i + 2] * left_volume + right[i + 2] * right_volume;
output[i + 3] = left[i + 3] * left_volume + right[i + 3] * right_volume;
}
for (; i < size; ++i) {
output[i] = left[i] * left_volume + right[i] * right_volume;
}
}
According to the loop unrolling, you can roughly know that there are three vectors, which are left channel data, right channel data, and output data:
void mix_neon(float *left, float left_volume,
float *right, float right_volume,
float *output, size_t size) {
int i = 0;
for (; i < size; i += 4) {
float32x4_t left_chunk = vld1q_f32(left + i);
float32x4_t right_chunk = vld1q_f32(right + i);
left_chunk = vmulq_n_f32(left_chunk, left_volume);
right_chunk = vmulq_n_f32(right_chunk, right_volume);
float32x4_t output_chunk = vaddq_f32(left_chunk, right_chunk);
vst1q_f32(output + i, output_chunk);
}
for (; i < size; ++i) {
output[i] = left[i] * left_volume + right[i] * right_volume;
}
}
in:
vld1q_f32
Import left and right channel data from memoryvmulq_n_f32
That is, the vector is multiplied by a constantvaddq_f32
Add left and right channel data using vector addition
2.3 Performance comparison
Mac M1 | Android Honor 50 | |
---|---|---|
baseline | 136 us | 3329 us |
neon | 227 us | 5401 us |
In this case, both M1 and Android are negative optimization
3. FIR filter
For the implementation of FIR filter and SIMD, please refer to Implementing Efficient FIR Filter with NEON , and details will not be repeated here.
3.1 baseline
float* applyFirFilterSingle(FilterInput& input) {
const auto* x = input.x;
const auto* c = input.c;
auto* y = input.y;
for (auto i = 0u; i < input.outputLength; ++i) {
y[i] = 0.f;
for (auto j = 0u; j < input.filterLength; ++j) {
y[i] += x[i + j] * c[j];
}
}
return y;
}
3.1 WILL
float* applyFirFilterInnerLoopVectorizationARM(FilterInput& input) {
const auto* x = input.x;
const auto* c = input.c;
auto* y = input.y;
for (auto i = 0u; i < input.outputLength; ++i) {
y[i] = 0.f;
float32x4_t outChunk = vdupq_n_f32(0.0f);
for (auto j = 0u; j < input.filterLength; j += 4) {
float32x4_t xChunk = vld1q_f32(x + i + j);
float32x4_t cChunk = vld1q_f32(c + j);
float32x4_t temp = vmulq_f32(xChunk, cChunk);
outChunk = vaddq_f32(outChunk, temp);
}
y[i] = vaddvq_f32(outChunk);
}
return y;
}
3.2 VOL
float* applyFirFilterOuterLoopVectorizationARM(FilterInput& input) {
const auto* x = input.x;
const auto* c = input.c;
auto* y = input.y;
// Note the increment by 4
for (auto i = 0u; i < input.outputLength; i += 4) {
float32x4_t yChunk{
0.0f, 0.0f, 0.0f, 0.0f};
for (auto j = 0u; j < input.filterLength; ++j) {
float32x4_t xChunk = vld1q_f32(x + i + j);
float32x4_t temp = vmulq_n_f32(xChunk, c[j]);
yChunk = vaddq_f32(yChunk, temp);
}
// store to memory
vst1q_f32(y + i, yChunk);
}
return y;
}
3.3 VOIL
float* applyFirFilterOuterInnerLoopVectorizationARM(FilterInput& input)
{
const auto* x = input.x;
const auto* c = input.c;
auto* y = input.y;
const int K = 4;
std::array<float32x4_t, K> outChunk{
};
for (auto i = 0u; i < input.outputLength; i += K) {
for(auto k = 0; k < K; ++k){
outChunk[k] = vdupq_n_f32(0.0f);
}
for (auto j = 0u; j < input.filterLength; j += 4) {
float32x4_t cChunk = vld1q_f32(c + j);
for(auto k = 0; k < K; ++k)
{
float32x4_t xChunk = vld1q_f32(x + i + j +k);
float32x4_t temp = vmulq_f32(cChunk, xChunk);
outChunk[k] = vaddq_f32(temp, outChunk[k]);
}
}
for(auto k = 0; k < K; ++k){
y[i + k] = vaddvq_f32(outChunk[k]);
}
}
return input.y;
}
3.4 Performance comparison
Mac M1 | Android Honor 50 | |
---|---|---|
baseline | 10420 us | 51119 us |
WILL | 2297 us | 55947 us |
VOL | 2524 us | 54134 us |
WELCOME | 689 us | 69341 us |
SIMD has achieved good optimization under Mac M1, but it is negative optimization under Android.
Summarize
Because modern compilers are too powerful, some uncomplicated tasks can be automatically identified and vectorized by the compiler, which requires careful use of SIMD optimization techniques. We must determine the baseline before optimization, and ensure that the algorithm output is consistent with the original after optimization. Consistent, and compared with the baseline performance to ensure positive optimization. Originally, I also found a few algorithms implemented by NEON from webrtc, but after testing it was still negative optimization, so I will not put it up.