NEON Intrinsics Practice Questions

Series Article Directory


foreword

Regarding SIMD, or NEON, I have published several articles to introduce it. If you have read these contents, I believe you have a certain understanding of NEON. Before that, we stay more at the theoretical stage: introduce the API of NEON, and give a few simple examples.

Today, we will go through some exercises, these tasks you may also encounter in actual development, they are simple enough, very suitable as a teaching example for getting started with NEON. We will show you how to use NEON to optimize existing code, and use Benchmark to test the performance difference between before and after optimization.

Regrettably, I thought that mastering SIMD can double the performance of your algorithm, but after the actual test, I found that the compiler is too smart. For some simple tasks, the code optimized by the compiler is better than your handwritten code. SIMD is faster and better. To be honest, this makes me a little frustrated, and it reduces my motivation to learn SIMD a lot, but I feel fortunate that as a programmer, I can safely hand over part of the work to the compiler without re-rolling.

Originally, my expected blog content flow should be like this:

  1. Ask a question and give a baseline with the basic implementation
  2. Optimized with NEON
  3. Wow, the optimized performance has been doubled, SIMD is awesome!

But in fact, the performance after optimization is basically negative optimization (hematemesis~), and different compilers behave differently. The same code improves performance on platform A, but it may be negative optimization on platform B. This leads to the fact that the work of "optimization" is even bound to the compiler version and operating system, and things become more and more complicated, not just at the code level. Of course, the reason for all this may be that the example task I gave is too simple, so simple that a smart compiler can see it through at a glance.

In any case, I decided to sort out the whole process and give you a reference for the judges. The following example runs on the author's Mac M1 and Android Honor 50. You can find all the code in neon_intrinsics_exercise - github .

1. General process

When you want to optimize the performance of an algorithm, first consider whether there is a better implementation, such as changing bubble sort to quick sort, and the algorithm complexity is from O ( N 2 ) O(N^2)O ( N2 )down toO ( N log ⁡ N ) O(N\log N)O ( NlogN)

When the algorithm implementation has been fixed and there is no better way to implement it, it is time to consider using SIMD technology for performance optimization. Assuming that there is an algorithm A to be optimized, the overall process is roughly as follows:

  1. Perform profiling on the A algorithm, and use the result as the baseline for optimization
  2. Perform SIMD optimization on the A algorithm to obtain the optimized algorithm A_SIMD
  3. Compare A_SIMD with the output of the A algorithm to ensure that the A_SIMD results are consistent with the results before optimization
  4. Perform profiling on A_SIMD and compare it with the baseline to ensure positive optimization

All in all, we must not only ensure that the optimized results are correct, but also ensure that the performance has indeed been improved.

2. Some examples

1. Vector sum

Task description: Implement a function that uses the NEON instruction set to sum all the numbers in an array and return the result.

1.1 baseline

This task is very simple. If you are smart, you may already have NEON implementation ideas in your mind, but please stop. Eat one bite at a time, let's start with the simplest one, using C/C++ to achieve the simplest implementation, which has nothing to do with NEON. code show as below:

float sum(float* array, size_t size)
{
    
    
	float s = 0.0f;
	for(int i = 0; i < size; ++i){
    
    
		s += array[i];
	}
    return s;
}

1.2 NEON implementation

First do a loop expansion of the baseline code:

float sum_expand(float* array, size_t size)
{
    
    
    float s = 0.0f;
    int i = 0;
    for(; i < size; i += 4){
    
    
        s += array[i];
        s += array[i + 1];
        s += array[i + 2];
        s += array[i + 3];
    }

    for(; i < size; ++i) {
    
    
        s += array[i];
    }
    return s;
}

The loop unrolling part can be done using SIMD vector operations:

float sum_neon(float* array, size_t size)
{
    
    
    int i = 0;
    float32x4_t out_chunk{
    
    0.0f,0.0f,0.0f,0.0f};
    for(; i < size; i+=4){
    
    
        float32x4_t chunk = vld1q_f32(array + i);
        out_chunk = vaddq_f32(out_chunk, chunk);
    }

    float x = out_chunk[0] + out_chunk[1] + out_chunk[2] + out_chunk[3];
    for(;i < size; ++i){
    
    
        x += array[i];
    }

    return x;
}

in:

  1. vld1q_f32(array + i)load data from memory chunkinto
  2. vaddq_f32, for vector addition
  3. Finally, use a forloop to accumulate the remaining data

1.3 Performance comparison

Mac M1 Android Honor 50
baseline 16 ns 3167 us
neon 16 ns 2445 us

The Android performance is optimized by about 23%; there is no performance improvement under the Mac M1.

2. Left and right channel mixing

Task description: Give you the data of the left and right channels and the volume of the two channels, which are floattwo arrays and two floatvalues ​​respectively, mix the left and right channels, and output the mixed data

2.1 baseline

void mix(float *left, float left_volume,
         float *right, float right_volume,
         float *output, size_t size) {
    
    
    for (int i = 0; i < size; ++i) {
    
    
        output[i] = left[i] * left_volume + right[i] * right_volume;
    }
}

2.2 NEON implementation

Similarly, first do the loop unrolling

void mix_expand(float *left, float left_volume,
                float *right, float right_volume,
                float *output, size_t size) {
    
    
    int i = 0;
    for (; i < size; i += 4) {
    
    
        output[i] = left[i] * left_volume + right[i] * right_volume;
        output[i + 1] = left[i + 1] * left_volume + right[i + 1] * right_volume;
        output[i + 2] = left[i + 2] * left_volume + right[i + 2] * right_volume;
        output[i + 3] = left[i + 3] * left_volume + right[i + 3] * right_volume;
    }

    for (; i < size; ++i) {
    
    
        output[i] = left[i] * left_volume + right[i] * right_volume;
    }
}

According to the loop unrolling, you can roughly know that there are three vectors, which are left channel data, right channel data, and output data:

void mix_neon(float *left, float left_volume,
              float *right, float right_volume,
              float *output, size_t size) {
    
    
    int i = 0;
    for (; i < size; i += 4) {
    
    
        float32x4_t left_chunk = vld1q_f32(left + i);
        float32x4_t right_chunk = vld1q_f32(right + i);
        
        left_chunk = vmulq_n_f32(left_chunk, left_volume);
        right_chunk = vmulq_n_f32(right_chunk, right_volume);

        float32x4_t output_chunk = vaddq_f32(left_chunk, right_chunk);
        vst1q_f32(output + i, output_chunk);
    }

    for (; i < size; ++i) {
    
    
        output[i] = left[i] * left_volume + right[i] * right_volume;
    }
}

in:

  1. vld1q_f32Import left and right channel data from memory
  2. vmulq_n_f32That is, the vector is multiplied by a constant
  3. vaddq_f32Add left and right channel data using vector addition

2.3 Performance comparison

Mac M1 Android Honor 50
baseline 136 us 3329 us
neon 227 us 5401 us

In this case, both M1 and Android are negative optimization

3. FIR filter

For the implementation of FIR filter and SIMD, please refer to Implementing Efficient FIR Filter with NEON , and details will not be repeated here.

3.1 baseline

float* applyFirFilterSingle(FilterInput& input) {
    
    
    const auto* x = input.x;
    const auto* c = input.c;
    auto* y = input.y;

    for (auto i = 0u; i < input.outputLength; ++i) {
    
    
        y[i] = 0.f;
        for (auto j = 0u; j < input.filterLength; ++j) {
    
    
            y[i] += x[i + j] * c[j];
        }
    }
    return y;
}

3.1 WILL

float* applyFirFilterInnerLoopVectorizationARM(FilterInput& input) {
    
    
    const auto* x = input.x;
    const auto* c = input.c;
    auto* y = input.y;

    for (auto i = 0u; i < input.outputLength; ++i) {
    
    
        y[i] = 0.f;
        float32x4_t outChunk = vdupq_n_f32(0.0f);
        for (auto j = 0u; j < input.filterLength; j += 4) {
    
    
            float32x4_t xChunk = vld1q_f32(x + i + j);
            float32x4_t cChunk = vld1q_f32(c + j);
            float32x4_t temp = vmulq_f32(xChunk, cChunk);
            outChunk = vaddq_f32(outChunk, temp);
        }
        y[i] = vaddvq_f32(outChunk);
    }
    return y;
}

3.2 VOL

float* applyFirFilterOuterLoopVectorizationARM(FilterInput& input) {
    
    
    const auto* x = input.x;
    const auto* c = input.c;
    auto* y = input.y;

    // Note the increment by 4
    for (auto i = 0u; i < input.outputLength; i += 4) {
    
    
        float32x4_t yChunk{
    
    0.0f, 0.0f, 0.0f, 0.0f};
        for (auto j = 0u; j < input.filterLength; ++j) {
    
    
            float32x4_t xChunk = vld1q_f32(x + i + j);
            float32x4_t temp = vmulq_n_f32(xChunk, c[j]);
            yChunk = vaddq_f32(yChunk, temp);
        }
        // store to memory
        vst1q_f32(y + i, yChunk);
    }
    return y;
}

3.3 VOIL

float* applyFirFilterOuterInnerLoopVectorizationARM(FilterInput& input)
{
    
    
    const auto* x = input.x;
    const auto* c = input.c;
    auto* y = input.y;

    const int K = 4;
    std::array<float32x4_t, K> outChunk{
    
    };

    for (auto i = 0u; i < input.outputLength; i += K) {
    
    
        for(auto k = 0; k < K; ++k){
    
    
            outChunk[k] = vdupq_n_f32(0.0f);
        }

        for (auto j = 0u; j < input.filterLength; j += 4) {
    
    
            float32x4_t cChunk = vld1q_f32(c + j);

            for(auto k = 0; k < K; ++k)
            {
    
    
                float32x4_t xChunk = vld1q_f32(x + i + j +k);
                float32x4_t temp = vmulq_f32(cChunk, xChunk);
                outChunk[k] = vaddq_f32(temp, outChunk[k]);
            }

        }

        for(auto k = 0; k < K; ++k){
    
    
            y[i + k] = vaddvq_f32(outChunk[k]);
        }
    }

    return input.y;
}

3.4 Performance comparison

Mac M1 Android Honor 50
baseline 10420 us 51119 us
WILL 2297 us 55947 us
VOL 2524 us 54134 us
WELCOME 689 us 69341 us

SIMD has achieved good optimization under Mac M1, but it is negative optimization under Android.


Summarize

Because modern compilers are too powerful, some uncomplicated tasks can be automatically identified and vectorized by the compiler, which requires careful use of SIMD optimization techniques. We must determine the baseline before optimization, and ensure that the algorithm output is consistent with the original after optimization. Consistent, and compared with the baseline performance to ensure positive optimization. Originally, I also found a few algorithms implemented by NEON from webrtc, but after testing it was still negative optimization, so I will not put it up.

Guess you like

Origin blog.csdn.net/weiwei9363/article/details/128203136