Realization of Efficient FIR Filter with NEON

Series Article Directory



written in front

Most of this article is translated from Efficient FIR Filter Implementation with SIMD . The original article uses AVX in the implementation of SIMD code. This article will use NEON to implement it. For how to use NEON, please refer to the concise tutorial of Neon intrinsics .

foreword

How to make your FIR filter run faster in the time domain?

FIR filters are the cornerstone in digital signal processing. It is especially important when applying reverb to audio signals, for example in virtual reality audio or in VST plug-ins for digital audio workstations. It is also widely used for sound applications in mobile phones (even pre-smartphones!) and embedded devices.

How to make FIR run faster?

FIR filter

The FIR filter consists of its finite impulse response signal h [ n ] h[n]h [ n ] definition, the output of the FIR filtery [ n ] y[n]y [ n ] is the input signalx[n] x[n]Convolution of x [ n ] with the impulse response. We write it as:
y [ n ] = x [ n ] ∗ h [ n ] = ∑ k = − ∞ ∞ x [ k ] h [ n − k ] y[n]=x[n] * h[n] =\sum_{k=-\infty}^{\infty} x[k] h[nk]and [ n ]=x[n]h[n]=k=x[k]h[nk]

For knowledge about convolution, you can refer to my previous article [Audio Processing] Fast Convolution Fast Convolution Algorithm Introduction or Convolution: The secret behind filtering

Optimization

If you want any software to perform as fast as possible, there are two ways to do it.

  1. best algorithm
  2. efficient execution

The same principle applies to DSP code as well. To have a fast finite impulse response (FIR) filter in code, you can do this:

  1. Use an algorithm with lower time complexity, such as fast convolution after Fourier transform to frequency domain, or
  2. Alternatively, take advantage of hardware and software resources to efficiently implement temporal convolution. This usually means vectorizing your code using SIMD instructions.

When should time domain filters be used?

You may be asking yourself:

Since we have a fast convolution algorithm, why do we need a time-domain implementation?

The algorithm complexity of fast convolution is O ( N log ⁡ N ) O(N\log N)O ( NlogN ) , where N may be the length of the input signal or filter. The algorithm complexity of linear convolution isO ( N 2 ) O(N^2)O ( N2)

First, this means that for sufficiently small NNsN , the linear convolution algorithm will be faster than the fast convolution algorithm.

Second, fast convolution operates on the domain of complex numbers, while linear convolution always uses real numbers. This effectively means that fast convolution needs to process twice as much data as linear convolution.

If the fact that temporal convolutions can be faster than fast convolutions confuses you, think of sorting algorithms. In general, quicksort is considered the fastest sorting algorithm. But if you look at an implementation of sort, such as std::sort, you'll see that it originally only used quicksort. After the sorted container is divided into small enough parts, the elements in it are sorted using insertion sort.

After explaining this, we can investigate how to efficiently implement FIR filtering in the time domain.

How to speed up the implementation of FIR filter?

In short: via SIMD.

The best way to speed up filter execution is to use SIMD instructions to process many samples at once. To achieve this, we need to rewrite the linear convolution algorithm so that our code operates on vectors.

This process is called loop vectorization .

Loop vectorization is usually done by the compiler, but this level of automatic vectorization is usually not sufficient for real-time DSPs. Instead, we need to instruct the compiler exactly what to do.

SIMD instructions perform best when operating on aligned data. Therefore, data alignment is another factor we should consider.

In summary, an efficient FIR filter implementation uses both strategies.

  1. loop vectorization
  2. data alignment

We will now discuss these two strategies in detail.

initial hypothesis

The linear convolution formula is as follows:
x [ n ] ∗ h [ n ] = ∑ k = − ∞ ∞ x [ k ] h [ n − k ] = y [ n ] , n ∈ Z . (1) x[n] * h[n]=\sum_{k=-\infty}^{\infty} x[k] h[nk]=y[n], \quad n \in \mathbb{Z} . \tag{1}x[n]h[n]=k=x[k]h[nk]=and [ n ] ,nZ.(1)

As you can probably guess, an infinitely long signal is impractical in practice. Also, inverting h[n] h[n] in the codeTiming in h [ n ] signals is rather problematic. Therefore, we will make some assumptions, however, this will not change the general nature of our discussion.

finite length signal

We will assume that our signal is finite. Of course, h [ n ] h[n]h [ n ] is like this, butx[n] x[n]This is not necessarily the case for x [ n ] .

We use N x N_xNxmeans x [ n ] x[n]The length of x [ n ] , useN h N_hNhmeans h [ n ] h[n]The length of h [ n ] .

And we assume N x > N h N_x \gt N_hNx>Nh

For subscript n < 0 n < 0n<0 orn >= N xn >=N_xn>=Nx, the signals are all 0.

inverted filter coefficients

In practical real-time audio scenarios, such as virtual reality, computer games, or digital audio workstations, we know h [ n ] h[n]h [ n ] but notx[n] x[n]x[n]

Therefore, we can h [ n ] h[n]h [ n ] for inversion, we define the length asN h N_hNhsignal ccc 为:
c [ n ] = h [ N h − n − 1 ] , n = 0 , … , N h − 1 (2) c[n]=h\left[N_{h}-n-1\right], \quad n=0, \ldots, N_{h}-1 \tag{2} c[n]=h[Nhn1],n=0,,Nh1(2)

The actual convolution formula

After introducing the above two assumptions, we can rewrite formula (1) as:
y [ n ] = ( x [ n ] ∗ h [ n ] ) [ n ] = ∑ k = 0 N h − 1 x [ n + k ] c [ k ] , n = 0 , … , N x − 1. (3) \begin{array}{c} y[n]=(x[n] * h[n])[n] \\ = \sum_{k=0}^{N_{h}-1} x[n+k] c[k], \quad n=0, \ldots, N_{x}-1 . \end{array} \tag {3}and [ n ]=(x[n]h[n])[n]=k=0Nh1x[n+k]c[k],n=0,,Nx1.(3)

This formula is very similar to the related formula, but keep in mind that it's still a convolution, albeit written differently.

In this new formulation, a convolution output is simply the vector ccc andxxThe inner product of x .

Note that the convolution of formula (3) uses the "same" mode, if we go to xxAppendN h − 1 N_h-1 to xNhIf 1 is 0, it becomes "full" mode.

As you will see, the full mode will greatly simplify our discussion.

Visualization of the convolution process

The figure below illustrates how the convolution is calculated.
insert image description here
The orange box indicates which elements are multiplied, and then the results of the multiplication are accumulated to get y [ 0 ] y[0]y[0]

With the above assumptions and convolutional format, we can write its implementation.

The simplest implementation of linear convolution

Before we speed up FIR filters with SIMD, we need to start with a baseline: a non-SIMD implementation.

This can be achieved by:

struct FilterInput {
    
    
// assume that these fields are correctly initialized
  const float* x;  // input signal with (N_h-1) zeros appended
  size_t inputLength;   // N_x
  const float* c;  // reversed filter coefficients
  size_t filterLength;  // N_h
  float* y;  // output (filtered) signal; 
             // pointer to preallocated, uninitialized memory
  size_t outputLength; // should be N_x in our context
};


float* applyFirFilterSingle(FilterInput& input) {
    
    
  const auto* x = input.x;
  const auto* c = input.c;
  auto* y = input.y;

  for (auto i = 0u; i < input.outputLength; ++i) {
    
    
    y[i] = 0.f;
    for (auto j = 0u; j < input.filterLength; ++j) {
    
    
      y[i] += x[i + j] * c[j];
    }
  }
  return y;
}

As you can see, this code is not very efficient; we iterate over samples one by one.

Its algorithm time complexity is O ( N h N x ) O(N_hN_x)O ( NhNx) , let's see how to vectorize this code.

loop vectorization

In the FIR filter scenario, there are 3 types of loop vectorization methods:

  1. Inner loop vectorization (VIL),
  2. Outer loop vectorization (VOL),
  3. Outer and inner loop vectorization (VOIL).

Their names indicate on which line we will load the data into the SIMD registers. One of the easiest to understand is VIL

Inner loop vectorization (VIL)

In VIL, we will vectorize operations in the inner loop.
We start by rewriting the previous code in a rough fashion. Let's assume our vector length is 4, which would correspond to a register that can hold 4 floats (such as ARM's Neon registers).

float* applyFirFilterInnerLoopVectorization(
    FilterInput& input) {
    
    
  const auto* x = input.x;
  const auto* c = input.c;
  auto* y = input.y;

  for (auto i = 0u; i < input.outputLength; ++i) {
    
    
    y[i] = 0.f;
    // Note the increment by 4
    for (auto j = 0u; j < input.filterLength; j += 4) {
    
    
      y[i] += x[i + j] * c[j] + 
              x[i + j + 1] * c[j + 1] +
              x[i + j + 2] * c[j + 2] + 
              x[i + j + 3] * c[j + 3];
    }
  }
  return y;
}

In the above code, on each iteration of the inner loop, we do the inner product of two 4-element vectors. In this way, we compute part of the convolution sum in Equation (3).

Note that we assume that the vector passed in is already zero-padded and has a length that is a multiple of 4.

The figure below shows the situation of VIL

insert image description here
Of course, this code is no more optimized than the previous code. It just rewrites the code in vector form. But this vector form is now easily implemented with vector instructions.

How would this implementation look in real SIMD code?

Implementation of VIL NEON

Below I will use NEON intrinsic functions to implement VIL. For NEON-related tutorials, please refer to Neon intrinsics brief tutorial .

#ifdef __aarch64__
float* applyFirFilterInnerLoopVectorizationARM(FilterInput& input) {
    
    
  const auto* x = input.x;
  const auto* c = input.c;
  auto* y = input.y;
  assert(input.inputLength % 4 == 0);
  assert(input.filterLength % 4 == 0);

  for (auto i = 0u; i < input.outputLength; ++i) {
    
    
    y[i] = 0.f;
    float32x4_t outChunk = vdupq_n_f32(0.0f);
    for (auto j = 0u; j < input.filterLength; j += 4) {
    
    
      float32x4_t xChunk = vld1q_f32(x + i + j);
      float32x4_t cChunk = vld1q_f32(c + j);
      float32x4_t temp = vmulq_f32(xChunk, cChunk);
      outChunk = vaddq_f32(outChunk, temp);
    }
    y[i] = vaddvq_f32(outChunk);
  }
  return y;
}
#endif

Again, we assume that the vector passed in is already zero-padded and has a length that is a multiple of 4.

Its algorithm time complexity is O ( N x N h / 4 ) O(N_xN_h/4)O ( NxNh/ 4 ) . Of course, in complexity theory, this is the same as non-vectorizing algorithms. But notice that in the inner loop, we have reduced the number of iterations by a factor of 4. This is because we can operate on a vector of 4 floating point numbers with a single NEON instruction.

Outer Loop Vectorization (VOL)

VOL is a crazier approach. In this approach, we try to compute an output vector at once in an outer loop. Similarly, we first give a rough VOL code implementation

float* applyFirFilterOuterLoopVectorization(
    FilterInput& input) {
    
    
  const auto* x = input.x;
  const auto* c = input.c;
  auto* y = input.y;

  // Note the increment by 4
  for (auto i = 0u; i < input.outputLength; i += 4) {
    
    
    y[i] = 0.f;
    y[i + 1] = 0.f;
    y[i + 2] = 0.f;
    y[i + 3] = 0.f;
    for (auto j = 0u; j < input.filterLength; ++j) {
    
    
      y[i] += x[i + j] * c[j];
      y[i + 1] += x[i + j + 1] * c[j];
      y[i + 2] += x[i + j + 2] * c[j];
      y[i + 3] += x[i + j + 3] * c[j];
    }
  }
  return y;
}

Again, we assume that the vector passed in is already zero-padded and has a length that is a multiple of 4.
The figure below shows the case of VOL
insert image description here
instead of computing 4 elements from the sum of convolutions (like VIL) in the inner loop, we compute 1 element from the sum of 4 convolutions. In VIL we have reduced the number of iterations of the inner loop by 4, and in VOL we have reduced the number of iterations of the outer loop by 4.

Therefore, VOL is not more optimized than VIL.

This code is now easily implemented with SIMD instructions.

VOL NEON Implementation

The following code illustrates how to use NEON to implement VOL

float* applyFirFilterOuterLoopVectorizationARM(FilterInput& input) {
    
    
  const auto* x = input.x;
  const auto* c = input.c;
  auto* y = input.y;

  // Note the increment by 4
  for (auto i = 0u; i < input.outputLength; i += 4) {
    
    
    float32x4_t yChunk{
    
    0.0f, 0.0f, 0.0f, 0.0f};
    for (auto j = 0u; j < input.filterLength; ++j) {
    
    
      float32x4_t xChunk = vld1q_f32(x + i + j);
      float32x4_t temp = vmulq_n_f32(xChunk, c[j]);
      yChunk = vaddq_f32(yChunk, temp);
    }
    // store to memory
    vst1q_f32(y + i, yChunk);
  }
  return y;
}

This code should be 4 times faster than the original code. In practice, the speed gain will be smaller due to the extra code.

A question arises: can we do better? Yes we can!

Outer and Inner Loop vectorization (VOIL)

The real breakthrough comes when we combine the two types of vectorization.

In this method, we calculate a vector in the outer loop and use the vector inner product in the inner loop. The rough code is as follows:

float* applyFirFilterOuterInnerLoopVectorization(
    FilterInput& input) {
    
    
  const auto* x = input.x;
  const auto* c = input.c;
  auto* y = input.y;

  // Note the increment
  for (auto i = 0u; i < input.outputLength; i += 4) {
    
    
    y[i] = 0.f;
    y[i + 1] = 0.f;
    y[i + 2] = 0.f;
    y[i + 3] = 0.f;
    
    // Note the increment
    for (auto j = 0u; j < input.filterLength; j += 4) {
    
    
      y[i] += x[i + j] * c[j] +
              x[i + j + 1] * c[j + 1] +
              x[i + j + 2] * c[j + 2] +
              x[i + j + 3] * c[j + 3];

      y[i + 1] += x[i + j + 1] * c[j] +
                  x[i + j + 2] * c[j + 1] +
                  x[i + j + 3] * c[j + 2] +
                  x[i + j + 4] * c[j + 3];

      y[i + 2] += x[i + j + 2] * c[j] +
                  x[i + j + 3] * c[j + 1] +
                  x[i + j + 4] * c[j + 2] +
                  x[i + j + 5] * c[j + 3];

      y[i + 3] += x[i + j + 3] * c[j] +
                  x[i + j + 4] * c[j + 1] +
                  x[i + j + 5] * c[j + 2] +
                  x[i + j + 6] * c[j + 3];
    }
  }
  return y;
}

Why is VOIL better?

You might be saying to yourself "okay, but it's just a manually unrolled loop! Why is it faster?".

This is because SIMD instructions use multiple registers at the same time, as in the case of VOIL, giving the processor more space to execute them faster. This is in contrast to the VIL or VOL case where only one or two registers are used.

When I say "simultaneously", I don't mean multithreading. I mean keeping references to various registers. This allows the processor to do things most efficiently.

data alignment

Another reason VOIL has such great optimization potential is that we can implement it using aligned load/store SIMD instructions. How to achieve? This will be the topic of the next post!

Let's see how to implement VOIL with NEON instructions.

VOIL NEON Implementation

float* applyFirFilterOuterInnerLoopVectorizationARM(FilterInput& input) {
    
    
  const auto* x = input.x;
  const auto* c = input.c;
  auto* y = input.y;

  std::array<float32x4_t, 4> outChunk{
    
    };

  for (auto i = 0u; i < input.outputLength; i += 4) {
    
    
    for(auto k = 0; k < 4; ++k){
    
    
      outChunk[k] = vdupq_n_f32(0.0f);
    }

    for (auto j = 0u; j < input.filterLength; j += 4) {
    
    
      float32x4_t cChunk = vld1q_f32(c + j);

      for(auto k = 0; k < 4; ++k)
      {
    
    
        float32x4_t xChunk = vld1q_f32(x + i + j +k);
        float32x4_t temp = vmulq_f32(cChunk, xChunk);
        outChunk[k] = vaddq_f32(temp, outChunk[k]);
      }

    }

    for(auto k = 0; k < 4; ++k){
    
    
      y[i + k] = vaddvq_f32(outChunk[k]);
    }
  }

  return y;
}


Summarize

In this post, we discuss what is a FIR filter and how to implement it efficiently; either choose a fast convolution algorithm, or use the single-instruction, multiple-data instructions of modern processors. Of course, you can have both

We redefine the sum of convolutions to facilitate discussion and implementation.

We looked at implementing FIR filters using a technique called loop vectorization. We show plain C implementations of VIL, VOL, and VOIL, discuss their visualization, and demonstrate their SIMD equivalents using the NEON instruction set. On my computer, VIL and VOL can improve performance by about 4 times, and VOIL is twice as fast as VIL/VOL.

Finally, we point out that we can do better with aligned data. This will be discussed in the next article.

Take a look at the helpful references below. The entire code can be found in my GitHub repository.

As always, if you have any questions feel free to post them below.

reference

Guess you like

Origin blog.csdn.net/weiwei9363/article/details/128164364