Summary of common functions of ARM NEON

NEON technology is a 128-bit SIMD (Single Instruction, Multiple Data) architectural extension to the ARM Cortex™-A series processors, designed to provide flexible and powerful acceleration for consumer multimedia applications to significantly improve the user experience. It has 32 registers, 64 bits wide (double view is 16 registers, 128 bits wide.)
At present, mainstream iPhone phones and most android phones support ARM NEON acceleration, so when writing mobile algorithms, you can use NEON The technology accelerates the algorithm. Taking the register size of length 4 as an example, the corresponding speed-up multiple is about 4 times the original.

NEON instructions perform "packed SIMD" processing:

寄存器被视为同一数据类型的元素的矢量
数据类型可为:签名/未签名的 8 位、16 位、32 位、64 位单精度浮点
指令在所有通道中执行同一操作

As shown below:
write picture description here

This article mainly introduces the structures and functions related to float32x4_t.
float32x4_t can be understood as vector <float32> (4), and similarly typexN_t is vector <type> (N).

In NEON programming, operations on a single data can be extended to operations on registers, that is, a vector of elements of the same type, thus greatly reducing the number of operations.
Here is a small example to explain how to use NEON built-in functions to speed up the implementation of counting the sum of elements in an array.

Take the C++ code as an example:
the original algorithm code is as follows:

#include <iostream>
using namespace std;

float sum_array(float *arr, int len)
{
    if(NULL == arr || len < 1)
    {
        cout<<"input error\n";
        return 0;
    }
    float sum(0.0);
    for(int i=0; i<len; ++i)
    {
        sum += *arr++;
    }
    return sum;
}

For an array of length N, the time complexity of the above algorithm is O(N).
Use the NEON function to accelerate:

#include <iostream>
#include <arm_neon.h> //需包含的头文件
using namespace std;

float sum_array(float *arr, int len)
{
    if(NULL == arr || len < 1)
    {
        cout<<"input error\n";
        return 0;
    }

    int dim4 = len >> 2; // 数组长度除4整数
    int left4 = len & 3; // 数组长度除4余数
    float32x4_t sum_vec = vdupq_n_f32(0.0);//定义用于暂存累加结果的寄存器且初始化为0
    for (; dim4>0; dim4--, arr+=4) //每次同时访问4个数组元素
    {
        float32x4_t data_vec = vld1q_f32(arr); //依次取4个元素存入寄存器vec
        sum_vec = vaddq_f32(sum_vec, data_vec);//ri = ai + bi 计算两组寄存器对应元素之和并存放到相应结果
    }
    float sum = vgetq_lane_f32(sum_vec, 0)+vgetq_lane_f32(sum_vec, 1)+vgetq_lane_f32(sum_vec, 2)+vgetq_lane_f32(sum_vec, 3);//将累加结果寄存器中的所有元素相加得到最终累加值
    for (; left4>0; left4--, arr++)
        sum += (*arr) ;   //对于剩下的少于4的数字,依次计算累加即可
    return sum;
}

The time complexity of the above algorithm is O(N/4).
From the above example, it is very simple to use the NEON function. It only needs to be processed sequentially and turned into batch processing (such as the above processing 4 at a time).

The functions used above are:
float32x4_t vdupq_n_f32 (float32_t value)
Copy the value by 4 and store it in the returned register

float32x4_t vld1q_f32 (float32_t const * ptr)
Load 4 elements in turn from the array and store them in the register

The corresponding void vst1q_f32 (float32_t * ptr, float32x4_t val)
writes the value in the register to the array

float32x4_t vaddq_f32 (float32x4_t a, float32x4_t b)
returns the sum of the corresponding elements of the two registers r = a+b

Correspondingly, float32x4_t vsubq_f32 (float32x4_t a, float32x4_t b)
returns the difference between the corresponding elements of the two registers r = ab

float32_t vgetq_lane_f32 (float32x4_t v, const int lane)
returns the value of a lane in the register

Other commonly used functions are:

float32x4_t vmulq_f32 (float32x4_t a, float32x4_t b)
returns the product of the corresponding elements of the two registers r = a*b

float32x4_t vmlaq_f32 (float32x4_t a, float32x4_t b, float32x4_t c)
r = a +b*c

float32x4_t vextq_f32 (float32x4_t a, float32x4_t b, const int n)
concatenates two registers and returns a register of size 4 starting from the nth bit 0<=n<=3
eg
a: 1 2 3 4
b: 5 6 7 8
vextq_f32(a,b,1) -> r: 2 3 4 5
vextq_f32(a,b,2) -> r: 3 4 5 6
vextq_f32(a,b,3) -> r: 4 5 6 7

float32x4_t sum = vdupq_n_f32(0);
float _a[] = {1,2,3,4}, _b[] = {5,6,7,8} ;
float32x4_t a = vld1q_f32(_a), b = vld1q_f32(_b)  ;
float32x4_t sum1 = vfmaq_laneq_f32(sum, a, b, 0);
float32x4_t sum2 = vfmaq_laneq_f32(sum1, a, b, 1);
float32x4_t sum3 = vfmaq_laneq_f32(sum2, a, b, 2);

For other commonly used functions, please refer to the development website
https://developer.arm.com/technologies/neon/intrinsics

In short, NEON learning is very fast, but if you want to go deeper, you need to spend more time and effort on it.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324882054&siteId=291194637