Neon intrinsics brief tutorial

Article directory


foreword

This article aims to provide beginners with NEON a guide, so that they can quickly get started with NEON. As a low-level technology, NEON has a steep learning curve. This tutorial will clear up all kinds of doubts you have during the entry period, and combine a large number of exercises so that you can really get started with NEON.

SIMD & NEON

SIMD (Single Instruction, Multiple Data) stands for Single Instruction Multiple Data. In short, it is an extension to the instruction set that can perform the same operation on multiple values.

NEON refers to an advanced SIMD (Single Instruction Multiple Data) extended instruction set for Arm Cortex-A series processors.

Why do I want to learn NEON, the reasons are:

  1. The audio DSP algorithm I am familiar with can be accelerated by SIMD technology to improve its performance
  2. The ARM architecture is already dominant on mobile (Android/iOS, etc.) devices. If you want these algorithms to perform better on the mobile end, you must learn NEON
  3. SIMD technology is cool, and I can learn about the SIMD programming paradigm through NEON. Learning something new is fun in itself.

NEON intrinsics

So how can we use NEON? You can embed NEON assembly code in C/C++ code. This method is very difficult. You need to be familiar with registers, assembly and other technologies, which is simply discouraged for novices. Fortunately, you also have NEON intrinsics to choose from.

NEON intrinsics are actually a set of c functions. You can implement SIMD by calling them to make your algorithm more efficient. As a novice, it is very appropriate for us to use NEON intrinsics as the entrance to explore the wonderful world of SIMD together.

After studying NEON intrinsics for a period of time, I have roughly mastered some basic uses of NEON intrinsics, so I quickly summed it up to prevent forgetting in the future, and at the same time, it is also a reference for you to read.

NEON intrinsics learning materials

Recommend a few good materials that I found in the learning process.

  1. Learn the architecture - Optimizing C code with Neon intrinsics , it is recommended to read it first, with less content and concise language. If you are not very clear about some of these knowledge points, it doesn't matter, read through my blog, I believe you can easily solve your doubts.
  2. Learn the architecture - Neon programmers' guide . In the "Introduction" section, the principle of SIMD & NEON is introduced in detail, and the registers are also explained; the chapter "NEON Intrinsics" introduces a lot of usage of NEON Intrinsics; the Chinese translation version is at NEON Code Farmer GuidanceChapter 1 : Introduction
  3. ARM Compiler toolchain Compiler Reference Version 5.03 , in which "Using NEON Support" is very detailed, it makes a detailed classification and summary of NEON Intrinsics, recommended reading.
  4. ARM NEON for C++ Developers introduces various uses of NEON Intrinsics in considerable detail.
  5. neon-guide , a brief neon tutorial with concise content.

register

The principle of SIMD speed-up lies in the register, mentioned in Introducing NEON Development Article :

Some modern software, especially multimedia codec software and graphics acceleration software, have a lot of data less than the machine word length to participate in the calculation. For example, data within 16 bits is frequent in audio applications, and data within 8 bits is frequent in graphics and video fields.
When performing these operations on a 32-bit microprocessor, a significant portion of the computing units are not utilized, but still consume computing resources. In order to make better use of this part of idle resources, SIMD technology uses a single instruction to perform the same operation on multiple data elements of the same type and size in parallel. In this way, the hardware can replace the usual two 32-bit value addition operations with parallel four 8-bit value addition operations within the same time consumption.

The registers of the ARM architecture are introduced in more detail in Arm NEON programming quick reference and Learn the architecture - Neon programmers' guide . To sum it up:

  • Armv7-A/AArch32
    • There are 16 32bit general-purpose registers (R0-R15)
    • There are 32 64bit NEON registers (D0-D31); they can also be regarded as 16 128bit NEON registers (Q0-Q15); each Q register corresponds to two D registers, and the matching relationship is as follows
      insert image description here
  • Armv8/AArch64
    • There are 31 64bit general-purpose registers (X0-X30); in addition, there is a special register whose name depends on the current operating environment
    • There are 32 128bit NEON registers (V0-V31); they can also be regarded as 32bit Sn registers or 64bit Dn registers
      insert image description here

vector data type

There are many vector data types in NEON, and you can find the specific list in Vector data types .

They have a unified naming paradigm rule:

 <type><size>x<number of lanes>_t
  • type, the type of data stored in the vector, including:
    • int, signed integer
    • uint, unsigned integer
    • float, floating point
    • poly, for an introduction to this type, please refer here
  • size, that is, the bit length of the type, for example, float32 means 32-bit float type, int64 means 64-bit int, and so on
  • number of lanes, the number of channels, that is, how many data there are, such as float32x4_t, there are 4 float32

In fact, these vector data types, you can think of them as an array, analogous to std::array in c++, for example

int16x8_t   < == > std::array<int16_t, 8>
uint64x2_t  < == > std::array<int64_t, 2>
float32x4_t < == > std::array<float, 4>

These data types are designed to fill a register, so their total bit length is either 64 or 128. Assuming a vector of float32x4_t whose values ​​are {0, 1, 2, 3}, then the order in which they are stored in registers is as follows:
insert image description here
You can obtain the values ​​in these vectors just like obtaining the values ​​in an array, for example:

float32x4_t a{
    
    1.0, 2.0, 3.0f, 4.0};
printf("%lf %lf %lf %lf\n", a[0], a[1], a[2], a[3]);

As for Lanes, we can understand it as an array subscript, and you will often see the word lanes in the introduction of NEON intrinsics functions later.

NENO intrinsics naming method

As mentioned earlier, NENO intrinsics are actually a bunch of C functions. As a novice, I was a little confused when I saw these functions for the first time, because their naming methods are too abstract, and it takes some queries to get a general idea of ​​their meaning. . It roughly conforms to the following rules:

<opname><flags>_<type>

To give a few examples:

  • vmul_s16, multiply two s16 vectors
  • vaddl_u8, adds two u8 vectors

More detailed rules are introduced in Program-conventions , which is dazzling. Understanding the naming rules helps us quickly understand the meaning of intrinsics, but as a newcomer, I don't think it is necessary to be too entangled. We can quickly grasp the nature of these magical functions by querying the intrinsics doc. As for the naming rules, if you are familiar with them and read a lot, you can naturally guess one or two.

NEON Intrinsics Query

You can log into Intrinsics to check. So how to understand the results of the query? Let me talk about my experience here.

For a function, what we care about includes:

  • What is the input? That is, what are the parameters.
  • What on output? That is, what is the data type returned.
  • How does the function behave? That is, what operations the function does.

Take vaddq_f32for example , the query result is as shown in the figure below. We do it against the picture
insert image description hereinsert image description here

  • Arguments, there are two parameters float32x4_t, namely aandb
  • Return Type, returns afloat32x4_t
  • Description, describing the behavior of the function: "Floating-point Add (vector). This instruction adds the corresponding vector elements in the two source SIMD&FP registers, writes the result into the vector, and writes the vector into the destination SIMD&FP register .All values ​​in this instruction are floating point values."
  • Instruction Group, Category
  • This intrinsic compiles to the following instructions, the function will be compiled into the following instructions: FADD Vd.4S,Vn.4S,Vm.4S. That is, perform FADD operation on the 4 floats in the Vm and Vn registers, and then store the result in Vd
  • Argument Preparation, the parameter ais placed in the Vn register, and the parameter bis placed in the Vm register
  • Architectures, this function is available under v7, A32, A64 architectures
  • Operation, that is, the specific operation of the instruction. Through this part, you can roughly understand the algorithm flow of the instruction. It is similar to pseudocode and is not difficult to understand. When encountering some strange instructions, you may not be able to know its function only through the Description. At this time, you can look at the Operation.

Three processing methods: Long/Wide/Narrow

NEON instructions are usually divided into Normal, Long, Wide and Narrow.

  • Normal, the input and output data of the instruction have the same bit width, for example vaddq_f32, the result is float32x4_tand the output is float32x4_t, both are 128-bit.
  • Long, the instruction operates on 64-bit data and produces a 128-bit vector result that is twice as wide as the input and of the same type. Such instructions are identified by an "l" in NEON Intrinsics, eg vaddl_s32input is int32x2_t, output is int64x2_t.
  • Wide, the instruction operates on a 128-bit vector and a 64-bit vector, producing a 128-bit vector result. The result and the first input vector are twice as wide as the second input vector. Such instructions are identified by "w" in NEON Intrinsics, for example vaddw_s32, input is int64x2_tand int32x2_t, output is int64x2_t.
  • Narrow, the instruction operates on a 128-bit vector and produces a 64-bit result that is half the width of the input. Such instructions are identified by "n" in NEON Intrinsics, eg vaddhn_s32input is int32x4_t, output is int16x4_t.

NENO intrinsics manual

The intrinsics are classified in detail in ARM Compiler toolchain Compiler Reference Version 5.03 . This chapter will give examples of functions of each category to help you understand.

You can run all the codes directly in the Compiler Explorer online editor, select the 'arm64' compiler and import <arm_neon.h>it .

Addition vector addition

Vector add: vadd{q}_type. Vr[i]:=Va[i]+Vb[i]

c = a + b

  • vaddq_f32
float32x4_t a{
    
    1.0, 2.0, 3.0f, 4.0};
float32x4_t b{
    
    1.0, 2.0, 3.0f, 4.0};
float32x4_t c = vaddq_f32(a, b); // c: {2, 4, 6, 8}
  • vadd_u64
uint64x1_t a{
    
    1};
uint64x1_t b{
    
    2};
uint64x1_t c = vadd_u64(a, b); // c: {3}

Vector long add: vaddl_type. Vr[i]:=Va[i]+Vb[i]

Long way to deal with. Va, Vb have the same number of channels, and the return value is a double-width vector of the input

  • vaddl_s32
int32x2_t a{
    
    1, 2};
int32x2_t b{
    
    1, 2};
int64x2_t c = vaddl_s32(a, b); // c: {2, 4}

Vector wide add: vaddw_type. Vr[i]:=Va[i]+Vb[i]

Wide mode processing. Va and Vb have the same number of channels, Va is twice as wide as Vb, and the return value has the same width as Va

  • vaddw_s32
int64x2_t a{
    
    1, 2};
int32x2_t b{
    
    1, 2};
int64x2_t c = vaddw_s32(a, b);

Vector halving add: vhadd{q}_type. Vr[i]:=(Va[i]+Vb[i])>>1

Add Va and Vb, and shift the result one bit to the right (equivalent to integer division by 2), that isc = (a + b) >> 1

  • vhadd_s32
int32x2_t a{
    
    1, 2};
int32x2_t b{
    
    2, 3};

// a + b = {3, 5}
// (a + b)/2 = {1, 2}
int32x2_t c = vhadd_s32(a, b);

Vector rounding halving add: vrhadd{q}_type. Vr[i]:=(Va[i]+Vb[i]+1)>>1

Va is added to Vb, plus 1, and shifted right by one. That is, integers are divided by 2 and rounded up, iec = (a + b + 1) >> 1

  • vrhadd_s32
int32x2_t a{
    
    1, 2};
int32x2_t b{
    
    2, 3};
int32x2_t c = vrhadd_s32(a, b);

VQADD: Vector saturating add

In vector saturation addition, when the maximum value that the calculation result can represent or is less than the minimum value, the calculation result takes the maximum value or the minimum value.

  • vqadd_s8
int8x8_t a{
    
    127, 127};
int8x8_t b{
    
    0, 1};
int8x8_t c = vqadd_s8(a, b); // c{127, 127, ....}

int8x8_t e{
    
    -128, -128};
int8x8_t d{
    
    0, -1};
int8x8_t f = vqadd_s8(e, d); // f{-128, -128, ....}

Vector add high half: vaddhn_type.Vr[i]:=Va[i]+Vb[i]

Narrow way to deal with. Add the vectors of Va and Vb, and store the high bit of the result in Vr

int32x4_t a{
    
    0x7ffffffe, 0x7ffffffe, 0, 0};
int32x4_t b{
    
    0x00000001, 0x00000002, 0, 0};
// 0x7ffffffe + 0x00000001 = 0x7fffffff => 取高位 => 0x7fff
// 0x7ffffffe + 0x00000002 = 0x80000000 => 取高位 => 0x8000
int16x4_t c = vaddhn_s32(a, b);//c{32767 -32768 0 0}

Vector rounding add high half: vraddhn_type.

Vector addition, taking half of the highest bit as the result, and rounding

int32x4_t a{
    
    0x7ffffffe, 0x7ffffffe, 0, 0};
int32x4_t b{
    
    0x00000001, 0x00000002, 0, 0};
// 0x7ffffffe + 0x00000001 + 0x00008000 = 0x80007fff => 取高位 => 0x8000
// 0x7ffffffe + 0x00000002 + 0x00008000 = 0x80008000 => 取高位 => 0x8000
int16x4_t c = vraddhn_s32(a, b);//c{-32768 -32768 0 0}

Multiplication vector multiplication

Vector multiply: vmul{q}_type. Vr[i] := Va[i] * Vb[i]

vector multiplication,c = a*b

  • vmul_f32
float32x2_t a{
    
    1.0f, 2.0f};
float32x2_t b{
    
    2.0f, 3.0f};
float32x2_t c = vmul_f32(a, b); // c{3.0f, 6.0f}

Vector multiply accumulate: vmla{q}_type. Vr[i] := Va[i] + Vb[i] * Vc[i]

Vector multiply and add, that is,d = a + b*c

  • vmla_f32
float32x2_t a{
    
    1.0f, 2.0f};
float32x2_t b{
    
    2.0f, 3.0f};
float32x2_t c{
    
    4.0f, 5.0f};
float32x2_t d = vmla_f32(a, b, c); //c{9, 17}

Vector multiply accumulate long: vmlal_type. Vr[i] := Va[i] + Vb[i] * Vc[i]

Long mode processing, Va is twice as wide as Vb/Vc, and the output width is consistent with Va

  • vmlal_s32
int64x2_t a{
    
    1, 2};
int32x2_t b{
    
    2, 3};
int32x2_t c{
    
    4, 5};
int64x2_t d = vmlal_s32(a, b, c); //c{9, 17}

Vector multiply subtract: vmls{q}_type. Vr[i] := Va[i] - Vb[i] * Vc[i]

Vector multiplication and subtraction, that is,d = a - b*c

  • vmls_f32
float32x2_t a{
    
    1.0f, 2.0f};
float32x2_t b{
    
    2.0f, 3.0f};
float32x2_t c{
    
    4.0f, 5.0f};
float32x2_t d = vmls_f32(a, b, c);// c{-7, -13}

Vector multiply subtract long

Vector multiplication and subtraction, handled in Long mode

  • vmlsl_s32
int64x2_t a{
    
    1, 2};
int32x2_t b{
    
    2, 3};
int32x2_t c{
    
    4, 5};
int64x2_t d = vmlsl_s32(a, b, c);// c{-7, -13}

Vector saturating doubling multiply high

abMultiply with , double the result (*2), place the highest-order half of the final result into a vector, and write the vector to the destination register.

  • vqdmulh_s32
int32x2_t a{
    
    0x00020000, 0x00035000};
int32x2_t b{
    
    0x00010000, 0x00015000};
// (0x00020000 * 0x00010000)*2 = 0x400000000,  >> 32 = 0x00000004
// (0x00035000 * 0x00015000)*2 = 0x8b2000000,  >> 32 = 0x00000008
int32x2_t c = vqdmulh_s32(a, b); // c{4, 8}

Vector saturating rounding doubling multiply high

  • vqrdmulh_s32, where 0x80000000is 1<<31, how to get this value, please refer to the Operation part of vqrdmulh_s32.
int32x2_t a{
    
    0x00010000, 0x00035000};
int32x2_t b{
    
    0x00020000, 0x00015000};
// (0x00020000 * 0x00010000)*2 + 0x80000000 = 0x480000000, >> 32 = 0x00000004
// (0x00035000 * 0x00015000)*2 + 0x80000000 = 0x932000000, >> 32 = 0x00000009
int32x2_t c = vqrdmulh_s32(a, b); // c{4, 9}

Vector saturating doubling multiply accumulate long

That is d = a + (b*c*2) , the Long method handles

  • vqdmlal_s32
int64x2_t a{
    
    1, 2};
int32x2_t b{
    
    3, 4};
int32x2_t c{
    
    5, 6};
int64x2_t d = vqdmlal_s32(a, b, c); // c{31,50}

Vector saturating doubling multiply subtract long

That is d = a - (b*c*2) , the Long method handles

  • vqdmlsl_s32
int64x2_t a{
    
    1, 2};
int32x2_t b{
    
    3, 4};
int32x2_t c{
    
    5, 6};
int64x2_t d = vqdmlsl_s32(a, b, c); // c{-29,-46}

Vector long multiply

That is c = a*b, the Long method handles

  • vmull_s32
int32x2_t a{
    
    1, 2};
int32x2_t b{
    
    3, 4};
int64x2_t c = vmull_s32(a, b);// c{3, 8}

Vector saturating doubling long multiply

That is c = 2*a*b, the Long method handles

  • vqdmull_s32
int32x2_t a{
    
    1, 2};
int32x2_t b{
    
    3, 4};
int64x2_t c = vqdmull_s32(a, b);

Subtraction vector subtraction

AdditionBy Multiplicationstudying the and instructions, you will find that there are many, many instructions that are variants of a certain basic command. The operations of these variant commands are similar to the basic commands. We will not explain the variant commands later, let us focus on more important instructions.

Vector subtract

Vector subtraction, that is,c = a - b

  • vsubq_f32
float32x4_t a{
    
    4,3,2,1};
float32x4_t b{
    
    1,2,3,4};
float32x4_t c = vsubq_f32(a, b); //c{3, 1, -1, -3}

Vector long subtract: vsubl_type. Vr[i]:=Va[i]-Vb[i]

Vector subtraction, handled in Long mode.

  • vsubl_s32
int32x2_t a{
    
    4, 3};
int32x2_t b{
    
    1, 2};
int64x2_t c= vsubl_s32(a, b);//c{3,1}

Vector wide subtract: vsubw_type. Vr[i]:=Va[i]-Vb[i]

Vector subtraction, Wide method processing

  • vsubw_s32
int64x2_t a{
    
    4,3};
int32x2_t b{
    
    1, 2};
int64x2_t c= vsubw_s32(a, b);//c{3,1}

Vector saturating subtract

vector saturation subtraction

  • vqsub_s32
int32x2_t a{
    
    0x7fffffff, 0x7fffffff};
int32x2_t b{
    
    -1, 1};
int32x2_t c = vqsub_s32(a, b); // c{0x7fffffff,0x7ffffffe}

Vector halving subtract

Divide by 2 after vector subtraction, iec = (a-b)/2

  • vhsubq_s32
int32x4_t a{
    
    4,3,2,1};
int32x4_t b{
    
    0,1,2,3};
int32x4_t c = vhsubq_s32(a, b);//c{2 1 0 -1}

Vector subtract high half

Vector subtraction, take half of the highest bit as the result

  • vsubhn_s64
int32x4_t a{
    
    0x7fffffff, 0x7fffeeee, 0,0};
int32x4_t b{
    
    0x7fff0000, 0x0000000f, 0,0};
// 0x7fffeeee - 0x0000000f = 7fffeedf => 取高位一半 = 0x7fff(32767)
int16x4_t c = vsubhn_s32(a, b);//c{0 32767 0 0}

Vector rounding subtract high half

Vector subtraction, take half of the highest bit as the result, and do rounding

  • vrsubhn_s32
int32x4_t a{
    
    0x7fffffff, 0x7fffeeee, 0,0};
int32x4_t b{
    
    0x7fff0000, 0x0000000f, 0,0};
// 0x7fffffff - 0x7fff0000 + 0x00008000 = 0x00017fff => 取高位 => 0x0001
// 0x7fffeeee - 0x0000000f + 0x00008000 = 0x80006edf => 取高位 => 0x8000
int16x4_t c = vrsubhn_s32(a, b);//c{1 -32768 0 0}

Comparison vector comparison

NEON provides a lot of vector comparisons, including ==, >=, <=, >, < and so on. In this part, select a few interesting functions to explain

  • vceq_s32, judges that the vectors are equal. If the values ​​of the corresponding channels are the same, then return all 1 values ​​on the bit, if not equal, return 0.
int32x2_t a{
    
    1, 2};
int32x2_t b{
    
    1, 0};
uint32x2_t c = vceq_s32(a, b); //c{0xffffffff 0}
  • vcage_f32, determine whether the absolute value of the floating point type is >=.
float32x2_t a{
    
    1.0, 2.0f};
float32x2_t b{
    
    -1.0, -3.0f};
uint32x2_t c = vcage_f32(a, b); //c{0xffffffff 0}
  • vtst_s8, namely a AND boperation , AND bit by bit
int8x8_t a{
    
    0b00011111, 0b00010000};
int8x8_t b{
    
    0b00010000, 0b00000000};
uint8x8_t c = vtst_s8(a, b); //c{0xff 0}

Absolute difference absolute difference

Absolute difference between the arguments: vabd{q}_type. Vr[i] = | Va[i] - Vb[i] |

The absolute value of the vector difference, that isc = abs(a - b)

  • vabd_f32
float32x2_t a{
    
    1.0, 2.0f};
float32x2_t b{
    
    2.0, 1.0f};
float32x2_t c = vabd_f32(a, b);c{
    
    1.000000 1.000000}

Absolute difference and accumulate: vaba{q}_type. Vr[i] = Va[i] + | Vb[i] - Vc[i] |

Right nowd = a + abs(b - c)

  • free_s32
int32x2_t a{
    
    1,1};
int32x2_t b{
    
    1,2};
int32x2_t c{
    
    2,1};
int32x2_t d= vaba_s32(a, b, c);//c{2 2}

Max/Min vector maximum/minimum

vmax{q}_type. Vr[i] := (Va[i] >= Vb[i]) ? Va[i] : Vb[i]

Compare two vectors and take the larger one

  • vmax_f32
float32x2_t a{
    
    1.0, -2.0f};
float32x2_t b{
    
    2.0, 1.0f};
float32x2_t c = vmax_f32(a, b); //c{2.000000 1.000000}

vmin{q}_type. Vr[i] := (Va[i] >= Vb[i]) ? Vb[i] : Va[i]

Vector comparison takes the smaller one

  • vmin_f32
float32x2_t a{
    
    1.0, -2.0f};
float32x2_t b{
    
    2.0, 1.0f};
float32x2_t c = vmin_f32(a, b);//c{1.000000 -2.000000}

Pairwise addition pairwise addition

Pairwise add

Vector summation, i.e.c = { sum(a), sum(b) }

  • vpadd_s32
int32x2_t a{
    
    3,4};
int32x2_t b{
    
    1,2};
int32x2_t c = vpadd_s32(a, b); //c{7 3}

Long pairwise add and accumulate

Right nowc={a[0] + sum(c[0:1]), a[1]+sum(2:3)}

  • vpadal_s16
int32x2_t a{
    
    3,4};
int16x4_t b{
    
    1,2,3,4};
int32x2_t c = vpadal_s16(a, b);//c{6 11}

Folding maximum

Take the maximum value in vector a and vector b, that isc={ max(a), max(b) }

  • vpmax_s32
int32x2_t a{
    
    1,2};
int32x2_t b{
    
    -1,0};
int32x2_t c = vpmax_s32(a, b);//c{2 0}

Folding minimum

Take the minimum value in vector a and vector b, that isc={ min(a), min(b) }

  • vpmin_s32
int32x2_t a{
    
    1, 2};
int32x2_t b{
    
    -1, 0};
int32x2_t c = vpmin_s32(a, b);//c{1 -1}

Reciprocal/Sqrt

These intrinsics perform the first of two steps in an iteration of the Newton-Raphson method to converge to the reciprocal or square root

  • vrecps_f32, iec = 2.0 - a*b
float32x2_t a{
    
    2, 4};
float32x2_t b{
    
    1, 3};
float32x2_t c=   vrecps_f32(a, b);//c{0.000000 -10.000000}
  • vrsqrts_f32, iec = (3.0 - a*b)/2
float32x2_t a{
    
    2, 4};
float32x2_t b{
    
    1, 3};
float32x2_t c=   vrsqrts_f32(a, b);//c{0.500000 -4.500000}

Shifts by signed variable is shifted according to the variable value

This part of the function provides the ability to shift signed variables

Vector shift left: vshl{q}_type. Vr[i] := Va[i] << Vb[i] (negative values shift right)

That is, the a vector is shifted to the left according to the value in the b vector, and if the value in b is negative, then it is shifted to the left

  • vshl_s16
 int16x4_t a{
    
    1, 8, -1, -8};
 int16x4_t b{
    
    2, -2, 2, -2};
 int16x4_t c = vshl_s16(a, b);//c{4 2 -4 -2}

Shifts by a constant constant shift

Vector shift right by constant

vector right shift

int32x2_t a{
    
    8, 16};
int32x2_t c = vshr_n_s32(a, 2);//c{2 4}

Vector shift left by constant

vector left shift

 int32x2_t a{
    
    2, 4};
 int32x2_t c = vshl_n_s32(a, 2);//c{8 16}

Vector shift right by constant and accumulate

Right nowd = (b >> n) + a

int32x2_t a{
    
    8, 4};
int32x2_t b{
    
    4, 2};
const int c = 1;
int32x2_t d = vsra_n_s32(a, b, c);//c{10 5}

Shifts with insert shift and insert

Vector shift right and insert

This operation is quite magical. Let me describe it in plain language as vsri_n_s32an example :

  1. Because c = 6, so keep the first 6 bits of a, get0x0c000000
  2. b is shifted right by 6 bits to get0x00000400
  3. OR the two results, get0x0c0000400
  • vsri_n_s32
int32x2_t a{
    
    0x0fffffff, 0x0fffffff};
int32x2_t b{
    
    0x00010000, 1};
const int c = 6;
int32x2_t d = vsri_n_s32(a, b, c);//c{0x0c0000400 0x0c0000000}

Loads of a single vector or lane vector load and store

Load a single vector from memory

load vector from memory

  • vld1q_f32
float a[4] = {
    
    1,2,3,4};
float32x4_t b = vld1q_f32(a);//{1.000000 2.000000 3.000000 4.000000}

Load a single lane from memory

Load from the memory to the specified position of the vector. In the following example, the vector will srcbe loaded and the value of src[2]the prtloaded from , that isc[0:lane] = src[0:lane], c[lane:end] = ptr[0:lane]

  • vld1q_lane_s32
int32_t ptr[] = {
    
    1, 2, 3, 4};
int32x4_t src{
    
    0, 1, 2, 3};
const int lane = 2;
int32x4_t c = vld1q_lane_s32(ptr, src, lane);//c{0 1 1 3}

Load all lanes of vector with same value from memory

Load the vector from a memory variable, the value in the vector is this value

  • vld1q_dup_s32
const int32_t a{
    
    10};
int32x4_t b = vld1q_dup_s32(&a); //{10 10 10 10}

Store a single vector or lane

read vector values ​​into memory

Store a single vector into memory

read an entire vector into memory

int32x4_t a{
    
    0,1,2,3};
int32_t ptr[4];
vst1q_s32(ptr, a);//ptr{0 1 2 3}

Store a lane of a vector into memory

Read the value of one channel of the vector into memory

  • vst1q_lane_s32
int32x4_t a{
    
    0,1,2,3};
int32_t a0, a1;
vst1q_lane_s32(&a0, a, 0);// 0
vst1q_lane_s32(&a1, a, 3);// 3

Loads of an N-element structure

Load multiple vectors at once

Load N-element structure from memory

Load multiple vectors from memory to interleave

  • vld2q_f32
float32_t ptr[] = {
    
    0,1,2,3,4,5,6,7,8};
float32x4x2_t a = vld2q_f32(ptr);
// a.val[0] = {0.000000 2.000000 4.000000 6.000000}
// a.val[1] = {1.000000 3.000000 5.000000 7.000000}

Load all lanes of N-element structure with same value from memory

Right nowa[0, :] = ptr[0], a[1, :] = ptr[1]

  • vld2_dup_f32
float32_t ptr[] = {
    
    0, 1};
float32x2x2_t a = vld2_dup_f32(ptr);
// a.val[0] = {0.000000 0.000000}
// a.val[1] = {0.000000 0.000000}

Load a single lane of N-element structure from memory

That is c[:, lane] = ptr, replace the lane column in src with the value in ptr

float32_t ptr[] = {
    
    10, 20};
float32x4x2_t src = {
    
    
        float32x4_t{
    
    1, 1, 1, 1},
        float32x4_t{
    
    2, 2, 2, 2}};
float32x4x2_t c = vld2q_lane_f32(ptr, src, 1);
// c.val[0] = {1.000000 10.000000 1.000000 1.000000}
// c.val[1] = {2.000000 20.000000 2.000000 2.000000}

Store N-element structure to memory

Read multiple vectors at once, interleaved reads

  • vst2q_s32
int32_t ptr[8];
int32x4x2_t val{
    
    
        int32x4_t{
    
    0,1,2,3},
        int32x4_t{
    
    4,5,6,7},
};
vst2q_s32(ptr, val);
//ptr = {0 4 1 5 2 6 3 7 }

Store a single lane of N-element structure to memory

Read a certain channel in multiple vectors, that is, take a certain column

int32_t ptr[2];
int32x4x2_t val{
    
    
        int32x4_t{
    
    0,1,2,3},
        int32x4_t{
    
    4,5,6,7},
};
const int lane = 2;
vst2q_lane_s32(ptr, val, lane);
// ptr = {2 6}

Extract lanes from a vector and put into a register

Read the value of the specified channel in the vector

  • vgetq_lane_f32
float32x4_t a{
    
    0, 1, 2, 3};
const int lane = 2;
float32_t b = vgetq_lane_f32(a, lane);

Load a single lane of a vector from a literal

Set the value of the channel specified by the vector

  • vsetq_lane_f32
float32x4_t a{
    
    0, 1, 2, 3};
float32_t new_val = 10;
const int lane = 2;
float32x4_t b = vsetq_lane_f32(new_val, a, lane);//c{0.000000 1.000000 10.000000 3.000000}

Initialize a vector from a literal bit pattern

Create a vector from a uint64_t variable,

uint64_t a = 0x0000000100000002;
int32x2_t b = vcreate_s32(a);//{2 1}

Set all lanes to same value

Load all lanes of vector to the same literal value

Set all channels of the vector to the same value

  • vdupq_n_f32
float32x4_t a = vdupq_n_f32(10);//{10.000000 10.000000 10.000000 10.000000}
  • vmovq_n_f32
float32x4_t a = vmovq_n_f32(10);//{10.000000 10.000000 10.000000 10.000000}

Load all lanes of the vector to the value of a lane of a vector

Take the value of a channel from athe vector and create a vector with this value

  • vdupq_lane_f32
float32x2_t a{
    
    0, 1};
const int lane = 1;
float32x4_t b = vdupq_lane_f32(a, lane);//{1.000000 1.000000 1.000000 1.000000}

Combining vectors

Merge two 64bit vectors into one 128bit vector

  • vcombine_f32
float32x2_t low{
    
    0, 1};
float32x2_t high{
    
    2, 3};
float32x4_t c = vcombine_f32(low, high);

Splitting vectors

  • vget_high_f32
float32x4_t a{
    
    0,1,2,3};
float32x2_t b = vget_high_f32(a);//{2.000000 3.000000}
  • vget_high_f32
float32x4_t a{
    
    0,1,2,3};
float32x2_t b = vget_high_f32(a);
float32x2_t c = vget_low_f32(a);//{0.000000 1.000000}

Converting vectors vector type conversion

  • vcvt_s32_f32
float32x2_t a{
    
    1, 2};
int32x2_t b = vcvt_s32_f32(a);//{1 2}

Table look up

  • vtbl2_s8
int8x8x2_t a{
    
    
        int8x8_t{
    
    0,1,3,5,7,9,11,13},
        int8x8_t{
    
    2,4,6,8,10,12,14,16}};
int8x8_t b{
    
    0,1,2,3,12,13,14,15};
int8x8_t c = vtbl2_s8(a, b);//{0 1 3 5 10 12 14 16}
  • vtbx1_s8
int8x8_t a{
    
    0,1,3,5,7,9,11,13};
int8x8_t b{
    
    2,4,6,8,10,12,14,16};
int8x8_t c{
    
    0,1,2,3,12,13,14,15};
int8x8_t d = vtbx1_s8(a,b,c);//{2 4 6 8 7 9 11 13 }

Operations with a scalar value

Some operations on vectors and scalars

Vector multiply accumulate with scalar

Right nowd = a + (b * c[lane])

  • vmla_lane_f32
float32x2_t a{
    
    1, 2};
float32x2_t b{
    
    1, 2};
float32x2_t c{
    
    2, 4};
const int lane = 0;
float32x2_t d = vmla_lane_f32(a, b, c, lane);//{3.000000 6.000000}

Vector multiply by scalar

Multiply vector by scalar

  • vmulq_n_f32
float32x4_t a{
    
    0, 1, 2, 3};
float32_t b = 2;
float32x4_t c = vmulq_n_f32(a, b);//{0.000000 2.000000 4.000000 6.000000}

Vector multiply subtract with scalar

Right nowd = a - (b * c)

float32x2_t a{
    
    1, 2};
float32x2_t b{
    
    1, 2};
float32_t c = 2.0f;
float32x2_t d = vmls_n_f32(a, b, c);//{-1.000000 -2.000000}

Vector extract vector extraction

Use the following example to illustrate that you want to combine a and b into one vector, and then take the value of the vector from lanethe subscript

int16x4_t a{
    
    0,2,4,6};
int16x4_t b{
    
    1,3,5,7};
// a:b => {0,2,4,6,1,3,5,6} => get starts index 3 => {6 1 3 5}
const int lane = 3;
int16x4_t c = vext_s16(a, b, lane);//{6 1 3 5}

Reverse vector elements (swap endianness)

vector element flip

  • vrev64q_f32
float32x4_t a{
    
    0, 1, 2, 3};
float32x4_t b = vrev64q_f32(a);//{1.000000 0.000000 3.000000 2.000000}

Other single operand arithmetic

Absolute: vabs{q}_type. Vd[i] = |Va[i]|

take the absolute value

  • vabsq_f32
 float32x4_t a{
    
    0, 1, -2, -3};
 float32x4_t b = vabsq_f32(a);//{0.000000 1.000000 2.000000 3.000000}

Negate: vneg{q}_type. Vd[i] = - Va[i]

reconciliation

  • vnegq_f32
 float32x4_t a{
    
    0, 1, -2, -3};
 float32x4_t b = vnegq_f32(a);//{-0.000000 -1.000000 2.000000 3.000000
}

Count leading sign bits

After the highest bit, the number of consecutive bit values ​​that are the same as the highest bit

  • vcls_s8
int8x8_t a{
    
    0b00000001, 0b00110000};
int8x8_t b = vcls_s8(a);
printf("%d %d\n", b[0], b[1]);//{6 1}

Count leading zeros

Starting from the highest bit, the number of consecutive 0s

  • vclz_s8
int8x8_t a{
    
    0b00000001, 0b01110000};
int8x8_t b = vclz_s8(a);
printf("%d %d\n", b[0], b[1]);//{7 1}

Count number of set bits

Count the number of bits that are 1

  • vcnt_s8
int8x8_t a{
    
    0b00000001, 0b01110000};
int8x8_t b = vcnt_s8(a);
printf("%d %d\n", b[0], b[1]);//{1 3}

Reciprocal estimate

reciprocal estimate, approximate value

  • vrecpe_f32
float32x2_t a{
    
    1, 3};
float32x2_t b = vrecpe_f32(a);//{0.998047 0.333008}

Reciprocal square root estimate

i.e. b = 1 / sqrt(a), approximate

  • vrsqrte_f32
float32x2_t a{
    
    1, 4};
float32x2_t b = vrsqrte_f32(a);//{0.998047 0.499023}

Logical operations Logical operations

Bitwise not

bitwise NOT

  • vmvn_s32
int32x2_t a{
    
    0x0000ffff, 0x0000000};
int32x2_t b = vmvn_s32(a);
printf("%x %x\n", b[0], b[1]); //{ffff0000 ffffffff}

Bitwise and

bitwise AND

  • water_s32
int32x2_t a{
    
    0x00000ff0, 0x00000ff};
int32x2_t b{
    
    0x0000ffff, 0x00000ff};
int32x2_t c = vand_s32(a, b);//{ff0 ff}

Bitwise or

bitwise or

  • vorr_s32
int32x2_t a{
    
    0x00000ff0, 0x00000ff};
int32x2_t b{
    
    0x0000ffff, 0x00000ff};
int32x2_t c = vorr_s32(a, b);//{ffff ff}

Bitwise exclusive or (EOR or XOR)

bitwise XOR

  • veor_s32
int32x2_t a{
    
    0x00000ff0, 0x00000ff};
int32x2_t b{
    
    0x0000ffff, 0x00000ff};
int32x2_t c = veor_s32(a, b);//{f00f 0}

Bit Clear

it seems thatc = a & !b

  • vbic_s32
int32x2_t a{
    
    0x7fffffff, 0x000000ff};
int32x2_t b{
    
    0x0000ffff, 0x7fffffff};
int32x2_t c = vbic_s32(a, b);//{7fff0000 f0}

Bitwise OR complement

Right nowc = a || !b

  • front_s32
int32x2_t a{
    
    0x7fffffff, 0x000000ff};
int32x2_t b{
    
    0x0000ffff, 0x7fffffff};
int32x2_t c = vorn_s32(a, b);//{ffffffff 800000ff}

Transposition operations

Some operations on vector transpose

Transpose elements

  • vtrn_s32
int32x2_t a{
    
    0, 1};
int32x2_t b{
    
    2, 3};
int32x2x2_t c = vtrn_s32(a, b);
/**
0 2
1 3
**/

Interleave elements

  • vzip_s8
int8x8_t a{
    
    0,2,4,6,8,10,12,14};
int8x8_t b{
    
    1,3,5,7,9,11,13,15};
int8x8x2_t c = vzip_s8(a, b);
/**
0 1 2 3 4 5 6 7 
8 9 10 11 12 13 14 15 
**/

De-Interleave elements

  • vuzp_s8
 int8x8_t a{
    
    0,2,4,6,8,10,12,14};
 int8x8_t b{
    
    1,3,5,7,9,11,13,15};
 int8x8x2_t c = vuzp_s8(a, b);
/**
0 4 8 12 1 5 9 13 
2 6 10 14 3 7 11 15 
**/

Vector reinterpret cast operations

In some cases, you may want to treat a vector as having a different type without changing its value. NEON provides a set of functions to perform this type of conversion.
These functions have the same syntax

vreinterpret{q}_dsttype_srctype

in:

  • q, specifying that the conversion is performed on a 128-bit vector. If it is absent, conversion will be performed on 64-bit vectors.
  • dsttype, the target type
  • srctype, the source data type

Take for vreinterpretq_s8_f32example

int16x4_t a{
    
    0, 1, 2, 3};
uint16x4_t b = vreinterpret_u16_s16(a);//{0 1 2 3}

Summarize

This article introduces the basic usage concepts and basic usage methods of NEON intrinsics, and lists a large number of NEON function usage examples, aiming to help get started with NEON no longer difficult. Some examples of the actual use of NEON instructions will be listed later to help you understand how NEON is used in actual application scenarios.


reference

  1. Learn the architecture - Optimizing C code with Neon intrinsics
  2. Learn the architecture - Neon programmers’ guide
  3. NEON Coder Guide Chapter 1 : Introduction
  4. ARM Compiler toolchain Compiler Reference Version 5.03
  5. ARM NEON for C++ Developers
  6. neon-guide

Guess you like

Origin blog.csdn.net/weiwei9363/article/details/127802048