Optimize common math operations with ARM NEON intrinsics

ARM NEON is the SIMD instruction set under the arm platform, and making good use of these instructions can greatly improve the speed of the program. However, for many people, it is difficult to optimize the code directly by using assembly instructions. At this time, ARM NEON intrinsic instructions can be used. It is the encapsulation of the underlying assembly instructions, which does not require users to consider the allocation of the underlying registers, but at the same time can achieve the original performance of assembly instructions.
For all intrinsic instructions, please refer to the blog " Introduction to ARM Neon Intrinsics Functions ", which will not be repeated in this article. We highlight two examples of using ARM NEON intrinsic instructions to accelerate math operations.

1. Dot Product of Vectors

given two vectors A = ( a 1 , a 2 , a 3 , . . . , a n ) B = ( b 1 , b 2 , b 3 , . . . , b n ) , the dot product of the vectors is

A B = a 1 b 1 + a 2 b 2 + a 3 b 3 + . . . a n b n

It can be seen that the above operation is to multiply and then add each dimension of the vector, and the multiplication has good parallelism, so it can be accelerated by ARM NEON intrinsic instructions. Here is the code implementation:

float dot(float* A,float* B,int K)
{
    float sum=0;
    float32x4_t sum_vec=vdupq_n_f32(0),left_vec,right_vec;
    for(int k=0;k<K;k+=4)
    {
        left_vec=vld1q_f32(A+ k);
        right_vec=vld1q_f32(B+ k);
        sum_vec=vmlaq_f32(sum_vec,left_vec,right_vec);
    }

    float32x2_t r=vadd_f32(vget_high_f32(sum_vec),vget_low_f32(sum_vec));
    sum+=vget_lane_f32(vpadd_f32(r,r),0);

    return sum;
}

The code is relatively simple. The core code is to store the two arrays four at a time into the 128-bit variables under the ARM NEON intrinsic, and then use a multiply-add instruction to calculate the cumulative sum of the four products. Finally, add the 4 sums to get the final result. Compared to the serial code, the above code has a nearly 4x speedup. When the data type is short or char, a higher speedup can be achieved. The following is an example of char:

int dot(char* A,char* B,int K)
{
    int sum=0;
    int16x8_t sum_vec=vdupq_n_s16(0);
    int8x8_t left_vec, right_vec;
    int32x4_t part_sum4;
    int32x2_t part_sum2;

    //有溢出的风险
    for(k=0;k<K;k+=8)
    {
        left_vec=vld1_s8(A+A_pos+k);
        right_vec=vld1_s8(B+B_pos+k);
        sum_vec=vmlal_s8(sum_vec,left_vec,right_vec);
    }

    part_sum4=vaddl_s16(vget_high_s16(sum_vec),vget_low_s16(sum_vec));   
    part_sum2=vadd_s32(vget_high_s32(part_sum4),vget_low_s32(part_sum4));
    sum+=vget_lane_s32(vpadd_s32(part_sum2,part_sum2),0);

return sum;
}

The dot product code based on char type is similar to the float type, but due to the possibility of overflow in char multiplication, we need to upgrade the data type to short after multiplication. The above code is also specially commented: there may be overflow, because a single multiplication will not overflow, but the result of multiplication may overflow. If the value of the two vectors is reasonably designed, the probability of overflowing will be very low. More importantly, the speedup ratio of the above code is more than twice that of the float type, so in a program with very strict speed requirements, the above code will bring A very noticeable speed increase.

2. exp acceleration

In the current deep neural network, the sigmoid function or the softmax function is often used, and these functions use the power exponential function of floating point numbers ( e x ). General Math Function Libraries e x The precision is high, but the speed is slow. In the blog " Fast Floating Point Number exp Algorithm ", the author gives a fast algorithm, which only needs two multiplications, and the speed is quite fast. The disadvantage is that the error is slightly larger, generally between 0~%5 between. For neural networks, however, such accuracy is sufficient.
First look at the original exp algorithm:

inline float fast_exp(float x)
{
    union {uint32_t i;float f;} v;
    v.i=(1<<23)*(1.4426950409*x+126.93490512f);

    return v.f;
}

The basic principle of the algorithm is designed in consideration of the layout of the float data type in memory. For more details, please refer to the original blog. This article only describes how to accelerate it with ARM NEON intrinsic instructions (compared to the original blog, the code The second constant is a little changed, the new constant is something I experimented with, and the error is smaller).
The advantage of ARM NEON intrinsic instructions is parallel computing, so we exp and add each element of an array, and then speed it up:

inline float expf_sum(float* score,int len)
{
    float sum=0;
    float32x4_t sum_vec=vdupq_n_f32(0);
    float32x4_t ai=vdupq_n_f32(1064807160.56887296), bi;
    int32x4_t   int_vec;
    int value;
    for(int i=0;i<len;i+=4)
    {
        bi=vld1q_f32(score+4*i);
        sum_vec=vmlaq_n_f32(ai,bi,12102203.1616540672);
        int_vec=vcvtq_s32_f32(sum_vec);

        value=vgetq_lane_s32(int_vec,0);
        sum+=(*(float*)(&value));
        value=vgetq_lane_s32(int_vec,1);
        sum+=(*(float*)(&value));
        value=vgetq_lane_s32(int_vec,2);
        sum+=(*(float*)(&value));
        value=vgetq_lane_s32(int_vec,3);
        sum+=(*(float*)(&value));
    }

    return sum;
}

In the original algorithm, (1<<23) is calculated first, and then multiplied by the other part. We simplify it into a multiply-add operation: 12102203.1616540672*x+1064807160.56887296. The algorithm is very similar to the dot product, first load 4 variables, and then perform the multiply-add operation. The next operation is to first convert the float type variable into an int type variable, and then obtain the float value and accumulate it through the address force conversion . Compared with the original exp accumulation, the speed can be improved by about 5 or 6 times.
What do the above two operations mean to those who are engaged in deep learning do not need me to introduce it~

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324403950&siteId=291194637