neon accelerates image transposition

The general matrix rotation operation is to operate on the elements in the matrix one by one. Assuming that the size of the matrix is m*n, then the time complexity is o(mn). If you use the neon acceleration technology provided by arm company, you can read multiple elements in parallel and operate on multiple elements. Although the time complexity is still o(mn), the constant factor will become smaller, and the value in the register will be smaller. The operation is faster than in normal memory, so it will bring a certain performance improvement. Image transposition is performed for grayscale images, and the image data is
implemented in uint8_t. c language:

for(y=0;y<h;y++)
    {
        for(x=0;x<w;x++)
            dst_gray[x*h+y]=src_gray[y*w+x];
    }

neon acceleration idea:
consider dividing a matrix into several sub-matrices, for example: a 128×256 matrix can be divided into 16×32 8×8 matrices. Rotate each 8x8 submatrix separately and copy it to the correct coordinates in the output matrix. It can be summed up in 2 steps:
the following steps are performed in a loop until all sub-matrices have been processed
1. Rotate the current sub-matrix
2. Copy the rotated sub-matrix into the output matrix
Neon instruction vtrn is the core of solving the transpose problem, which It is equivalent to transposing a 2x2 matrix, so for an 8x8 matrix, you can first transpose uint8x8_t, then convert it to uint16x4_t, transpose every line, then convert it to uint32x2_t, transpose every three lines, and finally convert it to uint8x8_t and output it to the target matrix the corresponding location. At the same time, pay attention to the processing of the remaining data. Because it is based on an 8x8 matrix, when the remaining data is insufficient to form an 8x8 matrix, it is necessary to process the remaining data and operate directly one by one.

vtrn function

neon optimized code:

int transposition_neon(uint8_t* src,uint8_t* dst,int w,int h)
{
    uint8x8x4_t mat1;
    uint8x8x4_t mat2;
    uint8x8x2_t temp1;
    uint8x8x2_t temp2;
    uint8x8x2_t temp3;
    uint8x8x2_t temp4;
    uint16x4x4_t temp11;
    uint16x4x4_t temp12;
    uint16x4x2_t temp5;
    uint16x4x2_t temp6;
    uint16x4x2_t temp7;
    uint16x4x2_t temp8;
    uint32x2x4_t temp21;
    uint32x2x4_t temp22;
    uint32x2x2_t res1;
    uint32x2x2_t res2;
    uint32x2x2_t res3;
    uint32x2x2_t res4;

    int dw=w&7;
    int dh=h&7;
    int sw=w-dw;
    int sh=h-dh;
    int x,y;
    for(y=0;y<sh;y=y+8)
    {
        for(x=0;x<sw;x=x+8)
        {
            mat1.val[0]=vld1_u8(src+y*w+x);
            mat1.val[1]=vld1_u8(src+(y+1)*w+x);
            mat1.val[2]=vld1_u8(src+(y+2)*w+x);
            mat1.val[3]=vld1_u8(src+(y+3)*w+x);
            mat2.val[0]=vld1_u8(src+(y+4)*w+x);
            mat2.val[1]=vld1_u8(src+(y+5)*w+x);
            mat2.val[2]=vld1_u8(src+(y+6)*w+x);
            mat2.val[3]=vld1_u8(src+(y+7)*w+x);
            temp1=vtrn_u8(mat1.val[0],mat1.val[1]);
            temp2=vtrn_u8(mat1.val[2],mat1.val[3]);
            temp3=vtrn_u8(mat2.val[0],mat2.val[1]);
            temp4=vtrn_u8(mat2.val[2],mat2.val[3]);

            temp11.val[0]=vreinterpret_u16_u8(temp1.val[0]);
            temp11.val[1]=vreinterpret_u16_u8(temp1.val[1]);
            temp11.val[2]=vreinterpret_u16_u8(temp2.val[0]);
            temp11.val[3]=vreinterpret_u16_u8(temp2.val[1]);
            temp12.val[0]=vreinterpret_u16_u8(temp3.val[0]);
            temp12.val[1]=vreinterpret_u16_u8(temp3.val[1]);
            temp12.val[2]=vreinterpret_u16_u8(temp4.val[0]);
            temp12.val[3]=vreinterpret_u16_u8(temp4.val[1]);

            temp5=vtrn_u16(temp11.val[0],temp11.val[2]);
            temp6=vtrn_u16(temp11.val[1],temp11.val[3]);
            temp7=vtrn_u16(temp12.val[0],temp12.val[2]);
            temp8=vtrn_u16(temp12.val[1],temp12.val[3]);

            temp21.val[0]=vreinterpret_u32_u16(temp5.val[0]);
            temp21.val[1]=vreinterpret_u32_u16(temp5.val[1]);
            temp21.val[2]=vreinterpret_u32_u16(temp6.val[0]);
            temp21.val[3]=vreinterpret_u32_u16(temp6.val[1]);
            temp22.val[0]=vreinterpret_u32_u16(temp7.val[0]);
            temp22.val[1]=vreinterpret_u32_u16(temp7.val[1]);
            temp22.val[2]=vreinterpret_u32_u16(temp8.val[0]);
            temp22.val[3]=vreinterpret_u32_u16(temp8.val[1]);

            res1=vtrn_u32(temp21.val[0],temp22.val[0]);
            res2=vtrn_u32(temp21.val[1],temp22.val[1]);
            res3=vtrn_u32(temp21.val[2],temp22.val[2]);
            res4=vtrn_u32(temp21.val[3],temp22.val[3]);

            mat1.val[0]=vreinterpret_u8_u32(res1.val[0]);
            mat1.val[1]=vreinterpret_u8_u32(res2.val[0]);
            mat1.val[2]=vreinterpret_u8_u32(res3.val[0]);
            mat1.val[3]=vreinterpret_u8_u32(res4.val[0]);
            mat2.val[0]=vreinterpret_u8_u32(res1.val[1]);
            mat2.val[1]=vreinterpret_u8_u32(res2.val[1]);
            mat2.val[2]=vreinterpret_u8_u32(res3.val[1]);
            mat2.val[3]=vreinterpret_u8_u32(res4.val[1]);

            vst1_u8(dst+x*h+y,mat1.val[0]);
            vst1_u8(dst+(x+1)*h+y,mat1.val[1]);
            vst1_u8(dst+(x+2)*h+y,mat1.val[2]);
            vst1_u8(dst+(x+3)*h+y,mat1.val[3]);
            vst1_u8(dst+(x+4)*h+y,mat2.val[0]);
            vst1_u8(dst+(x+5)*h+y,mat2.val[1]);
            vst1_u8(dst+(x+6)*h+y,mat2.val[2]);
            vst1_u8(dst+(x+7)*h+y,mat2.val[3]);
        }
    }
    for(y=sh-1;y<h;y++)
    {
        for(x=0;x<w;x++)
            dst[x*h+y]=src[y*w+x];
    }
    for(x=sw-1;x<w;x++)
    {    
        for(y=0;y<sh;y++)
        {
            dst[x*h+y]=src[y*w+x];
        }
    }
    return 0;
}

Test image pixel: 1680*1050
Test platform: HiSilicon 3559
Test result: Under O3 level optimization compilation, the speed is about 2.5 times (the O3 optimization effect seems to be related to the platform and has not been studied), under the default compilation, about 1.5 times the speed, online It is said that it can speed up 10 times, but I don't know how to achieve it.

References:
http://blog.csdn.net/jxt1234and2010/article/details/50437884
http://book.51cto.com/art/201506/481001.htm
http://www.cnblogs.com/hrlnw/p /3723072.html
http://www.cnblogs.com/hrlnw/p/3767853.html

neon accelerates image transposition

Guess you like