# AVX / AVX2 programming instructions

In doing recently accelerated encryption algorithm, because there are a large number of C-based matrix operations, optimizing the need to use AVX instruction. This article describes the system is not just an ordinary entry notes, primarily as a function of the content description (documentation localization). Please indicate the source, or I'll curse you draw circles only come later write the code writing bug> _ <.

Before reading need to know:

1. Basic C language
2. Understand what SIMD

## Official data (can see, then do not watch any crap I wrote a):

Method on all instructions SSE, AVX, AVX2, AVX512 and other intel in can be found here:

19.0U1_CPP_Compiler_DGR_0.pdf​software.intel.com

Online version (available screening):

Intrinsics Guide​software.intel.com

## A taste of SIMD / small scale chopper - using the SIMD Programming:

Unused SIMD:

vectorAdd(const float* a, const float* b, const float* c){ for(int i=0; i<8; i++) { c[i] = a[i] + b[i]; //一条代码仅能操作2个单个值进行运算  } //需要重复循环操作才能运算完整个数组间的加法 }

Use AVX:

__m256 vectorAdd(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(a,b); //一个方法可以同时操作整个数组进行运算 - 一步到位
}

## AVX common data types

When processing avx vector (array), preferably in each bit to see the number of units.

Kindergarten of introduction:

Preceding the data type with two underscores

The name of inner tube 128= a vector containing the 128 bit = 32 byte

Li Dai name 256= vector contains a 256 bit = 64 byte

Nor tail of what is inside the belt = float, a float= 32bit, in addition to their own look, there are several

Tail with da = inside are double, a double= 64bit, in addition to their own look, there are several

Tail with ia = integer which is, according to the different type of integer, the number of different

Decent introduction:

Two kinds of floating-point vector is listed separately: __m128, __m256, each of a number 4byteof floatconfiguration; __m128d, __m256deach number of 8bytethe doubleconfiguration.

And __m128i, __m256iis constituted by a vector of integers, char, short, int, longbelong integer (and unsignedmore than one type), so that, for example, __m256iit can be composed of 32 char, or 16 short, or 8 int4, or longconfiguration.

Further: AVX-512 Likewise, i.e. m512 / m512d / m512i

## Function name

_mm<bit_width>_<name>_<data_type>
• <bit_width> Vector type of return, the return is 256bit size is 256, 128 return size, this is empty. There are also special: store does not return (void); test series comparing two inputs are identical, returns 0 or 1.
• <name> Name of the function, the basic functions can be seen by name it ~
• <data_type> It indicates that this function when processing the data, the data will enter the process as to what type of

About <data_type>, because m256i / m128i in a variety of cosmetic, need to tell the size of an integer AVX inside (size) is.

1. ps: There is a float, as a number to see the 32bits
2. pd: There are double, to see 64bits as a number
3. epi8/epi16/epi32/epi64：向量里每个数都是整型，一个整型8bit/16bit/32bit/64bit
4. epu8/epu16/epu32/epu64：向量里每个数都是无符号整型(unsigned)，一个整型8bit/16bit/32bit/64bit
5. m128/m128i/m128d/m256/m256i/m256d：输入值与返回类型不同时会出现 ，例如__m256i_mm256_setr_m128i(__m128ilo,__m128ihi)，输入两个__m128i向量 ，把他们拼在一起，变成一个__m256i返回 。另外这种结尾只见于load
6. si128/si256：不care向量里到底都是些啥类型，反正128bit/256bit，例如：
__m256i _mm_broadcastsi128_si256 (__m128i a)

si128/si256基本只见于

• broadcast：复制2遍__m128i，放入 __m256i
• cast
• __m128i转型__m256i(upper 128 bit undefined) ,
• __m256i转型 __m128i

1. 现在我已经加载一个int a[32]到一个__m256i中，我想把每两个int当作一个数去操作(a[0], a[1]当作一个数，a[2], a[3]当作一个数，以此类推)，我给这个操作起名为"嘟嘟魔法"，最后返回一个__m256i类型的向量，这个函数应该叫什么名字呢？ [1] (答案在注释里)

## 用于【初始化 / 加载 】的函数

Intrinsics for Load and Store Operations (& Set Operations )

1. 初始化全0
2. 手动给值初始化
3. 从现有数组加载
4. 从pointer/地址加载

pending

## 看完加减法之后故名字义也都能会的函数们

• Intrinsics for Bitwise Operations - 按位逻辑操作
• 按位and/andnot/or/xor
• AVX2中只有输入为m256i类型,输出也为m256i类型的按位逻辑函数
• Intrinsics for Compare Operations - 比较操作
• _mm256_cmpeq_epi8/16/32/64
• 输入两个__m256i, 以每8/16/32/64个bit为一个数，
• 逐个数比较两个输入向量是否相同，
• 如果相同，则将返回向量中该数所对应的bit全设为1，
• 若不同， 则则将返回向量中该数所对应的bit全设为0
• _mm256_cmpgt_epi8/16/32/64
• 输入两个__m256i:s1s2, 以每8/16/32/64个bit为一个数
• 如果s1的第i个数>s2的第i个数, 则将返回向量中该数所对应的bit全设为1
• If the s1i-th <= s2i-th, the bit vector corresponding to the number will be returned to full 0
• _mm256_max_epi8/16/32
• Returns the vector corresponding to the number of bit, set the number of s1 [i] or s2 [i] the more large
• _mm256_max_epu8/16/32
• _mm256_min_epi8/16/32
• Returns the vector corresponding to the number of bit, is set to s1 [i] or s2 [i] in the relatively small number of
• _mm256_min_epu8/16/32

## Modified by the fusion of function - (AVX2) Intrinsics for Fused Multiply Add Operations

• Functions included (if not used would not mind a)
• Zone 1: _mm / _mm256
• Zone 2: _fmadd /  fmsub /  fmaddsub /  fmsubaddfnmadd / fnmsub
• Zone 3: ps / pd
• All permutations of regions Zone 1 + 2 + 3 24 Total area
• Zone 4:_mm
• Zone 4: _fmadd /  fmsub /  fnmadd / fnmsub
• District 6: ss / sd
• Permutations of all 4 Area + 5 + 5 Area Total Area 8

Simple memory: _mm256no ssto sd, fmaddsub and  fmsubadd nor sswithsd

## Rearrangement? Mixed election? I do not know what is Chinese - Intrinsics for Permute Operations

• _mm256_permutevar8x32_epi32
• __m256i _mm256_permutevar8x32_epi32(__m256i a,__m256i idx)
• _mm256_permutevar8x32_ps
• __m256 _mm256_permutevar8x32_ps(__m256 a,__m256i idx)
• _mm256_permute4x64_epi64
• __m256i _mm256_permute4x64_epi64(__m256i a,const int imm8)
• _mm256_permute4x64_pd
• __m256d _mm256_permute4x64_pd(__m256d a,const int imm8)
• _mm256_permute2x128_si256
• __m256i _mm256_permute2x128_si256(__m256i a,__m256i b,const int imm8)

Graphic permute

The first two: _mm256_permutevar8x32_epi32 the  _mm256_permutevar8x32_ps input and return types are not the same.

_mm256_permutevar8x32_ps

aSource input bvector returned idxto rearrangement index index

That is: Suppose idx[i]in the stored value m, that is, the operationb[i] = a[m]

Similarly two intermediate: _mm256_permute4x64_epi64 Input and Output __m256i_mm256_permute4x64_pd Input and Output __m256d

_mm256_permute4x64_epi64

aInput source, bto return a vector, imm8 / control rearrangement index to index

The intclass controlto read binary (bit by reading), press (_m256i)/(epi64) = 4, cut into 4 parts, corresponding to the generated index

As illustrated: index control = 173, is binary 10101101, after being cut into 10,10,11,01 respectively. Decimal translation, understood to be the weight of each index number m 2,2,3,1 respectively. applicationb[i] = a[m]

Namely: b[3]=a[2]b[2]=a[2]b[1]=a[3]b[0]=a[1]

## Slightly different rearrangement? Mixed election? Instruction - Intrinsics for Shuffle Operations

I call it "mixed election walled instruction"

For input and output of 128bit vector terms:

Figure source: https: //www.officedaytime.com/simd512e/simdimg/si.php f = pshufb?

When b[i]bit 7 bit is 1,return[i] =0

When b[i]bit 7 bit is 0, the reading b[i]of the first 4 bits consisting of bit value = m,return[i]=a[m]

128bit greater than the vector in terms of input and output, such as _mm256_shuffle_epi8 , _mm512_shuffle_epi8 :

Similarly 256/512 See https://www.officedaytime.com/simd512e/simdimg/si.php?f=pshufb

Each of a 128-bit Lane, rearrangements within the range of only 128, 128 before the content can not be rearranged to the 128-bit. Similarly shuffle and the other 128bit.

1. _mm256_shuffle_epi8
2. _mm256_shuffle_epi32
3. _mm256_shufflehi_epi16
4. _mm256_shufflelo_epi16

## The variable displacement instruction - (AVX2) Intrinsics for Logical Shift Operations

Each number can have different shift that allows each number in a vector, can shift different number of bits (the same direction only). Before AVX2 can only do the same digit shift.

pending

pending

## reference

1. ^ Answer: called _mm256i_dodoMagic_epi16 (__m256i a), parameter __m256i, return type is also __m256i

### Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12424622.html
Recommended
Ranking
Daily