AVX / AVX2 programming instructions
HTTPS: // zhuanlan.zhihu.com/p/94649418 feel good talking about it I just did not understand the difference between AVX2 and AVX512
In doing recently accelerated encryption algorithm, because there are a large number of C-based matrix operations, optimizing the need to use AVX instruction. This article describes the system is not just an ordinary entry notes, primarily as a function of the content description (documentation localization). Please indicate the source, or I'll curse you draw circles only come later write the code writing bug> _ <.
Before reading need to know:
- Basic C language
- Understand what SIMD
Official data (can see, then do not watch any crap I wrote a):
Method on all instructions SSE, AVX, AVX2, AVX512 and other intel in can be found here:
PDF version:
19.0U1_CPP_Compiler_DGR_0.pdfOnline version (available screening):
Intrinsics GuideA taste of SIMD / small scale chopper - using the SIMD Programming:
Source: https: //software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensionsUnused SIMD:
vectorAdd(const float* a, const float* b, const float* c){ for(int i=0; i<8; i++) { c[i] = a[i] + b[i]; //一条代码仅能操作2个单个值进行运算 } //需要重复循环操作才能运算完整个数组间的加法 }
Use AVX:
__m256 vectorAdd(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(a,b); //一个方法可以同时操作整个数组进行运算 - 一步到位
}
AVX common data types
When processing avx vector (array), preferably in each bit to see the number of units.
Kindergarten of introduction:
Preceding the data type with two underscores
The name of inner tube 128
= a vector containing the 128 bit = 32 byte
Li Dai name 256
= vector contains a 256 bit = 64 byte
Nor tail of what is inside the belt = float
, a float
= 32bit, in addition to their own look, there are several
Tail with d
a = inside are double
, a double
= 64bit, in addition to their own look, there are several
Tail with i
a = integer which is, according to the different type of integer, the number of different
Decent introduction:
Two kinds of floating-point vector is listed separately: __m128
, __m256
, each of a number 4byte
of float
configuration; __m128d
, __m256d
each number of 8byte
the double
configuration.
And __m128i
, __m256i
is constituted by a vector of integers, char
, short
, int
, long
belong integer (and unsigned
more than one type), so that, for example, __m256i
it can be composed of 32 char
, or 16 short
, or 8 int
4, or long
configuration.
Further: AVX-512 Likewise, i.e. m512 / m512d / m512i
Function name
_mm<bit_width>_<name>_<data_type>
<bit_width>
Vector type of return, the return is 256bit size is 256, 128 return size, this is empty. There are also special: store does not return (void); test series comparing two inputs are identical, returns 0 or 1.<name>
Name of the function, the basic functions can be seen by name it ~<data_type>
It indicates that this function when processing the data, the data will enter the process as to what type of
About <data_type>, because m256i / m128i in a variety of cosmetic, need to tell the size of an integer AVX inside (size) is.
ps
: There is a float, as a number to see the 32bitspd
: There are double, to see 64bits as a numberepi8
/epi16
/epi32
/epi64
:向量里每个数都是整型,一个整型8bit/16bit/32bit/64bitepu8
/epu16
/epu32
/epu64
:向量里每个数都是无符号整型(unsigned),一个整型8bit/16bit/32bit/64bitm128
/m128i
/m128d
/m256
/m256i
/m256d
:输入值与返回类型不同时会出现 ,例如__m256i_mm256_setr_m128i(__m128ilo,__m128ihi)
,输入两个__m128i
向量 ,把他们拼在一起,变成一个__m256i
返回 。另外这种结尾只见于load
si128
/si256
:不care向量里到底都是些啥类型,反正128bit/256bit,例如:
__m256i _mm_broadcastsi128_si256 (__m128i a)
此函数接收一个128-bit的__m128i
,然后将这__m128i
的128位,按位放入返回的__m256i
类型的前128位(127-0)中,再按位放入(复制到)后128位(255-128)中。因为不涉及单个数操作,将128位看成整体操作,所以不用在意其中每个数的数据类型。
si128
/si256
基本只见于
broadcast
:复制2遍__m128i
,放入__m256i
;cast
:__m128i
转型__m256i
(upper 128 bit undefined) ,__m256i
转型__m128i
总而言之5
& 6
两种并不太常见。
小测验:
- 现在我已经加载一个int a[32]到一个
__m256i
中,我想把每两个int当作一个数去操作(a[0], a[1]当作一个数,a[2], a[3]当作一个数,以此类推),我给这个操作起名为"嘟嘟魔法",最后返回一个__m256i
类型的向量,这个函数应该叫什么名字呢? [1] (答案在注释里)
用于【初始化 / 加载 】的函数
Intrinsics for Load and Store Operations (& Set Operations )
今天不想细写,列个提纲
- 初始化全0
- 手动给值初始化
- 从现有数组加载
- 从pointer/地址加载
常用【加减乘除法】的函数 - Intrinsics for Arithmetic Operations
pending
看完加减法之后故名字义也都能会的函数们
- Intrinsics for Bitwise Operations - 按位逻辑操作
- 按位
and
/andnot
/or
/xor
- AVX2中只有输入为
m256i
类型,输出也为m256i
类型的按位逻辑函数 - Intrinsics for Compare Operations - 比较操作
_mm256_cmpeq_epi8/16/32/64
- 输入两个
__m256i
, 以每8/16/32/64个bit为一个数, - 逐个数比较两个输入向量是否相同,
- 如果相同,则将返回向量中该数所对应的bit全设为1,
- 若不同, 则则将返回向量中该数所对应的bit全设为0
_mm256_cmpgt_epi8/16/32/64
- 输入两个
__m256i
:s1
和s2
, 以每8/16/32/64个bit为一个数 - 如果
s1
的第i个数>s2
的第i个数, 则将返回向量中该数所对应的bit全设为1 - If the
s1
i-th <=s2
i-th, the bit vector corresponding to the number will be returned to full 0 _mm256_max_epi8/16/32
- Returns the vector corresponding to the number of bit, set the number of s1 [i] or s2 [i] the more large
_mm256_max_epu8/16/32
_mm256_min_epi8/16/32
- Returns the vector corresponding to the number of bit, is set to s1 [i] or s2 [i] in the relatively small number of
_mm256_min_epu8/16/32
Modified by the fusion of function - (AVX2) Intrinsics for Fused Multiply Add Operations
- Functions included (if not used would not mind a)
- Zone 1:
_mm
/_mm256
- Zone 2:
_fmadd
/fmsub
/fmaddsub
/fmsubadd
/fnmadd
/fnmsub
- Zone 3:
ps
/pd
- All permutations of regions Zone 1 + 2 + 3 24 Total area
- Zone 4:
_mm
- Zone 4:
_fmadd
/fmsub
/fnmadd
/fnmsub
- District 6:
ss
/sd
- Permutations of all 4 Area + 5 + 5 Area Total Area 8
Simple memory: _mm256
no ss
to sd
, fmaddsub
and fmsubadd
nor ss
withsd
Rearrangement? Mixed election? I do not know what is Chinese - Intrinsics for Permute Operations
- _mm256_permutevar8x32_epi32
__m256i _mm256_permutevar8x32_epi32(__m256i a,__m256i idx)
- _mm256_permutevar8x32_ps
__m256 _mm256_permutevar8x32_ps(__m256 a,__m256i idx)
- _mm256_permute4x64_epi64
__m256i _mm256_permute4x64_epi64(__m256i a,const int imm8)
- _mm256_permute4x64_pd
__m256d _mm256_permute4x64_pd(__m256d a,const int imm8)
- _mm256_permute2x128_si256
__m256i _mm256_permute2x128_si256(__m256i a,__m256i b,const int imm8)
Graphic permute
The first two: _mm256_permutevar8x32_epi32
the _mm256_permutevar8x32_ps
input and return types are not the same.
a
Source input b
vector returned idx
to rearrangement index index
That is: Suppose idx[i]
in the stored value m, that is, the operationb[i] = a[m]
Similarly two intermediate: _mm256_permute4x64_epi64
Input and Output __m256i
; _mm256_permute4x64_pd
Input and Output __m256d
a
Input source, b
to return a vector, imm8 / control rearrangement index to index
The int
class control
to read binary (bit by reading), press (_m256i)/(epi64) = 4
, cut into 4 parts, corresponding to the generated index
As illustrated: index control = 173, is binary 10101101, after being cut into 10,10,11,01 respectively. Decimal translation, understood to be the weight of each index number m 2,2,3,1 respectively. applicationb[i] = a[m]
Namely: b[3]=a[2]
; b[2]=a[2]
; b[1]=a[3]
; b[0]=a[1]
Slightly different rearrangement? Mixed election? Instruction - Intrinsics for Shuffle Operations
I call it "mixed election walled instruction"
For input and output of 128bit vector terms:
Figure source: https: //www.officedaytime.com/simd512e/simdimg/si.php f = pshufb?When b[i]
bit 7 bit is 1,return[i] =0
When b[i]
bit 7 bit is 0, the reading b[i]
of the first 4 bits consisting of bit value = m
,return[i]=a[m]
128bit greater than the vector in terms of input and output, such as _mm256_shuffle_epi8
, _mm512_shuffle_epi8
:
Each of a 128-bit Lane, rearrangements within the range of only 128, 128 before the content can not be rearranged to the 128-bit. Similarly shuffle and the other 128bit.
- _mm256_shuffle_epi8
- _mm256_shuffle_epi32
- _mm256_shufflehi_epi16
- _mm256_shufflelo_epi16
The variable displacement instruction - (AVX2) Intrinsics for Logical Shift Operations
Each number can have different shift that allows each number in a vector, can shift different number of bits (the same direction only). Before AVX2 can only do the same digit shift.
pending
Discrete data loading - (AVX2) Intrinsics for GATHER Operations
pending
reference
- ^ Answer: called _mm256i_dodoMagic_epi16 (__m256i a), parameter __m256i, return type is also __m256i