[Posts] AVX / AVX2 programming instructions

AVX / AVX2 programming instructions


In doing recently accelerated encryption algorithm, because there are a large number of C-based matrix operations, optimizing the need to use AVX instruction. This article describes the system is not just an ordinary entry notes, primarily as a function of the content description (documentation localization). Please indicate the source, or I'll curse you draw circles only come later write the code writing bug> _ <.

Before reading need to know:

  1. Basic C language
  2. Understand what SIMD

Official data (can see, then do not watch any crap I wrote a):

Method on all instructions SSE, AVX, AVX2, AVX512 and other intel in can be found here:

PDF version:


Online version (available screening):

Intrinsics Guide​software.intel.com

A taste of SIMD / small scale chopper - using the SIMD Programming:

Source: https: //software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions

Unused SIMD:

vectorAdd(const float* a, const float* b, const float* c){ for(int i=0; i<8; i++) { c[i] = a[i] + b[i]; //一条代码仅能操作2个单个值进行运算  } //需要重复循环操作才能运算完整个数组间的加法 }

Use AVX:

__m256 vectorAdd(__m256 a, __m256 b, __m256 c) {
    return _mm256_add_ps(a,b); //一个方法可以同时操作整个数组进行运算 - 一步到位

AVX common data types

When processing avx vector (array), preferably in each bit to see the number of units.

Kindergarten of introduction:

Preceding the data type with two underscores

The name of inner tube 128= a vector containing the 128 bit = 32 byte

Li Dai name 256= vector contains a 256 bit = 64 byte

Nor tail of what is inside the belt = float, a float= 32bit, in addition to their own look, there are several

Tail with da = inside are double, a double= 64bit, in addition to their own look, there are several

Tail with ia = integer which is, according to the different type of integer, the number of different

Decent introduction:


Two kinds of floating-point vector is listed separately: __m128, __m256, each of a number 4byteof floatconfiguration; __m128d, __m256deach number of 8bytethe doubleconfiguration.

And __m128i, __m256iis constituted by a vector of integers, char, short, int, longbelong integer (and unsignedmore than one type), so that, for example, __m256iit can be composed of 32 char, or 16 short, or 8 int4, or longconfiguration.


Further: AVX-512 Likewise, i.e. m512 / m512d / m512i

Function name

  • <bit_width> Vector type of return, the return is 256bit size is 256, 128 return size, this is empty. There are also special: store does not return (void); test series comparing two inputs are identical, returns 0 or 1.
  • <name> Name of the function, the basic functions can be seen by name it ~
  • <data_type> It indicates that this function when processing the data, the data will enter the process as to what type of

About <data_type>, because m256i / m128i in a variety of cosmetic, need to tell the size of an integer AVX inside (size) is.

  1. ps: There is a float, as a number to see the 32bits
  2. pd: There are double, to see 64bits as a number
  3. epi8/epi16/epi32/epi64:向量里每个数都是整型,一个整型8bit/16bit/32bit/64bit
  4. epu8/epu16/epu32/epu64:向量里每个数都是无符号整型(unsigned),一个整型8bit/16bit/32bit/64bit
  5. m128/m128i/m128d/m256/m256i/m256d:输入值与返回类型不同时会出现 ,例如__m256i_mm256_setr_m128i(__m128ilo,__m128ihi),输入两个__m128i向量 ,把他们拼在一起,变成一个__m256i返回 。另外这种结尾只见于load
  6. si128/si256:不care向量里到底都是些啥类型,反正128bit/256bit,例如:
__m256i _mm_broadcastsi128_si256 (__m128i a)

此函数接收一个128-bit的__m128i,然后将这__m128i的128位,按位放入返回的__m256i 类型的前128位(127-0)中,再按位放入(复制到)后128位(255-128)中。因为不涉及单个数操作,将128位看成整体操作,所以不用在意其中每个数的数据类型。


  • broadcast:复制2遍__m128i,放入 __m256i
  • cast
    • __m128i转型__m256i(upper 128 bit undefined) ,
    • __m256i转型 __m128i

总而言之5 & 6两种并不太常见。


  1. 现在我已经加载一个int a[32]到一个__m256i中,我想把每两个int当作一个数去操作(a[0], a[1]当作一个数,a[2], a[3]当作一个数,以此类推),我给这个操作起名为"嘟嘟魔法",最后返回一个__m256i类型的向量,这个函数应该叫什么名字呢? [1] (答案在注释里)

用于【初始化 / 加载 】的函数

Intrinsics for Load and Store Operations (& Set Operations )


  1. 初始化全0
  2. 手动给值初始化
  3. 从现有数组加载
  4. 从pointer/地址加载

常用【加减乘除法】的函数 - Intrinsics for Arithmetic Operations



  • Intrinsics for Bitwise Operations - 按位逻辑操作
    • 按位and/andnot/or/xor
    • AVX2中只有输入为m256i类型,输出也为m256i类型的按位逻辑函数
  • Intrinsics for Compare Operations - 比较操作
    • _mm256_cmpeq_epi8/16/32/64
      • 输入两个__m256i, 以每8/16/32/64个bit为一个数,
      • 逐个数比较两个输入向量是否相同,
      • 如果相同,则将返回向量中该数所对应的bit全设为1,
      • 若不同, 则则将返回向量中该数所对应的bit全设为0
    • _mm256_cmpgt_epi8/16/32/64
      • 输入两个__m256i:s1s2, 以每8/16/32/64个bit为一个数
      • 如果s1的第i个数>s2的第i个数, 则将返回向量中该数所对应的bit全设为1
      • If the s1i-th <= s2i-th, the bit vector corresponding to the number will be returned to full 0
    • _mm256_max_epi8/16/32
      • Returns the vector corresponding to the number of bit, set the number of s1 [i] or s2 [i] the more large
    • _mm256_max_epu8/16/32
    • _mm256_min_epi8/16/32
      • Returns the vector corresponding to the number of bit, is set to s1 [i] or s2 [i] in the relatively small number of
    • _mm256_min_epu8/16/32

Modified by the fusion of function - (AVX2) Intrinsics for Fused Multiply Add Operations


  • Functions included (if not used would not mind a)
  • Zone 1: _mm / _mm256
  • Zone 2: _fmadd /  fmsub /  fmaddsub /  fmsubaddfnmadd / fnmsub
  • Zone 3: ps / pd
  • All permutations of regions Zone 1 + 2 + 3 24 Total area
  • Zone 4:_mm
  • Zone 4: _fmadd /  fmsub /  fnmadd / fnmsub
  • District 6: ss / sd
  • Permutations of all 4 Area + 5 + 5 Area Total Area 8

Simple memory: _mm256no ssto sd, fmaddsub and  fmsubadd nor sswithsd

Rearrangement? Mixed election? I do not know what is Chinese - Intrinsics for Permute Operations

  • _mm256_permutevar8x32_epi32
    • __m256i _mm256_permutevar8x32_epi32(__m256i a,__m256i idx)
  • _mm256_permutevar8x32_ps
    • __m256 _mm256_permutevar8x32_ps(__m256 a,__m256i idx)
  • _mm256_permute4x64_epi64
    • __m256i _mm256_permute4x64_epi64(__m256i a,const int imm8)
  • _mm256_permute4x64_pd
    • __m256d _mm256_permute4x64_pd(__m256d a,const int imm8)
  • _mm256_permute2x128_si256
    • __m256i _mm256_permute2x128_si256(__m256i a,__m256i b,const int imm8)

Graphic permute

The first two: _mm256_permutevar8x32_epi32 the  _mm256_permutevar8x32_ps input and return types are not the same.


aSource input bvector returned idxto rearrangement index index

That is: Suppose idx[i]in the stored value m, that is, the operationb[i] = a[m]

Similarly two intermediate: _mm256_permute4x64_epi64 Input and Output __m256i_mm256_permute4x64_pd Input and Output __m256d


aInput source, bto return a vector, imm8 / control rearrangement index to index

The intclass controlto read binary (bit by reading), press (_m256i)/(epi64) = 4, cut into 4 parts, corresponding to the generated index

As illustrated: index control = 173, is binary 10101101, after being cut into 10,10,11,01 respectively. Decimal translation, understood to be the weight of each index number m 2,2,3,1 respectively. applicationb[i] = a[m]

Namely: b[3]=a[2]b[2]=a[2]b[1]=a[3]b[0]=a[1]

Slightly different rearrangement? Mixed election? Instruction - Intrinsics for Shuffle Operations

I call it "mixed election walled instruction"

For input and output of 128bit vector terms:

Figure source: https: //www.officedaytime.com/simd512e/simdimg/si.php f = pshufb?

When b[i]bit 7 bit is 1,return[i] =0

When b[i]bit 7 bit is 0, the reading b[i]of the first 4 bits consisting of bit value = m,return[i]=a[m]

128bit greater than the vector in terms of input and output, such as _mm256_shuffle_epi8 , _mm512_shuffle_epi8 :

Similarly 256/512 See https://www.officedaytime.com/simd512e/simdimg/si.php?f=pshufb

Each of a 128-bit Lane, rearrangements within the range of only 128, 128 before the content can not be rearranged to the 128-bit. Similarly shuffle and the other 128bit.

  1. _mm256_shuffle_epi8
  2. _mm256_shuffle_epi32
  3. _mm256_shufflehi_epi16
  4. _mm256_shufflelo_epi16

The variable displacement instruction - (AVX2) Intrinsics for Logical Shift Operations

Each number can have different shift that allows each number in a vector, can shift different number of bits (the same direction only). Before AVX2 can only do the same digit shift.


Discrete data loading - (AVX2) Intrinsics for GATHER Operations





  1. ^ Answer: called _mm256i_dodoMagic_epi16 (__m256i a), parameter __m256i, return type is also __m256i

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12424622.html