AVX instruction set function list

Based on Intel Intrinsics Guide 3.62, functions starting with __mm in AVX and AVX2 are not included.

Arithmetic

__m256i _mm256_add_epi16 (__m256i a, __m256i b)

16-bit integer vector a plus b

Add packed 16-bit integers in a and b, and store the results in dst.

__m256i _mm256_add_epi32 (__m256i a, __m256i b)

32-bit integer vector a plus b

Add packed 32-bit integers in a and b, and store the results in dst.

__m256i _mm256_add_epi64 (__m256i a, __m256i b)

64-bit integer vector a plus b

Add packed 64-bit integers in a and b, and store the results in dst.

__m256i _mm256_add_epi8 (__m256i a, __m256i b)

8-bit integer vector a plus b

Add packed 8-bit integers in a and b, and store the results in dst.

__m256d _mm256_add_pd (__m256d a, __m256d b)

64-bit double-precision floating-point vector a plus b

Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

__m256 _mm256_add_ps (__m256 a, __m256 b)

32-bit single-precision floating-point vector a plus b

Add packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

__m256i _mm256_adds_epi16 (__m256i a, __m256i b)

使用饱和,16位整形向量a加b

Add packed 16-bit integers in a and b using saturation, and store the results in dst.

__m256i _mm256_adds_epi8 (__m256i a, __m256i b)

使用饱和,8位整形向量a加b

Add packed 8-bit integers in a and b using saturation, and store the results in dst.

__m256i _mm256_adds_epu16 (__m256i a, __m256i b)

使用饱和,16位无符号整形向量a加b

Add packed unsigned 16-bit integers in a and b using saturation, and store the results in dst.

__m256i _mm256_adds_epu8 (__m256i a, __m256i b)

使用饱和,8位无符号整形向量a加b

Add packed unsigned 8-bit integers in a and b using saturation, and store the results in dst.

__m256d _mm256_addsub_pd (__m256d a, __m256d b)

64位双精度浮点数向量a加或减b(偶数通道使用加法,奇数通道使用减法)

Alternatively add and subtract packed double-precision (64-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

__m256 _mm256_addsub_ps (__m256 a, __m256 b)

32-bit single-precision floating-point number vector a to add or subtract b (addition is used for even channels, and subtraction is used for odd channels)

Alternatively add and subtract packed single-precision (32-bit) floating-point elements in a to/from packed elements in b, and store the results in dst.

__m256d _mm256_div_pd (__m256d a, __m256d b)

64-bit double-precision floating-point number vector a divided by b

Divide packed double-precision (64-bit) floating-point elements in a by packed elements in b, and store the results in dst.

__m256 _mm256_div_ps (__m256 a, __m256 b)

32-bit single-precision floating-point number vector a divided by b

Divide packed single-precision (32-bit) floating-point elements in a by packed elements in b, and store the results in dst.

__m256 _mm256_dp_ps (__m256 a, __m256 b, const int imm8)

Determine which channels of the 32-bit single-precision floating-point number vectors a and b perform multiplication according to the 4-7 bits in imm8, add up all the results, and then store them in each channel of dst according to the lower four bits of imm8 (128 for each high and low digits separately)

Conditionally multiply the packed single-precision (32-bit) floating-point elements in a and b using the high 4 bits in imm8, sum the four products, and conditionally store the sum in dst using the low 4 bits of imm8.

__m256i _mm256_hadd_epi16 (__m256i a, __m256i b)

Calculate the addition of two adjacent channels for the 16-bit integer vectors a and b, and finally save the calculation results of a and b to the result dst

Horizontally add adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

__m256i _mm256_hadd_epi32 (__m256i a, __m256i b)

Calculate the addition of two adjacent channels for the 32-bit integer vectors a and b respectively, and finally save the respective calculation results of a and b to the result dst

Horizontally add adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

__m256d _mm256_hadd_pd (__m256d a, __m256d b)

Calculate the addition of two adjacent channels for the 64-bit double-precision floating-point vectors a and b, and finally save the calculation results of a and b to the result dst

Horizontally add adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

__m256 _mm256_hadd_ps (__m256 a, __m256 b)

Calculate the addition of two adjacent channels for the 32-bit single-precision floating-point vectors a and b respectively, and finally save the respective calculations of a and b to the result dst

Horizontally add adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

__m256i _mm256_hadds_epi16 (__m256i a, __m256i b)

Use saturation to calculate the addition of two adjacent channels for the 16-bit integer vectors a and b, and finally save the calculation results of a and b to the result dst

Horizontally add adjacent pairs of signed 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

__m256i _mm256_hsub_epi16 (__m256i a, __m256i b)

Calculate the subtraction of two adjacent channels for the 16-bit integer vectors a and b respectively, and finally save the respective calculation results of a and b to the result dst

Horizontally subtract adjacent pairs of 16-bit integers in a and b, and pack the signed 16-bit results in dst.

__m256i _mm256_hsub_epi32 (__m256i a, __m256i b)

Calculate the subtraction of two adjacent channels for the 32-bit integer vectors a and b respectively, and finally save the respective calculation results of a and b to the result dst

Horizontally subtract adjacent pairs of 32-bit integers in a and b, and pack the signed 32-bit results in dst.

__m256d _mm256_hsub_pd (__m256d a, __m256d b)

Calculate the addition of two adjacent channels for the 64-bit double-precision floating-point vectors a and b, and finally save the calculation results of a and b to the result dst

Horizontally subtract adjacent pairs of double-precision (64-bit) floating-point elements in a and b, and pack the results in dst.

__m256 _mm256_hsub_ps (__m256 a, __m256 b)

Calculate the subtraction of two adjacent channels for the 32-bit single-precision floating-point vectors a and b respectively, and finally save the respective calculation results of a and b to the result dst

Horizontally subtract adjacent pairs of single-precision (32-bit) floating-point elements in a and b, and pack the results in dst.

__m256i _mm256_hsubs_epi16 (__m256i a, __m256i b)

Use saturation to calculate the subtraction of two adjacent channels for the 16-bit integer vectors a and b, and finally save the calculation results of a and b to the result dst

Horizontally subtract adjacent pairs of signed 16-bit integers in a and b using saturation, and pack the signed 16-bit results in dst.

__m256i _mm256_madd_epi16 (__m256i a, __m256i b)

16位整形向量a乘b,结果为32位整形向量,将该中间结果的临近的通道相加,最后将结果保存到dst。

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results in dst.

__m256i _mm256_maddubs_epi16 (__m256i a, __m256i b)

8位无符号整形向量a乘8位有符号整形向量b,结果为16位有符号整形向量。使用饱和,将该中间结果的临近的通道相加,最后将结果保存到dst。

Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.

__m256i _mm256_mul_epi32 (__m256i a, __m256i b)

使用64位整形向量a和b的各低32位有符号整形相乘,结果为64位有符号向量

Multiply the low signed 32-bit integers from each packed 64-bit element in a and b, and store the signed 64-bit results in dst.

__m256i _mm256_mul_epu32 (__m256i a, __m256i b)

Multiply the lower 32-bit unsigned integers of the 64-bit integer vectors a and b, and the result is a 64-bit unsigned vector

Multiply the low unsigned 32-bit integers from each packed 64-bit element in a and b, and store the unsigned 64-bit results in dst.

__m256d _mm256_mul_pd (__m256d a, __m256d b)

64-bit double-precision floating-point vector a by b

Multiply packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

__m256 _mm256_mul_ps (__m256 a, __m256 b)

32-bit single-precision floating-point vector a by b

Multiply packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

__m256i _mm256_mulhi_epi16 (__m256i a, __m256i b)

The 16-bit signed integer vector a is multiplied by b, and the result is a 32-bit integer vector, and then the intermediate result is saved in the upper 16 bits of each channel to dst

Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

__m256i _mm256_mulhi_epu16 (__m256i a, __m256i b)

The 16-bit unsigned integer vector a is multiplied by b, and the result is a 32-bit integer vector, and then the intermediate result is saved in the upper 16 bits of each channel to dst

Multiply the packed unsigned 16-bit integers in a and b, producing intermediate 32-bit integers, and store the high 16 bits of the intermediate integers in dst.

__m256i _mm256_mulhrs_epi16 (__m256i a, __m256i b)

The 16-bit signed integer vector a is multiplied by b, and the result is a 32-bit integer vector. The upper 18 significant digits of the intermediate result are reserved and 1 is added, and the lower 16 bits are reserved at the end.

Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.

__m256i _mm256_mullo_epi16 (__m256i a, __m256i b)

The 16-bit signed integer vector a is multiplied by b, and the result is a 32-bit integer vector, and then the lower 16 bits of each channel of the intermediate result are saved to dst

Multiply the packed signed 16-bit integers in a and b, producing intermediate 32-bit integers, and store the low 16 bits of the intermediate integers in dst.

__m256i _mm256_mullo_epi32 (__m256i a, __m256i b)

The 32-bit signed integer vector a is multiplied by b, and the result is a 64-bit integer vector, and then the lower 32 bits of each channel of the intermediate result are saved to dst

Multiply the packed signed 32-bit integers in a and b, producing intermediate 64-bit integers, and store the low 32 bits of the intermediate integers in dst.

__m256i _mm256_sad_epu8 (__m256i a, __m256i b)

Computes the absolute value of the difference between the 8-bit unsigned integer vectors a and b. These differences are summed for every 8 channels to generate a total of 4 16-bit unsigned integer vectors, which are placed in the lower 16 bits of each 64-bit element of dst.

Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.

__m256i _mm256_sign_epi16 (__m256i a, __m256i b)

When the value of the channel in the 16-bit signed integer vector b is a negative number, reverse the sign of the value at the corresponding position in the 16-bit signed integer vector a and store it in dst. When the value of the channel in b is 0, dst The corresponding value in the position is set to 0

Negate packed signed 16-bit integers in a when the corresponding signed 16-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

__m256i _mm256_sign_epi32 (__m256i a, __m256i b)

When the value of the channel in the 32-bit signed integer vector b is a negative number, reverse the sign of the value at the corresponding position in the 32-bit signed integer vector a and store it in dst. When the value of the channel in b is 0, dst The corresponding value in the position is set to 0

Negate packed signed 32-bit integers in a when the corresponding signed 32-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

__m256i _mm256_sign_epi8 (__m256i a, __m256i b)

When the value of the channel in the 8-bit signed integer vector b is a negative number, reverse the sign of the value at the corresponding position in the 8-bit signed integer vector a and store it in dst. When the value of the channel in b is 0, dst The corresponding value in the position is set to 0

Negate packed signed 8-bit integers in a when the corresponding signed 8-bit integer in b is negative, and store the results in dst. Element in dst are zeroed out when the corresponding element in b is zero.

__m256i _mm256_sub_epi16 (__m256i a, __m256i b)

16-bit integer vector a minus b

Subtract packed 16-bit integers in b from packed 16-bit integers in a, and store the results in dst.

__m256i _mm256_sub_epi32 (__m256i a, __m256i b)

32-bit integer vector a minus b

Subtract packed 32-bit integers in b from packed 32-bit integers in a, and store the results in dst.

__m256i _mm256_sub_epi64 (__m256i a, __m256i b)

64-bit integer vector a minus b

Subtract packed 64-bit integers in b from packed 64-bit integers in a, and store the results in dst.

__m256i _mm256_sub_epi8 (__m256i a, __m256i b)

8-bit integer vector a minus b

Subtract packed 8-bit integers in b from packed 8-bit integers in a, and store the results in dst.

__m256d _mm256_sub_pd (__m256d a, __m256d b)

64-bit double-precision floating-point vector a minus b

Subtract packed double-precision (64-bit) floating-point elements in b from packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

__m256 _mm256_sub_ps (__m256 a, __m256 b)

32-bit single-precision floating-point number vector a minus b

Subtract packed single-precision (32-bit) floating-point elements in b from packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

__m256i _mm256_subs_epi16 (__m256i a, __m256i b)

Subtract b from a vector of 16-bit integers using saturation

Subtract packed signed 16-bit integers in b from packed 16-bit integers in a using saturation, and store the results in dst.

__m256i _mm256_subs_epi8 (__m256i a, __m256i b)

Subtract b from a vector of 8-bit integers using saturation

Subtract packed signed 8-bit integers in b from packed 8-bit integers in a using saturation, and store the results in dst.

__m256i _mm256_subs_epu16 (__m256i a, __m256i b)

Subtract b from a vector of 16-bit unsigned integers using saturation

Subtract packed unsigned 16-bit integers in b from packed unsigned 16-bit integers in a using saturation, and store the results in dst.

__m256i _mm256_subs_epu8 (__m256i a, __m256i b)

Subtract b from a vector of 8-bit unsigned integers using saturation

Subtract packed unsigned 8-bit integers in b from packed unsigned 8-bit integers in a using saturation, and store the results in dst.

Compare

__m256d _mm256_cmp_pd (__m256d a, __m256d b, const int imm8)

Compare 64-bit double-precision floating-point number vectors a and b according to the conditions of imm8, each channel returns 64-bit all 1s when the conditions are met, otherwise 64-bit all 0s

Compare packed double-precision (64-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

__m256 _mm256_cmp_ps (__m256 a, __m256 b, const int imm8)

Compare 32-bit single-precision floating-point number vectors a and b according to the conditions of imm8, each channel returns 32-bit all 1s when the conditions are met, otherwise 32-bit all 0s

Compare packed single-precision (32-bit) floating-point elements in a and b based on the comparison operand specified by imm8, and store the results in dst.

__m256i _mm256_cmpeq_epi16 (__m256i a, __m256i b)

Compare whether the 16-bit integer vector a and b are equal, and each channel returns 16-bit all 1s when the conditions are met, otherwise it returns 16-bit all 0s

Compare packed 16-bit integers in a and b for equality, and store the results in dst.

__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)

Compare whether the 32-bit integer vectors a and b are equal, and each channel returns 32-bit all 1s when the conditions are met, otherwise it returns 32-bit all 0s

Compare packed 32-bit integers in a and b for equality, and store the results in dst.

__m256i _mm256_cmpeq_epi64 (__m256i a, __m256i b)

Compare whether the 64-bit integer vector a and b are equal, and each channel returns 64-bit all 1s when the conditions are met, otherwise it returns 64-bit all 0s

Compare packed 64-bit integers in a and b for equality, and store the results in dst.

__m256i _mm256_cmpeq_epi8 (__m256i a, __m256i b)

Compare whether the 8-bit integer vector a and b are equal, and each channel returns 8-bit all 1 when the condition is met, otherwise it is 8-bit all 0

Compare packed 8-bit integers in a and b for equality, and store the results in dst.

__m256i _mm256_cmpgt_epi16 (__m256i a, __m256i b)

比较16位整形向量a是否大于(不含等于)b,每个通道当满足条件时返回16位全1否则为16位全0

Compare packed signed 16-bit integers in a and b for greater-than, and store the results in dst.

__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b)

比较32位整形向量a和b是否大于(不含等于),每个通道当满足条件时返回32位全1否则为32位全0

Compare packed signed 32-bit integers in a and b for greater-than, and store the results in dst.

__m256i _mm256_cmpgt_epi64 (__m256i a, __m256i b)

比较64位整形向量a和b是否大于(不含等于),每个通道当满足条件时返回64位全1否则为64位全0

Compare packed signed 64-bit integers in a and b for greater-than, and store the results in dst.

__m256i _mm256_cmpgt_epi8 (__m256i a, __m256i b)

比较8位整形向量a和b是否大于(不含等于),每个通道当满足条件时返回8位全1否则为8位全0

Compare packed signed 8-bit integers in a and b for greater-than, and store the results in dst.

Convert

__m256i _mm256_cvtepi16_epi32 (__m128i a)

将16位有符号整形向量a扩展(SignExtend)为32位有符号整形向量dst

Sign extend packed 16-bit integers in a to packed 32-bit integers, and store the results in dst.

__m256i _mm256_cvtepi16_epi64 (__m128i a)

Extend (SignExtend) the lower 4 channels of the 16-bit signed integer vector a to a 64-bit signed integer vector dst

Sign extend packed 16-bit integers in a to packed 64-bit integers, and store the results in dst.

__m256i _mm256_cvtepi32_epi64 (__m128i a)

Extend (SignExtend) the 32-bit signed integer vector a to a 64-bit signed integer vector dst

Sign extend packed 32-bit integers in a to packed 64-bit integers, and store the results in dst.

__m256d _mm256_cvtepi32_pd (__m128i a)

Extend (SignExtend) the 32-bit signed integer vector a to a 64-bit double-precision floating-point vector dst

Convert packed signed 32-bit integers in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

__m256 _mm256_cvtepi32_ps (__m256i a)

Extend (SignExtend) the 32-bit signed integer vector a to a 32-bit single-precision floating-point vector dst

Convert packed signed 32-bit integers in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

__m256i _mm256_cvtepi8_epi16 (__m128i a)

Extend (SignExtend) the 8-bit signed integer vector a to the 16-bit signed integer vector dst

Sign extend packed 8-bit integers in a to packed 16-bit integers, and store the results in dst.

__m256i _mm256_cvtepi8_epi32 (__m128i a)

Extend (SignExtend) the lower 8 channels of the 8-bit signed integer vector a to a 32-bit signed integer vector dst

Sign extend packed 8-bit integers in a to packed 32-bit integers, and store the results in dst.

__m256i _mm256_cvtepi8_epi64 (__m128i a)

Extend (SignExtend) the lower 4 channels of the 8-bit signed integer vector a to a 64-bit signed integer vector dst

Sign extend packed 8-bit integers in the low 8 bytes of a to packed 64-bit integers, and store the results in dst.

__m256i _mm256_cvtepu16_epi32 (__m128i a)

Extend (ZeroExtend) the 16-bit unsigned integer vector a to a 32-bit integer vector dst

Zero extend packed unsigned 16-bit integers in a to packed 32-bit integers, and store the results in dst.

__m256i _mm256_cvtepu16_epi64 (__m128i a)

Extend (ZeroExtend) the lower 4 channels of the 16-bit unsigned integer vector a to a 64-bit signed integer vector dst

Zero extend packed unsigned 16-bit integers in a to packed 64-bit integers, and store the results in dst.

__m256i _mm256_cvtepu32_epi64 (__m128i a)

Extend (SignExtend) the 32-bit unsigned integer vector a to a 64-bit integer vector dst

Zero extend packed unsigned 32-bit integers in a to packed 64-bit integers, and store the results in dst.

__m256i _mm256_cvtepu8_epi16 (__m128i a)

Extend and extend (SignExtend) the lower 16 channels of the 8-bit unsigned integer vector a to a 16-bit integer vector dst

Zero extend packed unsigned 8-bit integers in a to packed 16-bit integers, and store the results in dst.

_m256i _mm256_cvtepu8_epi32 (__m128i a)

Extend (SignExtend) the lower 8 channels of the 8-bit unsigned integer vector a to a 32-bit integer vector dst

Zero extend packed unsigned 8-bit integers in a to packed 32-bit integers, and store the results in dst.

__m256i _mm256_cvtepu8_epi64 (__m128i a)

Extend (SignExtend) the lower 4 channels of the 8-bit unsigned integer vector a to a 64-bit integer vector dst

Zero extend packed unsigned 8-bit integers in the low 8 bytes of a to packed 64-bit integers, and store the results in dst.

__m128i _mm256_cvtpd_epi32 (__m256d a)

Convert the 64-bit double-precision floating-point vector a to the 32-bit integer vector dst

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

__m128 _mm256_cvtpd_ps (__m256d a)

Convert the 64-bit double-precision floating-point vector a to the 32-bit single-precision floating-point vector dst

Convert packed double-precision (64-bit) floating-point elements in a to packed single-precision (32-bit) floating-point elements, and store the results in dst.

__m256i _mm256_cvtps_epi32 (__m256 a)

Convert the 32-bit single-precision floating-point vector a to the 32-bit integer vector dst

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers, and store the results in dst.

__m256d _mm256_cvtps_pd (__m128 a)

Convert 32-bit single-precision floating-point vector a to 64-bit double-precision floating-point vector dst

Convert packed single-precision (32-bit) floating-point elements in a to packed double-precision (64-bit) floating-point elements, and store the results in dst.

double _mm256_cvtsd_f64 (__m256d a)

Copy the lowest channel of the 64-bit double-precision floating-point vector to dst

Copy the lower double-precision (64-bit) floating-point element of a to dst.

int _mm256_cvtsi256_si32 (__m256i a)

Copy lowest channel of 32-bit integer vector to dst

Copy the lower 32-bit integer in a to dst.

float _mm256_cvtss_f32 (__m256 a)

Copy the lowest channel of the 32-bit single-precision floating-point vector to dst

Copy the lower single-precision (32-bit) floating-point element of a to dst.

__m128i _mm256_cvttpd_epi32 (__m256d a)

Using truncation, convert the 64-bit double-precision floating-point vector a to the 32-bit integer vector dst

Convert packed double-precision (64-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

__m256i _mm256_cvttps_epi32 (__m256 a)

Using truncation, convert the 32-bit single-precision floating-point vector a to the 32-bit integer vector dst

Convert packed single-precision (32-bit) floating-point elements in a to packed 32-bit integers with truncation, and store the results in dst.

Elementary Math Functions

__m256 _mm256_rcp_ps (__m256 a)

Approximately calculate the reciprocal of the 32-bit single-precision floating-point vector a, the maximum error of the result is not greater than 1.5*2^-12

Compute the approximate reciprocal of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

__m256 _mm256_rsqrt_ps (__m256 a)

Approximately calculate the reciprocal of the square root of the 32-bit single-precision floating-point number vector a, the maximum error of the result is not greater than 1.5*2^-12

Compute the approximate reciprocal square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst. The maximum relative error for this approximation is less than 1.5*2^-12.

__m256d _mm256_sqrt_pd (__m256d a)

Computes the square root of a vector a of 64-bit double-precision floating-point numbers

Compute the square root of packed double-precision (64-bit) floating-point elements in a, and store the results in dst.

__m256 _mm256_sqrt_ps (__m256 a)

Computes the square root of a vector a of 32-bit single-precision floating-point numbers

Compute the square root of packed single-precision (32-bit) floating-point elements in a, and store the results in dst.

General Support

__m256d _mm256_undefined_pd (void)

returns an undefined __m256d variable

Return vector of type __m256d with undefined elements.

__m256 _mm256_undefined_ps (void)

returns an undefined __m256 variable

Return vector of type __m256 with undefined elements.

__m256i _mm256_undefined_si256 (void)

returns an undefined __m256i variable

Return vector of type __m256i with undefined elements.

void _mm256_zeroall (void)

Set all XMM or YMM registers to zero

Zero the contents of all XMM or YMM registers.

void _mm256_zeroupper (void)

Zero the upper 128 bits of all YMM registers, and leave the lower 128 bits unchanged

Zero the upper 128 bits of all YMM registers; the lower 128-bits of the registers are unmodified.

Load

__m256d _mm256_broadcast_pd (__m128d const * mem_addr)

Read 128 bits (2 64-bit double precision floats) from memory and broadcast to dst

Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements of dst.

__m256 _mm256_broadcast_ps (__m128 const * mem_addr)

Read 128 bits (4 64-bit double precision floats) from memory and broadcast to dst

Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements of dst.

__m256d _mm256_broadcast_sd (double const * mem_addr)

Read a 64-bit double-precision floating-point number from memory and broadcast to all channels of dst

Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst.

__m256 _mm256_broadcast_ss (float const * mem_addr)

Read a 32-bit single-precision floating-point number from memory and broadcast to all channels of dst

Broadcast a single-precision (32-bit) floating-point element from memory to all elements of dst.

__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Gather a 32-bit integer vector from memory. The starting address for reading is base_addr, and the offset is the 32-bit vector vindex multiplied by scale bytes. scale must be 1, 2, 4, 8

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m256i _mm256_mask_i32gather_epi32 (__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)

Gather a 32-bit integer vector from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8.

If the highest bit of the channel corresponding to the 32-bit vector mask is 1, dst uses the aggregated data, otherwise uses the value of the corresponding channel in src

Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m256i _mm256_i32gather_epi64 (__int64 const* base_addr, __m128i vindex, const int scale)

Gather a vector of 64-bit integers from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m256i _mm256_mask_i32gather_epi64 (__m256i src, __int64 const* base_addr, __m128i vindex, __m256i mask, const int scale)

Gather a vector of 64-bit integers from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8.

If the highest bit of the 32-bit vector channel in the corresponding mask is 1, dst uses the aggregated data, otherwise uses the value of the corresponding channel in src

Gather 64-bit integers from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

Gather a vector of 64-bit double-precision floating-point numbers from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m256d _mm256_mask_i32gather_pd (__m256d src, double const* base_addr, __m128i vindex, __m256d mask, const int scale)

Gather a vector of 64-bit double-precision floating-point numbers from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8.

If the highest bit of the channel corresponding to the 32-bit vector mask is 1, dst uses the aggregated data, otherwise uses the value of the corresponding channel in src

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m256 _mm256_i32gather_ps (float const* base_addr, __m256i vindex, const int scale)

Assemble a vector of 32-bit single-precision floating-point numbers from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m256 _mm256_mask_i32gather_ps (__m256 src, float const* base_addr, __m256i vindex, __m256 mask, const int scale)

Assemble a vector of 32-bit single-precision floating-point numbers from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8.

If the highest bit of the channel corresponding to the 32-bit vector mask is 1, dst uses the aggregated data, otherwise uses the value of the corresponding channel in src

Gather single-precision (32-bit) floating-point elements from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m128i _mm256_i64gather_epi32 (int const* base_addr, __m256i vindex, const int scale)

Gather a 32-bit integer vector from memory. The starting address for reading is base_addr, and the offset is the 32-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m128i _mm256_mask_i64gather_epi32 (__m128i src, int const* base_addr, __m256i vindex, __m128i mask, const int scale)

从内存中聚集32位整形向量。读取的起始地址为base_addr,偏移量为64位向量vindex乘scale个字节。scale必须为1、2、4、8。

如果32位向量mask对应通道的最高位为1时,dst使用聚集的数据,否则使用src中对应的通道的值

Gather 32-bit integers from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m256i _mm256_i64gather_epi64 (__int64 const* base_addr, __m256i vindex, const int scale)

从内存中聚集64位整形向量,读取的起始地址为base_addr,偏移量为64位vindex向量乘以scale个字节。scale必须为1、2、4、8

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m256i _mm256_mask_i64gather_epi64 (__m256i src, __int64 const* base_addr, __m256i vindex, __m256i mask, const int scale)

Gather 64-bit integer vectors from the memory, the starting address for reading is base_addr, and the offset is the 64-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8.

If the highest bit of the channel corresponding to the 64-bit vector mask is 1, dst uses the aggregated data, otherwise uses the value of the corresponding channel in src

Gather 64-bit integers from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m256d _mm256_i64gather_pd (double const* base_addr, __m256i vindex, const int scale)

Gather 64-bit double-precision floating-point vectors from the memory. The starting address for reading is base_addr, and the offset is the 64-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m256d _mm256_mask_i64gather_pd (__m256d src, double const* base_addr, __m256i vindex, __m256d mask, const int scale)

Gather 64-bit double-precision floating-point vectors from the memory. The starting address for reading is base_addr, and the offset is the 64-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8.

If the highest bit of the channel corresponding to the 64-bit vector mask is 1, dst uses the aggregated data, otherwise uses the value of the corresponding channel in src

Gather double-precision (64-bit) floating-point elements from memory using 64-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m128 _mm256_i64gather_ps (float const* base_addr, __m256i vindex, const int scale)

Gather 32-bit single-precision floating-point vectors from the memory. The starting address for reading is base_addr, and the offset is 64-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

__m128 _mm256_mask_i64gather_ps (__m128 src, float const* base_addr, __m256i vindex, __m128 mask, const int scale)

Gather 32-bit single-precision floating-point vectors from the memory. The starting address for reading is base_addr, and the offset is the 64-bit vindex vector multiplied by scale bytes. scale must be 1, 2, 4, 8.

If the highest bit of the channel corresponding to the 64-bit vector mask is 1, dst uses the aggregated data, otherwise uses the value of the corresponding channel in src

Gather single-precision (32-bit) floating-point elements from memory using 64-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 64-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst using mask (elements are copied from src when the highest bit is not set in the corresponding element). scale should be 1, 2, 4 or 8.

__m256i _mm256_lddqu_si256 (__m256i const * mem_addr)

从非对齐的内存加载256位整形数据。当数据跨越一个cache line时,该命令性能可能比_mm256_loadu_si256更好

Load 256-bits of integer data from unaligned memory into dst. This intrinsic may perform better than _mm256_loadu_si256 when the data crosses a cache line boundary.

__m256d _mm256_load_pd (double const * mem_addr)

从内存中加载256位数据(由4个64位双精度浮点数组成), mem_addr必须32字节对齐,否则会产生通用保护异常

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

__m256 _mm256_load_ps (float const * mem_addr)

Load 256-bit data (consisting of 8 32-bit single-precision floating-point numbers) from memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

__m256i _mm256_load_si256 (__m256i const * mem_addr)

Load 256-bit integer data from memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Load 256-bits of integer data from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

__m256d _mm256_loadu_pd (double const * mem_addr)

Load 256-bit data (consisting of 4 64-bit double-precision floating-point numbers) from memory, mem_addr does not need to be aligned

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

__m256 _mm256_loadu_ps (float const * mem_addr)

Load 256-bit data (consisting of 8 32-bit single-precision floating-point numbers) from memory, mem_addr does not need to be aligned

Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

__m256i _mm256_loadu_si256 (__m256i const * mem_addr)

Load 256-bit integer data from memory, mem_addr does not need to be aligned

Load 256-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

__m256 _mm256_loadu2_m128 (float const* hiaddr, float const* loaddr)

Load two 128-bit data (consisting of four 32-bit single-precision floating-point numbers) from the memory and splice them into a 256-bit data. hiaddr and loaddr do not need to be aligned

Load two 128-bit values (composed of 4 packed single-precision (32-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

__m256d _mm256_loadu2_m128d (double const* hiaddr, double const* loaddr)

Load two 128-bit data (consisting of two 64-bit double-precision floating-point numbers) from memory and splice them into a 256-bit data. hiaddr and loaddr do not need to be aligned

Load two 128-bit values (composed of 2 packed double-precision (64-bit) floating-point elements) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

__m256i _mm256_loadu2_m128i (__m128i const* hiaddr, __m128i const* loaddr)

Load two 128-bit data (respectively composed of integer data) from the memory, and splicing into a 256-bit data, hiaddr and loaddr do not need to be aligned

Load two 128-bit values (composed of integer data) from memory, and combine them into a 256-bit value in dst. hiaddr and loaddr do not need to be aligned on any particular boundary.

__m256i _mm256_maskload_epi32 (int const* mem_addr, __m256i mask)

Load a 32-bit integer vector from the memory, and when the highest bit of the channel corresponding to the mask is 0, set the channel corresponding to dst to zero

Load packed 32-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

__m256i _mm256_maskload_epi64 (__int64 const* mem_addr, __m256i mask)

Load a 64-bit integer vector from memory, and when the highest bit of the channel corresponding to mask is 0, set the channel corresponding to dst to zero

Load packed 64-bit integers from memory into dst using mask (elements are zeroed out when the highest bit is not set in the corresponding element).

__m256d _mm256_maskload_pd (double const * mem_addr, __m256i mask)

Load a 64-bit double-precision floating-point number vector from memory, and when the highest bit of the channel corresponding to mask is 0, set the channel corresponding to dst to zero

Load packed double-precision (64-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

__m256 _mm256_maskload_ps (float const * mem_addr, __m256i mask)

Load a 32-bit single-precision floating-point vector from memory, and when the highest bit of the channel corresponding to mask is 0, set the channel corresponding to dst to zero

Load packed single-precision (32-bit) floating-point elements from memory into dst using mask (elements are zeroed out when the high bit of the corresponding element is not set).

__m256i _mm256_stream_load_si256 (__m256i const* mem_addr)

Use non-temporal memory hint to load 256-bit data from memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Load 256-bits of integer data from memory into dst using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Logical

__m256d _mm256_and_pd (__m256d a, __m256d b)

64-bit double-precision floating-point vector a logical AND b

Compute the bitwise AND of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

__m256 _mm256_and_ps (__m256 a, __m256 b)

32-bit single-precision floating-point number vector a logical AND b

Compute the bitwise AND of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

__m256i _mm256_and_si256 (__m256i a, __m256i b)

256-bit vector a logical AND b

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and store the result in dst.

__m256d _mm256_andnot_pd (__m256d a, __m256d b)

First calculate the logical NOT of the 64-bit double-precision floating-point number vector a, and then do the logical AND with the vector b

Compute the bitwise NOT of packed double-precision (64-bit) floating-point elements in a and then AND with b, and store the results in dst.

__m256 _mm256_andnot_ps (__m256 a, __m256 b)

First calculate the logical NOT of the 32-bit single-precision floating-point number vector a, and then do the logical AND with the vector b

Compute the bitwise NOT of packed single-precision (32-bit) floating-point elements in a and then AND with b, and store the results in dst.

__m256i _mm256_andnot_si256 (__m256i a, __m256i b)

First calculate the logical NOT of the 256-bit vector a, and then do the logical AND with the vector b

Compute the bitwise NOT of 256 bits (representing integer data) in a and then AND with b, and store the result in dst.

__m256d _mm256_or_pd (__m256d a, __m256d b)

Vector of 64-bit double-precision floating-point numbers a logical-or b

Compute the bitwise OR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

__m256 _mm256_or_ps (__m256 a, __m256 b)

32-bit single-precision floating-point vector a logical-or b

Compute the bitwise OR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

__m256i _mm256_or_si256 (__m256i a, __m256i b)

256-bit vector a logical OR b

Compute the bitwise OR of 256 bits (representing integer data) in a and b, and store the result in dst.

int _mm256_testc_pd (__m256d a, __m256d b)

First calculate the logic and b of the 64-bit double-precision floating-point number vector a. If the sign bits of the 4 channels of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then first calculate the logic of the 64-bit double-precision floating-point number vector a No, then do logic AND with vector b, if the sign bits of the 4 channels of the intermediate result are all 0, then set CF bit 1, otherwise CF is 1, return CF

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

int _mm256_testc_ps (__m256 a, __m256 b)

First calculate the logical AND b of the 32-bit single-precision floating-point number vector a, if the sign bits of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then first calculate the logical NO of the 32-bit single-precision floating-point number vector a, and then Do logic AND with vector b, if the sign bits of the intermediate result 8 are all 0, set CF bit 1, otherwise CF is 1, return CF

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the CF value.

int _mm256_testc_si256 (__m256i a, __m256i b)

First calculate the logical AND b of the 256-bit data a, if all the bits of the intermediate result are 0, set ZF to 1, otherwise ZF is 0. Then first calculate the logical NOT of the 256-bit vector a, and then do the logical AND with the vector b, if the If all the bits of the intermediate result are all 0, set the CF bit to 1, otherwise CF is 1, return CF

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the CF value.

int _mm256_testnzc_pd (__m256d a, __m256d b)

First calculate the logical AND b of the 64-bit double-precision floating-point number vector a. If the sign bits of the 4 channels of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then first calculate the logical NO of the 64-bit double-precision floating-point number vector a, and then do a logical AND with the vector b. If the sign bits of the 4 channels of the intermediate result are all 0, set the CF bit to 1, otherwise CF is 1. If ZF and CF are all 0, return 1, otherwise return 0"

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

int _mm256_testnzc_ps (__m256 a, __m256 b)

First calculate the logical AND b of the 32-bit single-precision floating-point vector a. If the sign bits of the 8 channels of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then calculate the logical NOT of the 32-bit single-precision floating-point number vector a, and then do a logical AND with the vector b. If the sign bits of the 8 channels of the intermediate result are all 0, set the CF bit to 1, otherwise CF is 1. Returns 1 if both ZF and CF are 0, otherwise returns 0

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

int _mm256_testnzc_si256 (__m256i a, __m256i b)

First calculate the logical AND b of the 256-bit vector a, if all bits of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then first calculate the logical NOT of the 256-bit vector a, and then do a logical AND with the vector b. If all bits of the intermediate result are all 0, set the CF bit to 1, otherwise CF is 1, and return CF. Returns 1 if both ZF and CF are 0, otherwise returns 0

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return 1 if both the ZF and CF values are zero, otherwise return 0.

int _mm256_testz_pd (__m256d a, __m256d b)

First calculate the logical AND b of the 64-bit double-precision floating-point number vector a. If the sign bits of the 4 channels of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then calculate the logical NOT of the 64-bit double-precision floating-point number vector a, and then do a logical AND with the vector b. If the sign bits of the 4 channels of the intermediate result are all 0, set the CF bit to 1, otherwise CF is 1, and return to ZF

Compute the bitwise AND of 256 bits (representing double-precision (64-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 64-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

int _mm256_testz_ps (__m256 a, __m256 b)

First calculate the logical AND b of the 32-bit single-precision floating-point vector a. If the sign bits of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then calculate the logical NOT of the 32-bit single-precision floating-point number vector a, and then do a logical AND with the vector b. If the sign bits of the 8 channels of the intermediate result are all 0, set the CF bit to 1, otherwise CF is 1, and return to ZF

Compute the bitwise AND of 256 bits (representing single-precision (32-bit) floating-point elements) in a and b, producing an intermediate 256-bit value, and set ZF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, producing an intermediate value, and set CF to 1 if the sign bit of each 32-bit element in the intermediate value is zero, otherwise set CF to 0. Return the ZF value.

int _mm256_testz_si256 (__m256i a, __m256i b)

First calculate the logical AND b of the 256-bit vector a, if all bits of the intermediate result are all 0, set ZF to 1, otherwise ZF is 0. Then first calculate the logical NOT of the 256-bit vector a, and then do a logical AND with the vector b. If all bits of the intermediate result are all 0, set the CF bit to 1, otherwise CF is 1, and return to ZF

Compute the bitwise AND of 256 bits (representing integer data) in a and b, and set ZF to 1 if the result is zero, otherwise set ZF to 0. Compute the bitwise NOT of a and then AND with b, and set CF to 1 if the result is zero, otherwise set CF to 0. Return the ZF value.

__m256d _mm256_xor_pd (__m256d a, __m256d b)

Vector of 64-bit double-precision floating-point numbers a Logical AND or b

Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

__m256 _mm256_xor_ps (__m256 a, __m256 b)

32-bit single-precision floating-point vector a logical AND or b

Compute the bitwise XOR of packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst.

__m256i _mm256_xor_si256 (__m256i a, __m256i b)

256-bit vector a logical AND or b

Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.

Miscellaneous

__m256i _mm256_alignr_epi8 (__m256i a, __m256i b, const int imm8)

Splice the 16-byte vector a and b into a 32-byte vector, shift the intermediate result to the right by imm8 bytes, and store the lower 16 bytes

Concatenate pairs of 16-byte blocks in a and b into a 32-byte temporary result, shift the result right by imm8 bytes, and store the low 16 bytes in dst.

int _mm256_movemask_epi8 (__m256i a)

Extract the highest bit of the 8-bit vector a to form a 32-bit dst

Create mask from the most significant bit of each 8-bit element in a, and store the result in dst.

int _mm256_movemask_pd (__m256d a)

Extract the highest bit of the 64-bit double-precision floating-point number vector a to form a 32-bit dst

Set each bit of mask dst based on the most significant bit of the corresponding packed double-precision (64-bit) floating-point element in a.

int _mm256_movemask_ps (__m256 a)

Extract the highest bit of the 32-bit single-precision floating-point number vector a to form a 32-bit dst

Set each bit of mask dst based on the most significant bit of the corresponding packed single-precision (32-bit) floating-point element in a.

__m256i _mm256_mpsadbw_epu8 (__m256i a, __m256i b, const int imm8)

Compute the sum of absolute differences (SADs) of quadruplets of unsigned 8-bit integers in a compared to those in b, and store the 16-bit results in dst. Eight SADs are performed for each 128-bit lane using one quadruplet from b and eight quadruplets from a. One quadruplet is selected from b starting at on the offset specified in imm8. Eight quadruplets are formed from sequential 8-bit integers selected from a starting at the offset specified in imm8.

__m256i _mm256_packs_epi16 (__m256i a, __m256i b)

Using saturation, the signed 16-bit integer vectors a and b are reduced and interleaved as an 8-bit integer vector dst

Convert packed signed 16-bit integers from a and b to packed 8-bit integers using signed saturation, and store the results in dst.

__m256i _mm256_packs_epi32 (__m256i a, __m256i b)

Using saturation, the signed 32-bit integer vectors a and b are reduced and interleaved as a 16-bit integer vector dst

Convert packed signed 32-bit integers from a and b to packed 16-bit integers using signed saturation, and store the results in dst.

__m256i _mm256_packus_epi16 (__m256i a, __m256i b)

Use saturation to shrink the signed 16-bit integer vectors a and b, and interleave them into an 8-bit unsigned integer vector dst

Convert packed signed 16-bit integers from a and b to packed 8-bit integers using unsigned saturation, and store the results in dst.

__m256i _mm256_packus_epi32 (__m256i a, __m256i b)

Use saturation to shrink the signed 32-bit integer vectors a and b, and interleave them into a 16-bit unsigned integer vector dst

Convert packed signed 32-bit integers from a and b to packed 16-bit integers using unsigned saturation, and store the results in dst.

Move

__m256d _mm256_movedup_pd (__m256d a)

Copies even-indexed elements of a 64-bit double-precision floating-point vector to adjacent odd-indexed elements

Duplicate even-indexed double-precision (64-bit) floating-point elements from a, and store the results in dst.

__m256 _mm256_movehdup_ps (__m256 a)

Copies odd-indexed elements of a vector of 32-bit single-precision floating-point numbers to adjacent even-indexed elements

Duplicate odd-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

__m256 _mm256_moveldup_ps (__m256 a)

Copies even-indexed elements of a 32-bit single-precision floating-point vector to adjacent odd-indexed elements

Duplicate even-indexed single-precision (32-bit) floating-point elements from a, and store the results in dst.

Probability/Statistics

__m256i _mm256_avg_epu16 (__m256i a, __m256i b)

Computes the mean of the 16-bit unsigned integer vectors a and b

Average packed unsigned 16-bit integers in a and b, and store the results in dst.

__m256i _mm256_avg_epu8 (__m256i a, __m256i b)

Computes the mean of 8-bit unsigned integer vectors a and b

Average packed unsigned 8-bit integers in a and b, and store the results in dst.

Set

__m256i _mm256_set_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Set the 16-bit integer vectors individually with the specified values

Set packed 16-bit integers in dst with the supplied values.

__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Set the 32-bit integer vectors individually with the specified values

Set packed 32-bit integers in dst with the supplied values.

__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Use the specified value to set the 64-bit integer vector respectively

Set packed 64-bit integers in dst with the supplied values.

__m256i _mm256_set_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16, char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0)

Set the 8-bit integer vectors individually with the specified values

Set packed 8-bit integers in dst with the supplied values.

__m256 _mm256_set_m128 (__m128 hi, __m128 lo)

Set __m256 with two __m128

Set packed __m256 vector dst with the supplied values.

__m256d _mm256_set_m128d (__m128d hi, __m128d lo)

使用两个__m128d设置__m256d

Set packed __m256d vector dst with the supplied values.

__m256i _mm256_set_m128i (__m128i hi, __m128i lo)

使用两个__m128i设置__m256i

Set packed __m256i vector dst with the supplied values.

__m256d _mm256_set_pd (double e3, double e2, double e1, double e0)

使用指定值分别设置64位双精度浮点数向量

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values.

__m256 _mm256_set_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

使用指定值分别设置32位单精度浮点数向量

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values.

__m256i _mm256_set1_epi16 (short a)

将一个16位整形广播为向量

Broadcast 16-bit integer a to all all elements of dst. This intrinsic may generate the vpbroadcastw.

__m256i _mm256_set1_epi32 (int a)

broadcast a 32-bit integer as a vector

Broadcast 32-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastd.

__m256i _mm256_set1_epi64x (long long a)

broadcast a 64-bit integer as a vector

Broadcast 64-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastq.

__m256i _mm256_set1_epi8 (char a)

broadcast an 8-bit integer as a vector

Broadcast 8-bit integer a to all elements of dst. This intrinsic may generate the vpbroadcastb.

__m256d _mm256_set1_pd (double a)

Broadcast a 64-bit double-precision floating-point number as a vector

Broadcast double-precision (64-bit) floating-point value a to all elements of dst.

__m256 _mm256_set1_ps (float a)

Broadcast a 32-bit single-precision floating-point number as a vector

Broadcast single-precision (32-bit) floating-point value a to all elements of dst.

__m256i _mm256_setr_epi16 (short e15, short e14, short e13, short e12, short e11, short e10, short e9, short e8, short e7, short e6, short e5, short e4, short e3, short e2, short e1, short e0)

Use the specified value in reverse to set the 16-bit integer vector

Set packed 16-bit integers in dst with the supplied values in reverse order.

__m256i _mm256_setr_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0)

Use the specified value in reverse to set the 32-bit integer vector respectively

Set packed 32-bit integers in dst with the supplied values in reverse order.

__m256i _mm256_setr_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0)

Use the specified value in reverse to set the 64-bit integer vector respectively

Set packed 64-bit integers in dst with the supplied values in reverse order.

__m256i _mm256_setr_epi8 (char e31, char e30, char e29, char e28, char e27, char e26, char e25, char e24, char e23, char e22, char e21, char e20, char e19, char e18, char e17, char e16 , char e15, char e14, char e13, char e12, char e11, char e10, char e9, char e8, char e7, char e6, char e5, char e4, char e3, char e2, char e1, char e0) reverse Set packed 8-bit integers in dst with the supplied values ​​in reverse order.

__m256 _mm256_setr_m128 (__m128 lo, __m128 hi)

Reverse setting __m256 using two __m128

Set packed __m256 vector dst with the supplied values.

__m256d _mm256_setr_m128d (__m128d lo, __m128d hi)

Reverse setting __m256d using two __m128d

Set packed __m256d vector dst with the supplied values.

__m256i _mm256_setr_m128i (__m128i lo, __m128i hi)

Reverse setting __m256i using two __m128i

Set packed __m256i vector dst with the supplied values.

__m256d _mm256_setr_pd (double e3, double e2, double e1, double e0)

反向使用指定值分别,设置64位双精度浮点数向量

Set packed double-precision (64-bit) floating-point elements in dst with the supplied values in reverse order.

__m256 _mm256_setr_ps (float e7, float e6, float e5, float e4, float e3, float e2, float e1, float e0)

反向使用指定值分别,设置32位单精度浮点数向量

Set packed single-precision (32-bit) floating-point elements in dst with the supplied values in reverse order.

__m256d _mm256_setzero_pd (void)

返回一个全为零的__m256d

Return vector of type __m256d with all elements set to zero.

__m256 _mm256_setzero_ps (void)

返回一个全为零的__m256

Return vector of type __m256 with all elements set to zero.

__m256i _mm256_setzero_si256 (void)

返回一个全为零的__m256i

Return vector of type __m256i with all elements set to zero.

Shift

__m256i _mm256_bslli_epi128 (__m256i a, const int imm8)

使用0填充,左移128位向量imm8个字节

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

__m256i _mm256_bsrli_epi128 (__m256i a, const int imm8)

Pad with 0, right shift 128-bit vector imm8 bytes

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

__m256i _mm256_sll_epi16 (__m256i a, __m128i count)

Fill with 0 and shift the count bits of the 16-bit integer vector to the left

Shift packed 16-bit integers in a left by count while shifting in zeros, and store the results in dst

__m256i _mm256_sll_epi32 (__m256i a, __m128i count)

Fill with 0 and shift the count bits of the 32-bit integer vector to the left

Shift packed 32-bit integers in a left by count while shifting in zeros, and store the results in dst.

__m256i _mm256_sll_epi64 (__m256i a, __m128i count)

Fill with 0 and shift the count bits of the 64-bit integer vector to the left

Shift packed 64-bit integers in a left by count while shifting in zeros, and store the results in dst.

__m256i _mm256_slli_epi16 (__m256i a, int imm8)

Fill with 0, shift the 16-bit integer vector imm8 bits to the left

Shift packed 16-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

__m256i _mm256_slli_epi32 (__m256i a, int imm8)

Fill with 0, shift the 32-bit integer vector imm8 bits to the left

Shift packed 32-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

__m256i _mm256_slli_epi64 (__m256i a, int imm8)

Pad with 0, shift the 64-bit integer vector imm8 bits to the left

Shift packed 64-bit integers in a left by imm8 while shifting in zeros, and store the results in dst.

__m256i _mm256_slli_si256 (__m256i a, const int imm8)

Pad with 0, left shift 128-bit vector imm8 bytes

Shift 128-bit lanes in a left by imm8 bytes while shifting in zeros, and store the results in dst.

__m256i _mm256_sllv_epi32 (__m256i a, __m256i count)

Fill with 0, shift the 32-bit integer vector to the left according to the value of the corresponding channel in count

Shift packed 32-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

__m256i _mm256_sllv_epi64 (__m256i a, __m256i count)

Fill with 0, and shift the 64-bit integer vector to the left according to the value of the corresponding channel in count

Shift packed 64-bit integers in a left by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

__m256i _mm256_sra_epi16 (__m256i a, __m128i count)

Use arithmetic right shift, right shift 16-bit integer vector count bits

Shift packed 16-bit integers in a right by count while shifting in sign bits, and store the results in dst.

__m256i _mm256_sra_epi32 (__m256i a, __m128i count)

Use arithmetic right shift, right shift 32-bit integer vector count bits

Shift packed 32-bit integers in a right by count while shifting in sign bits, and store the results in dst.

__m256i _mm256_srai_epi16 (__m256i a, int imm8)

Use arithmetic right shift, right shift 16-bit integer vector imm8 bits

Shift packed 16-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

__m256i _mm256_srai_epi32 (__m256i a, int imm8)

Use arithmetic right shift, right shift 32-bit integer vector imm8 bits

Shift packed 32-bit integers in a right by imm8 while shifting in sign bits, and store the results in dst.

__m256i _mm256_srav_epi32 (__m256i a, __m256i count)

Use arithmetic right shift to right shift the 32-bit integer vector according to the value of the corresponding channel in count

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in sign bits, and store the results in dst.

__m256i _mm256_srl_epi16 (__m256i a, __m128i count)

Use logical right shift, right shift 16-bit integer vector count bits

Shift packed 16-bit integers in a right by count while shifting in zeros, and store the results in dst.

__m256i _mm256_srl_epi32 (__m256i a, __m128i count)

Use logical right shift, right shift 32-bit integer vector count bits

Shift packed 32-bit integers in a right by count while shifting in zeros, and store the results in dst.

__m256i _mm256_srl_epi64 (__m256i a, __m128i count)

Use logical right shift, right shift 64-bit integer vector count bits

Shift packed 64-bit integers in a right by count while shifting in zeros, and store the results in dst.

__m256i _mm256_srli_epi16 (__m256i a, int imm8)

Use logical right shift, right shift 16-bit integer vector imm8 bits

Shift packed 16-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

__m256i _mm256_srli_epi32 (__m256i a, int imm8)

Use logical right shift, right shift 32-bit integer vector imm8 bits

Shift packed 32-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

__m256i _mm256_srli_epi64 (__m256i a, int imm8)

Use logical right shift, right shift 64-bit integer vector imm8 bits

Shift packed 64-bit integers in a right by imm8 while shifting in zeros, and store the results in dst.

__m256i _mm256_srli_si256 (__m256i a, const int imm8)

Using a logical right shift, right shifts a 128-bit vector imm8 bits

Shift 128-bit lanes in a right by imm8 bytes while shifting in zeros, and store the results in dst.

__m256i _mm256_srlv_epi32 (__m256i a, __m256i count)

Use logical right shift to right shift the 32-bit integer vector according to the value of the corresponding channel in count

Shift packed 32-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

__m256i _mm256_srlv_epi64 (__m256i a, __m256i count)

Use logical right shift to right shift the 64-bit integer vector according to the value of the corresponding channel in count

Shift packed 64-bit integers in a right by the amount specified by the corresponding element in count while shifting in zeros, and store the results in dst.

Special Math Functions

__m256i _mm256_abs_epi16 (__m256i a)

Computes the absolute value of a 16-bit signed integer vector a

Compute the absolute value of packed signed 16-bit integers in a, and store the unsigned results in dst.

__m256i _mm256_abs_epi32 (__m256i a)

Computes the absolute value of a 32-bit signed integer vector a

Compute the absolute value of packed signed 32-bit integers in a, and store the unsigned results in dst.

__m256i _mm256_abs_epi8 (__m256i a)

Computes the absolute value of an 8-bit signed integer vector a

Compute the absolute value of packed signed 8-bit integers in a, and store the unsigned results in dst.

__m256d _mm256_ceil_pd (__m256d a)

Round up the 64-bit double-precision floating-point number vector a and return the 64-bit double-precision floating-point number vector

Round the packed double-precision (64-bit) floating-point elements in a up to an integer value, and store the results as packed double-precision floating-point elements in dst.

__m256 _mm256_ceil_ps (__m256 a)

Round up the 32-bit single-precision floating-point number vector a and return the 32-bit single-precision floating-point number vector

Round the packed single-precision (32-bit) floating-point elements in a up to an integer value, and store the results as packed single-precision floating-point elements in dst.

__m256d _mm256_floor_pd (__m256d a)

Round down the 64-bit double-precision floating-point number vector a and return the 64-bit double-precision floating-point number vector

Round the packed double-precision (64-bit) floating-point elements in a down to an integer value, and store the results as packed double-precision floating-point elements in dst.

__m256 _mm256_floor_ps (__m256 a)

将32位单精度浮点数向量a向下取整,并返回32位单精度浮点数向量

Round the packed single-precision (32-bit) floating-point elements in a down to an integer value, and store the results as packed single-precision floating-point elements in dst.

__m256i _mm256_max_epi16 (__m256i a, __m256i b)

计算16位有符号整形向量a与b的最大值

Compare packed signed 16-bit integers in a and b, and store packed maximum values in dst.

__m256i _mm256_max_epi32 (__m256i a, __m256i b)

计算32位有符号整形向量a与b的最大值

Compare packed signed 32-bit integers in a and b, and store packed maximum values in dst.

__m256i _mm256_max_epi8 (__m256i a, __m256i b)

计算8位有符号整形向量a与b的最大值

Compare packed signed 8-bit integers in a and b, and store packed maximum values in dst.

__m256i _mm256_max_epu16 (__m256i a, __m256i b)

Calculate the maximum value of 16-bit unsigned integer vectors a and b

Compare packed unsigned 16-bit integers in a and b, and store packed maximum values in dst.

__m256i _mm256_max_epu32 (__m256i a, __m256i b)

Calculate the maximum value of 32-bit unsigned integer vectors a and b

Compare packed unsigned 32-bit integers in a and b, and store packed maximum values in dst.

__m256i _mm256_max_epu8 (__m256i a, __m256i b)

Calculate the maximum value of 8-bit unsigned integer vectors a and b

Compare packed unsigned 8-bit integers in a and b, and store packed maximum values in dst.

__m256d _mm256_max_pd (__m256d a, __m256d b)

Computes the maximum value of vectors a and b of 64-bit double-precision floating-point numbers. When a and b are NaN or +0, it does not conform to the IEEE754 standard

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed maximum values in dst.dst does not follow the IEEE Standard for Floating-Point Arithmetic (IEEE 754) maximum value when inputs are NaN or signed-zero values.

__m256 _mm256_max_ps (__m256 a, __m256 b)

计算32位单精度浮点数向量a与b的最大值。当a、b为NaN或+0时不符合IEEE754标准

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed maximum values in dst.dst does not follow the IEEE Standard for Floating-Point Arithmetic (IEEE 754) maximum value when inputs are NaN or signed-zero values.

__m256i _mm256_min_epi16 (__m256i a, __m256i b)

计算16位有符号整形向量a与b的最小值

Compare packed signed 16-bit integers in a and b, and store packed minimum values in dst.

__m256i _mm256_min_epi32 (__m256i a, __m256i b)

计算32位有符号整形向量a与b的最小值

Compare packed signed 32-bit integers in a and b, and store packed minimum values in dst.

__m256i _mm256_min_epi8 (__m256i a, __m256i b)

计算8位有符号整形向量a与b的最小值

Compare packed signed 8-bit integers in a and b, and store packed minimum values in dst.

__m256i _mm256_min_epu16 (__m256i a, __m256i b)

计算16位无符号整形向量a与b的最小值

Compare packed unsigned 16-bit integers in a and b, and store packed minimum values in dst.

__m256i _mm256_min_epu32 (__m256i a, __m256i b)

计算32位无符号整形向量a与b的最小值

Compare packed unsigned 32-bit integers in a and b, and store packed minimum values in dst.

__m256i _mm256_min_epu8 (__m256i a, __m256i b)

计算8位无符号整形向量a与b的最小值

Compare packed unsigned 8-bit integers in a and b, and store packed minimum values in dst.

__m256d _mm256_min_pd (__m256d a, __m256d b)

计算64位双精度浮点数向量a与b的最小值,当a、b为NaN或+0时不符合IEEE754标准

Compare packed double-precision (64-bit) floating-point elements in a and b, and store packed minimum values in dst.dst does not follow the IEEE Standard for Floating-Point Arithmetic (IEEE 754) minimum value when inputs are NaN or signed-zero values.

__m256 _mm256_min_ps (__m256 a, __m256 b)

计算32位单精度浮点数向量a与b的最小值,当a、b为NaN或+0时不符合IEEE754标准

Compare packed single-precision (32-bit) floating-point elements in a and b, and store packed minimum values in dst.dst does not follow the IEEE Standard for Floating-Point Arithmetic (IEEE 754) minimum value when inputs are NaN or signed-zero values.

__m256d _mm256_round_pd (__m256d a, int rounding)

计算64位双精度浮点数向量a取整,并返回64位双精度浮点数向量,取整方法根据参数rounding。

“Round the packed double-precision (64-bit) floating-point elements in a using the rounding parameter, and store the results as packed double-precision floating-point elements in dst.Rounding is done according to the rounding[3:0] parameter, which can be one of:”

__m256 _mm256_round_ps (__m256 a, int rounding)

Calculate the 32-bit single-precision floating-point vector a and round it, and return the 32-bit single-precision floating-point vector. The rounding method is based on the parameter rounding.

“Round the packed single-precision (32-bit) floating-point elements in a using the rounding parameter, and store the results as packed single-precision floating-point elements in dst.Rounding is done according to the rounding[3:0] parameter, which can be one of:”

Store

void _mm256_maskstore_epi32 (int* mem_addr, __m256i mask, __m256i a)

Save the 32-bit integer vector to the memory, and do not save it when the highest bit of the channel corresponding to the mask is zero

Store packed 32-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

void _mm256_maskstore_epi64 (__int64* mem_addr, __m256i mask, __m256i a)

Save the 64-bit integer vector to the memory, and do not save it when the highest bit of the channel corresponding to the mask is zero

Store packed 64-bit integers from a into memory using mask (elements are not stored when the highest bit is not set in the corresponding element).

void _mm256_maskstore_pd (double * mem_addr, __m256i mask, __m256d a)

Save the 64-bit double-precision floating-point number vector to the memory, and do not save it when the highest bit of the channel corresponding to the mask is zero

Store packed double-precision (64-bit) floating-point elements from a into memory using mask.

void _mm256_maskstore_ps (float * mem_addr, __m256i mask, __m256 a)

Save the 32-bit single-precision floating-point number vector to the memory, and do not save it when the highest bit of the channel corresponding to the mask is zero

Store packed single-precision (32-bit) floating-point elements from a into memory using mask.

void _mm256_store_pd (double * mem_addr, __m256d a)

Save 256 bits (composed of 4 64-bit double-precision floating-point numbers) to memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

void _mm256_store_ps (float * mem_addr, __m256 a)

Save 256 bits (composed of 8 32-bit single-precision floating-point numbers) to memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

void _mm256_store_si256 (__m256i * mem_addr, __m256i a)

Save 256-bit integer data to memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Store 256-bits of integer data from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

void _mm256_storeu_pd (double * mem_addr, __m256d a)

Save 256 bits (composed of 4 64-bit double-precision floating-point numbers) to memory, mem_addr does not need 32-byte alignment

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

void _mm256_storeu_ps (float * mem_addr, __m256 a)

Save 256 bits (composed of 8 32-bit single-precision floating-point numbers) to memory, mem_addr does not need 32-byte alignment

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr does not need to be aligned on any particular boundary.

void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)

Save 256-bit integer data to memory, mem_addr does not need 32-byte alignment

Store 256-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.

void _mm256_storeu2_m128 (float* hiaddr, float* loaddr, __m256 a)

将a拆分为两个128位数据(每个由4个32位单精度浮点数向量组成),分别存储至 hiaddr和loaddr。 hiaddr和loaddr无需对齐

Store the high and low 128-bit halves (each composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

void _mm256_storeu2_m128d (double* hiaddr, double* loaddr, __m256d a)

将a拆分为两个128位数据(每个由2个64位双精度浮点数向量组成),分别存储至 hiaddr和loaddr。 hiaddr和loaddr无需对齐

Store the high and low 128-bit halves (each composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

void _mm256_storeu2_m128i (__m128i* hiaddr, __m128i* loaddr, __m256i a)

将a拆分为两个128位数据(每个由128位整形数据组成),分别存储至 hiaddr和loaddr。 hiaddr和loaddr无需对齐

Store the high and low 128-bit halves (each composed of integer data) from a into memory two different 128-bit locations. hiaddr and loaddr do not need to be aligned on any particular boundary.

void _mm256_stream_pd (double * mem_addr, __m256d a)

Use non-temporal memory hint to save 256 bits (consisting of 4 64-bit double-precision floating-point numbers) to memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Store 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

void _mm256_stream_ps (float * mem_addr, __m256 a)

Use non-temporal memory hint to save 256 bits (composed of 8 32-bit single-precision floating-point numbers) to memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)

Use non-temporal memory hint to save 256-bit integer data to memory, mem_addr must be 32-byte aligned, otherwise a general protection exception will occur

Store 256-bits of integer data from a into memory using a non-temporal memory hint. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.

Swizzle

__m256i _mm256_blend_epi16 (__m256i a, __m256i b, const int imm8)

According to imm8's low 8-bit mixing (choose one) 16-bit integer vector a and b high and low 128-bit channels (that is, the 0th 16-bit channel and the 8th 16-bit channel use the same mixing rule)

Blend packed 16-bit integers from a and b within 128-bit lanes using control mask imm8, and store the results in dst.

__m256i _mm256_blend_epi32 (__m256i a, __m256i b, const int imm8)

According to the lower 8 bits of imm8 mixed (choose one) 32-bit integer vector a and b

Blend packed 32-bit integers from a and b using control mask imm8, and store the results in dst.

__m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)

According to the lower 4 bits of imm8 mixed (choose one) 64-bit double-precision floating-point number vector a and b

Blend packed double-precision (64-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

__m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)

According to the lower 8 bits of imm8 mixed (choose one of two) 32-bit single-precision floating-point vectors a and b

Blend packed single-precision (32-bit) floating-point elements from a and b using control mask imm8, and store the results in dst.

__m256i _mm256_blendv_epi8 (__m256i a, __m256i b, __m256i mask)

Mix (choose one of two) 8-bit integer vectors a and b according to the highest bit of the channel corresponding to the mask

Blend packed 8-bit integers from a and b using mask, and store the results in dst.

__m256d _mm256_blendv_pd (__m256d a, __m256d b, __m256d mask)

Mix (choose one of two) 64-bit double-precision floating-point number vectors a and b according to the highest bit of the channel corresponding to the mask

Blend packed double-precision (64-bit) floating-point elements from a and b using mask, and store the results in dst.

__m256 _mm256_blendv_ps (__m256 a, __m256 b, __m256 mask)

Mix (choose one of two) 32-bit single-precision floating-point number vectors a and b according to the highest bit of the channel corresponding to the mask

Blend packed single-precision (32-bit) floating-point elements from a and b using mask, and store the results in dst.

__m256d _mm256_broadcast_pd (__m128d const * mem_addr)

128-bit data in broadcast memory (consisting of 2 64-bit double-precision floating-point numbers)

Broadcast 128 bits from memory (composed of 2 packed double-precision (64-bit) floating-point elements) to all elements of dst.

__m256 _mm256_broadcast_ps (__m128 const * mem_addr)

128-bit data in broadcast memory (composed of four 32-bit single-precision floating-point numbers)

Broadcast 128 bits from memory (composed of 4 packed single-precision (32-bit) floating-point elements) to all elements of dst.

__m256d _mm256_broadcast_sd (double const * mem_addr)

Broadcast a 64-bit double-precision floating-point number in memory to all channels of dst

Broadcast a double-precision (64-bit) floating-point element from memory to all elements of dst.

__m256i _mm256_broadcastb_epi8 (__m128i a)

Broadcast the lowest channel of 8-bit integer vector to all channels of dst

Broadcast the low packed 8-bit integer from a to all elements of dst.

__m256i _mm256_broadcastd_epi32 (__m128i a)

Broadcast the lowest channel of the 32-bit integer vector to all channels of dst

Broadcast the low packed 32-bit integer from a to all elements of dst.

__m256i _mm256_broadcastq_epi64 (__m128i a)

Broadcast the lowest channel of 64-bit integer vector to all channels of dst

Broadcast the low packed 64-bit integer from a to all elements of dst.

__m256d _mm256_broadcastsd_pd (__m128d a)

Broadcast 64-bit double-precision floating-point vector lowest channel to all channels of dst

Broadcast the low double-precision (64-bit) floating-point element from a to all elements of dst.

__m256i _mm256_broadcastsi128_si256 (__m128i a)

Broadcast 128-bit integer data to all channels of dst

Broadcast 128 bits of integer data from a to all 128-bit lanes in dst.

__m256 _mm256_broadcastss_ps (__m128 a)

Broadcast 32-bit single-precision floating-point vector lowest channel to all channels of dst

Broadcast the low single-precision (32-bit) floating-point element from a to all elements of dst.

__m256i _mm256_broadcastw_epi16 (__m128i a)

Broadcast the lowest channel of the 16-bit integer vector to all channels of dst

Broadcast the low packed 16-bit integer from a to all elements of dst.

int _mm256_extract_epi16 (__m256i a, const int index)

Extract the 16-bit integer from a according to the index index

Extract a 16-bit integer from a, selected with index, and store the result in dst.

__int32 _mm256_extract_epi32 (__m256i a, const int index)

Extract the 32-bit integer from a according to the index index

Extract a 32-bit integer from a, selected with index, and store the result in dst.

__int64 _mm256_extract_epi64 (__m256i a, const int index)

Extract the 64-bit integer from a according to the index index

Extract a 64-bit integer from a, selected with index, and store the result in dst.

int _mm256_extract_epi8 (__m256i a, const int index)

根据索引index从a中提取8位整形

Extract an 8-bit integer from a, selected with index, and store the result in dst.

__m128d _mm256_extractf128_pd (__m256d a, const int imm8)

根据索引imm8提取128位数据(由2个64位双精度浮点数组成)

Extract 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

__m128 _mm256_extractf128_ps (__m256 a, const int imm8)

根据索引imm8提取128位数据(由4个32位单精度浮点数组成)

Extract 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a, selected with imm8, and store the result in dst.

__m128i _mm256_extractf128_si256 (__m256i a, const int imm8)

根据索引imm8提取128位数据(由整形数据组成)

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

__m128i _mm256_extracti128_si256 (__m256i a, const int imm8)

Extract 128-bit data (composed of integer data) according to the index imm8

Extract 128 bits (composed of integer data) from a, selected with imm8, and store the result in dst.

__m256i _mm256_insert_epi16 (__m256i a, __int16 i, const int index)

Copy a to dst, and then insert a 16-bit integer i at the corresponding position according to the index index

Copy a to dst, and insert the 16-bit integer i into dst at the location specified by index.

__m256i _mm256_insert_epi32 (__m256i a, __int32 i, const int index)

Copy a to dst, and then insert a 32-bit integer i at the corresponding position according to the index index

Copy a to dst, and insert the 32-bit integer i into dst at the location specified by index.

__m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index)

Copy a to dst, and then insert 64-bit integer i at the corresponding position according to the index index

Copy a to dst, and insert the 64-bit integer i into dst at the location specified by index.

__m256i _mm256_insert_epi8 (__m256i a, __int8 i, const int index)

Copy a to dst, and then insert an 8-bit integer i at the corresponding position according to the index index

Copy a to dst, and insert the 8-bit integer i into dst at the location specified by index.

__m256d _mm256_insertf128_pd (__m256d a, __m128d b, int imm8)

Copy a to dst, and then insert 128-bit data (consisting of two 64-bit double-precision floating-point numbers) at the corresponding position according to the index imm8

Copy a to dst, then insert 128 bits (composed of 2 packed double-precision (64-bit) floating-point elements) from b into dst at the location specified by imm8.

__m256 _mm256_insertf128_ps (__m256 a, __m128 b, int imm8)

Copy a to dst, and then insert 128-bit data (composed of four 32-bit single-precision floating-point numbers) at the corresponding position according to the index imm8

Copy a to dst, then insert 128 bits (composed of 4 packed single-precision (32-bit) floating-point elements) from b into dst at the location specified by imm8.

__m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int imm8)

Copy a to dst, and then insert 128-bit data at the corresponding position according to the index imm8

Copy a to dst, then insert 128 bits from b into dst at the location specified by imm8.

__m256i _mm256_inserti128_si256 (__m256i a, __m128i b, const int imm8)

Copy a to dst, and then insert 128-bit data (composed of integer data) at the corresponding position according to the index imm8

Copy a to dst, then insert 128 bits (composed of integer data) from b into dst at the location specified by imm8.

__m256d _mm256_permute_pd (__m256d a, int imm8)

Rearrange 64-bit double-precision floating-point number vectors according to imm8 (lower 4 bits control 1 channel per 1 bit) (can only be rearranged within the upper and lower 128 bits)

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

__m256 _mm256_permute_ps (__m256 a, int imm8)

Rearrange 32-bit single-precision floating-point number vectors according to imm8 (lower 8 bits control 1 channel per 2 bits) (can only be rearranged within the upper and lower 128 bits)

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

__m256d _mm256_permute2f128_pd (__m256d a, __m256d b, int imm8)

Mix 128-bit (consisting of two 64-bit double-precision floating-point numbers) vectors a and b according to imm8 (low 8 bits control 1 channel every 4 bits, and the highest bit of every 4 bits can control whether to output 0)

Shuffle 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

__m256 _mm256_permute2f128_ps (__m256 a, __m256 b, int imm8)

Mix 128-bit (composed of 4 32-bit single-precision floating-point numbers) vectors a and b according to imm8 (lower 8 bits control 1 channel every 4 bits, and every 4 highest bits can control whether to output 0)

Shuffle 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) selected by imm8 from a and b, and store the results in dst.

__m256i _mm256_permute2f128_si256 (__m256i a, __m256i b, int imm8)

Mix 128-bit (composed of integer data) vectors a and b according to imm8 (lower 8 bits control 1 channel every 4 bits, and every 4 highest bits can control whether to output 0)

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)

Mix 128-bit (composed of integer data) vectors a and b according to imm8 (lower 8 bits control 1 channel every 4 bits, and every 4 highest bits can control whether to output 0)

Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.

__m256i _mm256_permute4x64_epi64 (__m256i a, const int imm8)

Rearrange 64-bit integer vectors according to imm8 (lower 8 bits control 1 channel per 2 bits)

Shuffle 64-bit integers in a across lanes using the control in imm8, and store the results in dst.

__m256d _mm256_permute4x64_pd (__m256d a, const int imm8)

Rearrange 64-bit double-precision floating-point number vectors according to imm8 (lower 8 bits control 1 channel per 2 bits)

Shuffle double-precision (64-bit) floating-point elements in a across lanes using the control in imm8, and store the results in dst.

__m256d _mm256_permutevar_pd (__m256d a, __m256i b)

Rearrange the 64-bit double-precision floating-point number vector according to b (the second bit of every 64 bits controls 1 channel) (only in the high and low 128 bits)

Shuffle double-precision (64-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

__m256 _mm256_permutevar_ps (__m256 a, __m256i b)

Rearrange the 32-bit single-precision floating-point number vector according to b (the lower 2 bits of each 32-bit control 1 channel) (only in the upper and lower 128 bits)

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in b, and store the results in dst.

__m256i _mm256_permutevar8x32_epi32 (__m256i a, __m256i idx)

Rearrange the 32-bit integer vector according to idx (the lower 3 bits of each 32 bits control 1 channel)

Shuffle 32-bit integers in a across lanes using the corresponding index in idx, and store the results in dst.

__m256 _mm256_permutevar8x32_ps (__m256 a, __m256i idx)

Rearrange the 32-bit single-precision floating-point number vector according to idx (the lower 3 bits of each 32-bit control 1 channel)

Shuffle single-precision (32-bit) floating-point elements in a across lanes using the corresponding index in idx.

__m256i _mm256_shuffle_epi32 (__m256i a, const int imm8)

Rearrange the 32-bit integer vector according to imm8 (every 2 bits of the lower 8 bits control 1 channel) (can only be rearranged within the upper and lower 128 bits)

Shuffle 32-bit integers in a within 128-bit lanes using the control in imm8, and store the results in dst.

__m256i _mm256_shuffle_epi8 (__m256i a, __m256i b)

Rearrange the 8-bit integer vector according to b (the lower 4 bits of each 8 bits control 1 channel, and the highest bit of each 4 bits can control whether to output 0) (only rearrange in the upper and lower 128 bits, and the channel can be set to zero)

Shuffle 8-bit integers in a within 128-bit lanes according to shuffle control mask in the corresponding 8-bit element of b, and store the results in dst.

__m256d _mm256_shuffle_pd (__m256d a, __m256d b, const int imm8)

Mix 64-bit double-precision floating-point number vectors a and b according to imm8 (each 1 bit of the lower 4 bits controls 1 channel) (can only be mixed within 128 bits of high and low)

Shuffle double-precision (64-bit) floating-point elements within 128-bit lanes using the control in imm8, and store the results in dst.

__m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int imm8)

Mix 64-bit double-precision floating-point number vectors a and b according to imm8 (each 1 bit of the lower 4 bits controls 1 channel) (can only be mixed within 128 bits of high and low)

Shuffle single-precision (32-bit) floating-point elements in a within 128-bit lanes using the control in imm8, and store the results in dst.

__m256i _mm256_shufflehi_epi16 (__m256i a, const int imm8)

Rearrange the upper 64 bits of the 128-bit vector according to imm8 (every 2 bits of the lower 8 bits control 1 channel), and save it to the upper 64 bits of the 128-bit channel corresponding to dst, and directly transfer the lower 64 bits of the 128-bit channel in a Copy to the lower 64 bits of the corresponding channel of dst

Shuffle 16-bit integers in the high 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the high 64 bits of 128-bit lanes of dst, with the low 64 bits of 128-bit lanes being copied from from a to dst.

__m256i _mm256_shufflelo_epi16 (__m256i a, const int imm8)

Rearrange the lower 64 bits of the 128-bit vector a according to imm8 (every 2 bits of the lower 8 bits control 1 channel), and save it to the lower 64 bits of the 128-bit channel corresponding to dst, and directly copy the upper 64 bits of the 128-bit vector to The upper 64 bits of the corresponding channel of dst

Shuffle 16-bit integers in the low 64 bits of 128-bit lanes of a using the control in imm8. Store the results in the low 64 bits of 128-bit lanes of dst, with the high 64 bits of 128-bit lanes being copied from from a to dst.

__m256i _mm256_unpackhi_epi16 (__m256i a, __m256i b)

The high 64 bits of the 128-bit vectors a and b are cross-spliced ​​in units of 16 bits

Unpack and interleave 16-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

__m256i _mm256_unpackhi_epi32 (__m256i a, __m256i b)

Cross-splice the upper 64 bits of 128-bit vectors a and b in units of 32-bit integers

Unpack and interleave 32-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

__m256i _mm256_unpackhi_epi64 (__m256i a, __m256i b)

Cross-splice the upper 64 bits of the 128-bit vectors a and b in units of 64-bit integers

Unpack and interleave 64-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

__m256i _mm256_unpackhi_epi8 (__m256i a, __m256i b)

Cross-splice the upper 64 bits of 128-bit vectors a and b in units of 8-bit integers

Unpack and interleave 8-bit integers from the high half of each 128-bit lane in a and b, and store the results in dst.

__m256d _mm256_unpackhi_pd (__m256d a, __m256d b)

The high 64 bits of the 128-bit vectors a and b are cross-spliced ​​in units of 64-bit double-precision floating-point numbers

Unpack and interleave double-precision (64-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

__m256 _mm256_unpackhi_ps (__m256 a, __m256 b)

Cross-splice the upper 64 bits of 128-bit vectors a and b in units of 32-bit single-precision floating-point numbers

Unpack and interleave single-precision (32-bit) floating-point elements from the high half of each 128-bit lane in a and b, and store the results in dst.

__m256i _mm256_unpacklo_epi16 (__m256i a, __m256i b)

The lower 64 bits of the 128-bit vectors a and b are cross-spliced ​​in units of 16-bit integers

Unpack and interleave 16-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

__m256i _mm256_unpacklo_epi32 (__m256i a, __m256i b)

以32位整形为单位交叉拼接128位向量a和b的各低64位

Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

__m256i _mm256_unpacklo_epi64 (__m256i a, __m256i b)

以64位整形为单位交叉拼接128位向量a和b的各低64位

Unpack and interleave 64-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

__m256i _mm256_unpacklo_epi8 (__m256i a, __m256i b)

以8位整形为单位交叉拼接128位向量a和b的各低64位

Unpack and interleave 8-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

__m256d _mm256_unpacklo_pd (__m256d a, __m256d b)

以64位双精度浮点数为单位交叉拼接128位向量a和b的各低64位

Unpack and interleave double-precision (64-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

__m256 _mm256_unpacklo_ps (__m256 a, __m256 b)

以32位单精度浮点数为单位交叉拼接128位向量a和b的各低64位

Unpack and interleave single-precision (32-bit) floating-point elements from the low half of each 128-bit lane in a and b, and store the results in dst.

Guess you like

Origin blog.csdn.net/weixin_56819992/article/details/130498878