Turned dry tree line - the instruction set optimization refers to the North

Foreword

When we brush theme, there will always encounter something about range operation / modify topic. These topics often require us to maintain some range, support for a range query / modify operation. These topics like Fenwick tree 1 general simple question , but also like [boring columns] [2] Generally, tree line, Fenwick tree can be completed, but the amount of code long, poor readability, think big difficulty than the problem . This time the subject requires very strict, inquiry / inquiry number and sequence length were \ (10 ^ 5 \) level or more, capable of \ (O (n ^ 2) \) violent housing stolen sigh Wang algorithm. So, is there any way, by data, code, and easy to implement it?

Instruction Set Optimization

Today, I want to say instruction set optimized , it is the best answer to this question. So, what is the set of instructions? Instruction set optimized can do?

What is the set of instructions?

The instruction set is stored in an internal CPU, guidance and operation of the CPU instruction set optimized.

In simple terms, we write the program, either C, C ++, Java, Python , or other high-level language, CPU is not read. This time, our compiler (interpreter) of these languages translated into assembly code, and then translate 01 encoding, that CPU can understand the Instruction Set command .

However, due to the characteristics of high-level language, the translation process down in order to ensure its correctness and compatibility , the compiler will generate a lot of redundant code. The existence of these codes so that the speed running becomes a lot slower.

So, can I delete those redundant code ah?

Typically, these are not fully redundant code knocked off. Special way is open compiler optimization, that is, we often say that the oxygen (-o2), ozone absorption (-O3) . However, optimization is turned on, it does not mean that redundant code on all gone. There is still a lot of redundant code, we need to get rid of them.

Instruction set optimized to do what?

Content front, probably can be seen that the focus of optimization: get rid of redundant code .

So, how to get rid of them?

There are probably two ways:

1, inline assembly

Gan, in order to do a question, I would like specifically to learn a new language, I ** do? ! !

So, as a konjac, this program, pass.

2, the instruction set optimized

Want to get rid of these redundant code can be more than just code the majority of farmers and OIer. As a supplier of computing processor, CPU manufacturers naturally want to get rid of them, so as to enhance the operational performance of their home products.

In this regard, \ (Intel \) to C++provide a solution: Since redundant code optimization can not afford to dry, then I myself dry!

So, it completed an own use my CPU instruction set computing library written .

The arithmetic library (discussed here only as C++written) advantage is that it is directly included in a library file, simply call the appropriate header files, can be used directly. At the same time, since it is directly related to the CPU instruction set, generates few redundant code. We do question the point of view, is "the same calculation addition and subtraction, several times faster."

Instruction set optimized for use

Speaking in front of so much nonsense, summed up by one word: "instruction set is optimized fast!"

So, how to use it?

Ready to work

note! note! note! Do not use any formal instruction set optimized game! !

First of all, we have to let the compiler know that I want to use instruction set.

#pragma GCC target("sse,sse2,sse3,ssse3,sse4.1,sse4.2,avx,avx2,popcnt,tune=native")
//这里的SSE, AVX等等都是指令集的名称

Secondly, we have to import two header files, so you can call instruction set of functional form, without having to inline assembly.

#include <immintrin.h>
#include <emmintrin.h>

note! ! When including the <bits / stdc ++. H>, please import the two header files, libraries or will cause conflict!

Then, we can use the tax instruction set spicy!

Variables / functions

Here, commonly about variables __m256, __m256i, __m256dare three, namely for storing single-precision floating-point , integer , double precision floating point . 256 refers to a variable which occupies 256 bit, 128 can be replaced.

The function, we can in the official manual of Intel find the function you want in. This manual gives a brief description of the role of the function, as well as about the principle of pseudo-code, very convenient. But it requires a certain level of foreign language reading.

In it, naming function is regular, usually _mm数据大小_运算类型_epi每个元素的大小(). As _mm256_add_epi32(a, b)is the __m256itype of variable aand bto int(accounting for 32 Bit) as a reference element corresponding to the element blocks are added, the result returns a variable.

Here, I have listed some common functions (by the ouuan finishing invasion deleted).

__m256i _mm256_set_epi32 (int e7, int e6, int e5, int e4, int e3, int e2, int e1, int e0):参数是八个数,也就是一个“分块”里的数,注意是逆序的。返回值是一个含这八个数的“分块”。

__m256i _mm256_set_epi64x (__int64 e3, __int64 e2, __int64 e1, __int64 e0):和上面一样,只不过是 64 位整数,也就是 long long。

__m256i _mm256_set1_epi32 (int a):相当于 _mm256_set_epi32(a,a,a,a,a,a,a,a)。

__m256i _mm256_add_epi32 (__m256i a, __m256i b):把两个“分块”的对应位置分别相加,返回结果。

__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b):判断两个“分块”的对应位置是否相等,若相等则返回的“分块”对应位置是 0xffffffff,否则是 0。

__m256i _mm256_cmpgt_epi32 (__m256i a, __m256i b):和上面一样,只不过比较符是大于而不是相等。

__m256i _mm256_and_si256 (__m256i a, __m256i b):返回两个“分块”的按位与,可以配合上面两条比较指令来使用。

tips:

If you want to access each element in a block form can be accessed by a pointer.

__m256i block;
int *a = (int *) &block;
for(int i = 0; i < 8; i++)
    printf("%d", a[i]);

Real

Here, we have an example segment tree 1 to explain their method of eating.

First, let's define an __m256iarray block[]for storing data.

__m256i block[100000];

According to the digital input 4 a manner, all the pressure into this array.

a = (long long *) & block;
for(int i = 0; i < n; i++)
    scanf("%lld", a + i);

Then the interval is increased. Here we use the idea of ​​the block, that is "first to change sides, the middle piece and then modify."

void add(int l, int r, long long x) {
    //本文采用的存储方式为a[0..n-1], 区间操作均为前闭后开[l,r)。
    while((l & 3) && (l < r))// 判断边界
        a[l++] += x;
    if(l == r)
        return;
    while((r & 3)) // 判断边界
        a[--r] += x;
    if(l == r)
        return;
    __m256i ad = _mm256_set1_epi64x(x);
    for(l >>= 2, r >>= 2; l < r; l++) {
        block[l] = _mm256_add_epi64(block[l], ad);
    }
}

Range query, and modify its operation less the same.

long long ask(int l, int r) {
    long long ans = 0;
    while((l & 3) && (l < r))
        ans += a[l++];
    if(l == r)
        return ans;
    while((r & 3))
        ans += a[--r];
    if(l == r)
        return ans;
    __m256i ans1 = _mm256_set1_epi64x(0);
    for(l >>= 2, r >>= 2; l < r; l++) {
        ans1 = _mm256_add_epi64(block[l], ans1);
    }
    for(int i = 0; i < 4; i++)
        ans += ans1[i];
    return ans;
}

Complete code:

#define __AVX__ 1
#define __AVX2__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

#pragma GCC optimize("Ofast,no-stack-protector,unroll-loops,fast-math")
#pragma GCC target("sse,sse2,sse3,ssse3,sse4.1,sse4.2,avx,avx2,popcnt,tune=native")

#include <immintrin.h>
#include <emmintrin.h>
#include <bits/stdc++.h>
using namespace std;
__m256i block[100001], mod;
long long *a;
void add(int l, int r, long long x) {
    while((l & 3) && (l < r))
        a[l++] += x;
    if(l == r)
        return;
    while((r & 3))
        a[--r] += x;
    if(l == r)
        return;
    __m256i ad = _mm256_set1_epi64x(x);
    for(l >>= 2, r >>= 2; l < r; l++) {
        block[l] = _mm256_add_epi64(block[l], ad);
    }
}
long long ask(int l, int r) {
    long long ans = 0;
    while((l & 3) && (l < r))
        ans += a[l++];
    if(l == r)
        return ans;
    while((r & 3))
        ans += a[--r];
    if(l == r)
        return ans;
    __m256i ans1 = _mm256_set1_epi64x(0);
    for(l >>= 2, r >>= 2; l < r; l++) {
        ans1 = _mm256_add_epi64(block[l], ans1);
    }
    for(int i = 0; i < 4; i++)
        ans += ans1[i];
    return ans;
}
int main() {
    int n, m, op, x, y, val;
    scanf("%d%d", &n, &m);
    a = (long long *) & block;
    for(int i = 0; i < n; i++)
        scanf("%lld", a + i);
    while(m--) {
        scanf("%d%d%d", &op, &x, &y);
        x--;
        if(op == 1) {
            scanf("%lld", &val);
            add(x, y, val);
        } else {
            printf("%lld\n", ask(x, y));
        }
    }
}

to sum up

Optimized instruction set is probably the case. Complexity is theoretically \ (O (nm) \) , in fact, because the instruction set optimized, the number of times to enhance the efficiency, can by some of the more lenient timing problems. Of course, it also has great limitations, such as not supported range MOD , can not be the interval division (only seem to run properly on Intel GCC, other compilers will show "no function") and so on. However, it is still an excellent range Chaxiu solution , at the same time, want to be for the future of farming code students, it is also extremely useful.

After all, one subject, using different methods to solve it, is also one of the best ways of thinking exercise.

Guess you like

Origin www.cnblogs.com/MaxDYF/p/11518355.html