NVIDIA CUDA highly parallel processor programming (seven): parallel mode: prefix and

background

The prefix sum (prefix sum) is also called scan (scan), and the inclusive scan operation performs a binary operation on the n-element array [x 0 , x 1 , x 2 , …, x n-1 ], and returns the output array [x 0 , (x 0 #x 1 ),…, (x 0 #x 1 #x 2 # …# x n-1 )]. Exclusive scan is similar to closed scan, returning an array [0, x 0 , (x 0 #x 1 ),…, (x 0 #x 1 #x 2 # …# x n-2 )]. Move all the output arrays of closed scan to the right by one bit, and assign 0 to the 0th array to get the output array of open scan.

Serial implementation of the closed-sweep algorithm:

void sequential_scan(float *x, float *y, int Max_i){
    
    
	y[0] = x[0];
	for(int i=1;i < Max_i;i++){
    
    
		y[i] = y[i-1]+x[i];
	}
}

That is a simple dynamic programming.

Simple Parallel Scan

The following is an in-place scanning algorithm implemented by a reduction tree:

  • Declare an array XY in shared memory, and assign the values ​​in the x array of length n to the XY array.
  • Iterate, at the nth iteration, after n − 2 n − 1 n-2^{n-1}n2Add n 1 numbers to the 2ndn − 1 2^{n-1}2n 1 numbers.
  • When 2 n − 1 2^{n-1}2When n 1 is greater than n, stop iteration.

insert image description here
The number of elements processed by the above algorithm cannot exceed the number of threads in a block.

code:

#include<cuda.h>
#include<stdio.h>
#include<stdlib.h>
#define SECTION_SIZE 256
__global__ void work_inefficient_scan_kernel(float *X, int input_size){
    
    
	__shared__ float XY[SECTION_SIZE];
	int i = blockIdx.x*blockDim.x + threadIdx.x;
    //将数组加载到共享寄存器。
	if(i < input_size){
    
    
		XY[threadIdx.x] = X[i];
	}       
	//进行闭扫描
    for(unsigned int stride = 1;stride <= threadIdx.x;stride*=2){
    
    
        __syncthreads();
        XY[threadIdx.x] += XY[threadIdx.x - stride];
    }
    __syncthreads();
    X[i] = XY[threadIdx.x];
}


void test(float *A, int n){
    
    
    float *d_A, *d_B;
    int size = n * sizeof(float);
    cudaMalloc(&d_A, size);
    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    //用SECTION_SIZE作为block大小启动kernel。
    work_inefficient_scan_kernel<<<ceil((float)n/SECTION_SIZE), SECTION_SIZE>>>(d_A, n);
    cudaMemcpy(A, d_A, size, cudaMemcpyDeviceToHost);
    cudaFree(d_A);
}

int main(int argc, char **argv){
    
    
    int n = atoi(argv[1]);  //n 不能大于SECTION_SIZE
    float *A = (float *)malloc(n * sizeof(float));
    for(int i = 0; i < n;++i){
    
    
        A[i] = 1.0;
    }
    test(A, n);
    for(int i = 0;i < n; ++i){
    
    
        printf("%f ", A[i]);
    }
    free(A);
    return 0;
}
	

efficiency

Now let's analyze the working efficiency of the above kernel. All threads will iterate for log(SECTION_SIZE) steps. The number of threads that do not need to do addition in each iteration is equal to stride, so we can calculate the workload of the algorithm as follows:
∑ ( N − stride ) , for strides 1 , 2 , 4 , . . . , N / 2 \sum( N-stride), for \, strides \, 1,2,4,...,N/2(Nstride),forstrides1,2,4,...,
The first part of each term of N /2 is independent of the stride, so their sum is: N × log 2 ( N ) N \times log_2(N)N×log2( N ) ; the second part is similar to a geometric sequence, and their sum isN − 1 N-1N1 . So the total number of addition operations is: N × log 2 ( N ) − ( N − 1 ) N \times log_2(N)-(N-1)N×log2(N)(N1 )
The total number of additions done by the serial scan algorithm isN − 1 N-1N1 , the number of additions of different N values ​​​​in series and parallel:

N 16 32 64 128 256 512 1024
N − 1 N-1 N1 15 31 63 127 255 51 1023
N × l o g 2 ( N ) − ( N − 1 ) N \times log_2(N)-(N-1) N×log2(N)(N1) 49 129 321 769 1793 4097 9217

When the number of elements is 1024, 9 times more additions are done in parallel than in serial. As N increases, this coefficient will continue to grow.

Efficient Parallel Scanning

This algorithm consists of two parts, the first part is the reduction sum, and the second part is to distribute the partial sum with an inverted tree.

To reduce and sum an array of length N (the upper part of the figure below), generally requires (N/2) + (N/4) + ... + 2 + 1 = N - 1 operations .
insert image description here
Reduction and summation:

Version without control flow branch:

for(unsigned int stride = 1;stride <= blockDim.x;stride *= 2){
    
    
    __syncthreads();
    int index = (threadIdx.x + 1)*stride*2 - 1;
    if(index < SECTION_SIZE)
        XY[index] += XY[index - stride];
}

The values ​​in the array after reduction are as shown in the first line of the figure:
insert image description here
From the inverted tree, we can see that XY[7] is added to XY[11] for the first time, and XY[3], XY[7] are added for the second time ], XY[11] is added to XY[5], XY[9], XY[13], and the third division of XY[0] adds the previous element to the even-numbered element.

It can be seen that the added stride drops from SECTION_SIZE/4 to 1. We need to add the XY elements at positions that are multiples of stride - 1 to the elements at positions one stride away. The specific code is as follows:

    for(unsigned int stride = SECTION_SIZE/4; stride > 0; stride /= 2){
    
    
        __syncthreads();
        int index = (threadIdx.x + 1)*stride*2 - 1;
        if(index + stride < SECTION_SIZE)
            XY[index + stride] += XY[index];
    }

No matter in the reduction phase or in the distribution phase, the number of threads does not exceed SECTION_SIZE/2. So you can start a block with only SECTION_SIZE/2 threads. Since there can be 1024 threads in a block, there can be a maximum of 2048 elements in each scan part. So one thread has to load 2 elements.
Complete Brent_Kung_scan code:

#include<cuda.h>
#include<stdio.h>
#include<stdlib.h>
#define SECTION_SIZE 2048
__global__ void  Brent_Kung_scan_kernel(float *X, int input_size){
    
    
	__shared__ float XY[SECTION_SIZE];

    //将数组加载到共享寄存器,一个线程加载两个元素。
    int i = 2 * blockIdx.x * blockDim.x + threadIdx.x;
	if(i < input_size) XY[threadIdx.x] = X[i];
    else XY[threadIdx.x] = 0.0f;
    if(i + blockDim.x < input_size) XY[threadIdx.x + blockDim.x] = X[i + blockDim.x];
    else XY[threadIdx.x + blockDim.x] = 0.0f;

    //不带控制流的归约求和 
    for(unsigned int stride = 1;stride <= blockDim.x;stride *= 2){
    
    
        __syncthreads();
        int index = (threadIdx.x + 1)*stride*2 - 1;
        if(index < SECTION_SIZE)
            XY[index] += XY[index - stride];
    }

    //分发部分和
    for(unsigned int stride = SECTION_SIZE/4; stride > 0; stride /= 2){
    
    
        __syncthreads();
        int index = (threadIdx.x + 1)*stride*2 - 1;
        if(index + stride < SECTION_SIZE)
            XY[index + stride] += XY[index];
    }
    __syncthreads();
    if(i < input_size) X[i] = X[threadIdx.x];
    if(i + blockDim.x < input_size) X[i + blockDim.x] = XY[threadIdx.x + blockDim.x];
}

void test(float *A, int n){
    
    
    float *d_A;
    int size = n * sizeof(float);
    cudaMalloc(&d_A, size);
    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    Brent_Kung_scan_kernel<<<ceil((float)n/(SECTION_SIZE/2)), SECTION_SIZE/2>>>(d_A, n);
    cudaMemcpy(A, d_A, size, cudaMemcpyDeviceToHost);
    cudaFree(d_A);
}

int main(int argc, char **argv){
    
    
    int n = atoi(argv[1]);
    float *A = (float *)malloc(n * sizeof(float));
    for(int i = 0; i < n;++i){
    
    
        A[i] = 1.0;
    }
    test(A, n);
    for(int i = 0;i < n; ++i){
    
    
        printf("%f ", A[i]);
    }
    free(A);
    return 0;
}

For an array of N elements, the total number of operations is (N/2) + (N/4) + ... + 4 + 2 -1 which is about N - 2 , so the final number of operations is 2 × N − 3 2 \times N -32×N3 . The following table compares the number of operations performed for different N.

N 16 32 64 128 256 512 1024
N − 1 N-1 N1 15 31 63 127 255 51 1023
N × l o g 2 ( N ) − ( N − 1 ) N \times log_2(N)-(N-1) N×log2(N)(N1) 49 129 321 769 1793 4097 9217
2 × N − 3 2 \times N-3 2×N3 29 61 125 253 509 1021 2045

Since the number of operations is now proportional to N instead of N × log 2 ( N ) N \times log_2(N)N×log2( N ) , so efficient algorithms perform no more than twice as many total operations as serial. Parallel algorithms can perform better than serial as long as there are at least twice as many execution units.

Parallel scans of greater length

The above algorithm can only handle up to 2048 elements, we need a parallel algorithm that can handle more elements, and the following hierarchical scanning is one of the methods.

  • First, use the Brent_Kung_scan_kernel in the previous section to process the entire array. Scanning will be performed in each block. After scanning, only the results of scanning in the block are kept in the result array. We call these segments scan blocks.
  • Save the last number of each block in the auxiliary array S with length m, and scan S.
  • Adds the first m - 1 elements of S to each element of the last m - 1 scan blocks of the result array, respectively.

insert image description here

The current CUDA device has 1024 threads in each block, and each block can process up to 2048 elements. There are 65536 thread blocks in the grid x dimension, and can process up to 134217728 elements.
To complete the work in the above figure, we design 3 kernels.

  1. The first kernel loads the last element of the scan block onto the S array, using the block's first thread. You can use Brent_Kung_scan_kernel and add after it:

    __syncthreads();
    if(threadIdx.x == 0)
    	S[blockIdx.x] = XY[SECTION_SIZE - 1];
    
  2. The second kernel scans S. Brent_Kung_scan_kernel can be used.

  3. The third kernel writes S back to the original array.

The final code is as follows:

#include<cuda.h>
#include<stdio.h>
#include<stdlib.h>
#define SECTION_SIZE 2048
__global__ void  Brent_Kung_scan_kernel_1(float *X, float *S, int input_size){
    
    
	__shared__ float XY[SECTION_SIZE];

    //将数组加载到共享寄存器,一个线程加载两个元素。
    int i = 2 * blockIdx.x * blockDim.x + threadIdx.x;
	if(i < input_size) XY[threadIdx.x] = X[i];
    else XY[threadIdx.x] = 0.0f;
    if(i + blockDim.x < input_size) XY[threadIdx.x + blockDim.x] = X[i + blockDim.x];
    else XY[threadIdx.x + blockDim.x] = 0.0f;
    
    //不带控制流的归约求和 
    for(unsigned int stride = 1;stride <= blockDim.x;stride *= 2){
    
    
        __syncthreads();
        int index = (threadIdx.x + 1)*stride*2 - 1;
        if(index < SECTION_SIZE)
            XY[index] += XY[index - stride];
    }

    //分发部分和
    for(unsigned int stride = SECTION_SIZE/4; stride > 0; stride /= 2){
    
    
        __syncthreads();
        int index = (threadIdx.x + 1)*stride*2 - 1;
        if(index + stride < SECTION_SIZE)
            XY[index + stride] += XY[index];
    }
    __syncthreads();
    if(i < input_size) X[i] = XY[threadIdx.x];
    if(i + blockDim.x < input_size) X[i + blockDim.x] = XY[threadIdx.x + blockDim.x];

    __syncthreads();
    if(threadIdx.x == 0){
    
    
        S[blockIdx.x] = XY[SECTION_SIZE - 1];
    }
}

__global__ void  Brent_Kung_scan_kernel_2(float *X, int input_size){
    
    
	__shared__ float XY[SECTION_SIZE];

    //将数组加载到共享寄存器,一个线程加载两个元素。
    int i = 2 * blockIdx.x * blockDim.x + threadIdx.x;
	if(i < input_size) XY[threadIdx.x] = X[i];
    if(i + blockDim.x < input_size) XY[threadIdx.x + blockDim.x] = X[i + blockDim.x];

    //不带控制流的归约求和 
    for(unsigned int stride = 1;stride <= blockDim.x;stride *= 2){
    
    
        __syncthreads();
        int index = (threadIdx.x + 1)*stride*2 - 1;
        if(index < SECTION_SIZE)
            XY[index] += XY[index - stride];
    }

    //分发部分和
    for(unsigned int stride = SECTION_SIZE/4; stride > 0; stride /= 2){
    
    
        __syncthreads();
        int index = (threadIdx.x + 1)*stride*2 - 1;
        if(index + stride < SECTION_SIZE)
            XY[index + stride] += XY[index];
    }
    __syncthreads();
    if(i < input_size) X[i] = XY[threadIdx.x];
    if(i + blockDim.x < input_size) X[i + blockDim.x] = XY[threadIdx.x + blockDim.x];
}

__global__ void kernel_3(float *X, float *S, int input_size){
    
    
    //一个线程处理两个元素。
    int i = 2*blockDim.x*blockIdx.x + threadIdx.x;
    if(blockIdx.x > 0){
    
    
        X[i] += S[blockIdx.x - 1];
        X[blockDim.x + i] += S[blockIdx.x - 1];
    }
}

void test(float *A, int n){
    
    
    float *d_A, *S;
    int size = n * sizeof(float);
    cudaMalloc(&d_A, size);
    cudaMalloc(&S, ceil((float)n/SECTION_SIZE)*sizeof(float));
    cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
    Brent_Kung_scan_kernel_1<<<ceil((float)n/(SECTION_SIZE)), SECTION_SIZE/2>>>(d_A, S, n);
    Brent_Kung_scan_kernel_2<<<ceil((float)(ceil((float)n/SECTION_SIZE))/(SECTION_SIZE)), SECTION_SIZE/2>>>(S, n);
    kernel_3<<<ceil((float)n/(SECTION_SIZE)), SECTION_SIZE/2>>>(d_A, S, n);
    cudaMemcpy(A, d_A, size, cudaMemcpyDeviceToHost);
    cudaFree(d_A);
}

int main(int argc, char **argv){
    
    
    int n = atoi(argv[1]);
    float *A = (float *)malloc(n * sizeof(float));
    for(int i = 0; i < n;++i){
    
    
        A[i] = 1.0;
    }
    test(A, n);
    // for(int i = 0;i < n; ++i){
    
    
    //     printf("%f ", A[i]);
    // }
    printf("%f ", A[n-1]);
    free(A);
    return 0;
}

Running result:
Since the printing speed is too slow, it is correct to print the last element directly, which is equal to the input length.
insert image description here
When the length is 10 million, the calculation result is wrong.

insert image description here
Find out that the length that can be calculated correctly at most is about 4125000. This may be related to the hardware, but it is not clear what it is related to. I will add more later when I learn more.

Guess you like

Origin blog.csdn.net/weixin_45773137/article/details/125352979