CUDA Notes - Convolution Computation

1. One-dimensional convolution       

         The calculation of one-dimensional convolution is the sum of the source data and the weighted product of the value around each element and the convolution kernel according to the convolution kernel; the convolution kernel is generally an odd number;

        The figure below shows an array of 16 elements as source data and a convolution kernel of 5 elements;

         The process of calculating the convolution is as shown in the figure below. Each node performs overlapping weighting calculations with the corresponding convolution kernel; at the boundary position of the array, the convolution kernel will exceed the valid range (boundary judgment), and the value at the corresponding position is 0; 

        Let the source array be N, the output convolution array be P, and the length of the two arrays be size;

        Let the convolution kernel array be M and the length be size_kernel;

2. CPU code implementation:

void convolution_1D_basic_kernel(float* N, float* P, int size,
    float* M, int size_kernel) {
    int half_width_kernel = size_kernel / 2;
    for (int i = 0; i < size; i++) {
        float p_value = 0;
        int begin_pos = i - half_width_kernel;
        for (int j = begin_pos; j < begin_pos + size_kernel; j++) {
            if (j >= 0 && j < size) {
                p_value += N[j] * M[j - begin_pos];
            }            
        }
        P[i] = p_value;
    }
}

        Code content:

        1. Calculate half_width_kernel of the convolution kernel. When the convolution kernel is currently set to 5, half_width_kernel is 2;

        2. Iterate through the entire source array:

                2.1 Declare a temporary variable p_value to store the weighted value of each array and convolution kernel;

                2.2 Calculate the starting position of the convolution calculation of the current i position begin_pos is i - half_width_kernel;

                2.3 Starting from begin_pos, calculate the weighted value of each value and convolution kernel, the prerequisite is that it must be within the valid range of the source array [0, size); the number of traversals is 5;

                2.4 Write the calculation result into the corresponding P[i];

3. The GPU code is as follows:

3.1 General implementation

__global__ void convolution_1D_basic_kernel(float* N, float* P, int size,
    float* M, int size_kernel) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    float p_value = 0;
    int begin_pos = i - (size_kernel / 2);
    for (int j = begin_pos; j < begin_pos + size_kernel; j++) {
        if (j >= 0 && j < size) {
            p_value += N[j] * M[j - begin_pos];
        }
    }
    P[i] = p_value;
}

        Assume that the grid only uses one dimension, and the block uses one dimension; that is, convolution_1D_basic_kernel<<<1,16>>>;

        Code content:

        1. Calculate the subscript i of the current thread in the total thread grid according to the current thread Id, as well as the thread block Id and the thread block size;

        2. The current thread only needs to calculate the convolution value of N[i] corresponding to i; the start position for calculating the convolution calculation of the current position i begins_pos is i - (size_kernel / 2); starting from begin_pos, traverse size_kernel values sum of convolutions;

        Note: The traversal subscript needs to be within the valid range of N;

        3. Write the result into P[i];

        insufficient:

        1. There will be control flow diversity; due to the different scope of boundary calculation, there will be different decisions in the if statement;

        2. Memory bandwidth. From the above code, it can be seen that the N array is being accessed all the time. It is equivalent to obtaining data from the global memory, and does not use high-speed memory such as constant memory and shared memory;

3.2 Using constant memory

        The size of the convolution kernel in convolution is not large, the content of the convolution kernel remains unchanged in the convolution calculation, and each thread will access the convolution kernel;

        The convolution kernel can and should use constant memory and cache storage, which can improve the number of global memory accesses in calculations and improve the access storage delay;

        Variables stored in constant memory must be global variables, declared outside of all function definitions. Use the keyword " __constant__ ", and use " cudaMemcpyToSymbol " to copy the data to be placed in the constant memory to the constant memory of the device;

#define MAX_MASK_WIDTH 10
__constant__ float M[MAX_MASK_WIDTH];

        The difference from the previous code is that M is no longer passed into the kernel function as a parameter;

__global__ void convolution_1D_basic_kernel(float* N, float* P, int size,
    int size_kernel) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    float p_value = 0;
    int begin_pos = i - (size_kernel / 2);
    for (int j = begin_pos; j < begin_pos + size_kernel; j++) {
        if (j >= 0 && j < size) {
            p_value += N[j] * M[j - begin_pos];
        }
    }
    P[i] = p_value;
}

3.3 Using blocks

        The thread composition in 3.1 and 3.2 is a one-dimensional thread:

convolution_1D_basic_kernel<<<1,16>>>(N, P, size, size_kernel);

         In a chunking algorithm, all threads cooperate to load input elements into on-chip memory and directly access on-chip memory when those elements are subsequently used;

        The most direct way to implement putting elements into on-chip memory is to load all the input data required by the thread block into shared memory before computing;

        Using 4 thread blocks, 4 threads are used in the thread block to calculate the one-dimensional convolution:

convolution_1D_basic_kernel<<<4,4>>>(N, P, size, size_kernel);

        There are 4 threads in a thread block, if only the elements used by the 4 threads are loaded into the on-chip memory, the access speed is improved, but the problem of warp differentiation is still stored;

        To solve the problem of warp differentiation, it is necessary to load a few more elements to the on-chip memory. The number is related to the range of the convolution kernel, which is "size_kernel/2*2". The four lines in the figure below represent the calculation needs of the four thread blocks. The data content stored in the shared memory; the yellow part is the additional element (edge ​​element) required by the current thread calculation, and the green part is the corresponding central element (internal element) of the current thread;

        The shared storage size of each block should be able to store left edge elements, middle internal elements and right edge data; the following figure is an example, the number of shared storage of each block is left edge elements (size_kernel/2==2 ), internal elements in the middle (number of threads 4) and edge data on the right (size_kernel/2==2), a total of 8;

        In this way, when each thread is calculated, it can be calculated from the start position of the shared memory, as shown in the figure below; the judgment of boundary conditions is omitted;

       The figure below calculates the edge elements on the left (subscripts 0 and 1) based on threads 2 and 3, and calculates the edge elements on the right (subscripts 6 and 7) based on thread numbers 0 and 1;

        Load the left edge element

int halo_index_left = (blockIdx.x - 1) * blockDim.x + threadIdx.x;
if (threadIdx.x >= blockDim.x - n){
    N_ds[threadIdx.x - (blockDim.x - n)] =
         (halo_index_left < 0) ? 0 : N[halo_index_left];
}

        Calculate the starting position of the element required by the current thread block halo_index_left (it is an internal element in the previous thread block, so blockIdx.x - 1 is used), at this time n is size_kernl / 2, use n threads to load n on the left edge elements;

        load inner elements

N_ds[n + threadIdx.x] = N[blockIdx.x * blockDim.x + threadIdx.x];

        Load the right edge element

int halo_index_right = (blockIdx.x + 1) * blockDim.x + threadIdx.x;
if (threadIdx.x < n){
    N_ds[n + blockDim.x + threadIdx.x] =
         (halo_index_right >= size) ? 0 : N[halo_index_right];
}

        The detailed code is as follows:

__global__ void convolution_1D_basic_kernel(float* N, float* P, int size,
    int size_kernel) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    __shared__ float N_ds[TILE_SIZE + MAX_MASK_WIDTH - 1];

    int n = size_kernel / 2;
    
    int halo_index_left = (blockIdx.x - 1) * blockDim.x + threadIdx.x;
    if (threadIdx.x >= blockDim.x - n){
        N_ds[threadIdx.x - (blockDim.x - n)] =
             (halo_index_left < 0) ? 0 : N[halo_index_left];
    }

    N_ds[n + threadIdx.x] = N[blockIdx.x * blockDim.x + threadIdx.x];

    int halo_index_right = (blockIdx.x + 1) * blockDim.x + threadIdx.x;
    if (threadIdx.x < n){
        N_ds[n + blockDim.x + threadIdx.x] =
             (halo_index_right >= size) ? 0 : N[halo_index_right];
    }    

    __syncthreads();

    float p_value = 0;
    
    for (int j = 0; j < size_kernel; j++) {
        p_value += N[threadIdx.x + j] * M[j];
    }
    P[i] = p_value;
}

        Reduces the number of DRAM accesses to the array N by introducing additional complexity; the ultimate goal is to increase the ratio between arithmetic operations and memory accesses, so that the performance obtained is no longer limited or partially limited by the bandwidth of the DRAM;

3.4 Using another shared block

        Only the internal elements are stored in the shared memory block, and the edge elements directly access the global memory. Since the size of the convolution kernel is relatively small, which is much less than the number of internal elements, the code will be relatively simple;

__global__ void convolution_1D_basic_kernel(float *N, float *P,
	int Mask_Width, int Width){
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	__shared__ float N_ds[TILE_SIZE];
	
	N_ds[threadIdx.x] = N[i];
	
	__syncthreads();
	
	int this_tile_start_point = blockIdx.x + blockDim.x;
	int next_tile_start_point = (blockIdx.x + 1) * blockDim.x;
	int N_start_point = i - (Mask_Width / 2);
	float Pvalue = 0;
	for (int j = 0; j < Mask_Width; j++) {
		int N_index = N_start_point + j;
		if (N_index >= 0 && N_index < Width) {
			if (N_index >= this_tile_start_point 
                && N_index < next_tile_start_point) {
				Pvalue += N_ds[threadIdx.x + j - (Mask_Width/2)]*M[j];
			} else {
				Pvalue += N[N_index] * M[j];
			}
		}		
	}
	P[i] = Pvalue;
}

References

CUDA 3D convolution - ijpq - Blog Garden overview https://www.cnblogs.com/ijpq/p/15405106.html

Guess you like

Origin blog.csdn.net/liushao1031177/article/details/124044206