Those things about CUDA programming in those years (3)

1 Overview

The previous two articles introduced the basic concepts of CUDA, as well as simple summation operations on arrays and matrices:

Those things about CUDA programming in those years (1)
Those things about CUDA programming in those years (two)

One of the most widely used places of CUDA is in the processing of two-dimensional images, such as feature extraction, stereo matching, three-dimensional reconstruction, deep learning training and detection, and so on. Therefore, here is a simple example to show how CUDA achieves the purpose of processing images through the combination of thread blocks and threads:

Purpose: Divide an 8000*1000 single-channel image into 40x40 (adjustable) blocks, calculate the sum, maximum, minimum, and average of pixel values ​​in each block, and save the calculation results on the CPU side.

The above description and the following procedures refer to: this blog

2. Implementation steps

Since the maximum number of threads in each thread block is 1024, it is considered to complete two calculations (each calculation is 40*20), use shared memory to save the pixel data of the incoming image in each block, and finally use return The reduction algorithm optimizes addition.

2.1 Input an 8000*1000 single-channel image using OpenCV

What? No 8000*1000 single channel image? Just use the function directly resize.

	Mat image = imread("2.jpg", 0);   //读取待检测图片,0表示以灰度图读入
	cv::resize(image, image, cv::Size(1000, 8000));

2.2 Allocating memory for CUDA arrays

Including the image data to be processed, the maximum and minimum pixel values ​​of each block after processing, and the sum of pixel values ​​in the block. The specific implementation is as follows:

	//图像的所有字节数
	size_t memSize = image.cols*image.rows * sizeof(uchar);
	int size = 5000 * sizeof(int);

	//分配内存:GPU图像数据、求和结果数组、最大值结果数组、最小值结果数组
	uchar *d_src = NULL;
	int *d_sum = NULL;
	int *d_max = NULL;
	int *d_min = NULL;
	cudaMalloc((void**)&d_src, memSize);
	cudaMalloc((void**)&d_sum, size);
	cudaMalloc((void**)&d_max, size);
	cudaMalloc((void**)&d_min, size);

	//图像数据拷贝到GPU
	cudaMemcpy(d_src, image.data, memSize, cudaMemcpyHostToDevice);

2.3 Allocate threads and threads, execute kernel functions

This part first gives the general execution process of the kernel function, and then introduces the specific implementation process of the kernel function in detail in the next section, which is also the core part of this article.

	//图像宽、高
	int imgWidth = image.cols;
	int imgHeight = image.rows;

	//dim3 threadsPerBlock(20, 40); //每个block大小为20*40,对应matSum2核函数
	dim3 threadsPerBlock(40, 20); //每个block大小为40*20,对应matSum核函数

	dim3 blockPerGrid(25, 200); //将8000*1000的图片分为25*200个小图像块

	double time0 = static_cast<double>(getTickCount()); //计时器开始
	matSum << <blockPerGrid, threadsPerBlock, 3200 * sizeof(int) >> > (d_src, d_sum, d_max, d_min, imgHeight, imgWidth);

	//等待所有线程执行完毕
	cudaDeviceSynchronize();

	time0 = ((double)getTickCount() - time0) / getTickFrequency(); //计时器结束
	cout << "The Run Time is :" << time0 << "s" << endl; //输出运行时间

There are two thread allocation methods in the above program, both of which can achieve corresponding functions, but the thread execution methods are slightly different. Put it here first, and discuss it in detail later.

In addition, this parameter is also involved in the above program 3200 * sizeof(int). The specific meaning of this parameter refers to the byte size of the shared memory between each thread block during the execution of the kernel function. 共享内存:简而言之就是线程块内部各个线程都可以共同使用的内存。Knowing this is enough for this article.

The declaration method of shared memory is static and dynamic. statically declared as:

//声明一个二维浮点数共享内存数组
__shared__ float a[size_x][size_y];

The size_x and size_y here are the same as declaring a c++ array. If it is a number determined at compile time, it cannot be a variable.

The dynamic declaration is:

extern __shared__ int tile[];

Note that dynamic declarations only support one-dimensional arrays. For more information about shared memory, refer to this blog .

2.4 Result output and program end

It mainly outputs the result of CUDA to the CPU, then releases the memory, resets CUDA, and ends the program.

	//将数据拷贝到CPU
	cudaMemcpy(sum, d_sum, size, cudaMemcpyDeviceToHost);
	cudaMemcpy(max, d_max, size, cudaMemcpyDeviceToHost);
	cudaMemcpy(min, d_min, size, cudaMemcpyDeviceToHost);

	//输出
	cout << "The sum is :" << sum[0] << endl;
	cout << "The max is :" << max[0] << endl;
	cout << "The min is :" << min[0] << endl;

	//释放内存
	cudaFree(d_src);
	cudaFree(d_sum);
	cudaFree(d_max);
	cudaFree(d_min);

	//重置设备
	cudaDeviceReset();

3. The specific implementation process of the kernel function

This section is the most critical part of algorithm execution and the core of CUDA acceleration. The running process of the kernel function is: when the kernel function is started, all thread blocks run synchronously (not necessarily completely synchronously), and each thread block runs independently without affecting each other, but each thread inside the thread block can perform data interaction.

3.1 Define shared memory

The data interaction of each thread in the thread block is realized through the shared memory, so as to realize the maximum value, the minimum value and the summation operation.

	//定义线程块中各个线程的共享数组:40*40=1600
	const int number = 1600;
	extern __shared__ int _sum[];  //动态共享数组, 用于求和
	__shared__ int _max[number];   //静态共享数组, 用于最大值的求取
	__shared__ int _min[number];   //静态共享数组, 用于最小值的求取

3.2 Calculate the index corresponding to each thread in the image

It includes the following three aspects:
1. Calculate the absolute index value of the thread block in the image.
2. The absolute index value of the thread in the thread block in the image.
3. The index value of the thread in the thread block.

	//根据线程块和线程的索引, 依次对应到图像数组中
	//1、线程块在图像中的索引值
	int blockindex1 = blockIdx.x*blockDim.x + 2*blockIdx.y*blockDim.y*imgWidth;
	int blockindex2 = blockIdx.x*blockDim.x + (2*blockIdx.y*blockDim.y + blockDim.y)*imgWidth;

	//2、线程块中的线程在图像中的索引值
	int index1 = threadIdx.x + threadIdx.y*imgWidth + blockindex1;
	int index2 = threadIdx.x + threadIdx.y*imgWidth + blockindex2;

	//3、线程在线程块中的索引值
	int thread = threadIdx.y*blockDim.x + threadIdx.x;

In order to clearly show the above calculation process, the following explanatory diagram is drawn:
insert image description here
a simple explanation for the above diagram: because the allocated thread block (40, 20) is not enough to calculate an image block (40, 40). Therefore, a thread block needs to be executed twice to calculate an image block.

Our purpose is to get the pixel coordinate index value in the image corresponding to each thread in the thread block, and then save the pixel value corresponding to the index value to the corresponding shared memory for later calculation. Including the following process:

  1. Computes the absolute indices of thread blocks in the image, blockindex1 and blockindex2. blockindex2 is the index in the corresponding image of the first thread block on the second run.
  2. Computes the indices in the image of the threads in the thread block, index1 and index2. index2 is the index in the image corresponding to the second run of the first thread.
  3. Calculate the index of the thread in the thread block, as the index value of the subsequent data stored in the shared memory, thread1 and thread2. thread2 is the index in the corresponding thread block when the first thread runs for the second time.

3.3 Save each pixel value of the image block

Transfer all pixel values ​​in the 40*40 small image block to be calculated to the shared array twice

	//4、将待计算的40*40小图像块中的所有像素值分两次传送到共享数组中
	//将上半部分的40*20中所有数据赋值到共享数组中
	//将下半部分的40*20中所有数据赋值到共享数组中
	_sum[thread] = dataIn[index1];
	_sum[thread + blockDim.x*blockDim.y] = dataIn[index2];

	_max[thread] = dataIn[index1];
	_max[thread + blockDim.x*blockDim.y] = dataIn[index2];

	_min[thread] = dataIn[index1];
	_min[thread + blockDim.x*blockDim.y] = dataIn[index2];

3.4 Calculate the final result using the reduction algorithm

Use the reduction algorithm to find the sum, maximum value and minimum value of 1600 pixel values ​​​​in a 40*40 small image block

	//利用归约算法求出40*40小图像块中1600个像素值中的和、最大值以及最小值
	//使用归约算法后,最大值,最小值以及各像素之和,均保存在线程块中第一个线程对应的索引中。
	for (size_t s=number/2; s>0; s>>=1 )
	{
    
    
		_sum[thread] = _sum[thread] + _sum[thread + s];
		if (_max[thread] < _max[thread + s]) _max[thread] = _max[thread + s];
		if (_min[thread] > _min[thread + s]) _min[thread] = _min[thread + s];

		__syncthreads(); //所有线程同步
	}

	//将每个线程块的第一个数据结果保存
	if (thread == 0)
	{
    
    
		dataOutSum[blockIdx.x + blockIdx.y*gridDim.x] = _sum[0];
		dataOutMax[blockIdx.x + blockIdx.y*gridDim.x] = _max[0];
		dataOutMin[blockIdx.x + blockIdx.y*gridDim.x] = _min[0];
	}

3.5 Another thread mode

The calculation method is similar to the above, no more description, see the complete code in the next section.

4. Complete engineering code

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cuda.h>
#include <device_functions.h>
#include <opencv2/opencv.hpp>
#include <iostream>

using namespace std;
using namespace cv;

//threadsPerBlock(40, 20)
__global__ void matSum(uchar* dataIn, int *dataOutSum, int *dataOutMax, int *dataOutMin, int imgHeight, int imgWidth)
{
    
    
	//定义线程块中各个线程的共享数组:40*40=1600
	const int number = 1600;
	extern __shared__ int _sum[];  //动态共享数组, 用于求和
	__shared__ int _max[number];   //静态共享数组, 用于最大值的求取
	__shared__ int _min[number];   //静态共享数组, 用于最小值的求取

	//根据线程块和线程的索引, 依次对应到图像数组中
	//1、线程块在图像中的索引值
	int blockindex1 = blockIdx.x*blockDim.x + 2*blockIdx.y*blockDim.y*imgWidth;
	int blockindex2 = blockIdx.x*blockDim.x + (2*blockIdx.y*blockDim.y + blockDim.y)*imgWidth;

	//2、线程块中的线程在图像中的索引值
	int index1 = threadIdx.x + threadIdx.y*imgWidth + blockindex1;
	int index2 = threadIdx.x + threadIdx.y*imgWidth + blockindex2;

	//3、线程在线程块中的索引值
	int thread = threadIdx.y*blockDim.x + threadIdx.x;

	//4、将待计算的40*40小图像块中的所有像素值分两次传送到共享数组中
	//将上半部分的40*20中所有数据赋值到共享数组中
	//将下半部分的40*20中所有数据赋值到共享数组中
	_sum[thread] = dataIn[index1];
	_sum[thread + blockDim.x*blockDim.y] = dataIn[index2];

	_max[thread] = dataIn[index1];
	_max[thread + blockDim.x*blockDim.y] = dataIn[index2];

	_min[thread] = dataIn[index1];
	_min[thread + blockDim.x*blockDim.y] = dataIn[index2];

	//利用归约算法求出40*40小图像块中1600个像素值中的和、最大值以及最小值
	//使用归约算法后,最大值,最小值以及各像素之和,均保存在线程块中第一个线程对应的索引中。
	for (size_t s=number/2; s>0; s>>=1 )
	{
    
    
		_sum[thread] = _sum[thread] + _sum[thread + s];
		if (_max[thread] < _max[thread + s]) _max[thread] = _max[thread + s];
		if (_min[thread] > _min[thread + s]) _min[thread] = _min[thread + s];

		__syncthreads(); //所有线程同步
	}

	//将每个线程块的第一个数据结果保存
	if (thread == 0)
	{
    
    
		dataOutSum[blockIdx.x + blockIdx.y*gridDim.x] = _sum[0];
		dataOutMax[blockIdx.x + blockIdx.y*gridDim.x] = _max[0];
		dataOutMin[blockIdx.x + blockIdx.y*gridDim.x] = _min[0];
	}
}

//threadsPerBlock(20, 40)
__global__ void matSum2(uchar* dataIn, int *dataOutSum, int *dataOutMax, int *dataOutMin, int imgHeight, int imgWidth)
{
    
    
	//定义线程块中各个线程的共享数组:40*40=1600
	const int number = 1600;
	extern __shared__ int _sum[];  //动态共享数组, 用于求和
	__shared__ int _max[number];   //静态共享数组, 用于最大值的求取
	__shared__ int _min[number];   //静态共享数组, 用于最小值的求取

	int blockIndex1 = blockIdx.y*blockDim.y*imgWidth + 2 * blockIdx.x*blockDim.x;
	int blockIndex2 = blockIdx.y*blockDim.y*imgWidth + (2 * blockIdx.x + 1)*blockDim.x;

	int threadIndex1 = blockIndex1 + threadIdx.y*imgWidth + threadIdx.x;
	int threadIndex2 = blockIndex2 + threadIdx.y*imgWidth + threadIdx.x;

	int thread = threadIdx.x + threadIdx.y*blockDim.x;

	_sum[thread] = dataIn[threadIndex1];
	_sum[thread + blockDim.x*blockDim.y] = dataIn[threadIndex2];

	_max[thread] = dataIn[threadIndex1];
	_max[thread + blockDim.x*blockDim.y] = dataIn[threadIndex2];

	_min[thread] = dataIn[threadIndex1];
	_min[thread + blockDim.x*blockDim.y] = dataIn[threadIndex2];

	for (size_t i = number / 2; i > 0; i >>= 1)
	{
    
    
		_sum[thread] = _sum[thread] + _sum[thread + i];
		if (_min[thread] > _min[thread + i]) _min[thread] = _min[thread + i];
		if (_max[thread] < _max[thread + i]) _max[thread] = _max[thread + i];

		__syncthreads(); //所有线程同步
	}

	if (thread == 0)
	{
    
    
		dataOutSum[blockIdx.x + blockIdx.y*gridDim.x] = _sum[0];
		dataOutMax[blockIdx.x + blockIdx.y*gridDim.x] = _max[0];
		dataOutMin[blockIdx.x + blockIdx.y*gridDim.x] = _min[0];
	}
}

int main()
{
    
    
	Mat image = imread("2.jpg", 0); //读取待检测图片
	cv::resize(image, image, cv::Size(1000, 8000));

	int sum[5000];       //求和结果数组
	int max[5000];       //最大值结果数组
	int min[5000];       //最小值结果数组

	//图像的所有字节数
	size_t memSize = image.cols*image.rows * sizeof(uchar);
	int size = 5000 * sizeof(int);

	//分配内存:GPU图像数据、求和结果数组、最大值结果数组、最小值结果数组
	uchar *d_src = NULL;
	int *d_sum = NULL;
	int *d_max = NULL;
	int *d_min = NULL;
	cudaMalloc((void**)&d_src, memSize);
	cudaMalloc((void**)&d_sum, size);
	cudaMalloc((void**)&d_max, size);
	cudaMalloc((void**)&d_min, size);

	//图像数据拷贝到GPU
	cudaMemcpy(d_src, image.data, memSize, cudaMemcpyHostToDevice);

	//图像宽、高
	int imgWidth = image.cols;
	int imgHeight = image.rows;

	//dim3 threadsPerBlock(20, 40); //每个block大小为20*40,对应matSum2核函数
	dim3 threadsPerBlock(40, 20); //每个block大小为40*20,对应matSum核函数

	dim3 blockPerGrid(25, 200); //将8000*1000的图片分为25*200个小图像块

	double time0 = static_cast<double>(getTickCount()); //计时器开始
	matSum << <blockPerGrid, threadsPerBlock, 3200 * sizeof(int) >> > (d_src, d_sum, d_max, d_min, imgHeight, imgWidth);

	//等待所有线程执行完毕
	cudaDeviceSynchronize();

	time0 = ((double)getTickCount() - time0) / getTickFrequency(); //计时器结束
	cout << "The Run Time is :" << time0 << "s" << endl; //输出运行时间

	//将数据拷贝到CPU
	cudaMemcpy(sum, d_sum, size, cudaMemcpyDeviceToHost);
	cudaMemcpy(max, d_max, size, cudaMemcpyDeviceToHost);
	cudaMemcpy(min, d_min, size, cudaMemcpyDeviceToHost);

	//输出
	cout << "The sum is :" << sum[0] << endl;
	cout << "The max is :" << max[0] << endl;
	cout << "The min is :" << min[0] << endl;

	//释放内存
	cudaFree(d_src);
	cudaFree(d_sum);
	cudaFree(d_max);
	cudaFree(d_min);

	//重置设备
	cudaDeviceReset();

	waitKey(0);
	return 0;
}

5. Experimental results

insert image description here
Since the running order of the thread blocks is inconsistent, the first data of the summation result may be different each time it runs, but the running results of the maximum and minimum values ​​should be consistent each time.

6. Others

Reference blog: https://blog.csdn.net/MGotze/article/details/75268746?spm=1001.2014.3001.5501

Guess you like

Origin blog.csdn.net/qq_38589460/article/details/120317152