gpu programming

Due to project reasons, it is necessary to perform cuda code conversion on the CPU code to achieve the purpose of acceleration. Here, I will supervise and introduce the entire conversion process.

The cuda code is actually very similar to the c++ code. The core point is how to call the gpu on the cpu for thread allocation, so as to achieve the purpose of accelerated computing.

For gpu, the header files that must be included are cuda_runtime.h, device_launch_parameters.h

These two header files are used to detect the gpu graphics card and operate its memory.

1. Confirm that the equipment can be used

    struct cudaDeviceProp properties;
	cudaError_t cudaResultCode = cudaGetDeviceCount(&deviceCount);
	if (cudaResultCode != cudaSuccess)
		deviceCount = 0;
	/* machines with no GPUs can still report one emulation device */
	for (device = 0; device < deviceCount; ++device) {
		cudaGetDeviceProperties(&properties, device);
		if (properties.major != 9999) /* 9999 means emulation only */
			if (device == 0)
			{
				printf("multiProcessorCount %d\n", properties.multiProcessorCount);
				printf("maxThreadsPerMultiProcessor %d\n", properties.maxThreadsPerMultiProcessor);
			}
	}

Choose to obtain the maximum number of processors here, and have a certain understanding of the information of this graphics card, which is convenient for setting the number of threads and blocks later.

2. gpu memory allocation

If you want to calculate on the gpu, you must first ensure that the data is stored on the gpu, so you need to allocate gpu memory, which uses the same function as malloc as the cpu.

    cudaError_t  status = cudaMalloc(&dBuf, sizeof(float) * dBufSize);
	if (status != cudaSuccess)
	{
		printf("****************cuda malloc dbuf error ******************* \r\n");
		return;
	}

In each allocation, we need to ensure that the gpu can be allocated successfully, and judge according to the returned status. Here are just the return codes I encountered.

  • cudaSuccess = 0
    API calls return no errors.
  • cudaErrorIllegalAddress = 700
    The device encountered a load or store instruction at an invalid memory address. This leaves the process in an inconsistent state, and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and restarted.

If you get cudaSuccess, no need to worry about successful allocations. But if you encounter 700, the first step to check is whether the gpu address and cpu address are confused. This error mostly occurs when calling, and memory allocation basically does not appear.

After successfully allocating space, we need to assign a value to the opened space.

3. Memory copy

For memory assignment, the most commonly used should be to copy the data of the cpu to the gpu.

    status = cudaMemcpy(xyABD, pP->xyABD[0], NP*sizeof(cuFloatComplex), cudaMemcpyHostToDevice);
    if (status != cudaSuccess)
	{
		printf("****************cuda malloc dbuf error ******************* \r\n");
		return;
	}

    status =cudaMemcpy(aberrationOut, Waberration0, sizeof(float) * NP, cudaMemcpyDeviceToHost);
    if (status != cudaSuccess)
	{
		printf("****************cuda malloc dbuf error ******************* \r\n");
		return;
	}

This function has four parameters, the first target address, the second source address, and the third copy size (it should be noted that this refers to the size of the occupied space , and the fourth is the direction)

cudaMemcpyHostToDevice refers to copying the cpu data to the gpu

cudaMemcpyDeviceToHost refers to copying the data of the gpu to the cpu, which is mostly used to return the calculation results of the gpu to the cpu.

Note: This function is a synchronous execution function. It will be locked and always occupy the control of the CPU process before the data transfer operation is completed, so there is no need to add the cudaDeviceSynchronize() function

4. Function call

For GPU operations, specific functions are required. Here is a brief introduction to the function types, namely __global__, __host__, __device__.

(1)__host__

If no prefix is ​​specified, the function defaults to __host__, which can be understood as CPU functions, and these functions cannot be called by global and device functions.

(2)__global__

If the prefix indicates __global__, this function can be called by the GPU, and this function can call the function prefixed with __device__.

 The most commonly used operation on gpu should be the __global__ type.

__global__ void gconstMulDbl(float* out, float c, float* in, int number)
{
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	int threadMax = blockDim.x * gridDim.x;
	for (int i = tid; i < number; i = i + threadMax)
	{
		out[i] = in[i] * c;
	}
}

It should be noted that the return type of the function of the global type must be void, so the pointer must be used as the transfer of the result.

One-dimensional parallelism is used here, and the parallel allocation will be explained in detail below.

The call to this function ( __global__ ) must be called by a cpu function, that is, a __host__ type function.

gconstMulDbl < <<block,thread>> >(out, c, in, number);

It should be noted that except for the constant parameters, other parameters need to be saved on the gpu, and the parameters of the cpu cannot be operated, otherwise error 700  will be reported .

(3)__device__

__device__ indicates that this function is running on the GPU.

To be precise, it is called using the __global__ function. Its parameters are also gpu parameters, the difference is that there is no need to specify block and thread, and it can be directly called in the global function.

5. Thread allocation

The basic idea of ​​CUDA parallel programming is to divide a large task into N simple and repetitive operations, and create N threads to execute them separately. Thread, block, and grid are concepts in CUDA programming. In order to facilitate programmers' software design, threads are organized.

thread: A CUDA parallel program will be executed in many threads.
block: Several threads will be grouped into a block, and the threads in the same block can be synchronized or communicate through shared memory.
grid: Multiple blocks will form a grid again.

The complete execution configuration parameter form of the <<<>>> operator is <<<Dg, Db, Ns, S>>>

  • The parameter Dg is used to define the dimension and size of the entire grid, that is, how many blocks a grid has. It is dim3 type. Dim3 Dg(Dg.x, Dg.y, 1) means that each row in the grid has Dg.x blocks, each column has Dg.y blocks, and the third dimension is always 1 (currently, a kernel function only has one grid). There are Dg.x*Dg.y blocks in the entire grid, and the maximum value of Dg.x and Dg.y is 65535.
  • The parameter Db is used to define the dimension and size of a block, that is, how many threads a block has. It is dim3 type. Dim3 Db(Db.x, Db.y, Db.z) means that there are Db.x threads in each row in the entire block, Db.y threads in each column, and the height is Db.z. The maximum value of Db.x and Db.y is 512, and the maximum value of Db.z is 62. There are Db.x*Db.y*Db.z threads in a block. The maximum value of this product is 768 for hardware with computing capability 1.0 and 1.1, and the maximum value supported by hardware with computing capability 1.2 and 1.3 is 1024.
  • The parameter Ns is an optional parameter, which is used to set the maximum shared memory size that can be allocated dynamically in addition to the statically allocated shared memory for each block, and the unit is byte. When dynamic allocation is not required, the value is 0 or omitted.
  • The parameter S is an optional parameter of type cudaStream_t, and the initial value is zero, indicating which stream the kernel function is in.

Now that the parameter introduction is over, let’s introduce the application of different dimensions in detail.

The figure below is how my sample code is allocated, and the one-dimensional allocation only uses the x parameter. I added a heavy loop here to avoid setting the number of blocks and grids too small, which is smaller than the number of their arrays, and thus get wrong results.

__global__ void gconstMulDbl(float* out, float c, float* in, int number)
{
	int tid = threadIdx.x + blockIdx.x * blockDim.x;
	int threadMax = blockDim.x * gridDim.x;
	for (int i = tid; i < number; i = i + threadMax)
	{
		out[i] = in[i] * c;
	}
}

Next is the two-dimensional application, that is, block two-dimensional, grid two-dimensional

int blockId = blockIdx.x + blockIdx.y * gridDim.x;  
int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x;  

 (Image source "hujingshuang" original article, original link: https://blog.csdn.net/hujingshuang/article/details/53097222)

I haven't used this part in detail yet, so I won't write it in detail.

6. Memory release

After the program finishes running, the calculation result needs to be copied to the specified cpu memory. We need to release the allocated gpu space.

cudaError_t  status=	cudaFree((void*)gpuPar->XPupil);

The same as cpu memory release, only need to release the pointer allocated at that time. Also use the result returned by cudaError_t to judge whether the release is successful.

Summarize

At this point, the entire introduction to cuda programming is basically over. But with GPU acceleration, the road to optimization goes far beyond this, such as the allocation of the number of blocks and grids, the way of parallel computing, thread synchronization, etc., all need to be continuously explored.

Guess you like

Origin blog.csdn.net/ltd0924/article/details/123806157