Those things about CUDA programming in those years (1)

I. Overview

  • The purpose of using CUDA programming: When ordinary acceleration methods ( SIMD instructions , C++ multi-threading , OpenMP , etc.) cannot meet the actual needs, use CUDA to accelerate the operation of the algorithm to meet the real-time requirements of the system.
  • For example: stereo matching algorithm, deep learning training and testing, 3D reconstruction, etc.
  • Hardware requirements: You can check the graphics card that supports CUDA and the computing power of the graphics card on the official website .

2. CDUA installation

The installation process is relatively simple, simply divided into three steps:

  • 1. Prepare VS and the CUDA installer downloaded from the official website (I use 10.2, there is a higher version now, you can try it). During the installation process, it will automatically detect whether one of the supporting VS versions has been installed on the machine. If the VS version does not match the Cuda version, the installation cannot proceed. In addition, if the computer has 360 anti-virus installed (it is best to turn it off directly), there will be continuous prompts of suspected virus modification during the installation process, and all operations must be allowed, otherwise the installation will not be possible.
  • 2. After the installation is complete, you can open the command window, enter path, and check whether there are corresponding environment variables , as shown in the figure below.
    insert image description here
    If not, you can add environment variables yourself . Generally, there are, because they are added by default during installation. Use nvcc -Vthe command to view the corresponding CUDA installation information.
    insert image description here
  • 3. Open VS and you will find that there is an additional NVIDA option. After selecting it, you can create a new CUDA project.
    insert image description here
  • 4, Reference: https://blog.csdn.net/HaleyDong/article/details/86093520

3. Simple structure description

  • The following is a simple data flow process of Host (CPU) and Device (can be called CUDA or GPU).
    insert image description here
  • Grid is the outermost layer, called grid, generally three-dimensional, which gridDim.x,gridDim.y, gridDim.zrepresents the size of each dimension of the grid.
  • Block represents a thread block in the grid, generally three-dimensional, which blockDim.x,blockDim.y,blockDim.zrepresents the size of each dimension of the thread block; blockIdx.x,blockIdx.y,blockIdx.zrepresents the index of the thread block in the grid.
  • The innermost one is the thread actually used, which is also a three-dimensional distribution, which threadIdx.x,threadIdx.y,threadIdx.zrepresents the index value of each thread in the x, y, and z directions in the thread block.
  • The specific distribution of threads and thread blocks is shown in the following figure:
    insert image description here
  • Note: Threads in different blocks cannot affect each other! They are physically separated! ; Threads in a thread block can interact through shared memory . Remember these two points first, and we will discuss them in detail later.

4. Standard example

  • After using VS to create a new CUDA project, a kernel.cu file will be displayed by default. The following is a detailed annotation of the file:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);


//修饰符“__global__”,这个修饰符告诉编译器,被修饰的函数应该编译为在GPU而不是在CPU上运行,
__global__ void addKernel(int *c, const int *a, const int *b)
{
    
    
	//threadIdx.x,表示的是thread在x方向上的索引号
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}

int main()
{
    
    
    const int arraySize = 5;
    const int a[arraySize] = {
    
     1, 2, 3, 4, 5 };
    const int b[arraySize] = {
    
     10, 20, 30, 40, 50 };
    int c[arraySize] = {
    
     0 };

    //调用GPU运算的入口函数,返回类型是cudaError_t
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }

    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n",
        c[0], c[1], c[2], c[3], c[4]);

    //函数用于释放所有申请的显存空间和重置设备状态;
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }

    return 0;
}

// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    
    
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
	// 初始化设备上的GPU,并选择ID为0的GPU执行程序
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output).
	// 为device(GPU)中的求和数组c分配内存
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

	// 为device(GPU)中的数组a分配内存
    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

	// 为device(GPU)中的数组b分配内存
    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
	// 将CPU中的数组a数据拷贝到GPU
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

	// 将CPU中的数组b数据拷贝到GPU
    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU with one thread for each element.
	// “<<<>>>”表示运行时配置符号,在本程序中的定义是<<<1,size>>>,表示分配了一个线程块(Block),每个线程块有分配了size个线程
	// 这种设置默认线程块和线程的维度为1,即:blockIdx.x=0,threadId.x的范围为[0,size)
	// 一共开arraySize个线程,每个线程执行一组数据的加法。
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);

    // Check for any errors launching the kernel 函数用于返回最新的一个运行时调用错误,对于任何CUDA错误,都可以通过函数
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));  //函数来获取错误的详细信息。
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch. 函数提供了一个阻塞,用于等待所有的线程都执行完各自的计算任务,然后继续往下执行。
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory. 函数用于主机内存和设备显存以及主机与主机之间,设备与设备之间相互拷贝数据
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
    
    
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_c); //函数用于释放申请的显存空间。
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}
  • The above comments are already very detailed, so I will not explain their specific meanings. The actual running results are as follows:
    insert image description here
  • A hello_world applet:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void hello_world(void)
{
    
    
	printf("GPU: Hello world!\n");
}
int main(int argc, char **argv)
{
    
    
	printf("CPU: Hello world!\n");
	hello_world << <1, 10 >> >();
	cudaDeviceReset();//if no this line ,it can not output hello world from gpu
	return 0;
}
  • The resulting output:
    insert image description here

5. Reference

Tan Sheng's blog:

six other

Guess you like

Origin blog.csdn.net/qq_38589460/article/details/120209431