Concurrent Design Course Notes

CUDA hardware overview

  • CPU

    • ALU has strong computing power
    • Large capacity Cache
    • Complex control logic
  • GPU

    • Cache capacity is small
    • Simple control logic
    • ALU is energy efficient
    • Hidden latency through a large number of threads
  • The CPU has made a lot of optimizations for the efficient execution of serial programs

  • GPU is designed for parallel execution of large amounts of data

  • Nvidia's graphics processor product types

    • GeForce, consumer-oriented graphics processing products
    • Quadro, workstation graphics processing product
    • Tegra, a chip family for mobile devices
    • Tesla, mainly used for server high-performance computing

CUDA program structure

  • Get GPU hardware information
    • Get the number of GPU cards available
    cudaError_t cudaGetDeviceCount(int *n)
    
    • Set/get the current GPU card information
    cudaError_t cudaGetDevice(int *d)
    cudaError_t cudaSetDevice(int d)
    
    • View the characteristics of the GPU card (computing power, memory, etc.)
    cudaError_t cudaGetDeviceProperties(cudaDevProp* prop, int dev)
    struct cudaDeviceProp{
          
          
    	char name[256];
    	int major;
    	int minor;
    	...
    }
    
  • CUDA function call error handling (when an error occurs, the cause of the error, file name and line number information will be given)
    cudaError_t error;
    error = cudaMalloc(gpuptr, N*sizeof(float));
    if(error != cudaSuccess){
          
          
    	printf("Error allocating memory\n");
    	exit(1);
    }
    
  • Data allocation and release on GPU memory
    • CPU and GPU have different memory spaces (starting from CUDA6.0, a unified memory access mechanism can be used)
    • Allocate and release memory
    cudaMalloc( (void**) &pointer, size);
    cudaFree(pointer);
    
  • Data transfer between CPU and GPU
    // dst_pointer 是数据的目的地指针
    // src_pointer 是数据的源地址指针
    // size 是要拷贝的数据的数量,以字节为单位
    // direction 表示数据的拷贝方向
    cudaMemcpy(dst_pointer, src_pointer, size, direction)
    
  • Definition of kernel function on GPU
    __host__, // 函数在主机上被调用和执行,类似普通c函数的编译
    __device__, // 函数在GPU上被调用和执行,由nvcc进行编译
    __global__, // 函数在主机上被调用,GPU上执行
    
  • The CPU side calls the kernel function on the GPU
    • The way of calling kernel functions
      Declare the thread group structure, thread grid and thread block of the kernel function executed on the GPU
    • Thread grid
      • Two-dimensional structure
      • dim3 gridDim(2,4) means that there are 8 thread blocks, arranged according to 2*4
    • Thread block
      • Three-dimensional structure
      • dim3 blockDim(32, 8, 1) means that each thread block has 256 threads, arranged according to 32 8 1
    • Call kernel function
      kernelFunc<<<gridDim, blockDim>>>(param1, param2)
      

Thread grid, thread block and thread

  • Thread organization

    • The thread grid can be organized in one or two dimensions, gridDim.x, gridDim.y (y is a row, x is a column) represents the number of thread blocks in the x and y directions of the thread grid
    • Thread blocks can be organized in one, two, and three dimensions. blockDim.x, blockDim.y, blockDim.z represent the number of threads in the x, y, and z directions; blockIdx.x, blockIdx.y, blockIdx.z Indicates the index of the thread block in the thread grid to which it belongs
    • threadIdx.x, threadIdx.y, threadIdx.z represent the index of the thread in the thread block to which it belongs
  • Threads are executed in units of warps (each GPU contains multiple SMs, and each SM has multiple cores, which are scheduled in units of warps)

  • Thread block mapping
    All threads are allocated to the SM of the GPU in the unit of thread block. The actual number of allocated thread blocks is related to the number of threads inside the thread block, and is limited by the maximum number of executable threads

  • Synchronization between threads

    • Within the same thread block, use the __syncthreads() function to synchronize
    • Thread blocks need to use global memory for synchronization

Guess you like

Origin blog.csdn.net/qq_42573343/article/details/105292402