Concurrent Design Course Notes

CUDA hardware overview

CPU
- ALU has strong computing power
- Large capacity Cache
- Complex control logic
GPU
- Cache capacity is small
- Simple control logic
- ALU is energy efficient
- Hidden latency through a large number of threads
The CPU has made a lot of optimizations for the efficient execution of serial programs
GPU is designed for parallel execution of large amounts of data
Nvidia's graphics processor product types
- GeForce, consumer-oriented graphics processing products
- Quadro, workstation graphics processing product
- Tegra, a chip family for mobile devices
- Tesla, mainly used for server high-performance computing

CUDA program structure

Get GPU hardware information

Get the number of GPU cards available

cudaError_t cudaGetDeviceCount(int *n)

Set/get the current GPU card information

cudaError_t cudaGetDevice(int *d)
cudaError_t cudaSetDevice(int d)

View the characteristics of the GPU card (computing power, memory, etc.)

cudaError_t cudaGetDeviceProperties(cudaDevProp* prop, int dev)
struct cudaDeviceProp{
      
      
	char name[256];
	int major;
	int minor;
	...
}

CUDA function call error handling (when an error occurs, the cause of the error, file name and line number information will be given)

cudaError_t error;
error = cudaMalloc(gpuptr, N*sizeof(float));
if(error != cudaSuccess){
      
      
	printf("Error allocating memory\n");
	exit(1);
}

Data allocation and release on GPU memory
- CPU and GPU have different memory spaces (starting from CUDA6.0, a unified memory access mechanism can be used)
- Allocate and release memory
```
cudaMalloc( (void**) &pointer, size);
cudaFree(pointer);
```

Data transfer between CPU and GPU

// dst_pointer 是数据的目的地指针
// src_pointer 是数据的源地址指针
// size 是要拷贝的数据的数量，以字节为单位
// direction 表示数据的拷贝方向
cudaMemcpy(dst_pointer, src_pointer, size, direction)

Definition of kernel function on GPU

__host__, // 函数在主机上被调用和执行，类似普通c函数的编译
__device__, // 函数在GPU上被调用和执行，由nvcc进行编译
__global__, // 函数在主机上被调用，GPU上执行

The CPU side calls the kernel function on the GPU
- The way of calling kernel functions
  Declare the thread group structure, thread grid and thread block of the kernel function executed on the GPU
- Thread grid
  - Two-dimensional structure
  - dim3 gridDim(2,4) means that there are 8 thread blocks, arranged according to 2*4
- Thread block
  - Three-dimensional structure
  - dim3 blockDim(32, 8, 1) means that each thread block has 256 threads, arranged according to 32 8 1
- Call kernel function
```
kernelFunc<<<gridDim, blockDim>>>(param1, param2)
```

Thread grid, thread block and thread

Thread organization
- The thread grid can be organized in one or two dimensions, gridDim.x, gridDim.y (y is a row, x is a column) represents the number of thread blocks in the x and y directions of the thread grid
- Thread blocks can be organized in one, two, and three dimensions. blockDim.x, blockDim.y, blockDim.z represent the number of threads in the x, y, and z directions; blockIdx.x, blockIdx.y, blockIdx.z Indicates the index of the thread block in the thread grid to which it belongs
- threadIdx.x, threadIdx.y, threadIdx.z represent the index of the thread in the thread block to which it belongs
Threads are executed in units of warps (each GPU contains multiple SMs, and each SM has multiple cores, which are scheduled in units of warps)
Thread block mapping
All threads are allocated to the SM of the GPU in the unit of thread block. The actual number of allocated thread blocks is related to the number of threads inside the thread block, and is limited by the maximum number of executable threads
Synchronization between threads
- Within the same thread block, use the __syncthreads() function to synchronize
- Thread blocks need to use global memory for synchronization

Concurrent Design Course Notes

CUDA hardware overview

CUDA program structure

Thread grid, thread block and thread

Guess you like