CUDA Principles and Programming Fundamentals

Foreword:

The full name of CUDA is Computer Unified Device Architecture (Computer Unified Device Architecture), which is a parallel computing architecture introduced to the market by NVIDIA in 2007. As the general computing engine of NVIDIA graphics processors, CUDA provides us with a full set of tools for GPGPU (General Purpose Graphics Process Unit) development using NVIDIA graphics cards.

1.GPU and CPU

insert image description here

Green is the computing unit, orange-red is the storage unit, and orange-yellow is the control unit.
GPU is a highly parallelized, multi-threaded, multi-core processor. The structure of the GPU is designed with more transistors for data calculations, rather than the data flow cache and flow control of the CPU.

2. GPU memory hierarchy

insert image description here

2.1 Hardware perspective

Each thread thread has its own register register and local memory local memory. Each thread thread in the same thread block shares a share memory. All thread threads (including threads of different blocks) share a global global memory. Different grids have their own global memory.

2.2 Software perspective

Device side (device) -> grids -> kernel, the size can be customized
Multi-core processor (SM) -> Block -> thread block, composed of threads
Thread processor (SP) -> Thread -> thread, the smallest unit
insert image description here

grids, The relationship between the three variables of blocks and thread

3. CUDA software system

insert image description here

CUDA function library (CUDA Library)
CUDA runtimeAPI (Runtime API)
CUDA driver API (Driver API)

4. Asynchronous programming

insert image description here

The CUDA programming model assumes that CUDA threads execute on physically separate devices (GPUs), which act as host coprocessors for running programs. For example: the kernel part is executed on the GPU, while the rest of the program is executed on the CPU.
tips: Serial code is executed on the host CPU, while parallel code is executed on the device GPU

简单的加法demo:
// Device code
global void VecAdd(float* A, float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

// Host code
int main()
{
int N = …;
size_t size = N * sizeof(float);

// Allocate input vectors h_A and h_B in host memory
float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);

// Initialize input vectors
...

// Allocate vectors in device memory
float* d_A;
cudaMalloc(&d_A, size);
float* d_B;
cudaMalloc(&d_B, size);
float* d_C;
cudaMalloc(&d_C, size);

// Copy vectors from host memory to device memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid =
        (N + threadsPerBlock - 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

// Copy result from device memory to host memory
// h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
        
// Free host memory
...

}

5. CUDA program steps

The first part: apply for device memory from the host (host), and copy the content to be copied from the host memory to the applied device memory; the second part: the
kernel function on the device (device) side calculates the copied things to come Obtain and implement the result of the operation, the kernel refers to the function running on the GPU; the
third part: copy the result from the device memory to the applied host memory, and release the video memory on the device side and the memory on the host side

6. Application Programming Interface

6.1 Qualifiers for function types

6.1.1 device

Use the _device_ qualifier to declare that the function has the following characteristics:
Execute on the device device
and can only be called through the device

6.1.2 global

The use of _global_ has the following characteristics:
Execution on the device device
can only be called by the host

3.1.3 host

Using _host_ has the following features:
Execute on the host
Can only be called by the host
The _host_ qualifier can also be used with _device_

6.1.4 Restrictions on qualifiers

insert image description here

6.2 Qualifiers for variable types

device
insert image description here

constant

insert image description here

shared
insert image description here

6.3 Execution configuration

insert image description here

6.4 Built-in variables

6.4.1 gridDim

The dimension of the grid, the data type is dim3 (an integer vector, uint3 used to specify the dimension)

6.4.2 blockIdx

The thread block index in the grid, the data type is uint3

6.4.3 blockDim

The dimension of the block, the data type is dim3

6.4.4 threadIdx

Thread index, the data type is uint3

insert image description here

7. Common functions

1.cudaMalloc() – apply for video memory (device memory)
2.cudaMemcpy() – copy data between device (device) and host (host)
3.cudaFree() – release device memory (video memory release)

8. Kernel function thread block and thread index

insert image description here

tid = threadIdx.x + blockIdx.x * blockDim.x; // The starting position of the current thread index
threadIdx is the thread index, blockIdx.x is the index of the thread block, blockDim.x is the size of the thread block (including the number of threads )
tid += blockDim.x * gridDim.x;
After each thread calculates the task of the current index tid, we increment the index, where the incremental step is the number of threads running in the thread grid

———————————————————————————————————————————————————————

1. The process of pytorch calling a custom CUDA operator
1. First, the CUDA operator and the corresponding calling function
2. Then the torch.cpp function establishes the connection between pytorch and CUDA, and encapsulates it with pybind11
3. Finally, use PyTorch cpp extension library to compile and call

https://godweiyang.com/2021/03/18/torch-cpp-cuda/