CUDA11.8 programming learning

__global__ is used to define the kernel function kernel. The kernel function is executed on the GPU and is called from the CPU through triple angle bracket syntax. It can have parameters but cannot have a return value. Some compilers report an error at the triple angle brackets, which does not have much impact. The specific reason why the error is reported is unknown.

//核函数的声明方式
__global__ void kernel()
{
    
    
   ....
}

int main()
{
    
    
   kernel<<<>>>();//核函数调用，在CPU中以三重尖括号的方式调用
}

__device__ is used to define device functions. Device functions are executed on the GPU and called on the GPU. Triple angle brackets are not required. They are used the same as ordinary functions and can have parameters and return values.

Calling relationship: host__ can call __global , global__ can call __device , and device__ can call __device .

Through the __host__ __device__ double modifier, the function can be defined on both the CPU and the GPU, so that both the CPU and the GPU can call it.

The constexpr modified function automatically becomes callable by both CPU and GPU. The printf and __syncthreads classes cannot be called inside constexpr-modified functions.

The CUDA compiler has the characteristics of multi-segment compilation

A piece of code will first be sent to the compiler on the CPU (usually the compiler that comes with the system such as gcc and msvc) to generate the instruction code for the CPU part. Then it is sent to the real GPU compiler to generate GPU instruction code. Finally, it is linked into the same file. It looks like it is compiled only once, but in fact the code will be preprocessed multiple times.

threadIdx.x Gets the current thread
blockDim.x Gets the current number of threads
blockIdx.x Gets the current block number
gridDim.x Gets the number of blocks
In the header file #include "device_launch_parameters.h"

//线程<块<网格
unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int tnum = blockDim.x * gridDim.x

The CPU and GPU each use separate memory. The memory of the CPU is called host memory (host, main memory). The memory used by the GPU is called device memory, which is onboard on the graphics card and is faster. It is also called video memory.

Both stack and malloc allocate memory on the CPU and cannot be accessed by the GPU.

cudaMalloc can allocate video memory on the GPU, and cudaFree releases it at the end, and the return value of cudaMalloc is used to indicate the error code, so the pointer can only pass the &pret secondary pointer

A feature supported by relatively new graphics cards is unified memory (managed). You only need to replace cudaMalloc with cudaMallocManaged, and it is also released through cudaFree. The address allocated in this way is the same whether on the CPU or the GPU.

Problems with atomic operations: affecting performance.
However, atomic operations must ensure that only one thread can modify the same address at the same time. After one thread has modified it, another thread can enter, which is very inefficient.
SM (Streaming Multiprocessors) and block (block)
GPU is composed of multiple streaming multiprocessors (SM), and each SM can process one or more blocks.
SM is composed of multiple streaming single processors (SPs), each SP can handle one or more threads.
Each SM has its own shared memory (shared memory), which is similar in nature to the cache in the CPU. It is small but fast compared with the main memory and is used to cache temporary data.
Usually the number of blocks is always greater than the number of SMs. In this case, the NVIDIA driver will schedule the submitted blocks among multiple SMs. Just as the operating system schedules threads across multiple CPU cores.
One SM can run multiple blocks at the same time. In this case, multiple blocks share a shared memory (the memory allocated to each block is reduced).
Each thread inside the block is further scheduled to each SP on the SM.
Each thread in the same block shares a storage space, that is, shared memory. In CUDA syntax, shared memory can be created by defining a variable modified with __shared__.

Summary of GPU optimization:
1. Thread group divergence (wrap divergence): Try to ensure that all 32 threads enter the same branch, otherwise both branches will be executed;
2. Latency hiding: There needs to be enough blockDim for SM to Schedule other thread groups when stuck in memory waiting;
3. Register spill: If the kernel function uses many local variables (registers), blockDim should not be too large;
4. Shared memory (shared memory): The global memory is relatively low Effective, if you need to use it multiple times, you can read the shared memory first;
5. Coalesced access: It is recommended to read the shared memory sequentially first, so that the high-bandwidth shared memory can withstand the stride;
6. Block conflict ( Bank conflict): It will be inefficient for multiple threads in the same warp to access addresses equal to modulo 32 in shared memory. This can be avoided by intentionally making the array into an unaligned 33 stride.
NVIDIA's warp size is 32.

GPU Compute Capability

Compute Capability (Compute Capability) does not refer to the computing performance of the GPU.
Nvidia invented the concept of computing capability to identify the core architecture of the device, the functions and instructions supported by the GPU hardware, so the computing capability is also called the "SM version".
Computing capabilities are represented by a major revision number X and a minor revision number Y. The major revision number identifies the core architecture, and the minor revision number identifies the incremental update on this core architecture.
GPU computing capability query website: https://developer.nvidia.com/cuda-gpus#compute
I use 3060, and it is easy to query that its Compute Capability is 8.6.
Insert image description here
In VS2019, right-click and select the .cu file, CUDA C/ The Code Generation of Device in C++ can be changed to compute_86, sm_86.