CUDA: Grid, Block and Thread

CUDA is a technology for high-performance parallel computing using GPUs. In CUDA, a computing task is divided into multiple threads, and these threads are combined into one or more thread blocks, and the thread blocks are combined into one or more grids to perform parallel computing.

The calculation of a CUDA program is organized into three levels: grid (Grid), thread block (Block) and thread (Thread). A grid is a two-dimensional array containing multiple thread blocks. Each thread block is also a two-dimensional array containing multiple threads. Each thread executes the same code, but may use different data while executing.
CUDA is a parallel computing platform and programming model launched by NVIDIA, which allows developers to use C/C++ language for GPU programming. The basic computing unit of CUDA is a thread (Thread), the thread is organized into a thread block (Thread Block), and the thread block is organized into a grid (Grid).

Grid, Block, Thread (grid, block, thread) in CUDA refers to the organization of threads executed on the GPU. Grid is the highest-level thread organization method. Block is located in the next layer of Grid and represents a thread block. Thread is the smallest thread unit. Block contains a certain number of Threads.

The setting principles are as follows:

  • The size of Grid and Block needs to consider the limitations of GPU hardware and the scale of the problem.
  • The size of Grid and Block should be an integer power of 2, so that the thread scheduler of GPU hardware can be used more efficiently.
  • The block size should be small enough to fully utilize hardware resources in the SM, but large enough to hide scheduler and memory delays and improve performance.
  • The number of Threads should be large enough to make full use of the hardware resources in the SM, while avoiding too many threads, resulting in excessive thread scheduling and memory interaction overhead and affecting performance.

To set up grids, thread blocks and threads in CUDA you need to use the following code:

dim3 gridSize(xGridSize, yGridSize, zGridSize);
dim3 blockSize(xBlockSize, yBlockSize, zBlockSize);
kernel<<<gridSize, blockSize>>>(arg1, arg2, ...);

Wherein, gridSizerepresents the size of the grid, blockSizerepresents the size of the thread block, kerneland is the name of the CUDA function to be executed.

For example, if you want to Nadd an array of size , you can use the following code:

__global__ void add(int *a, int *b, int *c, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    int N = 1000;
    int *a, *b, *c;
    cudaMalloc(&a, N * sizeof(int));
    cudaMalloc(&b, N * sizeof(int));
    cudaMalloc(&c, N * sizeof(int));
    // initialize a and b arrays
    dim3 gridSize(ceil(N/256.0), 1, 1);
    dim3 blockSize(256, 1, 1);
    add<<<gridSize, blockSize>>>(a, b, c, N);
    // copy result back to host and free memory
    return 0;
}

In this example, we define the grid size as ceil(N/256.0)thread blocks and the thread block size as 256 threads. In a CUDA function add, each thread in each thread block computes the sum of the elements in an array, resulting in an array of results c.

Guess you like

Origin blog.csdn.net/qq_39506862/article/details/130894277