CUDA C: Threads, Thread Blocks and Thread Grids

Related Reading

CUDA Cicon-default.png?t=N7T8https://blog.csdn.net/weixin_45791458/category_12530616.html?spm=1001.2014.3001.5482


       The hundredth blog, write something different. 

        When the kernel function is called on the host side, it will be transferred to the device side for execution. At this time, the device will generate corresponding threads according to the calling format of the kernel function, and each thread will execute the statements specified by the kernel function.

        CUDA provides a thread hierarchy to facilitate thread organization. From top to bottom, it can be divided into thread grids, thread blocks and threads. All threads started by a core are collectively called a thread grid, and all threads in the same thread grid share the same global memory space. A thread grid is composed of multiple thread blocks (blocks), and a thread block contains several threads. Threads in the same thread block can cooperate in the following two ways, but threads in different thread blocks cannot cooperate.

  • Synchronize
  • Shared memory

        Threads distinguish each other through the preset variables of the following two kernel functions. The preset variables represent that CUDA allocates these two variables to each process at runtime. Based on these two variables, a piece of data can be assigned to different process handling.

  • blockIdx (the index of the thread block in the thread grid)
  • threadIdx (the index of the thread in the thread block)

        These two variables are defined by a structure named uint3, which is actually a built-in CUDA structure containing three unsigned integers, as shown below.

//这个定义在vector_types.h头文件中
struct __device_builtin__ uint3
{
    unsigned int x, y, z;
};

typedef __device_builtin__ struct uint3 uint3;

        By definition, these two variables can access the members of the structure in the following way.

blockIdx.x  //线程块索引的x分量
blockIdx.y  //线程块索引的y分量
blockIdx.z  //线程块索引的y分量
threadIdx.x //线程索引的x分量
threadIdx.y //线程索引的y分量
threadIdx.z //线程索引的z分量

        Why these two structures have three components is because CUDA supports up to three-dimensional hierarchical structures, that is, the distribution of thread blocks in the thread grid has up to three dimensions, and the distribution of threads in the thread blocks has up to three dimensions. CUDA uses the following two preset variables to save the dimension size of the hierarchy.

  • blockDim (the dimension size of the thread block, represented by the number of threads in the thread block)
  • gridDim (the dimension size of the thread grid, represented by the number of thread blocks in the thread grid)

        These two preset variables are defined by a structure named dim3, which is actually a built-in CUDA structure containing three unsigned integers, as shown below.

//这个定义在vector_types.h头文件中
struct __device_builtin__ dim3
{
    unsigned int x, y, z;
#if defined(__cplusplus)
#if __cplusplus >= 201103L
    __host__ __device__ constexpr dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
    __host__ __device__ constexpr dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
    __host__ __device__ constexpr operator uint3(void) const { return uint3{x, y, z}; }
#else
    __host__ __device__ dim3(unsigned int vx = 1, unsigned int vy = 1, unsigned int vz = 1) : x(vx), y(vy), z(vz) {}
    __host__ __device__ dim3(uint3 v) : x(v.x), y(v.y), z(v.z) {}
    __host__ __device__ operator uint3(void) const { uint3 t; t.x = x; t.y = y; t.z = z; return t; }
#endif
#endif /* __cplusplus */
};

typedef __device_builtin__ struct dim3 dim3;

        By definition, these two variables can access the members of the structure in the following way.

blockDim.x //线程块x方向的维度大小
blockDim.y //线程块y方向的维度大小
blockDim.z //线程块z方向的维度大小
gridDim.x  //线程格x方向的维度大小
gridDim.y  //线程格y方向的维度大小
gridDim.z  //线程格z方向的维度大小

        Typically, a thread grid has two dimensions, and a thread block has three dimensions. If the number of dimensions is less than 3, the Dim variable members corresponding to the extra dimensions will be initialized to 1.

        It should be noted that the four preset variables mentioned above can only be accessed inside the kernel function, that is, on the device side. On the host side, in order to call the kernel function, you can define variables of the dim3 data type yourself. These variables defined on the host side are not accessible inside the kernel function.

        The following program verifies how to use these preset variables and define variables of the dim3 data type yourself.

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void checkIndex(void) //定义核函数,显示本进程的预置变量
{
    printf("threadIdx:(%d, %d, %d)\n", threadIdx.x, threadIdx.y, threadIdx.z);
    printf("blockIdx:(%d, %d, %d)\n", blockIdx.x, blockIdx.y, blockIdx.z);

    printf("blockDim:(%d, %d, %d)\n", blockDim.x, blockDim.y, blockDim.z);
    printf("gridDim:(%d, %d, %d)\n", gridDim.x, gridDim.y, gridDim.z);

}

int main(int argc, char **argv)
{
    //定义数据量
    int nElem = 6;

    //定义了两个dim类型的变量block和grid用于核函数调用
    dim3 block(3); //注意这里使用了构造函数创建结构变量
    dim3 grid((nElem + block.x - 1) / block.x);

    //显示block和grid的分量值
    printf("grid.x %d grid.y %d grid.z %d\n", grid.x, grid.y, grid.z);
    printf("block.x %d block.y %d block.z %d\n", block.x, block.y, block.z);

    //使用block和grid进行核函数调用
    checkIndex<<<grid, block>>>();

    //复位设备端
    cudaDeviceReset();

    return(0);
}

        Because the printf function only supports GPU architectures above the Fermi architecture, you need to specify the architecture as sm_20 or above when compiling, as shown below (by default, nvcc will generate code for the lowest version of the architecture it supports).

$nvcc -arch=sm_20 checkDimension.cu -o check
$./check

        The output of the program is shown below. 

grid.x 2 grid.y 1 grid.z 1
block.x 3 block.y 1 block.z 1
threadIdx:(0, 0, 0)
threadIdx:(1, 0, 0)
threadIdx:(2, 0, 0)
threadIdx:(0, 0, 0)
threadIdx:(1, 0, 0)
threadIdx:(2, 0, 0)
blockIdx:(0, 0, 0)
blockIdx:(0, 0, 0)
blockIdx:(0, 0, 0)
blockIdx:(1, 0, 0)
blockIdx:(1, 0, 0)
blockIdx:(1, 0, 0)
blockDim:(3, 1, 1)
blockDim:(3, 1, 1)
blockDim:(3, 1, 1)
blockDim:(3, 1, 1)
blockDim:(3, 1, 1)
blockDim:(3, 1, 1)
gridDim:(2, 1, 1)
gridDim:(2, 1, 1)
gridDim:(2, 1, 1)
gridDim:(2, 1, 1)
gridDim:(2, 1, 1)
gridDim:(2, 1, 1)

Written at the end: This is my 100th blog. Looking back on it, it has only been 10 months since I wrote the first blog, but blogging seems to have become a habit of mine. I hope I can stick to it and work hard. Improve your skills!

Last but not least: I would like to thank my parents and classmate Xiao Li for their continued support and help!

Guess you like

Origin blog.csdn.net/weixin_45791458/article/details/135064050