3.5.Definition and use of cuda runtime API-kernel function

foreword

Teacher Du launched the tensorRT high-performance deployment course from scratch . I have read it before, but I didn’t take notes, and I forgot many things. This time I will do it again, and take notes by the way.

This course learns the simplified CUDA tutorial-kernel function

The course outline can be seen in the mind map below

insert image description here

1. Kernel function

What you need to know about kernel functions:

  1. The kernel function is the key to cuda programming

  2. Create a cudac program file through xxx.cu, and pass cu to nvcc to compile to recognize cuda syntax

    • nvcc is a c++ compiler of nvidia, which is used to compile cudac programs
  3. __global__Expressed as a kernel function, called by the host.

  4. __deivce__Represented as a device function, called by device

  5. __host__Represented as a host function, called by host. __shared__Indicates that the variable is a shared variable

    • A function can be both a device function and a host function, and can be __device__ __host__decorated at the same time
  6. The host calls the kernel function: function<<< gridDim , blockDim , sharedMemorySize, stream>>>(args…)

    • stream is the stream mentioned in the previous lesson, which can be controlled during asynchronous management. sharedMemorySize is the size of the shared memory
    • gridDim and blockDim are used to tell the kernel function how many threads to start, both are built-in variables, and their variable type is dim3
    • The total number of threads started nthreads = gridDim.x * gridDim.y * gridDim.z * blockDim.x * blockDim.y * blockDim.z
    • Both gridDim and blockDim are constrained and can be queried through runtime API or deviceQuery. gridDims(2.1 billion, 65536, 65536), blockDim(1024, 64, 64) blockDim.x * blockDim.y * blockDim.z <= 1024
  7. Only __global__the modified function can be called in the way of <<<>>>

  8. Calling the kernel function is passed by value, not by reference, but by passing classes, structures, etc. The kernel function can be a template, and the return value must be void

  9. The execution of the kernel function is asynchronous, that is, it returns immediately

  10. Thread layout mainly uses blockDim and gridDim

  11. The access thread index in the kernel function mainly uses built-in variables such as threadIdx, blockIdx, blockDim, and gridDim

We mentioned before that the data on the host, that is, the CPU, is copied to the device, that is, the GPU. What is the purpose? The purpose is of course to use the high-performance parallel computing capability of the GPU, so how to use these data on the GPU to complete the specified calculation? This requires you to call the kernel function (kernel) in CUDA to perform parallel computing.

Kernel is an important concept in CUDA programming. It refers to the function executed by threads in parallel on the device. The kernel function is declared using the __global__ symbol. When calling, use <<<grid, block>>> to specify the kernel function to be executed. The number of threads, each thread in CUDA must execute the kernel function, and each thread will be assigned a unique thread number thread ID , this ID value can be obtained through the built-in variable threadIdx of the kernel function .

Since GPU is actually a heterogeneous model, it is necessary to distinguish the codes on host and device. In CUDA, we use function type qualifiers to distinguish functions on host and device. There are three main function type qualifiers:

  • __global__ represents the kernel function , executed on the device, called by the host, and the return type must be void
  • __device__ indicates the device function , which is executed on the device and can only be called from the device
  • __host__ represents the host function , which can only be called from the host

To deeply understand the kernel function, it is necessary to have a clear understanding of its thread hierarchy.

First of all, there are many parallelized lightweight threads on the GPU. When the kernel is executed on the device, it actually starts many threads. All the threads started by a kernel are called a grid (grid) , and all threads on the same grid Threads share the same global memory space, grid is the first level of thread structure. The grid grid can be divided into many thread blocks (block) , a thread block contains many threads, which is the second level.

The two-layer thread organization structure is shown in Figure 1-1. It can be seen from the figure that this is a thread organization with both grid and block being 2-dim. What does 2-dim mean? This has to mention the grid and block variable types. Grid and block are actually variables defined in the dim3 type, and dim3 can be regarded as a structure variable containing three unsigned integer (x, y, z) members. When defined, the default value is initialized to 1.

insert image description here

Figure 1-1 Two-tier thread organization structure on Kernel (2-dim)

Therefore, grid and block can be flexibly defined as 1-dim, 2-dim and 3-dim structures. The normal 2-dim thread structure is more commonly used. For the thread organization structure in Figure 1-2, the definition of grid and block Can be as follows:

dim3 grid(3, 2);
dim3 block(5, 3);
kernel_func<<<grid, block>>>(params, ...);

It is worth noting that when the kernel function is called, the number of threads and the thread structure used by the kernel must be specified by executing the configuration <<<gird,block>>> .

Therefore, a thread needs two built-in variables (blockIdx, threadIdx) to uniquely identify them, both of which are variables of dim3 type, where blockIdx indicates the position of the thread in the grid grid, and threadIdx indicates the position of the thread in the block Position, Thread(1,1) in Figure xxx satisfies:

threadIdx.x = 1
threadIdx.y = 1
blockIdx.x = 1
blockIdx.y = 1

Sometimes, we also want to know the global ID of a thread in a thread block (block) . At this time, we must also know the organizational structure of the block, which is obtained through the built-in variable blockDim of the thread. It gets the size of each dimension of the thread block (block), for a 2-dim block( D x , D y D_x,D_yDx,Dy),线程 ( x , yx,yx,y ) has an ID value of (x + y ∗ D x x+y*D_xx+yDx), if it is a 3-dim block( D x , D y , D z D_x,D_y,D_zDx,Dy,Dz),线程 ( x , y , zx,y,zx,y,z )的 ID 值为 (x + y ∗ D x + z ∗ D x ∗ D y x+y*D_x+z*D_x*D_yx+yDx+zDxDy). In addition, the thread also has a built-in variable gridDim, which is used to obtain the size of each dimension of the grid block.

If you also want to know the global ID of the current thread in all threads (grid) , we need to use gridDim and blockDim at the same time. According to Mr. Du's method, the corresponding global ID can be easily calculated. Specifically as shown in the figure below:

insert image description here

Figure 1-2 Global ID calculation of threads in the grid

In the kernel function, blockDim and gridDim can be regarded as shapes, and threadIdx and blockIdx can be regarded as indexes. There is a convenient way to remember the global index calculation is to multiply by left and add to the right. After that, no matter how complicated the tensor dimension is, this method will Be applicable. The global index of the thread is usually mapped to the offset of the pointer, which is convenient for our subsequent operations

Let's take a simple example to illustrate, assuming grid(2,1,1) blockDim(1,1,10)

insert image description here

Figure 1-3 Example of global index calculation

According to the left multiplication and right addition rule, then idx = blockIdx.x * blockDim.x + threadIdx.x

The threads on a thread block (block) are placed on the same **streaming Multi-processor (SM)**, but the resources of a single SM are limited, which leads to the thread block (block) The number of threads in is limited, and modern GPUs can support up to 1024 threads.

When the kernel function is executed, many threads are actually started. These threads are logically parallel, but not necessarily at the physical layer. A core component of the GPU hardware is the SM. When a kernel kernel function is executed, the thread blocks in its grid are allocated to the SM. A thread block can only be scheduled on one SM, and an SM can generally schedule multiple thread blocks, depending on the capabilities of the SM itself. Each thread block of a kernel is allocated to multiple SMs, so the grid is only the logical layer, and the SM is the physical layer for execution, as shown in Figure 1-4 .

The basic execution unit of SM is thread warp (warps) , which contains 32 threads, but the number of concurrent thread warps of an SM is limited. In short, grids and thread blocks are just logical divisions, and all threads of a kernel are not necessarily concurrent at the physical layer. Therefore, the configuration of the grid and block of the kernel is different, and the performance will be different. Also note that since the basic execution unit of the SM is a warp containing 32 threads, the block size is generally set to a multiple of 32 .

insert image description here

Figure 1-4 Logical and physical layers of CUDA programming

2. Kernel function case

The main.cpp sample code of the kernel function case is as follows:

#include <cuda_runtime.h>
#include <stdio.h>

#define checkRuntime(op)  __check_cuda_runtime((op), #op, __FILE__, __LINE__)

bool __check_cuda_runtime(cudaError_t code, const char* op, const char* file, int line){
    
    
    if(code != cudaSuccess){
    
        
        const char* err_name = cudaGetErrorName(code);    
        const char* err_message = cudaGetErrorString(code);  
        printf("runtime error %s:%d  %s failed. \n  code = %s, message = %s\n", file, line, op, err_name, err_message);   
        return false;
    }
    return true;
}

void test_print(const float* pdata, int ndata);

int main(){
    
    
    float* parray_host = nullptr;
    float* parray_device = nullptr;
    int narray = 10;
    int array_bytes = sizeof(float) * narray;

    parray_host = new float[narray];
    checkRuntime(cudaMalloc(&parray_device, array_bytes));

    for(int i = 0; i < narray; ++i)
        parray_host[i] = i;
    
    checkRuntime(cudaMemcpy(parray_device, parray_host, array_bytes, cudaMemcpyHostToDevice));
    test_print(parray_device, narray);
    checkRuntime(cudaDeviceSynchronize());

    checkRuntime(cudaFree(parray_device));
    delete[] parray_host;
    return 0;
}

The kernel.cu sample code of the kernel function case is as follows:

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void test_print_kernel(const float* pdata, int ndata){
    
    

    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    /*    dims                 indexs
        gridDim.z            blockIdx.z
        gridDim.y            blockIdx.y
        gridDim.x            blockIdx.x
        blockDim.z           threadIdx.z
        blockDim.y           threadIdx.y
        blockDim.x           threadIdx.x

        Pseudo code:
        position = 0
        for i in 6:
            position *= dims[i]
            position += indexs[i]
    */
    printf("Element[%d] = %f, threadIdx.x=%d, blockIdx.x=%d, blockDim.x=%d\n", idx, pdata[idx], threadIdx.x, blockIdx.x, blockDim.x);
}

void test_print(const float* pdata, int ndata){
    
    

    // <<<gridDim, blockDim, bytes_of_shared_memory, stream>>>
    test_print_kernel<<<1, ndata, 0, nullptr>>>(pdata, ndata);

    // 在核函数执行结束后,通过cudaPeekAtLastError获取得到的代码,来知道是否出现错误
    // cudaPeekAtLastError和cudaGetLastError都可以获取得到错误代码
    // cudaGetLastError是获取错误代码并清除掉,也就是再一次执行cudaGetLastError获取的会是success
    // 而cudaPeekAtLastError是获取当前错误,但是再一次执行 cudaPeekAtLastError 或者 cudaGetLastError 拿到的还是那个错
    // cuda的错误会传递,如果这里出错了,不移除。那么后续的任意api的返回值都会是这个错误,都会失败
    cudaError_t code = cudaPeekAtLastError();
    if(code != cudaSuccess){
    
        
        const char* err_name    = cudaGetErrorName(code);    
        const char* err_message = cudaGetErrorString(code);  
        printf("kernel error %s:%d  test_print_kernel failed. \n  code = %s, message = %s\n", __FILE__, __LINE__, err_name, err_message);   
    }
}

The running effect is as follows:

insert image description here

Figure 2-1 Kernel case running effect

This example shows how to use kernel functions for parallel computing in CUDA.

test_print_kernelis a __global__kernel function marked with the modifier that will be executed on the GPU and called by the host. The role of the kernel function is to print the value of each element of the incoming data array and information such as thread index, block index, and block size. test_printIt is responsible for calling the kernel function for the host function, <<<1, ndata, 0, nullptr>>>which is the syntax for starting the kernel function, where 1is the number of blocks, ndatathe number of threads in each block, 0indicates the size of the shared memory, and nullptrindicates the use of the default stream.

After the execution of the kernel function, use cudaPeekAtLastErrorto check whether there is an error. If there is an error, an error code and message will be printed. It is worth noting that both cudaPeekAtLastErrorand cudaGetLastErrorcan get the error code, cudaGetLastErrorwhich is to get the error code and clear it, that is, execute it again cudaGetLastErrorto get success. Instead cudaPeekAtLastError, get the current error, but execute it again cudaPeekAtLastErroror cudaGetLastErrorget the same error. The error of cuda will be transmitted. If there is an error here, if it is not removed, then the return value of any subsequent API will be this error and will fail.

Through this case, you can understand how to define and start a kernel function, and use information such as thread index, block index, and block size to achieve parallel computing. In practical applications, more complex kernel functions can be written as needed to handle actual computing tasks.

The knowledge points about the kernel function are as follows: ( from Mr. Du )

  1. The cu file is generally used to write the kernel function of cuda
  2. Configured in .vscode/setting.json *.cu : cuda-cpp, the code can be parsed correctly
  3. In Makefile, cu is handed over to nvcc for compilation
  4. cu files can be written as normal cpp, it is a superset of cpp, compatible with all features of cpp
  5. Some new symbols and syntax are introduced in the cu file
  • __global__mark, kernel function mark
    • The caller must be host
    • The return value must be void
    • For example:__global__ void kernel(const float* pdata, int ndata)
    • The kernel function must kernel<<<gridDim, blockDim, bytesSharedMemorySize, stream>>>(pdata, ndata)be started as
    • Its parameter types are:<<<dim3 gridDim, dim3 blockDim, size_t bytesSharedMemorySize, cudaStream_t stream>>>
    • dim3 has default constructor dim3(int x, int y=1, int z=1)
    • Therefore, when directly assigning an int, it actually defines dim.x = value, dim.y = 1, dim.z = 1
    • Where gridDim, blockDim, bytesSharedMemory, stream are thread layout parameters
    • If stream is specified, add the kernel function to the stream for asynchronous execution
    • pdata and data are the function call parameters of the kernel function
    • Function call parameters must be passed by value, not by reference. Parameters can be class types etc.
      • The execution of the kernel function will be executed asynchronously regardless of whether the stream is nullptr or not
    • Therefore, in the printf operation in the kernel function, you must wait, such as cudaDeviceSynchronize or cudaStreamSynchronize, otherwise you will not be able to see the printed information
  • __device__Flags, a function called by the device
    • The caller must be a device
  • __host__flag, a function called by the host
    • The caller must be the host
  • It is also possible __deivce__ __host__to have both flags at the same time, indicating that the function can be either a device or a host
  • __constant__flag, defines constant memory
  • __shared__flags, defining shared memory
  1. Through the cudaPeekAtLastError/cudaGetLastError function, you can capture whether there is an error or exception in the kernel function
  2. Calculation formula of memory index
position = 0
for i in range(6):
   position *= dims[i]
   position += indexs[i]
  1. buildin variables, that is, built-in variables, click through ctrl + left mouse button to view the definition position

    • All kernel functions are accessible, and their values ​​are maintained and changed by the executor
    • gridDim[x, y, z]: Grid dimension, the size of the thread layout, which is specified when the kernel function starts
    • blockDim[x, y, z]: block dimension, the size of the thread layout, which is specified when the kernel function starts
    • blockIdx[x, y, z]: block index, the corresponding maximum value is gridDim, which is assigned by the executor according to the currently executing thread, and has been configured when accessing in the kernel function
    • threadIdx[x, y, z]: thread index, the corresponding maximum value is blockDim, which is assigned by the executor according to the currently executing thread, and has been configured when accessing in the kernel function
    • Dim is fixed, does not change after startup, and is the maximum value of Idx
    • Each has three dimensions x, y, z, in order of z, y, x respectively
  2. About the concept of thread, grid, block and threadIdx

  • First of all, we can loosely think that the GPU is equivalent to a cube, and this cube has many small squares as shown in the figure below

insert image description here

  • Each small block is a thread. For the convenience of discussion, we only consider 2D, as shown below

insert image description here

  • What we care about is the position of a certain thread, such as the yellow square in the above picture
  • Its position in 2D is (blockIdx.x, blockIdx.y, threadIdx.x, threadIdx.y) = (1, 0, 1, 1)
  • If this 2D is expanded into 1D, the 1D position of this yellow thread is 13
  • The calculation method is as shown in the figure below
  • But in general, in order to simplify the problem, we only need to use the three quantities of threadIdx.x, blockIdx.x, and blockDim.x, so the formula for calculating idx is as follows:
  • int idx = threadIdx.x + blockIdx.x * blockDim.x; The meaning is to ask for the 1D index of the thread, you must first know which block it is in, and then know which thread it is in this block

insert image description here

Summarize

This course has learned the kernel function, which is a function calculated in parallel on the GPU, which is decorated __global__and explained by symbols. Kernel functions are different from ordinary functions. When calling, you need to use <<<>grid, block>> to specify the number of threads to be started by the kernel, and each thread has a unique thread number thread ID to identify it. The index calculation can be memorized in the way of left multiplication and right addition according to Mr. Du's method.

In addition, we also need to have a certain understanding of the thread structure. All the threads started by a kernel are called a grid, and there are many blocks in a grid, and a block contains many threads. The grid is only the logical layer, and the SM (stream processor) is the physical layer of execution. The basic execution unit of the SM is warp (thread warp), and each warp contains 32 threads.

Finally, we wrote a simple kernel function case to understand the definition and startup of the kernel function, and use threadIdx, blockIdx, blockDim and other information to achieve parallel computing.
Logo, the global index calculation of threads can be memorized by multiplying left and adding right according to Mr. Du's method.

Guess you like

Origin blog.csdn.net/qq_40672115/article/details/131616766