foreword

Error detection when CUDA program is running

1. Error detection when CUDA program is running

Detect incorrect header files

Like some log files, general error detection will write a header file to contain the code to run the error detection API. In the operation error detection of the basic cuda program api, we have learned that the return value of basically all cuda apis has a flag information of the cudaError_t structure whether it is successfully used, and the successful call returns cudaSuccess. Therefore, according to the return value, a macro function can be defined on the header file to check whether the cuda program runs incorrectly.

#pragma once
#include<stdio.h>

#define CHECK(call) \
do                  \
{
      
      \
    const cudaError_t error_code = call; \
    if(error_code != cudaSuccess) \   
    {
    
    \
        printf("cuda error"); \
        printf("  File:   %s\n",__FILE__); \
        printf("  Line:   %d\n",__LINE__); \
        printf("  Error code: %d\n",error_code); \
        printf("  Error text: %s\n",cudaGetErrorString(error_code)); \
        exit(1); \
    }\
} while(0)

(1) #pragma once is a preprocessing instruction, which has the same function as the conditional compilation command #ifndef, but more concise.
(2) The parameter of the macro function CHECK is the cuda runtime api function.
(3) The precompiler has predefined macros:

    __FILE__ // 获取当前文件名
    __func__ // 获取当前函数名
    __LINE__ // 获取当前行号
    __DATE__ // 获取当前日期
    __TIME__ // 获取当前时间
    __WORDSIZE // 获取当前编译器的位数，适合用来显示警告、错误信息。

(4) It is also possible not to use the do while statement in the macro function, but it is not safe in some cases. Using the macro function, a semicolon will be added automatically when compiling and importing into the source code, so there is no need to add a semicolon after while (0) .

Check the CUDA API functions at runtime

Include the written header file in the source program, the cuda program header file name, generally named xxx.cuh. Just pass the api function that calls the cuda runtime as a parameter into the CHECK macro function of the header file, such as:

    // 分配设备内存
    double *d_x,*d_y,*d_z;
    // printf("%p",d_x);
    CHECK(cudaMalloc((void **)&d_x,M));
    CHECK(cudaMalloc((void **)&d_y,M));
    CHECK(cudaMalloc((void **)&d_z,M));
    // 将某些数据从主机复制到设备上
    CHECK(cudaMemcpy(d_x,h_x,M,cudaMemcpyHostToDevice));
    CHECK(cudaMemcpy(d_y,h_y,M,cudaMemcpyHostToDevice));

    // 数组求和
    const int block_size = 128;
    const int gride_size = N/block_size;
    add<<<gride_size,block_size>>>(d_x,d_y,d_z);
    CHECK(cudaMemcpy(h_z,d_z,M,cudaMemcpyHostToDevice));

If the program reports an error, the following information will be displayed:
insert image description here

Check the CUDA kernel function at runtime

The above method cannot be used directly as above because the kernel function has no return value. In cuda's api, there is a way to catch errors that may occur in the kernel function.
cudaGetLastError():
insert image description here
NOTE: This function may also return error codes from previous asynchronous starts.
Therefore, when checking later, also check whether the host and device are synchronized, such as:

    // 数组求和
    const int block_size = 1240;  // 最大限制是1024
    const int gride_size = N/block_size;
    add<<<gride_size,block_size>>>(d_x,d_y,d_z);
    // 检查核函数的调用
    CHECK(cudaGetLastError());
    CHECK(cudaDeviceSynchronize());  // 可用可不用，后面的隐式的起到了同步主机与设备的作用
    // 将某些数据从设备复制到主机上,这个数据传输函数隐式的起到了同步主机与设备的作用,所以后面用不用cudaDeviceSynchronize都可以
    CHECK(cudaMemcpy(h_z,d_z,M,cudaMemcpyDeviceToHost));
    check(h_z,N);

Using the cudaDeviceSynchronize() of the displayed synchronization host and device, the error location is more accurate.
Check the wrong error message:
insert image description here

CUDA-MEMCHECKTools

cuda provides a tool set of CUDA-MEMCHECK, which is used to check memory errors by the cuda-memcheck command.
Through -h, it can be found that the main things to check the memory are:
insert image description here

Summarize

The basic method of cuda program error checking
Reference:
If the blog content is infringing, you can contact and delete it in time!
CUDA Programming: Basics and Practice
https://docs.nvidia.com/cuda/
https://docs.nvidia.com/cuda/cuda-runtime-api
https://github.com/brucefan1983/CUDA-Programming

CUDA programming notes (3)

Article Directory