3. Tratamento de erros e obtenção de informações de hardware

3. Identificador de erro CUDA

Em um bom hábito de programação cuda, estamos acostumados a envolvê-lo com um manipulador de erros ao chamar uma API de tempo de execução cuda, como cudaMalloc() cudaMemcpy(). Isso
tornará mais fácil solucionar o problema da origem do erro.

Especificamente, a API de tempo de execução do CUDA retornará um cudaError (classe de enumeração).Você pode usar a classe de enumeração para ver se ela foi bem-sucedida ou se apresenta vários erros.

__FILE__, __LINE__Esses dois referem-se ao arquivo atual, de onde vêm as seguintes linhas e nomes de arquivo

ERROR: src/matmul_gpu_basic.cu:62, CODE:cudaErrorInvalidConfiguration, DETAIL:invalid configuration argument

Quanto às duas definições de macro aqui, uma é usada para verificar a API CUDA Runtime e a outra é usada para verificar a função do kernel. Ao verificar a função do kernel, use LAST_KERNEL_CHECK(), coloque-o após a sincronização para garantir que todas as operações CUDA anteriores (incluindo a execução do kernel) foram concluídas e, em seguida, verifique novamente

Existe cudaPeekAtLastError ou cudaGetLastError, a diferença é se deve propagar o erro

kernelFunction<<<numBlocks, numThreads>>>();
cudaError_t err1 = cudaPeekAtLastError();  // 只查看，不清除错误状态
cudaError_t err2 = cudaGetLastError();  // 查看并清除错误状态

#include <cuda_runtime.h>
#include <system_error>

#define CUDA_CHECK(call)             __cudaCheck(call, __FILE__, __LINE__)
#define LAST_KERNEL_CHECK()          __kernelCheck(__FILE__, __LINE__)
#define BLOCKSIZE 16

inline static void __cudaCheck(cudaError_t err, const char* file, const int line) {
    
    
    if (err != cudaSuccess) {
    
    
        printf("ERROR: %s:%d, ", file, line);
        printf("CODE:%s, DETAIL:%s\n", cudaGetErrorName(err), cudaGetErrorString(err));
        exit(1);
    }
}

inline static void __kernelCheck(const char* file, const int line) {
    
    
    /* 
     * 在编写CUDA是，错误排查非常重要，默认的cuda runtime API中的函数都会返回cudaError_t类型的结果，
     * 但是在写kernel函数的时候，需要通过cudaPeekAtLastError或者cudaGetLastError来获取错误
     */
    cudaError_t err = cudaPeekAtLastError();
    if (err != cudaSuccess) {
    
    
        printf("ERROR: %s:%d, ", file, line);
        printf("CODE:%s, DETAIL:%s\n", cudaGetErrorName(err), cudaGetErrorString(err));
        exit(1);
    }
}

3.1 Dois casos de erro

EX1:

O blockSize = 64 da multiplicação da matriz antes da alocação aqui, então há 64x64=4096 threads em um bloco de threads, que excede o limite de 1024. A seguir está a diferença entre não usar KernelCheck() e usá-lo.

Se você não adicioná-lo, nenhum erro será relatado.

matmul in cpu                  uses 4092.84 ms
matmul in GPU Warmup           uses 199.453 ms
matmul in GPU blockSize = 1    uses 13.1558 ms
matmul in GPU blockSize = 16   uses 13.0716 ms
matmul in GPU blockSize = 32   uses 13.0694 ms
matmul in GPU blockSize = 64   uses 2.00626 ms
res is different in 0, cpu: 260.89050293, gpu: 0.00000000
Matmul result is different

Se for adicionado , um erro será relatado. Este erro cudaErrorInvalidConfiguration significa que quando o kernel CUDA é executado, os parâmetros de configuração passados para o kernel são inválidos. Especificamente, a configuração do kernel CUDA inclui o número de blocos de threads, o número de threads em um bloco de threads e assim por diante.

matmul in cpu                  uses 4115.42 ms
matmul in GPU Warmup           uses 201.464 ms
matmul in GPU blockSize = 1    uses 13.1182 ms
matmul in GPU blockSize = 16   uses 13.0607 ms
matmul in GPU blockSize = 32   uses 13.0602 ms
ERROR: src/matmul_gpu_basic.cu:69, CODE:cudaErrorInvalidConfiguration, DETAIL:invalid configuration argument

EX2:

    // 分配grid, block
    dim3 dimBlock(blockSize, blockSize);
    int gridDim = (width + blockSize - 1) / blockSize;
    dim3 dimGrid(gridDim, gridDim);

Escrito

    // 分配grid, block
    dim3 dimBlock(blockSize, blockSize);
    int gridDim = (width + blockSize - 1) / blockSize;
    dim3 dimGrid(gridDim);

matmul in cpu                  uses 4152.26 ms
matmul in GPU Warmup           uses 189.667 ms
matmul in GPU blockSize = 1    uses 2.92747 ms
matmul in GPU blockSize = 16   uses 2.85372 ms
matmul in GPU blockSize = 32   uses 2.86483 ms
res is different in 32768, cpu: 260.76977539, gpu: 0.00000000

Isso não informa um erro. Há apenas uma grade aqui e não há blocos suficientes para calcular, então ele não calcula parte do cálculo, então a velocidade de execução é muito mais rápida. No futuro, se a velocidade do CUDA a programação é muito mais rápida, precisamos saber se a referência não está totalmente calculada.

4. Obtenha informações de hardware apropriadas

4.1 Por que obter informações de hardware

Ao programar CUDA, é muito importante entender as especificações de hardware, pois essas especificações limitam as estratégias paralelas e os métodos de otimização que você pode usar.

*********************Architecture related**********************
Device id:                              7
Device name:                            NVIDIA GeForce RTX 3090
Device compute capability:              8.6
GPU global meory size:                  23.70GB
L2 cache size:                          6.00MB
Shared memory per block:                48.00KB
Shared memory per SM:                   100.00KB
Device clock rate:                      1.69GHz
Device memory clock rate:               9.75Ghz
Number of SM:                           82
Warp size:                              32
*********************Parameter related************************
Max block numbers:                      16
Max threads per block:                  1024
Max block dimension size:               1024:1024:64
Max grid dimension size:                2147483647:65535:65535

4.2 Código

#include <cuda_runtime.h>
#include <system_error>
#include <stdarg.h>

#define CUDA_CHECK(call)             __cudaCheck(call, __FILE__, __LINE__)
#define LAST_KERNEL_CHECK(call)      __kernelCheck(__FILE__, __LINE__)
#define LOG(...)                     __log_info(__VA_ARGS__)

#define BLOCKSIZE 16

static void __cudaCheck(cudaError_t err, const char* file, const int line) {
    
    
    if (err != cudaSuccess) {
    
    
        printf("ERROR: %s:%d, ", file, line);
        printf("CODE:%s, DETAIL:%s\n", cudaGetErrorName(err), cudaGetErrorString(err));
        exit(1);
    }
}

static void __kernelCheck(const char* file, const int line) {
    
    
    cudaError_t err = cudaPeekAtLastError();
    if (err != cudaSuccess) {
    
    
        printf("ERROR: %s:%d, ", file, line);
        printf("CODE:%s, DETAIL:%s\n", cudaGetErrorName(err), cudaGetErrorString(err));
        exit(1);
    }
}

// 使用变参进行LOG的打印。比较推荐的打印log的写法
static void __log_info(const char* format, ...) {
    
    
    char msg[1000];
    va_list args;
    va_start(args, format);

    vsnprintf(msg, sizeof(msg), format, args);

    fprintf(stdout, "%s\n", msg);
    va_end(args);
}

#include <stdio.h>
#include <cuda_runtime.h>
#include <string>

#include "utils.hpp"

int main(){
    
    
    int count;
    int index = 0;
    cudaGetDeviceCount(&count);
    while (index < count) {
    
    
        cudaSetDevice(index);
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, index);
        LOG("%-40s",             "*********************Architecture related**********************");
        LOG("%-40s%d%s",         "Device id: ",                   index, "");
        LOG("%-40s%s%s",         "Device name: ",                 prop.name, "");
        LOG("%-40s%.1f%s",       "Device compute capability: ",   prop.major + (float)prop.minor / 10, "");
        LOG("%-40s%.2f%s",       "GPU global meory size: ",       (float)prop.totalGlobalMem / (1<<30), "GB");
        LOG("%-40s%.2f%s",       "L2 cache size: ",               (float)prop.l2CacheSize / (1<<20), "MB");
        LOG("%-40s%.2f%s",       "Shared memory per block: ",     (float)prop.sharedMemPerBlock / (1<<10), "KB");
        LOG("%-40s%.2f%s",       "Shared memory per SM: ",        (float)prop.sharedMemPerMultiprocessor / (1<<10), "KB");
        LOG("%-40s%.2f%s",       "Device clock rate: ",           prop.clockRate*1E-6, "GHz");
        LOG("%-40s%.2f%s",       "Device memory clock rate: ",    prop.memoryClockRate*1E-6, "Ghz");
        LOG("%-40s%d%s",         "Number of SM: ",                prop.multiProcessorCount, "");
        LOG("%-40s%d%s",         "Warp size: ",                   prop.warpSize, "");

        LOG("%-40s",             "*********************Parameter related************************");
        LOG("%-40s%d%s",         "Max block numbers: ",           prop.maxBlocksPerMultiProcessor, "");
        LOG("%-40s%d%s",         "Max threads per block: ",       prop.maxThreadsPerBlock, "");
        LOG("%-40s%d:%d:%d%s",   "Max block dimension size:",     prop.maxThreadsDim[0], prop.maxThreadsDim[1], prop.maxThreadsDim[2], "");
        LOG("%-40s%d:%d:%d%s",   "Max grid dimension size: ",     prop.maxGridSize[0], prop.maxGridSize[1], prop.maxGridSize[2], "");
        index ++;
        printf("\n");
    }
    return 0;
}