VS2019 CMake develops Win&Linux dual-platform CUDA+cuDNN

foreword


Deploying AI models on Nvidia's server GPU must use TensorRT, because cuDNN is much slower than TensorRT. Refer to my latest article on AI model structure optimization based on NvidiaGPU :

Finally, as far as I know, the only engine that can perfectly play the performance of TensorCore is TensorRT. Some training and inference platforms (such as pytorch) actually call cuDNN to complete the reasoning of the model. But according to my experience, the same The implementation of parameter convolution cuDNN is more than 3 times slower than the implementation of TensorRT. I also tried to implement convolution based on TensorCore, but in the end it will still be more than 20% slower than TensorRT.
—————————— ———————
Copyright statement: This article is an original article of CSDN blogger "Mr_L_Y", following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.
Original link: https://blog.csdn.net/luoyu510183/article/details/117385131

 


The most recent project is to help the server implement some AI algorithms, using Nvidia GPUs for reasoning, mainly CUDA, cuDNN and TensorRT. In addition, I am also developing an independent CUDA engine to replace cuDNN, because cuDNN currently does not support Direct This kind of direct convolution algorithm can only use three algorithms: GEMM, FFT and WINOGRAD. My current test feels that these three methods will not necessarily be faster than direct convolution. In addition, cuDNN's TRUE_HALF_CONFIG will not calculate faster than float Fast, this is the opposite of my conclusion with direct convolution.

First explain the structural relationship:

CUDA is the basic tool set for Nvidia GPU development. It not only includes the .cu compiler nvcc, but also a lot of useful tool libraries such as npp, nvjpeg, nvblas and so on.

cuDNN is a CUDA-based convolution algorithm library, which contains many operators for neural network convolution calculations, such as Tensor's + - * /, activation function, convolution function, etc.

TensorRT is a model-oriented application library based on cuDNN. You don’t need to understand the implementation of specific convolution here. You only need to hand over the model to TensorRT and use the corresponding parser to analyze it to perform inference or training.

Or the structure is as follows:

模型 => TensorRT Parser => TensorRT cuDNN Wrapper => cuDNN ops => CUDA Kernel => GPU driver

This article is mainly used as an introduction to compiling and building a basic dual-platform CUDA+cuDNN project. 

text

Preparation

Install CUDA

The tool set that needs to be used first is mainly VS2019 and its CMake project template. This article is an advanced content, and does not talk about the basics of CMake, you can go to the previous article to read.

There is no specific requirement for the VS version. I use CUDA version 10.1, and the driver version is greater than r418. CUDA10.2 is also possible, and I have not tested it after 11.0. 

Download CUDA from the official website https://developer.nvidia.com/Cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal :

Note that this is the latest version. Generally, I like to use the latest version, but due to server deployment requirements, I can only download version 10.1. You need to click the link below to download the historical version:

URL: https://developer.nvidia.com/cuda-toolkit-archive

There is nothing to say about Windows installation, it just lets you choose the components you need. Here is a reminder that the GPU Driver will be installed at the same time. Generally, the 10.1 driver is relatively old, so you can choose not to install the driver and only install the cuda toolkit.

Refer to https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html for Linux installation , be sure to install according to the steps on the official website, it is recommended to use deb local installation.

Install cuDNN

Address: https://developer.nvidia.com/rdp/cudnn-download

Here I still download the old version 7.6.5

I forgot about the Windows installation, cuDNN seems to be decompressed into a folder and needs to be configured with environment variables. It is most straightforward to copy the decompressed folders directly to the cuda folder, unless you need multiple versions of cudnn corresponding to the same cuda version. 

After decompression, there should be several folders such as include lib, just copy them directly to the corresponding cuda directory:

The Linux side should be simpler, just install the two deb files directly. There is a sequence relationship here, and it must not be installed out of order.

First is the runtime library, then the development library, pay attention to whether there is dev in the file name. Finally, there is a sample and documentation, which I usually don’t install.

CMake configuration

 For basic things such as how to create a project, how to add a Linux or WSL target, please refer to my previous article. Here I mainly talk about how to set the link location and header file location of CUDA under Win and Linux.

Please see the CMakeLists.txt below:

cmake_minimum_required(VERSION 3.18)

project(cudnnBug)
if (UNIX)
	set(CUDA_TOOLKIT_ROOT_DIR "/usr/local/cuda")
	set(CMAKE_CUDA_COMPILER "${CUDA_TOOLKIT_ROOT_DIR}/bin/nvcc") #要在15行之前先设置Nvcc的路径
	set(CUDA_LIB_DIR "${CUDA_TOOLKIT_ROOT_DIR}/lib64")
elseif (WIN32)
    find_package(CUDA)
	set(CUDA_LIB_DIR "${CUDA_TOOLKIT_ROOT_DIR}/lib/x64")
endif()
set(CUDA_INCLUDE "${CUDA_TOOLKIT_ROOT_DIR}/include")

project(${PROJECT_NAME} LANGUAGES CXX CUDA) #使能CUDA, 在这里会检查nvcc路径是否正确

FILE(GLOB SRCS "*.cpp" "*.cc" "*.cu") #将当前路径下所有的源文件写入SRCS , 如果需要搜索文件夹内的源文件则把GLOB_RECURSE替换GLOB
FILE(GLOB INCS "*.h" "*.hpp" "*.cuh") #同上,这里是头文件

add_executable(${PROJECT_NAME} ${INCS} ${SRCS})

set_property(TARGET ${PROJECT_NAME} PROPERTY CUDA_ARCHITECTURES 61 70 75)#这个很重要, 这是编译的gpu代码支持的平台, 比如需要在1080TI运行就需要加上61
#另外CUDA_ARCHITECTURES好像是cmake 3.18的新特性,老的cmake 需要用 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_70,code=sm_70")

target_include_directories(${PROJECT_NAME} PUBLIC ${CUDA_INCLUDE})

target_link_directories(${PROJECT_NAME} PUBLIC ${CUDA_LIB_DIR})
target_link_libraries(${PROJECT_NAME} PUBLIC cudart cudnn)

if(WIN32)
elseif(UNIX)
	target_link_libraries(${PROJECT_NAME} PUBLIC pthread)
endif()

The above comments basically explain the core knowledge points, and it is necessary to add how to query the CUDA_ARCHITECTURES corresponding to the graphics card.

URL here: https://developer.nvidia.com/cuda-GPUs

This completes the CMake configuration on Windows and Linux.

some test code

The following code is an example of cudnn, which is a bug I once reported to Nvidia.

The code mainly demonstrates three convolution forms:

  1. First use cudnnConvolutionForward to convolve, then use cudnnAddTensor to add bias
  2. Directly use cudnnConvolutionBiasActivationForward to complete convolution and bias superposition, using CUDNN_ACTIVATION_IDENTITY as activation
  3. Directly use cudnnConvolutionBiasActivationForward to complete convolution and bias superposition, using CUDNN_ACTIVATION_RELU as activation

Under normal circumstances, the results of 1 and 2 should be the same, and 3 has the effect of Relu. But in the case of bugs, both 2 and 3 have the effect of Relu. The bug occurs when the dilation rate is greater than 3. After feedback from NVIDIA, this bug Fixed under cuDNN8, should still exist in cuDNN7.

This project only contains one cudnn_convbug.cu, the code is as follows:


#include "cuda_runtime.h"
#include "cudnn.h"
#include <stdio.h>
#include <vector>

#define CheckCUDNN(ret) \
{\
auto tmp = ret;\
if(tmp!=CUDNN_STATUS_SUCCESS)\
{\
  printf("CuDNN Error %d: %d",__LINE__,tmp);\
  return -1;\
}\
}

#define CheckCUDA(ret) \
{\
auto tmp=ret;\
if(tmp!=cudaSuccess)\
{\
  printf("CUDA Error %d: %d",__LINE__,tmp);\
  return -1;\
}\
}

static int cudnnBugTest(int dialation)
{
  constexpr int width = 28, height = 16;
  constexpr int bytesize = width * height * sizeof(float);
  int logsize = 100;
  
  cudnnHandle_t hcudnn = NULL;
  CheckCUDNN(cudnnCreate(&hcudnn));
  cudnnTensorDescriptor_t hinputtensor, houtputtensor, bias;
  cudnnCreateTensorDescriptor(&hinputtensor);
  cudnnCreateTensorDescriptor(&houtputtensor);
  cudnnCreateTensorDescriptor(&bias);
  cudnnSetTensor4dDescriptor(hinputtensor, cudnnTensorFormat_t::CUDNN_TENSOR_NCHW, cudnnDataType_t::CUDNN_DATA_FLOAT, 1, 1, height, width);
  cudnnSetTensor4dDescriptor(houtputtensor, cudnnTensorFormat_t::CUDNN_TENSOR_NCHW, cudnnDataType_t::CUDNN_DATA_FLOAT, 1, 3, height, width);
  cudnnSetTensor4dDescriptor(bias, cudnnTensorFormat_t::CUDNN_TENSOR_NCHW, cudnnDataType_t::CUDNN_DATA_FLOAT, 1, 3, 1, 1);
  float* src_dev, * tar_dev;
  float* src_h, * tar_h;
  float* bias_dev;
  float bias_h[] = { -100,-90,-80 };

  cudaMalloc(&src_dev, bytesize);
  cudaMalloc(&tar_dev, bytesize * 3);
  cudaMallocHost(&src_h, bytesize);
  cudaMallocHost(&tar_h, bytesize * 3);
  cudaMalloc(&bias_dev, 3 * sizeof(float));
  cudaMemcpy(bias_dev, bias_h, 3 * sizeof(float), cudaMemcpyKind::cudaMemcpyHostToDevice);
  for (int i = 0; i < height; i++)
  {
    for (int j = 0; j < width; j++)
    {
      *(src_h + i * width + j) = (float)(j);
    }
  }
  cudaMemcpy(src_dev, src_h, bytesize, cudaMemcpyKind::cudaMemcpyHostToDevice);
  float alpha(1), beta(0);

  cudnnConvolutionDescriptor_t hconv0;
  cudnnCreateConvolutionDescriptor(&hconv0);
  CheckCUDNN(cudnnSetConvolution2dDescriptor(hconv0, dialation, dialation, 1, 1, dialation, dialation, cudnnConvolutionMode_t::CUDNN_CROSS_CORRELATION, cudnnDataType_t::CUDNN_DATA_FLOAT));

  float* filter_h, * filter_d;
  constexpr int filtersizeb = 3 * 3 * 3 * sizeof(float);
  cudaMalloc(&filter_d, filtersizeb);
  cudaMallocHost(&filter_h, filtersizeb);
  for (int i = 0; i < 3 * 3 * 3; i++)
  {
    *(filter_h + i) = 1;
  }
  cudaMemcpy(filter_d, filter_h, filtersizeb, cudaMemcpyKind::cudaMemcpyHostToDevice);

  cudnnFilterDescriptor_t hfilter0;
  cudnnCreateFilterDescriptor(&hfilter0);
  CheckCUDNN(cudnnSetFilter4dDescriptor(hfilter0, cudnnDataType_t::CUDNN_DATA_FLOAT, cudnnTensorFormat_t::CUDNN_TENSOR_NCHW, 3, 1, 3, 3));
  size_t sizeInBytes = 0;
  void* workSpace = NULL;
  cudnnConvolutionFwdAlgo_t convalgo = cudnnConvolutionFwdAlgo_t::CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM;
  CheckCUDNN(cudnnGetConvolutionForwardWorkspaceSize(hcudnn, hinputtensor, hfilter0, hconv0, houtputtensor, convalgo, &sizeInBytes));
  if (sizeInBytes)
  {
    cudaMalloc(&workSpace, sizeInBytes);
  }
  CheckCUDNN(cudnnConvolutionForward(hcudnn, &alpha, hinputtensor, src_dev, hfilter0, filter_d, hconv0
    , convalgo
    , workSpace, sizeInBytes, &beta, houtputtensor, tar_dev));
  cudnnAddTensor(hcudnn, &alpha, bias, bias_dev, &alpha, houtputtensor, tar_dev);
  std::vector<float> out0, out1, out2;
  out0.resize(width * height * 3);
  out1.resize(width * height * 3);
  out2.resize(width * height * 3);
  cudaMemcpy(out0.data(), tar_dev, bytesize * 3, cudaMemcpyKind::cudaMemcpyDeviceToHost);//这里tar_h是正确的计算结果
  
  cudnnActivationDescriptor_t act0;
  cudnnCreateActivationDescriptor(&act0);
  CheckCUDNN(cudnnSetActivationDescriptor(act0, cudnnActivationMode_t::CUDNN_ACTIVATION_IDENTITY, cudnnNanPropagation_t::CUDNN_NOT_PROPAGATE_NAN, 0));
  CheckCUDNN(cudnnConvolutionBiasActivationForward(hcudnn, &alpha, hinputtensor, src_dev, hfilter0, filter_d, hconv0
    , convalgo
    , workSpace, sizeInBytes, &beta, houtputtensor, tar_dev, bias, bias_dev, act0, houtputtensor, tar_dev));
  cudaMemcpy(out1.data(), tar_dev, bytesize * 3, cudaMemcpyKind::cudaMemcpyDeviceToHost);//这里tar_h是被RELU

  CheckCUDNN(cudnnSetActivationDescriptor(act0, cudnnActivationMode_t::CUDNN_ACTIVATION_RELU, cudnnNanPropagation_t::CUDNN_NOT_PROPAGATE_NAN, 10));
  CheckCUDNN(cudnnConvolutionBiasActivationForward(hcudnn, &alpha, hinputtensor, src_dev, hfilter0, filter_d, hconv0
    , convalgo
    , workSpace, sizeInBytes, &beta, houtputtensor, tar_dev, bias, bias_dev, act0, houtputtensor, tar_dev));
  cudaMemcpy(out2.data(), tar_dev, bytesize * 3, cudaMemcpyKind::cudaMemcpyDeviceToHost);//这里tar_h和CUDNN_ACTIVATION_IDENTITY的结果一样

  printf("Method0\tMethod1\tMethod2\n");
  for (size_t i = 0; i < logsize; i++)
  {
    printf("%f\t%f\t%f\n", out0[i], out1[i], out2[i]);
  }
  //TODO 清理cudaMalloc和cudaMallocHost的内存,由于只运行一次这里就暂时没加.
  CheckCUDNN(cudnnDestroy(hcudnn));
  return 1;
}

int main(int ac,const char**as)
{
    int dilation = 5;
    if (ac>1)
    {
        dilation = atoi(as[1]);
    }
    cudnnBugTest(dilation);
    CheckCUDA(cudaDeviceReset());
    return 0;
}

The above program is mainly to test the changes to the three convolution results when the dilation rate is changed. The operation is as follows:

The three columns of data correspond to the three convolution methods mentioned above. When a certain dilation rate causes the data in the first column and the second column to be unequal, a bug occurs.

epilogue

This article does not explain the CUDA syntax and API, nor does it demonstrate the Launch kernel. I suggest going directly to the CUDA Samples for this part.

Here are a few samples I recommend:

 

Guess you like

Origin blog.csdn.net/luoyu510183/article/details/113471199