[TensorRT] TensorRT deploys Yolov5 model (C++)

Source address:

  The project code is open source in my GitHub code warehouse. My GitHub homepage is: GitHub
  project code:

https://github.com/guojin-yan/Inference/blob/master/tensorrt/cpp_tensorrt_yolov5/cpp_tensorrt_yolov5.cpp

  NVIDIA TensorRT™ is an SDK for high-performance deep learning inference that provides low latency and high throughput for deep learning inference applications. For detailed installation methods, refer to the following blog: NVIDIA TensorRT Installation (Windows C++)
insert image description here

1. Basic steps of TensorRT deployment model

  A classic TensorRT deployment model steps are: onnx model transfer engine, read local model, create inference engine, create inference context, create GPU memory buffer, configure input data, model inference, and process inference results.

1.1 Onnx model to engine

  TensorRT supports a variety of model files. However, with the development of the onnx model, various model frameworks currently use the onnx model as an intermediate conversion format. Yes, the model structure is becoming more and more general, so TensorRT is currently mainly updating. Transformation for this model. TensorRT can directly read the engine file. For the onnx model, a series of conversion configurations are required, and the subsequent inference can only be performed after being converted to the engine engine. Therefore, the model conversion needs to be performed before model inference. The conversion method interface has been provided in the project:

void onnx_to_engine(std::string onnx_file_path, std::string engine_file_path, int type) {

	// 构建器,获取cuda内核目录以获取最快的实现
	// 用于创建config、network、engine的其他对象的核心类
	nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(gLogger);
	const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
	// 解析onnx网络文件
	// tensorRT模型类
	nvinfer1::INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
	// onnx文件解析类
	// 将onnx文件解析,并填充rensorRT网络结构
	nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, gLogger);
	// 解析onnx文件
	parser->parseFromFile(onnx_file_path.c_str(), 2);
	for (int i = 0; i < parser->getNbErrors(); ++i) {
		std::cout << "load error: " << parser->getError(i)->desc() << std::endl;
	}
	printf("tensorRT load mask onnx model successfully!!!...\n");

	// 创建推理引擎
	// 创建生成器配置对象。
	nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
	// 设置最大工作空间大小。
	config->setMaxWorkspaceSize(16 * (1 << 20));
	// 设置模型输出精度
	if (type == 1) {
		config->setFlag(nvinfer1::BuilderFlag::kFP16);
	}
	if (type == 2) {
		config->setFlag(nvinfer1::BuilderFlag::kINT8);
	}
	// 创建推理引擎
	nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
	// 将推理银枪保存到本地
	std::cout << "try to save engine file now~~~" << std::endl;
	std::ofstream file_ptr(engine_file_path, std::ios::binary);
	if (!file_ptr) {
		std::cerr << "could not open plan output file" << std::endl;
		return;
	}
	// 将模型转化为文件流数据
	nvinfer1::IHostMemory* model_stream = engine->serialize();
	// 将文件保存到本地
	file_ptr.write(reinterpret_cast<const char*>(model_stream->data()), model_stream->size());
	// 销毁创建的对象
	model_stream->destroy();
	engine->destroy();
	network->destroy();
	parser->destroy();
	std::cout << "convert onnx model to TensorRT engine model successfully!" << std::endl;
}

1.2 Read local model

  Here, reading the local model is to read the engine binary file saved locally in the previous step, and read the model file information into the memory. This file saves all the information of the model and the configuration information of the computer, so the model file cannot be used on different computers.

std::ifstream file_ptr(model_path_engine, std::ios::binary);
size_t size = 0;
file_ptr.seekg(0, file_ptr.end);	// 将读指针从文件末尾开始移动0个字节
size = file_ptr.tellg();	// 返回读指针的位置,此时读指针的位置就是文件的字节数
file_ptr.seekg(0, file_ptr.beg);	// 将读指针从文件开头开始移动0个字节
char* model_stream = new char[size];
file_ptr.read(model_stream, size);
file_ptr.close();

1.3 Create an inference engine

  First, you need to initialize the logging interface class, which is used to create subsequent deserialization engines; then create a deserialization engine, whose main function is to allow deserialization of serialized functionally unsafe engines, and then call Deserialize the engine to create an inference engine. In this step, you only need to input the data and length of the model file read in the previous step.

// 日志记录接口
Logger logger;
// 反序列化引擎
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
// 推理引擎
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(model_stream, size);

1.4 Create inference context

  The reasoning context here is similar to the reasoning request in OpenVINO, which is the class for later model reasoning.

nvinfer1::IExecutionContext* context = engine->createExecutionContext();

1.5 Create GPU memory buffer

  TensorRT uses NVIDIA graphics cards for model inference, but our inference data and subsequent processing data are implemented in memory, so we need to create a memory buffer for input inference data and read inference result data.

// 创建GPU显存缓冲区
void** data_buffer = new void* [num_ionode];
// 创建GPU显存输入缓冲区
int input_node_index = engine->getBindingIndex(input_node_name);
cudaMalloc(&(data_buffer[input_node_index]), input_data_length * sizeof(float));
// 创建GPU显存输出缓冲区
int output_node_index = engine->getBindingIndex(output_node_name);
cudaMalloc(&(data_buffer[output_node_index]), output_data_length * sizeof(float));

1.6 Configure input data

  When configuring the input data, you only need to call the cudaMemcpyAsync() method to load the cuda stream data to the i model. However, the data needs to be preprocessed according to the requirements of the model. In addition, the data results need to be added to the cuda stream.

// 创建输入cuda流
cudaStream_t stream;
cudaStreamCreate(&stream);
std::vector<float> input_data(input_data_length);
memcpy(input_data.data(), BN_image.ptr<float>(), input_data_length * sizeof(float));
// 输入数据由内存到GPU显存
cudaMemcpyAsync(data_buffer[input_node_index], input_data.data(), input_data_length * sizeof(float), cudaMemcpyHostToDevice, stream);

1.7 Model Reasoning

context->enqueueV2(data_buffer, stream, nullptr);

1.8 Processing inference results

  We finally process the data in memory. First, we need to read the data from the video memory into the memory.

float* result_array = new float[output_data_length];
cudaMemcpyAsync(result_array, data_buffer[output_node_index], output_data_length * sizeof(float), cudaMemcpyDeviceToHost, stream);

  The next step is to process data according to the output results of the model. Different models will have different data processing methods.

2. TensorRT deploys the Yolov5 model

2.1 Create a new C++ project

  Right-click the solution, select Add New Project, add a C++ empty project, and name the C++ project: cpp_tensorrt_yolov5. After entering the project, right-click the source file, select Add → New Item → C++ file (cpp), and add the file.
  Right-click the current project, enter the property settings, and configure the properties of TensorRT and OpenCV.

Set include directory :

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include
D:\Program Files\TensorRT-8.4.0.6\include
E:\OpenCV Source\opencv-4.5.5\build\include
E:\OpenCV Source\opencv-4.5.5\build\include\opencv2

Set **Library Directory**:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\lib
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\lib\x64
D:\Program Files\TensorRT-8.4.0.6\lib
E:\OpenCV Source\opencv-4.5.5\build\x64\vc15\lib

Set up additional dependencies :

nvinfer.lib
nvinfer_plugin.lib
nvonnxparser.lib
nvparsers.lib
cublas.lib
cublasLt.lib
cuda.lib
cudadevrt.lib
cudart.lib
cudart_static.lib
cudnn.lib
cudnn64_8.lib
cudnn_adv_infer.lib
cudnn_adv_infer64_8.lib
cudnn_adv_train.lib
cudnn_adv_train64_8.lib
cudnn_cnn_infer.lib
cudnn_cnn_infer64_8.lib
cudnn_cnn_train.lib
cudnn_cnn_train64_8.lib
cudnn_ops_infer.lib
cudnn_ops_infer64_8.lib
cudnn_ops_train.lib
cudnn_ops_train64_8.lib
cufft.lib
cufftw.lib
curand.lib
cusolver.lib
cusolverMg.lib
cusparse.lib
nppc.lib
nppial.lib
nppicc.lib
nppidei.lib
nppif.lib
nppig.lib
nppim.lib
nppist.lib
nppisu.lib
nppitc.lib
npps.lib
nvblas.lib
nvjpeg.lib
nvml.lib
nvrtc.lib
OpenCL.lib
opencv_world455.lib

2.2 Define the relevant information of the yolov5 model

const char* model_path_onnx = "E:/Text_Model/yolov5/yolov5s.onnx";
const char* model_path_engine = "E:/Text_Model/yolov5/yolov5s.engine";
const char* image_path = "E:/Text_dataset/YOLOv5/0001.jpg";
std::string lable_path = "E:/Git_space/Al模型部署开发方式/model/yolov5/lable.txt";
const char* input_node_name = "images";
const char* output_node_name = "output";
int num_ionode = 2;

2.3 Read local model information

std::ifstream file_ptr(model_path_engine, std::ios::binary);
	if (!file_ptr.good()) {
		std::cerr << "文件无法打开,请确定文件是否可用!" << std::endl;
}
size_t size = 0;
file_ptr.seekg(0, file_ptr.end);	// 将读指针从文件末尾开始移动0个字节
size = file_ptr.tellg();	// 返回读指针的位置,此时读指针的位置就是文件的字节数
file_ptr.seekg(0, file_ptr.beg);	// 将读指针从文件开头开始移动0个字节
char* model_stream = new char[size];
file_ptr.read(model_stream, size);
file_ptr.close();

2.4 Initialize the reasoning engine

Here we need to initialize the deserialization engine and inference engine, and create a context for inference.

Logger logger;
// 反序列化引擎
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
// 推理引擎
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(model_stream, size);
// 上下文
nvinfer1::IExecutionContext* context = engine->createExecutionContext();

2.5 Create GPU memory buffer

The number of GPU memory buffers is mainly related to the input and output nodes of the model. Here we only need to set it according to the number of input and output nodes of the model.

void** data_buffer = new void* [num_ionode];
// 创建GPU显存输入缓冲区
int input_node_index = engine->getBindingIndex(input_node_name);
nvinfer1::Dims input_node_dim = engine->getBindingDimensions(input_node_index);
size_t input_data_length = input_node_dim.d[1]* input_node_dim.d[2] * input_node_dim.d[3];
cudaMalloc(&(data_buffer[input_node_index]), input_data_length * sizeof(float));
// 创建GPU显存输出缓冲区
int output_node_index = engine->getBindingIndex(output_node_name);
nvinfer1::Dims output_node_dim = engine->getBindingDimensions(output_node_index);
size_t output_data_length = output_node_dim.d[1] * output_node_dim.d[2] ;
cudaMalloc(&(data_buffer[output_node_index]), output_data_length * sizeof(float));

2.6 Configure model input

  First, the input image is processed according to the model data input requirements. First, the image data is copied to the square background, and then the RGB channels are exchanged, scaled to the specified size, and normalized. In OpenCV, the blobFromImage() method can directly implement the above Function.

// 图象预处理 - 格式化操作
cv::Mat image = cv::imread(image_path);
int max_side_length = std::max(image.cols, image.rows);
cv::Mat max_image = cv::Mat::zeros(cv::Size(max_side_length, max_side_length), CV_8UC3);
cv::Rect roi(0, 0, image.cols, image.rows);
image.copyTo(max_image(roi));
// 将图像归一化,并放缩到指定大小
cv::Size input_node_shape(input_node_dim.d[2], input_node_dim.d[3]);
cv::Mat BN_image = cv::dnn::blobFromImage(max_image, 1 / 255.0, input_node_shape, cv::Scalar(0, 0, 0), true, false);

  Next, create a cuda stream and place the processed data in the input_data container; finally use the cudaMemcpyAsync() method directly to transfer the input data to the video memory.

// 创建输入cuda流
cudaStream_t stream;
cudaStreamCreate(&stream);
std::vector<float> input_data(input_data_length);
memcpy(input_data.data(), BN_image.ptr<float>(), input_data_length * sizeof(float));
// 输入数据由内存到GPU显存
cudaMemcpyAsync(data_buffer[input_node_index], input_data.data(), input_data_length * sizeof(float), cudaMemcpyHostToDevice, stream);

2.7 Model Reasoning

context->enqueueV2(data_buffer, stream, nullptr);

2.8 Processing inference results

  First read the inference result data, mainly to assign the inference data results on the GPU memory to the memory, so as to facilitate subsequent further processing of the data.

float* result_array = new float[output_data_length];
cudaMemcpyAsync(result_array, data_buffer[output_node_index], output_data_length * sizeof(float), cudaMemcpyDeviceToHost, stream);

  The next step is to process the data. The output result of Yolov5 is an array of size 85x25200, in which no 85 data are grouped. In this project, we provide a result processing class specially used to process the results of yolov5 data, so here we only need Just call the result class:

ResultYolov5 result;
result.factor = max_side_length / (float) input_node_dim.d[2];
result.read_class_names(lable_path);
cv::Mat result_image = result.yolov5_result(image, result_array);

  The figure below shows our test results.
insert image description here

Guess you like

Origin blog.csdn.net/grape_yan/article/details/128550102