C++ deployment deep learning model

Article Directory

foreword

When deploying large-scale deep learning applications, C++ may be a better choice than python to meet application requirements or squeeze model performance. Based on this, I specially record the recent C++ learning experience. In fact, thinking about why learning C++ starts with the end in mind. The first is to improve the performance of the model and meet the requirements of high availability, high concurrency, and low latency in application scenarios. In order to improve the performance of the model, some reasoning frameworks, such as TensorRT, NCNN or Openvino (TensorRT is used as a case in this article), are needed. TensorRT supports Python API in versions above 8.0, but it is still necessary to learn C++. In addition, C++ will also be considered for model compression.

After determining the reasoning framework, then determine what format model the framework requires? It may be necessary to mention ONNX, the model master, so you need to learn the ONNX model. Learning how to convert Tensorflow, Pytorch or keras models to ONNX, what operators it supports, these are all things that need to be learned.

Finally, the route is to convert the Tensorflow or pytorch model into ONNX, and then ONNX optimizes the model, converts it into TensorRT model optimization and C++ reasoning.

  1. ONNX model conversion and optimization
  2. TensorRT conversion and optimization
  3. C++ implements inference of ONNX model and TensorRT inference.

ONNX

For ONNX conversion, please refer to my git repository: onnx model conversion . Including Tensorflow and Pytorch model conversion Demo. Also includes onnxruntimethe process of using reasoning. The basic conversion process has been converted in this part, and it needs to be further deepened in the future to understand the supported operators, how to convert complex models, and even how to write operators.

C++

C++ needs to learn the basics, and these are not a problem. Read the book, learn the video, and use a C++ call ONNX model reasoning as an example. For the specific code, please refer to: ONNX C++ . In addition, I found a treasure blogger, you can refer to his series of articles:

  1. https://blog.csdn.net/qq_34124780/article/details/114666312
  2. 2021.04.15 Update C++ to deploy yolov5 model using opencv (2)
  3. 2021.09.02 Update Description Use opencv to deploy yolov5 model under c++ (3)
  4. 2021.11.01 Deploy yolov5-6.0 version under opencv under c++ (4)
  5. 2022.07.25 Using opencv to deploy yolov7 model under C++ (5)
  6. the code

In the above blog, the DNN reasoning framework that comes with OpenCV is used. I have time to compare the advantages and disadvantages of various reasoning frameworks. The reasoning of YOLOX mainly completes three steps: model loading, image preprocessing and result postprocessing. A header file that defines the following:
ONNXinference header file:

#include <assert.h>
#include<onnxruntime_cxx_api.h>
#include<ctime>
#include <opencv2/core.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/videoio.hpp>
#include <opencv2/highgui.hpp>

class yoloxmodelinference {
    
    
public:
	yoloxmodelinference(const wchar_t* onnx_model_path);
	float* predict_test(std::vector<float>input_tensor_values, int batch_size = 1);
	cv::Mat predict(cv::Mat& input_tensor, int batch_size = 1, int index = 0);
	std::vector<float> predict(std::vector<float>& input_data, int batch_size = 1, int index = 0);
private:
	Ort::Env env;
	Ort::Session session;
	Ort::AllocatorWithDefaultOptions allocator;
	std::vector<const char*>input_node_names;
	std::vector<const char*>output_node_names;
	std::vector<int64_t> input_node_dims;
	std::vector<int64_t> output_node_dims;
	std::size_t num_output_nodes;
	std::size_t num_input_nodes;
	const int netWidth = 640;
	const int netHeight = 640;
	const int strideSize = 3;//stride size
	float boxThreshold = 0.25;
};
#endif // !yoloxmodel

DNNInference header file:

#pragma once
#include<iostream>
#include<opencv2/opencv.hpp>

struct Output {
    
    
	//类别
	int id;
	//置信度
	float confidence;
	//矩形框
	cv::Rect box;
};

class YOLO {
    
    
public:
	YOLO() {
    
    

	}
	~YOLO(){
    
    }
	bool initModel(cv::dnn::Net& net, std::string& netPath, bool isCuda);
	std::vector<Output>& Detect(cv::Mat& image, cv::dnn::Net& net);

private:
	//网络输入的shape
	const int netWidth = 640;   //ONNX图片输入宽度
	const int netHeight = 640;  //ONNX图片输入高度
	const int strideSize = 3;   //stride size
	float boxThreshold = 0.25;
	float classThreshold = 0.25;
	float nmsThreshold = 0.45;
	float nmsScoreThreshold = boxThreshold * classThreshold;

};

The header file can be regarded as a "configuration file", which declares functions and some fixed parameters.

Reading files for DNN is very simple as shown below. At the same time, it is also very clear that DNN can use cuda.

bool YOLO::initModel(Net& net, string& netPath, bool isCuda)
{
    
    
	try {
    
    
		net = readNet(netPath);
	}
	catch (const exception& e) {
    
    
		cout << e.what() << std::endl;
		return false;
	}
	//cuda
	//if (isCuda) {
    
    
	//	net.setPreferableBackend(cv::dnn::DNN_BACKEND_CUDA);
	//	net.setPreferableTarget(cv::dnn::DNN_TARGET_CUDA_FP16);
	//}
	net.setPreferableBackend(DNN_BACKEND_DEFAULT);
	net.setPreferableTarget(DNN_TARGET_CPU);
	return true;
}

ONNXIt will be a little more troublesome. To say that it is a little troublesome is actually because I haven't studied the API seriously:

yoloxmodelinference::yoloxmodelinference(const wchar_t* onnx_model_path):session(nullptr), env(nullptr) {
    
    
    //初始化环境,每个进程一个环境,环境保留了线程池和其他状态信息
    this->env = Ort::Env(ORT_LOGGING_LEVEL_WARNING, "yolox");
    //初始化Session选项
    Ort::SessionOptions session_options;
    session_options.SetInterOpNumThreads(1);
    session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    // 创建Session并把模型加载到内存中
    this->session = Ort::Session(env, onnx_model_path, session_options);
    //输入输出节点数量和名称
    this->num_input_nodes = session.GetInputCount();
    this->num_output_nodes = session.GetOutputCount();
    for (int i = 0; i < this->num_input_nodes; i++)
    {
    
    
        auto input_node_name = session.GetInputName(i, allocator);
        this->input_node_names.push_back(input_node_name);
        Ort::TypeInfo type_info = session.GetInputTypeInfo(i);
        auto tensor_info = type_info.GetTensorTypeAndShapeInfo();
        ONNXTensorElementDataType type = tensor_info.GetElementType();
        this->input_node_dims = tensor_info.GetShape();
    }
    for (int i = 0; i < this->num_output_nodes; i++)
    {
    
    
        auto output_node_name = session.GetOutputName(i, allocator);
        this->output_node_names.push_back(output_node_name);
    }
}

Next is the preprocessing of the image. The current reference is the code of the blogger above. There is a problem that the ONNX model used by the blogger is converted from the yolo model in torch format, while my yolo model is in tensorflow format. The difference between the two is that the former is [batchsize, channel, width, height], while the latter is tensortflow [batchsize, width, height, channel]. So this is one of the current problems that need to be solved. By searching some information, such as change-blobfromimage-dimensions-order , I found that the reasoning framework of OpenCV is not very friendly to Tensorflow support. Therefore, in order to solve this problem, I downloaded the official ONNX model converted by pytorch from github.

TensorRT

TensorRT supports three network structures and parameters:

  1. TF-TRT requires a Tensorflow model. The API is integrated in the Tensorflow framework, and the development cost is minimal
  2. ONNX model format, the development cycle of this method is relatively short, very user-friendly
  3. Use the TensorRT API to manually build the model, which is a hardcore level task operation.

To sum up, the second point is more economical. When using the ONNX model, it has not been optimized. Therefore, it is necessary to optimize specific parameters with TensorRT, obtain the TensorRT Engine model, and finally use the model for inference.

(1) Use trtexec.exe
trtexec is an example in the TensorRT sample, which packs many methods of TensorRT into an executable file. It can optimize the model into TensorRT Engine, and fill in random numbers to run inference for speed test. The command ./trtexec --onnx=model.onnxoptimizes the onnx model into an Engine, and then counts and reports the time after multiple inferences. It is also possible to convert the ONNX model to a TensorRT model in trt format: /trtexec --onnx=model.onnx --saveEngine=xxx.trt. The function of trtexec is to see how fast the model can run. It does not care about the accuracy. If you really want to deploy a fast and good model, you still have to adjust the TensorRT API yourself.

Generally speaking, the models you write are all float32 operations. TensorRT will enable the TF32 data format by default. It is a truncated version of FP32, only 19 bits, and maintains the accuracy of fp16 and the exponent range of fp32. In addition, TensorRT can additionally specify the precision to convert the calculations in the model into float16 or int8 types. You can open only one or both. trtexec will tend to be the fastest way (some network modules do not support int8). --bestparameter, this parameter is equivalent to opening --int8 --fp16at the same time .

One thing to note is that int8 optimization involves model quantization, which requires calibration to improve accuracy. TensorRT has two quantization methods: post-training quantization and in-training quantization. The calibration methods of the two are different, and the accuracy is also different, and the latter is higher. For details, refer to NVIDIA Deep Learning TensorRT Documentation .

(2) Quantization calibration
The post-training calibration is used here. The logic is: get some real input data (no output required), and tell TensorRT that it will adjust the quantization scaling according to the distribution of real input data to ensure the accuracy is appropriate to the greatest extent. Theoretically, the more calibration data, the higher the accuracy, but in fact, not much data is needed. TensorRT officials say that 500 images are enough to calibrate the ImageNet classification network.

Here I think that to achieve this step of calibration, first of all, you have to ensure that your ONNX model is also accurate. In other words, the model saved with the training framework should be calibrated first when converted to the ONNX model. The following is a practical tutorial | Implementing the PyTorch-ONNX precision alignment tool .

After converting the deep learning framework model into an intermediate representation model, the first thing deployment engineers need to do is to align the accuracy to ensure that the calculation results of the model are comparable to before. The most commonly used method for accuracy alignment is to use the test set to evaluate the intermediate representation model to see if the model's evaluation indicators (such as accuracy and similarity) have declined.
In order to match PyTorch and ONNX modules, we can use a custom operator that stores debugging information, as shown in the following figure:
insert image description here

An operator called can be defined Debugthat ONNXhas an attribute debugname name. And because each ONNXoperator node has its own output tensor name, in this way, ONNXthe output name of the node and the debugging name are bound together. You can follow PyTorchthe debugging name in , find ONNXthe output in the corresponding , and complete PyTorchthe ONNXcorrespondence with . For details, please refer to the original text. The general idea of ​​the original text is the same as debugging on Linux. Write some printstatements coutto compare the differences in the output results one by one. At the same time, you can also start from the tail, debug and optimize step by step, and check the difference in accuracy between the two. If a different operator is found, it is necessary to know exactly which step has a problem and rewrite the ONNX model.

reference

  1. Nvidia TensorRT
  2. Stepping on the Pit Record TensorRT Fool-like Deployment Process
  3. ONNX

Guess you like

Origin blog.csdn.net/u012655441/article/details/125848977
Recommended