yolov5-openvino deployment - solve the problem of slow loading speed when the quantitative model is deployed

Pain points

After the model trained by yolov5 is quantized by nncf, it is converted into an xml and bin model with an accuracy of Int8. When deploying and loading the xml model in c++, it will encounter a problem. It takes a long time to load the model for the first time, although it has been quantized The final model has about 2 times the improvement in detection speed, but too long model loading time will have a great impact on industrial production. For the production line that has already started, when we start the detection software, the too long model The loading time will cause the first batch of products to be undetectable and judged as PreNG. Therefore, it is necessary to use some method to solve the problem that the loading time of the xml model in the early stage is too long.

solution

After testing, we found that when using the onnx model without quantization, the loading speed during inference can be very fast, which meets the production needs (I guess it is because the calculation graph of onnx contains weight information, while the xml saves the calculation graph, and the bin saves The most important thing is the weight, and the separate storage method leads to slow loading). Therefore, we try to choose to load the onnx model for detection at the beginning, and open up an independent thread to load the xml model, so as to realize the form of loading xml while using onnx detection, and we define a global variable count externally To detect the count for onnx, when the count reaches a certain value, and the reasoning request of the xml model is created, when these two conditions are met at the same time, we enter the code of xml model reasoning, and use the accelerated model to continue to complete the follow-up reasoning. Therefore, when we check the logs, we will find that at the beginning, the detection speed of the model is relatively slow, and then as the model is loaded, the detection speed will suddenly increase at a certain point.

code details

Global variable settings

#include"yolov5_detect.h"
#include "mobilenet.h"
//#include <opencv2/cudaarithm.hpp>
//#include <opencv2/cudaoptflow.hpp>
//#include <opencv2/cudaimgproc.hpp>
#include<crtdbg.h>

#ifdef _DEBUG
#define new new(_NORMAL_BLOCK,__FILE__,__LINE__)
#endif
#include <time.h>
#include <chrono>
using namespace std::chrono;
//#include <windows.h>

InferRequest infer_request_left_up;
InferRequest infer_request_left_down;
InferRequest infer_request_right_up;
InferRequest infer_request_right_down;
InferRequest infer_request_WuzhuangSeCha_left;
InferRequest infer_request_WuzhuangSeCha_right;

InferRequest infer_request_left_up_;
InferRequest infer_request_left_down_;
InferRequest infer_request_right_up_;
InferRequest infer_request_right_down_;
InferRequest infer_request_WuzhuangSeCha_left_;
InferRequest infer_request_WuzhuangSeCha_right_;

string model_file = "yolov5s_int8.xml";
string model_file_onnx = "yolo_detect.onnx";
string model_file_WuZhuangSeCha = "yolo_WuZhuangSeCha_int8.xml";
string model_file_WuZhuangSeCha_onnx = "yolo_WuZhuangSeCha_detect.onnx";
string device_name = "CPU";
const float threshold_nms = 0.45; // 不需要修改
const float confidence_score = 0.6; // 不需要修改
const float threshold_score = 0.25;
const int OVERLAP = 100; // 不需要修改
const int ADD_HEIGHT = 900;// 4000 / 4 - 100;
vector<string>class_names(0);
vector<string>class_names_WuzhuangSeCha(0);
int detect_num_classes = 0;  // 不需要修改
int detect_num_classes_WuzhuangSeCha = 0;  // 不需要修改
int max_detection = 100;
int tt = 0;
bool load_xml_left = true;
bool load_xml_right = true;
bool load_xml_Wu_left = true;
bool load_xml_Wu_right = true;
bool xml_loading_left = false;
bool xml_loading_right = false;
bool xml_loading_Wu_left = false;
bool xml_loading_Wu_right = false;

//onnx模型检测计数器
int onnx_detect_count = 0;

Define the initialization function of the detection model (xml)

CPPDLL_API void Initial_model_xml_left()
{
    
    
	if (load_xml_left && xml_loading_left == false)
	{
    
    
		Core core;
		//2.载入并编译模型
		CompiledModel compiled_model_left_up_ = core.compile_model(model_file, device_name);

		//2.载入并编译模型
		CompiledModel compiled_model_left_down_ = core.compile_model(model_file, device_name);

		if (compiled_model_left_up_ && compiled_model_left_down_)
		{
    
    
			infer_request_left_down_ = compiled_model_left_down_.create_infer_request();
			infer_request_left_up_ = compiled_model_left_up_.create_infer_request();

			xml_loading_left = true;
		}
	}
	if (onnx_detect_count >= 60 && infer_request_left_down && infer_request_left_up)
	{
    
    
		infer_request_left_down.~InferRequest();
		infer_request_left_up.~InferRequest();
	}
}

CPPDLL_API void Initial_model_xml_right()
{
    
    

	if (load_xml_right && xml_loading_right == false)
	{
    
    
		Core core;
		//2.载入并编译模型
		CompiledModel compiled_model_right_up_ = core.compile_model(model_file, device_name);

		//2.载入并编译模型
		CompiledModel compiled_model_right_down_ = core.compile_model(model_file, device_name);

		if (compiled_model_right_up_ && compiled_model_right_down_)
		{
    
    
			infer_request_right_down_ = compiled_model_right_down_.create_infer_request();
			infer_request_right_up_ = compiled_model_right_up_.create_infer_request();

			xml_loading_right = true;
		}
	}
	if (onnx_detect_count >= 60 && infer_request_right_down && infer_request_right_up)
	{
    
    
		infer_request_right_down.~InferRequest();
		infer_request_right_up.~InferRequest();
	}

}

CPPDLL_API void Initial_model_xml_Wu_left()
{
    
    
	if (load_xml_Wu_left && xml_loading_Wu_left == false)
	{
    
    
		Core core;
		//2.载入并编译模型
		CompiledModel compiled_model_WuzhuangSeCha_left_ = core.compile_model(model_file_WuZhuangSeCha, device_name);

		if (compiled_model_WuzhuangSeCha_left_)
		{
    
    
			infer_request_WuzhuangSeCha_left_ = compiled_model_WuzhuangSeCha_left_.create_infer_request();

			xml_loading_Wu_left = true;
		}
	}
	if (onnx_detect_count >= 60 && infer_request_WuzhuangSeCha_left)
	{
    
    
		infer_request_WuzhuangSeCha_left.~InferRequest();
	}
}

CPPDLL_API void Initial_model_xml_Wu_right()
{
    
    
	if (load_xml_Wu_right && xml_loading_Wu_right == false)
	{
    
    
		Core core;
		//2.载入并编译模型
		CompiledModel compiled_model_WuzhuangSeCha_right_ = core.compile_model(model_file_WuZhuangSeCha, device_name);

		if (compiled_model_WuzhuangSeCha_right_)
		{
    
    
			infer_request_WuzhuangSeCha_right_ = compiled_model_WuzhuangSeCha_right_.create_infer_request();

			xml_loading_Wu_right = true;
		}
	}
	if (onnx_detect_count >= 60 && infer_request_WuzhuangSeCha_right)
	{
    
    
		infer_request_WuzhuangSeCha_right.~InferRequest();
	}
}

void xml_init() 
{
    
    
	Initial_model_xml_left();
	Initial_model_xml_right();
	Initial_model_xml_Wu_left();
	Initial_model_xml_Wu_right();
}

Among them, the xml_init() function contains the xml reading method of all our detection models, which will be passed to a newly created thread in the subsequent onnx model initialization function. For the reading method of each xml model, briefly explain the details: load_xml_left is a global variable, which is set to true by us, indicating that the xml model needs to be loaded, and xml_loading_left is also a global variable, which is set to false, indicating that the xml model has not been loaded Loaded. When these two conditions are met at the same time, we enter the first if statement, instantiate a core, and create a model object. After creating the model object, we enter the next if statement, that is, both models exist In this case, we create inference requests for these two models. After my previous tests, I found that the main reason is that creating inference requests takes a lot of time. When the inference request is created, we set xml_loading_left to true, indicating that the xml model is loaded, so that we will not enter the first judgment statement later. Let's look at the third if statement. When the detection count >= 60 and both onnx models exist, we will destroy them and release the memory occupied by the models. After completing these, the program will not enter the first if and the third if, so there is no need to worry about executing the initialization function all the time during the detection process.

Define the initialization function (onnx) of the detection model, and add the initialization function of xml to the child thread

CPPDLL_API void Initial_Model() {
    
    
	//1.创建OpenVINO Runtime Core对象
	Core core;

	//2.载入并编译模型
	CompiledModel compiled_model_left_up = core.compile_model(model_file_onnx, device_name);

	//3.创建推理请求
	infer_request_left_up = compiled_model_left_up.create_infer_request();

	//2.载入并编译模型
	CompiledModel compiled_model_left_down = core.compile_model(model_file_onnx, device_name);

	//3.创建推理请求
	infer_request_left_down = compiled_model_left_down.create_infer_request();

	//2.载入并编译模型
	CompiledModel compiled_model_right_down = core.compile_model(model_file_onnx, device_name);

	//3.创建推理请求
	infer_request_right_up = compiled_model_right_down.create_infer_request();

	//2.载入并编译模型
	CompiledModel compiled_model_right_up = core.compile_model(model_file_onnx, device_name);

	//3.创建推理请求
	infer_request_right_down = compiled_model_right_up.create_infer_request();

	class_names = getClassName("cocoName.txt");
	detect_num_classes = class_names.size();

	initial_model_classification();
	Initial_Model_WuzhuangSeCha();
	
	thread xml_thread(xml_init);
	xml_thread.detach();


	//预热
	//Warm_CPU();
}

The focus here is the last two lines of code. First, a child thread object is instantiated through the thread library function, and the initialization function of the xml model is passed into it. Then use the detach() method to separate the created sub-thread from the main calling thread. In this way, when the initialization function is called, the previous onnx model will not be stuck in the loading of the xml model, and the main calling thread will continue to Next, the sub-thread (called thread) will run in the background. When using join(), the calling thread will wait for the called thread to finish executing before continuing to execute the following code. The specific difference between join() and detach() in std::thread can refer to this blog: The difference and implementation of join and detach in std::thread

pre-processing, inference, post-processing

Here we choose one of the models as an example, because this project contains six reasoning processes, we only select one of them, and the others are similar processes

void detect_left_up(cv::Mat frame, string& return_result) {
    
    
	Shape tensor_shape;
	Tensor input_node;
	if (xml_loading_left && onnx_detect_count >= 20)
	{
    
    
		input_node = infer_request_left_up_.get_input_tensor();
		tensor_shape = input_node.get_shape();

		//Mat frame = imread(".\\test_data0\\2.png", IMREAD_COLOR);
		//Lettterbox resize is the default resize method in YOLOv5.
		int w = frame.cols;
		int h = frame.rows;
		int _max = max(h, w);
		Mat image(Size(_max, _max), CV_8UC3, cv::Scalar(255, 255, 255));
		Rect roi(0, 0, w, h);
		frame.copyTo(image(roi));
		//交换RB通道
		cvtColor(image, image, COLOR_BGR2RGB);
		//计算缩放因子
		size_t num_channels = tensor_shape[1];
		size_t height = tensor_shape[2];
		size_t width = tensor_shape[3];
		float x_factor = image.cols / float(width);
		float y_factor = image.rows / float(height);

		//int64 start = cv::getTickCount();
		//缩放图片并归一化
		Mat blob_image;
		resize(image, blob_image, cv::Size(width, height));
		blob_image.convertTo(blob_image, CV_32F);
		blob_image = blob_image / 255.0;

		// 4.3 将图像数据填入input tensor
		Tensor input_tensor = infer_request_left_up_.get_input_tensor();
		// 获取指向模型输入节点数据块的指针
		float* input_tensor_data = input_node.data<float>();
		// 将图片数据填充到模型输入节点中
		// 原有图片数据为 HWC格式,模型输入节点要求的为 CHW 格式
		for (size_t c = 0; c < num_channels; c++) {
    
    
			for (size_t h = 0; h < height; h++) {
    
    
				for (size_t w = 0; w < width; w++) {
    
    
					input_tensor_data[c * width * height + h * width + w] = blob_image.at<Vec<float, 3>>(h, w)[c];
				}
			}
		}

		// 5.执行推理计算
		auto start1 = std::chrono::system_clock::now();
		infer_request_left_up_.infer();
		auto end1 = std::chrono::system_clock::now();
		//auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
		//cout << "left_up推理耗时:" << duration << "ms" << endl;


		// 6.处理推理计算结果
		// 6.1 获得推理结果
		const ov::Tensor& output = infer_request_left_up_.get_tensor("output");
		const float* output_buffer = output.data<const float>();

		// 6.2 解析推理结果,YOLOv5 output format: cx,cy,w,h,score
		int out_rows = output.get_shape()[1]; //获得"output"节点的rows
		int out_cols = output.get_shape()[2]; //获得"output"节点的cols
		Mat det_output(out_rows, out_cols, CV_32F, (float*)output_buffer);

		vector<cv::Rect> boxes;
		vector<int> classIds;
		vector<float> confidences;

		for (int i = 0; i < det_output.rows; i++) {
    
    
			float confidence = det_output.at<float>(i, 4);
			if (confidence < confidence_score) {
    
    
				continue;
			}
			int endindex = detect_num_classes + 5;
			Mat classes_scores = det_output.row(i).colRange(5, endindex);
			Point classIdPoint;
			double score;
			minMaxLoc(classes_scores, 0, &score, 0, &classIdPoint);

			// 置信度 0~1之间
			if (score > confidence_score)
			{
    
    
				float cx = det_output.at<float>(i, 0);
				float cy = det_output.at<float>(i, 1);
				float ow = det_output.at<float>(i, 2);
				float oh = det_output.at<float>(i, 3);
				int x = static_cast<int>((cx - 0.5 * ow) * x_factor);
				int y = static_cast<int>((cy - 0.5 * oh) * y_factor);
				int width = static_cast<int>(ow * x_factor);
				int height = static_cast<int>(oh * y_factor);
				Rect box;
				box.x = x;
				box.y = y;
				box.width = width;
				box.height = height;

				boxes.push_back(box);
				classIds.push_back(classIdPoint.x);
				confidences.push_back(score);
			}
		}

		// NMS
		vector<int> indexes;
		dnn::NMSBoxes(boxes, confidences, threshold_score, threshold_nms, indexes);
		int count = 0;
		if (indexes.size() > max_detection) {
    
    
			count = max_detection;
		}
		else {
    
    
			count = indexes.size();
		}
		for (size_t i = 0; i < count; i++) {
    
    
			int index = indexes[i];
			int idx = classIds[index];


			int x0 = int(boxes[index].x);
			int y0 = int(boxes[index].y);
			int x1 = int(boxes[index].x) + int(boxes[index].width);
			int y1 = int(boxes[index].y) + int(boxes[index].height);
			int area = boxes[index].width * boxes[index].height;

			return_result += class_names[idx] + "," + std::to_string(int(x0)) + "," + std::to_string(int(y0)) + "," + std::to_string(int(x1)) + "," +
				std::to_string(int(y1)) + "," + std::to_string(int(area)) + "," + std::to_string(confidences[index]) + ";" + "\n";

			rectangle(frame, Point(x0, y0), Point(x1, y1), Scalar(0, 255, 255), 4);

			cout << " " << endl;

			//putText(frame, class_names[idx], Point(boxes[index].tl().x, boxes[index].tl().y - 10), FONT_HERSHEY_SIMPLEX, .5, Scalar(0, 0, 0));
		}
		//cv::namedWindow("left_up", WINDOW_NORMAL);
		//cv::imshow("left_up", frame);
		//cv::waitKey(0);
	}
	else
	{
    
    
		input_node = infer_request_left_up.get_input_tensor();
		tensor_shape = input_node.get_shape();

		//Mat frame = imread(".\\test_data0\\2.png", IMREAD_COLOR);
	//Lettterbox resize is the default resize method in YOLOv5.
		int w = frame.cols;
		int h = frame.rows;
		int _max = max(h, w);
		Mat image(Size(_max, _max), CV_8UC3, cv::Scalar(255, 255, 255));
		Rect roi(0, 0, w, h);
		frame.copyTo(image(roi));
		//交换RB通道
		cvtColor(image, image, COLOR_BGR2RGB);
		//计算缩放因子
		size_t num_channels = tensor_shape[1];
		size_t height = tensor_shape[2];
		size_t width = tensor_shape[3];
		float x_factor = image.cols / float(width);
		float y_factor = image.rows / float(height);

		//int64 start = cv::getTickCount();
		//缩放图片并归一化
		Mat blob_image;
		resize(image, blob_image, cv::Size(width, height));
		blob_image.convertTo(blob_image, CV_32F);
		blob_image = blob_image / 255.0;

		// 4.3 将图像数据填入input tensor
		Tensor input_tensor = infer_request_left_up.get_input_tensor();
		// 获取指向模型输入节点数据块的指针
		float* input_tensor_data = input_node.data<float>();
		// 将图片数据填充到模型输入节点中
		// 原有图片数据为 HWC格式,模型输入节点要求的为 CHW 格式
		for (size_t c = 0; c < num_channels; c++) {
    
    
			for (size_t h = 0; h < height; h++) {
    
    
				for (size_t w = 0; w < width; w++) {
    
    
					input_tensor_data[c * width * height + h * width + w] = blob_image.at<Vec<float, 3>>(h, w)[c];
				}
			}
		}

		// 5.执行推理计算
		auto start1 = std::chrono::system_clock::now();
		infer_request_left_up.infer();
		auto end1 = std::chrono::system_clock::now();
		//auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
		//cout << "left_up推理耗时:" << duration << "ms" << endl;


		// 6.处理推理计算结果
		// 6.1 获得推理结果
		const ov::Tensor& output = infer_request_left_up.get_tensor("output");
		const float* output_buffer = output.data<const float>();

		// 6.2 解析推理结果,YOLOv5 output format: cx,cy,w,h,score
		int out_rows = output.get_shape()[1]; //获得"output"节点的rows
		int out_cols = output.get_shape()[2]; //获得"output"节点的cols
		Mat det_output(out_rows, out_cols, CV_32F, (float*)output_buffer);

		vector<cv::Rect> boxes;
		vector<int> classIds;
		vector<float> confidences;

		for (int i = 0; i < det_output.rows; i++) {
    
    
			float confidence = det_output.at<float>(i, 4);
			if (confidence < confidence_score) {
    
    
				continue;
			}
			int endindex = detect_num_classes + 5;
			Mat classes_scores = det_output.row(i).colRange(5, endindex);
			Point classIdPoint;
			double score;
			minMaxLoc(classes_scores, 0, &score, 0, &classIdPoint);

			// 置信度 0~1之间
			if (score > confidence_score)
			{
    
    
				float cx = det_output.at<float>(i, 0);
				float cy = det_output.at<float>(i, 1);
				float ow = det_output.at<float>(i, 2);
				float oh = det_output.at<float>(i, 3);
				int x = static_cast<int>((cx - 0.5 * ow) * x_factor);
				int y = static_cast<int>((cy - 0.5 * oh) * y_factor);
				int width = static_cast<int>(ow * x_factor);
				int height = static_cast<int>(oh * y_factor);
				Rect box;
				box.x = x;
				box.y = y;
				box.width = width;
				box.height = height;

				boxes.push_back(box);
				classIds.push_back(classIdPoint.x);
				confidences.push_back(score);
			}
		}


		// NMS
		vector<int> indexes;
		dnn::NMSBoxes(boxes, confidences, threshold_score, threshold_nms, indexes);
		int count = 0;
		if (indexes.size() > max_detection) {
    
    
			count = max_detection;
		}
		else {
    
    
			count = indexes.size();
		}
		for (size_t i = 0; i < count; i++) {
    
    
			int index = indexes[i];
			int idx = classIds[index];


			int x0 = int(boxes[index].x);
			int y0 = int(boxes[index].y);
			int x1 = int(boxes[index].x) + int(boxes[index].width);
			int y1 = int(boxes[index].y) + int(boxes[index].height);
			int area = boxes[index].width * boxes[index].height;

			return_result += class_names[idx] + "," + std::to_string(int(x0)) + "," + std::to_string(int(y0)) + "," + std::to_string(int(x1)) + "," +
				std::to_string(int(y1)) + "," + std::to_string(int(area)) + "," + std::to_string(confidences[index]) + ";" + "\n";

			rectangle(frame, Point(x0, y0), Point(x1, y1), Scalar(0, 255, 255), 4);
			onnx_detect_count++;

			cout << " " << endl;

			//putText(frame, class_names[idx], Point(boxes[index].tl().x, boxes[index].tl().y - 10), FONT_HERSHEY_SIMPLEX, .5, Scalar(0, 0, 0));
		}
		//cv::namedWindow("left_up", WINDOW_NORMAL);
		//cv::imshow("left_up", frame);
		//cv::waitKey(0);
	}
	// 计算FPS
	//float t = (getTickCount() - start) / static_cast<float>(getTickFrequency());
	//cout << "Infer time(ms): " << t * 1000 << "ms; Detections: " << indexes.size() << endl;
	//putText(frame, format("FPS: %.2f", 1.0 / t), Point(20, 40), FONT_HERSHEY_PLAIN, 2.0, Scalar(255, 0, 0), 2, 8);
	//cv::namedWindow("YOLOv5-6.1 + OpenVINO 2022.1 C++ Demo", WINDOW_NORMAL);
	//imshow("YOLOv5-6.1 + OpenVINO 2022.1 C++ Demo", frame);

	//waitKey(0);
	//destroyAllWindows();

}

Here we only need to pay attention to two details. The first detail is the judgment condition of the if statement. When xml_loading_left is true, the xml model is loaded. It can be set to 60, so why can it be set like this? At the beginning, I wrote >= 30 instead of >= 60 in the onnx model destruction condition in the xml model loading function. The reason why I changed it to the latter is because the former is in the process of executing the code. When the onnx model detects that the counter reaches 30, the destructor is executed to destroy the onnx model, but the xml model has not been loaded at this time, resulting in neither the xml model nor the onnx model, and the program exits abnormally, so I set it to 60 to give the xml model enough time to load.

Summarize

So far, we have successfully solved the xml model obtained after nncf quantification. It takes too long to initialize the model loading, which leads to the problem of missed detection of a large number of products in industrial production applications. In this process, I learned how to use the model flexibly. Detection and some operations of thread threads.

Guess you like

Origin blog.csdn.net/ycx_ccc/article/details/131965908