opencv dnn module example (20) target detection object_detection of yolor

YOLOR from the paperYou Only Learn One Representation: Unified Network for Multiple Tasks, influenced by the way humans learn (using five senses, through routine and subconscious Inspired by learning, summarizing rich experience and encoding and storing it, and then processing known or unknown information), the paper proposes a unified network to encode explicit knowledge and implicit knowledge at the same time, and performs kernel space alignment in the network. Spatial alignment), prediction refinement (prediction refinement) and multi-task learning (multi-task learning), forming a unified representation for multiple tasks at the same time. The results show that the introduction of implicit knowledge in neural networks can help improve the performance of all tasks. Further analysis found that the reason why implicit representation can improve performance is because it has the ability to capture the physical meaning of different tasks.

1. Introduction to the paper

paper: https://arxiv.org/abs/2105.04206
code: https://github.com/WongKinYiu/yolor

1.1. YOLOR ideological motivation

Figure 1: People can answer different questions based on the same input image. This article also aims to train a single neural network to serve multiple tasks.

As shown in Figure 1,People can analyze the same target from multiple angles. However, usually only one angle is given when training CNN, that is to say, for a certain task The obtained CNN features are difficult to apply to other problems. The author believes that the main reason for the above problems is that the model only extracts neuron features and discards the learning and application of implicit knowledge. However, just like the human brain Implicit knowledge is very useful for analyzing a wide variety of tasks.

Humans usually learn implicit knowledge through the subconscious mind. However, there is no systematic definition of how to learn and obtain implicit knowledge. For neural networks, shallow features are generally defined as explicit knowledge, and deep features are defined as implicit knowledge. This article defines directly observable knowledge as explicit knowledge, and knowledge hidden in neural networks and unobservable as implicit knowledge.

Figure 2: Multi-purpose neural network architecture. (a) Different tasks correspond to different models; (b) Different tasks share the backbone network and use different output heads; (c) The unified network proposed in this article: one representation that integrates explicit knowledge and implicit knowledge serves multiple tasks.

As shown in Figure 2, a unified network is proposed to integrate explicit knowledge and implicit knowledge, and by learning unified expressions, each sub-representation can be applied to different tasks. Based on the theoretical foundation of previous work, this paper combines compressed sensing and deep learning to build a unified network.

The main contributions of this article are as follows:

  1. A unified network that can complete multiple tasks at the same time is proposed. It learns a unified representation that can complete multiple tasks by fusing explicit knowledge and implicit knowledge. The proposed network can effectively improve the performance of the model, only increasing by a thousandth. Less than one calculation cost;

  2. Implicit knowledge learning is completed through kernel space alignment, prediction refinement and multi-task learning, and its effectiveness is verified;

  3. The modeling methods of implicit knowledge, including vectors, neural networks, and matrix decomposition, were discussed respectively, and the effectiveness of these methods was verified;

  4. It was confirmed that the proposed implicit representation learning method can accurately correspond to specific physical features and be presented in a visual way; it was also confirmed that if the operator conforms to the physical meaning of the target, it can be used to integrate implicit knowledge and Explicit knowledge, with a multiplier effect;

  5. Compared with SOTA, YOLOR can achieve the same accuracy as target detection Scaled-YOLOv4-P7, but the inference speed is 88% faster.

1.2. Implicit knowledge learning

1.2.1. How implicit knowledge works

Manifold space reduction
Kernel space alignment
More functions and processing methods

1.2.2. Unified network modeling of implicit knowledge

Representation of implicit knowledge
Unified Networks:
Modeling of implicit knowledge
Vector/Matrix/ Tensor
Matrix factorization
Training
Inference

1.3. Experiment

3.1 Experimental setup

3.2 FPN feature alignment
3.3 Target detection prediction refinement

3.4 Multi-task specification representation

3.5 Comparison of different operators for implicit knowledge modeling

3.6 Comparison of different methods of implicit knowledge modeling

3.7 Implicit knowledge model analysis

3.8 Implicit knowledge improves target detection

1.4. Summary

2. Test

Tested here with yolor-p6-640-640. You can see from the network model that there are 4 outputs in total, which are actually the results on 4 scales, and are finally merged into one output through reshape and concat (the output format is consistent with yolov5).
Insert image description here

2.1、opencv dnn

2.1.1. Code

Use the same test code as yolov5.

#pragma once

#include "opencv2/opencv.hpp"


#include <fstream>
#include <sstream>

#include <random>


using namespace cv;
using namespace dnn;

float inpWidth;
float inpHeight;
float confThreshold, scoreThreshold, nmsThreshold;
std::vector<std::string> classes;
std::vector<cv::Scalar> colors;


bool letterBoxForSquare = true;

cv::Mat formatToSquare(const cv::Mat &source);

void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& out, Net& net);

void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame);

std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<int> dis(100, 255);

int testYoloR()
{
    
    
    // 根据选择的检测模型文件进行配置 
    confThreshold = 0.25;
    scoreThreshold = 0.45;
    nmsThreshold = 0.5;
    float scale = 1 / 255.0;  //0.00392
    Scalar mean = {
    
    0,0,0};
    bool swapRB = true;
    inpWidth = 640;
    inpHeight = 640;

    String modelPath = R"(E:\DeepLearning\yolor\yolor-p6-640-640.onnx)";
    String configPath;

    String framework = "";

    //int backendId = cv::dnn::DNN_BACKEND_OPENCV;
    //int targetId = cv::dnn::DNN_TARGET_CPU;

    int backendId = cv::dnn::DNN_BACKEND_CUDA;
    int targetId = cv::dnn::DNN_TARGET_CUDA;  

    String classesFile = std::string(R"(\data\coco.names)");


    // Open file with classes names.
    if(!classesFile.empty()) {
    
    
        const std::string& file = classesFile;
        std::ifstream ifs(file.c_str());
        if(!ifs.is_open())
            CV_Error(Error::StsError, "File " + file + " not found");
        std::string line;
        while(std::getline(ifs, line)) {
    
    
            classes.push_back(line);
            colors.push_back(cv::Scalar(dis(gen), dis(gen), dis(gen)));
        }
    }
    // Load a model.
    Net net = readNet(modelPath, configPath, framework);
    net.setPreferableBackend(backendId);
    net.setPreferableTarget(targetId);

    //std::vector<String> outNames = net.getUnconnectedOutLayersNames();
    std::vector<String> outNames{
    
    "output"};
    {
    
    
        int dims[] = {
    
    1,3,inpHeight,inpWidth};
        cv::Mat tmp = cv::Mat::zeros(4, dims, CV_32F);
        std::vector<cv::Mat> outs;

        net.setInput(tmp);
        for(int i = 0; i<10; i++)
            net.forward(outs, outNames); // warmup
    }

    // Create a window
    static const std::string kWinName = "Deep learning object detection in OpenCV";

    cv::namedWindow(kWinName, 0);

    // Open a video file or an image file or a camera stream.
    VideoCapture cap;
    cap.open(R"(E:\DeepLearning\yolov5\data\images\bus.jpg)");

    cv::TickMeter tk;
    // Process frames.
    Mat frame, blob;

    while(waitKey(1) < 0) {
    
    

        //tk.reset();
        //tk.start();

        cap >> frame;
        if(frame.empty()) {
    
    
            waitKey();
            break;
        }

        // Create a 4D blob from a frame.
        cv::Mat modelInput = frame;
        if(letterBoxForSquare && inpWidth == inpHeight)
            modelInput = formatToSquare(modelInput);

        blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);

        // Run a model.
        net.setInput(blob);

        std::vector<Mat> outs;
        //tk.reset();
        //tk.start();

        auto tt1 = cv::getTickCount();
        net.forward(outs, outNames);
        auto tt2 = cv::getTickCount();

        tk.stop();
        postprocess(frame, modelInput.size(), outs, net);
        //tk.stop();

         Put efficiency information.
        //std::vector<double> layersTimes;
        //double freq = getTickFrequency() / 1000;
        //double t = net.getPerfProfile(layersTimes) / freq;
        //std::string label = format("Inference time: %.2f ms  (%.2f ms)", t, /*tk.getTimeMilli()*/ (tt2 - tt1) / cv::getTickFrequency() * 1000);
        std::string label = format("Inference time: %.2f ms", (tt2 - tt1) / cv::getTickFrequency() * 1000);

        cv::putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

        cv::imshow(kWinName, frame);
    }
    return 0;
}


cv::Mat formatToSquare(const cv::Mat &source)
{
    
    
    int col = source.cols;
    int row = source.rows;
    int _max = MAX(col, row);
    cv::Mat result = cv::Mat::zeros(_max, _max, CV_8UC3);
    source.copyTo(result(cv::Rect(0, 0, col, row)));
    return result;
}

void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& outs, Net& net)
{
    
    
    // yolov5 has an output of shape (batchSize, 25200, 85) (Num classes + box[x,y,w,h] + confidence[c])
    auto tt1 = cv::getTickCount();

    float x_factor = inputSz.width / inpWidth;
    float y_factor = inputSz.height / inpHeight;

    std::vector<int> class_ids;
    std::vector<float> confidences;
    std::vector<cv::Rect> boxes;

    int rows = outs[0].size[1];
    int dimensions = outs[0].size[2];

    float *data = (float *)outs[0].data;

    for(int i = 0; i < rows; ++i) {
    
    
        float confidence = data[4];

        if(confidence >= confThreshold) {
    
    
            float *classes_scores = data + 5;

            cv::Mat scores(1, classes.size(), CV_32FC1, classes_scores);
            cv::Point class_id;
            double max_class_score;

            minMaxLoc(scores, 0, &max_class_score, 0, &class_id);

            if(max_class_score > scoreThreshold) {
    
    
                confidences.push_back(confidence);
                class_ids.push_back(class_id.x);

                float x = data[0];
                float y = data[1];
                float w = data[2];
                float h = data[3];

                int left = int((x - 0.5 * w) * x_factor);
                int top = int((y - 0.5 * h) * y_factor);
                int width = int(w * x_factor);
                int height = int(h * y_factor);

                boxes.push_back(cv::Rect(left, top, width, height));
            }
        }

        data += dimensions;
    }

    std::vector<int> indices;
    NMSBoxes(boxes, confidences, scoreThreshold, nmsThreshold, indices);

    auto tt2 = cv::getTickCount();
    std::string label = format("NMS time: %.2f ms", (tt2 - tt1) / cv::getTickFrequency() * 1000);
    cv::putText(frame, label, Point(0, 30), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));


    for(size_t i = 0; i < indices.size(); ++i) {
    
    
        int idx = indices[i];
        Rect box = boxes[idx];
        drawPred(class_ids[idx], confidences[idx], box.x, box.y,
                 box.x + box.width, box.y + box.height, frame);
    }
}

void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
    
    
    rectangle(frame, Point(left, top), Point(right, bottom), Scalar(0, 255, 0));

    std::string label = format("%.2f", conf);
    Scalar color = Scalar::all(255);
    if(!classes.empty()) {
    
    
        CV_Assert(classId < (int)classes.size());
        label = classes[classId] + ": " + label;
        color = colors[classId];
    }

    int baseLine;
    Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);

    top = max(top, labelSize.height);
    rectangle(frame, Point(left, top - labelSize.height),
              Point(left + labelSize.width, top + baseLine), color, FILLED);
    cv::putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.5, Scalar());
}

2.1.2. Results

The test is as shown in the figure, and it is found that a traffic light was mistakenly recognized. Compare the source py test script and modify bool letterBoxForSquare = false; to turn off the proportional scaling and the detection is normal.

Insert image description here

2.2. Test efficiency

RTX-1080ti,i7-7700k

opencv cpu:630ms
opencv gpu :52ms
opencv gpu (fp16):793ms

The following statistical times include: preprocessing + inference + postprocessing
openvino (cpu): 274ms
onnxruntime (gpu): 30ms
tensorrt:23ms

Guess you like

Origin blog.csdn.net/wanggao_1990/article/details/133269741