opencv dnn module example (19) target detection object_detection of yolox

0. Preface

YOLOX is released by Megvii Technology in 2021, benchmarking YOLO v5. There are three main black technologies introduced in YOLOX, namely decoupled head, anchor-free and advanced label assigning strategy (SimOTA). How is the performance of YOLOX? You can refer to Figure 1 of the original paper as shown below. YOLOX is slightly better than the YOLO v5 of the year, and used YOLOX to win the first place in the Streaming Perception Challenge that year.Insert image description here

Then someone here may ask, which one should you choose between YOLO v5 and YOLOX in your own project (there are also yolov7, yolov8...). If the dataset image resolution is not very high, such as 640x640, you can try both. If the resolution is very high, such as 1280x1280, then use YOLO v5. Because the YOLO v5 official warehouse provides larger-scale pre-training weights, while YOLOX currently only has 640x640 pre-training weights (YOLOX official warehouse said that it will provide larger-scale pre-training weights in the future, but there has been no news for more than a year).

Main model content:

  • ​To the Yolov3 baseline model, add various tricks, such as Decoupled Head, SimOTA, etc., to get the Yolox-Darknet53 version;
    - Four versions of Yolov5 , using these effective tricks, improved one by one, and obtained four versions of Yolox-s, Yolox-m, Yolox-l, and Yolox-x; ​
    ​- Designed ​Yolox-Nano , Yolox-Tiny lightweight network, and tested the applicability of some tricks;

1. Introduction to the Internet

Taking Yolox-Darknet53 as an example, the network structure is given

Insert image description here

In order to facilitate the analysis of improvement points, we split the Yolox-Darknet53 network structure into four sections:

  • Input: ​Strong augmentation data enhancement​
  • BackBone backbone network: There is no change in the backbone network, it is still Darknet53.​
  • Neck: There is no change. The Neck layer of Yolov3 baseline is still an FPN structure.​
  • Prediction:​Decoupled Head、End-to-End YOLO、Anchor-free、Multi positives。​

After a series of improvements, Yolox-Darknet53 finally achieved the effect of AP47.3.

1.1. Input

At the input end of the network, Yolox mainly uses two data enhancement methods: Mosaic and Mixup. Using these two data enhancements, the Yolov3 baseline was directly improved by 2.4 percentage points.

There are two points to note: (1) In the last 15 epochs of training, these two data enhancements will be turned off. Before this, Mosaic and Mixup data enhancement were both turned on. This detail needs to be paid attention to. (2) Due to the adoption of stronger data enhancement methods, the author found in the study that ImageNet pre-training would be meaningless, so all models were trained from scratch.

1.2. Backbone backbone network

The backbone networks of Yolox-Darknet53 and the original Yolov3 baseline both use the Darknet53 network structure.

1.3、Neck

The Neck structures of Yolox-Darknet53 and Yolov3 baseline are fused using the FPN structure. ​
As shown in the figure below, ​FPN is top-down and transfers and fuses high-level feature information through upsampling to obtain feature maps for prediction.
Insert image description here

1.4. Prediction prediction output

In the output layer, it is mainly explained from four aspects: Decoupled Head, Anchor Free, label allocation, and Loss calculation.

1.4.1. Decoupled Head

Currently, there are similar applications in many one-stage networks, such as RetinaNet, FCOS, etc. In Yolox, the author added three Decoupled Heads.

In the baseline network, the AP value of Yolov3 baseline is 38.5. The author wants to continue to improve. For example, after the output end is improved to the End-to-end method (that is, without NMS), the AP value is only 34.3. When FCOS was improved to be without NMS, COCO achieved performance comparable to FCOS with NMS. Why does it drop so much after improving on Yolo? By chance, the author changed the Yolo Head in End-to-End to the Decoupled Head method.

Insert image description here

The test found that the AP value of End-to-end Yolo increased from 34.3 to 38.8. The author also changed the Yolo Head in Yolov3 baseline to Decoupled Head, and found that the AP value increased from 38.5 to 39.6. It was also found that not only the accuracy was improved, but the convergence speed of the network was also accelerated. Conclusion: The detection head currently used in the Yolo series may be lacking in expression ability, and the expression ability without the Decoupled Head would be better. ​The comparison curve is as follows
Insert image description here
The curve shows that: Decoupled Head has faster convergence speed and higher accuracy. However, it should be noted that decoupling the detection head will increase the complexity of the operation. Therefore, after weighing the speed and performance, the author finally used a 1x1 convolution to reduce the dimension first, and used two 3x3 convolutions in each of the next two branches, and finally adjusted to only increase the network a little bit. parameter. And after decoupling here, there is a deeper importance: Yolox's network architecture can be integrated with many algorithm tasks. For example: (1) YOLOX + Yolact/CondInst/SOLO, realizing instance segmentation on the device side. (2) YOLOX + 34-layer output, achieving 17 key point detection of the human body on both ends.

The details of the decoupled detection head of Yolox-Darknet53 are as shown in the figure. Three different branches are used to predict Cls., Reg. and IoU parameters to decouple the three. Note that in YOLOX, different heads are used for different prediction feature maps, that is, parameters are not shared. The first Head output length is 20*20. There are a total of three branches before Concat: There are a total of three branches before Concat:

(1) cls_output: Mainly predict the category of the target frame and the score. Because the COCO data set has a total of 80 categories and is mainly N binary classification judgments, it becomes 20*20*80 in size after being processed by the Sigmoid activation function. ​
(2) obj_output: Mainly determines whether the target frame is the foreground or the background, so it is processed by Sigmoid and becomes 20*20*1 size. ​
(3) reg_output: Mainly predicts the coordinate information (x, y, w, h) of the target box, so the size is 20*20*4. ​

The last three outputs are fused together through Concat to obtain 20*20*85 feature information. Of course, this is only the information of Decoupled Head①, and then Decoupled Head② and ③ are processed.

Insert image description here
Decoupled Head② outputs feature information and performs Concate to obtain 404085 feature information. ​Decoupled Head③ outputs feature information and performs Concate to obtain 808085 feature information. Then perform the Reshape operation on the three pieces of information ①②③, and conduct the overall Concat to obtain 8400*85 prediction information.

1.4.2、Anchor-Free

At present, in the industry, there are two main methods: Anchor Based and Anchor Free. In Yolov3, Yolov4, and Yolov5, the Anchor Based method is usually used to extract the target frame, and then compare it with the annotated groundtruth to judge the two difference between the two.

① Anchor Based method

For example, the input image passes through the Backbone and Neck layers, and finally the feature information is transferred to the output Feature Map. At this time, it is necessary to set some Anchor rules to associate the prediction box and the labeling box. Therefore, during training, the difference between the two, that is, the loss function, is calculated, and then the network parameters are updated. For example, in the figure below, on the last three Feature Maps, there are three anchor boxes of different sizes based on each cell.
Insert image description here
When the input is 416*416, the sizes of the last three feature maps of the network are 13*13, 26*26, 52*52. Each grid point on the feature map predicts three anchor boxes. When using the COCO data set, there are 80 categories. Based on each anchor box, there are x, y, w, h, obj (foreground and background), class (80 categories), a total of 85 parameters. Therefore, 3*(13*13+26*26+52*52)*85=​904995 prediction results will be generated. ​If the input is 640*640, and the last three feature map sizes are 20*20, 40*40, 80*80, 3*(20*20+40*40+80*80)*85 will be generated =​2,142,000 prediction results.

② Anchor Free method

In Yolox-Darknet53, the Anchor Free method is used. The output of the network is different from the FeatureMap in Yolov3, but a feature vector of 8400*85. Through calculation, 8400*85=714000 prediction results are 2/3 less parameters than the Anchor Based method.

Anchor box information In the previous Anchor Based, we know that each feature map cell has three anchor boxes of different sizes. Yolox-Darknet53 still exists, but it cleverly introduces the size information of downsampling in the previous Backbone.

Insert image description here
The top branch is downsampled 5 times, 2 to the 5th power is 32, and the output of Decoupled Head① is 2020< a i=2>85 size. The middle branch has 1600 prediction boxes, and the corresponding anchor box size is 16*16. The final branch has 6400 prediction boxes, and the size of the corresponding anchor box is 8*8.

1.4.3. Label allocation

1.4.4. Loss calculation

1.5, Yolox-s, l, m, x series

Yolov5s network structure diagram
Insert image description here
Yolox-s network structure:
Insert image description here
It can be seen from the comparison of the above two pictures and the previous content The main differences between Yolov5s and Yolox-s are: (1) Input: Based on Mosa data enhancement, the Mixup data enhancement effect is added; (2) Backbone: The activation function uses the SiLU function; ( 3) Neck: The activation function uses the SiLU function; (4) Output: The detection head is changed to a Decoupled Head, using anchor free, multi positives, and SimOTA methods. Based on the previous Yolov3 baseline, the above tricks have achieved very good gains.

Insert image description hereIt can be seen that when the speed increases by about 1ms, the AP accuracy achieves an increase of 0.8~2.9. And the lighter the network structure, such as Yolox-s, the increase point is the highest, reaching 2.9 points. As the depth and width of the network deepen, the rising point slowly decreases, and finally Yolox-x has a rising point of 0.8.

1.6. Lightweight network research

After improving the Yolov3 and Yolov5 series, the author designed two lightweight networks to compare with Yolov4-Tiny and Yolox-Nano. During the research process, the author made two discoveries, mainly describing the advantages and disadvantages of lightweight networks and data enhancement.

1.6.1. Lightweight network

Yolo will be transplanted to edge devices due to the needs of actual scenarios. Therefore, the Yolox-Tiny network structure was built for Yolov4-Tiny; the Yolox-Nano network structure was built for FCOS style NanoDet.
Insert image description here
As can be seen from the above table: ​(1) Compared with Yolov4-Tiny, Yolox-Tiny achieved a 9-point increase in AP value when the number of parameters decreased by 1M. . (2) Compared with NanoDet, Yolox-Nano achieved an increase of 1.8 points when the number of parameters dropped to only 0.91M. (3) Therefore, it can be seen that the overall design of Yolox still has great improvements in terms of lightweight models.

1.6.2. Advantages and disadvantages of data enhancement

In many comparison tests of Yolox, data augmentation is used. ​But different network structures, some deep and some shallow, have different learning capabilities. So is uncontrolled data enhancement really better? The author team also conducted a comparative test on this issue.
Insert image description here

Through the above table, we can find the following:

① Mosaic and Mixup hybrid strategy (1) For the lightweight network, Yolox-nano, when the Mixup data enhancement method is added based on Mosaic, the AP value does not increase but decreases, from 25.3 to 24. (2) For deeper networks, Yolox-L adds the Mixup data enhancement method based on Mosaic, and the AP value has increased, from 48.6 to 49.5. (3) Therefore, different network structures adopt different data enhancement strategies. For example, Yolox-s, Yolox-m, or Yolov4 and Yolov5 series can all be tried using different data enhancement strategies.

② Scale enhancement strategy In Mosaic data enhancement, the random_perspective function in the code Yolox/data/data_augment.py generates a random value for the scaling coefficient of the image when generating the affine transformation matrix.

For Yolox-l, the random range scale is set between [0.1, 2], which is the default parameter set in the article; when using a lightweight model, such as YoloNano, on the one hand only Mosaic data enhancement is used, and on the other hand In terms of random range scale, it is set between [0.5, 1.5], which weakens the performance of Mosaic augmentation.

1.7. Implementation results of Yolox

1.7.1. Accuracy and speed comparison

Earlier we learned about the reasons and principles of Yolox’s various trick improvements. Now let’s take an overall look at the comparison of accuracy and speed of various models:
Insert image description here
The picture on the left is relatively Comparatively standard, the comparison effect of network structure is mainly compared in terms of speed and accuracy. The picture on the right is a comparison of lightweight networks. The main comparison is the number of parameters and accuracy.

From the picture on the left, we can draw: (1) Compared with Yolov5-l, which is equivalent to Yolov4-CSP, Yolo-l achieves the AP50% indicator on the COCO data set, surpassing Yolov5 at almost the same speed. -l 1.8 percentage points. (2) Compared with Yolov5-Darknet53, Yolox-Darknet53 achieves the AP47.3% indicator, which is 3 percentage points higher at almost the same speed.

From the picture on the right, it can be concluded: (1) Compared with Nano, Yolox-Nano has reduced parameter volume and GFLOPS. The parameter volume is 0.91M and GFLOPS is 1.08, but the accuracy can reach 25.3%, exceeding Nano1 .8 percentage points. (2) Compared with Yolox-Tiny and Yolov4-Tiny, when the number of parameters and GFLOPS are reduced, the accuracy far exceeds Yolov4-Tiny by 9 percentage points.

1.7.2, Autonomous Driving Competition

In the Streaming Perception Challenge track of the CVPR2021 autonomous driving competition, one of the main focuses of the challenge is the problem of real-time video stream 2D ​​target detection in autonomous driving scenarios. A server sends and receives pictures and detection results to simulate video streaming at 30FPS. The client performs real-time inference after receiving the pictures. ​Contest address​:​​ ​https://eval.ai/web/challenges/challenge-page/800/overview​

In the competition, Megvii Technology used Yolox-l as the participating model and used TensorRT for inference acceleration. Finally, it won the first place in the full-track and detection-only track competitions. Therefore, the various improvement methods of Yolox are quite good and worth learning and studying in depth.

1.8. Network training

参考链接 https://github.com/Megvii-BaseDetection/YOLOX/blob/main/docs/train_custom_data.md

2. Test

First clone the project and install it

git clone [email protected]:Megvii-BaseDetection/YOLOX.git
cd YOLOX
pip3 install -v -e .  # or  python3 setup.py develop

We take the yolox-m model as an example for testing. Use the wget tool to download the download link.

wget https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_m.pth

2.1. Official script test

2.1.1. torch model testing

python tools/demo.py image -n yolox-m -c yolox_m.pth --path assets/dog.jpg --conf 0.25 --nms 0.45 --tsize 640 --save_result --device [cpu/gpu]

For a 1080p video, tested with CPU and GPU respectively, the inference time of each frame is 650ms and 20ms respectively.
Insert image description here

2.1.2, onnx model test

The official model download link is providedhttps://ghproxy.com/https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/ yolox_m.onnx.

Alternatively, convert from .pth via script

python tools/export_onnx.py --output-name yolox_m.onnx -f exps/default/yolox_m.py -c yolox_m.pth

test code

python demo\ONNXRuntime\onnx_inference.py -m yolox_m.onnx -i assets\bus.jpg -o output -s 0.3 --input_shape 640,640

The result graph after execution will be saved in the output folder.

2.1.3, opencv dnn test

Pay attention to the post-processing process of the output. The output of yolox is the original output and needs to be decoded. According to different feature map sizes, offsets and step sizes, the target frame mapped to the input image size is calculated, and finally scaled to the original image.

When the 80 category inputs 640*640, it outputs [8400,85], 8400 is the total number of target boxes, 85 is the information of the target box, the format is

[center_x, center_y, w,h,    obj-score,     cls1-score, cls2-score, ... , cls80-score]

8400 target frames are generated on three different feature maps based on anchor free, [80,80], [40,40], [20,20], and the corresponding feature map strides are strides = 8,16, 32. The first is the feature map [80,80], which has 6400 positions, and each position corresponds to 80 types of target information (85 dimensions); later, similarly, there are 1600 positions for [40,40], and 400 positions for [20,20]. location.

center_x, center_y: Indicates the offset of the grid point position in the current feature map, for example, on 40x40, grid point (1,2), then center_x = 0.12, center_y = 0.08, then the position mapped to the input graph is [(1+0.12)*16, (2 + 0.08) *16]
w,h: The width and height of the target box, mapped to the width and height of the input image, need to be multiplied by the corresponding step size on the feature map where the current target box is located.
obj-score: Confidence that the target exists
cls1-score, cls2-score, … , cls80-score< /span>: Confidence of each category, the actual target confidence must be multiplied by obj-score

The code is given directly below:

#pragma once

#include "opencv2/opencv.hpp"

#include <fstream>
#include <sstream>

#include <random>

using namespace cv;
using namespace dnn;

float inpWidth;
float inpHeight;
float confThreshold, scoreThreshold, nmsThreshold;
std::vector<std::string> classes;
std::vector<cv::Scalar> colors;


bool letterBoxForSquare = true;

cv::Mat formatToSquare(const cv::Mat &source);

void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& out, Net& net);

void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame);

std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<int> dis(100, 255);


struct GridAndStride
{
    
    
    int grid0;
    int grid1;
    int stride;
};

static void generate_grids_and_stride(std::vector<int>& strides, std::vector<GridAndStride>& grid_strides)
{
    
    
    for(auto stride : strides) {
    
    
        int num_grid_y = inpHeight / stride;
        int num_grid_x = inpWidth / stride;
        for(int g1 = 0; g1 < num_grid_y; g1++) {
    
    
            for(int g0 = 0; g0 < num_grid_x; g0++) {
    
    
                grid_strides.push_back(GridAndStride{
    
    g0, g1, stride});
            }
        }
    }
}

std::vector<GridAndStride> grid_strides;

int NUM_CLASSES;

int testYolo_x()
{
    
    
    // 根据选择的检测模型文件进行配置 
    confThreshold = 0.25;
    scoreThreshold = 0.45;
    nmsThreshold = 0.5;
    float scale = 1;  // 1 / 255.0;  //0.00392
    Scalar mean = {
    
    0,0,0};
    bool swapRB = true;
    inpWidth = 640;
    inpHeight = 640;

    String modelPath = R"(E:\DeepLearning\YOLOX\yolox_m.onnx)";
    String configPath;

    String framework = "";

    //int backendId = cv::dnn::DNN_BACKEND_OPENCV;
    //int targetId = cv::dnn::DNN_TARGET_CPU;

    int backendId = cv::dnn::DNN_BACKEND_CUDA;
    int targetId = cv::dnn::DNN_TARGET_CUDA;

    String classesFile = R"(E:\DeepLearning\darknet-yolo3-master\data\coco.names)";

    // Open file with classes names.
    if(!classesFile.empty()) {
    
    
        const std::string& file = classesFile;
        std::ifstream ifs(file.c_str());
        if(!ifs.is_open())
            CV_Error(Error::StsError, "File " + file + " not found");
        std::string line;
        while(std::getline(ifs, line)) {
    
    
            classes.push_back(line);
            colors.push_back(cv::Scalar(dis(gen), dis(gen), dis(gen)));
        }
    }
    NUM_CLASSES = classes.size();

    std::vector<int> strides = {
    
    8, 16, 32};
    generate_grids_and_stride(strides, grid_strides);

    // Load a model.
    Net net = readNet(modelPath, configPath, framework);
    net.setPreferableBackend(backendId);
    net.setPreferableTarget(targetId);


    std::vector<String> outNames = net.getUnconnectedOutLayersNames();
    {
    
    
        int dims[] = {
    
    1,3,inpHeight,inpWidth};
        cv::Mat tmp = cv::Mat::zeros(4, dims, CV_32F);
        std::vector<cv::Mat> outs;

        net.setInput(tmp);
        for(int i = 0; i<10; i++)
            net.forward(outs, outNames); // warmup
    }

    // Create a window
    static const std::string kWinName = "Deep learning object detection in OpenCV";

    cv::namedWindow(kWinName, 0);

    // Open a video file or an image file or a camera stream.
    VideoCapture cap;
    cap.open(R"(E:\DeepLearning\yolov5\data\images\bus.jpg)");

    cv::TickMeter tk;
    Mat frame, blob;

    while(waitKey(1) < 0) {
    
    
        cap >> frame;
        if(frame.empty()) {
    
    
            waitKey();
            break;
        }

        // Create a 4D blob from a frame.
        cv::Mat modelInput = frame;
        if(letterBoxForSquare && inpWidth == inpHeight)
            modelInput = formatToSquare(modelInput);

        blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);

        // Run a model.
        net.setInput(blob);

        std::vector<Mat> outs;
        //tk.reset();
        //tk.start();

        auto tt1 = cv::getTickCount();
        net.forward(outs, outNames);
        auto tt2 = cv::getTickCount();

        tk.stop();
        postprocess(frame, modelInput.size(), outs, net);
        //tk.stop();

        std::string label = format("Inference time: %.2f ms", (tt2 - tt1) / cv::getTickFrequency() * 1000);
        cv::putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

        cv::imshow(kWinName, frame);
    }
    return 0;
}


cv::Mat formatToSquare(const cv::Mat &source)
{
    
    
    int col = source.cols;
    int row = source.rows;
    int _max = MAX(col, row);
    cv::Mat result = cv::Mat::zeros(_max, _max, CV_8UC3);
    source.copyTo(result(cv::Rect(0, 0, col, row)));
    return result;
}


void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& outs, Net& net)
{
    
    
    // yolox has an output of shape (batchSize, 8400, 85) (box[x,y,w,h] + confidence[c] + Num classes )
    auto tt1 = cv::getTickCount();

    float x_factor = inputSz.width / inpWidth;
    float y_factor = inputSz.height / inpHeight;

    std::vector<int> class_ids;
    std::vector<float> confidences;
    std::vector<cv::Rect> boxes;

    float *feat_blob = (float *)outs[0].data;

    const int num_anchors = grid_strides.size();

	// 后处理部分,可以简化
    for(int anchor_idx = 0; anchor_idx < num_anchors; anchor_idx++) {
    
    
        const int grid0 = grid_strides[anchor_idx].grid0;
        const int grid1 = grid_strides[anchor_idx].grid1;
        const int stride = grid_strides[anchor_idx].stride;

        const int basic_pos = anchor_idx * (NUM_CLASSES + 5);

        float box_objectness = feat_blob[basic_pos + 4];
        for(int class_idx = 0; class_idx < NUM_CLASSES; class_idx++) 
        {
    
    

            float box_cls_score = feat_blob[basic_pos + 5 + class_idx];
            float box_prob = box_objectness * box_cls_score;

            if(box_prob > scoreThreshold) {
    
    
                class_ids.push_back(class_idx);
                confidences.push_back(box_prob);

                // yolox/models/yolo_head.py decode logic
                float x_center = (feat_blob[basic_pos + 0] + grid0) * stride;
                float y_center = (feat_blob[basic_pos + 1] + grid1) * stride;
                float w = exp(feat_blob[basic_pos + 2]) * stride;
                float h = exp(feat_blob[basic_pos + 3]) * stride;

                int left = int((x_center - 0.5 * w) * x_factor);
                int top = int((y_center - 0.5 * h) * y_factor);
                int width = int(w * x_factor);
                int height = int(h * y_factor);

                boxes.push_back(cv::Rect(left, top, width, height));
            }
        } // class loop
    }

    std::vector<int> indices;
    NMSBoxes(boxes, confidences, scoreThreshold, nmsThreshold, indices);

    auto tt2 = cv::getTickCount();
    std::string label = format("NMS time: %.2f ms", (tt2 - tt1) / cv::getTickFrequency() * 1000);
    cv::putText(frame, label, Point(0, 30), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));


    for(size_t i = 0; i < indices.size(); ++i) {
    
    
        int idx = indices[i];
        Rect box = boxes[idx];
        drawPred(class_ids[idx], confidences[idx], box.x, box.y,
                 box.x + box.width, box.y + box.height, frame);
    }
}

void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
    
    
    rectangle(frame, Point(left, top), Point(right, bottom), Scalar(0, 255, 0));

    std::string label = format("%.2f", conf);
    Scalar color = Scalar::all(255);
    if(!classes.empty()) {
    
    
        CV_Assert(classId < (int)classes.size());
        label = classes[classId] + ": " + label;
        color = colors[classId];
    }

    int baseLine;
    Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);

    top = max(top, labelSize.height);
    rectangle(frame, Point(left, top - labelSize.height),
              Point(left + labelSize.width, top + baseLine), color, FILLED);
    cv::putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.5, Scalar());
}

In post-processing, it can be streamlined and only process the category data with the highest probability of each target frame to improve the running speed and greatly improve the time of NMS.

void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& outs, Net& net)
{
    
    
	....

    for(int anchor_idx = 0; anchor_idx < num_anchors; anchor_idx++) {
    
    
        const int grid0 = grid_strides[anchor_idx].grid0;
        const int grid1 = grid_strides[anchor_idx].grid1;
        const int stride = grid_strides[anchor_idx].stride;

        const int basic_pos = anchor_idx * (NUM_CLASSES + 5);

        float *data = feat_blob + basic_pos;
        float confidence = data[4];
        if(confidence < confThreshold)
            continue;

        cv::Mat scores(1, classes.size(), CV_32FC1, data + 5);
        cv::Point class_id;
        double max_class_score;
        minMaxLoc(scores, 0, &max_class_score, 0, &class_id);

        float box_prob = confidence * max_class_score;
        
        if(box_prob > scoreThreshold) {
    
    
            class_ids.push_back(class_id.x);
            confidences.push_back(box_prob);

            // yolox/models/yolo_head.py decode logic
            float x_center = (feat_blob[basic_pos + 0] + grid0) * stride;
            float y_center = (feat_blob[basic_pos + 1] + grid1) * stride;
            float w = exp(feat_blob[basic_pos + 2]) * stride;
            float h = exp(feat_blob[basic_pos + 3]) * stride;

            int left = int((x_center - 0.5 * w) * x_factor);
            int top = int((y_center - 0.5 * h) * y_factor);
            int width = int(w * x_factor);
            int height = int(h * y_factor);

            boxes.push_back(cv::Rect(left, top, width, height));
        } 
    }

    ...  // nms + draw
}

Test results:
cuda 36ms, cpu 420ms, fp16 650ms.

Insert image description here

2.2. Test summary and comparison

Summary of testing on other frameworks using onnx
opencv cuda: 36ms
opencv cpu: 650ms
opencv cuda fp16: 420ms,

Includes 预处Logic+reason+后处Logic
openvino(CPU): 199ms
onnxruntime(GPU ): 22ms
trt: 13ms

Guess you like

Origin blog.csdn.net/wanggao_1990/article/details/133983897