opencv dnn module example (17) target detection object_detection of yolo v5

Two months after the yolo v4 introduced in the previous article [ opencv dnn module example (16) target detection object_detection - yolov4] , Ultralytics released the first official version of YOLOV5, whose performance is comparable to YOLO V4.

Insert image description here

Yolo v5 actually has no inheritance relationship with Yolo v4. They are both improved based on yolo v3. However, because it has not published corresponding articles, open source protocols and other issues, it has been questioned that it cannot be regarded as a new generation of YOLO. However, for our learning and use, as long as it can catch mice, either a white cat or a black cat is a good cat.

1. Explanation of the differences between Yolo v5 and Yolo v4

Compare YOLO V5 and V4 from the following aspects, briefly describe the characteristics of their respective new technologies, and compare the differences and similarities between the two.

1.1. Data Augmentation - Data enhancement

YOLO V4 uses a combination of multiple data enhancement technologies for a single image. In addition to classic geometric distortion and lighting distortion, it also innovatively uses image occlusion (Random Erase, Cutout, Hide and Seek, Grid Mask, MixUp) technology. For multi-image combination, the author uses a mixture of CutMix and Mosaic technologies. In addition, the author also used Self-Adversarial Training (SAT) for data enhancement.

The author of YOLO V5 has not published a paper yet, so its data augmentation pipeline can only be understood from a code perspective.
YOLOV5 will pass each batch of training data through the data loader and enhance the training data at the same time.
The data loader performs three types of data enhancement: scaling, color space adjustment and mosaic enhancement.
Interestingly, there are media reports that Glen Jocher, the author of YOLO V5, is the creator of Mosaic Augmentation. He believes that the huge performance improvement of YOLO V4 is largely due to the mosaic data enhancement. Perhaps he is not convinced. After the release of YOLO V4
, YOLO V5 was launched in just two months. Of course, whether to continue to use the name YOLO V5 or adopt other names in the future depends first on whether the final research results of YOLO V5 can truly lead YOLO V4.
But it is undeniable that mosaic data enhancement can indeed effectively solve the most troublesome "small object problem" in model training, that is, small objects are not detected as accurately as large objects.

1.2. Auto Learning Bounding Box Anchors - adaptive anchor box

In the previous YOLO V3, k-means and genetic learning algorithms were used to analyze the custom data set to obtain a preset anchor box suitable for object boundary box prediction in the custom data set.
Insert image description here
In YOLO V5, the anchor box is automatically learned based on training data. YOLO V4 does not have an adaptive anchor box .

For the COCO data set, the size of the anchor box under the 640×640 image size has been preset in the configuration file *.yaml of YOLO V5:

anchors:
  - [10,13, 16,30, 33,23]  		# P3/8
  - [30,61, 62,45, 59,119]  	# P4/16
  - [116,90, 156,198, 373,326]  # P5/32

For custom data sets, since the target recognition framework often needs to scale the original image size, and the size of the target object in the data set may be different from the COCO data set, YOLO V5 will automatically learn the size of the anchor box again.
Insert image description here
In the picture above, YOLO V5 is learning the size of the automatic anchor box. For the BDD100K data set, after the image in the model is scaled to 512, the optimal anchor box is:
Insert image description here

1.3. Backbone-Cross-stage Partial Network (CSP)

Both YOLO V5 and V4 use CSPDarknet as Backbone. The full name of CSPNet is Cross Stage Partial Networks, which is a cross-stage partial network. CSPNet solves the gradient information duplication problem in network optimization in other large convolutional neural network frameworks Backbone, and integrates gradient changes into the feature map from beginning to end, thus reducing the number of model parameters and FLOPS values, which not only ensures the reasoning speed and accuracy, and reduced model size.

1.4. Neck-Path Aggregation Network (PANET)

Neck is mainly used to generate feature pyramids. The feature pyramid will enhance the model's detection of objects at different scales, allowing it to recognize the same object at different sizes and scales.

Before PANET came out, FPN had been the state of the art in the feature aggregation layer of the object detection framework until the emergence of PANET.

In the research of YOLO V4, PANET is considered to be the most suitable feature fusion network for YOLO, so both YOLO V5 and V4 use PANET as Neck to aggregate features.

1.5. Head-YOLO universal detection layer

The model Head is mainly used for the final detection part. It applies anchor boxes on the feature map and produces a final output vector with class probabilities, object scores and bounding boxes.

In the YOLO V5 model, the model Head is the same as the previous YOLO V3 and V4 versions.
Insert image description here
These Heads with different scaling scales are used to detect objects of different sizes (input 608, final output downsampling 5 times), each Head has a total of (80 classes + 1 probability + 4 coordinates) * 3 anchor boxes, a total of 255 channels.

1.5, Activation Function - activation function

The choice of activation function is crucial for deep learning networks. The author of YOLO V5 used Leaky ReLU and Sigmoid activation functions.

In YOLO V5, the middle/hidden layer uses the Leaky ReLU activation function, and the final detection layer uses the Sigmoid activation function. YOLO V4 uses the Mish activation function.

Mish beats Swish on 39 benchmarks and ReLU on 40 benchmarks, with some results showing 3–5% improvements in benchmark accuracy. But be aware that Mish activation is computationally more expensive compared to ReLU and Swish.
Insert image description here

1.6. Optimization Function - optimization function

The author of YOLO V5 provides us with two optimization functions Adam and SGD, and both preset matching training hyperparameters. Default is SGD.

YOLO V4 uses SGD.

The author of YOLO V5 recommends that if you need to train smaller custom datasets, Adam is a more suitable choice, although Adam's learning rate is generally lower than SGD.

But if you train a large data set, SGD works better than Adam for YOLOV5.

In fact, there is no unified conclusion in the academic community as to which one is better, SGD or Adam, and it depends on the actual project situation.

The loss calculation of the Cost Function
YOLO series is based on objectness score, class probability score, and bounding box regression score.

YOLO V5 uses GIOU Loss as the loss of bounding box, and uses binary cross entropy and Logits loss function to calculate the loss of class probability and target score. At the same time, we can also use the fl_gamma parameter to activate Focal loss to calculate the loss function.

YOLO V4 uses CIOU Loss as the loss of the bounding box. Compared with other mentioned methods, CIOU brings faster convergence and better performance.
Insert image description here

The results in the above figure are based on Faster R-CNN. It can be seen that CIoU actually performs better than GIoU.

1.7、Benchmarks- YOLO V5 VS YOLO V4

Before there is a detailed discussion in the paper, we can only compare the performance of the two by looking at the COCO indicators released by the author and combined with the subsequent example evaluations by the big guys.

1.7.1. Official performance evaluation

Insert image description here
Insert image description here
In the two figures above, the relationship between FPS and ms/img is inverted. After unit conversion, we can find that YOLO V5 can reach 250FPS on the V100GPU and has a high mAP.

Since the original training of YOLO V4 is on 1080TI, which is far lower than the performance of V100, and the benchmarks of AP_50 and AP_val are different, it is impossible to obtain the benchmarks of the two based on the above table alone.

Fortunately, WongKinYiu, the second author of YOLO V4, used the V100 GPU to provide comparable benchmarks.
Insert image description here

As can be seen from the chart, the performance of the two is actually very close, but according to the data, YOLO V4 is still the best object detection framework. YOLO V4 is highly customizable. If you are not afraid of more custom configurations, then Darknet-based YOLO V4 is still the most accurate.

It is worth noting that YOLO V4 actually uses a large number of data enhancement technologies in the Ultralytics YOLOv3 code base. These technologies are also run in YOLO V5. How much impact the data enhancement technology has on the results has to wait for the author's paper analysis.

1.7.2. Training time

According to Roboflow research, YOLO V5 trains very quickly, far exceeding YOLO V4 in training speed. For Roboflow's custom dataset, YOLO V4 took 14 hours to reach the maximum validation evaluation, while YOLO V5 only took 3.5 hours.

Insert image description here

1.7.3. Model size

The sizes of different models in the figure are: V5x: 367MB, V5l: 192MB, V5m: 84MB, V5s: 27MB, YOLOV4: 245 MB. The YOLO V5s model size is very
small, which reduces deployment costs and is conducive to rapid deployment of the model.
Insert image description here

1.7.4. Reasoning time

Insert image description here

On a single image (batch size 1), YOLOV4 infers in 22 ms and YOLOV5s infers in 20 ms.

The YOLOV5 implementation defaults to batch inference (batch size 36), and divides the batch processing time by the number of images in the batch. The inference time of a single image can reach 7ms, which is 140FPS. This is the current state-of-the-art in the field of object detection. of-the-art.

I used the model I trained to perform real-time inference on 10,000 test images. The inference speed of YOLOV5s is very amazing. Each image requires only 7ms of inference time. Coupled with the model size of more than 20 megabytes, it is unrivaled in terms of flexibility.

But in fact, this is not fair to YOLO V4. Since YOLO V4 does not implement default batch reasoning, it is disadvantaged in comparison. There should be many tests on the two object detection frameworks under the same benchmark.

Secondly, YOLO V4 has recently launched a tiny version. The performance and speed comparison between YOLO V5s and V4 tiny requires more practical analysis.

1.8. Comparison and summary

In general, YOLO V4 is better than YOLO V5 in performance, but weaker than YOLO V5 in flexibility and speed.

Since YOLO V5 is still being updated rapidly, the final research results of YOLO V5 remain to be analyzed.

I personally think that for these object detection frameworks, the performance of the feature fusion layer is very important. Currently, both use PANET, but according to research by Google Brain, BiFPN is the best choice for the feature fusion layer. Whoever can integrate this technology is likely to achieve significant performance improvements.

Insert image description here

Although YOLO V5 is still inferior, YOLO V5 still has the following significant advantages:

  • Using the Pytorch framework is very user-friendly and can easily train your own data sets. Compared with the Darknet framework adopted by YOLO V4, the Pytorch framework is easier to put into production.

  • The code is easy to read and integrates a large number of computer vision technologies, which is very conducive to learning and reference.

  • Not only is it easy to configure the environment, model training is also very fast, and batch inference produces real-time results.

  • Ability to perform efficient inference directly on single images, batched images, videos, and even webcam port inputs

  • It can easily convert the Pytorch weight file into the ONXX format used by Android, and then convert it to the format used by OPENCV, or convert it to IOS format through CoreML and deploy it directly to the mobile application.

  • Finally, the object recognition speed of YOLO V5s up to 140FPS is very impressive, and the user experience is great

2. yolo v5 test

The current yolo v5 project address is https://github.com/ultralytics/yolov , and the version has been updated to v7.0.

2.1. python test

2.1.1. Installation

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

2.1.2. Reasoning

  • Using yolov5 hub inference, the latest model will be automatically downloaded from YOLOv5 release.

    	import torch
    	# Model
    	model = torch.hub.load("ultralytics/yolov5", "yolov5s")  # or yolov5n - yolov5x6, custom
    	# Images
    	img = "https://ultralytics.com/images/zidane.jpg"  # or file, Path, PIL, OpenCV, numpy, list
    	# Inference
    	results = model(img)
    	# Results
    	results.print()  # or .show(), .save(), .crop(), .pandas(), etc.
    	```
    
    
  • Inference using detect.py
    detect.py runs inference on various sources. The model is automatically downloaded from the latest YOLOv5 release and the results are saved to runs/detect.

    python detect.py --weights yolov5s.pt --source 0                               # webcam
                                                   img.jpg                         # image
                                                   vid.mp4                         # video
                                                   screen                          # screenshot
                                                   path/                           # directory
                                                   list.txt                        # list of images
                                                   list.streams                    # list of streams
                                                   'path/*.jpg'                    # glob
                                                   'https://youtu.be/LNwODJXcvt4'  # YouTube
                                                   'rtsp://example.com/media.mp4'  # RTSP, RTMP, HTTP stream
    

2.1.3. Test output

Pay attention to the usage and operating efficiency comparison of the parameters --dnn and --half, focusing on the three indicator time data of pre-process, inference, and nms.

(yolo_pytorch) E:\DeepLearning\yolov5>python detect.py --weights yolov5n.pt --source data/images/bus.jpg
detect: weights=['yolov5n.pt'], source=data/images/bus.jpg, data=data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
YOLOv5  v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)

Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
image 1/1 E:\DeepLearning\yolov5\data\images\bus.jpg: 640x480 4 persons, 1 bus, 121.0ms
Speed: 1.0ms pre-process, 121.0ms inference, 38.0ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs\detect\exp2


(yolo_pytorch) E:\DeepLearning\yolov5>python detect.py --weights yolov5n.pt --source data/images/bus.jpg --device 0
detect: weights=['yolov5n.pt'], source=data/images/bus.jpg, data=data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=0, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=True, vid_stride=1
YOLOv5  v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)

Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
image 1/1 E:\DeepLearning\yolov5\data\images\bus.jpg: 640x480 4 persons, 1 bus, 11.0ms
Speed: 0.0ms pre-process, 11.0ms inference, 7.0ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs\detect\exp2


(yolo_pytorch) E:\DeepLearning\yolov5>python detect.py --weights yolov5n.pt --source data/images/bus.jpg --dnn
detect: weights=['yolov5n.pt'], source=data/images/bus.jpg, data=data\coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs\detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=True, vid_stride=1
YOLOv5  v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)

Fusing layers...
YOLOv5n summary: 213 layers, 1867405 parameters, 0 gradients
image 1/1 E:\DeepLearning\yolov5\data\images\bus.jpg: 640x480 4 persons, 1 bus, 10.0ms
Speed: 0.0ms pre-process, 10.0ms inference, 4.0ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs\detect\exp3

Other test comparisons

        pre-process、 inference、   nms
cpu:        1           121       38
gpu:        0           11         7
dnn:        0		     10         4
gpu-half:    0           10         4
dnn-half:    1           11         4 

2.2. c++ test

Here, the opencv dnn module is used to load the onnx format model exported by yolov5 for testing.

2.2.1. Model export

The official website actually provides onnx format export files for each version of the model, but they are all half-precision models and cannot be used directly in opencv dnn.

Here we take yolov5x as an example to export the onnx model. For first time use, you can view the py file parameters or view it through the command line, as follows. Note that when exporting, select the appropriate onnx opset version to adapt to the opencv dnn version .

(yolo_pytorch) E:\DeepLearning\yolov5>python export.py --weights yolov5x.pt --include onnx --opset 12
export: data=E:\DeepLearning\yolov5\data\coco128.yaml, weights=['yolov5x.pt'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=12, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx']
YOLOv5  v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CPU

Fusing layers...
YOLOv5x summary: 444 layers, 86705005 parameters, 0 gradients

PyTorch: starting from yolov5x.pt with output shape (1, 25200, 85) (166.0 MB)

ONNX: starting export with onnx 1.14.0...
ONNX: export success  10.0s, saved as yolov5x.onnx (331.2 MB)

Export complete (15.0s)
Results saved to E:\DeepLearning\yolov5
Detect:          python detect.py --weights yolov5x.onnx
Validate:        python val.py --weights yolov5x.onnx
PyTorch Hub:     model = torch.hub.load('ultralytics/yolov5', 'custom', 'yolov5x.onnx')
Visualize:       https://netron.app

2.2.2, opencv dnn c++ code test

The theme code is the same as in yolov4, the main differences are:

  • Preprocessing can be performed according to the situation whether to scale and fill, ensuring that the size is consistent with the network input, see formatToSquare()function.
  • There are some slight adjustments to the data processing of network output in the post-processing code.

The complete code is as follows

#pragma once

#include "opencv2/opencv.hpp"

#include <fstream>
#include <sstream>
#include <random>

using namespace cv;
using namespace dnn;

 float inpWidth;
 float inpHeight;
 float confThreshold, scoreThreshold, nmsThreshold;
 std::vector<std::string> classes;
 std::vector<cv::Scalar> colors;

 bool letterBoxForSquare = true;

 cv::Mat formatToSquare(const cv::Mat &source);

 void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& out, Net& net);

 void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame);

std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<int> dis(100, 255);

int main()
{
    
    
    // 根据选择的检测模型文件进行配置 
    confThreshold = 0.25;
    scoreThreshold = 0.45;
    nmsThreshold = 0.5;
    float scale = 1/255.0;  //0.00392
    Scalar mean = {
    
    0,0,0};
    bool swapRB = true;
    inpWidth = 640;
    inpHeight = 640;

    String model_dir = R"(E:\DeepLearning\yolov5)";
    String modelPath = model_dir + R"(\yolov5n.onnx)";
    String configPath;

    String framework = "";
    int backendId = cv::dnn::DNN_BACKEND_CUDA;
    int targetId = cv::dnn::DNN_TARGET_CUDA;

    String classesFile = R"(model\object_detection_classes_yolov3.txt)";

    // Open file with classes names.
    if(!classesFile.empty()) {
    
    
        const std::string& file = classesFile;
        std::ifstream ifs(file.c_str());
        if(!ifs.is_open())
            CV_Error(Error::StsError, "File " + file + " not found");
        std::string line;
        while(std::getline(ifs, line)) {
    
    
            classes.push_back(line);
            colors.push_back(cv::Scalar(dis(gen), dis(gen), dis(gen)));
        }
    } 
    // Load a model.
    Net net = readNet(modelPath, configPath, framework);
    net.setPreferableBackend(backendId);
    net.setPreferableTarget(targetId);

    std::vector<String> outNames = net.getUnconnectedOutLayersNames();
    {
    
    
        int dims[] = {
    
    1,3,inpHeight,inpWidth};
        cv::Mat tmp = cv::Mat::zeros(4, dims, CV_32F);
        std::vector<cv::Mat> outs;

        net.setInput(tmp);
        for(int i = 0; i<10; i++)
            net.forward(outs, outNames); // warmup
    }

    // Create a window
    static const std::string kWinName = "Deep learning object detection in OpenCV";

    cv::namedWindow(kWinName, 0);

    // Open a video file or an image file or a camera stream.
    VideoCapture cap;
    //cap.open(0);
    cap.open(R"(E:\DeepLearning\yolov5\data\images\bus.jpg)");

    cv::TickMeter tk;
    // Process frames.
    Mat frame, blob;

    while(waitKey(1) < 0) {
    
    
        //tk.reset();
        //tk.start();

        cap >> frame;
        if(frame.empty()) {
    
    
            waitKey();
            break;
        }

        // Create a 4D blob from a frame.
        cv::Mat modelInput = frame;
        if(letterBoxForSquare && inpWidth == inpHeight)
            modelInput = formatToSquare(modelInput);
            
        blobFromImage(modelInput, blob, scale, cv::Size2f(inpWidth, inpHeight), mean, swapRB, false);

        // Run a model.
        net.setInput(blob);

        std::vector<Mat> outs;
        //tk.reset();
        //tk.start();

        auto tt1 = cv::getTickCount();
        net.forward(outs, outNames);
        auto tt2 = cv::getTickCount();

        tk.stop();
        postprocess(frame, modelInput.size(), outs, net);
        //tk.stop();

        // Put efficiency information.
        std::vector<double> layersTimes;
        double freq = getTickFrequency() / 1000;
        double t = net.getPerfProfile(layersTimes) / freq;
        std::string label = format("Inference time: %.2f ms  (%.2f ms)", t, /*tk.getTimeMilli()*/ (tt2 - tt1) / cv::getTickFrequency() * 1000);
        cv::putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

        cv::imshow(kWinName, frame);
    }
    return 0;
}

cv::Mat formatToSquare(const cv::Mat &source)
{
    
    
    int col = source.cols;
    int row = source.rows;
    int _max = MAX(col, row);
    cv::Mat result = cv::Mat::zeros(_max, _max, CV_8UC3);
    source.copyTo(result(cv::Rect(0, 0, col, row)));
    return result;
}

void postprocess(Mat& frame, cv::Size inputSz, const std::vector<Mat>& outs, Net& net)
{
    
    
    // yolov5 has an output of shape (batchSize, 25200, 85) (Num classes + box[x,y,w,h] + confidence[c])

    auto tt1 = cv::getTickCount();

    //float x_factor = frame.cols / inpWidth;
    //float y_factor = frame.rows / inpHeight;
    float x_factor = inputSz.width / inpWidth;
    float y_factor = inputSz.height / inpHeight;

    std::vector<int> class_ids;
    std::vector<float> confidences;
    std::vector<cv::Rect> boxes;

    int rows = outs[0].size[1];
    int dimensions = outs[0].size[2];

    float *data = (float *)outs[0].data;

    for(int i = 0; i < rows; ++i) {
    
    
        float confidence = data[4];

        if(confidence >= confThreshold) {
    
    
            float *classes_scores = data + 5;

            cv::Mat scores(1, classes.size(), CV_32FC1, classes_scores);
            cv::Point class_id;
            double max_class_score;

            minMaxLoc(scores, 0, &max_class_score, 0, &class_id);

            if(max_class_score > scoreThreshold) {
    
    
                confidences.push_back(confidence);
                class_ids.push_back(class_id.x);

                float x = data[0];
                float y = data[1];
                float w = data[2];
                float h = data[3];

                int left = int((x - 0.5 * w) * x_factor);
                int top = int((y - 0.5 * h) * y_factor);
                int width = int(w * x_factor);
                int height = int(h * y_factor);
               
                boxes.push_back(cv::Rect(left, top, width, height));
            }
        }

        data += dimensions;
    }

    std::vector<int> indices;
    NMSBoxes(boxes, confidences, scoreThreshold, nmsThreshold, indices);
     
    auto tt2 = cv::getTickCount();
    std::string label = format("NMS time: %.2f ms",  (tt2 - tt1) / cv::getTickFrequency() * 1000);
    cv::putText(frame, label, Point(0, 30), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 255, 0));

    for(size_t i = 0; i < indices.size(); ++i) {
    
    
        int idx = indices[i];
        Rect box = boxes[idx];
        drawPred(class_ids[idx], confidences[idx], box.x, box.y,
                 box.x + box.width, box.y + box.height, frame);
    }
}

void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
    
    
    rectangle(frame, Point(left, top), Point(right, bottom), Scalar(0, 255, 0));

    std::string label = format("%.2f", conf);
    Scalar color = Scalar::all(255);
    if(!classes.empty()) {
    
    
        CV_Assert(classId < (int)classes.size());
        label = classes[classId] + ": " + label;
        color = colors[classId];
    }

    int baseLine;
    Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);

    top = max(top, labelSize.height);
    rectangle(frame, Point(left, top - labelSize.height),
              Point(left + labelSize.width, top + baseLine), color, FILLED);
    cv::putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.5, Scalar());
}

2.2.3. Test results

When the previous python test used GPU, forward inference took 10ms and NMS took 4ms. When opencv dnn is used to open dnn here, forward inference took ~7ms and NMS took ~0.3ms.
Insert image description here

3. Custom data set training

Here, yolov5s is used as a pre-training model to train a target detection model containing 4 types of vehicle types.

3.1. Data set preparation

First label the picture yourself, for example, taking the voc format as an example, use the labelImg tool for labeling. The default labeling file format of coco is xml, which needs to be converted to txt through a script (in addition, you can directly use the labelme tool to directly save it into the txt format required by yolo) .

Here we only focus on the folders JPEGImagesand labels. After the annotation is completed, place the image and the generated annotation file in any directory. For example E:\DeepLearning\yolov5\custom-data\vehicle, then place the image and annotation file into the images and labels folders respectively (yolov5 default path, otherwise you need to modify the img2label_paths function in yolov5/utils/dataloaders.py two parameters).

vehicle
├── images
│   ├── 20151127_114556.jpg
│   ├── 20151127_114946.jpg
│   └── 20151127_115133.jpg
├── labels
│   ├── 20151127_114556.txt
│   ├── 20151127_114946.txt
│   └── 20151127_115133.txt

After that, prepare the list files train.txt, val.txt, and test.txt for the training set, verification set, and test set (optional). The absolute paths of the images are stored in the three files, and the ratio is randomly selected, such as 7:2:1.

3.2. Configuration file

Copy the data/coco.yaml and model/yolov5s.yaml files to the data set directory and make modifications.

For example, the dataset description filemyvoc.yaml

train: E:/DeepLearning/yolov5/custom-data/vehicle/train.txt
val: E:/DeepLearning/yolov5/custom-data/vehicle/val.txt
 
# number of classes
nc: 4
 
# class names
names: ["car", "huoche", "guache", "keche"]

Network model configuration file yolov5s.yamlonly modify the parameter nc to the actual number of target detection categories

# Parameters
nc: 4  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32

3.3. Training

As mentioned earlier, after the preparation work is completed, the directory structure is as follows.
Insert image description hereAfter that, we train 20 epocs. The script for single-GPU training is as follows:

python train.py
	 --weights yolov5s.pt 
	 --cfg custom-data\vehicle\yolov5s.yaml 
	 --data custom-data\vehicle\myvoc.yaml 
	 --epoch 20 
	 --batch-size=32 
	 --img 640 
	 --device 0

The training output content is

E:\DeepLearning\yolov5>python train.py --weights yolov5s.pt --cfg custom-data\vehicle\yolov5s.yaml --data custom-data\vehicle\myvoc.yaml --epoch 20 --batch-size=32 --img 640 --device 0
train: weights=yolov5s.pt, cfg=custom-data\vehicle\yolov5s.yaml, data=custom-data\vehicle\myvoc.yaml, hyp=data\hyps\hyp.scratch-low.yaml, epochs=20, batch_size=32, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
fatal: unable to access 'http://github.com/ultralytics/yolov5.git/': Recv failure: Connection was reset
Command 'git fetch origin' timed out after 5 seconds
YOLOv5  v7.0-167-g5deff14 Python-3.9.16 torch-1.13.1+cu117 CUDA:0 (NVIDIA GeForce GTX 1080 Ti, 11264MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5  runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1     24273  models.yolo.Detect                      [4, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
YOLOv5s summary: 214 layers, 7030417 parameters, 7030417 gradients, 16.0 GFLOPs

Transferred 342/349 items from yolov5s.pt
AMP: checks passed
optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias
train: Scanning E:\DeepLearning\yolov5\custom-data\vehicle\train... 998 images, 0 backgrounds, 0 corrupt: 100%|██████████| 998/998 [00:07<00:00, 141.97it/s]
train: New cache created: E:\DeepLearning\yolov5\custom-data\vehicle\train.cache
val: Scanning E:\DeepLearning\yolov5\custom-data\vehicle\val... 998 images, 0 backgrounds, 0 corrupt: 100%|██████████| 998/998 [00:13<00:00, 72.66it/s]
val: New cache created: E:\DeepLearning\yolov5\custom-data\vehicle\val.cache

AutoAnchor: 4.36 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset
Plotting labels to runs\train\exp13\labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs\train\exp13
Starting training for 20 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       0/19      6.36G    0.09633      0.038    0.03865         34        640: 100%|██████████| 32/32 [00:19<00:00,  1.66it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:11<00:00,  1.45it/s]
                   all        998       2353      0.884      0.174      0.248     0.0749

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       1/19       9.9G    0.06125    0.03181    0.02363         26        640: 100%|██████████| 32/32 [00:14<00:00,  2.18it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.50it/s]
                   all        998       2353      0.462      0.374       0.33      0.105

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       2/19       9.9G    0.06124    0.02353    0.02014         18        640: 100%|██████████| 32/32 [00:14<00:00,  2.22it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.58it/s]
                   all        998       2353      0.469      0.472      0.277      0.129

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       3/19       9.9G    0.05214    0.02038     0.0175         27        640: 100%|██████████| 32/32 [00:14<00:00,  2.22it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.56it/s]
                   all        998       2353       0.62       0.64      0.605      0.279

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       4/19       9.9G    0.04481    0.01777    0.01598         23        640: 100%|██████████| 32/32 [00:14<00:00,  2.17it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.60it/s]
                   all        998       2353      0.803      0.706      0.848      0.403

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       5/19       9.9G     0.0381    0.01624    0.01335         19        640: 100%|██████████| 32/32 [00:14<00:00,  2.16it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.55it/s]
                   all        998       2353      0.651      0.872        0.8      0.414

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       6/19       9.9G    0.03379    0.01534    0.01134         28        640: 100%|██████████| 32/32 [00:14<00:00,  2.18it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.58it/s]
                   all        998       2353       0.94      0.932      0.978      0.608

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       7/19       9.9G    0.03228    0.01523    0.00837         10        640: 100%|██████████| 32/32 [00:14<00:00,  2.21it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:09<00:00,  1.67it/s]
                   all        998       2353      0.862      0.932      0.956      0.591

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       8/19       9.9G     0.0292    0.01458   0.007451         20        640: 100%|██████████| 32/32 [00:14<00:00,  2.21it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.56it/s]
                   all        998       2353       0.97      0.954      0.986      0.658

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
       9/19       9.9G    0.02739    0.01407   0.006553         29        640: 100%|██████████| 32/32 [00:15<00:00,  2.12it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.58it/s]
                   all        998       2353      0.982      0.975      0.993       0.74

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      10/19       9.9G     0.0248    0.01362   0.005524         30        640: 100%|██████████| 32/32 [00:14<00:00,  2.14it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.55it/s]
                   all        998       2353      0.985      0.973      0.993      0.757

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      11/19       9.9G    0.02377    0.01271   0.005606         27        640: 100%|██████████| 32/32 [00:15<00:00,  2.13it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.52it/s]
                   all        998       2353      0.964      0.975      0.989      0.725

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      12/19       9.9G    0.02201    0.01247   0.005372         33        640: 100%|██████████| 32/32 [00:14<00:00,  2.19it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.57it/s]
                   all        998       2353      0.988      0.988      0.994       0.83

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      13/19       9.9G    0.02103    0.01193   0.004843         22        640: 100%|██████████| 32/32 [00:14<00:00,  2.14it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.57it/s]
                   all        998       2353      0.981      0.987      0.994      0.817

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      14/19       9.9G    0.02017    0.01167    0.00431         22        640: 100%|██████████| 32/32 [00:14<00:00,  2.20it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:09<00:00,  1.60it/s]
                   all        998       2353       0.96      0.952      0.987      0.782

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      15/19       9.9G    0.01847    0.01158   0.004043         32        640: 100%|██████████| 32/32 [00:14<00:00,  2.20it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.56it/s]
                   all        998       2353      0.988      0.992      0.994      0.819

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      16/19       9.9G    0.01771     0.0114   0.003859         24        640: 100%|██████████| 32/32 [00:14<00:00,  2.20it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.55it/s]
                   all        998       2353      0.967       0.96       0.99      0.832

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      17/19       9.9G    0.01665    0.01077   0.003739         32        640: 100%|██████████| 32/32 [00:14<00:00,  2.22it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.59it/s]
                   all        998       2353      0.992      0.995      0.994       0.87

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      18/19       9.9G    0.01559    0.01067   0.003549         45        640: 100%|██████████| 32/32 [00:14<00:00,  2.21it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:10<00:00,  1.53it/s]
                   all        998       2353      0.991      0.995      0.995      0.867

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      19/19       9.9G    0.01459    0.01009   0.003031         31        640: 100%|██████████| 32/32 [00:14<00:00,  2.18it/s]
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:11<00:00,  1.42it/s]
                   all        998       2353      0.994      0.995      0.994      0.885

20 epochs completed in 0.143 hours.
Optimizer stripped from runs\train\exp13\weights\last.pt, 14.4MB
Optimizer stripped from runs\train\exp13\weights\best.pt, 14.4MB

Validating runs\train\exp13\weights\best.pt...
Fusing layers...
YOLOv5s summary: 157 layers, 7020913 parameters, 0 gradients, 15.8 GFLOPs
                 Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 16/16 [00:11<00:00,  1.37it/s]
                   all        998       2353      0.994      0.995      0.994      0.885
                   car        998       1309      0.995      0.999      0.995      0.902
                huoche        998        507      0.993      0.988      0.994      0.895
                guache        998        340      0.988      0.993      0.994      0.877
                 keche        998        197      0.999          1      0.995      0.866
Results saved to runs\train\exp13

During the training process, you can use tensorboard to visually view the training curve. Start it in the yolov5 directory tensorboard --logdir runs\train, and then http://localhost:6006/access it to view:
Insert image description here
the training speed is very fast, with 998 pictures, and it only takes about 8 minutes to train 20epoc. The training saved model is stored in runs\train\exp13the directory.

Other relevant screenshots

results.png

train_batch1.jpg
Use the script python detect.py --weights runs\train\exp13\weights\best.pt --source custom-data\vehicle\images\11.jpgto test the following
Insert image description here
result chart
Insert image description here

Guess you like

Origin blog.csdn.net/wanggao_1990/article/details/132758180