opencv dnn module example (21) target detection object_detection of yolov6

1. Introduction to YOLOv6

1.1. Overview

In early 2023, Meituan’s Visual Intelligence Department released YOLOv6 version 3.0, which once again pushed the comprehensive performance of target detection to new highs. In addition to a full series of upgrades to the YOLOv6-N/S/M/L models, this update also launches a large-resolution P6 model. Among them, YOLOv6-L6 surpasses YOLOv7-E6E in detection accuracy and speed, and has achieved SOTA in the current real-time target detection list.

YOLOv6 Github portal:github.com/meituan/YOLOv6, technical report:YOLOv6 v3 .0: A Full-Scale Reloading

yolov6 The first version was released in June 2022 and has been updated to version 4.0 so far.
Insert image description here
Figure 1 Performance comparison chart of YOLOv6 models of various sizes and other YOLO series frameworks

Insert image description here

Table 1 Performance comparison results of YOLOv6 models of various sizes and other YOLO series frameworks


Note: The YOLOv6 series models are all obtained after training for 300 epoch without using pre-trained models or additional detection data sets. "‡" indicates that the self-distillation algorithm is used, and "*" indicates the indicators for re-evaluating the released model from the official code base. The above speed indicators are all tested in T4 TRT7.2 environment.

1.2. Key technologies

1.2.0, network structure

Backbone:
Insert image description here
Inspired by RepVGG, the author designed an efficient re-parameterized skeleton called EfficientRep. For small models, the main component of the skeleton is Rep-Block, as shown in Figure 3a); in the inference stage, RepBlock is converted into RepConv, as shown in Figure 3b).

Compared with other mainstream architectures, the authors found that the RepVGG skeleton has more feature representation capabilities in small networks at similar inference speeds, although it is more difficult to extend to large models due to the explosive growth of parameters and computational costs. In this case, the author adopts RepBlock as a building block for building small networks; for large models, the author improves a more effective CSP block, namely CSPStackRep Block. Usually 3*3 convolution is optimized on the hardware computing platform, so it can not only improve representation capabilities but also improve reasoning speed.

However, the author found that if the model is expanded, the parameters and calculation costs will increase exponentially. In order to obtain better balance, the author improves the CSPStackRep Block, as shown in Figure 3c). It absorbs the advantages of CSP, namely Cross Stage partial connection, and is composed of RepVGG block in the training phase and RepConv in the inference phase.

Insert image description here
Neck:

In practice, multi-scale feature fusion has proven to be a key and effective part of object detection. The author adopts PAN topology, which is consistent with YOLOV4 and V5. The author uses RepBlock to enhance the neck to form Rep-PAN;

Head:
The detection head of YOLOV5 is a coupled detection head, and its parameters are shared between classification and positioning. Its peers such as FCOS and YOLOX add two 33 convolution to decouple the two branches to improve performance.

The authors adopted a hybrid channel strategy to build a more efficient decoupling head. The author only uses a convolution of 33. The width of the detection head is determined by the product of the width of the skeleton and the neck. This improvement further reduces operations and improves inference speed. The author simplified the coupled head to make it more efficient, calling it Efficient Decoupled Head.


The v3.0 update mainly innovates and optimizes Neck network design, training and distillation strategies:

  • Designed a reparameterizable bidirectional fusion PAN (RepBi-PAN) Neck network with stronger representation ability;
  • Proposed a new anchor-aided training strategy;
  • A Decoupled Location Distillation strategy is proposed to improve the performance of small models.

1.2.1. RepBi-PAN Neck network with stronger representation ability

An effective multi-scale feature fusion network is particularly critical to the effect of target detection. Feature Pyramid Network (FPN) uses a top-down path to fuse the output features from different stages of the backbone network to compensate for the loss of target position information during the network learning process. Given the limitations of unidirectional information flow transmission, PANet adds an additional bottom-up path on top of FPN. BiFPN introduces learnable weights for different input features and simplifies PAN to achieve better performance and higher efficiency. PRB-FPN preserves high-quality features through a parallel residual FPN structure with bidirectional fusion for accurate localization.

Based on the above research, the paper proposes a reparameterizable bidirectional fusion PAN (RepBi-PAN) Neck network with stronger representation ability. Generally speaking, the shallow features of the backbone network have high resolution and rich spatial information, which is beneficial to the positioning task in target detection. In order to aggregate shallow features, a common approach is to add a P2 fusion layer and an additional detection head to the FPN, but this often brings large computational costs.

In order to achieve a better trade-off between accuracy and delay, a Bidirectional Concatenate (BiC) module is designed to introduce bottom-up information flow in the top-down transmission path, so that shallow features can be processed more efficiently. This method participates in multi-scale feature fusion to further enhance the expressive ability of fused features. This module can help retain more accurate positioning signals, which is of great significance for the positioning of small objects.

In addition, feature enhancement optimization has been performed on the previous version of the SimSPPF module to enrich the representation capabilities of the feature map. It was found that the SPPCSPC module used by YOLOv7 can improve detection accuracy, but has a greater impact on network inference speed. So we simplified its design, which greatly improved the reasoning efficiency without having a big impact on detection accuracy. At the same time, we introduced the idea of ​​reparameterization and adjusted the channel width and depth of the Neck network accordingly. The final RepBi-PAN network structure is shown in Figure 2 below:
Insert image description here
Figure 2 RepBi-PAN network structure diagram

Insert image description here
Table 2 BiC module ablation experimental results

As can be seen from Table 2, on the YOLOv6-S/L model, only after the BiC module is introduced into the top-down transmission path of the PAN network, the detection accuracy increases by 0.6% while the impact on the inference speed remains at 4%. and 0.4% AP. When we additionally tried to replace regular connections with BiC modules in the bottom-up information flow, no further positive gain was obtained, so we only applied BiC modules in the top-down path. At the same time, we also noticed that the BiC module can bring a 1.8% AP improvement to the detection accuracy of small targets.

Insert image description here
Table 3 Comparison results of different SPP modules on model accuracy and speed

In Table 3, an experimental comparison is made on the impact of different SPP modules on model accuracy and speed, including the simplified designed SPPF, SPPCSPC and CSPSPPF modules. In addition, we also tried to use the SimSPPF module after the output features of the backbone networks C3, C4 and C5 to enhance the aggregated expression of features, which is represented by SimSPPF * 3 in the table. Judging from the experimental results, although repeated use of the SimSPPF module increases the amount of calculation, it does not bring further improvement in detection accuracy.

Compared with the SimSPPF module, the simplified design of the SPPCSPC module increases 1.6% and 0.3% AP respectively on the YOLOv6-N/S model, but reduces the inference speed FPS by about 10%. When the SimSPPF module was replaced with the optimized SimCSPSPPF module, an accuracy gain of 1.1%/0.4%/0.1% was achieved on the YOLOv6-N/S/M model respectively, and the inference speed was greatly improved compared to the SimSPPCSPC module. . Therefore, for better accuracy-efficiency trade-off, the SimCSPSPPF module is adopted on YOLOv6-N/S, while the SimSPPF module is adopted on YOLOv6-M/L.

1.2.2. Brand new Anchor-Aided Training strategy

Using YOLOv6-N as the baseline, relevant experiments and analysis were conducted on the similarities and differences between Anchor-based and Anchor-free paradigms
Insert image description here
YOLOv6-N uses Anchor-based and Anchor respectively When using the -free training paradigm, the overall mAP of the model is almost close, but the AP index of the Anchor-based model on small, medium, and large targets will be higher. From the above experiments, it can be concluded that compared to the Anchor-free paradigm, the Anchor-based model has additional performance gains.

At the same time, it was found that when YOLOv6 uses TAL for label allocation, the stability of its model accuracy is closely related to whether ATSS preheating is used. When ATSS preheating is not used, YOLOv6-N with the same parameter configuration is trained multiple times, and the model accuracy can reach a maximum of 35.9% mAP and a minimum of 35.3% mAP. The same model will have a difference of 0.6% mAP. However, when ATSS is used to warm up, the highest model accuracy can only reach 35.7% mAP. From the experimental results, it can be concluded that the preheating process of ATSS uses Anchor-based preset information to achieve the purpose of stable model training, but it will also limit the peak capacity of the network to a certain extent, so it is not the most optimal method. Excellent choice.

Inspired by the above work, we proposed a strategy based on Anchor-Aided Training (AAT). During the network training process, the two training paradigms of Anchor-based and Anchor-free are combined at the same time, and the whole-stage network is mapped and optimized. Finally, the unification of Anchors is realized, and the respective advantages of combining different Anchor networks are fully utilized, thus The accuracy of model detection is further improved. In addition, a flexible configuration training strategy is also proposed, which only introduces additional auxiliary branches during the training process and does not use them during the testing process. In the end, the network accuracy is improved without increasing the inference time, and the painless increase point
Insert image description here
The ablation experiment results using the AAT training strategy are shown in Table 5 below. We conducted experiments on YOLOv6 models of various sizes. The YOLOv6-S model achieved an accuracy gain of 0.3% after adopting the AAT strategy, while the YOLOv6-M/L models brought an accuracy gain of 0.5% respectively. It is worth noting that the accuracy index of YOLOv6-N/S/M in small target detection has been significantly enhanced.
Insert image description here

1.2.3. Painless DLD decoupling positioning distillation strategy

The DLD (Decoupled Location Distillation) algorithm based on the decoupled detection task and distillation task adds an additional enhanced regression branch to the regression head of each layer of the network. During the training phase, this branch will also participate in the calculation of IoU loss, and Add it to the final Loss.

In the distillation task of target detection, LD achieves the purpose of distilling positioning information in the network by introducing the DFL branch, which makes up for the inability of the Logit Mimicking method to use positioning distillation information. However, the addition of the DFL branch has a significant impact on the speed of small models. The speed of YOLOv6-N dropped by 16.7%, and the speed of YOLOv6-S dropped by 5.2%. In actual industrial applications, the speed requirements for small models are often very high. Therefore, the current distillation strategy is not suitable for industrial implementation.

To address this problem, we propose a DLD (Decoupled Location Distillation) algorithm based on decoupled detection tasks and distillation tasks. The DLD algorithm adds an additional enhanced regression branch to the regression head of each layer of the network. During the training phase, this branch will also participate in the calculation of IoU loss and accumulate it into the final Loss. By adding additional enhanced regression branches, more additional constraints can be added to the network, resulting in a more comprehensive and detailed optimization of the network. Moreover, the DLD algorithm introduces a branch distillation learning strategy when training the reinforcement regression branch.
Insert image description here

1.3. Summary

From the perspective of commercial applications, we have conducted in-depth improvement experiments on the YOLO open source target detector. From the construction of the skeleton model to the decoupling operation of the detection head, from the selection of the loss function to the quantitative deployment, we have always maintained the balanced advantages of speed and accuracy. . The YOLOv6 is an anchor-free detector.

Third-party evaluations
When YOLOv7 has been widely reported and dubbed the "currently fastest and most powerful" object detector, the latest version of YOLOv6 has already improved in inference speed and accuracy. Balance defeated YOLOv7, but YOLOV6 caused a controversy.

The YOLO model produced by Meituan "takes" other companies' technology and creates a super "stitch monster" that takes up the V6's pitfalls. Some people point out that although it has advantages in speed and accuracy, it avoids comparison of parameter quantities. However, from the perspective of commercial applications, the author pays more attention to speed performance than the comparison of parameters and calculations, which is understandable.

2. Test

Used the model yolov6m.pt for testing.

2.1. Official project testing

Modify the code of the function inyolov6/core/inferer.py to preheat the model and display the timeinfer()

    for i in range(0,5): self.model(img)         # 增加
                
    t1 = time.time()
    pred_results = self.model(img)
    det = non_max_suppression(pred_results, conf_thres, iou_thres, classes, agnostic_nms, max_det=max_det)[0]
    t2 = time.time()

    print("img_src process time: ", t2-t1)       # 增加

The test is as follows

>python tools\infer.py --weights=weights/yolov6m.pt --source=data/images/bus.jpg --yaml=data/coco.yaml --img-size 640 640 --device cpu

Insert image description here

Code to switch to GPU --device 0 Test results
Insert image description here

2.2. opencv dnn test

First export the onnx model, the script is

python deploy\ONNX\export_onnx.py --weights weights\yolov6m.pt --img 640 --batch 1 --simplify

Since the network output results are consistent with yolov5 and yolor (see the previous blog code), the previous code test is reused. The default preprocessing of the python script adjusts the original image to 640*640. Using the previous test image, the recognition confidence is It’s 100%, which is a bit outrageous;

In addition, without performing proportional scaling adjustment bool letterBoxForSquare = false;, the bicycle on the balcony of the building was actually recognized, which is ridiculously strong.

Insert image description here

2.3. Test statistics

python(CPU):393ms
python(GPU):25ms

opencv dnn(CPU):350ms
opencv dnn(GPU):35ms

openvino(CPU):337ms
onnxruntime(GPU):31ms
TensorRT:15ms

Guess you like

Origin blog.csdn.net/wanggao_1990/article/details/133269792