[YOLO series] Super detailed interpretation of YOLOv6 papers (translation + study notes)

foreword 

YOLOv6 is a target detection framework developed by Meituan Visual Intelligence Department, dedicated to industrial applications. The title of the thesis is "YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications" .

This framework focuses on detection accuracy and inference efficiency at the same time. Among the commonly used size models in the industry: YOLOv6-nano has an accuracy of 35.0% AP on COCO, and an inference speed of 1242 FPS on T4; YOLOv6-s on COCO The accuracy can reach 43.1% AP, and the inference speed can reach 520 FPS on T4. In terms of deployment, YOLOv6 supports the deployment of different platforms such as GPU (TensorRT), CPU (OPENVINO), ARM (MNN, TNN, NCNN), which greatly simplifies the adaptation work during project deployment.


Learning materials:

Original address: YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications

Open source code: GitHub - meituan/YOLOv6: YOLOv6: a single-stage object detection framework dedicated to industrial applications.

Previous review: 

[YOLO Series] YOLOv5 Super Detailed Interpretation (Internet Detailed Explanation)

[YOLO series] Super detailed interpretation of YOLOv4 paper 2 (detailed network explanation)

[YOLO series] Super detailed interpretation of YOLOv4 papers 1 (translation + study notes)

[YOLO series] Super detailed interpretation of YOLOv3 papers (translation + study notes)

[YOLO series] super detailed interpretation of YOLOv2 papers (translation + study notes)

[YOLO series] Super detailed interpretation of YOLOv1 papers (translation + study notes)


Table of contents

foreword 

Abstract—abstract 

1. Introduction—Introduction

Two, Method—method 

2.1 Network Design— Network Design

2.1.1 Backbone—backbone network

2.1.2 Neck—neck 

2.1.3 Head—head

2.2 Label Assignment—label assignment

2.3 Loss Functions—loss function

 2.3.1 Classification Loss—classification loss 

2.3.2 Box Regression Loss— regression box loss 

2.3.3 Object Loss—target loss 

 2.4. Industry-handy improvements—industry-handy improvements 

2.4.1 More training epochs—more training epochs

2.4.2 Self-distillation— Self-distillation

2.4.3 Gray border of images— image gray border

2.5. Quantization and Deployment—quantification and deployment

2.5.1 Reparameterizing Optimizer—reparameterization 

2.5.2 Sensitivity Analysis—Sensitivity

2.5.3 Quantization-aware Training with Channel-wise Distillation—quantization-aware training based on channel distillation

3. Experiments—Experiments

3.1 Implementation Details—implementation details

3.2 Comparisons—control experiments

3.3 Ablation Study—ablation experiment

3.3.1 Network—Network

3.3.2 Label Assignment—label assignment

3.3.3 Loss functions—loss function

3.4 Industry-handy improvements—handy improvements in industry

3.5. Quantization Results—quantification results

3.5.1 PTQ

3.5.2 QAT

4. Conclusion—Conclusion

Abstract— abstract 

translate

For many years, the YOLO series has been the de facto industry-grade standard for efficient object detection. The YOLO community has overwhelmingly enriched its applications in numerous hardware platforms and rich scenarios. In this technical report, we strive to push its limits to a new level, moving towards industry application with a firm mind. Considering the different requirements for speed and accuracy in real-world environments, we extensively study the latest advances in object detection from industry or academia. Specifically, we draw heavily on recent ideas in network design, training strategies, testing techniques, quantization and optimization methods. On this basis, we integrated our ideas and practices to build a set of deployable networks of different scales to suit diverse use cases. With the generous permission of the YOLO authors, we named it YOLOv6. We also warmly welcome further improvements from users and contributors. For performance, our YOLOv6-N achieved an AP of 35.9% on the COCO dataset, with a throughput of 1234 FPS on the NVIDIA Tesla T4 GPU. YOLOv6-S achieves 43.5% AP at 495 FPS, surpassing other mainstream detectors of the same scale (YOLOv5-S, YOLOX-S and PPYOLOE-S). Our quantized version, YOLOv6-S, even brings a new state-of-the-art 43.3% AP at 869 FPS. In addition, YOLOv6-M/L also achieves better accuracy performance (i.e. 49.5%/52.3%) than other detectors with similar inference speed. We carefully conduct experiments to verify the effectiveness of each component.


intensive reading

Abstract— abstract

background

YOLO series are now widely used in industry

The main work of this paper

YOLOv 6 draws heavily on recent ideas in network design , training strategies , testing techniques , quantization , and optimization methods . (That is to say, there are no eye-catching innovations, and the main thing is stitching and stacking)

final effect

  • Higher Accuracy : YOLOv6-N achieves 35.9% AP on COCO dataset with a throughput of 1234 FPS on NVIDIA Tesla T4 GPU .
  • Faster: YOLOv6 - S achieves 43.5% AP at 495 FPS , surpassing other mainstream detectors of the same size ( YOLOv5 -S , YOLOX-S and PPYOLOE - S ).
  • The quantized version has also improved: the quantized version YOLOv6-S even brings a new state-of-the-art 43.3% AP at 869 FPS .
  • Rest of the versions : YOLOv 6- M / L also achieves better accuracy performance (i.e. 49.5%/52.3% ) than other detectors with similar inference speed.


1. Introduction— Introduction

translate

The YOLO series has been the most popular detection framework for industrial applications because of its good balance between speed and accuracy. The pioneering work of the YOLO series is YOLOv1-3 [32-34], which opened up a new path of one-stage detectors and has been greatly improved later. YOLOv4[1] reorganized the detection framework into several independent parts (backbone, neck and head), and verified bag-of-freebies and bag-of-specials at that time, and designed a framework suitable for training on a single GPU . Currently, YOLOv5 [10], YOLOX [7], PPY-OLOE [44] and YOLOv7 [42] are all contenders for highly efficient detectors that can be deployed. Models of different sizes are usually obtained through scaling techniques.

In this report, we empirically observe several important factors that motivate us to refit the YOLO framework: (1) Reparameterization from RepVGG [3] is a superior technique that has not been achieved in detection good use. We also note that simple model extensions of RepVGG blocks become impractical, for which we argue that an elegant consistency of network design between small and large networks is unnecessary. For small networks, common single-path architecture is a good choice, but for large models, the exponential growth of parameters and the computational cost of single-path architecture make it infeasible; (2) detection based on reparameterization The quantization of detectors also requires careful handling, otherwise the performance degradation due to their heterogeneous configurations will be intractable during training and inference. (3) Previous work [7, 10, 42, 44] tends to pay less attention to deployment, whose latencies are usually compared on high-cost machines like V100. When it comes to the real serving environment, there is a hardware gap. Typically, low-power GPUs like the Tesla T4 cost less and provide reasonably good inference performance. (4) Considering the difference in architecture, advanced domain-specific strategies, such as label assignment and loss function design, need further validation; (5) For deployment, we can tolerate training strategy adjustments to improve accuracy performance, but not Increase the cost of reasoning, such as knowledge extraction.

With the above comments in mind, we lead to the birth of YOLOv6, which achieves the best trade-off between accuracy and speed to date. We show in Figure 1 how YOLOv6 compares to other peers at a similar scale. To improve inference speed without performance degradation, we investigate state-of-the-art quantization methods, including post-training quantization (PTQ) and quantization-aware training (QAT), and incorporate them into YOLOv6 for deployment-ready goals. target of the network.


intensive reading

background

   (1 ) Reparameterization from RepVGG is a superior technique that has not been well exploited in detection (the existing yolo version). At the same time, the author believes that the design of small networks is different from that of large networks. For large networks, simple model scaling of RepV GG blocks is impractical

   ( 2 ) Quantization based on reparameterized detectors requires careful handling

   ( 3 ) Previous work tends to pay less attention to deployment, whose latencies are often compared on high-cost machines such as the V100 . There is a hardware gap in the actual service environment

   ( 4 ) Label assignment and loss function design, further verification is required to consider the differences in architecture

   ( 5 ) For deployment, the training process can use strategies , but does not increase the cost of reasoning

contribute

  (1 ) Customize models of different scales for different scenario applications, small models are characterized by ordinary single-path backbones, while large models are built on efficient multi -branch blocks   

  ( 2 ) The self-distillation strategy is added, and the classification task and regression task are performed at the same time

  ( 3 ) Integrating various advanced tricks , such as: label assignment detection technology, loss function and data enhancement technology

  ( 4 ) With the help of re-optimizer and channel distillation, the quantitative detection scheme was reformed and a better detector was obtained


Two, Method— method 

translate

The redesign of YOLOv6 consists of the following parts, network design, label assignment, loss function, data augmentation, industry-appropriate improvements, and quantization and deployment:

  • Network design: Backbone: Compared with other mainstream architectures, we found that the RepVGG [3] backbone network has more feature representation capabilities in small networks at similar inference speeds, but due to the explosive growth of parameters and computing costs, It is difficult to scale up for larger models. In this regard, we consider RepBlock [3] as the building block of our small network. For large models, we modify a more efficient CSP [43] block named CSPStackRep block. Neck: The neck of YOLOv6 adopts the PAN topology after YOLOv4 and YOLOv5 [24]. We augment the neck with RepBlocks or CSPStackRep Blocks for Rep-PAN. Head: We simplify the decoupling head to make it more efficient, called the high-efficiency decoupling head.
  • Label assignment: We evaluate the recent progress of label assignment strategies [5, 7, 18, 48, 51] on YOLOv6 through extensive experiments, and the results show that TAL [5] is more effective and training friendly.
  • Loss function: The loss function of the mainstream anchor-free target detector includes classification loss, box regression loss and object loss. For each loss, we conduct systematic experiments with all available techniques, and finally choose VariFocal loss [50] as our classification loss and SIoU [8]/GIoU [35] loss as our regression loss.
  • Industry-appropriate improvements: We introduce more common practices and tricks to improve performance, including self-distillation and more training epochs. For self-distillation, both classification and regression are separately supervised by a teacher model. Distillation of regression is achieved thanks to DFL [20]. In addition, the information ratio of soft labels and hard labels is dynamically decreased by cosine decay, which helps students selectively acquire knowledge at different stages during the training process. In addition, we have encountered issues where performance suffers from not adding extra gray edges at evaluation time, for which we provide some remedies. We offer some remedies.
  • Quantization and Deployment: To address the performance drop when quantizing models based on reparameterization, we train YOLOv6 with RepOptimizer [2] to obtain PTQ-friendly weights. We further adopt QAT with channel rectification [36] and graph optimization to pursue the ultimate performance. Our quantized YOLOv6-S achieves a new state of the art with 42.3% AP and 869 FPS throughput (batch size = 32).

2.1 Network Design— Network Design

translate

A single-stage object detector generally consists of the following parts: backbone, neck and head. The backbone part mainly determines the feature representation ability, and at the same time, because it carries a large part of the computational cost, its design has a key impact on the inference efficiency. Neck is used to aggregate low-level physical features and high-level semantic features, and then build pyramid feature maps at each level. The head consists of several convolutional layers, which predict the final detection result based on the multi-level features of the neck set. From a structural point of view, it can be divided into anchor-based and anchor-free, or parameter coupling heads and parameter decoupling heads.

In YOLOv6, based on the principles of hardware-friendly network design [3], we propose two scalable and reparameterizable backbones and necks to accommodate models of different scales, and an efficient solution with a mixed-channel strategy Coupler.

intensive reading

The composition of a single -stage object detector:

  • Backbone network: mainly determines the feature representation ability  
  • Neck: Used to aggregate low-level physical features with high-level semantic features, and then build pyramidal feature maps at all levels
  • Head: Consists of multiple convolutional layers and predicts the final detection result based on multi-level features assembled in the neck


2.1.1 Backbone—backbone network

translate

As mentioned above, the design of the backbone network has a great influence on the effectiveness and efficiency of the detection model. Previous studies have shown that multi-branch networks [13, 14, 38, 39] tend to achieve better classification performance than single-path networks [15, 37], but are often accompanied by reduced parallelism, resulting in increased inference latency. In contrast, common single-path networks like VGG [37] have the advantages of high parallelism and less memory footprint, leading to higher inference efficiency. Recently in RepVGG [3], a structural reparameterization method is proposed to decouple the multi-branch topology at training time from the common structure at inference time to achieve a better speed-accuracy trade-off.
Inspired by the above work, we design an efficient reparameterizable backbone network named EfficientRep. For small models, the main component of the backbone network is the Rep-Block in the training phase, as shown in Figure 3(a). In the inference stage, each RepBlock is converted into a 3×3 convolutional layer (denoted as RepConv) with ReLU activation function, as shown in Fig. 3(b). Typically, 3×3 convolutions are highly optimized on mainstream GPUs and CPUs, which enjoy higher computational density. Therefore, EfficientRep Backbone makes full use of the computing power of the hardware, which greatly reduces the delay of reasoning and improves the representation ability at the same time.

However, we notice that the computational cost and number of parameters for single-path simple networks increase exponentially as the model capacity is further scaled up. To achieve a better trade-off between computational burden and accuracy, we modify a CSPStackRep block to build the backbone of medium and large networks. As shown in Figure 3(c), the CSPStackRep Block consists of three 1×1 convolutional layers and a stacked sub-block consisting of two RepVGG blocks [3] or RepConv (at training or inference time, respectively) with a residual connection composition. Furthermore, Cross-Stage Part (CSP) joins are employed to improve performance without excessive computational cost. Compared with CSPRepResStage [45], it has a more compact outlook and considers the balance between accuracy and speed.

intensive reading

background

The RepVGG  backbone has stronger feature representation capabilities in small networks , but with the explosive growth of parameters and computational costs, it is difficult for RepVGG to achieve high performance in large models

improve work

  • designed an efficient reparameterizable backbone called EfficientRep
  • In small models ( n / t / s)  , use RepBlock
  • In large models ( m / l )  , use CSPStackRep Blocks   

method

(1) Replace the ordinary Conv layer with stride = 2 in Backbone  with  the RepConv layer with stride = 2

(2) Redesign the original CSP-Block as RepBlock , where the first RepConv of RepBlock will do channel dimension transformation and alignment    

(3) Optimize the original SPPF into a more efficient SimSPPF

The following figure shows the specific design structure of EfficientRep  Backbone 

The following figure shows the structure of the network in different situations

Figure ( a ): When training, the RepVGG  block is connected to a ReLU

Figure ( b ): When representing reasoning, the RepVGG block is replaced by RepConv

Figure ( c ): Represents the structure of the CSPStackRep block ( 3 1 x 1 conv  + 2 RepVGG ( training ) / RepConv ( inference ) + 1 Relu . )


 2.1.2 Neck—neck 

translate

In practice, multi-scale feature integration has been proven to be a critical and effective part of object detection [9, 21, 24, 40]. We adopt the modified PAN topology [24] in YOLOv4 [1] and YOLOv5 [10] as the basis for our neck detection. Furthermore, we replace the CSPBlock used in YOLOv5 with RepBlock (for small models) or CSPStackRep Block (for large models), and adjust the width and depth accordingly. The neck of YOLOv6 is denoted as Rep-PAN.

intensive reading

Referring to the PAN used by YOLOv 4 and v 5 , combined with RepBlock or CSPStackRep in Backbone , a Rep-PAN is proposed

method

Rep - PAN  is based on the PAN topology, replacing the CSP-Block used in YOLOv 5 with RepBlock  , and adjusting the operators in the overall Neck 

Purpose

While achieving efficient reasoning , it maintains good multi-scale feature fusion capabilities

Re p - PAN  is based on the PAN topology, replacing the CSP - Block used in YOLOv 5 with RepBlock , and at the same time adjusting the operators in the overall Neck , the purpose is to achieve efficient reasoning on the hardware while maintaining a better Multi-scale feature fusion ability


2.1.3 Head—head

Efficient decoupled head—efficient decoupled head

translate

Efficient decoupling head The detection head of YOLOv5 is a coupled head whose parameters are shared between the classification and localization branches, while its counterparts in FCOS [41] and YOLOX [7] decouple the two branches, And an additional two 3×3 convolutional layers are introduced in each branch to improve performance. In YOLOv6, we adopt a hybrid channel strategy to build a more efficient decoupling head. Specifically, we reduce the number of intermediate 3×3 convolutional layers to only one. The width of the head is scaled by both the backbone and the neck's width multiplier. These modifications further reduce the computational cost to achieve lower inference latency.

intensive reading

method

   (1) Reduce the number of 3x3 convolutional layers in the middle to 1

   (2) The size of the Head    is the same size as the backbone and the neck

Purpose

   Computational costs are further reduced for lower inference latency.


Anchor-free

translate

Anchor-free Anchor-free detectors stand out for their better generalization ability and ease of decoding predictions. The time cost of its post-processing is greatly reduced. There are two types of anchor-free detectors: anchor-based [7, 41] and keypoint-based [16, 46, 53]. In YOLOv6, we adopted an anchor-based paradigm whose regression branch actually predicts the distance from the anchor to the four sides of the box.

intensive reading

 Reasons for not using Anchor-based 

 Since the Anchor - based detector needs to perform cluster analysis before training to determine the optimal Anchor set, this will increase the complexity of the detector to a certain extent.    In some edge-end applications , it is necessary to carry a large number of detection results between hardware steps. also introduces additional delay

Advantages of Anchor-free

Anchor - free scheme does not require preset parameters, and the post-processing time is short

Anchor - There are two free schemes

  • point-base   ( FCOS )   - used by YOLOv6
  • keypoint-based  ( CornerNet)

  • 2.2 Label Assignment—label assignment

    Shimot

    translate

    SimOTA OTA [6] considers label assignment in object detection to be an optimal transfer problem. It defines positive/negative training samples for each ground-truth target from a global perspective. SimOTA [7] is a simplified version of OTA [6], which reduces extra hyperparameters and maintains performance. In the early version of YOLOv6, SimOTA was used as the label assignment method. However, in practice, we found that introducing SimOTA slows down the training process. Also, it's not uncommon to get stuck in erratic training sessions. Therefore, we are eager to have an alternative to SimOTA.

    intensive reading

    introduce

    OTA regards  label assignment in object detection as an optimal transfer problem. It defines positive / negative training samples for each real object from a global perspective.

    SimOTA is a simplified version of OTA that reduces additional hyperparameters and maintains performance.

    step

       ① Calculate the cost of the paired prediction frame and the true value frame, which is composed of classification and regression loss

       ② Calculate the IoU of the truth box and the first k prediction boxes , and the sum is Dynamic k ; therefore, there are differences in Dynamic k for different truth boxes

       ③Finally select the first Dynamic k prediction frames with the smallest cost as positive samples

    insufficient

    SimOTA  will slow down the training speed and easily lead to unstable training


  • Task alignment learning—task alignment learning

  • translate

    Task Alignment Learning Task Alignment Learning (TAL) was first proposed in TOOD [5], where a unified classification score and an indicator of predicted box quality are designed. IoU is replaced by this metric for assigning object labels. To some extent, the problem of inconsistent tasks (classification and box regression) is alleviated.

    Another major contribution of TOOD is on the task-aligned head (T-head). T-head stacks convolutional layers to build interactive features, on top of which a Task Alignment Predictor (TAP) is used. PP-YOLOE [45] improves T-head by replacing the layer attention in T-head with lightweight ESE attention to form ET-head. However, we found that ET-head reduces inference speed in our model without accuracy gain.

    Therefore, we retain the design of the Efficient decoupling head. Furthermore, we observe that TAL brings more performance gains than SimOTA and stabilizes training. Therefore, we adopt TAL as the default label assignment strategy in YOLOv6.

  • intensive reading

    introduce

    TAL was proposed in TOOD , in which a [unified metric for classification score and positioning frame quality] was designed, and the measurement result was used instead of IoU to help

    Assign labels to help solve the problem of misalignment of tasks, and it is more stable and the effect is better

    step

      ① Calculate the product of gt and prediction frame IoU and classification score in each feature layer as score , and perform classification and detection task alignment

      ② For each gt , select the top - k largest score corresponding to the bbox

      ③ Select the center of the anchor used by the bbox to fall within the gt as a positive sample

      ④ If an anchor  box corresponds to multiple gts , select the prediction box corresponding to the anchor with the largest IoU of gt and the prediction box to be responsible for the gt

    Effect

       TAL brings more performance improvement than SimOTA , stabilizing training


  • 2.3 Loss Functions—loss function

     2.3.1 Classification Loss—classification loss 

  • translate

  • Improving the performance of a classifier is a key part of optimizing a detector. Focal Loss [22] modifies the traditional cross-entropy loss to solve the problem of class imbalance between positive and negative samples or between hard and easy samples. To address the inconsistent use of quality estimation and classification between training and inference, Quality Focal Loss (QFL) [20] further extends Focal Loss to jointly represent classification scores and localization quality for classification supervision. VariFocal Loss (VFL) [50] is derived from Focal Loss [22], but its processing of positive and negative samples is asymmetric. It balances the learning signals from both samples by considering the different importance of positive and negative samples. Poly Loss [17] decomposes the commonly used classification loss into a series of weighted polynomial bases. It adjusts polynomial coefficients on different tasks and datasets, and is experimentally proven to be better than cross-entropy loss and Focal Loss.

    We evaluate all these high-level classification losses on YOLOv6, finally adopting VFL [50].

  • intensive reading

  • VariFocal  Loss  ( VFL )  :An asymmetric weighting operation is proposed.

  • Aiming at the problem of unbalanced positive and negative samples and unequal weights in positive samples, more valuable positive samples can be found. Therefore, VariF ocal  Loss is chosen  as the classification loss.


    2.3.2 Box Regression Loss— regression box loss

    translate

    Regression loss provides an important learning signal for precisely localizing box boundaries. L1 loss is the original regression loss in earlier work. Gradually, various well-designed regression losses emerged, such as IoU series loss [8, 11, 35, 47, 52, 52] and probability loss [20].

    IoU-serial loss The IoU loss [47] regresses the four boundaries of the predicted box as an overall unit. It is proven to be effective as it agrees with the evaluation metrics. There are many variants of IoU, such as GIoU [35], DIoU [52], CIoU [52], α-IoU [11] and SIoU [8], etc., forming a related loss function. In this work, we conduct experiments with GIoU, CIoU and SIoU. While SIoU is applied to YOLOv6-N and YOLOv6-T, while others use GIoU.

    Probability loss Distribution Focal Loss (DFL) [20] simplifies the continuous distribution of box positions into a discrete probability distribution. It considers the ambiguity and uncertainty of the data without introducing any other strong prior factors, which helps to improve the box localization accuracy, especially when the ground-truth boxes are ambiguous. On the basis of DFL, DFLv2 [19] develops a lightweight sub-network to exploit the close correlation between distribution statistics and actual localization quality, which further improves detection performance. However, DFL usually outputs 17 times more regression values ​​than general object box regression, resulting in a large overhead. The additional computational cost greatly hinders the training of small models. However, DFLv2 further increases the computational burden due to the additional sub-network. In our experiments, DFLv2 brings performance gains similar to DFL on our models. Therefore, we only adopt DFL in YOLOv6-M/L. Experimental details can be found in Section 3.3.3.

    intensive reading

    IoU-series Loss — IoU series loss

    SIoU  Loss is significantly improved on small models, and GIoU Loss is significantly improved on large models, so SIoU   ( for n/t/s )   /GIoU (for m/l) loss is selected as the regression loss .

    Probability Loss—probability loss

    Distribution  Focal  Loss  ( DFL ) and Distribution  Focal  Loss  ( DFL )  v2 can bring certain performance improvements, but have a greater impact on efficiency, so they are deprecated .


    2.3.3 Object Loss—target loss 

    translate

    Object loss was first proposed in FCOS [41] to reduce the score of low-quality bounding boxes in order to filter them out in post-processing. It is also used in YOLOX [7] to speed up convergence and improve network accuracy. As an anchor-free framework like FCOS and YOLOX, we have tried object loss in YOLOv6. Unfortunately, it didn't lead to many good results. Details are given in Section 3.

    intensive reading

    Object loss was first proposed in FC OS  to reduce the score of low - quality  bbox , which is beneficial to  filter out in NMS . This loss is used in YOLOX to speed up convergence and improve accuracy, but after using the same method in YOLOv6 There is no gain.    


    2.4. Industry-handy improvements—industry-handy improvements 

    2.4.1 More training epochs—more training epochs

    translate

    Empirical results show that the performance of the detector keeps improving as the training time increases. We extend the training time from 300 epochs to 400 epochs to achieve a better convergence.

    intensive reading

    Extend training from 300 epochs to 400 epochs to achieve better convergence .

    Effect

    YOLOv 6- N , T , and S improved AP by 0.4% , 0.6% , and 0.5% respectively during longer training periods .


    2.4.2 Self-distillation— Self-distillation

    translate

    To further improve the accuracy of the model without introducing too much additional computational cost, we apply the classic knowledge distillation technique to minimize the KL-divergence between the teacher's and student's predictions. We restrict the teacher to the student itself, but pre-train, so we call it self-distillation. Note that KL-divergence is often used to measure the difference between data distributions. However, there are two subtasks in object detection, among which only the classification task can directly utilize KL-divergence based knowledge distillation. Due to the existence of DFL loss [20], we can also perform it on box regression. The knowledge extraction loss can be expressed as:

    L KD = KL(p cls t ||p s ) + KL(p t ||p s ),

    where p cls t and ps are the class predictions of the teacher model and student model, respectively, and correspondingly p reg t and p reg are the box regression predictions. Now, the overall loss s-function is formulated as:

    L total = L det + αL KD ,

    where Ldet is the detection loss computed with predictions and labels. The hyperparameter α is introduced to balance the two losses. In the early stages of training, the teacher's soft labels are easier to learn. As the training continues, the student's performance will match that of the teacher, so hard labels will help the student more. On this basis, we apply cosine weight decay to α to dynamically adjust the information from hard labels and soft labels from the teacher. We conduct detailed experiments to verify the effect of YOLOv6's self-distillation, which will be discussed in Section 3.

    intensive reading

    background

    To further improve the accuracy without introducing too much additional computational cost, we apply the classical knowledge distillation technique to minimize the KL - divergence between the teacher's and student's predictions.

    method

    The authors constrain the teacher model to have the same network structure as the student model, but pre-trained, hence the name self-distillation.

    Due to the DFL loss, the regression branch can also use knowledge distillation, and the loss function is shown in Equation 1 :

    Effect

  • Applying self-distillation on the classification branch alone improves AP by 0.4%
  • Performing self-distillation on the prediction box regression task can improve AP by 0.3%
  • The introduction of self-distillation at weight decay enables the model to increase AP by 0.6%

2.4.3 Gray border of images— image gray border

translate

We noticed that when evaluating the model performance in the implementations of YOLOv5 [10] and YOLOv7 [42], each image is surrounded by a half gray border. While adding no useful information, it helps to detect objects near the edges of the image. This trick also works for YOLOv6. However, the extra gray pixels obviously slow down the inference speed. Without the gray border, the performance of YOLOv6 drops, which is also the case of [10, 42]. We speculate that this problem is related to gray edge filling in Mosaic enhancement [1, 10]. For verification, we conduct an experiment [7] of turning off the mosaic enhancement in the last epoch (also known as the fading strategy). In this regard, we change the area of ​​the gray border and directly resize the image with the gray border to the size of the target image. Combining these two strategies, our model can maintain or even improve the performance without slowing down the inference speed. Reasoning speed.

intensive reading

background

half gray border it helps to detect objects near the edge of the image

method

  (1) Turn off the experiment of mosaic enhancement  in the last epoch (also known as desalination strategy)

   (2) The area of ​​the gray border is changed , and the image with the gray border is directly adjusted to the size of the target image

Effect

 The final performance of YOLOv6-N / S/M is more accurate, and the final image size is reduced from 672 to 640 .


2.5. Quantization and Deployment—quantification and deployment

 2.5.1 Reparameterizing Optimizer—reparameterization 

translate

RepOptimizer [2] proposes a gradient reparameterization at each optimization step. This technique also works well for the quantization of reparameterization based models. We therefore reconstruct the reparameterization block of YOLOv6 in this way and train it with a reoptimizer to obtain PTQ-friendly weights. The distribution of feature maps is narrow (Fig. 4, B.1), which greatly benefits the quantization process, see Section 3.5.1 for results. 

intensive reading

(1) RepOptimizer proposes  gradient reparameterization at each training time, which can better solve the model based on reparameterization.               

(2) RepOptimizer is used in YOLOv 6 to obtain the weight of PTQ - friendly , and the distribution of its features is very narrow, which is conducive to quantification.  


2.5.2 Sensitivity Analysis—Sensitivity

translate

We further improve the performance of PTQ by partially converting quantization-sensitive operations to floating-point calculations. To obtain the sensitivity distribution, we commonly use several metrics, namely mean square error (MSE), signal-to-noise ratio (SNR) and cosine similarity. Often, for comparison, the output feature map (after activation of a layer) can be chosen to compute these metrics with and without quantization. As an alternative, it is also possible to compute the verification AP by switching the quantization of a specific layer [29].

 We compute all these metrics on the YOLOv6-S model trained with the reoptimizer, and select the top 6 sensitive layers, run as floats. See B.2 for a complete diagram of the sensitivity analysis.

intensive reading

The PTQ performance is further improved by converting quantization-sensitive operations to floating-point calculations. In order to obtain the sensitivity distribution, the author uses mean-square error (MSE), signal-noise ratio (SNR) and cosine similarity .

2.5.3 Quantization-aware Training with Channel-wise Distillation—quantization-aware training based on channel distillation

translate

 In cases where PTQ is insufficient, we propose to involve quantization-aware training (QAT) to improve quantization performance. To address the inconsistency of fake quantizers during training and inference, it is necessary to build QAT on top of the re-optimizer. Furthermore, channel distillation [36] (later called CW distillation) was adopted within the YOLOv6 framework, as shown in Fig. 5. This is also a self-distillation approach where the teacher network is the student model at fp32 precision. See experiments in Section 3.5.1.

intensive reading

In order to prevent insufficient PTQ , the author introduces QAT (quantization during training) to ensure consistent training and reasoning. The author also uses RepOptimizer , and in addition uses channel - wise distillation , as shown in the figure:  


3. Experiments— Experiments

3.1 Implementation Details—implementation details

translate

We use the same optimizer and learning schedule as YOLOv5 [10], i.e. stochastic gradient descent (SGD) with momentum and learning rate cosine decay. Warming up, group weight decay strategies and Exponential Moving Average (EMA) are also utilized. We employ two strong data augmentations (Mosaic [1, 10] and Mixup [49]) [1, 7, 10]. A full list of hyperparameter settings can be found in our released code. We train our model on the COCO 2017 [23] training set and evaluate the accuracy on the COCO 2017 validation set. All our models are trained on 8 NVIDIA A100 GPUs, and speed performance is measured on NVIDIA Tesla T4 GPUs equipped with TensorRT version 7. 2 Unless otherwise stated. Also speed performance measured using other TensorRT versions or other devices is demonstrated in Appendix A.

intensive reading

( 1 ) Using the same optimization algorithm and learning mechanism settings as YOLOv 5 (including SGD , learning rate, preheating, group weight decay strategy and EMA , and two data enhancements )

( 2 ) Train the model on the COCO  2017 training set and evaluate the accuracy on the COCO  2017 validation set

( 3 ) All models are trained on 8 NVIDIA A100 GPUs , and speed performance is measured on NVIDIA Tesla T4 GPUs equipped with TensorRT version 7


3.2 Comparisons—control experiments

translate

Considering that the goal of this work is to build networks for industrial applications, we mainly focus on the speed performance of all models after deployment, including throughput (FPS with batch size 1 or 32) and GPU latency, rather than FLOPs or number of parameters. We compare YOLOv 6 with other state-of-the-art detectors in the YOLO family, including YOLOv 5 [10], YOLOX [7], PPYOLOE [45] and YOLOv 7 [42]. Note that we tested the speed performance of all official models at FP 16 precision on the same Tesla T4 GPU using TensorRT [28]. The performance of YOLOv 7-Tiny is re-evaluated based on its open-source code and input sizes of 416 and 640 weights. The results are shown in Table 1 and Figure 2. 1. Compared with YOLOv 5-N/YOLOv 7-Tiny (input size=416), our YOLOv 6-N is significantly improved by 7.9%/2.6%. It also has the best speed performance in terms of throughput and latency. Compared with YOLOX-S/PPYOLOE-S, YOLOv 6-S can increase AP by 3.0%/0.4%, faster. We compare YOLOv 5-S and YOLOv 7-Tiny (input size=640) with YOLOv 6-T, our method 2. 9% better accuracy and 73/25 FPS faster with a batch size of 1. The performance of YOLOv 6-M is 4 times higher than that of YOLOv 5-M. 2% AP, at the same speed, achieves 2. At higher speeds, the AP is 7%/0.6% higher than YOLOX-M/PPYOLOE-M. Also, it is more accurate and faster than YOLOv 5-L. YOLOv 6-L is 2. Under the same delay limit, it is 8%/1.1% more accurate than YOLOX-L/PPYLOE-L. We also provide a faster version of YOLOv 6-L by replacing SiLU with ReLU (denoted as YOLOv 6-L-ReLU). It achieves 51.7% AP with a latency of 8.8 ms, outperforming YOLOX-L/PPYOLOE-L/YOLOv 7 in both accuracy and speed.

intensive reading

Compared with SOTA

( 1 ) Compared with YOLOv5-N/YOLOv7-Tiny (input size=416) , YOLOv6- N  increased by 7.9% and 2.6% respectively , and reached the highest speed

( 2 ) Compared with YOLOX - S / PPYOLOE - S , YOLOv 6- S has increased by 3.0% and 0.4% respectively 

( 3 ) Compared with YOLOv 5- S and YOLOv7-Tiny (input size=640) , YOLOv6-M has a 4.2% higher AP at the same speed

( 4 ) Compared with YOLOX - M / PPYOLOE - M  , YOLOv 6- M  is faster and has 2.7% and 0.6% higher AP respectively

( 5 ) Compared with YOLOX - L / PPYOLOE - L / YOLOv 7 , YOLOv 6- L - Relu  achieved 51.7% AP , surpassing the previous methods


3.3 Ablation Study—ablation experiment

3.3.1 Network—Network

Backbone and neck—backbone network and neck

translate

Backbone and Neck We explore the influence of single-path structure and multi-branch structure on the backbone and neck, and the channel coefficient (denoted as CC) of CSPStackRep Block. All models described in this section employ TAL as the label assignment strategy, VFL as the classification loss, and GIoU and DFL as the regression loss. The results are shown in Table 2. We found that the optimal network structure for models at different scales should come up with different solutions. For YOLOv 6-N, the single-path structure outperforms the multi-branch structure in terms of accuracy and speed.

Runs faster due to relatively low memory footprint and high parallelism. For YOLOv 6-S, both block styles bring similar performance. When it comes to larger models, the multi-branch architecture achieves better performance in terms of accuracy and speed. And we finally choose multi-branch with 2/3 channel coefficients for YOLOv 6-M and 1/2 channel coefficients for YOLOv 6-L. Furthermore, we investigate the effect of the width and depth of the neck on YOLOv 6-L. The results in Table 3 show that the slender neck performs 0.2% faster than the wide shallow neck at the same speed.

intensive reading

The author compares the influence of different blocks in the backbone and neck and the channel coefficient in the CSPStackRep Block

in conclusion  

Different network structures apply different strategies


Combinations of convolutional layers and activation functions—convolutional layer and activation function combination

translate

The YOLO series adopts a wide range of activation functions, ReLU [27], LReLU [25], Swish [31], SiLU [4], Mish [26], etc. Among these activation functions, SiLU is the most used. In general, SiLU has better accuracy and does not cause much additional computational cost. However, when it comes to industrial applications, especially when deploying models with TensorRT [28] acceleration, ReLU has a greater speed advantage due to its fusion into convolutions. In addition, we further verify the effectiveness of the combination of RepConv/Normal Convolution (denoted as Conv) and ReLU/SiLU/LReLU in networks of different sizes to achieve better trade-offs. As shown in Table 4, Conv with SiLU performs best in terms of accuracy, while the combination of RepConv and ReLU achieves a better trade-off. We recommend users to adopt ReLU's RepConv in latency-sensitive applications. We choose to use the RepConv/ReLU combination

intensive reading

Commonly used activation functions in the YOLO series include ReLU , LReLU , Swish , SiLU , Mish , etc.   SiLU has the highest precision and is the most commonly used, but it cannot be integrated with the convolutional layer when deploying a model accelerated by TensorRT,   and ReLU has a speed advantage  

Further verified the effectiveness of the combination of RepConv / ordinary convolution (denoted as Conv ) and ReLU / SiLU / LReLU in different sizes of networks

in conclusion

(1) Conv + SiLU has the best performance, but RepConv + ReLU achieves a balance between performance and speed

(2) Use RepConv/ReLU combination in YOLOv6-N/T/S/M to achieve higher inference speed

(3) Use the Conv/Si LU combination in the large-scale model YOLOv6-L to speed up training and improve performance.


Miscellaneous design—the rest of the design

translate

We also perform a series of ablations on other network parts mentioned in Section 2. 1 Based on YOLOv6-N. We choose YOLOv 5-N as the baseline and gradually add other components. The results are shown in Table 5. First, our model is 1 for the decoupled head (denoted as DH). Accuracy increased by 4%, time cost increased by 5%. Second, we verify that the anchor-free paradigm is 51% faster than the anchor-based paradigm due to its 3x fewer predefined anchors, which results in less dimensionality of the output. In addition, the unified modification of the backbone (EfficientRep backbone) and neck (Rep-PAN neck) expressed as EB+RN brings 3.6% AP improvement and 21% faster operation. Finally, the optimized decoupling head (Hybrid Channel, HC) brings 0.2% AP and 6.FPS accuracy and 8% faster speed, respectively.

intensive reading

Operation and conclusion

DH :  TakingYOLOv5-Nas the baseline, verifythe influence of different components inYOLOv6-NDH) to improve performance by1.4%, and increase time consumptionby 5%

AF :  The Anchor-freesolution takes51%

EB + RN :  backbone networkEfficientRep +neckRep-PAN improves performance by3.6%and reduces time consumption by21%

HC :  The hybrid channel strategy inthe Headincreases the performance by0.2%and reduces the time consumptionby 6.8%


 3.3.2 Label Assignment—label assignment

translate

In Table 6, we analyze the effectiveness of mainstream label assignment strategies. Experiments are performed on YOLOv6N. As expected, we observe that SimOTA and TAL are the best two strategies compared to ATSS, SimOTA can improve AP by 2.0%, and TAL brings 0. AP is 5% higher than SimOTA. Considering TAL's stable training and better accuracy performance, we adopt TAL as our label assignment strategy. Furthermore, the implementation of TOOD [5] employs ATSS [51] as a warm-up label assignment strategy in the early training epoch. We also keep the warm-up strategy and explore it further. The details are shown in Table 7, and we can find that similar performance can also be achieved without warm-up or with warm-up by other strategies (ie: e.g., SimOTA).

intensive reading

  • The comparison shows that   SimOTA and TAL are the best two strategies.

in conclusion

Considering TAL 's stable training and better accuracy performance, the authors adopt TAL as our label assignment strategy.

  • Further exploring the warm-up strategy :

in conclusion  

Similar performance can also be achieved without warm-up or with warm-up via other strategies (i.e. SimOTA).


3.3.3 Loss functions—loss function

Classification Loss—classification loss

translate

Classification Loss: We experimented with Focal Loss [22], Poly loss [17], QFL [20] and VFL [50] on YOLOv 6-N/S/M. As can be seen from Table 8, VFL brings 0. Compared with focal loss, the AP improvement of YOLOv 6-N/S/M was 2%/0.3%/0.1%, respectively. We choose VFL as the classification loss function.

intensive reading

The author validates different classification loss functions

in conclusion

 Choose VF L as the classification loss function


Regression Loss— regression loss

translate

Experiments were performed on the regression loss IoU sequence and the probability loss function on YOLOv 6-N/S/M. YOLOv 6 N/S/M uses the latest IoU series loss. The experimental results in Table 9 show that for YOLOv 6-N and YOLOv 6-T, SIoU loss outperforms other losses, while CIoU loss performs better on YOLOv 6-M. For the probability loss, as listed in Table 10, 0 can be obtained by introducing DFL. The performance gains of YOLOv 6-N/S/M are 2%/0.1%/0.2% respectively. However, for small models, the inference speed suffers greatly. Therefore, DFL is only introduced in YOLOv 6-M/L.

intensive reading

The author experimented with IoU series loss and probability loss function on YOLOv6-N/S/M

in conclusion

(1) About IoU series loss:    YOLOv6-N and YOLOv6-T use SIoU loss, and the rest use GIoU loss

(2) Regarding probability loss:    YOLOv6-M/L uses DFL , and the rest are not used


Object Loss—target loss

translate

As shown in Table 11, object loss is also experimented with YOLOv 6. From Table 11, we can see that object loss has a negative impact on the YOLOv 6-N/S/M network, where the maximum reduction is 1. 1% AP on YOLOv 6-N. Negative gains may arise from conflicts between the target branch and the other two branches in the TAL. Specifically, during the training phase, the IoU between predicted and ground-truth boxes as well as classification scores are used to jointly construct metrics as criteria for assigning labels. However, the introduced object branch expands the number of tasks to be aligned from two to three, which obviously increases the difficulty. Based on the experimental results and this analysis, object loss is then discarded in YOLOv6.

intensive reading

Object Loss experiment was also carried out using YOLOv6

in conclusion

Target loss in YOLOv 6- N/S/M reduces effectiveness; author chooses to discard

Reason: The negative gain may come from the conflict between the object branch and the other two branches in the TAL. In the TAL , IoU and classification are combined as an additional branch to introduce

The alignment of two branches becomes three branches, which increases the difficulty of alignment


3.4 Industry-handy improvements—handy improvements in industry

(For this part, please see Part 2, so I won’t repeat it~)


3.5. Quantization Results—quantification results

3.5.1 PTQ

translate

When training the model with RepOptimizer, the average performance is significantly improved, see Table 15. RepOptimizer is usually faster, almost the same

intensive reading

Train the model with RepOptimizer

Conclusion:  Indicates that RepOptimizer brings substantial performance improvements


3.5.2 QAT

translate

For v1.0, we apply a pseudo-quantizer to the non-sensitive layers obtained from Section 2. 5.2 Perform quantization-aware training and call it partial QAT. We compare the results with the full QAT in Table 16. Partial QAT leads to better accuracy but slightly lower throughput. Due to the removal of quantization sensitive layers in v2. 0 version, we use the full QAT directly on YOLOv 6-S trained with RepOptimizer. We eliminate interpolation quantizers through graph optimization for higher accuracy and faster speed. We compare distillation-based quantization results from PaddleSlim [30] in Table 17. Note that our quantized version of YOLOv 6-S is the fastest and most accurate, see also Figure 1.

intensive reading

For v 1.0 , the authors apply non-sensitive layers for quantization-aware training and call them partial QAT .

in conclusion 

Partial  QAT   ( only quantizes the sensitive layer) has better performance than full QAT , but takes slightly more time

In the v 2.0 version, the quantization sensitive , and the author directly used the full QAT on the YOLOv6-S trained with RepOptimizer .

in conclusion

It shows that the YOLOv6-S quantified by the author is fast and has good performance, and the rest of the detectors use the distillation-based quantification method in PaddleSl i m .


4.  Conclusion — Conclusion

translate

In short, and with ongoing industry needs in mind, we propose YOLOv6 in its current form, carefully examining all advances in object detector components to date, while instilling our thoughts and practices. The results outperform other available real-time detectors in both accuracy and speed. To facilitate industrial deployment, we also provide a custom quantization method for YOLOv6, making it a fast detector out of the box. We sincerely appreciate the outstanding ideas and efforts from academia and industry. In the future, we will continue to expand the project to meet higher standards and more demanding scenarios.

intensive reading

YOLOv6 outperforms other available object detectors in both accuracy and speed . To facilitate industrial deployment, the authors also provide a custom quantization method for YOLOv6 , making it an out-of-the-box fast detector.

solved problem

   (1) The structural reparameterization method proposed by RepVGG performed well, but no detection . The author believes that the block scaling of RepVGG is unreasonable, small models and large models

   (2) There is no need to maintain a similar network structure; small models use a single-path architecture, and large models are not suitable for stacking parameters .

   (3) After using the reparameterization method, the quantization of the detector also needs to be reconsidered , otherwise the performance may be degraded due to the different structures during training and inference.

   (4) Early work paid little attention to deployment . In the previous work, inference was done on high-end machines such as V100 , but in actual use , low-power inference GPUs such as T4 are often used , and the author pays more attention to the latter

performance .

  (5) Reconsider label assignment and loss function for changes in network structure.

(6) For deployment, the training strategy can be adjusted to improve performance    without increasing the cost of reasoning , such as using knowledge distillation.

main contribution

 (1) In different industrial landing scenarios, different models are designed, taking into account both accuracy and speed. Among them, the small model is single-branch, and the large model is multi-branch.   

 (2) The self-distillation strategy is used in both classification and regression tasks, and the teacher to facilitate the training of the student model.

 (3) Various label assignments, loss functions, and data enhancement techniques are analyzed, and appropriate strategies are selected further improve performance.

 (4) Based on the Re pOptimizer optimizer and channel distillation, the quantization method has been improved.

future perfect

   1.   Improve the full range of YOLOv6 models and continue to improve detection performance.

   2.   Design hardware- friendly models on various hardware platforms .

   3.  Support ARM platform deployment and full-chain adaptation such as quantitative distillation.

   4.   Horizontal expansion and introduction of related technologies, such as semi-supervised , self-supervised learning, etc.

   5.   Explore the generalization performance of YOLOv6 in more unknown business scenarios.


References in this article:

YOLOv6: Fast and accurate target detection framework open source - Meituan Technical Team (meituan.com)

Guess you like

Origin blog.csdn.net/weixin_43334693/article/details/130444498