[YOLO series] Super detailed interpretation of YOLOv7 papers (translation + study notes)

foreword

Finally read the legendary YOLOv7~≖‿≖✧

This article made a high-profile debut less than a month after Meituan's v6 came out. The author is still the AB master we are familiar with (yes, it is the v4 one), and it feels "familiar" to read (throughout my entire May Day holiday (╯-_-)╯╧╧).

In fact, there are still many details about the network structure of YOLOv7 that are worthy of in-depth study, and I will release a detailed series of codes like v5 in the future (there are too many private messages in the background asking me to release the code of v7, bridge bean sack! Wait for me to study and understand Come out! (ง •̀_•́)ง)

Well, let us study the thesis part first!

Learning materials:

论文:YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Open source code: mirrors / WongKinYiu / yolov7 · GitCode


Early review of YOLO paper series: 

[YOLO series] Super detailed interpretation of YOLOv6 papers (translation + study notes)

[YOLO Series] YOLOv5 Super Detailed Interpretation (Internet Detailed Explanation)

[YOLO series] Super detailed interpretation of YOLOv4 paper 2 (detailed network explanation)

[YOLO series] Super detailed interpretation of YOLOv4 papers 1 (translation + study notes)

[YOLO series] Super detailed interpretation of YOLOv3 papers (translation + study notes)

[YOLO series] Super detailed interpretation of YOLOv2 papers (translation + study notes)

[YOLO series] Super detailed interpretation of YOLOv1 papers (translation + study notes)


Table of contents

foreword

Abstract—abstract

1. Introduction—Introduction

2. Related work—related work

2.1 Real-time object detectors—real-time object detector

2.2 Model re-parameterization—model re-parameterization

2.3 Model scaling—model scaling

3. Architecture—Network Structure

3.1 Extended efficient layer aggregation networks—Extended efficient layer aggregation network

3.2 Model scaling for concatenation-based models— model scaling based on connection models

4. Trainable bag-of-freebies—trainable bag-of-freebies

4.1 Planned re-parameterized convolution—convolution re-parameterization

4.2 Coarse for auxiliary and fine for lead loss—auxiliary training module

4.2.1 Deep supervision—deep supervision

4.2.2 Label assigner— label assigner

4.2.3 Lead head guided label assigner— lead head guided label assigner

4.2.4 Coarse-to-fine lead head guided label assigner—from coarse to fine lead head guided label assigner

4.3 Other trainable bag-of-freebies—other trainable "tools"

5. Experiments—Experiments

5.1 Experimental setup—experimental steps

5.2 Baselines—Baseline Network

5.3 Comparison with state-of-the-arts—comparison with other popular networks

5.4 Ablation study—ablation study

5.4.1 Proposed compound scaling method—the proposed compound scaling method

5.4.2 Proposed planned re-parameterized model—proposed planned re-parameterized model

5.4.3 Proposed assistant loss for auxiliary head—提出的 auxiliary head 的 assistant loss

6. Conclusions—Conclusion

Abstract— abstract

translate

YOLOv7 exceeds all known object detectors in speed and accuracy from 5 FPS to 160 FPS, YOLOv7 exceeds all known object detectors in speed and accuracy from 5 FPS to 160 FPS, and Highest accuracy 56.8% AP among all known real-time object detectors at 30 FPS or higher on GPU V100. The YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) is 509% faster and more accurate than the transformer-based detector SWINL Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP). 2%, and the convolution-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) is 551% faster, 0.7% more accurate, and YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Furthermore, we only train YOLOv7 from scratch on the MS COCO dataset without using any other datasets or pretrained weights.  The source code is published in:  GitHub - WongKinYiu/yolov7: Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors


intensive reading

Achievements of YOLOv7

Surpasses all known object detectors in speed and accuracy in the range of 5 FPS to 160FPS, and has the highest performance among all known real-time object detectors at 30 FPS or more on GPU V100 Accuracy 56.8%AP

YOLOv7-E6 outperforms in speed and accuracy

  • Transformer - based detector SWINL  Cascade - Mask  R - CNN
  • Convolution-based detector ConvNeXt  XL cascaded mask R-CNN

YOLOv7 outperforms

YOLOR , YOLOX , Scaled - Speed ​​and accuracy of YOLOv 4 , YOLOv 5 , DETR , Deformable DETR , DINO -5scale-R50 , ViT-Adapter-B and many other object detectors .

Training: The author only trains YOLOv 7 from 0 on the COCO dataset , without using any other datasets or pre- trained weights.


1.   Introduction— Introduction

translate

    Real-time object detection is a very important topic in computer vision because it is often a necessary component in computer vision systems. For example, multi-object tracking [94, 93], autonomous driving [40, 18], robotics [35, 58], medical image analysis [34, 46], etc. Computing devices that perform real-time object detection are usually some mobile CPUs or GPUs, as well as various neural processing units (NPUs) developed by major manufacturers. For example, Apple Neural Engine (Apple), Neural Compute Stick (Intel), Jetson AI Edge Device (Nvidia), Edge TPU (Google), Neural Processing Engine (Qualcomm), AI Processing Unit (MediaTek), and AI SoC (Kneron) are all It is an NPU. Some of the edge devices mentioned above focus on accelerating different operations, such as plain convolution, depthwise convolution, or MLP operations. In this paper, we propose a real-time object detector with the main hope that it can support both mobile GPUs and edge-to-cloud GPU devices.

    In recent years, real-time object detectors are still being developed for different edge devices. For example, operations developing MCUNet [49, 48] and NanoDet [54] have focused on producing low-power single-chips and increasing the inference speed of edge CPUs. As for methods such as YOLOX [21] and YOLOR [81], they focus on improving the inference speed on various GPUs. Recently, the development of real-time object detectors has focused on the design of efficient architectures. As for real-time object detectors that can be used on CPU [54, 88, 84, 83], their designs are mainly based on MobileNet [28, 66, 27], ShuffleNet [92, 55] or GhostNet [25]. Another mainstream Real-time object detectors are developed for GPU [81, 21, 97], most of them use ResNet [26], DarkNet [63] or DLA [87], and then use CSPNet [80] strategy to optimize the architecture. The method proposed in this paper takes a different direction from the current mainstream real-time object detectors. In addition to architecture optimization, our proposed method will focus on the optimization of the training process. Our focus will be on some optimized modules and optimization methods, which may increase the training cost to improve the accuracy of object detection, but will not increase the inference cost. We refer to the proposed module and optimization method as trainable bag-of-freebies.

    Recently, model reparameterization [13, 12, 29] and dynamic label assignment [20, 17, 42] have become important topics in network training and object detection. Mainly after the above-mentioned new concepts were proposed, the training of object detectors evolved many new problems. In this paper, we introduce some new problems we discovered and devise efficient methods to solve them. For model reparameterization, we use the concept of gradient propagation paths to analyze model reparameterization strategies applicable to different network layers, and propose a planned reparameterization model. Furthermore, the training of models with multiple output layers creates new problems when we discover that dynamic label assignment techniques are used. Namely: "How to assign dynamic targets to the outputs of different branches?" To address this question, we propose a new label assignment method called coarse-to-fine guided label assignment.

    The contributions of this paper are summarized as follows: (1) We design several trainable bag-of-freebies methods that enable real-time object detection to greatly improve detection accuracy without increasing inference costs; (2) For object detection methods Evolving, we discover two new issues, namely, how the reparameterized modules replace the original modules, and how the dynamic label assignment strategy handles assignment to different output layers. Furthermore, we propose methods to address the difficulties posed by these problems; (3) we propose "scaling" and "composite scaling" methods for real-time object detectors that can efficiently utilize parameters and computations; (4) we propose The proposed method can effectively reduce about 40% of the parameters and 50% of the computation of the state-of-the-art real-time object detector, and has faster inference speed and higher detection accuracy.


intensive reading

The main work of this paper

  • A real-time object detector is proposed , mainly in the hope that it can support both mobile GPUs and GPU devices from the edge to the cloud
  • Optimized the architecture and focused on optimizing the training process . Emphasis is placed on optimization modules and optimization methods , called trainable "Bag of freebies" .

In fact, we have seen Bag  of  freebies and Bag  of  specials in YOLOv4, now let’s review:

  • Bag  of  freebies :Literally means"free gifts". Here it refers to the method of using some useful training techniques to train the model, which  only changes the training strategy or only increases. In this way, the model can achieve better accuracy withoutincreasing the complexity of the model, and itwill not increase the amount of calculation for reasoning.
  • Bag  of  specials :Refers to some plug-in modules (plugin modules) and post-processing methods (post-processing methods),  which only slightly increasethe cost of inference, but can greatly improve the accuracy of target detection.  Generally, theseplugins are used to enhance specific attributes in a model. For example,increase the receptive field (SPP,  ASPP,   RFB), introduce the attention mechanism (spatial attention,channel attention), and improve the ability of feature integration (FPN,ASFF,BiFPN).
  • For the model reparameterization problem, this paper uses the concept of gradient propagation paths to analyze model reparameterization strategies applicable to layers in different networks, and proposes a planned reparameterization model .
  • For the dynamic label assignment problem , this paper proposes a new label assignment method called Coarse-to-Fine Guided Label Assignment

The main contribution of this paper

(1) Several methods that can be used for training are proposed . These methods only increase the training burden to improve model performance without increasing the reasoning burden.

(2) For the development of target detection methods, the author discovered two new problems :

  • ①How to replace the original module with the reparameterized module
  • ② How does the dynamic label allocation strategy handle the allocation of different output layers

However, this paper proposes a solution to these two problems

(3) The author proposes " extend " and " compound scaling " for target detection to make more effective use of parameters and calculation problems.

(4) The proposed method can effectively reduce the amount of parameters by 40% and the amount of calculation by 50% , with high precision and high speed


2. Related  work— related work

2.1 Real-time object detectors—real-time object detector

translate

    The current state-of-the-art real-time object detectors are mainly based on YOLO [61, 62, 63] and FCOS [76, 77], respectively [3, 79, 81, 21, 54, 85, 23]. Can become the most advanced real-time Object detectors usually require the following features: (1) faster and stronger network architecture; (2) more effective feature integration methods [22, 97, 37, 74, 59, 30, 9, 45]; (3) more accurate detection methods [76, 77, 69]; (4) a more robust loss function [96, 64, 6, 56, 95, 57]; (5) a more efficient label assignment method [99, 20, 17, 82, 42]; (6) More effective training methods. In this paper, we do not intend to explore self-supervised learning or knowledge distillation methods that require additional data or large models. Instead, we design new trainable bag-of-freebies methods for problems derived from state-of-the-art methods related to (4), (5) and (6) above.


intensive reading

An advanced network should have the following properties:

     ( 1 ) Faster and more efficient network

     ( 2 ) More effective feature integration method

     ( 3 ) More accurate detection method

     ( 4 ) More robust loss function

     ( 5 ) A more effective tag matching method

     ( 6 ) More effective training method

This article mainly focuses on ( 4 ), ( 5 ) and ( 6 ).


2.2 Model re-parameterization—model re-parameterization

translate

    Model reparameterization techniques [71, 31, 75, 19, 33, 11, 4, 24, 13, 12, 10, 29, 14, 78] combine multiple computational modules into one during the inference phase. Model reparameterization techniques can be seen as an integration technique, which we can divide into two categories, namely, module-level integration and model-level integration. There are two common practices for model-level reparameterization to obtain the final inference model. One is to train multiple identical models with different training data, and then average the weights of multiple trained models. The other is to perform a weighted average of the model weights for different iterations. Module-level reparameterization is a hot research problem recently. This method splits a module into multiple identical or different module branches during training, and integrates multiple branch modules into a completely equivalent module during inference. However, not all proposed reparameterization modules can be perfectly applied to different architectures. With this in mind, we develop new reparameterization modules and design related application strategies for various architectures.


intensive reading

Introduction to Model Reparameterization

Split a module into multiple identical or different module branches during training ; integrate multiple branch modules into a completely equivalent module during inference .  

There are two approaches: module- level integration and model-level integration

Two ways to get the final inference model

     ( 1 ) Train multiple identical models with different training data , and then average the weights of multiple trained models

     ( 2 ) Weighted average of model weights with different iterations

advantage

  • During training , a multi-branch network is used to enable the model to obtain better feature expression
  • During reasoning , parallel fusion is combined into serial, thereby reducing the amount of computation and parameters , and increasing the speed ( the recognition effect after fusion is theoretically the same as that before fusion, but in practice it is basically slightly reduced)

insufficient

   Not all proposed reparameterization modules can be perfectly applied to different architectures.


2.3 Model scaling—model scaling

translate

   Model scaling [72, 60, 74, 73, 15, 16, 2, 51] is a method of enlarging or reducing a designed model and adapting it to different computing devices. Model scaling methods often use different scaling factors such as resolution (the size of the input image), depth (number of layers), width (number of channels), and stages (number of feature pyramids) to achieve a good trade-off - denoting network parameters The quantity, calculation amount, inference speed and accuracy of the system. Network Architecture Search (NAS) is one of the commonly used model scaling methods. NAS can automatically search for an appropriate scaling factor from the search space without defining overly complex rules. The disadvantage of NAS is that it requires very expensive calculations to complete the search for the model scaling factor. In [15], researchers analyzed the relationship between the scaling factor and the number of parameters and the amount of operations, trying to directly estimate some rules to obtain the scaling factor required for model scaling. Consulting the literature, we find that almost all model scaling methods analyze individual scaling factors independently, and even methods in the compound scaling category optimize scaling factors independently. The reason for this is because most popular NAS architectures deal with scale factors that are not very relevant. We observe that all connection-based models, such as DenseNet [32] or VoVNet [39], change the input width of some layers when scaling the depth of such models. Since the proposed architecture is concatenated based we had to devise a new compound scaling method for this model.


intensive reading

Common scaling factors

  • Resolution (the size of the input image)
  • Depth (number of layers )
  • Width (number of channels)
  • stage (number of feature pyramids)

 Common method: NAS 

   Introduction: NAS is model search. Its main idea is not to design a specific network artificially, but to let the model choose by itself.

   Pay attention to the problem:    

             ( 1 ) How to define the candidate space

             ( 2 ) Speed ​​up training   

   Cons: Consumes a lot of time and resources


3. Architecture— Network Structure

3.1 Extended efficient layer aggregation networks—Extended efficient layer aggregation network

translate

In most of the literature on designing efficient architectures, the main considerations do not exceed the number of parameters, computational effort, and computational density. Ma et al. also analyzed the input/output channel ratio, the number of branches of the architecture, and the impact of element-wise operations on the network inference speed from the characteristics of memory access costs. Dole et al. also consider activations when performing model scaling, i.e., more consideration is given to the number of elements in the output tensor of the convolutional layer. The design of CSPVoVNet [79] in Fig. 2(b) is a variant of VoVNet [39]. In addition to considering the above basic design issues, the architecture of CSPVoVNet [79] also analyzes the gradient path to enable the weights of different layers to learn more diverse features. The gradient analysis method described above makes inference faster and more accurate. The ELAN [1] in Figure 2(c) considers the following design strategy – “How to design an efficient network?”. They came to the conclusion that by controlling the shortest and longest gradient paths, deeper networks can learn and converge efficiently. In this paper, we propose an ELAN-based Extended ELAN (E-ELAN), whose main architecture is shown in Fig. 2(d).

     It reaches a steady state regardless of the gradient path length and the stacked number of computational blocks in large-scale ELAN. If more computing blocks are infinitely stacked, this stable state may be disrupted, and the parameter utilization rate will decrease. The proposed E-ELAN uses expand, shuffle, and merge cardinality to realize the ability to continuously enhance the learning ability of the network without destroying the original gradient path. In terms of architecture, E-ELAN only changes the architecture of the computing block, while the architecture of the transition layer remains unchanged at all. Our strategy is to use group convolutions to expand the channels and cardinality of computational blocks. We will apply the same group parameters and channel multipliers to all compute blocks of the compute layer. Then, the feature maps calculated by each computation block are shuffled into g groups according to the set group parameter g, and then they are concatenated together. At this point, the number of channels for each set of feature maps will be the same as in the original architecture. Finally, we add g sets of feature maps to perform pooling cardinality. In addition to maintaining the original ELAN design architecture, E-ELAN can also guide different groups of computing blocks to learn more diverse features.


intensive reading

(a)VoVNet

VoVNet is a connection-based model composed of OSA , which improves DenseNet to be more efficient, different from the common plain structure and residual structure.

This structure not only inherits the advantages of DenseNet 's  multiple receptive fields to represent multiple features, but also solves the problem of low efficiency of dense connections.

For detailed explanations of VoVNet, please refer to: VoVNet: Real-time Target Detection Backbone Network_Target Detection Backbone_AI Vision Network Qi Blog-CSDN Blog


 (b)CSPVoVNet 

(b) is a CSP variant of (a) , CS PVoVNet , in addition to considering the amount of parameters, calculation amount, calculation density, memory access cost proposed by ShuffleNet  v 2 (input-output channel ratio, number of architecture branches, element t - wise , etc.) , also analyzes the gradient path , which allows the weights of different layers to learn more discriminative features.


(c)  ELAN

( c ) Concluded on " how to design an efficient network " : By controlling the shortest and longest gradient paths, deeper networks can learn and converge more efficiently.

Advantages and disadvantages of gradient

   The gradient path design strategy has a total of 3 advantages :

  •  1.   Network parameters can be used effectively . In this part, it is proposed that by adjusting the gradient propagation path , the weights of different computing units can learn various information, thereby achieving higher parameter utilization efficiency
  •  2.   With stable model learning ability , since the gradient path design strategy directly determines and propagates information to update weights to each computing unit, the designed architecture can avoid degradation during training
  •   3.   With efficient inference speed , the gradient path design strategy makes parameter utilization very effective, so the network can achieve higher accuracy without adding additional complex architecture .

Due to the above reasons, the designed network can be lighter and simpler in architecture.


   The gradient path design strategy has 1 disadvantage :

  • 1.   When the gradient update path is not a simple reverse feed-forward path of the network , the difficulty of programming will increase greatly.

For a detailed explanation of ELAN, please refer to: [Network Structure Design] 11. E-LAN ​​| Designing a network structure through a gradient transmission path


(d)  E-ELAN

( d ) is the Extended - ELAN  ( E - ELAN ) proposed by the author . E - ELAN uses expand , shuffle , and merge cardinality to enhance the ELAN network .   

In terms of architecture, E-ELAN  only changes the architecture of the computation  blocks , while the architecture of the transition  layer remains unchanged at all. 

The strategy of YOLOv 7 is to use group convolution to expand the channels and cardinality of computation blocks .

Implementation

  • First, use group convolution to increase the cardinality   of channels and computation blocks (all computation blocks use the same group parameters and channel multipliers)
  • Next, shuffle the feature maps obtained by the calculation block into g groups , and then concat , so that the number of feature map channels in each group is the same as the number of channels in the initial structure  
  • Finally, the features of the g groups are added together

3.2 Model scaling for concatenation-based models— model scaling based on connection models

translate

The main purpose of model scaling is to adjust some properties of the model and generate models of different scales to meet the needs of different inference speeds. For example, the scaling model of EfficientNet [72] considers width, depth and resolution. As for scaled-YOLOv4 [79], its scaling model is to adjust the number of stages. In [15], Dollar et al. The impact of vanilla convolution and group convolution on the amount of parameters and calculations during width and depth scaling is analyzed, and a corresponding model scaling method is designed accordingly.

    The above methods are mainly used in architectures such as PlainNet or ResNet. When these architectures expand or shrink, the in-degree and out-degree of each layer will not change, so we can independently analyze the impact of each scaling factor on the amount of parameters and computation. However, if these methods are applied to a connection-based architecture, we find that when the depth is scaled up or down, the in-degree of the translation layer immediately after the connection-based computation block will decrease or increase, as shown in Figure 3 ( a) and (b).

    From the above phenomena, it can be inferred that for concatenation-based models, we cannot analyze different scaling factors separately but have to consider them together. Taking scaling-up depth as an example, such an action will cause the ratio of the input channel to the output channel of the transition layer to change, which may cause the hardware usage of the model to decrease. Therefore, we have to propose corresponding composite model scaling methods for cascade-based models. When we scale a computed block's depth factor, we also have to compute the change in the block's output channels. Then, we will scale the transition layer with an equal amount of varying width factors, and the result is shown in Fig. 3(c). Our proposed compound scaling method can preserve the characteristics of the model at the time of initial design and maintain the optimal structure.


intensive reading

The main purpose of model scaling

The main purpose of model scaling is to adjust certain attributes of the model and generate models of different scales to meet the needs of different inference speeds [like V5 and YOLOX ]

 problem introduction 

The parameters and calculations for each scaling factor can be analyzed independently. However, if these methods are applied to the connection-based model architecture, we will find that when the depth is enlarged or reduced, the input width of the subsequent network layer changes , so that the ratio of the input channel and output channel of the subsequent layer changes, The in-degree of the translation layer immediately after will decrease or increase, resulting in a decrease in the hardware usage of the model as shown in (a) -> (b) . 

method

For the connection model, the author proposes a composite model method , which considers the scaling of the depth factor of the calculation module and also considers the same amount of changes in the width factor of the transition layer.

When scaling a network of connected structures, only the depth of the computational block is scaled , and the rest of the transformation layer is scaled only by the width


4. Trainable  bag-of-freebies— trainable bag-of-freebies

4.1 Planned re-parameterized convolution—convolution re-parameterization

translate

    Although RepConv [13] achieves excellent performance on VGG [68], its accuracy drops significantly when we directly apply it to architectures such as ResNet [26] and DenseNet [32]. We use the gradient flow propagation path to analyze how reparameterized convolutions should be combined with different networks. We also design the planned reparameterized convolutions accordingly.

   RepConv actually combines 3×3 convolution, 1×1 convolution and identity connection in one convolutional layer. After analyzing the combination and corresponding performance of RepConv and different architectures, we found that the identity connection in RepConv destroys the residual in ResNet and the connection in DenseNet, providing more gradient diversity for different feature maps. For the above reasons, we use RepConv without identity connections (RepConvN) to design an architecture for planning reparameterized convolutions. In our thinking, when a convolutional layer with residuals or connections is replaced by a reparameterized convolution, there should be no identity connections. Figure 4 shows an example of our designed "planned reparameterized convolution" for PlainNet and ResNet. As for the reparameterized convolution experiments based on the residual model and the full plan based on the connection model, it will be introduced in the ablation research section.


intensive reading

Review RepVGG

RepVGG is built in the VGG style and uses heavy parameterization technology, so it is called RepVGG .

( A ) is the ResNet structure , the identity of the top part adopts 1×1 convolution, and the general structure of RepVGG is similar to ResNet

(B) The picture is the structure of RepVGG training, which borrows the structure of the residual block of RepVGG , specifically including three branches of 3 ×3 , 1×1 and shortcut . 

  • When the dimensions of the input and output are inconsistent , that is, when stride=2, there are only two branches of 3×3 and 1×1
  • When the dimensions of the input and output are the same , there is not only a 1×1 identity connection in the last three layers of convolution, but also a non-convolutional identity connection that directly performs feature fusion

( C ) The picture is the RepVGG test structure, which will remove all these connections and become a single VGG structure. This operation is also called the decoupling of training and prediction.

 problem introduction 

The identity connection in RepConv destroys the residual in ResNet and the concatenation in DenseNet , and we know that the residual and concatenation  provide more gradient diversity for different feature maps, which will lead to a decrease in accuracy. So when a convolution layer with residual or concatenation is replaced by a convolution with heavy parameters , there should be no identity connection

method

Based on this, the author uses RepConv ( RepConvN ) without identity  connection to design the planning reparameterization convolution structure, as shown in the following figure :  


4.2 Coarse for auxiliary and fine for lead loss—auxiliary training module

(I really found a lot of translations on the Internet for this title, but the results are either strange literal translations or no translations. In short, it is enough to understand that it is an auxiliary training module)

My personal understanding is:

Lead  head is the head responsible for the final output ; Auxiliary head is the   head responsible for auxiliary training

Then this title is saying that the head responsible for auxiliary training uses thick labels, and the head responsible for final output uses fine labels


4.2.1 Deep supervision—deep supervision

translate

   Deep supervision [38] is a technique often used to train deep networks. Its main concept is to add additional auxiliary heads in the middle layers of the network, and the weights of the shallow network guided by the auxiliary loss. Even for generally well-converging architectures such as ResNet [26] and DenseNet [32], deep supervision [70, 98, 67, 47, 82, 65, 86, 50] can still significantly improve the performance of models on many tasks. Figure 5 (a) and (b) show object detector architectures “without” and “with” deep supervision, respectively. In this paper, we refer to the head responsible for the final output as the lead head, and the head used for auxiliary training as the auxiliary head.


intensive reading

Introduction

   Deep supervision  is a technique commonly used to train deep networks. Its main concept is to add additional Auxiliary head in the middle layer of the network , and shallow network weights guided by Auxiliary loss .

The picture below ( A) is without deep supervision,   (B) is with deep supervision

Method in this paper

   The head responsible for the final output is called the Lead head , and the head used for auxiliary training is called the Auxiliary head .


4.2.2  Label assigner— label assigner

translate

    We next discuss the issue of label assignment. In the past, in the training of deep networks, the label assignment usually directly refers to the ground truth, and generates hard labels according to the given rules. However, in recent years, if we take object detection as an example, researchers tend to use the network to predict the quality and distribution of outputs, and then consider using some computational and optimization methods together with ground truth to generate reliable soft labels [61, 8, 36 , 99, 91, 44, 43, 90, 20, 17, 42]. For example, YOLO [61] uses predicted bounding box regression and ground truth IoU as soft labels for objectness. In this paper, we consider the network's prediction results together with the ground truth, and then the mechanism for assigning soft labels is called a "label assigner".

    Regardless of the case of auxiliary head or lead head, deep supervision on the target object is required. During the process of soft label allocator related technologies, we accidentally discovered a new derived question, namely "how to assign soft labels to auxiliary head and lead head?" So far, as far as we know, there is no related literature discussed this issue. The result of the most popular method at present is shown in Fig. 5(c), which is to separate the auxiliary head and the lead head, and then use the respective prediction results and ground truth for label assignment. The method proposed in this paper is a new label assignment method, which guides the auxiliary head and the leading head through the leading head prediction. In other words, we use the leader prediction as a guide to generate coarse-to-fine hierarchical labels, which are used for auxiliary head and leader learning, respectively. The proposed two deeply supervised label assignment strategies are shown in Fig. 5 (d) and (e), respectively.


intensive reading

previous method

   In the past, in the training of deep networks, the label assignment usually directly refers to GT , and generates hard labels according to the given rules

Method in this paper

   The author proposes a "label assigner" mechanism, which considers the network prediction results together with GT , and then assigns soft labels

   About hard  label and soft  label:

  • Hard label: It is also called hard target in some papers . In fact, this is also borrowed from the idea of ​​knowledge distillation. Hard literally means that it is tougher, yes is, no is not , label format: (1,2,3. ..) or (0,1,0,0...) [For example: either a cat or a dog]
  • soft label: soft is expressed in the form of probability . It can be understood as smoothing or softening the label, such as [0.6,0.4] , [for example: there is a 60% probability of a cat, and a 40% probability of a dog], it seems that it will not give you a very clear answer .

In the more popular structures now, the data distribution of the network output is often matched with GT through certain optimization methods to generate a soft  label (in fact, the output through softmax or sigmod that we are familiar with is a kind of soft label ) . 

problem introduction   

How to assign soft labels to Auxiliary  head and Lead  head ?

current method

   As shown in ( c ) , this is an independent label matching structure, which separates the Auxiliary head and the Lead head , and then uses their own prediction results and real labels for label assignment.


4.2.3 Lead head guided label assigner— lead head guided label assigner

translate

(Lead head guided label assigner) The lead head guided label assigner is mainly calculated based on the prediction results of the lead head and GT, and generates soft labels through the optimization process. This set of soft labels will be used for training of auxiliary head and lead head. The reason for this is because the lead head has relatively strong learning ability, so the soft labels generated by it should be more representative of the distribution and correlation between the source data and the target. Furthermore, we can view this learning as a kind of generalized residual learning. By letting the shallower auxiliary head directly learn the information that the lead head has already learned, the lead head will be more able to focus on learning the residual information that has not yet been learned.


intensive reading

specific method

   It is mainly calculated based on the prediction results of the lead  head and GT , and generates a soft label through the optimization process . This set of soft labels will be used for the training of Auxiliary head and Lead  head .

reason

  • Lead  head  has a stronger learning ability , so the soft label obtained from it is more representative in the data probability of the data set.   
  • The Lead  head  can only learn the features that the Auxiliary  head  has not learned .


4.2.4 Coarse-to-fine lead head guided label assigner—from coarse to fine lead head guided label assigner

translate

(Coarse-to-fine lead head guided label assigner) The lead head guides the label assigner from coarse to fine , and also uses the prediction results of the lead head and GT to generate soft labels. However, during this process, two different sets of soft labels are generated, namely coarse labels and fine labels, where the fine labels are the same as the soft labels generated by the lead-head guided label assigner, and by relaxing the constraints of the positive sample assignment process, allowing More grids are considered positive targets to generate coarse labels. The reason is that the learning ability of the auxiliary head is not as strong as that of the lead head. In order to avoid losing the information that needs to be learned, we will focus on optimizing the recall of the auxiliary head in the target detection task. For the output of the lead head, high-precision results can be filtered out from high-recall results as the final output. But it must be noted that if the additional weight of the coarse label is close to that of the fine label, it may produce a bad prior on the final prediction. So in order to make these super-coarse positive grids have less impact, a constraint is set in the decoder such that the super-coarse normal grids cannot perfectly produce soft labels. The above mechanism allows to dynamically adjust the importance of fine and coarse labels during the learning process, and makes the optimization upper bound of fine labels always higher than that of coarse labels.


intensive reading

specific method

In this process, two different sets of soft labels are generated , namely coarse labels and fine labels

  • The fine label is the same as the soft label generated by the lead head on the label allocator
  • Coarse labels are generated by relaxing the conditions for identifying positive targets , that is, allowing more grids to be used as positive targets

What exactly are thick and thin labels?

The first is the fine label. The three heads        finally output by the network are Lead heads . This part of the prediction result and the ground truth will be used to generate a soft label . The network will think that the data distribution of this soft label is closer to the real data distribution, and the training will be obtained. The content is more " detailed " ,

       Let’s talk about the rough label. Since the Auxiliary head is obtained from the middle network part, its prediction effect is definitely not as detailed as the data or features the deep network Lead head , so the content of the Auxiliary head part is relatively " rough ". Yes, during the training process, the soft label of the Lead head and the ground truth will be regarded as a brand new ground truth , and then a loss function will be established with the Auxiliary head . To put it bluntly, the prediction result of the auxiliary head is also " approximately " as Lead head  

——————————The above quote @爱食肉的Peng’s interpretation, thank you big brother!

reason

 The learning ability of the Auxiliary head is not as strong as that of  the Lead head . In order to avoid information loss, it needs to receive more information to learn. In the target detection task, the Auxiliary head needs to have a high recall .    

[Note] If the thick label is very close to the thin label , it may produce bad results

[Solution]  In order to make the super-coarse positive grids ( the positive label of the Auxiliary head ) less affected, a restriction is set in the decoder , so those super-coarse positive grids cannot generate soft  labels, so that during the training process, let Grids automatically adjust the importance of coarse and thin labels.


4.3 Other trainable bag-of-freebies—other trainable "tools"

translate

In this section, we list some trainable freebie packages. These freebies are some of the tricks we use in training, but the original concept is not something we came up with. The training details of these freebies will be detailed in the appendix, including (1) Batch normalization in conv-bn-activation topology: this part mainly connects the batch normalization layer directly to the convolutional layer. The purpose of this is to incorporate the batch normalized mean and variance into the biases and weights of the convolutional layers during the inference phase. (2) Addition and multiplication of implicit knowledge in YOLOR [81] combined with convolutional feature maps: The implicit knowledge in YOLOR can be reduced to a vector by precomputation in the inference stage. This vector can be combined with the biases and weights of the previous or subsequent convolutional layer. (3) EMA model: EMA is a technique used in the mean teacher [75], in our system we use the EMA model purely as the final inference model


intensive reading

( 1 )   Batch normalization in conv-bn-activation structure combination

( 2 )    The implicit knowledge in YOLOR is combined with convolutional feature maps and addition and multiplication methods

( 3 )    EMA model _

Please read the translation above for a detailed explanation ~


5. Experiments— Experiments

5.1 Experimental setup—experimental steps

translate

 We conduct experiments and validate our object detection method using the Microsoft COCO dataset. All our experiments do not use pre-trained models. That is, all models are trained from scratch. During development, we use the train 2017 set for training, and then use the val 2017 set for validation and hyperparameter selection. Finally, we show the performance of object detection on the 2017 test set and compare it with state-of-the-art object detection algorithms. The detailed training parameter settings are described in the appendix.

    We design basic models for edge GPU, normal GPU, and cloud GPU, which are called YOLOv7tiny, YOLOv7, and YOLOv7-W6, respectively. At the same time, we also use the basic model for model scaling to obtain different types of models for different business needs. For YOLOv7, we perform stack scaling on the neck, and use the proposed compound scaling method to scale the depth and width of the entire model, and thus obtain YOLOv7-X. For YOLOv7-W6, we use the newly proposed composite scaling method to obtain YOLOv7-E6 and YOLOv7-D6. Furthermore, we use the proposed EELAN for YOLOv7-E6, thus completing YOLOv7E6E. Since YOLOv7-tiny is an edge GPU-oriented architecture, it uses leaky ReLU as the activation function. As for the other models, we use SiLU as the activation function. We describe the scaling factors for each model in detail in the appendix.


intensive reading

  • Dataset :  COCO dataset
  • Pre- trained model:  None, start training from 0
  • Different GPUs and corresponding models:

    • ​​​​​​​​​​​​​​Edge GPU: YOLOv 7- tiny

    • Ordinary GPU: YOLOv7

    • The basic model of cloud GPU : YOLOv7-W6

  • Activation function:
    • YOLOv 7 tiny :  leaky ReLU

    • Other models: SiLU


5.2 Baselines—Baseline Network

translate

We choose previous versions of YOLO [3, 79] and the state-of-the-art object detector YOLOR [81] as our baselines. Table 1 shows the comparison of our proposed YOLOv7 model with a baseline trained with the same settings.

From the results, we can see that if compared with YOLOv4, the parameters of YOLOv7 are reduced by 75%, the amount of calculation is reduced by 36%, and the AP is increased by 1.5%. If compared with the state-of-theart YOLOR-CSP, the parameters of YOLOv7 are reduced by 43%, the amount of calculation is reduced by 15%, and the AP is increased by 0.4%. In terms of the performance of the tiny model, compared with YOLOv4-tiny-31, the number of parameters of YOLOv7tiny is reduced by 39%, and the amount of calculation is reduced by 49%, but the AP remains unchanged. On the cloud GPU model, our model can still have a higher AP while reducing the number of parameters by 19% and the amount of computation by 33%.


intensive reading

We choose previous versions of YOLO [YOLOv4 , Scaled - YOLOv4 ] and the state-of-the-art object detector YOLOR as baselines. Table 1 shows the comparison of the YOLOv7 model proposed in this paper with the baseline model trained with the same settings . 

Table 1: Comparison of baseline object detectors

Conclusion : The specific values ​​will not be repeated. Through comparison, we can see that the amount of parameters and calculations are reduced, and the accuracy is improved.


5.3 Comparison with state-of-the-arts—comparison with other popular networks

translate

We compare the proposed method with state-of-the-art object detectors for general-purpose GPUs and mobile GPUs, and the results are shown in Table 2. From the results in Table 2, we know that the proposed method has the best speed-accuracy trade-off - all round close. If we compare YOLOv7-tiny-SiLU with YOLOv5-N (r6.1), our method is 127 fps faster on AP and 10.7% more accurate. In addition, YOLOv7 has a frame rate of 51.4% AP at 161 fps, while PPYOLOE-L with the same AP only has a frame rate of 78 fps. In terms of parameter usage, YOLOv7 is 41% less than PPYOLOE-L. If we compare YOLOv7-X with an inference speed of 114 fps to YOLOv5-L (r6.1) with an inference speed of 99 fps, YOLOv7-X can improve AP by 3.9%. If you compare YOLOv7X with the similarly scaled YOLOv5-X (r6.1), YOLOv7-X achieves 31 fps faster inference. In addition, in terms of parameters and calculations, YOLOv7-X reduces 22% of parameters and 8% of calculations compared to YOLOv5-X (r6.1), but improves AP by 2.2%.

If we compare YOLOv7 and YOLOR using an input resolution of 1280, YOLOv7-W6 infers 8 fps faster than YOLOR-P6 and also improves detection by 1% AP. As for the comparison between YOLOv7-E6 and YOLOv5-X6 (r6.1), the AP gain of the former is 0.9% higher than that of the latter, the parameters are 45% less, the amount of calculation is 63% less, and the reasoning speed is increased by 47%. The inference speed of YOLOv7-D6 is close to that of YOLOR-E6, but the AP is increased by 0.8%. The inference speed of YOLOv7-E6E is close to that of YOLOR-D6, but the AP is increased by 0.3%.


intensive reading

In this paper, the proposed method is compared with the state-of-the-art object detectors on general-purpose GPU and mobile GPU , and the results are shown in Table 2 :

Table 2: Comparison of state-of-the-art real-time object detectors

Conclusion : Anyway, we have defeated them, and we are right now that we are the best


5.4 Ablation study—ablation study

5.4.1 Proposed compound scaling method—the proposed compound scaling method

translate

Results obtained when scaling with different model scaling strategies are shown. Among them, our proposed compound scaling method is to enlarge the depth of the calculation block by 1.5 times and the width of the transition block by 1.25 times. Our method can improve AP by 0.5% with fewer parameters and computation, if compared to methods that only scale up the width. If our method only needs to increase the number of parameters by 2.9% and computation by 1.2% compared with the method of only amplifying the depth, our method can improve the AP by 0.2%.


intensive reading

Table 3 shows the results obtained when scaling with different model scaling strategies:

Table 3: Ablation studies under the proposed model scaling

Conclusions : Compound scaling strategies allow for more efficient use of parameters and computation.


5.4.2 Proposed planned re-parameterized model—proposed planned re-parameterized model

translate

To verify the generality of our proposed planar reparameterization model, we use it for connection-based and residual-based models for validation, respectively. The connection-based and residual-based models we choose for validation are 3-stacked ELAN and CSPDarknet, respectively.

In the concatenation-based model experiment, we replaced the 3×3 convolutional layers at different positions in the 3-stacked ELAN with RepConv, and the detailed configuration is shown in Figure 6. From the results shown in Table 4, it can be seen that all higher AP values ​​exist in our proposed plan reparameterization model.

In the experiment dealing with the residual-based model, since the original dark block does not have a 3 × 3 con convolutional block in line with our design strategy, we additionally designed a reverse dark block for experiments, and its architecture is shown in Figure 7. Since CSPDarknet with dark block and reverse dark block have exactly the same parameters and operations, it is a fair comparison. The experimental results shown in Table 5 fully confirm that the proposed plan reparameterization model is equally effective on the residual-based model. We find that the design of RepCSPResNet [85] also follows our design pattern


intensive reading

Purpose

In order to verify the generality of the proposed planned  re - parameterized model, it  is used for connection-based model and residual-based model for verification.

Model selection: three-layer ELAN and CSPDarknet .  

Replace the 3 × 3 convolutional layers at different positions in the 3 -layer stacked ELAN with RepConv , and the detailed configuration is shown in the figure below:

Figure 6: Planned RepConv 3-stacked ELAN. The blue circle is where Conv is replaced with RepConv

Table 4: Ablation studies of the Planned RepConcatenation model.

Conclusion : From the results shown in Table 4 , it is seen that all higher AP values ​​appear on the proposed Planned RepConcatenation model.

In the experiment of dealing with the residual model, since the original dark block does not have a 3×3 convolution block, the author also designed a reverse dark block , whose architecture is shown in Figure 7 :

Figure 7: Reverse CSPDarknet. We reversed the positions of 1×1 and 3×3 convolutional layers in the dark block to comply with our planned reparameterized model design strategy.

Table 5: Ablation studies of the planned RepResidual model

Conclusion : The experimental results shown in Table 5 confirm that the proposed reparameterization strategy is still effective for residual models. The design of RepCSPResNet also conforms to the design pattern of this article


5.4.3 Proposed assistant loss for auxiliary head—提出的 auxiliary head 的 assistant loss

translate

In Auxiliary Loss for Auxiliary Head Experiments, we compare the general independent label assignment of the Guided Head and Auxiliary Head methods, and we also compare the two proposed methods for guided label assignment. We present all comparison results in Table 6. From the results listed in Table 6, it can be seen that any model adding auxiliary loss can significantly improve the overall performance. Furthermore, our proposed guided label assignment strategy achieves better performance than the general independent label assignment strategies in AP, AP50, and AP75. As for our proposed coarse-assisted and fine-guided label assignment strategy, it yields the best results in all cases. In Fig. 8, we show the object graphs predicted by different methods on the auxiliary head and the leading head. From Fig. 8 we find that if the auxiliary head learns to guide the guided soft label, it will indeed help the guided head to extract residual information from the consistent target.

   Since the proposed YOLOv7 uses multiple pyramids to jointly predict object detection results, we can directly connect the auxiliary head to the pyramids in the middle layer for training. This training can compensate for information that may be lost in the next level of pyramid prediction. Due to the above reasons, we design a partial auxiliary header in the proposed E-ELAN architecture. Our approach is to connect the auxiliary head after one set of feature maps before merging the bases, and this connection can make the weights of the newly generated set of feature maps not directly updated by the auxiliary loss. Our design allows each lead-head pyramid to still draw information from objects of different sizes. Table 8 shows the results obtained using two different methods, the coarse-to-fine lead method and the partial coarse-to-fine lead method. Apparently, partial coarse-to-fine guidance method has a better auxiliary effect.


intensive reading

The author compares the independent label assignment strategies of lead head and auxiliary head , and also compares the proposed guided label assignment method. All comparison results are shown in Table 6 :

Table 6: Proposed ablation studies of the auxiliary head

Conclusion : From the results listed in Table 6, it can be seen that any model adding auxiliary loss can significantly improve the overall performance.

Figure 8: Object graphs predicted by different methods at the auxiliary head and lead head

Conclusion : If the auxiliary  head learns the lead-guided soft  label , it will indeed help the lead  head to extract residual information from the consensus target.

Table 7: Ablation studies with constrained auxiliary head.
Conclusion: Judging from the numbers in the table, better performance can be obtained by constraining the upper bound of the target by the size of the distance from the target center.
Method: A partial auxiliary head is designed in the E-ELAN architecture . Before merging the cardinality ( cardinality ), connect the auxiliary head after a set of feature maps . This connection can make the weights of the newly generated feature map set not directly updated through the auxiliary loss
Table 8: Ablation study of partial auxiliary head
Conclusion: The partial auxiliary head method has a better auxiliary effect.

6. Conclusions— Conclusion

translate

In this paper, we propose a new real-time object detector architecture and corresponding model scaling method. Furthermore, we find that the development process of object detection methods generates new research topics. During the research, we found the replacement problem of the reparameterization module and the assignment problem of dynamic label assignment. To address this issue, we propose a trainable bag-of-freebies approach to improve object detection accuracy. Based on the above, we developed the YOLOv7 series of object detection systems, which achieved state-of-the-art results.


intensive reading

In this paper, we propose a new real-time detector that addresses both the replacement of reparameterization modules and the assignment of dynamic labels.
The main work:
( 1 ) High-efficiency aggregation network architecture: YOLOV7 expands ELAN and proposes a new network architecture E-ELAN , which focuses on efficiency
( 2 ) Reparameterized convolution: YOLOV7 introduces model reparameterization into the network architecture. The idea of ​​reparameterization first appeared in REPVGG
( 3 ) Auxiliary head detection: In YOLOv7, the shallow features of the head part are extracted as the Auxiliary head , and the deep features are the final output of the network as the Lead head.
( 4 ) Connection-based model scaling: The author proposes a composite model method for the connection model. When scaling the network of the connection structure, only the depth of the calculation block is scaled, and the rest of the conversion layer only scales the width.
( 5 ) Dynamic label allocation strategy: Lead head -oriented label allocation method and lead head- guided label allocation method from coarse to fine

References in this article:

CSDN:

Partial interpretation of YOLOv7 paper [including my own understanding]

[Target Detection] 54, YOLO v7 | Alexey AB again! Designed for real-time object detection_

Station b: [Strong push! YOLOv7 paper innovation points, network structure, official source code, full detailed explanation, step by step guide you to interpret YOLOV7 code line by line!
 

Guess you like

Origin blog.csdn.net/weixin_43334693/article/details/130478338