Some understanding of YOLOv7

insert image description here
Paper link: https://arxiv.org/abs/2207.02696
Code link: https://github.com/WongKinYiu/yolov7

Show muscles:

The official version of YOLOv7 is more accurate than YOLOv5 at the same volume, 120% faster (FPS), 180% faster than YOLOX (FPS), 1200% faster than Dual-Swin-T (FPS), and 550% faster than ConvNext ( FPS), 500% faster than SWIN-L (FPS). In the range of 5FPS to 160FPS, whether it is speed or accuracy, YOLOv7 exceeds the currently known detectors, and tested on GPU V100, the model with an accuracy of 56.8% AP can reach 30 FPS (batch=1) At the same time, this is the only detector that can still exceed 30FPS with such high precision. The specific comparison is shown in the figure below:
insert image description here

1 Introduction

In recent years, model structure reparameterization and dynamic label assignment have become important optimization directions in network training and target detection. In this paper, the author found some existing problems, such as:

  • For model structure reparameterization, the concept of gradient propagation path is used to analyze the reparameterization strategies applicable to each layer structure in different networks, and a planned model structure reparameterization is proposed.
  • When using the dynamic label assignment strategy, the model with multiple output layers will generate new problems when training, such as how to better match the output of different branches to dynamic targets . In response to this problem, the author proposes a new label assignment method called coarse-to-fine (coarse to fine) guided label assignment strategy.
The contributions of this paper are as follows:
  1. Several trainable bag-of-freebies are designed to enable real-time detectors to improve detection accuracy without increasing the cost of inference.
  2. For the development of object detection, the author finds two problems, namely, how to effectively replace the original module with model reparameterization, and how to handle the assignment of different output layers well with the dynamic label assignment strategy . Therefore, a method is proposed in this paper to solve it.
  3. The author proposes "extension" and "composite scaling" methods for real-time detectors, which can use parameters and calculations more efficiently. At the same time, the method proposed by the author can effectively reduce the parameters of real-time detectors by 50%, and has faster Inference speed and higher detection accuracy. (Similar to the baseline of v5 or scale v4 divided into several models with different specifications, it can be scaled by width and depth, or scaled by module)

2. Related work

SOTA detectors generally have the following characteristics:

  • Faster and more efficient network architecture
  • More Efficient Feature Integration Methods
  • A more accurate detection method
  • A more robust loss function
  • More Efficient Label Assignment Methods
  • more efficient training

2.1 Model reparameterization

There are two common operations for model-level reparameterization:

  1. One is to train multiple identical models with different training data, and then average the weights of multiple trained models.
  2. One is to perform a weighted average of the model weights under different iterations.

Module reparameterization is a popular research topic in recent years. A monolithic module is split into multiple identical or different module branches during training, but multiple branches are integrated into a fully equivalent module during inference. However, not all proposed reparameterized modules can be perfectly applied to different architectures. The authors therefore develop new reparameterized modules and design associated application strategies for various architectures.

2.2 Model Scaling

Model scaling makes it suitable for different computing devices by expanding or reducing the rate in the baseline. NAS is one of the commonly used model scaling methods nowadays. Scale factors typically include the following:

  • input size
  • depth
  • width
  • stage

3. Model Design

3.1 Efficient aggregation network

In most papers on designing efficient networks, the main considerations are the number of parameters, computational effort, and computational density. However, from the perspective of memory access, it is also possible to analyze the impact of the input/output channel ratio, the number of branches of the architecture, and the element-level operations on the network reasoning speed (proposed in the shufflenet paper). The activation function also needs to be considered when performing model scaling, that is, more consideration is given to the number of elements in the output tensor of the convolutional layer.
insert image description here

  • CSPVoVNet is a variant of VoVNet whose architecture also analyzes gradient paths, enabling different layers to learn more diverse features.
  • (c) ELAN is in the following design considerations - "How to design an efficient network?" It is concluded that by controlling the shortest and longest gradient paths, deeper networks can learn efficiently and converge better
  • Therefore, in v7, the author proposed an extended version E-ELAN (d) based on ELAN. In large-scale ELAN, a steady state is reached regardless of the gradient path length and the number of computational modules. But if more computing modules are wirelessly stacked, this steady state may be broken and the parameter utilization rate will be reduced. The E-ELAN proposed in this paper adopts expand, shuffle, and merge cardinality structures to improve the learning ability of the network without destroying the original gradient path.

In terms of architecture, E-ELAN only changes the structure in the computing module, while the transition layer structure remains completely unchanged. The author's strategy: *** Use group convolution to expand the channels and bases of the computing modules, and use the same group parameter and channel multiplier for computing all modules in each layer. Then, the feature maps calculated by each module are shuffled into G groups according to the set number of groups, and finally they are connected together. ***At this point, the number of channels in each set of feature maps will be the same as in the original architecture. Finally, the author added G group features to merge cardinality. Besides maintaining the original ELAN design architecture, E-ELAN can also guide different grouping modules to learn more diverse features.

4. bag-of-freebies

4.1 Convolution Reparameterization

Although RepConv achieves excellent performance on VGG, when it is directly applied to ResNet and DenseNet or other network architectures, its accuracy will be significantly reduced. The authors use the gradient propagation path to analyze which networks should be used with different reparameterization modules. By analyzing the combination of RepConv and different architectures and the resulting performance, the author found that the identity in RepConv destroyed the residual structure in ResNet and the cross-layer connection in DenseNet

4.2 Auxiliary training module

Deep supervision is a technique commonly used to train deep networks, the main concept of which is to add additional auxiliary heads in the middle layers of the network (so does google!!), and shallow network weights guided by auxiliary losses. Even for network structures with good convergence effects like ResNet and DenseNet, deep supervision can still significantly improve the performance of the model on many tasks. Figures (a) and (b) below show the target detector architecture with and without deep supervision respectively. In this paper, the author will be responsible for the final output head called the guide head, which will be used to assist the training head called the auxiliary head.
insert image description here
Target detection often uses the prediction frame and GT as the result of IOU as a label. At first, the auxiliary head and the guide head were processed separately. The method proposed in this paper is a new label assignment method: through the guide head prediction to guide the auxiliary head. That is, first use the prediction of the leading head as a guide to generate hierarchical labels from coarse to fine, which are used for the learning of the auxiliary head and the leading head respectively. (A bit like the taste of knowledge distillation) The purpose of this is to make the boot head have a strong learning ability, and the resulting soft labels can better represent the distribution differences and correlations between the source data and the target. In addition, the author also This kind of learning can be regarded as a kind of residual learning in a broad sense. By letting the shallower auxiliary head directly learn the information that the seeker has already learned, the seeker can focus more on the residual information that has not yet been learned.

4.3 Other trainable bag-of-freebies

  • Batch normalization: The purpose is to integrate the mean and variance of batch normalization into the bias and weight of the convolutional layer during the inference phase.
  • Implicit knowledge in YOLOR combines convolutional feature maps and multiplication: The implicit knowledge in YOLOR can simplify the calculated value into a vector during the inference stage. This vector can be combined with the biases and weights of the previous or next convolutional layer.
  • EMA Model: EMA is a technique used in the mean teacher, and the author uses the EMA model as the final inference model.

5. Experiment

This part does not show muscle, you can go to the paper to see for yourself.

Guess you like

Origin blog.csdn.net/weixin_45074568/article/details/126002238