Interpretation of YOLOv7 paper

Paper link: https://arxiv.org/abs/2207.02696
Code link: https://github.com/WongKinYiu/yolov7

Summary

In the range of 5 FPS to 160 FPS, the speed and accuracy of YOLOv7 exceed all known target detectors. Among all known real-time target detectors above 30 FPS on V100, YOLOv7 has the highest accuracy rate, reaching 56.8 %AP.


Among them, the YOLOv7-E6 target detector (56 FPS V100, 55.9% AP) is 509% faster and 2% more accurate than the transformer-based detector SWIN- L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) ; 551% faster and 0.7% AP more accurate than the convolution-based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP).

YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in terms of speed and accuracy.

I. Introduction

Real-time object detection is an important topic in computer vision and an essential part of computer vision systems. For example, multi-target tracking, autonomous driving, robotics, medical image analysis, etc. Computing devices that perform real-time object detection are usually some mobile CPUs or GPUs, as well as various neural processing units (NPUs) developed by major manufacturers. Some of the edge devices mentioned above focus on accelerating different operations such as plain convolution, depthwise convolution or MLP operations. The real-time object detector proposed in this paper mainly hopes that it can support both mobile GPU and GPU devices from the edge to the cloud.

Real-time object detectors for different edge devices are still under development. For example, MCUNe and NanoDet mainly focus on providing low-power microcontrollers and improving the inference speed of edge CPUs. While methods such as YOLOX and YOLOR focus on improving the inference speed of various GPUs.

In recent years, the development of real-time object detectors has mainly focused on the design of efficient architectures. Real-time object detectors used on CPU are mostly based on MobileNet, ShuffleNet or GhostNet. Most of the mainstream real-time object detectors developed for GPU use ResNet, DarkNet or DLA, and then use the CSPNet strategy to optimize the architecture.

The method proposed in this paper takes a different direction from current mainstream real-time object detectors. In addition to architecture optimization, the method proposed in this paper focuses on the optimization of the training process. The author's focus is on some optimization modules and optimization methods, which can improve the accuracy of object detection by increasing the training cost, but do not increase the inference cost. The authors refer to the proposed modules and optimization methods as trainable bag-of-freebies.

In recent years, model reparameterization and dynamic label assignment have become important topics in network training and object detection. In this paper, some newly discovered problems are introduced and effective methods are designed to solve them, such as:

(1) In terms of model reparameterization, combined with the concept of gradient propagation path, the model reparameterization strategies of each layer in different networks are analyzed, and a planning reparameterization model is proposed.

(2) When using dynamic label assignment techniques, the training of models with multiple output layers creates new problems. "How to assign dynamic targets to the outputs of different branches?", the author proposes a new label assignment method - coarse-to-fine guided label assignment method.

The main contributions of this paper are as follows:

  1. Several trainable bag-of-freebies detection methods are designed to greatly improve the detection accuracy of real-time object detection without increasing the cost of reasoning;
  2. For the development of object detection methods, the author discovered two new problems, namely, how the reparameterized module replaces the original module, and how the dynamic label assignment strategy handles the assignment of different output layers. Therefore, in this paper, the authors propose methods to address these issues;
  3. Proposed "scaling" and "composite scaling" methods for efficient use of parameters and calculations for real-time object detectors;
  4. The proposed method can effectively reduce about 40% of the parameters and 50% of the calculation amount of real-time object detectors, and has faster inference speed and higher detection accuracy.

2. Related work

2.1 Real-time Object Detector

The current state-of-the-art real-time object detectors are mainly based on YOLO and FCOS. Being able to be a state-of-the-art real-time object detector usually requires the following characteristics:

        (1) Faster and stronger network architecture;

        (2) A more efficient feature integration method;

        (3) More accurate detection methods;

        (4) More robust loss function;

        (5) A more efficient label assignment method;

        (6) A more effective training method.

In this paper, the authors do not intend to explore self-supervised learning or knowledge distillation methods that require additional data or large models. Instead, new trainable bag-of-freebies will be designed for problems derived from the state-of-the-art methods in (4), (5) and (6) above.

2.2 Model reparameterization

Model reparameterization technology combines multiple computing modules into one during the inference stage. Model reparameterization technology can be regarded as an integration technology, which can be divided into two categories, namely, module-level integration and model-level integration. To obtain the final inference model, there are two common model-level reparameterization practices:

  • Train multiple identical models with different training data, and then average the weights of multiple trained models.
  • Weighted average of model weights for different iterations.

Module-level reparameterization is a popular research topic in recent years. This approach splits a module into multiple identical or different module branches during training, and integrates multiple branched modules into fully equivalent modules during inference. However, not all proposed reparameterization modules can be perfectly applied to different architectures. With this in mind, the authors develop new reparameterization modules and design related application strategies for various architectures.

2.3 Model Scaling

Model scaling is a method of enlarging or reducing an already designed model to fit different computing devices. Model scaling methods usually use different scaling factors, such as resolution (the size of the input image), depth (the number of layers), width (the number of channels), and stages (the number of feature pyramids), so that the number of network parameters, computation, inference Good trade-off between speed and accuracy.

Network Architecture Search (NAS) is a commonly used method for model scaling. NAS can automatically search for suitable scaling factors from the search space without defining overly complicated rules. The disadvantage of NAS is that it requires very expensive calculations to complete the search of the model scaling factor.

By reviewing the literature, the author found that almost all model scaling methods independently analyze a single scaling factor, and even methods in the compound scaling category optimize scaling factors independently. This is because most popular NAS architectures deal with less relevant scaling factors. The authors observe that all cascade-based models, such as DenseNet or VoVNet, change the input width of certain layers when the depth of these models is scaled. Since the architecture proposed in this paper is based on concatenation, a new compound scaling method has to be devised.

3. Model architecture design

3.1 Extended Efficient Aggregation Network

In most of the literature on designing efficient architectures, the main considerations are the number of parameters, computational effort, and computational density. Starting from the characteristics of memory access costs, it is also possible to analyze the impact of input/output channel ratio, number of architectural branches, and unit operations on network inference speed. In addition, activations can also be considered when performing model scaling, i.e. more consideration is given to the number of elements in the output tensor of the convolutional layer.

  • CSPVoVNet in Figure 2(b) is a variant of VoVNet. In addition to considering the above basic design issues, the architecture of CSPVoVNet also analyzes the gradient path, so that the weights of different layers can learn more diverse features. The gradient analysis method described above enables faster and more accurate inference.
  • The ELAN in Figure 2(c) considers the following design strategy—“How to design an efficient network?” They conclude that deep networks can learn and converge efficiently by controlling the shortest and longest gradient paths.
  • In this paper, the author proposes Extended-ELAN (E-ELAN) based on ELAN, whose main architecture is shown in Fig. 2(d).

In large-scale ELAN, a steady state is reached regardless of the gradient path length and the stacked number of computation blocks. If more computation blocks are stacked infinitely, this steady state may be disrupted, leading to a decrease in parameter utilization. The proposed E-ELAN utilizes expand, shuffle, and merge cardinality to realize the ability to continuously enhance the learning ability of the network without destroying the original gradient path.

In terms of architecture, E-ELAN only changes the architecture of the computing block, while the architecture of the transition layer remains completely unchanged. The author's strategy is to use group convolution to expand the channels and bases of computing modules, using the same group parameter and channel multiplier for computing all modules in each layer. Then, the feature maps calculated by each calculation module are shuffled into g groups according to the set grouping number g, and then spliced ​​together. At this point, the number of channels in each set of feature maps will be the same as in the original architecture. Finally, the authors add g-set feature maps to perform merge cardinality. In addition to maintaining the original ELAN design architecture, E-ELAN can also guide different computing module groups to learn more different features.

3.2 Model scaling based on serial models

In architectures such as PlainNet or ResNet, the in-degree and out-degree of each layer do not change when scaling up or down, so the impact of each scaling factor on the amount of parameters and computation can be independently analyzed. However, if these methods are applied in a concatenation-based architecture, it is found that when depth scaling is performed, the branches of the conversion layer immediately after the concatenation-based computation block will decrease or increase, as shown in Fig. 3 (a) and ( b) as shown.

It can be deduced from the above phenomena that we cannot analyze the different scaling factors of concatenation-based models individually, but have to consider them together. Taking zooming in on depth as an example, this behavior causes the ratio of the input and output channels of the transition layer to change, resulting in a reduction in the hardware usage of the model. Therefore, for concatenation-based models, corresponding composite model scaling methods must be employed. When we scale the depth factor of a computed block, we must also compute the change in the block's output channels. Then, the transition layer will be scaled by the width factor of the same variation, and the result is shown in Fig. 3(c). The compound scaling method proposed by the authors can preserve the properties of the model at the time of initial design and maintain the optimal structure.

4. Trainable bag-of-freebies

4.1. Reparameterized convolution

Although RepConv achieved excellent performance on VGG, when the authors directly applied it to ResNet and DensNet and other architectures, its accuracy decreased significantly. The authors use the gradient flow propagation path to analyze how to combine reparameterized convolutions with different networks. The authors also design reparameterized convolutions accordingly.

RepConv actually combines 3×3 convolutions, 1×1 convolutions and identity connections in one convolutional layer. After analyzing the combination and corresponding performance of RepConv and different architectures, the authors found that the identity connection in RepConv destroys the residual in ResNet and the cascade in DensNet, which provide more gradient diversity for different feature maps.

Based on the above reasons, the author uses RepConv (RepConvN) without identity connection to design the architecture of reparameterized convolution. In the author's thinking, when convolutional layers with residuals or cascades are replaced by reparameterized convolutions, there should be no identity connections. Figure 4 shows the "reparameterized convolution" designed by the author for PlainNet and ResNet. The full reparameterization convolution experiment based on the residual model and the cascade model will be introduced in the ablation experiment.

4.2 Auxiliary training module

Deep supervision is a commonly used deep network training technique. The main idea is to add additional auxiliary heads in the middle layer of the network to assist the loss-guided shallow network weights. Even for architectures such as ResNet and DenseNet that generally converge well, deep supervision can still significantly improve the performance of models on many tasks. Figure 5 (a) and (b) show object detector architectures “without” and “with” deep supervision, respectively. In this article, the author refers to the head responsible for the final output as the lead head and the auxiliary training head as the auxiliary head.

Next, we discuss the issue of label assignment. In the past, in the training of deep networks, the label assignment usually directly points to the ground truth, and hard labels are generated according to the given rules. But in recent years, taking target detection as an example, researchers often use the quality and distribution of network prediction output, combined with ground truth considerations, and use some calculation and optimization methods to generate reliable soft labels. For example, YOLO uses bounding box regression to predict IoU and ground truth as soft labels. In this paper, the author will refer to the mechanism that the network prediction results and ground truth values ​​are considered to reassign soft labels together as a "label assigner".

During the development of soft label dispenser-related technologies, the author discovered a new problem, that is, "how to distribute soft labels to auxiliary heads and guide heads", and there is no literature to discuss this issue. The most popular method at present, as shown in Fig. 5(c), separates the auxiliary head and the guide head, and performs label assignment using their respective prediction results and ground truth.

The label assignment method proposed in this paper is to guide the auxiliary head as well as itself through the prediction of the leading head. That is, guided by the predictions of the leader, coarse-to-fine hierarchical labels are generated for auxiliary-head learning and leader-head learning, respectively. The proposed two deeply supervised label assignment strategies are shown in Fig. 5 (d) and (e), respectively.

Lead head guided label assigner: The lead head guided "label assigner" is mainly calculated based on the prediction results of the lead head and the ground truth, and generates soft labels through optimization. This set of soft labels will be used as targets for the auxiliary head and the guiding head to train the model. The reason for this is that the bootstrap head has a strong learning ability, and the resulting soft labels should be more representative of the distribution and correlation between source and target data. Furthermore, this learning can be viewed as a kind of generalized residual learning. By letting the shallower auxiliary head directly learn information that the seeker has already learned, the seeker will be able to focus more on learning the remaining information that has not yet been learned.

Coarse-to-fine lead head guided label assigner: The Coarse-to-fine lead head uses its own prediction results and ground truth to generate soft labels. However, in this process, two different sets of soft labels are generated, namely coarse labels and fine labels, where the fine labels are the same as the soft labels generated by the bootstrap head on the label allocator, and the coarse labels are obtained by lowering the constraint of positive sample assignment , allowing more meshes to be generated as positive targets. This is because the learning ability of the auxiliary head is not as strong as that of the lead head. In order to avoid losing the information that needs to be learned, the authors focus on optimizing the recall of the auxiliary head. For the output of the seeker, the result of filtering high-precision values ​​from the precision can be used as the final output. However, it must be noted that if the additional weight of the coarse label is close to that of the fine label, it may produce wrong prior results in the final prediction. Therefore, to make these coarse positive grids less influential, the authors put constraints in the decoder such that the coarse positive grids cannot perfectly produce soft labels. The above mechanism allows the importance of fine and coarse labels to be adjusted dynamically during the learning process, so that the optimizable upper bound of fine labels is always higher than that of coarse labels.

4.3 Other trainable bag-of-freebies

(1) Batch normalization in the conv-bn-activation paradigm: This part is mainly to connect the batch normalization layer directly to the convolutional layer, the purpose is to integrate the mean and variance of the batch normalization into the convolutional layer during the inference phase in the biases and weights of .

(2) Implicit knowledge in YOLOR combines convolutional feature maps and multiplication-add methods: through precomputation in the inference stage, the implicit knowledge in YOLOR can be simplified into a vector, which can be compared with the previous or subsequent layer. Combination of biases and weights.

(3) EMA model: EMA is a technique used in the mean teacher, and the author uses the EMA model as the final inference model.

5. Experiment

5.1 Experimental setup

The author uses the COCO dataset for experiments, and all experiments do not use pre-trained models. During development, use train 2017 for training, then use val 2017 for validation and hyperparameter selection.

The authors designed basic models for edge GPUs, common GPUs, and cloud GPUs, called YOLOv7tiny, YOLOv7, and YOLOv7-W6, respectively. At the same time, the basic model is also used to scale the model for different service requirements to obtain different types of models. For YOLOv7, the author superimposed and scaled the neck, and used the proposed composite scaling method to scale the depth and width of the entire model, thus obtaining YOLOv7-x. For YOLOv7-W6, the author uses the newly proposed composite scaling method to obtain YOLOv7-E6 and YOLOv7-D6. In addition, the author uses the proposed E-ELAN for YOLOv7-E6 to complete YOLOv7-E6E. Since YOLOv7-tiny is an edge GPU-oriented architecture, it will use leaky ReLU as the activation function. For other models, SiLU is used as the activation function.

5.2 Baseline

The authors choose the previous version of YOLO and the state-of-the-art object detector YOLOR as baselines. The above table shows the comparison of the YOLOv7 model proposed by the authors with those baselines trained with the same settings. The results show,

  • Compared with YOLOv4, the parameters of YOLOv7 are reduced by 75%, the amount of calculation is reduced by 36%, and the AP is increased by 1.5%.
  • Compared with the most advanced YOLOR-CSP, the parameters of YOLOv7 are reduced by 43%, the amount of calculation is reduced by 15%, and the AP is increased by 0.4%.
  • In terms of the performance of the small model, compared with YOLOv4-tiny-31, the number of parameters of YOLOv7tiny is reduced by 39%, and the amount of calculation is reduced by 49%, but the AP remains unchanged.
  • On the cloud GPU model, the YOLOv7 model can still have a higher AP while reducing the number of parameters by 19% and the amount of calculation by 33%.

5.3 Comparison with SOTA Algorithm

The authors compare with the current state-of-the-art general-purpose GPU and mobile GPU object detectors, and the results are shown in the table below. 

It can be seen from the results in the above table that the method proposed in this paper is the best in terms of comprehensive trade-off between speed and accuracy:

  • Comparing YOLOv7-tiny-SiLU with YOLOv5-N (r6.1), our method is 127 FPS faster and provides 10.7% accuracy.
  • YOLOv7 has 51.4% AP at 161 frame rate, while PPYOLOE-L only has 78 frame rate at the same AP. In terms of parameter usage, YOLOv7 is 41% lower than PPYOLOE-L.
  • Comparing 1 YOLOv7-X with 14 FPS inference speed and YOLOv5-L (r6.1), YOLOv7-X can improve AP by 3.9%.
  • If you compare YOLOv7-X with the similarly scaled YOLOv5-X (r6.1), YOLOv7-X infers 31 FPS faster. In addition, in terms of parameters and calculations, compared with YOLOv5-X (r6.1), YOLOv7-X reduces parameters by 22% and calculations by 8%, but increases AP by 2.2%.
  • If you compare YOLOv7 and YOLOR at an input resolution of 1280, YOLOv7-W6 infers 8 FPS faster than YOLOR-P6, and the detection rate also increases by 1% AP.
  • Compared with YOLOv7-E6 and YOLOv5-X6 (r6.1), the former has an AP gain of 0.9% over the latter, 45% fewer parameters, 63% less computation, and 47% faster reasoning.
  • The inference speed of YOLOv7-D6 is close to that of YOLOR-E6, but the AP is increased by 0.8%. The inference speed of YOLOv7-E6E is close to YOLOR-D6, but the AP is increased by 0.3%.

5.4 Ablation experiment

5.4.1 Composite scaling method

Table 3 shows the results when scaling with different model scaling strategies. Among them, the composite scaling method proposed by the author is to enlarge the depth of the calculation block by 1.5 times and the width of the transition block by 1.25 times. Compared with the method of only expanding the width, the method can improve AP by 0.5% with fewer parameters and less calculation. If the author's method is compared with the method that only increases the depth, the author's method only needs to increase the number of parameters by 2.9% and the amount of calculation by 1.2%, which can increase the AP by 0.2%. From the results in Table 3, it can be seen that the compound scaling strategy proposed by the authors can utilize parameters and computation more efficiently.

5.4.2 Reparameterization model

In order to verify the generality of the proposed planar reparameterization model, the author applies it to the concatenation-based model and the residual-based model for verification. The cascade-based and residual-based models we choose are 3-layer ELAN and CSPDarknet, respectively. In the cascade-based model experiment, the author replaces the 3 × 3 convolutional layers at different positions in the 3-layer stacked ELAN with RepConv, and the specific configuration is shown in Figure 6. From the results shown in Table 4, we see that all higher AP values ​​appear in the reparameterized models proposed by the authors.

In the experiment dealing with the residual model, since the original dark block does not have a 3 × 3 convolution in line with the author's design strategy, the author also designed a reverse dark block for the experiment, and its architecture is shown in Figure 7. Since the CSPDarknet of the dark block and the reverse dark block have exactly the same parameters and operations, the comparison is fair. The experimental results shown in Table 5 fully demonstrate that the proposed reparameterization model is equally effective for residual-based models. The design of RepCSPResNet also conforms to the author's design pattern.

5.4.3 Auxiliary loss head

In the auxiliary loss experiment of the auxiliary head, the author compares the independent label assignment strategies of the guiding head and the auxiliary head, and also compares the proposed guided label assignment method, Table 6 shows all the comparison results.

From Table 6, it is clear that any model adding auxiliary loss can significantly improve the overall performance. In addition, our proposed guided label assignment strategy achieves better performance in AP, AP50, and AP75 than the general independent label assignment strategy. For the label assignment strategy proposed by the authors, coarse-assisted and fine-guided, the best results are obtained in all cases. In Figure 8, the authors show the target maps predicted by different methods at the auxiliary head and the guide head. From Figure 8, the authors found that if the auxiliary head learns guided soft labels, it can indeed help the guiding head to extract residual information from consistent targets.

Table 7 further analyzes the impact of the proposed coarse-to-fine label assignment method on the auxiliary header decoder. In other words, the author compared the results of introducing upper bound constraints and not introducing upper bound constraints. From the numbers in the table, the method of constraining the upper limit of the object with the distance from the center of the object can achieve better performance.

Since the proposed YOLOv7 uses multiple pyramids to jointly predict object detection results, it is possible to directly connect the auxiliary head to the pyramids in the middle layer for training. This type of training can compensate for information that may be lost in the next level of pyramid prediction. Based on the above reasons, the author designs some auxiliary headers in the proposed E-ELAN architecture. The author's method is to connect the auxiliary head after a set of feature maps before merge cardinality. This connection can make the weights of the newly generated feature map set not directly updated by the auxiliary loss. The authors' design allows each pyramid's seeker to acquire information from objects of different sizes. Table 8 shows the results of two different methods, coarse-to-fine bootstrapping and partial coarse-to-fine bootstrapping. Obviously, part of coarse-to-fine has a better auxiliary effect.

 6. Conclusion

This paper proposes a new real-time object detector architecture and corresponding model scaling method. Furthermore, the authors find that the development process of object detection methods has generated new research topics. During the research, the authors discovered the replacement problem of reparameterization modules and the assignment problem of dynamic label assignment. To solve this problem, the authors propose trainable bag-of-freebies to improve the accuracy of object detection. On this basis, the author developed the YOLOv7 series of target detection systems and obtained the most advanced detection results.

Guess you like

Origin blog.csdn.net/weixin_42620109/article/details/126690065