Detailed interpretation of YOLOV7 (2) Interpretation of the paper

Detailed interpretation of YOLOV7 (2) Interpretation of the paper


foreword

After Meituan released YOLOV6, the original author of the YOLO series also released YOLOV7.
The main contributions of YOLOV7 are:

1. Model
reparameterization YOLOV7 introduces model reparameterization into the network architecture. The idea of ​​reparameterization first appeared in REPVGG.
2. Label allocation strategy
The label allocation strategy of YOLOV7 uses the cross-grid search of YOLOV5 and the matching strategy of YOLOX.
3. ELAN Efficient Network Architecture
A new network architecture proposed in YOLOV7, focusing on high efficiency.
4. Training with an auxiliary head
YOLOV7 proposes a training method for the auxiliary head. The main purpose is to increase the training cost and improve the accuracy without affecting the inference time, because the auxiliary head will only appear during the training process.

1. What is YOLOV7?

YOLO algorithm is the most typical representative of one-stage target detection algorithm. It is based on deep neural network for object recognition and positioning. It runs very fast and can be used in real-time systems.
YOLOV7 is currently the most advanced algorithm of the YOLO series, surpassing the previous YOLO series in terms of accuracy and speed.
Understanding YOLOV7 is a necessary step in the research of target detection algorithms.

2. Interpretation of the paper

0. Summary

  • The speed and accuracy of YOLOV7 in the range of 5FPS to 160FPS (frame/second) exceed all known target detectors. Among all known real-time target detectors above 30FPS on GPU V100, YOLOv7 has the highest accuracy rate, reaching 56.8%AP.
  • The target detection of the YOLOV7-E6 version (56 FPS V100, 55.9% AP) is 509% faster and 2% more accurate than the transformer-based detector SWINL cascaded mask R-CNN (9.2 FPS, A100, 53.9% AP) , And the convolution-based detector convext-xl Cascaded Mask R-CNN (86FPS, A100, 55.2% AP) is 551% faster and 0.7% AP more accurate.
  • Also YOLOV7 outperforms: YOLOR, YOLO scale-yolov4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, vita-adapter-b and many others in terms of speed and accuracy. Furthermore, we only train YOLOv7 from scratch on the MS COCO dataset without using any other datasets or pretrained weights.

1 Introduction

  1. Real-time object detection is a very important topic in computer vision, and it is often an indispensable part of computer vision systems.
  2. For example, multi-target tracking, autonomous driving, robotics, medical image analysis, etc.
  3. Computing devices running real-time object detection are usually some mobile CPU or GPU, and various neural processing units (NPUs) developed by major manufacturers.
  4. For example, Apple’s Neural Engine (Apple), Neural Computing (Intel), Jetson’s AI Edge Device (Nvidia), Edge TPU (Google), Neural Processing Engine (Qualcomm), AI Processing Unit (MediaTek), and AI Soc (Kneron ), are all NPUs.
  5. Some of the edge devices mentioned above are mainly used to accelerate different operations such as normal convolution, depthwise convolution or MLP operations. In this paper, we propose a real-time object detector mainly in the hope that it can support both mobile GPUs and GPU devices, from the edge to the cloud.
  6. In recent years, real-time object detectors are still being developed for different edge devices. For example, the improvements of MCUNet and NanoDe mainly focus on: generating low-power MCUs and increasing the inference speed of edge CPUs . While methods such as YOLOX and YOLOR focus on improving the inference speed of various GPUs .
  7. In recent years, the development of real-time object detectors has mainly focused on the design of efficient architectures. As for real-time object detectors that can be used on CPU, most of their designs are based on MobileNet, ShuffleNet, or GhostNet.
  8. Another mainstream real-time object detector is developed for GPU, most of them use ResNet, DarkNet or DLA, and then use CSPNet strategy to optimize the architecture.
  9. The method proposed in this paper takes a different direction from current mainstream real-time object detectors. In addition to architecture optimization, our proposed method focuses on the optimization of the training process. We will focus on some optimized modules and optimization methods that can enhance the training cost to improve the accuracy of object detection without increasing the cost of inference. We refer to the proposed module and optimization method as a trainable skill package.
  10. Recently, model reparameterization and dynamic label assignment have become important topics in network training and object detection. After the above new concepts were proposed, many new problems emerged in the training of target detectors. In this article, we introduce some new problems we discovered and propose effective solutions.
  11. In terms of model reparameterization , combined with the concept of gradient propagation path, the model reparameterization strategies of each layer in different networks are analyzed, and a replanned model reparameterization method is proposed.
  12. The training of models with multiple output layers creates new problems when we discover that dynamic label assignment techniques are used. That is: "How to assign dynamic targets to the outputs of different branches?" To address this question, we propose a new label assignment method called coarse-to-fine guided label assignment.
  13. The main contributions of this paper are as follows:
  • (1) We design several trainable free trick package detection methods that enable real-time object detection to greatly improve detection accuracy without increasing inference costs;
  • (2) For the development of object detection methods, we uncover two new issues, namely, how the reparameterized module replaces the original module , and how the dynamic label assignment strategy handles assignments to different output layers . Furthermore, we propose methods to address the difficulties posed by these issues;
  • (3) For real-time object detectors, two methods of " expansion " and " composite scaling " are proposed to effectively utilize parameters and calculations;
  • (4) This method can effectively reduce about 40% of the parameters and 50% of the calculation amount of the real-time object detector, and has faster reasoning speed and higher detection accuracy.

2. Related work

2.1. Real-time Object Detector

The current state-of-the-art real-time object detectors are mainly based on YOLO and FCOS.
Being able to be a state-of-the-art real-time object detector usually requires the following characteristics:
(1) faster and stronger network architecture;
(2) a more efficient feature integration method
(3) more accurate detection method
(4) more robust Sticky loss function
(5) a more efficient method for label assignment
(6) a more efficient method for training.
In this paper, we do not intend to explore self-supervised learning or knowledge distillation methods that require additional data or large models. Instead, we focus on designing new trainable free skill packages for problems derived from state-of-the-art methods in (4), (5), and (6) above.

2.2. Model reparameterization

  • Model reparameterization techniques combine multiple computational modules into one during the inference phase.
  • Model parameterization technology can be regarded as an integration technology, and we can divide it into two categories: module-level integration and model-level integration . To obtain the final inferred model, there are two common practices of model-level reparameterization.
  • One is to train multiple identical models with different training data . The weights of multiple trained models are then averaged.
  • The second is to perform a weighted average of the model weights under different generations .
  • Module-level reparameterization is a popular research topic in recent years. This approach splits a module into multiple identical or different module branches during training, and integrates multiple branched modules into a fully equivalent module during inference. However, not all proposed reparameterized modules can be perfectly applied to different architectures. With this in mind, we develop new reparameterization modules and design related application strategies for various architectures.

2.3. Model scaling

  • Model scaling is a method of enlarging or reducing a designed model to fit different computing devices.
    Model scaling methods usually use different scaling factors, such as resolution (the size of the input image), depth (the number of layers), width (the number of channels), and stages (the number of feature pyramids), so that the number of network parameters, the amount of computation, A good trade-off is achieved in terms of inference speed and accuracy.
  • Network architecture search (NAS) is a commonly used method for model extension. NAS can automatically search for suitable scaling factors from the search space without defining too complex rules. The disadvantage of NAS is that it requires very expensive calculations to complete the search of the model scaling factor.
  • The researchers analyzed the relationship between the scaling factor and the amount of parameters and operations, trying to directly estimate some rules, so as to obtain the scaling factor required for model scaling. By reviewing the literature, we found that almost all model scaling methods analyze individual scaling factors independently, and even methods in the compound scaling category optimize scaling factors independently. This is because most popular NAS architectures deal with less relevant scaling factors. We observe that all cascade-based models, such as DenseNet or VoVNet, will change the input width of some layers when the depth of these models is scaled. Since the proposed architecture is cascade-based, we had to devise a new compound scaling method for this model.

3. Architecture

3.1. Extending the Efficient Layer Aggregation Network

Scaling Efficient Layer Aggregation Networks In most of the literature on designing efficient architectures, the main considerations do not exceed the number of parameters, computational effort, and computational density.
Ma et al. also analyzed the impact of the input/output channel ratio, the number of architectural branches, and unit operations on the network inference speed from the characteristics of memory access costs. Dollar'et et al. also consider the activation layer when performing model scaling, i.e. more consideration is given to the number of elements in the output tensor of the convolutional layer.
The design of CSVPoVNet in Fig. 2(b) is a variant of vovnet. In addition to considering the above basic design issues, the architecture of CSPVoVNet also analyzes the gradient path so that the weights of different layers can learn more different features. The gradient analysis method described above enables faster and more accurate inference.
ELAN [1] in Figure 2(c) considered the following design strategy – “How to design an efficient network?” They came to a conclusion: by controlling the shortest and longest gradient path , a deeper network can learn efficiently and convergence . This paper proposes an ELAN-based Extended-ELAN ( E-ELAN ), whose main structure is shown in Figure 2(d).
figure 2

  • In large-scale ELAN, it reaches a steady state regardless of the gradient path length and the stacked number of computation blocks. If more computing blocks are stacked infinitely, this steady state may be disrupted, reducing parameter utilization .
  • The E-ELAN algorithm uses expansion, random scrambling, and merging bases to realize the ability to continuously enhance the learning ability of the network without destroying the original gradient path.
  • In terms of architecture, E ELAN only changes the architecture of the computing block, while the architecture of the transition layer remains completely unchanged. Our strategy is to use grouped convolutions to expand the channels and cardinality of computational blocks.
  • We will apply the same grouped convolution to all computation blocks of a computation layer . Then, the feature maps calculated by each calculation block are randomly shuffled into g groups according to the set group convolution group number g, and then spliced ​​together. At this point, the number of channels in each set of feature maps will be the same as in the original architecture. Finally, we add g sets of feature maps to perform pooling bases. In addition to maintaining the original ELAN design architecture, E-ELAN can also guide different computing block groups to learn more different features.

3.2. Cascading-based model scaling

  • Cascading-based model scaling The main purpose of model scaling is to adjust some properties of the model and generate models of different scales to meet the needs of different inference speeds.
  • For example, effectnet's scaling model takes width, depth, and resolution into account . The scaling model of scale-yolov4 is to adjust the number of stages . Dollar' et al. analyzed the impact of conventional convolution and group convolution on the amount of parameters and calculations when scaling width and depth, and designed a corresponding model scaling method based on this.
  • Figure 3: Model scaling based on the serial model. From (a) to (b), we observe that when the cascade-based model is depth-scaled, the output width of the computation block also increases . This phenomenon causes the input width of subsequent transport layers to increase. Therefore, we propose (c), i.e., when performing model scaling on a cascade-based model, only the depth in the computation block needs to be scaled, and the rest of the transmission layers are scaled in width accordingly.
  • The above methods are mainly used in architectures such as PlainNet and ResNet. When these architectures perform scaling up or down, the in-degree and out-degree of each layer do not change, so we can independently analyze the impact of each scaling factor on the amount of parameters and computation. However, if we apply these methods to the cascade-based architecture, we find that when the depth is scaled up or down, the degree of the translation layer located after the cascade-based computation block decreases or increases, as shown in Fig. 3 (a) and (b) shown.
  • From the above phenomena, it can be inferred that for the cascade-based model, we cannot analyze the different scaling factors individually, but must consider them together . Take zooming in on depth as an example, this behavior results in a change in the ratio between the input and output channels of the transition layer, which can lead to a decrease in the model's hardware usage. Therefore, for cascade-based models, we have to propose corresponding compound model scaling methods . When we scale a computed block's depth factor, we also have to compute the change in the block's output channels. Then, we will make the same amount of changes to the transition layer , and the result is shown in Fig. 3(c). Our proposed compound scaling method can preserve the properties of the model at the time of initial design and preserve the optimal structure.
    insert image description here

4. Training skills

4.1. Redesign model reparameterization

  • Although model reparameterization has achieved excellent performance on VGG, when we directly apply it to architectures such as ResNet and DenseNet, its accuracy will be significantly reduced.
  • The application of reparameterized convolutions under different networks is analyzed using the gradient flow propagation path. And redesign the reparameterized convolution accordingly.
  • RepConv actually combines 3 × 3 convolutions, 1 × 1 convolutions, and identity connections in one convolutional layer. By analyzing the combination of RepConv with different architectures and their performance, we find that the identity connection in RepConv eliminates the residual in ResNet and stitching in DenseNet, providing more gradient diversity for different feature maps.
  • Based on the above reasons, we use RepConvN without identity connection to redesign the architecture of reparameterized convolution. In our thinking, when a convolutional layer with a residual or concatenated structure is replaced by a reparameterized convolution, there should be no identity connections . Figure 4 shows an example of our designed "planned reparameterized convolution" used in PlainNet and ResNet. Experiments on reparameterized convolutions based on residual models and full planning on cascaded models are presented in the ablation research section.
    insert image description here

4.2. Coarse label is used for auxiliary head, thin label is used for guide head

  • Deep supervision is a technique commonly used in deep network training. The main idea is to add auxiliary heads in the middle layers of the network, guided by shallow network weights and auxiliary losses. Even for generally well-converged architectures like ResNet and DenseNet, deep supervision can still significantly improve the performance of models on many tasks. Figure 5 (a) and (b) show the object detector architectures “without” and “with” deep supervision, respectively. In this paper, we call the head responsible for the final output the lead head, and the head that assists training as the auxiliary head.
  • We next discuss the issue of label assignment.
  • In the past deep network training, the label assignment usually directly refers to GT (true label) , and generates hard labels according to the given rules. However, in recent years, taking object detection as an example, researchers often use the quality and distribution of the network prediction output, combined with GT, and adopt some calculation and optimization methods to generate a reliable soft label. For example, YOLO uses bounding box regression to predict IoU and GT as objective soft labels. In this paper, the mechanism that comprehensively considers the network prediction results and GT and assigns soft labels is called "label assigner".
  • Regardless of the case of the auxiliary head or the seeker head, deep supervised training on the target is required. In the course of developing techniques related to soft-label assigners, we stumbled across a new derivative question, namely: , " How to assign soft labels to auxiliary headers and leading headers ?" To the best of our knowledge, the literature has not addressed this question. The results of the most commonly used method at present are shown in Fig. 5(c), which separates the auxiliary head and the guidance head, and uses the respective prediction results and GT to perform label assignment .
  • The method proposed in this paper is a novel method for label assignment of the auxiliary head and the seeker by simultaneous guidance of the head prediction. In other words, guided by the leader prediction results, we generate coarse-to-fine hierarchical labels for auxiliary and leader learning, respectively. The proposed two deeply supervised label assignment strategies are shown in Fig. 5 (d) and (e), respectively.

insert image description here

  • The leader-lead label allocator mainly performs calculations based on the prediction results of the leader and GT, and generates soft labels through an optimization process . This set of soft labels will serve as the target training model for the auxiliary head and the seeker head. The reason for this is that the bootstrap head has a strong learning ability, and the resulting soft labels should be more representative of the distribution and correlation between source and target data. Furthermore, we can view this learning as a kind of generalized residual learning. By letting the shallower auxiliary head directly learn the information that the seeker has already learned, the seeker will be able to focus more on learning the residual information that has not yet been learned.
  • The coarse-to-fine seeker-guided label allocator also utilizes the prediction results of the seeker and GT to generate soft labels. But in this process we generate two different sets of soft labels , namely coarse labels and fine labels , where the fine labels are the same as the soft labels produced by the bootstrap label allocator , and the coarse labels are obtained by relaxing the constraints of the positive sample assignment process , Allows more meshes to be generated as positive targets. This is because the learning ability of the auxiliary head is not as strong as that of the lead head. In order to avoid losing the information that needs to be learned, we will focus on optimizing the recall of the auxiliary head in the object detection task. For the output of the leading head, we can filter the high-precision results from the high-recall results as the final output. However, we have to be aware that if the weight of the coarse label is close to that of the fine label, it may produce poor results in the final prediction. Therefore, in order to make the influence of coarse labels less, we add constraints in the decoder so that the coarse grid cannot perfectly produce soft labels. The above mechanism enables the importance of fine labels and coarse labels to be adjusted dynamically during the learning process, so that the optimal upper bound of fine labels is always higher than that of coarse labels.

4.3. Other trainable free skill packs

In this section we list some trainable free skill packs. These free trick packs are some of the tricks we use in our training, but we didn't come up with the original concept. The training details for these free skill packs are elaborated in the appendix.
Including:
(1) Batch normalization in convn-bn-activation topology: This part mainly connects the Batch normalization layer directly to the convolutional layer. The purpose of this is to incorporate the batch normalized mean and variance into the biases and weights of the convolutional layers during the inference phase.
(2) The way of implicit knowledge in YOLOR combined with convolutional feature map plus multiplication: In the inference stage, through precomputation, the implicit knowledge in YOLOR can be simplified into a vector. This vector can be combined with biases and weights from previous or subsequent layers.
(3) EMA model: EMA is a technique used in the mean teacher, in our system, we only use the EMA model as the final inference model.

5. Experimental results

5.1. Experiment setup

  • We conduct experiments on the Microsoft COCO dataset to validate our object detection method. All our experiments do not use pre-trained models. That is, all models are trained from scratch. During development, we use the train2017 training set for training, and then use the val2017 test set for validation and hyperparameter selection. Finally, we show the performance of object detection on the 2017 test set and compare with state-of-the-art object detection algorithms. The detailed training parameter settings are described in the appendix.
  • We design the basic models of edge GPU, common GPU and cloud GPU, called YOLOv7-tiny, YOLOv7 and YOLOv7-W6, respectively. At the same time, we also use the basic model to scale the model for different service requirements to obtain different types of models.
  • In YOLOv7, we superimpose scaling on the neck and use the proposed compound scaling method to scale the depth and width of the whole model, resulting in YOLOv7-X.
  • For YOLOv7-W6, we adopt the newly proposed composite scaling method to obtain YOLOv7-e6 and YOLOv7-D6.
  • Furthermore, we use the proposed E-ELAN for YOLOv7-E6, thus completing YOLOv7-E6E . Since YOLOv7-tiny is an edge GPU-oriented architecture, it will use leaky ReLU as the activation function. For other models, we use SiLU as the activation function. We describe the scaling factors for each model in detail in the appendix.

5.2. Baseline

  • Baselines We choose the previous version of YOLO and the state-of-the-art object detector YOLOR as our baselines. Table 1 shows the comparison of our proposed YOLOv7 models with those baselines trained with the same settings.
  • The results show that compared with YOLOv4, the parameters of YOLOv7 are reduced by 75%, the amount of calculation is reduced by 36%, and the AP is increased by 1.5%. Compared with the existing YOLOR-CSP, the parameters of YOLOv7 are reduced by 43%, the amount of calculation is reduced by 15%, and the AP is increased by 0.4%.
  • In terms of tiny model performance, compared with YOLOv4-tiny-31, the parameters of YOLOv7-tiny are reduced by 39%, and the amount of calculation is reduced by 49%, but the AP remains unchanged. Our model can still have a high AP while reducing the number of parameters by 19% and the amount of computation by 33%.
    insert image description here

5.3. Comparison with state-of-the-art

  • We compare the proposed method with the state-of-the-art detectors on general-purpose GPUs and mobile GPUs, and the results are shown in Table 2. From the results in Table 2, we know that our proposed method has the best comprehensive trade-off of speed and accuracy.
  • If we compare YOLOv7-tiny-SILU with YOLOv5-N (r6.1), our method is 127 frames/s faster and 10.7% higher in AP.
  • Furthermore, YOLOv7 has 51.4% AP at 161 frames/s, while PPYOLOE-L only has 78 frames/s at the same AP.
  • In terms of parameter usage, YOLOv7 is 41% lower than PPYOLO-L. If we compare the inference speed of YOLOv7-X with YOLOv5-L (r6.1) at 114 frames per second, YOLOv7-X can improve AP by 3.9%. If you compare YOLOv7-X with similarly scaled YOLOv5-X (r6.1), YOLOv7-X inference is 31FPS faster.
  • In addition, in terms of parameters and computation, compared with YOLOv5-X (r6.1), YOLOv7-X reduces parameters by 22% and computation by 8%, but increases AP by 2.2%.
  • Comparing YOLOv7 and YOLOR under the condition of input resolution of 1280, the inference speed of YOLOv7-W6 is 8fps faster than that of YOLOR-P6, and the detection rate is also increased by 1% AP, while YOLOv7-E6 and YOLOv5-X6 (r6.1) Compared with the latter, the former has 0.9% AP gain, the parameters are reduced by 45%, the calculation amount is reduced by 63%, and the inference speed is increased by 47%. YOLOv7-D6 has a similar inference speed to YOLOR-E6, but the AP has increased by 0.8%. The reasoning speed of YOLOv7-E6E is similar to that of YOLOR-D6, but the AP is increased by 0.3%.
    insert image description here

5.4. Ablation experiments

5.4.1 Proposed Composite Scaling Method

  • Table 3 shows the results of scaling with different model scaling strategies. Among them, our proposed composite scaling method is to expand the depth of the calculation block by 1.5 times and the width of the transition block by 1.25 times.
  • Compared with the method of only expanding the width , our method can improve AP by 0.5% with fewer parameters and less computation.
  • Compared with the method that only increases the depth, it only needs to increase the number of parameters by 2.9% and the calculation amount by 1.2%, which can improve the AP by 0.2%.
  • From the results in Table 3, we can see that our proposed compound scaling strategy can utilize parameters and computation more efficiently.

insert image description here

5.4.2 Redesigned reparameterized model

  • To verify the generality of the proposed model reparameterization, we use it for cascade-based and residual-based models respectively for validation.
  • The cascade-based and residual-based models we choose are 3-layer ELAN and CSPDarknet , respectively .
  • In the cascade-based model experiments, we replace the 3×3 convolutional layers at different positions in the 3-layer ELAN with RepConv, and the specific configuration is shown in Figure 6. From the results shown in Table 4, we see that all higher AP values ​​appear in our proposed reparameterized models.

insert image description here

  • In the experiment based on the residual model, since the original Dark block does not have a 3 × 3 convolutional block that meets our design strategy, we additionally designed a reverse Dark block for the experiment , and its architecture is shown in Figure 7.
  • Since CSPDarknet's Dark block and reverse Dark block have exactly the same parameters and operations, the comparison is fair. The experimental results shown in Table 5 fully confirm that the proposed reparameterization model is equally effective for residual-based models. We find that the design of RepCSPResNet also follows our design pattern.
    insert image description here

5.4.3 Guidance loss proposed for the auxiliary head

  • In the guidance loss experiments on the auxiliary head, we compare the general leading head-independent label assignment method with the auxiliary head method, and compare the two proposed guided label assignment methods. We show the comparison results in Table 6. From the results listed in Table 6, it is clear that any model that increases the guidance loss can significantly improve the overall performance.

  • Furthermore, our proposed guided label assignment strategy achieves better performance than the general independent label assignment strategy in AP, AP50 and AP75. For our proposed coarse-label and fine-label assignment strategies, the best results are obtained in all cases.

  • Figure 8 shows the materialization maps predicted by different methods at the auxiliary head and the guide head. From Fig. 8 we find that if the auxiliary head learns guided soft labels, it can indeed help the guide head to extract residual information from consistent targets.
    insert image description here
    insert image description here

  • In Table 7, we further analyze the impact of the proposed coarse-to-fine guide label assignment method on the auxiliary header decoder. That is, we compared the results with and without the upper bound constraint. From the numbers in the table, the method of constraining the upper limit of the object with the distance from the center of the object can achieve better results.
    insert image description here

  • Since the proposed YOLOv7 uses multiple pyramids to jointly predict object detection results, we can directly connect the auxiliary head to the pyramids in the middle layer for training.

  • This type of training can compensate for information that may be lost in the next level of pyramid prediction.

  • Based on the above reasons, we design some auxiliary headers in the proposed E-ELAN architecture. Our approach is to connect the auxiliary head after a certain feature map set before merging the bases, and this connection can make the weights of the newly generated feature map set not be directly updated by the auxiliary loss.

  • Our design allows each pyramid to still draw information from objects of different sizes. Table 8 shows the results of two different methods. Coarse-to-fine guidance and partial coarse-to-fine guidance. Obviously, the partial coarse-to-fine guidance method has a better auxiliary effect.

insert image description here

6 Conclusion

  • This paper proposes a new real-time object detector architecture and corresponding model scaling method . Furthermore, we find that the development process of object detection methods generates new research topics.
  • During the research, we discovered the replacement problem of reparameterization modules and the assignment problem of dynamic label assignment . To address this issue, we propose a trainable free trick package to improve object detection accuracy.
  • On this basis, we developed the YOLOv7 series target detection system and obtained the most advanced target detection results.

7. Acknowledgments

The authors would like to thank the National High Performance Computing Center (NCHC) for providing computing and storage resources.

8. More contrast

  • In the range of 5 FPS to 160 FPS , YOLOV7 exceeds all known object detectors in speed and accuracy, and has the highest performance among all known real-time object detectors of 30FPS or higher on GPU V100 56.8% AP test-dev /  56.8  % AP min-val .
  • The YOLOv7-e6 object detector (56 FPS V100, 55.9% AP) is 509% faster and more accurate than the transformer -based detector SWIN-L Cascade-Mask R-CNN (9.2 FPSA100, 53.9% AP) and 2 % , and the convolution-based detector convext-xl Cascaded Mask R-CNN (86FPS, A100, 55.2% AP) is 551% faster and 0.7% AP more accurate.
  • Also YOLOV7 outperforms: YOLOR, YOLO scale-yolov4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, vita-adapter-b and many others in terms of speed and accuracy. Furthermore, we only train YOLOv7 from scratch on the MS COCO dataset without using any other datasets or pretrained weights.

insert image description here

insert image description here

  • The maximum accuracy of the YOLOv7-E6E (56.8% AP) real-time model is 13.7% AP higher than the current most accurate Meituan/YOLOv6-s model (43.1% AP) on the COCO dataset. On COCO dataset and V100 GPU with batch=32 , our YOLOv7-tiny (35.2% AP, 0.4ms) model is 25% faster and 0.2% AP higher than Meituan/YOLOv6-n (35.0% AP, 0.5 ms).
    insert image description here
    insert image description here

Guess you like

Origin blog.csdn.net/qq128252/article/details/126695791