YOLOv7 full text translation

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

YOLOv7 full text translation

Original link https://arxiv.org/abs/2207.02696
Github address: https://github.com/WongKinYiu/yolov7
insert image description here
Abstract
YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9%AP) outperforms both transformer-based detector SWINL Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by509% in speed and 2% in accuracy, and convolutionalbased detector ConvNeXt-XL Cascade-Mask R-CNN (8.6FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR,YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, we train YOLOv7 only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. Source code is released in https://github.com/WongKinYiu/yolov7 .
The speed and accuracy of YOLOv7 in the range of 5 frames/second to 160 frames/second exceed all known object detectors, and all known 30 frames/ Among the real-time target detectors above seconds, YOLOv7 has the highest accuracy rate, reaching 56.8% AP. YOLOv e6 object detector (56 FPS V100, 55.9% AP) is 509% faster and 2% more accurate than Transformer-based detector SWINL Cascaded Mask R-CNN (9.2 FPSA100, 53.9% AP), and convolution-based detection Convext-xl cascaded mask R-CNN (86 FPA100, 55.2% AP) speed 551%, accuracy 0.7% AP, and YOLOv7 is superior in speed and accuracy: YOLOR, YOLO scale-yolov4, YOLOv5, DETR, deformation DETR, DINO-5scale-R50, vita-adapter-b and many other object testers. Furthermore, we only train YOLOv7 from scratch on the MS COCO dataset without using any other dataset or previously trained weights. The source code is released at https://github.com/WongKinYiu/yolov7 .
insert image description here

  1. Introduction
    Real-time object detection is a very important topic in computer vision, and it is often an indispensable part of computer vision systems. For example, multi-object tracking [94, 93], autonomous driving [40, 18], robotics [35, 58], medical image analysis [34, 46], etc. Computing devices that perform real-time object detection are usually some mobile CPU or GPU, and various Neural Processing Units (NPUs) developed by major manufacturers. For example, Apple’s Neural Engine (Apple), Neural Computing (Intel), Jetson’s AI Edge Device (Nvidia), Edge TPU (Google), Neural Processing Engine (Qualcomm), AI Processing Unit (MediaTek), and AI soc (Kneron), both are npu. Some of the edge devices mentioned above are mainly used to accelerate different operations such as normal convolution, depthwise convolution or MLP operations. In this paper, we propose a real-time object detector mainly in the hope that it can support both mobile GPUs and GPU devices, from the edge to the cloud. In recent years, real-time object detectors are still being developed for different edge devices. For example, Development Figure 1: Comparison with other real-time object detectors, our proposed method achieves state-of-the-art performance. Improvements in MCUNet [49, 48] and NanoDet [54] mainly focus on producing low-power MCUs and increasing the inference speed of edge CPUs. While methods such as YOLOX [21] and YOLOR [81] focus on improving the inference speed on various GPUs. In recent years, the development of real-time object detectors has mainly focused on the design of efficient architectures. As for real-time object detectors [54, 88, 84, 83] that can be used on CPU, their designs are mostly based on MobileNet [28, 66, 27], ShuffleNet [92, 55], or GhostNet [25]. Another mainstream real-time object detector is developed for GPU [81, 21, 97], most of them use ResNet [26], DarkNet [63] or DLA [87], and then use CSPNet [80] strategy to optimize the architecture . The method proposed in this paper takes a different direction from current mainstream real-time object detectors. In addition to architecture optimization, our proposed method focuses on the optimization of the training process. We will focus on some optimized modules and optimization methods that can enhance the training cost to improve the accuracy of object detection without increasing the cost of inference. We refer to the proposed module and optimization method as a trainable free package.
    Recently, model reparameterization [13, 12, 29] and dynamic label assignment [20, 17, 42] have become important topics in network training and object detection. Mainly after the above-mentioned new concepts were proposed, the training of target detectors developed many new problems. In this article, we introduce some new problems we discovered and propose effective solutions. In terms of model reparameterization, combined with the concept of gradient propagation path, the model reparameterization strategies of each layer in different networks are analyzed, and a planned model reparameterization method is proposed. Furthermore, the training of models with multiple output layers creates new problems when we discover that dynamic label assignment techniques are used. That is: "How to assign dynamic targets to the outputs of different branches?" To address this question, we propose a new label assignment method called coarse-to-fine guided label assignment. The main contributions of this paper are as follows: (1) We design several trainable free packet detection methods that enable real-time object detection to greatly improve the detection accuracy without increasing the inference cost; (2) For the development of object detection methods, We uncover two new issues, namely, how the reparameterized modules replace the original ones, and how the dynamic label assignment strategy handles assignments to different output layers. In addition, we propose methods to address the difficulties posed by these problems; (3) two methods for real-time object detectors, "scaling" and "composite scaling", efficiently utilize parameters and computation; (4) the method It can effectively reduce about 40% of the parameters and 50% of the calculation amount of the real-time target detector, and has faster reasoning speed and higher detection accuracy.
  2. Related work
    2.1. Real-time object detectors
    The most advanced real-time object detectors are mainly based on YOLO [61, 62, 63] and FCOS [76, 77], respectively [3, 79, 81, 21, 54, 85 ,twenty three]. Being able to be a state-of-the-art real-time object detector usually requires the following features: (1) a faster and stronger network architecture; (2) a more efficient feature integration method [22, 97, 37, 74, 59, 30, 9, 45]; (3) a more accurate detection method [76, 77, 69]; (4) a more robust loss function [96, 64, 6, 56, 95, 57]; (5) a more Efficient label assignment methods [99, 20, 17, 82, 42]; (6) A more efficient training method. In this paper, we do not intend to explore self-supervised learning or knowledge steaming methods that require additional data or large models. Instead, we design new trainable free-pack methods for problems derived from state-of-the-art methods in (4), (5) and (6) above.
    2.2. Model re-parameterization
    Model re-parameterization technology [71, 31, 75, 19, 33, 11, 4, 24, 13, 12, 10, 29, 14, 78] combines multiple calculation modules into one. Model parameterization technology can be regarded as an integration technology, and we can divide it into two categories: module-level integration and model-level integration. To obtain the final inferred model, there are two common practices of model-level reparameterization. One is to train multiple identical models with different training data. The weights of multiple trained models are then averaged. The second is to perform a weighted average of the model weights under different generations. Module-level reparameterization is a popular research topic in recent years. This approach splits a module into multiple identical or different module branches during training, and integrates multiple branched modules into a fully equivalent module during inference. However, not all proposed reparameterized modules can be perfectly applied to different architectures. With this in mind, we develop new reparameterization modules and design related application strategies for various architectures.
    2.3. Model scaling
    Model Scaling Model scaling [72, 60, 74, 73, 15, 16, 2, 51] is a method of enlarging or reducing an already designed model to fit different computing devices. Model scaling methods usually use different scaling factors, such as resolution (the size of the input image), depth (the number of layers), width (the number of channels), and stages (the number of feature pyramids), so that the number of network parameters, the amount of computation, A good trade-off is achieved in terms of inference speed and accuracy. Network architecture search (NAS) is a commonly used method for model extension. NAS can automatically search for suitable scaling factors from the search space without defining too complex rules. The disadvantage of NAS is that it requires very expensive calculations to complete the search of the model scaling factor. In [15], researchers analyzed the relationship between the scaling factor and the amount of parameters and operations, trying to directly estimate some rules to obtain the scaling factor required for model scaling. By reviewing the literature, we found that almost all model scaling methods analyze individual scaling factors independently, and even methods in the compound scaling category optimize scaling factors independently. This is because most popular NAS architectures deal with less relevant scaling factors. We observe that all cascade-based models, such as DenseNet [32] or VoVNet [39], will change the input width of some layers when the depth of these models is scaled. Since the proposed architecture is cascade-based, we had to devise a new compound scaling method for this model.
  3. Architecture
    3.1. Extended efficient layer aggregation networks
    Extended efficient layer aggregation networks In most of the literature on designing efficient architectures, the main considerations do not exceed the number of parameters, the amount of computation, and the computation density. Ma et al. [55] also analyzed the influence of the input/output channel ratio, the number of architectural branches and unit operations on the network inference speed from the characteristics of the memory access cost. Dollar' et al [15] also consider activations when performing model scaling, i.e., more consideration is given to the number of elements in the output tensor of the convolutional layer. The design of CSVPoVNet [79] in Fig. 2(b) is a variant of vovnet [39]. The architecture of CSPVoVNet [79], in addition to considering the above basic design issues, also analyzes the gradient path so that the weights of different layers can learn more different features. The gradient analysis method described above enables faster and more accurate inference. ELAN [1] in Figure 2(c) considered the following design strategy – “How to design an efficient network?” They came to a conclusion: by controlling the shortest and longest gradient path, a deeper network can learn efficiently and convergence. This paper proposes the ELAN-based Extended-ELAN (E-ELAN), whose main structure is shown in Fig. 2(d).
    insert image description here
    In large-scale ELAN, it reaches a steady state regardless of the gradient path length and the stacked number of computation blocks. If more computing blocks are stacked infinitely, this steady state may be disrupted, reducing parameter utilization. The algorithm utilizes expansion, shuffling, and merging bases to realize the ability to continuously enhance the learning ability of the network without destroying the original gradient path. In terms of architecture, E ELAN only changes the architecture of the computing block, while the architecture of the transition layer remains completely unchanged. Our strategy is to use group convolutions to expand the channels and cardinality of computational blocks. We will apply the same group parameters and channel multipliers to all computation blocks of a computation layer. Then, the feature maps calculated by each calculation block are shuffled into g groups according to the set group parameter g, and then spliced ​​together. At this point, the number of channels in each set of feature maps will be the same as in the original architecture. Finally, we add g sets of feature maps to perform pooling bases. In addition to maintaining the original elan design architecture, E-ELAN can also guide different computing block groups to learn more different features.
    3.2. Model scaling for concatenation-based models
    Cascading-based model scaling The main purpose of model scaling is to adjust some properties of the model and generate models of different scales to meet the needs of different inference speeds. For example, the scaling model of EffentNet [72] considers width, depth and resolution . The scaling model of scale-yolov4 [79] is to adjust the number of stages. In [15], Dollar' et al analyzed the impact of vanilla convolution and group convolution on the amount of parameters and calculations when scaling width and depth, and designed a corresponding model scaling method based on this. Figure 3: Model scaling based on concatenated models. From (a) to (b), we observe that when the cascade-based model is depth-scaled, the output width of the computation block also increases. This phenomenon causes the input width of subsequent transport layers to increase. Therefore, we propose (c), i.e., when performing model scaling on a cascade-based model, only the depth in the computation block needs to be scaled, and the rest of the transmission layers are scaled in width accordingly. The above methods are mainly used in architectures such as PlainNet and ResNet. When these architectures perform scaling up or down, the in-degree and out-degree of each layer do not change, so we can independently analyze the impact of each scaling factor on the amount of parameters and computation. However, if we apply these methods to the cascade-based architecture, we find that when the depth is scaled up or down, the degree of the translation layer located after the cascade-based computation block decreases or increases, as shown in Fig. 3 (a) and (b) shown.
    From the above phenomena, it can be inferred that for the cascade-based model, we cannot analyze the different scaling factors individually, but must consider them together. Take zooming in on depth as an example, this behavior results in a change in the ratio between the input and output channels of the transition layer, which can lead to a decrease in the model's hardware usage. Therefore, for cascade-based models, we have to propose corresponding compound model scaling methods. When we scale a computed block's depth factor, we also have to compute the change in the block's output channels. Then, we will make the same amount of changes to the transition layer, and the result is shown in Fig. 3(c). Our proposed compound scaling method can preserve the properties of the model at the time of initial design and preserve the optimal structure.
    insert image description here
  4. Trainable bag-of-freebies
    4.1. Planned re-parameterized convolution
    [13] has achieved excellent performance on VGG [68], but when we directly apply it to architectures such as resnet [26] and DenseNet [32] , its accuracy will be significantly reduced. The application of reparameterized convolutions under different networks is analyzed using the gradient flow propagation path. And accordingly the planned reparameterized convolution is designed. RepConv actually combines 3 × 3 convolutions, 1 × 1 convolutions, and identity connections in one convolutional layer. By analyzing the combination of RepConv with different architectures and their performance, we find that the identity connection in RepConv eliminates the residual in ResNet and stitching in DenseNet, providing more gradient diversity for different feature maps. For the reasons above, we use identity-connection-free repconvv (RepConvN) to design an architecture for planning reparameterized convolutions. In our thinking, there should be no identity connection when a convolutional layer with residuals or concatenation is replaced by a reparameterized convolution. Figure 4 shows an example of our designed "planned reparameterized convolution" used in PlainNet and ResNet. Experiments on reparameterized convolutions based on residual models and full planning on cascaded models are presented in the ablation research section.

    4.2. Coarse for auxiliary and fine for lead loss
    Deep supervision [38] is a technique commonly used in deep network training. The main idea is to add an auxiliary head in the middle layer of the network, and the shallow network weights guided by the auxiliary loss. Even for generally well-converged architectures like ResNet [26] and DenseNet [32], deep supervision [70, 98, 67, 47, 82, 65, 86, 50] can still significantly improve the performance of models on many tasks. Figure 5 (a) and (b) show the object detector architectures “without” and “with” deep supervision, respectively. In this paper, we call the head responsible for the final output the lead head, and the head that assists training as the auxiliary head.
    We next discuss the issue of label assignment. In the past deep network training, the label assignment usually directly refers to the ground truth, and generates hard labels according to the given rules. However, in recent years, taking object detection as an example, researchers often use the quality and distribution of the network's predicted output, combined with the ground truth, and employ some computational and optimization methods to generate a reliable soft label [61, 8, 36, 99 ,91,44,43,90,20,17,42]. For example, YOLO [61] used bounding box regression to predict IoU and ground truth as objective soft labels. This paper refers to the mechanism that comprehensively considers the network prediction results and ground truth values ​​and assigns soft labels as "label assigner".
    Regardless of the circumstances of the supporting supervisor or the leading supervisor, in-depth supervisory training is required for the target target. In the process of developing the technology related to the soft label assigner, we stumbled upon a new derived question, namely: , "How to assign soft labels to the auxiliary header and the lead header?" As far as we know, the relevant literature has not yet explored this question. The results of the most commonly used method at present are shown in Fig. 5©, which separates the auxiliary head and the lead head, and uses the respective prediction results and ground truth to perform label assignment. The method proposed in this paper is a novel method for label assignment of auxiliary leads and leads simultaneously guided by lead prediction. In other words, we use the leader prediction as a guide to generate coarse-to-fine hierarchical labels for assisting leader and leader learning, respectively. The proposed two deeply supervised label assignment strategies are shown in Fig. 5 (d) and (e), respectively.
    insert image description here
    The lead head guided label assigner ( Lead head guided label assigner ) mainly calculates based on the predicted results of the lead head and the ground truth value, and generates soft labels through the optimization process. This set of soft labels will be used as the target training model for the auxiliary head and the guide head. The reason for this is that the lead head has a strong learning ability, and the resulting soft labels should be more representative of the distribution and correlation between source and target data. Furthermore, we can think of this learning as a kind of generalized residual learning. By letting the shallower auxiliary head directly learn information that the leader has already learned, the leader will be able to focus more on learning the remaining information that has not yet been learned.
    The **Coarse-to-fine lead head guided label assigner** also utilizes lead head predictions and grounding facts to generate soft labels. But in the process we generated two different sets of soft labels, viz. Coarse labels and fine labels, where the fine labels are the same as the soft labels produced by the head-guided label allocator, and the coarse labels are produced by relaxing the constraints of the positive sample assignment process, allowing more meshes to be considered as positive objects . This is because the learning ability of the auxiliary head is not as strong as that of the lead head. In order to avoid losing the information that needs to be learned, we will focus on optimizing the recall of the auxiliary head in the object detection task. For the output of the pin header, we can filter the high-precision results from the high-recall results as the final output. However, we have to be careful that if the additional weight of the coarse labels is close to that of the thin labels, it may produce a poor prior on the final prediction. Therefore, to make those extra coarse positive grids less influential, we add constraints in the decoder such that the extra coarse positive grids cannot perfectly produce soft labels. The above mechanism enables the importance of fine labels and coarse labels to be adjusted dynamically during the learning process, so that the optimizable upper bound of fine labels is always higher than that of coarse labels.
    4.3. Other trainable bag-of-freebies
    Other Trainable Free Bags In this section we will list some trainable free bags. These freebies are some of the tricks we use in our training, but the original concept is not something we came up with. The training details of these giveaways will be elaborated in the appendix, including: (1) Batch normalization in convn-bn-activation topology: This part mainly connects the batch normalization layer directly to the convolutional layer. The purpose of this is to incorporate the batch normalized mean and variance into the biases and weights of the convolutional layers during the inference phase. (2) Implicit knowledge in YOLOR [81] combined with convolutional feature maps plus multiplication: In the inference stage, through precomputation, the implicit knowledge in YOLOR can be reduced to a vector. This vector can be combined with biases and weights from previous or subsequent layers. (3) EMA model: EMA is a technique used in MeanTeacher [75], in our system, we only use EMA model as the final inference model.
  5. Experiments
    5.1. Experimental setup
    We use the Microsoft COCO dataset to conduct experiments to verify our object detection method. All our experiments do not use pre-trained models. That is, all models are trained from scratch. During development, we use the train2017 set for training, and then use the val2017 set for validation and hyperparameter selection. Finally, we show the performance of object detection on the 2017 test set and compare with state-of-the-art object detection algorithms. The detailed training parameter settings are described in the appendix. We design the basic models of edge GPU, common GPU and cloud GPU, called YOLOv7-tiny, YOLOv7 and YOLOv7-w6, respectively. At the same time, we also use the basic model to scale the model for different service requirements to obtain different types of models. In ForYOLOv7, we superimpose scaling on the neck and use the proposed compound scaling method to scale the depth and width of the whole model, resulting in YOLOv7-X. For YOLOv7-W6, we adopt the newly proposed compound scaling method to obtain yolov7-e6 and YOLOv7-D6. Furthermore, we use the proposed EELAN for YOLOv7-E6, thus completing YOLOv7-E6E. Since YOLOv7-tiny is an edge GPU-oriented architecture, it will use leaky ReLU as the activation function. For other models, we use SiLU as the activation function. We describe the scaling factors for each model in detail in the appendix.
    5.2. Baselines
    Baselines We choose previous versions of YOLO [3, 79] and the state-of-the-art object detector YOLOR [81] as our baselines. Table 1 shows the comparison of our proposed YOLOv7 models with those baselines trained with the same settings. The results show that compared with YOLOv4, the parameters of YOLOv7 are reduced by 75%, the amount of calculation is reduced by 36%, and the AP is increased by 1.5%. Compared with the existing YOLOR-CSP, the parameters of YOLOv7 are reduced by 43%, the amount of calculation is reduced by 15%, and the AP is increased by 0.4%. In terms of tiny model performance, compared with YOLOv4-tiny-31, the parameters of YOLOv7-tiny are reduced by 39%, and the amount of calculation is reduced by 49%, but the AP remains unchanged. Our model can still have a high AP while reducing the number of parameters by 19% and the amount of computation by 33%.
    insert image description here
    5.3. Comparison with state-of-the-arts
    We compare the proposed method with state-of-the-art object detectors on general-purpose GPUs and mobile GPUs, and the results are shown in Table 2. From the results in Table 2, we know that the proposed method has the best comprehensive trade-off of speed and accuracy. If we compare YOLOv7-tiny-silu with YOLOv5-N (r6.1), our method is 127 frames/s faster and 10.7% accurate on AP. Furthermore, YOLOv7 has 51.4% AP at 161 frames/s, while PPYOLOE-L only has 78 frames/s at the same AP. In terms of parameter usage, YOLOv7 is 41% lower than Ppyoloe-L. If we compare the inference speed of YOLOv7-X with YOLOv5-L (r6.1) at 114 frames per second, YOLOv7-X can improve AP by 3.9%. If you compare YOLOv7-X with similarly scaled YOLOv5-X (r6.1), YOLOv7-X achieves 31 fps faster inference. In addition, in terms of parameters and computation, compared with YOLOv5-X (r6.1), YOLOv7-X reduces parameters by 22% and computation by 8%, but increases AP by 2.2%. 7 Comparing YOLOv7 and YOLOR under the condition of input resolution of 1280, the inference speed of YOLOv7-w6 is 8fps faster than that of YOLOR-p6, and the detection rate is also increased by 1% AP, while YOLOv7-e6 and YOLOv5-X6 (r6.1 ), the former has 0.9% AP gain than the latter, the parameters are reduced by 45%, the amount of calculation is reduced by 63%, and the inference speed is increased by 47%. YOLOv7-D6 has a similar inference speed to YOLOR-E6, but the AP has increased by 0.8%. The reasoning speed of YOLOv7-E6E is similar to that of YOLOR-D6, but the AP is increased by 0.3%.
    insert image description here
    5.4. Ablation study (ablation experiment)
    5.4.1 Proposed compound scaling method (proposed compound scaling method)
    Table 3 shows the results of scaling using different model scaling strategies. Among them, our proposed composite scaling method is to expand the depth of the calculation block by 1.5 times and the width of the transition block by 1.25 times. Compared with the method of only expanding the width, our method can improve AP by 0.5% with fewer parameters and less computation. Compared with the method that only increases the depth, it only needs to increase the number of parameters by 2.9% and the calculation amount by 1.2%, which can improve the AP by 0.2%. From the results in Table 3, we can see that our proposed compound scaling strategy can utilize parameters and computation more efficiently.
    insert image description here
    5.4.2 Proposed planned re-parameterized model
    In order to verify the generality of the proposed planar re-parameterized model, we used it for the cascade-based model respectively Validate with the residual-based model. The cascade-based and residual-based models we choose are 3-layer ELAN and cspdarknet, respectively. In the cascade-based model experiments, we replace the 3 × 3 convolutional layers at different positions in the 3-stacking ELAN with RepConv, and the specific configuration is shown in Figure 6. From the results shown in Table 4, we see that all higher AP values ​​appear in our proposed reparameterized models. In the experiments based on the residual model, since the original dark block does not have a 3 × 3 convolutional block that meets our design strategy, we additionally design an inverted dark block for the experiment, whose architecture is shown in Figure 7. Since CSPDarknet's dark block and reverse dark block have exactly the same parameters and operations, the comparison is fair. The experimental results shown in Table 5 fully confirm that the proposed reparameterization model is equally effective for residual-based models. We find that the design of RepCSPResNet [85] also follows our design pattern.
    insert image description here
    insert image description here
    5.4.3 Proposed assistant loss for auxiliary head

Auxiliary Loss for Auxiliary Head In the Auxiliary Loss for Auxiliary Head experiment, we compare the general lead-head independent label assignment method with the auxiliary head method, and compare the two proposed lead-guided label assignment methods. We show the comparison results in Table 6. From the results listed in Table 6, it is clear that any model that increases the assistant loss can significantly improve the overall performance. Furthermore, our proposed guided label assignment strategy achieves better performance than the general independent label assignment strategy in AP, AP50 and AP75. For our proposed coarse-assisted and refined lead label assignment strategies, the best results are obtained in all cases. Figure 8 shows the materialization maps predicted by different methods at the auxiliary head and the guide head. From Fig. 8 we find that if the auxiliary head learns guided soft labels, it can indeed help the guiding head to extract residual information from consistent objects.
insert image description here
insert image description here
In Table 7, we further analyze the impact of the proposed coarse-to-fine guide label assignment method on the auxiliary header decoder. That is, we compared the results with and without the upper bound constraint. From the numbers in the table, the method of constraining the upper limit of the object with the distance from the center of the object can achieve better results.
insert image description here
Since the proposed YOLOv7 uses multiple pyramids to jointly predict object detection results, we can directly connect the auxiliary head to the pyramids in the middle layer for training. This type of training can compensate for information that may be lost in the next level of pyramid prediction. Based on the above reasons, we design some auxiliary headers in the proposed E-ELAN architecture. Our approach is to connect the auxiliary head after a certain feature map set before merging the bases, and this connection can make the weights of the newly generated feature map set not be directly updated by the auxiliary loss. Our design allows each lead-head pyramid to still draw information from objects of different sizes. Table 8 shows the results of two different methods. Coarse-to-fine guidance method and partial coarse-to-fine guidance method. Obviously, the partial coarse-to-fine guidance method has a better auxiliary effect.
insert image description here
6. Conclusions
This paper proposes a new real-time object detector architecture and corresponding model scaling method. Furthermore, we find that the development process of object detection methods generates new research topics. During the research, we found the replacement problem of the reparameterization module and the assignment problem of dynamic label assignment. To address this issue, we propose a trainable free-bag approach to improve object detection accuracy. On this basis, we developed the YOLOv7 series target detection system and obtained the most advanced target detection results.
7. Acknowledgments
8. More comparison
In the range of 5 frames per second to 160 frames per second, onyolov7 exceeds all known object detectors in speed and accuracy, all known 30FPS or higher on GPU V100 It has the highest 56.8% AP test-dev / 56.8% APmin-val among real-time object detectors. The YOLOv7-e6 object detector (56 FPS V100, 55.9% AP) is 509% faster and more accurate than the Transformer-based detector sin-l cascaded mask R-CNN (9.2 FPSA100, 53.9% AP) and 2 %, and the convolution detector convext-xl CascadeMask R-CNN (8.6 FPSA100, 55.2% AP) with 551% speed and 0.7% AP accuracy, and YOLOv7 outperforms in speed and accuracy: YOLOR, YOLOX, Scaled- YOLOv4, YOLOv5, DETR, Deformed DETR, DINO-5scale-R50, vita-adapter-b and many other object detectors. Furthermore, we only train YOLOv7 from scratch on the MS COCO dataset without using any other datasets or pre-trained weights.
The maximum accuracy of the YOLOv7-E6E (56.8% AP) real-time model is 13.7% AP higher than the current most accurate Meituan/YOLOv6-s model (43.1% AP) on the COCO dataset. On COCO dataset and batch=32 on V100 GPU, our YOLOv7-tiny (35.2% AP, 0.4ms) model is 25% faster and 0.2% AP higher than Meituan/YOLOv6-n (35.0% AP, 0.5 ms).
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
References

End Sahua

Guess you like

Origin blog.csdn.net/qq_45294476/article/details/125657910