YOLO series target detection algorithm - YOLOX

YOLO series target detection algorithm catalog - article link


This article summarizes:

  1. Using YOLOv3-SPP as the baseline, using a variety of improved technologies, a high-performance anchor-free detector is formed;
  2. Improvements include adding EMA weight update, cosine learning rate strategy, IoU loss and IoU-aware branches, cls and obj branches use BCE loss during training, reg branch uses IoU loss, etc.;
  3. In addition, there are Decoupled head, Multi positives, SimOTA, etc.;
  4. After the improvement of YOLOX, the lightweight models YOLOX-Tiny and YOLOX-Nano have better performance, and the large model YOLOX-L has also achieved SOTA.

Summary of deep learning knowledge points

Column link:
https://blog.csdn.net/qq_39707285/article/details/124005405

此专栏主要总结深度学习中的知识点,从各大数据集比赛开始,介绍历年冠军算法;同时总结深度学习中重要的知识点,包括损失函数、优化器、各种经典算法、各种算法的优化策略Bag of Freebies (BoF)等。



YOLO series target detection algorithm-YOLOX
2021.7.18 YOLOX:《YOLOX:YOLOX: Exceeding YOLO Series in 2021》

1 Introduction

  In this paper, some experienced improvement methods for the YOLO series are introduced, and a high-performance detector - YOLOX is proposed. YOLOX adopts the anchor-free method, coupled with other advanced detection technologies, such as decoupled head and label allocation strategy SimOTA, and achieved excellent results. In terms of lightweight models, YOLOX-Nano has only 0.91M parameters and achieved 25.3% AP, which is 1.8% higher than NanoDet. YOLOX-L has almost the same parameters as YOLOv4-CSP and YOLOv5-L, achieving 50.0% AP, which is 1.8% higher than AP.

  In the past two years, the main progress in the target detection academic community has mainly focused on anchor-free detectors, label assignment strategies, and end-to-end (NMS-free) detectors. These have not yet been integrated into the YOLO series. YOLOv4 and YOLOv5 are still anchor-based target detectors, and training rules need to be manually designed. This is why YOLOX is proposed, to integrate these developments into the YOLO series.

  Considering that YOLOv4 and YOLOv5 are a bit over-optimized on anchor-based, this article uses YOLOv3 as the starting point (the default here is YOLOv3-SPP).
insert image description here
  As shown in Figure 1, YOLOX-DarkNet53 is obtained after modification on YOLOv3, and 47.3% AP is obtained, which is higher than 44.3% of YOLOv3. In addition, using advanced CSPNetbackbone and additional PAN head, YOLOX-L achieves 50.0% AP on COCO with a resolution of 640×640 on COCO, which is 1.8% higher than the corresponding YOLOv5-L. YOLOX-Tiny and YOLOX-Nano (only 0.91M parameters and 1.08G FLOPs) are 10% and 1.8% higher than the corresponding YOLOv4-Tiny and NanoDet3, respectively.

2. YOLOX

2.1 YOLOX-DarkNet53

  This article chooses YOLOv3's DarkNet53 as the baseline. In this section, the design of the entire system in YOLOX will be introduced step by step.

implementation details

YOLOv3 baseline

  • The baseline uses DarkNet53 as the backbone and uses the SPP layer, which is YOLOv3-SPP
  • Added EMA weight update
  • cosine learning rate strategy
  • IoU loss and IoU-aware branches
  • During training, the cls and obj branches use BCE loss, and the reg branch uses IoU loss
  • RandomHorizontalFlip (random horizontal flip), ColorJitter (color jitter) and multi-scale data enhancement
  • Abandoned using the RandomResizedCrop strategy as it was found to overlap somewhat with mosaic augmentation

Decoupled head
insert image description here

  In the target detection task, the conflict between the classification and detection branches is well known, so the decoupled head (decoupled head) has been widely used since it was proposed. However, the backbone and feature pyramids of the YOLO series (eg, FPN/PAN), while evolving, their detection heads remain coupled, as shown in Figure 2.

  This paper demonstrates through two analytical experiments that coupling detection heads may impair their performance:

  1. Replacing YOLO's head with a decoupled head greatly improves the convergence speed, as shown in Figure 3;
    insert image description here
  2. A decoupled head is crucial for an end-to-end version of YOLO (described below).
    insert image description here

  It can be seen from Table 1 that the end-to-end performance of the coupled head is reduced by 4.2% AP, while that of the decoupled head is reduced to 0.8% AP. Therefore, this paper replaces the YOLO detection head with a streamlined decoupling head, as shown in Figure 2. Specifically, it consists of a 1×1 conv layer to reduce the channel dimension, followed by two parallel branches and two 3×3 conv layers.

Data enhancement
  is added using the Mosaic and MixUp data enhancement methods, and the training is turned off in the last 15 epochs of the model. The results are shown in Table 2, and an AP of 42.0% is achieved. After using strong data enhancement, it was found that the pre-training of ImageNet is no longer beneficial. Therefore, this article trains all the following models from scratch.

The anchor-free
  well-known anchor-based method has some problems, for example:

  1. To achieve optimal detection performance, cluster analysis is required before training to determine an optimal set of anchors, which are domain-specific and less general;
  2. The anchor mechanism increases the complexity of the detection head, and the number of predictions per image, and in some edge AI systems, moving such a large number of predictions between devices (e.g., from NPU to CPU) can become an overall delay potential bottlenecks.

  The anchor-free mechanism significantly reduces the number of design parameters that require heuristic tuning and involves many tricks (such as anchor clustering, Grid Sensitive). To achieve good performance, make the detector, especially its training and decoding stages, fairly simple.

  The way to switch YOLO to anchor-free is very simple. Reduce the predictions at each location from 3 to 1, and make them directly predict four values, the two offsets from the upper left corner of the grid, and the height and width of the predicted box. The center location of each object is assigned as a positive sample, and a scale range is predefined to specify the FPN level of each object. This modification reduces the parameters and GFLOPs of the detector, making it faster, but achieves a better performance of 42.9% AP, as shown in Table 2.
insert image description here

Multi positives
  In order to comply with the distribution rules of YOLOv3, the anchor-free version selects only one positive sample (center position) for each target, while ignoring other high-quality predictions. However, optimizing these high-quality predictions may also lead to beneficial gradient updates, which alleviate extreme imbalances in positive/negative sampling during training. This paper simply assigns the central 3 × 3 region as positive, which is also called "central sampling" in FCOS. The performance of the detector is improved to 45.0% AP, as shown in Table 2, which has surpassed the current best class-YOLOv3's 44.3% AP.

SimOTA
  label assignment is another important advance in object detection in recent years. This paper studies OTA and summarizes four key points for an advanced label assignment:

  1. loss/quality aware
  2. center prior
  3. The number of positive anchors for each GT dynamic (abbreviated as dynamic top-k)
  4. The global view
    OTA satisfies all the above four rules, so this paper chooses it as a candidate label assignment strategy.

  Specifically, OTA analyzes label assignment from a global perspective, and formulates the assignment process as an Optimal Transport (OT) problem, yielding SOTA performance in current assignment strategies. However, in practice, it was found that solving the OT problem by the Sinkhorn-Knopp algorithm would incur 25% additional training time, which is quite expensive for training for 300 epochs. Therefore, this paper reduces it to a dynamic top-k strategy, called SimOTA, to obtain an approximate solution.

  Here is a brief introduction to SimOTA. SimOTA first calculates the pairwise matching degree, which is represented by the cost or quality of each prediction-gt pair. For example, in SimOTA, gt gi g_igiand prediction pj p_jpjThe cost calculation formula between is:
cij = L ijcls + λ L ijreg c_{ij}=L^{cls}_{ij}+\lambda L^{reg}_{ij}cij=Lijcls+λLijreg
where λ \lambdaλ is the balance coefficient,L ijcls and L ijreg L^{cls}_{ij} and L^{reg}_{ij}Lijclsand LijregIs gi g_igiand pj p_jpjBetween category loss and regression loss. Then for gi g_igi, select the top k predictions with the smallest cost in a fixed central area as its positive samples. Finally, the corresponding grid divisions of these positive predictions are classified as positive and the rest as negative. Note: k values ​​vary with different GTs.

  SimOTA not only reduces training time, but also avoids additional solver hyperparameters in the algorithm. As shown in Table 2, SimOTA improves the detector from 45.0% AP to 47.3% AP, which is 3.0% AP higher than SOTA YOLOv3, showing the capability of advanced assignment strategies.

End-to-end YOLO
  adds two additional conv layers, one-to-one label assignment and gradient stopping. These methods enable detectors to perform in an end-to-end manner, but slightly reduce performance and inference speed, as shown in Table 2. Therefore, this paper treats it as an optional module, which does not involve the final model.

2.2 Other backbones

  In addition to DarkNet53, YOLOX has been tested with different input sizes on other backbones, and YOLOX has consistently improved compared to all corresponding algorithms.

CSPNet modified in YOLOv5

  In order to get a fair comparison, the backbone of YOLOv5 is adopted, including improved CSPNet, SiLU activation and PAN head. YOLOX-S, YOLOX-M, YOLOX-L, and YOLOX-X are also obtained following the scaling rules of the model. Comparing with YOLOv5 in Table 3, our model gets ∼3.0% ∼1.0% AP with only increased latency (due to the decoupling header). Tiny/Nano detector
insert image description here

  The model is further scaled down to YOLOX-Tiny for comparison with YOLOv4-Tiny. For mobile devices, deep convolution is used to build a YOLOX-Nano model, which has only 0.91M parameters and 1.08G FLOPs. As shown in Table 4, YOLOX performs well at smaller model sizes. Model Size and Data Augmentation
insert image description here

  In our experiments, all the models maintain almost the same learning schedule and optimization parameters. However, the augmentation strategies used to find suitable ones vary for different model sizes. As shown in Table 5, applying MixUp in YOLOX-L can improve by 0.9%, while for small models like YOLOX-Nano, it is better to weaken this enhancement. Specifically, when training small models, namely YOLOX-S, YOLOX-Tiny and YOLOX-Nano, the mixuo enhancement is eliminated and the mosaic enhancement is weakened (reducing the scale range from [0.1,2.0] to [0.5,1.5]) . This modification increases the AP of YOLOX-Nano from 24.0% to 25.3%.

  For large models, experiments also found that stronger augmentation is more helpful. Inspired by Copypaste, this paper dithers by a randomly sampled scale factor before mixing two images. To understand the capabilities of MixUp with proportional jittering, it is compared with Copypaste on YOLOX-L. Note: Copypaste requires additional instance mask annotations, while MixUp does not . The results are shown in Table 5, and the two methods achieve competitive performance, showing that MixUp with scale dithering is a qualified replacement for Copypaste when there is no instance mask annotation.
insert image description here

3. Compared with SOTA

  The comparison results with SOTA are shown in Table 6, however, the inference speed of the models in this table is usually uncontrolled, because the speed varies with different software and hardware. Therefore, using the same hardware and code base for all YOLO series in Figure 1, a controlled speed/accuracy curve was plotted.
insert image description here

4 Conclusion

  This paper introduces some efficient updates of the YOLO family, resulting in a high-performance anchor-free object detector named YOLOX. Equipped with some of the latest advanced detection techniques, namely decoupling heads, anchor-free, and advanced label assignment strategies, YOLOX achieves a better trade-off between speed and accuracy at all model sizes versus other algorithms. Notably, this paper improves the accuracy of the YOLOv3 architecture on COCO to 47.3%, which is 3.0% higher than the current best practice.

Guess you like

Origin blog.csdn.net/qq_39707285/article/details/128340308