Detailed interpretation of YOLOX (1) Interpretation of the paper

0. Summary

In the paper, we introduce some empirical improvements of the YOLO family to propose a new high-performance detector, YOLOX. We switch YOLO anchor-freeto , and adopt other advanced detection techniques, such as 解耦头and advanced label assignment strategies SimOTA, thus achieving SOTA in numerous models.

Compared with only 0.91M parameters and 1.08G FLOPs YOLONano, we got 25.3% AP on the COCO dataset, which is higher than NanoDet1.8%的AP

For one of the most widely used detectors in the industry YOLOv3, we improve the AP to 47.3% on COCO, which is higher than the current state-of-the-art detector 3.0% AP;

For approximately the same number of parameters as YOLOX-L YOLOv4-CSP,YOLOv5-L, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, which is higher than YOLOv5-L 1.8%AP.

Furthermore, we YOLOX-Lwon the first place in the `Streaming Perception Challenge Competition (CVPR Workshop on Autonomous Driving 2021) using our model alone.

We hope that this report can provide useful experience for 开发人员real-world scenarios, 研究人员and we also provide deployment versions that support ONNX, TensorRT, NCNN, and Openvino. The source code is at https://github.com/Megvii-BaseDetection/YOLOX.

insert image description here

1 Introduction

  • With the development of target detection, the YOLO series is always pursued in real-time applications 速度和精度的权衡. They take the state-of-the-art detection techniques available at the time (e.g. anchor, YOLOv2, Residual NetYOLOv3), and optimize best-practice implementations. Currently, YOLOv5 maintains the best trade-off performance on COCO with an 48.2%的APinference time of 13.7ms.

  • However, in the past two years, the main progress in object detection academia has focused on Anchor-freedetectors, 高级标签分配策略and 端到端(NMS-free) detectors. These have not yet been incorporated into the YOLO series, such as YOLOv4 and YOLOv5 are still anchor-based detectors, manually assigning rules for training.

  • This is why we propose YOLOX, to 最新的改进compare these 成熟的优化移植to the YOLO family. Considering that YOLOv4 and YOLOv5 may have a point for anchor-based strategies 过度优化, we use it 选择YOLOv3as a starting point (we will YOLOv3- sppset as default YOLOv3). In fact, due to limited computing resources and insufficient software support in various practical applications, YOLOv3 is still one of the most widely used detectors in the industry.

  • As shown in Figure 1, through the above techniques. Under the condition of 640 resolution on the COCO dataset, we will YOLOv3提高到47.3%AP(YOLOX-DarkNet53), which greatly exceeds the current best practice of YOLOv3 (44.3% AP, ultralytics version2). In addition, when using the state-of-the-art YOLOv5 architecture (with advanced CSPNet backbone and additional PAN head), YOLOX-L achieves 50.0% AP at 640 resolution on the COCO dataset, compared to the corresponding YOLOv5- l 1.8% higher AP. We also test our design strategy on small-scale models, YOLOX-Tiny and YOLOX-Nano (only 0.91M parameters and 1.08G FLOPs) are 10% AP and 1.8% AP higher than the corresponding YOLOv4-Tiny and NanoDet3, respectively.

  • We released our code at https://githubcom/Megvii-BaseDetection/YOLOX. Support ONNX, TensorRT, NCNN and Openvino. It is worth mentioning that we won the first place in the Streaming Perception Challenge (CVPR Workshop on Autonomous Driving 2021) using the YOLOX-L model alone.

2.YOLOX

2.1 YOLOX-DarkNet53

We choose YOLOv3 and Darknet53 as our baselines. In the next sections, we will step through the entire system design in YOLOX.

Implementation details:

The training parameter settings are maintained from the baseline model to the final model.

  • Dataset: COCO trian2017

  • epochs:300

  • Optimizer: SGD

  • Learning rate: lr*BatchSize/64, lr = 0.01

  • Weight decay coefficient: 0.0005

  • SGD Momentum Coefficient: 0.9

  • Number of GPUs: 8 Tesla v100

  • batchsize:128

    You can also choose a single GPU and other batch sizes for training. The image input size is from 448 to 832, that is, 32 is the step size. FPS和latencyThere are also measurements in the reduction, passed FP16-precisionand batch=1obtained on a single Tesla v100.

YOLOV3 baseline:

Adopt DarkNet53structure and SPPlayers as base network, ie YOLOV3-SPP. Compared to the original implementation, we slightly changed some training strategies, EMAadding weight updates, CosineLRScheduler , IoU 损失and IoU-aware分支. We use BCE 损失to train the cls and obj branches, and IoU 损失to train the reg branch. These general training tips are key improvements with YOLOX 正相关, so we put them on the baseline. Furthermore, we only use RandomHorizontalFlip, ColorJitterand 多尺度数据增强and discard the RandomResizedCrop strategy because we found that RandomResizedCrop overlaps with the planned mosaic augmentation. As shown in Table 2, using these augmentations, our base network achieves 38.5% AP on the COCO validation set.

Decoupling headers in object detection:

The conflict between classification tasks and regression tasks is a well-known problem. Therefore, decoupled heads for classification and localization are widely used in most one-stage and two-stage detectors. 检测头保持耦合However, as shown in Figure 2, although the backbone and feature pyramids (such as FPN, PAN) of the YOLO series continue to evolve , their

Our two analytical experiments suggest that it 耦合检测头might 损害性能.

  • As shown in Figure 3, replacing the head of YOLO with a decoupled head can greatly improve the convergence speed.

  • 端到端The decoupling header is crucial to the version of YOLO .

  • It can be seen from Table 1 that the 端到端version is AP 耦合头when it is in use, and AP when 降低4.2% it is in use .解耦头降低0.8%

  • Therefore, as shown in Figure 2, we replaced the YOLO detection head with a lightweight decoupling head. Specifically, it consists of a 1x1 convolutional layer to reduce the channel dimension, followed by two parallel branches with two 3x3 convolutional layers .

  • 1.1 msWe report the batch=1 inference time on V100 in Table 2, with the extra (11.6 ms vs 10.5 ms) brought by the lightweight decoupled head .

Strong Data Augmentation:

  • We add Mosaicand MixUpto our augmentation strategy to improve the performance of YOLOX. Mosaic is an effective data enhancement strategy proposed by ultralytics-YOLOv3, and then it is widely used in YOLO v4, YOLO v5 and other detectors. MixUp was originally designed for image classification tasks and was later modified in BoF for object detection training.

  • We adopt MixUp and Mosaic in our model, and turn off in the last 15 epochs , as shown in Figure 2, achieving 42.0% AP. After using strong data augmentation, we found that ImageNet-based pre-trained models did not perform better, so all our models were trained from scratch.

Anchor-free:

  • Both YOLOv4 and YOLOv5 follow the original basis of YOLOv3 anchor-based. However, there are many known issues with the anchor mechanism.
    • First of all, in order to achieve the optimal detection performance, it needs to be carried out before training to 聚类分析determine a set of optimal anchors, those aggregated anchors are 特定领域, 不具有通用性. Secondly, the anchor mechanism 增加了检测头部的复杂性, and for each image 预测数量. In some edge AI systems, moving such a large number of predictions between devices (e.g., from NPU to CPU) can become 延迟a potential bottleneck on the overall side.
  • Anchor-free detectors have been developed rapidly in the past two years, and these works show that the performance of anchor-free detectors can be anchor-basedcomparable to that of . The anchor-free mechanism significantly reduces the number of design parameters that need to be combined 手动调参with many involved tricks (e.g. anchor聚类, Grid Sensitive) to achieve better performance, especially the anchor-free mechanism greatly simplifies the detector performance during training and decoding stages.
  • Grid Sensitive is an optimization method introduced by the YOLOv4 model, that is, when calculating the coordinates of the center point of the prediction frame in the grid, logit取sigmoidafter activating the output, add one 缩放和偏移to ensure that the center point of the prediction frame can effectively fit the real frame The case where it falls exactly on the edge of the mesh.
  • It is very simple to switch YOLO to anchor-free mode. We reduce the predictions per location from 3 to 1 and make them 直接预测4个值, i.e., two offsets from the upper left corner of the grid, and the height and width of the prediction box. We assign the center of each target as the positive sample and 预先定义一个刻度范围, as done, for each target 指定一个FPN级别. Such modification reduces the parameters and GFLOPs of the detector, making it faster, but achieves better performance - 42.9% AP, as shown in Table 2.

Multi positives:

  • To be consistent with YOLOv3's assignment rules, the anchor-free version above selects only 一个正样本(central locations) for each object, while ignoring other high-quality predictions.

  • Optimizing these high-quality predictions may also lead to beneficial gradients, which may alleviate the problem of positive/negative sampling during training 极端不平衡. We simply designate the central 3*3 region as a positive sample, also known as FCOS in FCOS 中心取样. As shown in Table 2, the AP performance of this detector is improved to 45.0%, which already surpasses the current Ultralytic-YOLOV3 best practice (44.3% AP2).

Shimot

Advanced label assignment is another important advance in object detection in recent years. Based on our own research OTA, we summarize four key points for advanced label assignment:

  • loss/quality perception

  • central prior

  • Dynamic number of positive anchor boxes for each GT (abbreviated as dynamic top-k)

  • global view.

OTA satisfies all the above four conditions, so we choose it as a candidate label assignment strategy.

Specifically, OTA analyzes label assignment from a global perspective, and formulates the assignment process as an optimization transfer problem, achieving SOTA performance among current assignment strategies. However, in practice we found that solving the OT problem with the Sinkhorn-Knopp algorithm brings 25% additional training time, which is somewhat time-consuming for training 30epochs. Therefore, we simplify it to a dynamic top-k strategy, named SimOTA, to obtain close conclusions.

We briefly introduce SimOTA here. SimOTA first computes the degree of pairwise matching, each GT and predicted value is represented by loss or quality. For example, in SimOTA, the loss of GT (gi) and prediction (pj) is calculated as follows:
insert image description here

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly

where λ is the balance coefficient. L ij^cls and L ij^reg are classification loss and localization regression loss between GT (gi) and predicted value (pj).

For GT(gi), we select the top k predictions with the lowest loss as its positive samples. Finally, the corresponding grids of these positive samples are assigned as positive samples and the others as negative samples. It is worth noting that different GTs have different k values. For more details, please refer to OTA dynamic k estimation strategy.

SimOTA not only reduces training time, but also avoids additional hyperparameters that need to be calculated in the Sinkhorn-Knopp algorithm. As shown in Table 2, SimOTA improves AP from 45.0% to 47.3%, which is 3% higher than SOTA ultralytics-YOLOv3, indicating the effectiveness of advanced allocation strategies.

End-to-end YOLO

We follow (the idea of ​​the PSS head) adding two convolutional layers, one-to-one label assignment, and gradient stopping operations. This enables the detector to proceed in an end-to-end manner, but slightly reduces performance and inference speed. The NMS free options are shown in the table below. Therefore, it was considered an optional module and was not included in the final model.
(Use the new module PSS (Positive Samples Selector) to replace the call to the traditional NMS, and truly realize the target detection training of End2End.)

insert image description here

2.2 Other backbone networks

In addition to DarkNet53, we also tested YOLOX backbones of different sizes, and YOLOX achieved consistent improvements compared to other corresponding networks.

Modify YOLOv5's CSPNet

In order to have a fair comparison, we adopt the exact YOLOv5 backbone including modified CSPNet, SiLU activation function, and PAN head. We also follow its scaling rules to generate YOLOX-S, YOLOX-M, YOLOX-L, and YOLOX-X models. Compared with YOLOv5, as shown in the table below, our model has consistently improved 1.0% to 3.0% AP, with only a slight increase in time (the increase comes from the decoupling head).

insert image description here

Tiny and Nano detectors

We further shrink the model to become YOLOX-Tiny to compare with YOLOv4-Tiny. For mobile devices, we use depth-wise convolution to build YOLOX-Nano with only 0.91M parameters and 1.08FLOPs. As shown in the table below, YOLOX performs better than the corresponding models at smaller model sizes.
insert image description here

Model Dimensions and Data Augmentation

In our experiments, all models guarantee almost the same learning strategy and optimization parameters, as described in 2.1. However, we found that suitable augmentation strategies differ across models of different sizes. As shown in Table 5, when MixUp is applied on YOLOX-L, the AP can be increased by 0.9%, but for small models like YOLOX-Nano, the AP decreases instead.

Specifically, when training small models namely YOLOX-S, YOLOX-Tiny, YOLOX-Nano, we remove the mix up enhancement and weaken the mosaic (reduce scaling from [0.1, 2.0] to [0.5, 1.5]). Such modification increases the AP of YOLOX-Nano from 24.0% to 25.3%.

For large models, we also find stronger augmentation helpful. Indeed, our MixUp measure is partially stronger than the original version. Inspired by Copypaste, before blending the two images, we dither them by a randomly sampled scale factor. To understand MixUp with scaling jitter, we compare it with Copypaste on YOLOX-L. It is worth noting that Copypaste requires additional instance mask annotations but MixUp does not. But as shown in Table 5, the two methods achieve almost the same performance, indicating that MixUp with scaling dithering is a replacement for Copypaste when no instance mask annotations are available.

insert image description here

3 Comparison with SOTA

The following table is a comparison with traditional SOTA. However, keep in mind that the model inference speeds in the table are usually not controllable, and different software and hardware have different speeds. Therefore, we draw a slightly controlled speed/accuracy curve using the same hardware and code base for all YOLO series in Figure 1.

We noticed that some high-performance YOLO series have larger model sizes such as Scale-YOLOv4 and YOLOv5-P6. The current Transformer-based detector improves the accuracy of SOTA to 60AP. Due to time and resource constraints, we did not explore these important features in this paper. But they are already under consideration.

insert image description here

4 No. 1 in Streaming Media Perception Challenge

The Streaming Perception Challenge at WAD 2021 is a joint project

Accuracy and latency are evaluated by a recently proposed metric (Streaming Accuracy).

The meaning behind this indicator is to jointly evaluate the output of the entire perception stack at each moment, forcing the stack to consider the amount of stream data that is ignored during calculation. We find that the sweet spot measured on a 30 FPS data stream is a powerful model with an inference time ≤ 33ms. So we use TensorRT to quantify the YOLOX-L model, generate the final model, and win the first place.

See the Challenges page for details.

5 Conclusion

In this report, we introduce some empirical improvements of the YOLO series to form a high-performance anchor-free detector called YOLOX. Equipped with some recent advanced detection techniques, namely decoupling head, anchor-free, advanced label assignment strategy. YOLOX achieves a better balance of speed and accuracy than other products in all model sizes. Due to its broad compatibility, YOLOv3 is one of the most widely used detectors in the industry, and surprisingly improved its architecture to achieve 47.3% AP on COCO, exceeding the current best practice of 3.0% AP. It is hoped that this report can help developers and researchers get better experience in actual scenarios.

Guess you like

Origin blog.csdn.net/qq128252/article/details/127124198