Interpretation of YOLOX paper
0. Summary
In the paper, we introduce some empirical improvements of the YOLO family to propose a new high-performance detector, YOLOX. We switch YOLO
anchor-free
to , and adopt other advanced detection techniques, such as解耦头
and advanced label assignment strategiesSimOTA
, thus achieving SOTA in numerous models.
Compared with only 0.91M parameters and 1.08G FLOPs
YOLONano
, we got 25.3% AP on the COCO dataset, which is higher than NanoDet1.8%的AP
For one of the most widely used detectors in the industry
YOLOv3
, we improve the AP to 47.3% on COCO, which is higher than the current state-of-the-art detector3.0% AP
;
For approximately the same number of parameters as YOLOX-L
YOLOv4-CSP,YOLOv5-L
, we achieve 50.0% AP on COCO at a speed of 68.9 FPS on Tesla V100, which is higher than YOLOv5-L1.8%AP
.
Furthermore, we
YOLOX-L
won the first place in the `Streaming Perception Challenge Competition (CVPR Workshop on Autonomous Driving 2021) using our model alone.
We hope that this report can provide useful experience for
开发人员
real-world scenarios,研究人员
and we also provide deployment versions that support ONNX, TensorRT, NCNN, and Openvino. The source code is at https://github.com/Megvii-BaseDetection/YOLOX.
1 Introduction
-
With the development of target detection, the YOLO series is always pursued in real-time applications
速度和精度的权衡
. They take the state-of-the-art detection techniques available at the time (e.g.anchor
, YOLOv2,Residual Net
YOLOv3), and optimize best-practice implementations. Currently, YOLOv5 maintains the best trade-off performance on COCO with an48.2%的AP
inference time of13.7ms
. -
However, in the past two years, the main progress in object detection academia has focused on
Anchor-free
detectors,高级标签分配策略
and端到端(
NMS-free) detectors. These have not yet been incorporated into the YOLO series, such as YOLOv4 and YOLOv5 are still anchor-based detectors, manually assigning rules for training. -
This is why we propose YOLOX, to
最新的改进
compare these成熟的优化移植
to the YOLO family. Considering that YOLOv4 and YOLOv5 may have a point for anchor-based strategies过度优化
, we use it选择YOLOv3
as a starting point (we willYOLOv3- spp
set as default YOLOv3). In fact, due to limited computing resources and insufficient software support in various practical applications, YOLOv3 is still one of the most widely used detectors in the industry. -
As shown in Figure 1, through the above techniques. Under the condition of 640 resolution on the COCO dataset, we will
YOLOv3提高到47.3%AP
(YOLOX-DarkNet53), which greatly exceeds the current best practice of YOLOv3 (44.3% AP, ultralytics version2). In addition, when using the state-of-the-art YOLOv5 architecture (with advanced CSPNet backbone and additional PAN head), YOLOX-L achieves 50.0% AP at 640 resolution on the COCO dataset, compared to the corresponding YOLOv5- l 1.8% higher AP. We also test our design strategy on small-scale models, YOLOX-Tiny and YOLOX-Nano (only 0.91M parameters and 1.08G FLOPs) are 10% AP and 1.8% AP higher than the corresponding YOLOv4-Tiny and NanoDet3, respectively. -
We released our code at https://githubcom/Megvii-BaseDetection/YOLOX. Support ONNX, TensorRT, NCNN and Openvino. It is worth mentioning that we won the first place in the Streaming Perception Challenge (CVPR Workshop on Autonomous Driving 2021) using the YOLOX-L model alone.
2.YOLOX
2.1 YOLOX-DarkNet53
We choose YOLOv3 and Darknet53 as our baselines. In the next sections, we will step through the entire system design in YOLOX.
Implementation details:
The training parameter settings are maintained from the baseline model to the final model.
-
Dataset: COCO trian2017
-
epochs:300
-
Optimizer: SGD
-
Learning rate: lr*BatchSize/64, lr = 0.01
-
Weight decay coefficient: 0.0005
-
SGD Momentum Coefficient: 0.9
-
Number of GPUs: 8 Tesla v100
-
batchsize:128
You can also choose a single GPU and other batch sizes for training. The image input size is from 448 to 832, that is, 32 is the step size.
FPS和latency
There are also measurements in the reduction, passedFP16-precision
andbatch=1
obtained on a single Tesla v100.
YOLOV3 baseline:
Adopt
DarkNet53
structure andSPP
layers as base network, ieYOLOV3-SPP
. Compared to the original implementation, we slightly changed some training strategies,EMA
adding weight updates,CosineLRScheduler
,IoU 损失
andIoU-aware分支
. We useBCE 损失
to train the cls and obj branches, andIoU 损失
to train the reg branch. These general training tips are key improvements with YOLOX正相关
, so we put them on the baseline. Furthermore, we only useRandomHorizontalFlip
,ColorJitter
and多尺度数据增强
and discard the RandomResizedCrop strategy because we found that RandomResizedCrop overlaps with the planned mosaic augmentation. As shown in Table 2, using these augmentations, our base network achieves 38.5% AP on the COCO validation set.
Decoupling headers in object detection:
The conflict between classification tasks and regression tasks is a well-known problem. Therefore, decoupled heads for classification and localization are widely used in most one-stage and two-stage detectors.
检测头保持耦合
However, as shown in Figure 2, although the backbone and feature pyramids (such as FPN, PAN) of the YOLO series continue to evolve , their
Our two analytical experiments suggest that it
耦合检测头
might损害性能
.
-
As shown in Figure 3, replacing the head of YOLO with a decoupled head can greatly improve the convergence speed.
-
端到端
The decoupling header is crucial to the version of YOLO . -
It can be seen from Table 1 that the
端到端
version is AP耦合头
when it is in use, and AP when降低4.2%
it is in use .解耦头
降低0.8%
-
Therefore, as shown in Figure 2, we replaced the YOLO detection head with a lightweight decoupling head. Specifically, it consists of a 1x1 convolutional layer to reduce the channel dimension, followed by two parallel branches with two 3x3 convolutional layers .
-
1.1 ms
We report the batch=1 inference time on V100 in Table 2, with the extra (11.6 ms vs 10.5 ms) brought by the lightweight decoupled head .
Strong Data Augmentation:
-
We add
Mosaic
andMixUp
to our augmentation strategy to improve the performance of YOLOX. Mosaic is an effective data enhancement strategy proposed by ultralytics-YOLOv3, and then it is widely used in YOLO v4, YOLO v5 and other detectors. MixUp was originally designed for image classification tasks and was later modified in BoF for object detection training. -
We adopt MixUp and Mosaic in our model, and turn off in the last 15 epochs , as shown in Figure 2, achieving 42.0% AP. After using strong data augmentation, we found that ImageNet-based pre-trained models did not perform better, so all our models were trained from scratch.
Anchor-free:
- Both YOLOv4 and YOLOv5 follow the original basis of YOLOv3
anchor-based
. However, there are many known issues with the anchor mechanism.- First of all, in order to achieve the optimal detection performance, it needs to be carried out before training to
聚类分析
determine a set of optimal anchors, those aggregated anchors are特定领域
,不具有通用性
. Secondly, the anchor mechanism增加了检测头部的复杂性
, and for each image预测数量
. In some edge AI systems, moving such a large number of predictions between devices (e.g., from NPU to CPU) can become延迟
a potential bottleneck on the overall side.
- First of all, in order to achieve the optimal detection performance, it needs to be carried out before training to
- Anchor-free detectors have been developed rapidly in the past two years, and these works show that the performance of anchor-free detectors can be
anchor-based
comparable to that of . The anchor-free mechanism significantly reduces the number of design parameters that need to be combined手动调参
with many involved tricks (e.g.anchor聚类
,Grid Sensitive
) to achieve better performance, especially the anchor-free mechanism greatly simplifies the detector performance during training and decoding stages. - Grid Sensitive is an optimization method introduced by the YOLOv4 model, that is, when calculating the coordinates of the center point of the prediction frame in the grid,
logit取sigmoid
after activating the output, add one缩放和偏移
to ensure that the center point of the prediction frame can effectively fit the real frame The case where it falls exactly on the edge of the mesh. - It is very simple to switch YOLO to anchor-free mode. We reduce the predictions per location from 3 to 1 and make them
直接预测4个值
, i.e., two offsets from the upper left corner of the grid, and the height and width of the prediction box. We assign the center of each target as the positive sample and预先定义一个刻度范围
, as done, for each target指定一个FPN级别
. Such modification reduces the parameters and GFLOPs of the detector, making it faster, but achieves better performance - 42.9% AP, as shown in Table 2.
Multi positives:
-
To be consistent with YOLOv3's assignment rules, the anchor-free version above selects only
一个正样本
(central locations) for each object, while ignoring other high-quality predictions. -
Optimizing these high-quality predictions may also lead to beneficial gradients, which may alleviate the problem of positive/negative sampling during training
极端不平衡
. We simply designate the central 3*3 region as a positive sample, also known as FCOS in FCOS中心取样
. As shown in Table 2, the AP performance of this detector is improved to 45.0%, which already surpasses the current Ultralytic-YOLOV3 best practice (44.3% AP2).
Shimot
Advanced label assignment is another important advance in object detection in recent years. Based on our own research OTA, we summarize four key points for advanced label assignment:
-
loss/quality perception
-
central prior
-
Dynamic number of positive anchor boxes for each GT (abbreviated as dynamic top-k)
-
global view.
OTA satisfies all the above four conditions, so we choose it as a candidate label assignment strategy.
Specifically, OTA analyzes label assignment from a global perspective, and formulates the assignment process as an optimization transfer problem, achieving SOTA performance among current assignment strategies. However, in practice we found that solving the OT problem with the Sinkhorn-Knopp algorithm brings 25% additional training time, which is somewhat time-consuming for training 30epochs. Therefore, we simplify it to a dynamic top-k strategy, named SimOTA, to obtain close conclusions.
We briefly introduce SimOTA here. SimOTA first computes the degree of pairwise matching, each GT and predicted value is represented by loss or quality. For example, in SimOTA, the loss of GT (gi) and prediction (pj) is calculated as follows:
[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly
where λ is the balance coefficient. L ij^cls and L ij^reg are classification loss and localization regression loss between GT (gi) and predicted value (pj).
For GT(gi), we select the top k predictions with the lowest loss as its positive samples. Finally, the corresponding grids of these positive samples are assigned as positive samples and the others as negative samples. It is worth noting that different GTs have different k values. For more details, please refer to OTA dynamic k estimation strategy.
SimOTA not only reduces training time, but also avoids additional hyperparameters that need to be calculated in the Sinkhorn-Knopp algorithm. As shown in Table 2, SimOTA improves AP from 45.0% to 47.3%, which is 3% higher than SOTA ultralytics-YOLOv3, indicating the effectiveness of advanced allocation strategies.
End-to-end YOLO
We follow (the idea of the PSS head) adding two convolutional layers, one-to-one label assignment, and gradient stopping operations. This enables the detector to proceed in an end-to-end manner, but slightly reduces performance and inference speed. The NMS free options are shown in the table below. Therefore, it was considered an optional module and was not included in the final model.
(Use the new module PSS (Positive Samples Selector) to replace the call to the traditional NMS, and truly realize the target detection training of End2End.)
2.2 Other backbone networks
In addition to DarkNet53, we also tested YOLOX backbones of different sizes, and YOLOX achieved consistent improvements compared to other corresponding networks.
Modify YOLOv5's CSPNet
In order to have a fair comparison, we adopt the exact YOLOv5 backbone including modified CSPNet, SiLU activation function, and PAN head. We also follow its scaling rules to generate YOLOX-S, YOLOX-M, YOLOX-L, and YOLOX-X models. Compared with YOLOv5, as shown in the table below, our model has consistently improved 1.0% to 3.0% AP, with only a slight increase in time (the increase comes from the decoupling head).
Tiny and Nano detectors
We further shrink the model to become YOLOX-Tiny to compare with YOLOv4-Tiny. For mobile devices, we use depth-wise convolution to build YOLOX-Nano with only 0.91M parameters and 1.08FLOPs. As shown in the table below, YOLOX performs better than the corresponding models at smaller model sizes.
Model Dimensions and Data Augmentation
In our experiments, all models guarantee almost the same learning strategy and optimization parameters, as described in 2.1. However, we found that suitable augmentation strategies differ across models of different sizes. As shown in Table 5, when MixUp is applied on YOLOX-L, the AP can be increased by 0.9%, but for small models like YOLOX-Nano, the AP decreases instead.
Specifically, when training small models namely YOLOX-S, YOLOX-Tiny, YOLOX-Nano, we remove the mix up enhancement and weaken the mosaic (reduce scaling from [0.1, 2.0] to [0.5, 1.5]). Such modification increases the AP of YOLOX-Nano from 24.0% to 25.3%.
For large models, we also find stronger augmentation helpful. Indeed, our MixUp measure is partially stronger than the original version. Inspired by Copypaste, before blending the two images, we dither them by a randomly sampled scale factor. To understand MixUp with scaling jitter, we compare it with Copypaste on YOLOX-L. It is worth noting that Copypaste requires additional instance mask annotations but MixUp does not. But as shown in Table 5, the two methods achieve almost the same performance, indicating that MixUp with scaling dithering is a replacement for Copypaste when no instance mask annotations are available.
3 Comparison with SOTA
The following table is a comparison with traditional SOTA. However, keep in mind that the model inference speeds in the table are usually not controllable, and different software and hardware have different speeds. Therefore, we draw a slightly controlled speed/accuracy curve using the same hardware and code base for all YOLO series in Figure 1.
We noticed that some high-performance YOLO series have larger model sizes such as Scale-YOLOv4 and YOLOv5-P6. The current Transformer-based detector improves the accuracy of SOTA to 60AP. Due to time and resource constraints, we did not explore these important features in this paper. But they are already under consideration.
4 No. 1 in Streaming Media Perception Challenge
The Streaming Perception Challenge at WAD 2021 is a joint project
Accuracy and latency are evaluated by a recently proposed metric (Streaming Accuracy).
The meaning behind this indicator is to jointly evaluate the output of the entire perception stack at each moment, forcing the stack to consider the amount of stream data that is ignored during calculation. We find that the sweet spot measured on a 30 FPS data stream is a powerful model with an inference time ≤ 33ms. So we use TensorRT to quantify the YOLOX-L model, generate the final model, and win the first place.
See the Challenges page for details.
5 Conclusion
In this report, we introduce some empirical improvements of the YOLO series to form a high-performance anchor-free detector called YOLOX. Equipped with some recent advanced detection techniques, namely decoupling head, anchor-free, advanced label assignment strategy. YOLOX achieves a better balance of speed and accuracy than other products in all model sizes. Due to its broad compatibility, YOLOv3 is one of the most widely used detectors in the industry, and surprisingly improved its architecture to achieve 47.3% AP on COCO, exceeding the current best practice of 3.0% AP. It is hoped that this report can help developers and researchers get better experience in actual scenarios.