Paper reading | One-stage target detector YOLOv4

Paper related information

1. Thesis title: YOLOv4: Optimal Speed ​​and Accuracy of Object Detection

2. Publication time: 202004

3. Document address: https://arxiv.org/abs/2004.10934

4. Paper source code: https://github.com/AlexeyAB/darknet

Introduction

YOLOv4 is another masterpiece of the YOLO family after YOLOv3. Through some general techniques, the accuracy of YOLOv4 has been greatly increased to 43.5%.

AP (65.7% AP50), while maintaining the real-time performance of 65 FPS, in addition, YOLOv4 also aims to realize that only a GPU (1080Ti, 2080Ti, etc.) can train an accurate and fast model.

According to the author's summary, there are three major contributions:

  1. The network is efficient and powerful, and a fast and accurate model can be trained with only a single card.
  2. The methods of Bag-of Freebies and Bag-of-Specials in target detection are verified.
  3. Modify CBN, PAN, SAM, etc. to make it suitable for single card training.

Related work

Object detector model

Insert picture description here

It is roughly divided into two parts, backbone and head. The one inserted between the two to extract the feature maps of different stages is neck. Please refer to the above figure.

Development methods in recent years: 1. Insert some layers between the backbone and the head, that is, changes to the neck layer (FPN, PAN); 2. Build a new backbone architecture (DetNet, DetNAS); 3. Brand new target detection architecture (SpineNet, HitDetector) ).

Object detectors can be divided as follows:

• Input: Image, Patches, Image Pyramid

• Backbones : VGG16, ResNet-50 SpineNet, EfficientNet-B0/B7 , CSPResNeXt50 ,CSPDarknet53

• Neck:

• Additional blocks: SPP, ASPP , RFB, SAM .

• Path-aggregation blocks: FPN , PAN ,NAS-FPN, Fully-connected FPN, BiFPN, ASFF ,SFAM 

• Heads :

• Dense Prediction (one-stage):

	◦ RPN , SSD , YOLO , RetinaNet (anchor based)

	◦ CornerNet , CenterNet , MatrixNet, FCOS  (anchor free)

• Sparse Prediction (two-stage):

	◦ Faster R-CNN  R-FCN , Mask RCNN  (anchor based)

	◦ RepPoints  (anchor free)

Bag of freebies(BoF)

Bag of freebies concept: a method that only changes the training strategy or only increases the training cost.

Common operations in object detection that conform to this definition are as follows

Data enhancement

The purpose of data enhancement is to increase the variability of the picture, thereby enhancing the generalization ability/robustness of the model. The enhancement operation can be roughly divided into two categories, photometric distortion (optical distortion) and geometric distortion (geometric distortion).

photometric distortion : brightness, contrast,hue, saturation, and noise of an image。

geometric distortion :random scaling, cropping, flipping, and rotating。

The above is pixel-wise enhancement, the following can be called region-wise enhancement.

There are also enhancements to simulate image occlusion problems: random erase, CutOut, hide-and-seek, grid mask. The general idea is to randomly select a rectangular area, and then replace it with a random value or directly set it to 0. (Applicable to classification, object detection), if these methods are applied to the feature map, they become: DropOut, DropConnect, and DropBlock.

In recent years, data enhancement methods have emerged using multiple pictures to enhance: MixUp, CutMix.

Used to eliminate the texture bias of CNN learning: transfer GAN.

Sample balance

In order to solve the semantic difference of the data set, that is, the imbalance between the categories.

The two-stage method introduces hard negative example mining and online hard example mining, and the one-stage method is suitable for intensive detection.

The one-stage method introduces focal loss.

Category relationship representation

It is difficult to express the relationship between different categories through one-hot hard representation, so label smoothing is introduced. Hard labels are converted to soft labels, which is more robust. In order to further soften labels, knowledge distillation is proposed to design labels. Refining the network.

Objective function of BBox regression

Traditional target detectors directly use MSE to return to center coordinates or diagonal coordinates. For anchor-based detectors, it is estimating the relative offset, and the regression is the offset of each point; for the anchor-free detector that directly predicts each point of the frame, these Points are used as independent variables to regress.

In order to take into account the scale invariance, object shape, size, ratio, center position, etc., this loss function has undergone multiple evolutions. The evolution process is as follows:

IoU loss->GIoU loss->DIoU loss->CIoU loss

Bag of specials(BoS)

Bag of specials concept: plugin modules and post-processing methods that only increase the inference cost but can greatly improve the detection accuracy.

plugin modules

Plugin modules are usually used to enhance the specific attributes of the model, such as expanding the receptive field, introducing the attention mechanism or increasing the feature fusion ability.

增加感受野:SPP,ASPP,RFB

attention mechanism(注意力机制)在目标检测中可分两种:

	channel-wise attention :Squeeze-and-Excitation (SE) 

	pointwise attention:Spatial Attention Module (SAM)

特征融合:早期的有直接通过skip connection或hyper-column来将低层次的物理特征融合到高层次的。随后的FPN 这样的多尺度预测的方法出现,引出了很多轻权重融合不同特征层的模块出现,有SFAM  , ASFF ,以及BiFPN。

	 SFAM  通过SE模块实现channel-wise来给用于特征融合的多个特征层的不同通达重新设置权重然后再相加。

	 ASFF通过softmax来实现point-wise的特征融合。

	 BiFPN通过残差连接实现scale-wise的特征融合。

激活函数:好的梯度函数有利于梯度高效传播并且减少计算。典型的有tanh,sigmoid,ReLU,ReLU, leaky-ReLU, parametric-ReLU,ReLU6, SELU, Swish, or Mish。其中ReLU6和hard-Swish解决了ReLU在输出为负数时梯度为0的情况。

post-processing methods

Post-processing methods are used to process model prediction results.

Common post-processing includes NMS and soft-NMS, followed by DIoU NMS. However, since such post-processing does not consider features, post-processing is not required in the anchor-free method.

Methodology

Selection of architecture

To select the backbone network, it is necessary to find an optimal balance among the output resolution of the network, the number of network convolutional layers, the number of parameters, and the number of channels of the output result. The performance of CSPDarknet53 on the classification network is not as good as CSPResNext50, but CSPDarknet53 is better on the detection network.

Then find the appropriate additional blocks to increase the receptive field such as SPP, ASPP, etc., and the best way to Path-aggregation blocks, such as FPN, ASFF, PAN, etc.

The optimal model of the classifier is often not the optimal model of the detector. The detector requires:

  1. High resolution. To detect small objects.
  2. More layers. Larger receptive field to cover the increasing input network.
  3. More parameters. Thereby there is a larger capacity to detect objects of different sizes in a picture.

Finally, the structure of YOLOv4 consists of CSPDarknet53 backbone, SPP additional module, PANet path-aggregation neck, and YOLOv3 (anchor based) head.

Additional improvements

In order to make the model more suitable for training on a single GPU, the author makes the following improvements to the model:

  1. Propose new data enhancement methods, including Mosaic and Self Adversarial Training (SAT);
  2. Use the best hyperparameters while using genetic algorithms;
  3. Improve SAM, PAN and BN.

Among them, Mosaic is a fusion of 4 pictures in the training data, so that a picture contains 4 different contextual information, so that a batch does not need to be very large to train the model well.

Insert picture description here

Self-Adversarial Training (SAT) is a new data set enhancement method. There are two stages. In the first stage, the model changes the input image instead of the weight to create the illusion that the image does not contain the target, making the network carry out self-resistance attacks; In the phase, the target is detected in the normal way.

CmBN is an improvement of BN, it only collects information between small batches of data in a single batch, as shown in the following figure:

Insert picture description here

The paper replaces spatial-wise attention in SAM with pixel-wise attention; changes the feature fusion method in PAN from addition to concatenate, as shown in the following figure:
Insert picture description here

YOLOv4

Detailed description of YOLOv4:

YOLOv4 consists of:

  • Backbone: CSPDarknet53
  • Neck: SPP , PAN
  • Head: YOLOv3

The tricks used by YOLO v4:

  • Bag of Freebies (BoF) for backbone: CutMix and Mosaic data enhancement, DropBlock regularization, Class label smoothing

  • Bag of Specials (BoS) for backbone: Mish 激活函数, Cross-stage partial connections (CSP), Multi input weighted residual connections (MiWRC)

  • Bag of Freebies (BoF) for detector: CIoU-loss, CmBN, DropBlock regularization, Mosaic data enhancement, Self-Adversarial Training, eliminate network sensitivity, multiple anchors to match a label box, cosine annealing strategy, best hyperparameters, Random size during training

  • Bag of Specials (BoS) for detector: Mish激活函数,SPP-block, SAM-block, PAN path-aggregation block, DIoU-NMS

Experiments

In order to test the improvement of the accuracy of the classifier by different training improvement techniques, the classifier is tested on the ImageNet (ILSVRC 2012 val) data set. In order to test the improvement of the accuracy of the detector, test on the MS COCO (test-dev 2017) data set Detector, only the experimental results on the detector are shown here.

Due to the different GPU architectures used in the test results of other models, experiments were conducted on different GPU architectures to compare with other models. The GPUs used in the experiment have Maxwell (GTX Titan X or Tesla M40 GPU), Pascal (Titan X, Titan Xp, GTX 1080 Ti, or Tesla P100 GPU), and Volta (Titan Volta or Tesla V100 GPU).

In the data set MS COCO (test-dev 2017), each batchsize is 1, not using tensorRT, and the FPS is greater than or equal to 30. The test results on these three architectures are as follows:

Maxwell:

Insert picture description here

Pascal

Insert picture description here

Time

Insert picture description here

Conclusion

The paper mainly focuses on the optimization method of the model, which is equivalent to the master of the optimization method tricks. In addition, it can be trained and tested on a single GPU such as GTX1080Ti or RTX 2080Ti, making the model easy to popularize. The text also proposes some new data enhancements. Methods and improvements, but it is too complicated compared to YOLOv3. The model should be simple, efficient and convenient to train. However, the analysis of the various parts of the detector in the article is very good, and it can be regarded as an overview of the detector's tricks.

Guess you like

Origin blog.csdn.net/yanghao201607030101/article/details/111969099