[Paper Notes] [Faster Rcnn Optimization] "Light-Head R-CNN: In Defense of Two-Stage Object Detector"

Paper: https://arxiv.org/abs/1711.07264

Code: GitHub - zengarden/light_head_rcnn: Light-Head R-CNN

1 Overview

We generally divide Object detection into two series:

  • One type is two-stage detector:

    • Representative algorithms include Faster RCNN, Mask RCNN, etc.

    • The algorithm is carried out in two stages. The first stage is to generate proposals, and the second stage is to classify and regress these proposals [fine tuning]

  • The other type is one-stage detector:

    • Representative algorithms include YOLO series, SSD, etc.

    • The algorithm focuses on one step, and there is no process of generating proposals.

        Generally speaking, the two-stage series of detection algorithms have higher detection accuracy , while the single-stage series of algorithms have faster detection speed . In actual industrial applications [implementation], the speed of the algorithm is an important consideration for us. At the same time, as the YOLO series of algorithms are gradually optimized, its accuracy has also been improved, and it can basically meet actual needs, so it is more suitable for practical applications. Most of them may still be based on the one-stage series of algorithms.

        The author named the paper "Guarding the Dignity of the Two-stage Object Detector" and explored how R-CNN balances accuracy and speed in object detection by constructing a lightweight head R-CNN network;

  • The author used Resnet-101 as the base model and won the COCO-2017 championship.

  • The author uses lightweight Xception as the base model, which achieves 30.7 mmAP and a speed of 102FPS. It comprehensively beats the existing single-stage algorithm from speed to accuracy.

2. Why is the two-stage algorithm so slow?

The author believes that the main reason for the slowness of two-stage is: Heavy Head; the author's definition of "Head" in the paper is as follows:
“Head” in our paper refers to the structure attached to our backbone base network. More specifically, there will be two components: R-CNN subnet and ROI warping

In other words, Head mainly consists of two parts:

  • ROI warping: Generate a fixed-size feature map for each ROI through ROI Pooling/PSRoI Pooling and other methods.

  • RCNN subnet: The ROI-based feature map further implements the recognition process.

Due to the large amount of calculation in the head of         the two-satge algorithm , even a lightweight backbone network will not greatly improve the speed. The two-stage algorithms Faster R-CNN and R-FCN are used as examples to explain:

For Faster-RCNN, there are two main reasons for low efficiency.

  • In order to ensure accuracy, each ROI is processed by ROI Pooling and then connected to two large FC layers , which is very time-consuming;

  • The first FC input channel often has many channels [for example, the output of resnet101 has 2048 channels], so the amount of calculation is very large;

[ Note: In order to speed up, the global average pool is usually used behind the RoI feature map instead of directly connecting the fully connected layer. The purpose of this is to reduce the calculation amount of the first FC layer, but some spatial localization information will be lost. Conducive to the return of the target bbox]

        For R-FCN, a 1x1 convolution kernel is used to obtain a Position-sensitive feature map. Each RoI passes through a PSRoI Pooling layer to obtain a fixed-size feature map [p*p*(c+1), c is the number of categories]. Finally The prediction of each RoI is obtained through a global average pool layer; compared to Faster-RCNN, the number of channels output by PSRoI Pooling is much smaller [c+1], and the fully connected layer is removed, so the speed is faster; but for The number of channels of the Position-sensitive feature map needs to satisfy P^2(C+1)

  • For the 80 categories of the COCO data set, the corresponding number of channels is 3969, and this overhead becomes very large;

3. Optimization of Light-Head R-CNN

        Figure C above shows the improved network structure in the paper. It can be seen that Light-Head R-CNN is improved from Faster R-CNN and R-FCN, and combines some of the advantages of the two; its improvement points are mainly in the following three aspects:

3.1.Thin feature maps for RoI warping

  • Before RoI warping, the feature map is compressed to obtain a [thin] thin feature map to avoid the problem that the feature map in R-FCN is too large and increases as the number of categories C increases.

    • The number of channels is reduced from 3969 of R-FCN to 490, and the calculation amount is greatly reduced.

  • After the thin feature map is followed by PSROI/ROI Pooling, the ROI-wise feature map obtained at this time is also thin. In this way, if the FC layer is followed later, the calculation amount will be smaller, which avoids the problem of Faster R-CNN in the R-CNN subnet. The problem of excessive calculation of the first FC layer

In order to study the impact of thin feature maps with fewer channels on accuracy, the author built the following network structure for experiments.

 Compared with R-FCN, the network structure is basically the same, and the main differences are in the following two points [other settings are the same]:

  • Reduce feature map channel from 3696 (7*7*81) to 490 (7*7*10)

  • Since the number of feature map channels has been changed, prediction cannot be made directly through the global average pool, so a FC layer is added.

The experimental results are as follows:

         Among them, B1 represents the R-FCN implemented by the original author, and B2 represents the R-FCN implemented by the Light-Head author, and adds some tricks [larger training image size, balanced training loss for classification and regression, modified anchor Settings, etc.], it can be seen that B2 is improved by 3 points compared to B1. Then change to thin feature map on the B1/B2 model, and you can find that: the reduction in the number of channels will lead to a decrease in accuracy, but the magnitude of the decrease is very small;

  • Because an additional FC layer is added, the number of channels is not the only variable, but the model structure is basically the same, and the experimental results are still relatively comparable.

  • In addition, the introduction of thin feature map makes it possible to introduce feature pyramid network [FPN]; for the original implementation of RFCN, due to the higher resolution of the shallow feature map (Resnet stage 2), the amount of calculation and memory at this time It consumes a lot of money

3.2.Large separable convolution

        Referring to the R-FCN method, the implementation of the thin feature map is to directly compress the channels through 1 × 1 convolution, because the feature map before this layer has 2048 channels, and this layer only has 490 channels. The reduction of so many channels will Weaken the ability of the feature map; in order to make up for the loss of accuracy caused by the reduction of channels, the author introduces large seperable convolution, whose structure is shown in the figure below, where the kernel size k=15, so it is called large conv, mainly to ensure that the channel compression is not lost too much Multiple precision;

  • In order to further reduce the amount of calculation, the implementation of large seperable convolution draws on the idea of ​​Inception 3. The convolution kernel of size k*k is replaced by two-layer convolution of 1*k and k*1. It can be calculated in Reduce the amount of calculations while maintaining consistent results;

  • The computational complexity is related to the sizes of Cmid and Cout. The hyperparameters in the paper are set to: Cmid = 256, Cout = 490.

Experimental results:

  • Thanks to the larger valid receptive field brought by the large kernel, the feature maps after pooling have stronger expressive capabilities, resulting in an accuracy improvement of 0.7;

3.3.light RCNN-subnet

        The author only uses a single FC layer (2048 channel) without dropout, followed by two full connections for classification and regression;

  • Only use 4 channels for each bounding box because regression is shared between different classes

  • Because the feature map of the upper layer of the fully connected layer is only 10*p*p, and only one fully connected layer is introduced, the calculation amount here is greatly reduced compared with the original faster rcnn.

Experimental results:

  •  Thanks to various optimization strategies, Light Head R-CNN has achieved good results in both speed and accuracy; 

4. Experiment

The author ran COCO experiments on 8 Pascal TITAN XP GPUs. The relevant parameter settings are as follows:
  • synchronized SGD with weight decay: 0.0001,momentum: 0.9.
  • Each mini-batch has 2 images per GPU.
  • Each image has 2000/1000 RoIs for training/testing.
  • Pad images within mini-batch to the same size by filling zeros into the right-bottomof the image.
  • Learning rate: 0.01 for first 1.5M iterations and 0.001 for later 0.5M iterations.
  • Adopt atrous algorithm in stage 5 of Resnet.
  • I AM ADOPTED.
  • Backbone network is initialized based on thepre-trained ImageNet.
  • Pooling size: 7.
  • Batch normalization is also fixed for faster experiment.
  • Data augmentation: Horizontal image flipping.

4.1.LightHead RCNN : High Accuracy【准】

In order to improve the accuracy of the algorithm, the author added other tricks to LightHead  RCNN , namely:
  • The interpolation technology in RoIAlign (Mask-RCNN) was added to PSRoI pooling, which improved by 1.3 points.
  • Use 0.5 to replace the original 0.3 as the NMS threshold , which is increased by 0.6 points [ Improved recall in crowd situations ]
  • Use multi-scale for training. During training, randomly sample the scale from {600, 700, 800, 900, 1000} and increase it by 1 point.

 Finally, on the COCO test-dev data set, the performance comparison with other common target detection algorithms is as follows:

4.2.LightHead RCNN : High Speed【快】

Considering the actual implementation and improving the inference speed of the model, the author made the following changes:

  • Use lightweight Backbone Xception to replace Resnet-101.

  • Deprecated atrous algorithm [large calculation for small backbone]

  • Reduce the RPN channel by half: 512 - - >256.

  • Change the settings of Large separable convolution: kernel size = 15, Cmid = 64, Cout = 490.

  • Using PSPooling + RoI-align

After using the above trick, we can achieve 102FPS on COCO while achieving an accuracy of 30.7%mmAP. This performance can be said to be very good! ! !

5 Conclusion:

Light Head R-CNN is a two-stage target detection method that can achieve a good balance between speed and accuracy; it mainly has the following optimization points:
  • Use lightweight Backbone Xception to replace Resnet-101.
  • Thin feature maps Large separable convolution ensure accuracy while increasing speed
  • Use single-layer FC instead of global average pooling
  • Remove the last two layers of FC to reduce the loss of spatial information and improve accuracy.
  • Add other tricks: PSRoI with RoIAlign, multi-scale training, OHEM 

Guess you like

Origin blog.csdn.net/qq_44804542/article/details/123624669