Fast R-CNN Paper Interpretation-Combine the multi-segment training of RCNN into one, use the RoI pooling layer to unify the scale-the biggest advantage is the fast training and detection

Author: WXY

Date: 2020-9-5

Paper Journal: Ross Girshick Microsoft Research Sep 2015

Tag: Fast RCNN

1. The words written in the front

Fast R-CNN is based on the previous RCNN and is used for efficient target detection. It uses some new techniques to improve training speed, test speed, and accuracy. Fast R-CNN trains a VGG 16 network, but the training speed is 9 times faster than RCNN, and the test speed is 213 times faster. At the same time, it has a higher accuracy rate on PASCAL VOC. Compared with SPPnet, its training speed is 3 times faster. The test speed is 10 times faster.

In order to achieve target detection in the previous model, there are two main problems: too many candidate regions, and at the same time the region is not accurate enough, it must be corrected to achieve accurate positioning, solving them while sacrificing speed and accuracy. Fast R-CNN proposes a single-stage training algorithm that can perform classification and correction position learning at the same time.

Defects of R-CNN:

  • The training is multi-stage: RCNN first fine-tunes the convolutional network, then trains all SVMs through the obtained features, and trains bounding-box regression in the third step.

  • Training takes up a lot of time and space: when training SVM and bounding-box, the input is the features extracted by CNN in each picture, and they are stored on disk. If you want to train a large network like VGG16, 5k images need to be divided into 2.5 GPU-days for training. At the same time, hundreds of gigabytes of storage space are required.

  • Slow target detection speed: It takes 47s for RCNN to detect a picture on VGG16

The reason for the slow detection: the proposed calculation parameters of the extracted area of ​​a picture are not shared (one area is convolved once). Therefore, SPPnet (Spatial Pyramid Pooling Network) was proposed to speed up by sharing computing. SPPnet takes the entire picture as input for convolution, maps the area obtained by the SS algorithm to the feature map, and obtains the corresponding window. The window is evenly divided into 4 4, 2 2, 1 1 blocks, and a different scale is adopted for each block Maximum pooling, regions of different scales will get the same size feature vector (4 4+2*2+1)*N (convolve the entire image, the feature map obtained after convolution pooling and the original image features The position is corresponding, you can directly intercept the candidate frame on the feature map, and then do the maximum pooling of different sizes to stitch the resulting feature vectors to obtain the final feature vector) The outputs of different sizes are pooled and then merged into the spatial pyramid pooling layer .

Compared with RCNN, SPPnet has increased the test speed by 100 times and the training speed by 3 times. Disadvantages: The training is still multi-stage, including extracting features, fine-tuning, training SVM, training bounding-box, the features are also in the disk, the fine-tuning algorithm cannot update the convolutional layer before the pooling pyramid, and the accuracy is limited.
Insert picture description here

Advantages of Fast R-CNN

  • High accuracy
  • Single segment during training
  • Training can update all layers of the network
  • No need for disk storage features

Second, the structure and training process of Fast R-CNN

Insert picture description here
Fast R-CNN inputs an entire picture and a set of candidate frames extracted by the SS algorithm. The network first convolves and pools the entire picture to obtain a feature map, and then for each candidate region, RoI (region of interest ) The pooling layer will extract a fixed-length feature vector from the feature map. Each feature vector enters the fully connected layer, and is finally input to two output layers: the first uses solftmax to estimate the probability of each target class and the background, and the other outputs its bounding-box regression for each class (each Class 4 values).

The CNN model uses VGG16, replacing the last pooling layer with a RoI layer, and replacing the last fully connected layer and softmax with a K+1-dimensional fully connected layer and softmax and a bounding-box regression classifier.

1. Roll pooling layer

RoI (region of interest and region proposal are basically equivalent, which is a mapping on the feature map)

RoI pooling is a simplified version of the SPP pooling layer. SPP uses different scales of pooling, while RoI pooling uses only one scale (7*7). Before the fully connected layer, the RoI pooling layer pools the mapping blocks of different sizes of RoI on the feature map into feature vectors of the same size and inputs them to the fully connected layer.
Insert picture description here
1) How to understand the area suggestion mapping based on the original image. In the
Insert picture description here
upper part of the feature map, the area is used as input for convolution, and the lower part is the original image input for convolution. You can see that each position of the fully connected layer is The position of the original image has a corresponding relationship, because this position relationship exists, so that we can find the area from the original image, and then map it to the feature map.

2) How to understand the RoI pooling layer to convert feature inputs of different sizes into output of the same size.
FRCN has two output layers, one is the probability of outputting each class, using softmax, which is better than SVM; the other is each Class bounding-box regression, in order to combine the two when propagating forward and backward, the loss function is modified to:
Insert picture description here
Insert picture description here
is the probability loss of each class u, is the bounding box loss, and is the hyperparameter used to adjust the probability and the boundary The ratio of frame loss, when u=0, it means background, the background has no ground-truth, it is 0, and it is 1 in other cases.
Insert picture description here
Insert picture description here
The smooth function improves robustness

L2 loss function: mean square error
Insert picture description here
Insert picture description here

  • Advantages: smooth points can be derived
  • Disadvantages: The gradient of the central value of the principle is very large, which may cause the gradient to explode, so it is not robust

L1 loss function: average absolute error
Insert picture description here
Insert picture description here

  • Advantages: stable gradient
  • Disadvantages: the center point is the turning point and cannot be derived

smooth L1
Insert picture description here
Insert picture description here
is not sensitive to points farther from the center point (increase robustness), and the position of the turning point can be derived

3. Update to the network in fine-tuning

The important ability of Fast R-CNN when using backpropagation for all weights in the network.

Why can't the weights under the spatial pooling pyramid of RCNN and SPPnet be updated by backpropagation? The root cause is that when the training samples (RoI) come from different images, the backpropagation efficiency is extremely low (I said before that the mini-batch of RCNN training It is to sample 128 regions from all classes, which are from different images. FRCN's mini-batch is to sample R/N=64 regions from N=2 images. According to the author, different images will cause reverse Very low transmission efficiency). Reasons for inefficiency: RoI may have a very large receptive field, which usually spans almost the entire input picture, because forward propagation must process the entire receptive field, so the training input is very large.

The author proposes feature sharing during training. The mini-batch when performing stochastic gradient descent uses hierarchical sampling. First, N pictures are sampled, and then R/N RoIs are sampled from each picture, which is a total of R RoIs. RoIs from the same picture share computation and memory in the forward and backward propagation. (Assuming N=2 and R=128, FRCN using shared calculation is 64 times faster than RCNN using RoI from 128 different images to form a mini-batch)

(Sampling in fine-tuning: In fine-tuning, the mini-batch size of each SGD (random gradient descent) is R=128, N=2, and 64 RoIs are sampled from each picture. Select the ground-truth IoU from RoI 25% of them are greater than 0.5, they are regarded as the foreground, u≥1, and the remaining RoI is sampled from the RoI with the IoU value in [0.1, 0.5), as the background u=0. )

4. Ensure the same scale

Violently adjust the input image to a fixed size.

Multi-scale method: The image pyramid is a special case of SPP. A pyramid of a specific size is used to sample and pool the image to obtain a fixed-length feature vector.

FRCN testing
can be performed after fine-tuning. The network inputs a picture and picture regional suggestions (about 2000), the picture is convolved to obtain a feature map, RoI provides a mapping, and then RoIs of different sizes are input into the RoI pooling layer to make their scales consistent. For each tested RoI, output the probability and bounding box regression offset of each class, and then use non-maximum suppression for each class.

3. Results

1. Accuracy

Insert picture description here

2. Training and testing time

Three models pre-trained with ImageNet are used, the first is AlexNet, the second is VGG_CNN_M_1024, and the third is VGG16, replaced by S, M, and L respectively.

Introduce skills: When SVD
is detected, the calculation of the fully connected layer takes up almost half of the time in the forward propagation. The author uses truncated singular value decomposition. Large fully connected layers are easily compressed by it, thereby reducing time. (Do not understand specifically)

Insert picture description here
Insert picture description here

3. Fine-tune which layers

Experiments show that fine-tuning to include part of the convolutional layer is the best, but not the first convolutional layer (the basic features do not need to be fine-tuned, which does not improve the experiment)
Insert picture description here
Insert picture description here

4. SVM and softmax

Insert picture description here

Note: If there is any infringement, please contact us

Guess you like

Origin blog.csdn.net/cyl_csdn_1/article/details/108981714