Fast-RCNN study notes

background

According to the previous study of RCNN , we learned that RCNN first obtains 2k pre-selection boxes through selective search of the input picture, and then resizes the picture of the pre-selection box into a size of 227×227 (because the AlexNet convolutional neural network has a large number of input pictures. The size requirement is 227*227). Since the fully connected layer of the convolutional neural network has fixed requirements for the size of the input feature map (because the number of weights of the fully connected layer is certain), the image of the candidate area must be resized before being sent to the convolutional neural network. The model performs feature extraction, but no matter whether it is cut or deformed, it cannot completely retain the information of the original image . Therefore, He Kaiming and others proposed the Spatial Pyramid Pooling Layer (Spatial Pyramid Pooling Layer) to solve the limitation of the traditional convolutional neural network on the size of the input image.

SPPnet

insert image description here
The above picture is the SPP-Net structure proposed in the article of Mr. Kaiming. This structure was proposed for the mainstream convolutional neural network at that time. Therefore, the method mentioned in the article may be different from the SPP that we have modified now and in the future. The methods are slightly different, but the ideas are the same. The SPP structure proposed in this paper is added between the convolutional layer and the fully connected layer to replace the traditional CNN's pool5 ordinary pooling layer (taking AlexNet as an example, it is between the fifth hidden layer and the sixth hidden layer. , which can also be found from the figure above). The main process of SPP-Net is to first input an image (input image, here our input image size is not fixed, you can input an image of any size), after passing through a five-layer convolutional layer feature extraction network to obtain feature maps (due to the input The size of the image is not fixed, so the feature maps obtained here are not fixed size).Here comes the point:
Send the feature maps we obtained into the spatial pyramid pooling layer, through multiple scale pooling methods (three-scale pooling shown in the paper), after passing through the pyramid pooling layer, we will obtain fixed-size features The graph vectors are 4×4×256, 2×2×256, 1×1×256 (where 256 is the number of convolution kernels). The feature vectors of different scales are combined into a one-dimensional vector and sent to the fully connected layer for training (ie, FC6 and FC7 in the above figure), which can well correspond to each weight set by the fully connected layer.
From the above description, we can clearly see that the main function of SPPNet (for the traditional AlexNet and VGG16 convolutional neural network) is the bridge between the feature extraction network (backbone) and the fully connected layer. In order to facilitate the feature map information obtained by the backbone, it can The number of weight parameters that satisfy the fully connected layer.
The following figure attaches the network structure diagram of the classic convolutional neural network AlexNet to facilitate everyone to understand the notes I made:
insert image description here

Advantages of SPP-Net over RCNN

Advantages of SPP-Net:
1. It realizes the image input of any size, and fixes the size of the feature map information received by the fully connected layer.
2. The calculation time is greatly reduced. For the previous RCNN, all candidate boxes are resized and sent to the CNN network one by one for feature extraction, and then a unified parameter amount is obtained and then sent to the fully connected layer for training.

Fast-RCNN

Flaws of RCNN

After learning from Kaiming's SPP-Net, the RBG boss proposed the Fast-RCNN algorithm on the basis of RCNN.
The RBG boss proposed three shortcomings of RCNN in the article:
1. The training process is a multi-stage pipeline.
To put it simply, RCNN divides the loss function of the three parts of the parameter adjustment of the convolutional layer (log loss function), the classification of the target (SVM), and the calculation of the regression box into three stages.
2. The time and space complexity of algorithm training is high.
Because RCNN sends 2k preselection boxes of an image into the network one by one for training, and these 2k preselection boxes contain a large number of overlapping parts, which will cause a lot of repetition. calculate.
3. In the prediction stage, the speed of predicting a picture is too slow.
insert image description here

The proposal of Fast-RCNN

insert image description here
The above picture is the algorithm flow chart in the Fast-RCNN article.
Here I briefly summarize some of the innovations proposed by Fast-RCNN based on RCNN:

1. Innovation of input methods

During training, Fast-RCNN directly sends the original image to the backbone for feature extraction after calculating 2k RP regions by the ss algorithm. After obtaining the feature map, the ROI of the corresponding RP on the feature map is obtained through the mapping relationship between the original image and the feature map. It must be mentioned here that the number of RoIs obtained for each picture in Fast-RCNN is 64, and each A mini-batch is two pictures, so a mini-batch is 128 RoIs, and then select 25% of the RoI and groud truth for each picture to calculate the IoU, set the IoU threshold to 0.5, and the greater one is taken as a positive example , less than as a counterexample (in the prediction stage, each input map contains 2000 RP regions). This can reduce a lot of repeated operations, because in RCNN, the RP of the original image is first obtained through ss, and then each RPresize is sent to CNN for feature extraction. There will be many overlapping areas between different RPs, which will lead to Redundant calculations, and in this way, a picture will be calculated 2k times in CNN (because there are 2000 RPs in a picture).

2. RoI pooling

Use ROI pooling to change the size of the feature map, because the feature map input to the fully connected layer needs to be of uniform size (the number of weight parameters set by the fully connected layer is certain). After replacing the pooling layer of the traditional pool5 layer (that is, the pooling layer in the fifth hidden layer of AlexNet) with the ROI pooling layer, the feature map is unified into a size of 7×7, and then the roi obtained according to the mapping relationship Regions are fed into fully connected layers for classification and regression. (It should be noted here that the spp strategy used by Fast-RCNN is a simple version, single-scale spp, see the figure below, and the 4×4 in the figure below should be 7×7 in Fast-RCNN). The feature map obtained by RoI pooling is flattened and passed through two fully connected layers to obtain the RoI vector, and the RoI vector is sent in parallel to the fully connected layers of classification and regression.
insert image description here

3. Multi-task loss function

A multi-task loss function is used. The classification loss and regressor loss are added to the classification loss, including the classification loss and position loss of the candidate area, which can train the parameters of the classification and regression boxes in CNN. The softmax classifier is used instead of SVM for classification, and the calculation method of the regression box is consistent with RCNN. The obtained feature maps are connected in parallel, one for classification and one for regression. In this way, Fast-RCNN does not need to train CNN, train SVM, and train BBox regression like RCNN, because Fast-RCNN adds both classification loss and regression loss to the loss function. See the following two pictures and we can Intuitively see the optimization made by Fast-RCNN here:
This is RCNN:
insert image description here
This is Fast-RCNN:
insert image description here
(Here is the picture reference link )

We first accept such a setting: u represents the real classification, and v represents the real target position. The multi-task loss function of Fast-RCNN is expressed as:
insert image description here

Among them, the first part represents the loss of classification, and the second part is the loss of regression box.
The classification loss function of the first part is expressed as:
insert image description here
the classification loss uses the log loss of softmax, and the target category is divided into k+1 categories, k is the number of target categories, and 1 background category.
The location loss function of the second part is expressed as:
insert image description here
where:
insert image description here

4. Fully connected layer

Using SVD to decompose the weight matrix of the fully connected layer, the speed of processing an image is significantly improved.

Guess you like

Origin blog.csdn.net/ycx_ccc/article/details/128176920