Detailed explanation of the algorithm process of the Fast-RCNN model for deep learning target detection (super detailed theory)

1. Fast-RCNN paper background
2. Fast-RCNN algorithm flow
3. Fast R-CNN problems and shortcomings

This article is explained by comparing RCNN. If you are not very familiar with RCNN network, you can visit this link to get a quick understanding. Click the link below

Detailed explanation of the algorithm process of the target detection R-CNN model of deep learning (super detailed theory)

1. Fast-RCNN paper background

Paper address https://arxiv.org/abs/1504.08083

  Fast R-CNN is a 2015 paper by Ross Girshick titled " Fast R-CNN ". This paper aims to solve some problems in the field of object detection, especially the contradiction between speed and accuracy in traditional object detection methods .

  Abstract : This paper presents a fast region-based convolutional network approach to object detection. Fast R-CNN builds on previous work to efficiently classify objects proposed using deep convolutional networks. Compared with previous work, Fast R-CNN adopts some innovations to improve the training and testing speed while also improving the detection accuracy. Fast Fast trains very deep VGG16 networks 9x faster than R-CNN, tests 213x faster, and achieves higher mAP on PASCAL VOC 2012. Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate than SPPnet.

2. Fast-RCNN algorithm process

1. 4 steps of RCNN algorithm process:

(1) Obtain candidate regions: For an input image, first use the selective search algorithm to obtain about 2000 candidate regions. Since the candidate regions generated by selective search are regions of inconsistent size, and the subsequent convolutional neural network in the entire The connection layer needs to guarantee a fixed-size input, so it is scaled to a fixed-size image after entering the convolutional network;

(2) Obtaining image features: Input the image into the convolutional neural network to obtain image features. This part can use commonly used image convolutional neural networks such as VGGNet, AlexNet, etc.

(3) Obtaining the category of the region: After initially obtaining the position of the target, it is necessary to obtain the category of the target. In this step, the SVM classifier is used to determine which category the current region belongs to.

(4) Fine-tuning the location of the region: Although the candidate region has the initial target location, this region is relatively rough, so the regressor is used to fine-tune the location of the region

2.Fast-RCNN algorithm flow:

insert image description here

(1) Input the image;
(2) Extract the features of the image through the convolutional layer in the deep network (the convolutional layer in VGG, Alexnet, Resnet, etc.) to obtain the feature map of the image; (3)
Through the selective search algorithm Get the region of interest of the image (usually 2000); (4) ROI pooling (region of interest pooling
) for the obtained region of interest : that is, through the method of coordinate projection, the sensory information in the input image is obtained on the feature map The feature area corresponding to the area of ​​interest, and perform maximum pooling on this area, so that the features of the area of ​​interest are obtained, and the feature size is unified, as shown in Figure 2; (5) The output of the ROI pooling layer ( and The feature map corresponding to the region of interest after the maximum pooling feature) is used as the feature vector of each region of interest; ( 6) The feature vector of the region of interest is connected to the fully connected layer, and the multi-task loss function is defined, respectively Connect with the softmax classifier and boxbounding regressor to obtain the category and coordinate bounding box of the current region of interest respectively; (7) Perform non-maximum suppression (NMS) on all the obtained bounding boxes to obtain the final detection result.



insert image description here

3. Fast RCNN is improved compared to RCNN

(1) Fast RCNN still uses selective search to select 2000 suggestion boxes, but here instead of inputting so many suggestion boxes into the convolutional network, it inputs the original image into the convolutional network to obtain a feature map , and then uses the suggestion box to pair the features Graph extraction feature boxes. The advantage of this is that there are a lot of overlapping parts of the original suggestion box, and the repeated calculation of the convolution is serious. Here, only one convolution is calculated for each position, which greatly reduces the amount of calculation. (2) Due to the different size of the suggestion box, the
obtained The feature boxes need to be converted to the same size. This step is achieved through the ROI pooling layer (ROI means the region of interest is the target) (3) There are no
SVM classifiers and regressors in Fast RCNN, and the location and size of the classification and prediction boxes All are output through the convolutional neural network
(4) In order to improve the calculation speed, the network finally uses SVD instead of the fully connected layer

4.ROI Pooling(Region of Interest)

  Its input is a feature map, and the output is a vector of channel x H x W with a fixed size. ROI Pooling maps region proposals of different sizes into a fixed-size (W x H) rectangular frame. Its function is to pool the corresponding region into a fixed-size feature map in the feature map according to the position coordinates of the region proposals, so as to perform subsequent classification and output regression frame operations. It speeds up processing.

ROI Pooling has two inputs, one is the feature map after the picture enters CNN, and the other is the border of the region. The output of ROI is a vector of region_nums x channels x W x H.

  RoI can be regarded as a simplified version of SPP. The original SPP is concat to form new features after multi-scale pooling, while RoI only uses one scale, which can scale the feature matrix of any dimension to a fixed dimension. The specific method in the paper is to divide the height and width into small blocks of 7*7, and then perform a max pooling operation on each small block. The channel dimension remains unchanged. This can make the output dimension fixed, and RoI Pooling is not Multi-scale pooling and gradient return are very convenient, providing conditions for the fine-tune convolutional layer. (SPP Net cannot fine-tune convolutional layers)

insert image description here


  5. In RCNN, SPP Net uses the selective search method to generate nearly 2,000 Region Proposals on the original image, then resizes them to a fixed size, and inputs them into the CNN network. That is to say, an original image needs 2,000 forward inference calculations , there are a lot of repeated redundant calculations .
  The main contribution of SPP Net is: shared convolution calculation and Spatial Pyramid Pooling (spatial pyramid pooling) , so that each picture only needs to perform forward inference calculation of CNN network once. In RCNN, the image block needs to be resized to a fixed size, and there will inevitably be deformation and distortion .

The first line in the figure is the process of RCNN, which needs to input each image block into the network.
The second line is the SPP method, which greatly reduces the amount of calculation. The specific implementation method and process will be described in detail below.

insert image description here
insert image description here
Training process : Use the ImageNet pre-trained AlexNet model, input the image for forward reasoning, obtain conv5 features, and use the selective search algorithm to obtain candidate frames on the original image, and then map these candidate frames on the conv5 feature map to extract the corresponding SPP feature, and then fine-tune the fully connected layer of the SPP feature (the convolutional layer is not fine-tune), to obtain the features of the fully connected layer. After that, it is consistent with RCNN, and the features of the fully connected layer are input into the SVM classifier for classification, and the LR model of the Bounding Box is trained with the SPP feature to correct the position of the candidate box.
Inference process : Consistent with the training process, the original image is input into the CNN network, and the SPP feature is obtained after selective search, and then the classification feature is obtained through the fully connected layer, which is input into the SVM classifier, and the SPP feature is input into the Bounding Box In the LR model of

6. Multi-task loss function

insert image description here
where p is the softmax probability distribution predicted by the classifier p=(p0, p1, ...),
u corresponds to the target real category label, tu corresponds to the regression parameter of the corresponding category u predicted by the bounding box regressor, and v corresponds to the regression parameter of the real target box .

The classification loss function is Negative Log Likelyhood Loss: (Considering that p is calculated using softmax, it is equivalent to the classification is calculated using CrossEntropyLoss)

insert image description here

3. Fast R-CNN problems and shortcomings

1. Long training and inference time : The training process of Fast R-CNN is relatively slow, and it is necessary to train the candidate region generation network (RPN) first, and then train the target classification network. In addition, in the inference stage, the entire image needs to be forward-propagated, and the calculation is time-consuming.

2. Fixed size of ROI Pooling : The ROI Pooling operation maps candidate regions of different sizes to a fixed-size feature map. Such fixed-size maps may lead to loss or distortion of information, especially for smaller or larger object regions.

3. The quality of the candidate region generator : Fast R-CNN uses a candidate region generation network (RPN) to generate candidate regions, and the quality of the candidate region generator directly affects the accuracy of target detection. If the proposal generator cannot accurately provide proposals containing objects, then the final detection results may be affected.

4. Rely on pre-trained models : Fast R-CNN usually requires fine-tuning on pre-trained convolutional neural network (CNN) models. This means that it has a certain dependence on the selection and quality of the pre-trained model. If the pre-trained model is not accurate enough or suitable for a specific task, it may affect the performance of Fast R-CNN.

5. Sliding window-based candidate region generation : Fast R-CNN still uses a sliding window-based candidate region generation method, which may lead to a large amount of calculation on large-scale images. Although RPN can reduce the number of sliding windows, it still needs to scan the entire image.

  Fast RCNN on the GPU only needs 0.32s to run network inference, but it takes 2s to run selective search. That is to say, selective search severely restricts the speed of Fast RCNN and becomes the main bottleneck. (Later, Faster RCNN proposed the RPN network to solve this problem)

Guess you like

Origin blog.csdn.net/qq_55433305/article/details/131367568