Classic target detection algorithm: RCNN, Fast RCNN, Faster RCNN basic ideas and network structure introduction

1. Basic introduction to target detection

1.1 What is object detection?

The so-called target detection is to find the target we care about in an image and determine its category and position. This is one of the core problems in the field of computer vision. Due to the different appearance, color and size of various targets, as well as challenging problems such as lighting and occlusion during imaging, target detection has been continuously optimized and researched.

1.2 Classification of target detection algorithms

Traditional target detection algorithms include: SIFT (Scale Invariant Feature Transform), HOG (Histogram of Oriented Gradients), DPM (a component-based image detection algorithm), etc.

The target detection algorithm based on deep learning can be divided into two categories: the second-order algorithm (Two Stage) and the first-order algorithm (One Stage).

  • The second-order algorithm: first generate a region candidate frame, and then perform classification and regression correction through a convolutional neural network. Common algorithms include RCNN, SPPNet, Fast RCNN, Faster RCNN, and RFCN. The detection result of the second-order algorithm is more accurate.
  • First-order algorithm: No candidate frame is generated, and features are directly extracted from the network to predict the classification and location of the object. Common algorithms include SSD, YOLO series, and RetinaNet. First-order algorithm detection speed and faster.

2. RCNN

2.1 Introduction to RCNN

The RCNN (Region with CNN feature) algorithm appeared in 2014. It is the pioneering work of applying deep learning to the field of target detection. With the excellent feature extraction ability of convolutional neural network, the effect of target detection has been greatly improved.

RCNN increased the detection rate from 35.1% to 53.7% on the PASCAL VOC2012 data set, making CNN the norm in the field of target detection, and it also made everyone start to explore the great potential of CNN in other computer vision fields.

Paper: "Rich feature hierarchies for accurate object detection and semantic segmentation"
Author: Ross Girshick
Source code (provided by the author): https://github.com/rbgirshick/rcnn

2.2 RCNN algorithm flow

RCNN inherits the idea of ​​traditional target detection, treats target detection as a classification problem, first extracts a series of target candidate areas, and then classifies the candidate areas.

Its specific algorithm process includes the following 4 steps:

(1) Generate candidate regions:

Use a certain area candidate algorithm (such as Selective Search) to divide the image into small areas, and then merge the areas with high probability of containing the same object as the candidate area output. Some merging strategies are also needed here. Different candidate areas will overlap, as shown in the figure below (the black box is the candidate area):

        

To generate 1000-2000 candidate regions (take 2000 as an example), each region is then normalized, that is, scaled to a fixed size (227*227).

(2) Use CNN for feature extraction for each candidate area:

Here, a pre-trained neural network (such as AlexNet, VGG) should be selected in advance, and the fully connected layer should be retrained, that is, the application of fintune technology.

Input the candidate area into the trained AlexNet CNN network to obtain a fixed-dimensional feature output (4096 dimensions), and obtain a 2000×4096 feature matrix.
        

        
(3) Use the SVM classifier of each class to classify the output features of CNN:

Here we take the PASCAL VOC dataset as an example. There are 20 categories in the dataset, so 20 SVM classifiers are set.

Multiply the 2000×4096 feature with the weight matrix 4096×20 composed of 20 SVMs to obtain a 2000×20 dimensional matrix, which represents the probability that 2000 candidate regions belong to 20 categories, so the sum of each row of the matrix is ​​1 .

        
Perform non-maximum value suppression on each column (that is, each category) in the above-mentioned 2000×20-dimensional matrix to remove overlapping suggestion boxes, and obtain some candidate boxes with the highest probability in this column, that is, the category.

The specific implementation method of non-maximum value suppression to remove overlapping suggestion boxes is:

Step 1: Define the IoU index (Intersection over Union), that is, (A∩B) / (AUB), which is the ratio of the area of ​​the overlapping area of ​​AB to the total area of ​​AB. Intuitively speaking, IoU is the ratio of AB overlap. The larger the IoU, the larger the proportion of AB overlap, that is, the more similar A and B are.

        
Step 2: Find the region with the highest probability among the 2000 candidate regions in each category, calculate the IoU value between other regions and this region, and delete all candidate regions with IoU values ​​greater than the threshold. In this way, only a few candidate regions with low coincidence rate can be retained, and repeated regions can be removed.

For example, in the following example, A is the area with the highest probability among all candidate boxes corresponding to the sunflower class, and B is another area. If the IoU of AB is calculated and the result is greater than the threshold, then AB is considered to belong to the same class (that is, they are all sunflowers). So A should be kept and B should be deleted, which is non-maximum suppression.

        
        

(4) Use the regressor to refine the position of the candidate area:

The position of the candidate area obtained by the Selective Search algorithm is not necessarily accurate, so 20 regressors are used to perform regression operations on the remaining suggestion boxes in the above 20 categories, and finally the corrected target area for each category is obtained. The specific implementation is as follows:

As shown in the figure, the yellow box represents the candidate region Region Proposal, the green window represents the actual region Ground Truth (manually marked), and the red window represents the predicted region after the Region Proposal is regressed. The least square method can be used to solve the linear regression problem.

Four parameters of the candidate area can be obtained through the regressor, namely: the x and y offset of the candidate area, and the scaling factor of height and width. These four parameters can be used to fine-tune the position of the candidate area to obtain the red prediction area.

        

2.3 RCNN flow chart

        

2.4 RCNN framework

RCNN consists of four parts: SS algorithm, CNN, SVM, bbox regression.

        

2.5 Disadvantages of RCNN

(1) Training and testing are slow and require multi-step training, which is very cumbersome.

(2) Since the fully connected network in classification is involved, the size of the candidate region input to CNN is fixed, resulting in a decrease in accuracy.

(3) The candidate area needs to be extracted and saved in advance, which takes up a lot of space. For very deep networks, such as VGG16, the features extracted from the 5000 pictures on the VOCO7 training set require hundreds of GB of storage space, and this problem is fatal.

RCNN became the SOAT algorithm in the field of target detection at that time. Although it is not used much now, its ideas are still worthy of our reference and learning.

3. Fast RCNN

3.1 Introduction to Fast RCNN

After RCNN, SPPNet solves the two problems of repeated convolution calculation and fixed output size. The main contribution of SPPNet is to calculate the global feature map on the entire image, and then for a specific proposal candidate box, it only needs to be taken out on the global feature map. The feature map corresponding to the coordinates is enough. However, SPPNe still has some disadvantages, such as still needing to save the features in the disk, and the speed is still very slow.

The Fast RCNN algorithm was proposed by Ross Girshick (or the big guy) in 2015, and has been improved on the basis of RCNN and SPPNet. According to the name, Fast RCNN is faster and stronger. Its training steps are end-to-end. Based on the CGG16 network, its training speed is 9 times faster than RCNN, and its testing speed is 213 times faster. It has achieved an accuracy rate of 68.4% in the PASCAL VOC2012 data set.

Paper: "Fast R-CNN"
source code (provided by the author): https://github.com/rbgirshick/fast-rcnn

3.2 Fast RCNN algorithm flow

(1) An image generates 1K~2K candidate regions (using the Selective Search algorithm, referred to as the SS algorithm), and we call a candidate region the ROI region.

(2) Input the image into the network to obtain the corresponding feature map, and project the candidate frame generated by the SS algorithm onto the feature map to obtain the corresponding feature matrix.

R-CNN vs Fast-RCNN:

R-CNN sequentially inputs 2000 candidate frame regions into the convolutional neural network to obtain features, which has a lot of redundancy and takes a long time to extract.
Fast-RCNN sends the entire image to the network and calculates the features of the entire image at one time, so that the feature map of the desired candidate area can be obtained according to the coordinates of the feature map without repeated calculations.

(3) Scale each feature matrix to a 7x7 feature map through the ROI pooling layer.

As mentioned earlier, RCNN needs to normalize the candidate area to a fixed size (227 227), while Fast RCNN does not need such an operation. Fast RCNN changes the feature map of each candidate area to 7 7 through the pooling layer , as shown below Shown:

        

(4) Flatten the feature map into a vector, and obtain the prediction result through a series of fully connected layers and softmax.

3.3 Fast RCNN flow chart

3.3.1 Overall process

        

        

As shown in the figure, an image is input into Deep ConvNet to obtain the feature map of the image, and the feature map (Conv feature map) is performed according to the coordinate mapping relationship between the ROI area and the overall image (RoI Projection), and each candidate area (ROI area ) feature matrix.

Pass each feature matrix through the RoI pooling layer, pool it to a fixed size (7*7), and then flatten it into a vector (vector). After two fully connected layers (FC), the ROI feature vector (ROI feature vector) is obtained.

After that, the ROI feature vector connects two FCs in parallel, one of which is used for target probability prediction (softmax), and the other is used for regression of bounding box parameters (bbox regressor, bbox means bounding box).

3.3.2 softmax classifier

The softmax classifier outputs the probabilities of N+1 categories, as shown in the figure below. There are 20 categories in the PASCAL VOC2012 dataset, so the probability of 21 categories will be output, the first of which is the background probability, and the remaining 20 are the probability of each category. So there are N+1 nodes in the FC of softmax.

insert image description here

3.3.3 Bounding box regressor (bbox regressor)

A picture will draw N+1 candidate boxes for classification, and each candidate box has four parameters x, y, w, and d, so there are 4(N+1) nodes in the FC of the bbox regressor.

insert image description here
The calculation method of the bounding box parameter regression:

        

3.3 Calculation of loss in Fast RCNN

Because the probability of N+1 categories and the regression parameters of the bounding box need to be predicted in Fast RCNN, two loss functions are defined: classification loss and bounding box regression loss.

insert image description here

3.4 Fast RCNN framework

First review the framework of RCNN:

        

RCNN consists of four parts, so multi-step training is required, which is very cumbersome.

Fast RCNN combines CNN feature extraction, SVM bounding box classification, and bbox regression bounding box regression, all of which are fused into the same CNN. Then Fast RCNN has only two parts: first obtain the candidate frame through the SS algorithm, and then complete feature extraction, classification and bounding box regression through CNN.

        

So naturally, in the next Faster RCNN algorithm, it is necessary to consider how to integrate the Region proposal into the CNN, and merge the entire algorithm into a network, so that end-to-end target detection can be achieved.

3.5 Disadvantages of Fast RCNN

1. Although the GPU is used, the Region proposal is still implemented on the CPU. In the CPU, it takes about 2s to extract the candidate frame area of ​​a picture with the SS algorithm, but it only takes 0.32s to complete the entire CNN. Therefore, the bottleneck of Fast RCNN's calculation speed is the Region proposal.

2. Real-time applications cannot be satisfied, and end-to-end training and testing have not been truly realized;

4. Faster RCNN

4.1 Introduction to Faster RCNN

Faster RCNN is another masterpiece of the author Ross Girshick after RCNN and Fast RCNN. Also using VGG16 as the backbone of the network, the inference speed reaches 5fps on the GPU (including the generation of candidate regions), and the accuracy rate is further improved. In 2015 ILSVRC and cOco competition won the first place in several projects.

4.2 Faster RCNN algorithm flow

Faster RCNN = RPN + Fast RCNN

RPN refers to Region Proposal Network, which suggests a regional generation network. RPN is used in Faster RCNN to replace the SS algorithm in Fast RCNN.

Algorithm flow:

(1) Input the image into the network to obtain the corresponding feature map.

(2) Use the RPN network to generate candidate frames, and project the candidate frames generated by RPN onto the feature map to obtain the feature matrix of the ROI region.

(3) The feature matrix of each ROI area is scaled to a 7x7 feature map through the ROI pooling layer, and then the feature map is flattened into a vector, and then the prediction result is obtained through a series of fully connected layers.

The basic structure of the Faster RCNN network is as follows:

        

4.2 RPN network

4.2.1 RPN network structure

        
The conv feature map in the figure is the corresponding feature map obtained by the image input network, and a 256d one-dimensional vector is generated after processing by the sliding window. The vector passes through two fully connected layers to output classification probability scores and bounding box regression parameter coordinates respectively, where k refers to k anchor boxes, and 2k scores are the probability that each anchor box is the foreground and background respectively (note that only the distinction is made here Foreground and background, all categories are classified as foreground), 4k coordinates are because each anchor box has four parameters.

4.2.2 Definition of anchor

So what is an anchor?

First of all, it must be clear that the anchor is not a candidate box (Proposal), and the difference between the two will be mentioned later.

If we find a point in the feature map, we can find a corresponding pixel in the original image, and draw 9 boxes of different sizes and aspect ratios with the pixel as the center, called anchors. As shown in the figure below, these anchors may or may not contain targets. Because the size and aspect ratio of the target we are looking for in a picture is not fixed, here we use 9 anchors of different sizes and aspect ratios for prediction.

insert image description here
So why are there 9 anchors?

The area and aspect ratio of each anchor are given in the paper:

insert image description here
Therefore, each position in the feature map will generate 3 3 = 9 anchors in the original image. As shown in the figure below, the three blue anchors have an area of ​​128 128, the red ones have an area of ​​256 256, and the green ones have an area of ​​512 512 of.
insert image description here

4.2.3 The process of RPN generating proposal

For a 1000x600x3 image (three channels), a 3x3 convolution kernel is used for feature extraction to obtain a 60x40 feature map, and there are a total of 60x40x9 (about 2w) anchors. After ignoring the anchors beyond the image boundary, there are about 6000 anchors left.

For these 6000 anchors, each anchor is adjusted to a proposal through the bounding box regression parameters generated by RPN (as mentioned earlier, each anchor outputs 2 probabilities and 4 bounding box regression parameters through RPN), here you can see the anchor The difference between proposal and proposal. This process is the process of RPN generating candidate boxes.

There is a large amount of overlap between the candidate boxes generated by RPN. Based on the cls score of the candidate box, non-maximum suppression is used, and the IoU is set to 0.7, so that only 2000 candidate boxes are left for each picture.

4.3 Faster RCNN framework

        
Faster RCNN goes a step further on the basis of Fast RCNN, and integrates candidate frame generation into the CNN network, so that the four major parts of candidate frame generation, feature extraction, candidate frame classification, and candidate frame boundary regression are combined in one CNN network. Step-by-step training is avoided and true end-to-end object detection is achieved.
        

Five, the comparison of the three: RCNN, Fast RCNN, Faster RCN

All three are second-order algorithms, network framework comparison:

        

It can be seen that from RCNN, Fast RCNN to Faster RCNN, the network framework is becoming more and more concise, and the target detection effect is getting better and better.

The advantages and disadvantages of the three comparison:
        
insert image description here
        

6. References

Video (B station): Faster RCNN theory collection

Blog: Faster R-CNN paper notes - FR

Blog: Explanation of Advantages and Disadvantages of Faster RCNN

Guess you like

Origin blog.csdn.net/qq_43799400/article/details/123127851
Recommended