Evolution of target detection technology based on deep learning: R-CNN, Fast R-CNN, Faster R-CNN [transfer]

The author's writing is simple and easy to understand, and it is directly turned around. Original link: https://www.cnblogs.com/skyfsm/p/6806246.html

My understanding of object detection is to accurately find the location of the object in a given picture and mark the category of the object. The problem to be solved by object detection is where the object is and what is the whole process. However, this problem is not so easy to solve. The size of the object varies widely, the angle and posture of the object are uncertain, and it can appear anywhere in the picture, not to mention that the object can be of multiple categories.

Evolution of object detection technology:
RCNN->SppNET->Fast-RCNN->Faster-RCNN

From the task of image recognition,
here is an image task:
not only to recognize the object in the picture, but also to use a box to frame its position.

 

The above task in professional terms is: image recognition + positioning
Image recognition (classification):
input: picture
output: object category
Evaluation method: accuracy

Localization:
Input: Image
Output: The position of the box in the image (x, y, w, h)
Evaluation method: Detection evaluation function intersection-over-union (IOU)

The convolutional neural network CNN has already helped us complete the task of image recognition (determining whether it is a cat or a dog), we only need to add some additional functions to complete the positioning task.

What are the solutions to the problem of positioning?
Idea 1: Treat it as a regression problem
As a regression problem, we need to predict the values ​​of the four parameters (x, y, w, h) to get the position of the box.



Step 1:
  • Solve simple problems first, build a neural network to recognize images
  • Fine-tuning on AlexNet VGG GoogleLenet

 

Step 2:
  • Expand the tail of the above neural network (that is to say, the front of the CNN remains unchanged, and we make improvements to the end of the CNN: two heads are added: "classification head" and "regression head")
  • Become classification + regression model


Step 3:
  • Regression part with Euclidean distance loss
  • Train with SGD

 

Step 4:
  • Assemble the 2 heads in the prediction phase
  • Complete different functions

 


Here you need to perform two fine-tuning
first on ALexNet, the second time change the head to regression head, the front remains unchanged, do a fine-tuning

 

Where is the Regression part added?

There are two processing methods:
  • After the last convolutional layer (eg VGG)
  • After the last fully connected layer (eg R-CNN)

 

Regression is too difficult to do, and we should try our best to convert it into a classification problem.
The training parameters of regression take much longer to converge, so the above network adopts the classification network to calculate the connection weights of the common parts of the network.

 

Idea 2: Take the image window
  • Or the classification + regression idea just now
  • Let's take "boxes" of different sizes
  • Let the boxes appear in different positions, and get the judgment score of this box
  • Get the box with the highest score


Black box in the upper left corner: score 0.5

Black box in the upper right corner: score 0.75

Black box at bottom left: score 0.6

Black box at bottom right: score 0.8

According to the score, we choose the black box in the lower right corner as the prediction of the target location.
Note: Sometimes the two boxes with the highest scores are also selected, and then the intersection of the two boxes is taken as the final position prediction.

Doubt: How big should the frame be?
Take different boxes and swipe from the upper left corner to the lower right corner in turn. Very rude.

To sum up the idea:
for a picture, use boxes of various sizes (traverse the entire picture) to cut out the picture, input it to CNN, and then CNN will output the classification of the box and the corresponding x, y of the box picture. ,h,w (regression).


This method is too time-consuming to optimize.
The original network is like this:



The optimization is like this: Change the fully connected layer to a convolutional layer, which can speed up.

 

Object Detection (Object Detection)
What to do when the image has many objects? The difficulty has suddenly increased.

Then the task becomes: multi-object recognition + positioning of multiple objects
So consider this task as a classification problem?

What's wrong with seeing it as a classification problem?
  • You need to find a lot of positions for a lot of boxes of different sizes
  • You also need to classify the images inside the boxes
  • Of course, if your GPU is very powerful, well, go for it...

As a classification, is there any way to optimize it? I don't want to try so many boxes and so many positions!
Someone thought of a good method:
find out the boxes that may contain objects (that is, candidate boxes, such as selecting 1000 candidate boxes), these boxes can overlap and contain each other, so that we can avoid violent enumeration of all boxes .



Daniels have invented many methods for selecting candidate boxes, such as EdgeBoxes and Selective Search.
Below is a performance comparison of various methods for selecting candidate boxes.



There is a big doubt, how does the algorithm "selective search" used to extract candidate boxes select these candidate boxes? That one has to take a good look at its paper, so I won't introduce it here.


R-CNN was born
Based on the above ideas, RCNN appeared.

Step 1: Train (or download) a classification model (such as AlexNet)

Step 2: Fine-tuning the model
  • Change the number of classes from 1000 to 20
  • Remove the last fully connected layer


Step 3: Feature extraction
  • Extract all candidate boxes of the image (selective search)
  • For each region: correct the size of the region to fit the input of the CNN, do a forward operation, and combine the output of the fifth pooling layer (that is, the The features extracted from the candidate frame) are saved to the hard disk

Step 4: Train an SVM classifier (two classification) to determine the category of the object in the candidate frame.
Each category corresponds to an SVM, and determine whether it belongs to this category, and it is positive. On the contrary, nagative,
such as the following figure, is the SVM of dog classification.


Step 5: Use the regressor to fine-tune the candidate box position: For each class, train a linear regression model to determine whether the box is perfectly framed.

 

 

The idea of ​​SPP Net in the evolution of RCNN has contributed a lot to it. Here is a brief introduction to SPP Net.

SPP Net
SPP: Spatial Pyramid Pooling (spatial pyramid pooling)
It has two characteristics:

1. Combine the spatial pyramid method to realize the pair-scale input of CNNs.
Generally, CNN is followed by a fully connected layer or a classifier. They all require a fixed input size, so they have to crop or warp the input data. These preprocessing will cause data loss or geometric distortion. The first contribution of SPP Net is to add the pyramid idea to CNN to realize multi-scale input of data.

As shown in the figure below, an SPP layer is added between the convolutional layer and the fully connected layer. At this time, the input of the network can be of any scale. In the SPP layer, each pooling filter will be resized according to the input, and the output scale of SPP is always fixed.

 

2. Extract convolution features only once for the original image.
In R-CNN, each candidate frame is first resized to a uniform size, and then used as the input of CNN, which is very inefficient.
So SPP Net optimizes according to this disadvantage: only convolve the original image once to get the feature map of the whole image, then find the mapping patch on each candidate frame zaifeature map, and use this patch as the convolution of each candidate frame Features are input to the SPP layer and subsequent layers. It saves a lot of computing time, which is about a hundred times faster than R-CNN.


Fast R-CNN
SPP Net is really a good method. The advanced version of R-CNN, Fast R-CNN, adopts the SPP Net method on the basis of RCNN, and improves RCNN to further improve the performance.

What are the differences between R-CNN and Fast RCNN?
Let’s talk about the shortcomings of RCNN first: even if preprocessing steps such as selective search are used to extract potential bounding boxes as input, RCNN still has a serious speed bottleneck. The reason is obvious, that is, when the computer performs feature extraction on all regions, there will be Repeated calculations, Fast-RCNN was born to solve this problem.

Daniel proposed a network layer that can be regarded as a single-layer sppnet, called ROI Pooling. This network layer can map inputs of different sizes to a fixed-scale feature vector, and we know that conv, pooling, relu and other operations are not A fixed-size input is required. Therefore, after performing these operations on the original image, although the size of the input image is different, the resulting feature map size is also different. It cannot be directly connected to a fully connected layer for classification, but this magical ROI Pooling can be added. layer, extract a fixed-dimensional feature representation for each region, and then perform type recognition through normal softmax. In addition, the previous processing flow of RCNN was to propose proposal first, then CNN to extract features, then use SVM classifier, and finally do bbox regression, and in Fast-RCNN, the author cleverly put bbox regression inside the neural network, and Region classification and summation have become a multi-task model, and practical experiments have also proved that these two tasks can share convolutional features and promote each other. A very important contribution of Fast-RCNN is that it has successfully made people see the hope of real-time detection in the framework of Region Proposal+CNN. It turns out that multi-type detection can really improve the processing speed while ensuring accuracy, which is also for the later Faster- RCNN laid the groundwork.

Draw a key point:
R-CNN has some considerable shortcomings (remove these shortcomings and it becomes Fast R-CNN).
Big disadvantage: Since each candidate box has to go through the CNN alone, it takes a lot of time.
Solution: The convolutional layer is shared. Now not every candidate frame is entered into CNN as an input, but a complete picture is input, and the features of each candidate frame are obtained in the fifth convolutional layer.

The original method: many candidate boxes (such as two thousand) --> CNN--> get the features of each candidate box --> classification + regression
The current method: a complete picture --> CNN--> get each The features of the candidate frame --> classification + regression

So it is easy to see that the reason for the speedup of Fast RCNN relative to RCNN is that, unlike RCNN, which features each candidate region for the deep network, the entire image is feature-rich, and then the candidate frame is mapped to conv5, while SPP The features only need to be calculated once, and the rest only needs to be operated on the conv5 layer.

The performance improvement is also quite obvious:

Faster R-CNN
Problems with Fast R-CNN: There is a bottleneck: selective search to find all candidate boxes, which is also very time-consuming. So can we find a more efficient way to find these candidate boxes?
Solution: Add a neural network that extracts edges, that is to say, the job of finding candidate boxes is also handed over to the neural network.
A neural network that does such a task is called a Region Proposal Network (RPN).

Specific practices:
  • Put RPN after the last convolutional layer
  • RPN directly trains to obtain candidate regions

 

Introduction to RPN:
  • Sliding the window on the feature map
  • Build a neural network for object classification + box position regression
  • The position of the sliding window provides the general position information of the object
  • The box regression provides a more accurate position of the box

 


One network, four loss functions;
  • RPN calssification(anchor good.bad)
  • RPN regression(anchor->propoasal)
  • Fast R-CNN classification(over classes)
  • Fast R-CNN regression(proposal ->box)

speed comparison

The main contribution of Faster R-CNN is to design the network RPN for extracting candidate regions, which replaces the time-consuming selective search, which greatly improves the detection speed.


Finally, summarize the steps of the major algorithms:
RCNN
  1. Determine about 1000-2000 candidate boxes in the image (using selective search)
  2. The image blocks in each candidate box are scaled to the same size and input into the CNN for features Extraction 
  3. For the features extracted from the candidate frame, use the classifier to determine whether it belongs to a specific class
  4. For the candidate frame belonging to a certain feature, use the regressor to further adjust its position

Fast RCNN
  1. Determine about 1000-2000 candidate boxes in the image (using selective search)
  2. Input the entire image to CNN to get the feature map
  3. Find the mapping patch of each candidate box on the feature map, and put This patch is input to the SPP layer and subsequent layers as the convolutional feature of each candidate box.
  4. For the features extracted from the candidate box, use a classifier to determine whether it belongs to a specific class.
  5. For a candidate box belonging to a certain feature, further adjust its position with a regressor

Faster RCNN
  1. Input the entire picture into CNN to get the feature map
  2. Input the convolutional features to the RPN to get the feature information
  of the candidate frame 3. Use the classifier to determine whether it belongs to a specific class for the features extracted from the candidate frame
  4. For the candidate box belonging to a certain feature, use the regressor to further adjust its position

 

 

In general, from R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN along the way, the process of target detection based on deep learning has become more and more streamlined, with higher accuracy and faster speed. Come faster. It can be said that the R-CNN series target detection method based on region proposal is the most important branch in the current target detection technology field.


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325866679&siteId=291194637