Target detection: a brief introduction to R-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD

Target detection: a brief introduction to R-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD


Reposted from: Watch Express

Original link: https://kuaibao.qq.com/s/20180723B05TXK00?refer=cp_1026


1. Common algorithms for target detection

Object detection is to accurately find the location of the object in a given picture and mark the category of the object. Therefore, the problem to be solved by object detection is the whole process of where and what the object is.

However, this problem is not so easy to solve. The size of the object varies widely, the angle and posture of the object are uncertain, and it can appear anywhere in the picture, not to mention that the object can also be of multiple categories.

At present, the target detection algorithms emerging in academia and industry are divided into three categories:

1. Traditional target detection algorithm: Cascade + HOG/DPM + Haar/SVM and many improvements and optimizations of the above methods;

2. Candidate area/frame + deep learning classification: By extracting candidate areas, and classifying the corresponding areas based on deep learning methods, such as:

R-CNN(Selective Search + CNN + SVM)

SPP-net(ROI Pooling)

Fast R-CNN(Selective Search + CNN + ROI)

Faster R-CNN(RPN + CNN + ROI)

R-FCN

and other series of methods;

3. Regression methods based on deep learning: YOLO/SSD/DenseBox and other methods; and the recent RRC detection combined with RNN algorithm; Deformable CNN combined with DPM, etc.

Traditional target detection process:

1) Area selection (exhaustive strategy: use sliding windows, and set different sizes, different aspect ratios to traverse the image, and the time complexity is high)

2) Feature extraction (SIFT, HOG, etc.; morphological diversity, illumination change diversity, and background diversity make feature robustness poor)

3) Classifier classification (mainly SVM, Adaboost, etc.)

2. Traditional target detection algorithm

2.1 Starting from the task of image recognition

Here is an image task: it is necessary to recognize the object in the picture and frame its position with a box.

This task is essentially these two problems: one: image recognition, two: positioning.

Image recognition (classification):

Input: picture

Output: the class of the object

Evaluation Method: Accuracy

Positioning (localization):

Input: picture

Output: the position of the box in the picture (x, y, w, h)

Evaluation method: detection evaluation function intersection-over-union (for what is IOU, please refer to question 55 under this deep learning classification: https://www.julyedu.com/question/big/kp_id/26/ques_id/2138)

The convolutional neural network CNN has already helped us complete the task of image recognition (determining whether it is a cat or a dog), and we only need to add some additional functions to complete the positioning task.

What are the solutions to the positioning problem?

Idea 1: Think of it as a regression problem

As a regression problem, we need to predict the values ​​​​of the four parameters (x, y, w, h) to obtain the position of the box.

step 1:

* Solve simple problems first, build a neural network for image recognition

*Fine-tuning on AlexNet VGG GoogleLenet (for what is fine-tuning, please refer to question 54 of this deep learning classification: https://www.julyedu.com/question/big/kp_id/26/ques_id/ 2137)

Step 2:

*Expand at the end of the above neural network (that is to say, the front of CNN remains unchanged, we make improvements to the end of CNN: add two heads: "classification head" and "regression head")

* Become classification + regression mode

Step 3:

*The part of Regression uses Euclidean distance loss

* Trained using SGD

Step 4:

*Put 2 heads together in the prediction stage

*Complete different functions

Two fine-tunings are required here

The first time I did it on ALexNet, the second time I changed the head to a regression head, the previous one was unchanged, and I did a fine-tuning

Where is the Regression part added?

There are two ways to handle this:

• added after the last convolutional layer (like VGG)

• After the last fully connected layer (like R-CNN)

Regression is too difficult to do, and we should try to convert it into a classification problem.

The convergence time of regression training parameters is much longer, so the above network uses the classification network to calculate the connection weights of the common parts of the network.

Idea 2: Take the image window

• It is still the classification + regression idea just now

• Let's take "boxes" of different sizes

• Let the box appear in different positions to get the judgment score of this box

• Take the box with the highest score

Black box in upper left corner: score 0.5

Black box in upper right corner: Score 0.75

Black box in the lower left corner: score 0.6

Black box in the lower right corner: Score 0.8

According to the high and low scores, we selected the black box in the lower right corner as the prediction of the target location.

Note: Sometimes the two boxes with the highest scores are selected, and then the intersection of the two boxes is taken as the final position prediction.

Doubt: How big should the frame be?

Take different boxes and swipe from the upper left corner to the lower right corner in turn. Very rough.

To sum up the idea:

For a picture, use boxes of various sizes (traversing the whole picture) to intercept the picture, input it to CNN, and then CNN will output the score (classification) of the box and the x, y, h, w corresponding to the picture in the box (regression).

This method is too time-consuming and needs to be optimized.

The original network is like this:

The optimization is like this: change the fully connected layer to a convolutional layer, which can speed up.

2.2 Object Detection

What to do when the image has many objects? The difficulty increased suddenly.

Then the task becomes: multi-object recognition + positioning multiple objects

So think of this task as a classification problem?

What's wrong with seeing it as a classification problem?

• You need to find a lot of positions, give a lot of boxes of different sizes

• You also need to classify the images inside the box

•Of course, if your GPU is powerful, well, let's do it...

Therefore, the main problems of traditional target detection are:

1) The region selection strategy based on the sliding window is not targeted, the time complexity is high, and the window is redundant

2) Handcrafted features are not robust to diversity changes

As a classification, is there any way to optimize it? I don't want to try so many frames and so many positions!

3. Candidate area/window + deep learning classification

3.1 The birth of R-CNN

Someone thought of a good way: find out in advance where the target may appear in the picture, that is, the candidate area (Region Proposal). Using information such as texture, edge, and color in the image can ensure a high recall rate (Recall) when selecting fewer windows (thousands or even hundreds).

Therefore, the problem turns into finding areas/frames that may contain objects (that is, candidate areas/frames, such as selecting 2000 candidate frames), and these frames can overlap and contain each other, so that we can avoid violence. Lift all the boxes up.

The big cows have invented many methods for selecting the candidate box Region Proposal, such as Selective Search and EdgeBoxes. So how does the algorithm "selective search" used to extract candidate boxes select these candidate boxes? For details, please take a look at PAMI2015's "What makes for effective detection proposals?"

The following is a performance comparison of various methods for selecting candidate boxes.

With the candidate area, the remaining work is actually to classify the image of the candidate area (feature extraction + classification).

For image classification, I have to mention that in the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), Professor Geoffrey Hinton, a master of machine learning, led his student Krizhevsky to use a convolutional neural network to reduce the Top-5 error of the ILSVRC classification task to 15.3%. , while the second top-5 error using the traditional method is as high as 26.2%. Since then, convolutional neural network CNN has occupied the absolute dominance of image classification tasks.

In 2014, RBG (Ross B. Girshick) used Region Proposal + CNN to replace the sliding window + hand-designed features used in traditional target detection, designed the R-CNN framework, made a huge breakthrough in target detection, and opened the deep learning-based target detection upsurge.

The brief steps of R-CNN are as follows

(1) Input test image

(2) Use the Selective Search algorithm to extract about 2000 candidate regions that may contain objects in the image from bottom to top Region Proposal

(3) Because the sizes of the extracted regions are different, each Region Proposal needs to be scaled (warp) into a uniform size of 227x227 and input to CNN, and the output of CNN's fc7 layer is used as a feature

(4) Input the CNN features extracted by each Region Proposal to the SVM for classification

The specific steps are as follows

Step 1: Train (or download) a classification model (such as AlexNet)

Step 2: Fine-tuning the model

• Change the number of categories from 1000 to 21, such as 20 object categories + 1 background

• Remove the last fully connected layer

Step 3: Feature Extraction

• Extract all candidate boxes of an image (Selective Search)

• For each region: correct the size of the region to fit the input of CNN, do a forward operation, and save the output of the fifth pooling layer (that is, the features extracted from the candidate box) to the hard disk

Step 4: Train a SVM classifier (two classification) to judge the category of the object in the candidate box

Each category corresponds to an SVM, to judge whether it belongs to this category, it is positive, otherwise nagative.

For example, the picture below is the SVM for dog classification

Step 5: Use the regressor to fine-tune the position of the candidate frame: For each class, train a linear regression model to determine whether the frame is perfect.

Careful students may have noticed the problem. Although R-CNN is no longer as exhaustive as the traditional method, in the first step of the R-CNN process, there are as many as 2000 candidate frame region proposals extracted through Selective Search for the original image. Each of the 2,000 candidate frames needs to be subjected to CNN feature extraction + SVM classification, which requires a large amount of calculation, resulting in a very slow R-CNN detection speed, which takes 47 seconds for a single image.

Is there a way to speed it up? The answer is yes. Aren’t these 2000 region proposals all part of the image? Then we can add convolutional layer features to the image once, and then only need to map the region proposal’s position in the original image to the convolutional layer feature map. In this way, for an image, we only need to mention the convolutional layer features once, and then input the convolutional layer features of each region proposal to the fully connected layer for subsequent operations.

But the problem now is that the scale of each region proposal is different, and the input of the fully connected layer must be a fixed length, so it is definitely not possible to directly input the fully connected layer in this way. SPP Net can just solve this problem.

3.2 SPP Net

SPP: Spatial Pyramid Pooling (Spatial Pyramid Pooling)

SPP-Net is from the paper published on IEEE in 2015 - "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition".

As we all know, CNN generally contains a convolutional part and a fully connected part. Among them, the convolutional layer does not require a fixed-size image, while the fully connected layer requires a fixed-size input.

Therefore, when the fully connected layer faces input data of various sizes, it is necessary to crop the input data (crop is to deduct the patch of the network input size from a large image, such as 227×227), or warp (put a bounding box bounding The content of the box is resized to 227×227) and a series of operations to unify the size of the picture, such as 224*224 (ImageNet), 32*32 (LenNet), 96*96, etc.

So as you can see above, in R-CNN, "Because the sizes of the extracted regions are different, each Region Proposal needs to be scaled (warp) to a uniform size of 227x227 and input to CNN".

However, the preprocessing of warp/crop causes problems that are either stretched and deformed, or the object is incomplete, which limits the recognition accuracy. Don't understand? To put it bluntly, if you want to resize a 16:9 picture into a 1:1 picture, do you think the picture is distorted?

The author of SPP Net, Kaiming He et al., think in reverse. Since the existence of the fully connected FC layer, ordinary CNN needs to fix the input of the fully connected layer by fixing the size of the input image. Then, the convolutional layer can be adapted to any size, so why not add some structure at the end of the convolutional layer so that the input obtained by the fully connected layer becomes fixed?

This "turning decay into magic" structure is the spatial pyramid pooling layer.

The following figure is a comparison of R-CNN and SPP Net detection process:

It has two characteristics:

1. Combining the spatial pyramid method to realize multi-scale input of CNNs.

The first contribution of SPP Net is to connect the pyramid pooling layer after the last convolutional layer to ensure that the input to the next fully connected layer is fixed.

In other words, in a common CNN mechanism, the size of the input image is often fixed (such as 224*224 pixels), and the output is a vector of fixed dimensions. SPP Net adds a ROI pooling layer (ROI Pooling) to the ordinary CNN structure, so that the input image of the network can be of any size, and the output remains unchanged, which is also a fixed-dimensional vector.

In short, CNN can only have fixed input and fixed output. After CNN is added with SSP, it can input and output arbitrarily. Amazing, right?

The ROI pooling layer generally follows the convolutional layer. At this time, the input of the network can be of any scale. In the SPP layer, each pooling filter will be resized according to the input, and the output of the SPP is a fixed-dimensional vector. Then it is given to the fully connected FC layer.

2. Only extract convolution features once from the original image

In R-CNN, each candidate box is first resized to a uniform size, and then used as the input of CNN, which is very inefficient.

The SPP Net has been optimized based on this shortcoming: only one convolution calculation is performed on the original image, and the convolution feature map of the entire image is obtained, and then the mapping patch of each candidate frame on the feature map is found, and this patch is used as The convolutional feature of each candidate box is input to the SPP layer and subsequent layers to complete the feature extraction work.

In this way, R-CNN needs to calculate convolution for each region, while SPPNet only needs to calculate convolution once, which saves a lot of calculation time and is about 100 times faster than R-CNN.

3.3 Fast R-CNN

SPP Net is really a good method. Fast R-CNN, an advanced version of R-CNN, adopts the SPP Net method on the basis of R-CNN, and improves R-CNN to further improve performance.

What are the differences between R-CNN and Fast R-CNN?

Let me talk about the shortcomings of R-CNN first: Even if preprocessing steps such as Selective Search are used to extract potential bounding boxes as input, R-CNN will still have a serious speed bottleneck. There will be repeated calculations during feature extraction, and Fast-RCNN was born to solve this problem.

Compared with the R-CNN frame diagram, it can be found that there are two main differences: one is that an ROI pooling layer is added after the last convolutional layer, and the other is that the loss function uses a multi-task loss function (multi-task loss), which divides the border Regression Bounding Box Regression is directly added to the CNN network for training (for what is border regression, please refer to question 56 under this deep learning classification: https://www.julyedu.com/question/big/kp_id/26/ques_id/2139 ).

(1) The ROI pooling layer is actually a simplified version of SPP-NET. SPP-NET uses pyramid maps of different sizes for each proposal, while the ROI pooling layer only needs to be down-sampled to a 7x7 feature map. For the VGG16 network conv5_3, there are 512 feature maps, so all region proposals correspond to a 7*7*512-dimensional feature vector as the input of the fully connected layer.

In other words, this network layer can map inputs of different sizes to a fixed-scale feature vector, and we know that operations such as conv, pooling, and relu do not require fixed-size inputs. Therefore, after performing these operations on the original image, Although the size of the input image is different, the size of the feature map obtained is also different, and it cannot be directly connected to a fully connected layer for classification, but this magical ROI Pooling layer can be added to extract a fixed-dimensional feature representation for each region, and then pass Normal softmax for type recognition.

(2) The R-CNN training process is divided into three stages, while Fast R-CNN directly uses softmax instead of SVM classification, and at the same time uses the multi-task loss function frame regression to also join the network, so that the entire training process is end-to-end ( Remove the region proposal extraction stage).

In other words, the previous processing flow of R-CNN is to propose proposal first, then CNN extracts features, then uses SVM classifier, and finally does box regression. In Fast R-CNN, the author cleverly puts box regression into Inside the neural network, the classification and integration with the region becomes a multi-task model. The actual experiment also proves that the two tasks can share convolutional features and promote each other.

Therefore, a very important contribution of Fast-RCNN is to successfully let people see the hope of real-time detection of the Region Proposal + CNN framework. It turns out that multi-type detection can really improve the processing speed while ensuring the accuracy, and it is also for the later. Faster R-CNN laid the groundwork.

Draw a key point:

R-CNN has some considerable shortcomings (remove them all and it becomes Fast R-CNN).

Big disadvantage: Since each candidate box has to go through CNN alone, it takes a lot of time.

Solution: share the convolutional layer, now not every candidate box is used as input into CNN, but a complete picture is input, and the features of each candidate box are obtained in the fifth convolutional layer

The original method: many candidate boxes (such as two thousand) --> CNN --> get the features of each candidate box --> classification + regression

The current method: a complete picture --> CNN --> get the features of each candidate frame --> classification + regression

So it is easy to see that the reason for the speedup of Fast R-CNN compared to R-CNN is that, unlike R-CNN, which does not provide each candidate region with features for the deep network, but instead provides features for the entire image, and then maps the candidate frames. To conv5, and SPP only needs to calculate the features once, and the rest only needs to be operated on the conv5 layer.

The performance improvement is also quite obvious:

3.4 Faster R-CNN

Problems with Fast R-CNN: There is a bottleneck: selective search, finding all candidate boxes, which is also very time-consuming. So can we find a more efficient way to find these candidate boxes?

Solution: Add a neural network that extracts the edge, that is to say, the work of finding the candidate frame is also handed over to the neural network.

Therefore, rgbd introduces Region Proposal Network (RPN) in Fast R-CNN to replace Selective Search, and introduces anchor box to deal with the change of target shape (anchor is a box with fixed position and size, which can be understood as a fixed proposal set in advance ).

specific methods:

• Put the RPN after the last convolutional layer

• RPN direct training to get candidate regions

Introduction to RPN:

• Sliding window on feature map

• Build a neural network for regression of object classification + box position

• The position of the sliding window provides the general position information of the object

• Box regression provides more precise location of boxes

One network, four loss functions;

•RPN calssification(anchor good.bad)

•RPN regression(anchor->propoasal)

•Fast R-CNN classification(over classes)

•Fast R-CNN regression(proposal ->box)

speed comparison

The main contribution of Faster R-CNN is to design a network RPN for extracting candidate regions, which replaces the time-consuming Selective Search, which greatly improves the detection speed.

Finally, summarize the steps of the major algorithms:

RCNN

1. Determine about 1000-2000 candidate boxes in the image (using Selective Search)

2. The image blocks in each candidate frame are scaled to the same size and input to CNN for feature extraction

3. For the features extracted in the candidate box, use the classifier to determine whether it belongs to a specific class

4. For the candidate boxes belonging to a certain category, use the regressor to further adjust its position

Fast R-CNN

1. Determine about 1000-2000 candidate boxes in the image (using selective search)

2. Input CNN to the whole picture to get feature map

3. Find the mapping patch of each candidate box on the feature map, and input this patch as the convolution feature of each candidate box to the SPP layer and subsequent layers

4. For the features extracted in the candidate box, use the classifier to determine whether it belongs to a specific class

5. For the candidate boxes belonging to a certain category, use the regressor to further adjust its position

Faster R-CNN

1. Input CNN to the whole picture to get feature map

2. The convolution feature is input to the RPN to obtain the feature information of the candidate frame

3. For the features extracted in the candidate box, use the classifier to determine whether it belongs to a specific class

4. For the candidate boxes belonging to a certain category, use the regressor to further adjust its position

In short, as listed at the beginning of this article

R-CNN(Selective Search + CNN + SVM)

SPP-net(ROI Pooling)

Fast R-CNN(Selective Search + CNN + ROI)

Faster R-CNN(RPN + CNN + ROI)

In general, from R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN all the way, the process of target detection based on deep learning has become more and more streamlined, with higher accuracy and faster speed. come faster. It can be said that the R-CNN series target detection method based on region proposal is the most important branch in the field of target detection technology.

4. Regression method based on deep learning

4.1 YOLO (CVPR2016, oral)

(You Only Look Once: Unified, Real-Time Object Detection)

The Faster R-CNN method is currently the mainstream target detection method, but the speed cannot meet the real-time requirements. Methods such as YOLO gradually show their importance. This type of method uses the idea of ​​regression, uses the entire image as the input of the network, and directly returns the target frame of this position on multiple positions of the image, and the target belongs to. category.

Let's look directly at the flow chart of YOLO's target detection above:

(1) Given an input image, first divide the image into a 7*7 grid

(2) For each grid, we predict 2 borders (including the confidence that each border is the target and the probability of each border area on multiple categories)

(3) According to the previous step, 7*7*2 target windows can be predicted, and then the target windows with lower probability are removed according to the threshold, and finally the redundant windows can be removed by NMS (about what is non-maximum suppression NMS, please refer to See question 58 under this deep learning classification: https://www.julyedu.com/question/big/kp_id/26/ques_id/2141).

It can be seen that the whole process is very simple, no intermediate region proposal is needed to find the target, and the position and category determination is completed by direct regression.

Summary: YOLO converts the target detection task into a regression problem, which greatly speeds up the detection, allowing YOLO to process 45 images per second. Moreover, since each network uses full-image information when predicting the target window, the proportion of false positives is greatly reduced (sufficient context information).

But YOLO also has problems: without the Region Proposal mechanism, only using 7*7 grid regression will make the target not very accurate, which also leads to the detection accuracy of YOLO is not very high.

4.2 SSD

(SSD: Single Shot MultiBox Detector)

The problems existing in YOLO are analyzed above. It is not very accurate to use the whole image feature to return to the target in the 7*7 rough grid. Can it be combined with the idea of ​​​​region proposal to achieve more precise positioning? SSD combines YOLO's regression idea and Faster R-CNN's anchor mechanism to achieve this.

The picture above is a framework diagram of SSD. First, the method of SSD to obtain the target position and category is the same as YOLO, using regression, but YOLO predicts a certain position using the features of the whole image, and SSD predicts a certain position using this Features around the location (which feels a bit more reasonable).

So how to establish the corresponding relationship between a certain position and its characteristics? You may have thought of it, using the anchor mechanism of Faster R-CNN. As shown in the frame diagram of SSD, if the size of a feature map (figure b) of a certain layer is 8*8, then a 3*3 sliding window is used to extract the features of each position, and then this feature is regressed to obtain the coordinate information of the target and Category information (Figure c).

Different from Faster R-CNN, this anchor is on multiple feature maps, so that multi-layer features can be used and multi-scale can be naturally achieved (feature maps of different layers have different 3*3 sliding window receptive fields).

Summary: SSD combines the regression idea in YOLO and the anchor mechanism in Faster R-CNN, and uses the multi-scale regional features of each position in the whole image for regression, which not only maintains the fast speed of YOLO, but also ensures that the window prediction is consistent with Faster R-CNN is also more accurate. SSD can reach 72.1% mAP on VOC2007, and the speed can reach 58 frames per second on GPU.

Guess you like

Origin blog.csdn.net/leiduifan6944/article/details/105800189