Evolution of target detection technology based on deep learning: R-CNN, Fast R-CNN, Faster R-CNN, YOLO, SSD

Object detection is to accurately find the location of the object in a given picture and mark the category of the object. Therefore, the problem to be solved by object detection is the whole process of where and what the object is.


However, this problem is not so easy to solve. The size of the object varies widely, the angle and posture of the object are uncertain, and it can appear anywhere in the picture, not to mention that the object can be of multiple categories.


The current target detection algorithms in academia and industry are divided into 3 categories:
1. Traditional target detection algorithms: Cascade + HOG/DPM + Haar/SVM and many improvements and optimizations of the above methods;


2. Candidate window + deep learning classification: through Extract candidate regions and classify the corresponding regions based on deep learning methods, such as:
R-CNN (Selective Search + CNN + SVM)
SPP-net (ROI Pooling)
Fast R-CNN (Selective Search + CNN + ROI)
Faster R-CNN (RPN + CNN + ROI)
R-FCN
and other series of methods;


3. Regression methods based on deep learning: YOLO/SSD/DenseBox and other methods; and the recent RRC detection combined with RNN algorithm; combined with DPM Deformable CNN and other


traditional target detection processes:
1) Region selection (exhaustive strategy: use sliding windows, and set different sizes, different aspect ratios to traverse the image, and the time complexity is high)
2) Feature extraction (SIFT, HOG, etc.; morphological diversity, illumination variation diversity, and background diversity make features less robust)
3) Classifier classification (mainly SVM, Adaboost, etc.)


from the task of image recognition
Here is an image task:
not only to recognize the object in the picture, but also to use a box to frame its position.
{{1517392987_360.jpg}}


The above task is professionally speaking: image recognition + localization
Image recognition (classification):
input: picture
output : object category
Evaluation method: accuracy
{{1517393007_344.jpg}}


localization (localization) :
Input: Picture
Output : The position of the box in the picture (x, y, w, h)
Evaluation method: Detection evaluation function intersection-over-union ( IOU ) 
{{1517393013_253.png}}


Convolutional neural network CNN has helped We are done with image recognition (cat or dog), we just need to add some extra features to complete the localization task.


What are the solutions to the problem of positioning?
Idea 1: Treat it
as As a regression problem, we need to predict the values ​​of the four parameters (x, y, w, h) to get the position of the box.
{{1517393027_336.png}}


Step 1:
  • Solve a simple problem first, build a neural network to recognize images
  •
Fine-tuning {{1517393041_799.jpg}} on AlexNet VGG GoogleLenet
 
Step 2:
  • Expand the tail of the above neural network (that is to say, the front of the CNN remains unchanged, we make improvements to the end of the CNN: add two heads : "classification head" and "regression head")
  • Become classification + regression mode
{{1517393046_527.png}}


Step 3:
  • Regression part with Euclidean distance loss
  • Use SGD training


Step 4:
  • Prediction stage put 2 heads Parts are spelled out
  • Complete different functions.


Here you need to perform two fine-
tunings. The first time is to do it on ALexNet, and the second time to change the head to regression head, the front remains unchanged, where to


add the part of doing a fine-tuning Regression ?


There are two processing methods:
  • Add it after the last convolutional layer (such as VGG)
  • Add it after the last fully connected layer (such as R-CNN)


Regression is too difficult to do, and you should try to convert it into a classification problem.
The training parameters of regression take much longer to converge, so the above network adopts the classification network to calculate the connection weights of the common parts of the network.


Idea 2: Take the image window
  • Or the classification + regression idea just now
  • Let's take "boxes" of different sizes
  • Let the boxes appear in different positions, and get the judgment score of this box
  • Get the box with the highest score in the upper left corner


Black box: score 0.5
{{1517393142_364.jpg}}


black box in upper right corner: score 0.75
{{1517393149_762.jpg}}


black box in lower left corner: score 0.6
{{1517393156_659.jpg}}


black box in lower right corner: score 0.8
{{1517393161_204.jpg}}


According to the score, we selected the black box in the lower right corner as the prediction of the target position.
Note: Sometimes the two boxes with the highest scores are also selected, and then the intersection of the two boxes is taken as the final position prediction.


Doubt: How big should the frame be?
Take different boxes and swipe from the upper left corner to the lower right corner in turn. Very rude.


To summarize the idea:
For a picture, use boxes of various sizes (traverse the entire picture) to cut out the picture and input it to CNN, and then CNN will output the classification of the box and the x, y, h, w corresponding to the picture of the box (regression).
{{1517393174_730.jpg}}


This method is too time-consuming, so make an optimization.
The original network is like this:
{{1517393184_685.jpg}} is


optimized like this: change the fully connected layer to a convolutional layer, which can speed up.
{{1517393192_919.jpg}}


Object Detection (Object Detection)
What to do when there are many objects in the image? The difficulty has suddenly increased.


Then the task becomes: multi-object recognition + positioning of multiple objects
So consider this task as a classification problem?
What's wrong with {{1517393210_554.jpg}}


as a classification problem?
  • You need to find a lot of positions and give a lot of boxes of different sizes
  • You also need to classify the images in the boxes
  • Of course, if your GPU is very powerful, well, go for it...


So, the main problem of traditional object detection is :
1) The region selection strategy based on the sliding window is not targeted, the time complexity is high, and the window is redundant
2) The hand-designed features are not very robust to changes in


diversity classification, is there any way to optimize it? I don't want to try so many boxes and so many positions!


R-CNN was born.


Someone thought of a good method: find out in advance where the target in the image may appear, that is, the candidate region (Region Proposal). Using the texture, edge, color and other information in the image, it can be guaranteed to maintain a high recall rate (Recall) in the case of selecting fewer windows (thousands or even hundreds).


Therefore, the problem turns into finding boxes that may contain objects (that is, candidate boxes, such as selecting 1000 candidate boxes), these boxes can overlap and contain each other, so that we can avoid violent enumeration of all boxes .
{{1517393217_390.jpg}} Daniels


have invented a lot of methods for selecting Region Proposal candidate boxes, such as Selective Search and EdgeBoxes. How does the "selective search" algorithm used to extract candidate boxes select these candidate boxes? For details, take a look at PAMI2015's "What makes for effective detection proposals?"


The following is a performance comparison of various methods for selecting candidate boxes.
{{1517393227_723.jpg}}


With the candidate region, the remaining work is actually the work of image classification (feature extraction + classification) for the candidate region. For image classification, I have to mention that at the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), machine learning expert Professor Geoffrey Hinton led student Krizhevsky to use convolutional neural networks to reduce the Top-5 error of the ILSVRC classification task to 15.3% , while the top-5 error of the second place using the traditional method is as high as 26.2%. Since then, the convolutional neural network CNN has occupied the absolute dominance of image classification tasks.


In 2014, RBG (Ross B. Girshick) used region proposal + CNN to replace the sliding window + hand-designed features used in traditional target detection, and designed the R-CNN framework, which made a huge breakthrough in target detection and opened the target detection based on deep learning. boom.
{{1517393238_127.png}}


The brief steps of R-CNN are as follows
(1) Input test image
(2) Use selective search algorithm to extract about 2000 candidate regions that may contain objects in the image from bottom to top Region Proposal
(3) Because the size of the extracted regions is different, each Region Proposal needs to be scaled (warp) into a uniform size of 227x227 and input to CNN, and the output of the fc7 layer of CNN is used as a feature
(4) Each Region Proposal The extracted CNN features are input to the SVM for classification.


The specific steps are as follows
: Step 1: Train (or download) a classification model (such as AlexNet) {{   1517393249_806.jpg
}}


Step 2: Fine-tuning the model Change from 1000 to 20   • Remove the last fully connected layer {{1517393259_895.png}} Step 3: Feature extraction   • Extract all candidate boxes of the image (selective search)   •







For each area: correct the area size to fit the input of CNN, do a forward operation, and save the output of the fifth pooling layer (that is, the features extracted from the candidate frame) to the hard disk
{{1517393264_688.png}}


step Four: Train an SVM classifier (two classification) to determine the category of the object in the candidate frame.
Each category corresponds to an SVM, to determine whether it belongs to this category, it is positive, and vice versa.


For example, the figure below is the SVM for dog classification
{{1517393277_461.png}}


Step 5: Use the regressor to finely correct the position of the candidate frame: For each class, train a linear regression model to determine whether the frame is perfect.
{{1517393285_829.png}}


Careful students may see the problem. In the first step of R-CNN, after extracting the candidate frame region proposals (about 2000) from the original image through Selective Search, each proposal is treated as an image. In the subsequent processing (CNN feature extraction + SVM classification), the process of feature extraction and classification is actually performed on an image 2000 times. As a result, the detection speed of R-CNN is very slow, and it takes 47s for a picture.


{{1525103713_502.png}}


Is there any way to speed up? The answer is yes. Aren't these 2000 region proposals part of the image? Then we can completely extract the convolutional layer features on the image, and then we only need to map the region proposal in the original image position to the convolutional layer feature map. In this way, for an image, we only need to extract the convolutional layer features once, and then input the convolutional layer features of each region proposal to the fully connected layer for subsequent operations.


But the problem now is that the scale of each region proposal is different, and the input of the fully connected layer must be a fixed length, so it is definitely not possible to directly input the fully connected layer in this way. SPP Net just solves this problem.


SPP Net
SPP: Spatial Pyramid Pooling (Spatial Pyramid Pooling)


SPP-Net is from a paper published in IEEE in 2015 - "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition".


As we all know, CNN generally contains a convolutional part and a fully connected part. Among them, the convolutional layer does not require a fixed-size image, and the fully-connected layer requires a fixed-size input.


{{1525243658_939.png}}


Therefore, when the fully connected layer handles input data of various sizes, it is necessary to crop the input data (crop is a patch that deducts the network input size from a large image, such as 227×227), or A series of operations such as warp (resize the content of a bounding box to 227×227) to unify the size of the image, such as 224*224 (ImageNet), 32*32 (LenNet), 96*96, etc.


{{1525249316_603.png}}


So as you can see above, in R-CNN, "Because the size of the extracted regions is different, each Region Proposal needs to be scaled (warp) to a uniform size of 227x227 and input to CNN".


However, the preprocessing of warp/crop causes the problem of data loss or geometric distortion to a certain extent, which limits the recognition accuracy. Don't quite understand? In other words, if you want to resize a 16:9 image to a 1:1 image, do you think the image is distorted?


The author of SPP Net, Kaiming He, et al. think in reverse, since due to the existence of the fully connected FC layer, the ordinary CNN needs to fix the input of the fully connected layer by fixing the size of the input image. Then the convolutional layer can be adapted to any size, why can't we add a certain structure to the end of the convolutional layer, so that the input obtained by the fully connected layer becomes fixed?


This "turning decay into magic" structure is the spatial pyramid pooling layer. The following figure is a comparison of the detection process of R-CNN and SPP Net:


{{1525249330_874.png}}


It has two characteristics:


1. Combined with the spatial pyramid method to realize multi-scale input of CNNs.


The first contribution of SPP Net is that after the last convolutional layer, the pyramid pooling layer is connected to ensure that the input to the next fully connected layer is fixed.


In other words, in ordinary CNN mechanisms, the size of the input image is often fixed (such as 224*224 pixels), and the output is a fixed-dimensional vector. SPP Net adds a ROI pooling layer (ROI Pooling) to the ordinary CNN structure, so that the input image of the network can be of any size, and the output is unchanged, which is also a fixed-dimensional vector.


In short, CNN can only have fixed input and fixed output originally, but after adding SSP to CNN, it can input and output arbitrarily. Magical, right?


The ROI pooling layer generally follows the convolutional layer. At this time, the input of the network can be of any scale. In the SPP layer, each pooling filter will be resized according to the input, and the output of the SPP is a fixed-dimensional vector. Then it is given to the fully connected FC layer.
{{1517393299_167.jpg}}


2. Extract the convolution feature only once for the original image.


In R-CNN, each candidate frame is first resized to a uniform size, and then used as the input of CNN, which is very inefficient.
SPP Net optimizes according to this shortcoming: only perform convolution calculation on the original image once to obtain the convolution feature feature map of the entire image, and then find the mapping patch of each candidate frame on the feature map, and use this patch as The convolutional features of each candidate box are input to the SPP layer and subsequent layers to complete the feature extraction.


In this way, R-CNN needs to calculate convolution for each region, while SPPNet only needs to calculate convolution once, which saves a lot of calculation time and is about a hundred times faster than R-CNN.
{{1517393307_311.jpg}}


Fast R-CNN


SPP Net is really a good method. The advanced version of R-CNN, Fast R-CNN, adopts the SPP Net method on the basis of R-CNN and improves R-CNN. , which further improves the performance.


What are the differences between R-CNN and Fast R-CNN?
Let’s talk about the shortcomings of R-CNN first: Even if preprocessing steps such as selective search are used to extract potential bounding boxes as input, R-CNN still has a serious speed bottleneck. The reason is obvious, that is, the computer features all regions. There will be repeated calculations during extraction, and Fast-RCNN was born to solve this problem.
Comparing {{1517393313_544.png}}


with the R-CNN frame diagram, it can be found that there are two main differences: one is that an ROI pooling layer is added after the last convolutional layer, and the other is that the loss function uses a multi-task loss function (multi -task loss), the bounding box regression is directly added to the CNN network for training.
(1) The ROI pooling layer is actually a simplified version of SPP-NET. SPP-NET uses pyramid maps of different sizes for each proposal, while the ROI pooling layer only needs to downsample to a 7x7 feature map. For the VGG16 network conv5_3, there are 512 feature maps, so that all region proposals correspond to a 7*7*512-dimensional feature vector as the input of the fully connected layer.


In other words, this network layer can map inputs of different sizes to a fixed-scale feature vector, and we know that operations such as conv, pooling, and relu do not require fixed-size inputs. Therefore, after performing these operations on the original image, Although the size of the feature map obtained is different due to the different input image size, it cannot be directly connected to a fully connected layer for classification, but this magical ROI Pooling layer can be added to extract a fixed-dimensional feature representation for each region, and then pass Normal softmax for type recognition.


(2) The R-CNN training process is divided into three stages, while Fast R-CNN directly uses softmax instead of SVM classification, and uses multi-task loss function bounding box regression is also added to the network, so that the entire training process is end-to-end ( Remove the region proposal extraction stage).


That is to say, the previous processing flow of R-CNN was to propose proposal first, then CNN to extract features, then use SVM classifier, and finally do bbox regression, and in Fast R-CNN, the author cleverly put bbox regression into it Inside the neural network, it is combined with region classification to become a multi-task model. Actual experiments have also proved that these two tasks can share convolutional features and promote each other.


{{1525104022_285.png}}


Therefore, a very important contribution of Fast-RCNN is to successfully let people see the hope of real-time detection in the framework of Region Proposal + CNN. It turns out that multi-type detection can really improve the processing speed while ensuring the accuracy, which is also helpful for later Faster R-CNN laid the groundwork.


Draw a key point:
R-CNN has some considerable shortcomings (remove these shortcomings and it becomes Fast R-CNN).
Big disadvantage: Since each candidate box has to go through the CNN alone, it takes a lot of time.
Solution: The convolutional layer is shared. Now, instead of each candidate frame entering the CNN as an input, a complete picture is input, and the features of each candidate frame are obtained in the fifth convolutional layer. The


original method: many candidates Boxes (such as two thousand)-->CNN-->Get the features of each candidate box-->Classification + regression
The current method: a complete picture-->CNN-->Get the features of each candidate box- -> Classification + Regression


So it is easy to see that the reason for the speedup of Fast R-CNN relative to R-CNN is: but unlike R-CNN, which features each candidate region for the deep network, it features the entire image once. Then map the candidate frame to conv5, and SPP only needs to calculate the feature once, and the rest only needs to operate on the conv5 layer.


The performance improvement is also quite obvious:
{{1517393334_157.png}}


Faster R-CNN


Problems with Fast R-CNN: There is a bottleneck: selective search to find all candidate boxes, which is also very time-consuming. So can we find a more efficient way to find these candidate boxes?


Solution: Add a neural network that extracts edges, that is to say, the job of finding candidate boxes is also handed over to the neural network.


Therefore, rgbd introduces Region Proposal Network (RPN) to replace selective search in Fast R-CNN, and introduces anchor box to deal with the change of target shape.


Specific practices:
  • Put RPN behind the last convolutional layer
  • RPN directly trains to obtain candidate regions
{{1517393348_780.png}}


RPN introduction:
  • Sliding window on feature map
  • Build a neural network for object classification + box Position regression
  • The position of the sliding window provides the general position information of the object
  • The box regression provides a more accurate position of the box
{{1517393355_598.png}}


A network, four loss functions;
  • RPN calssification (anchor good.bad )
  • RPN regression(anchor->propoasal)
  • Fast R-CNN classification(over classes)
  • Fast R-CNN regression(proposal ->box)
{{1517393380_922.png}}


Speed ​​comparison
{{1517393385_752.png}}


The main contribution of Faster R-CNN is to design the network RPN for extracting candidate regions, instead of the time-consuming selective search, which greatly improves the detection speed.


Finally, summarize the steps of the major algorithms:
RCNN
1. Determine about 1000-2000 candidate boxes in the image (using Selective Search)
2. The image blocks in each candidate box are scaled to the same size and input into the CNN Perform feature extraction 
3. For the features extracted from the candidate frame, use the classifier to determine whether it belongs to a specific class 
4. For the candidate frame belonging to a certain feature, use the regressor to further adjust its position


Fast R-CNN
1. In the image Determine about 1000-2000 candidate frames (using selective search)
2. Input the entire image into CNN to get the feature map
3. Find the mapping patch of each candidate frame on the feature map, and use this patch as each candidate frame The convolutional features are input to the SPP layer and subsequent layers
4. For the features extracted from the candidate frame, use a classifier to determine whether they belong to a specific class 
5. For the candidate frame belonging to a certain feature, use the regressor to further adjust its position


Faster R-CNN
1. Input the whole picture into CNN to get the feature map
2. Input the convolution feature to the RPN to get the feature information of the candidate frame
3. For the features extracted from the candidate box, use the classifier to determine whether it belongs to a specific class 
4. For the candidate box belonging to a certain feature, use the regressor to further adjust its position.


In short , as listed at the beginning of this article,
R- CNN (Selective Search + CNN + SVM)
SPP-net (ROI Pooling)
Fast R-CNN (Selective Search + CNN + ROI)
Faster R-CNN (RPN + CNN + ROI)


In general, from R-CNN, SPP -NET, Fast R-CNN, Faster R-CNN Along the way, the process of target detection based on deep learning has become more and more streamlined, with higher and higher accuracy and faster speed. It can be said that the R-CNN series target detection method based on region proposal is the most important branch in the current target detection technology field.


YOLO (CVPR2016, oral)
(You Only Look Once: Unified, Real-Time Object Detection)


The Faster R-CNN method is currently the mainstream object detection method, but the speed cannot meet the real-time requirements. Methods such as YOLO gradually show their importance. These methods use the idea of ​​regression, that is, given an input image, the target frame and target category of this position are directly returned at multiple positions of the image.


{{1525171091_647.jpg}}


Let's look directly at the flow chart of YOLO's target detection above:
(1) Given an input image, first divide the image into a 7*7 grid
(2) For each grid, we predict 2 bounding boxes (including the confidence that each bounding box is a target and the probability of each bounding box region on multiple categories)
(3) According to the previous step, we can predict 7*7 *2 target windows, and then remove target windows with lower probability according to the threshold, and finally NMS removes redundant windows.


It can be seen that the whole process is very simple. There is no need for the intermediate region proposal to find the target, and the direct regression completes the determination of the position and category.
{{1525171142_149.jpg}}


Summary: YOLO converts the target detection task into a regression problem, which greatly speeds up the detection, allowing YOLO to process 45 images per second. Moreover, since each network uses full-image information to predict the target window, the proportion of false positives is greatly reduced (sufficient contextual information).


But YOLO also has problems: without the region proposal mechanism, only using 7*7 grid regression will make the target not very accurate positioning, which also leads to the detection accuracy of YOLO is not very high.


SSD 
(SSD: Single Shot MultiBox Detector)


analyzes the problems of YOLO above. It is not very accurate to use the whole image feature to regress in a 7*7 rough grid to locate the target. Can it be combined with the idea of ​​region proposal to achieve more precise positioning? SSD combines the regression idea of ​​YOLO and the anchor mechanism of Faster R-CNN to achieve this.


{{1525171268_230.jpg}}


The above picture is a frame diagram of SSD. First, the method of SSD to obtain target position and category is the same as that of YOLO, which uses regression, but YOLO predicts a certain position using the features of the whole image, and SSD predicts A location is using features around that location (which feels more reasonable).


So how to establish the corresponding relationship between a certain position and its characteristics? Maybe you have already thought of using the anchor mechanism of Faster R-CNN. As shown in the frame diagram of SSD, if the size of a feature map of a certain layer (Figure b) is 8*8, then a 3*3 sliding window is used to extract the feature of each position, and then this feature is regressed to obtain the coordinate information of the target and Category information (Figure c).


Unlike Faster R-CNN, this anchor is on multiple feature maps, so that multi-layer features can be used and multi-scale can be achieved naturally (feature maps of different layers have different 3*3 sliding window receptive fields).


Summary: SSD combines the regression idea in YOLO and the anchor mechanism in Faster R-CNN, and uses the multi-scale regional features of each position in the whole image for regression, which not only maintains the fast speed of YOLO, but also ensures that the window prediction is consistent with Faster. R-CNN is also more accurate. SSD can reach 72.1% mAP on VOC2007, and the speed can reach 58 frames per second on GPU.


Main reference and extended reading


1 https://www.cnblogs.com/skyfsm/p/6806246.html, by @Madcola
2 https://mp.weixin.qq.com/s?__biz=MzI1NTE4NTUwOQ==&mid=502841131&idx =1&sn=bb3e8e6aeee2ee1f4d3f22459062b814#rd
3 https://zhuanlan.zhihu.com/p/27546796
4 https://blog.csdn.net/v1_vivian/article/details/73275259
5 https://blog.csdn.net/tinyzhao /article/details/53717136
6 Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition,by Kaiming He等人
7 https://zhuanlan.zhihu.com/p/24774302

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325203274&siteId=291194637