Research progress of object detection based on deep learning

The Deep Learning Lecture Hall is dedicated to promoting the latest technologies, products and activities in artificial intelligence and deep learning. Please pay attention to our Zhihu column!

foreword


Before starting the content of this article, let's take a look at the picture on the left above. What objects do you see in the picture ? Where are they ? This is not simple. There is a cat and a person in the picture. The specific location is the location of the two bounding-boxes on the right side of the image above. In fact, the process just now is target detection. Target detection is "given an image or video frame, find the positions of all targets in it, and give the specific category of each target".

Object detection is a simple task for humans, but for computers, it sees some arrays with values ​​ranging from 0 to 255, so it is difficult to directly obtain high-level semantic concepts such as people or cats in the image. It is not clear in which area of ​​the image the target appears. The target in the image may appear in any position, the shape of the target may have various changes, and the background of the image is very different... These factors make target detection not an easy task to solve.

Thanks to deep learning—mainly convolution neural network (CNN) and region proposal algorithms, since 2014, object detection has made a huge breakthrough. This article mainly analyzes and summarizes the target detection algorithm based on deep learning. The article is divided into four parts: the first part generally introduces the process of traditional target detection, and the second part introduces the combination of region proposal and CNN classification represented by R-CNN The target detection framework (R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN); the third part introduces the target detection framework represented by YOLO (YOLO, SSD) that converts target detection into regression problems; Section 4 introduces some techniques and methods that can improve the performance of object detection.

1. Traditional target detection method
As shown in the figure above, the traditional target detection method is generally divided into three stages: first select some candidate regions on a given image, then extract features from these regions, and finally use the trained classifier to perform Classification. In the following, we will introduce these three stages respectively.

1) Region selection

This step is to locate the location of the target. Since the target may appear anywhere in the image, and the size and aspect ratio of the target are also uncertain, the sliding window strategy is initially used to traverse the entire image, and different scales and aspect ratios need to be set. Although this exhaustive strategy includes all possible positions of the target, the disadvantages are also obvious: the time complexity is too high, and too many redundant windows are generated, which also seriously affects the speed and performance of subsequent feature extraction and classification. (In fact, due to the time complexity problem, the aspect ratio of the sliding window is generally set to a few fixed settings, so for multi-category target detection with a large floating aspect ratio, even the sliding window traversal cannot be very good. Area)

2) Feature extraction

Designing a robust feature is not so easy due to the morphological diversity of the target, the diversity of illumination changes, and the diversity of backgrounds. However, the quality of the extracted features directly affects the accuracy of classification. (The commonly used features at this stage are SIFT, HOG, etc.)

3) Classifier

There are mainly SVM, Adaboost and so on.

Summary: There are two main problems in traditional target detection: one is that the region selection strategy based on sliding windows is not targeted, the time complexity is high, and the window is redundant; the other is that the hand-designed features are not very good for diversity changes. robustness.

2. Deep learning target detection algorithm based on Region Proposal

How can we solve the two main problems existing in traditional object detection tasks?

For the problem of sliding window, region proposal provides a good solution. The region proposal (candidate region) is to find out in advance where the target in the image may appear. However, since the region proposal utilizes the texture, edge, color and other information in the image, it can ensure a high recall rate when fewer windows (thousands or even hundreds) are selected. This greatly reduces the time complexity of subsequent operations, and the obtained candidate windows are of higher quality than sliding windows (sliding windows have a fixed aspect ratio). The more commonly used region proposal algorithms are selective Search and edge Boxes. If you want to know more about region proposal, you can read "What makes for effective detection proposals?" of PAMI2015.

With the candidate region, the remaining work is actually the work of image classification (feature extraction + classification) for the candidate region. For image classification, I have to mention that at the 2012 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), machine learning expert Professor Geoffrey Hinton led student Krizhevsky to use convolutional neural networks to reduce the Top-5 error of the ILSVRC classification task to 15.3% , while the top-5 error of the second place using the traditional method is as high as 26.2%. Since then, convolutional neural networks have dominated image classification tasks, and the top-5 error of Microsoft's latest ResNet and Google's Inception V4 models has dropped to less than 4%, which has surpassed the ability of humans on this specific task. Therefore, it is a good choice to use CNN for image classification after the target detection obtains the candidate region.

In 2014, the great god RBG (Ross B. Girshick) used region proposal+CNN to replace the sliding window + hand-designed features used in traditional target detection, and designed the R-CNN framework, which made a huge breakthrough in target detection and opened the target based on deep learning. Detection boom.


1)R-CNN (CVPR2014, TPAMI2015)

(Region-based Convolution Networks for Accurate Object detection and Segmentation)


The above frame diagram clearly shows the target detection process of R-CNN:

(1) Input test image

(2) Using the selective search algorithm to extract about 2000 region proposals in the image.

(3) Scale (warp) each region proposal to a size of 227x227 and input it to CNN, using the output of the fc7 layer of CNN as a feature.

(4) Input the CNN features extracted from each region proposal to the SVM for classification.

Here are some explanations for the above framework:

* The above frame diagram is the flow chart of the test. To test, we first need to train the CNN model for extracting features and the SVM for classification: fine-tune the model (AlexNet/VGG16) pre-trained on ImageNet to get the The CNN model for feature extraction, and then use the CNN model to extract features from the training set to train the SVM.

* Scaling each region proposal to the same scale is because the input of the CNN fully connected layer needs to ensure that the dimension is fixed.

* One less process is drawn in the above picture - bounding-box regression is performed for the region proposal classified by SVM. The bounding-box regression is a linear regression algorithm for correcting the region proposal. In order to make the window extracted by the region proposal follow The target real window fits better. Because the window extracted by the region proposal cannot be as accurate as the manual labeling, if the region proposal is far away from the target position, even if the classification is correct, but due to the intersection of the IoU (region proposal and the Ground Truth window) ratio) is lower than 0.5, then it is equivalent to the target still not detected.

Summary: The detection result of R-CNN on PASCAL VOC2007 is directly improved from 34.3% of DPM HSC to 66% (mAP). Such a big improvement enables us to see the huge advantage of region proposal+CNN.

But the R-CNN framework also has many problems:

(1) The training is divided into multiple stages, and the steps are cumbersome: fine-tuning the network + training SVM + training bounding box regressor

(2) Training is time-consuming and takes up a lot of disk space: 5000 images generate hundreds of G feature files

(3) Slow: Using GPU, VGG16 model takes 47s to process an image.

For this problem of slow speed, SPP-NET gives a good solution.

2)SPP-NET (ECCV2014, TPAMI2015)

(Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition)

Let's take a look at why R-CNN is so slow in detection. It takes 47s for a picture! Looking closely at the R-CNN framework, it is found that after the region proposal (about 2000) is proposed for the image, each proposal is treated as an image for subsequent processing (CNN feature extraction + SVM classification), in fact, 2000 images are performed on an image. The process of secondary feature and classification!

Is there a way to speed it up? It seems that there are. Aren't these 2000 region proposals all part of the image, then we can completely extract the convolutional layer feature on the image, and then we only need to map the region proposal in the original image position to the convolutional layer feature map, In this way, for an image, we only need to mention the convolutional layer features once, and then input the convolutional layer features of each region proposal into the fully connected layer for subsequent operations. (For CNN, most of the operations are consumed in convolution operations, which can save a lot of time). The problem now is that the scale of each region proposal is different. It is definitely not possible to directly input the fully connected layer in this way, because the input of the fully connected layer must be a fixed length. SPP-NET does exactly that:


The above picture corresponds to the network structure diagram of SPP-NET. Input any image to CNN. After convolution operation, we can obtain convolution features (for example, the last convolution layer of VGG16 is conv5_3, which produces a total of 512 feature maps) . The window in the figure is the area where a region proposal of the original image corresponds to the feature map. It only needs to map the features of these windows of different sizes to the same dimension and use it as the fully connected input to ensure that the image is only extracted once. Layered features. SPP-NET uses spatial pyramid pooling: each window is divided into 4*4, 2*2, 1*1 blocks, and each block is downsampled using max-pooling, so that for each window After the SPP layer, a feature vector with a length of (4*4+2*2+1)*512 dimensions is obtained, which is used as the input of the fully connected layer for subsequent operations.

Summary: Using SPP-NET can greatly speed up target detection compared to R-CNN, but there are still many problems:

(1) The training is divided into multiple stages, and the steps are cumbersome: fine-tuning the network + training SVM + training and training bounding box regressor

(2) SPP-NET fixes the convolutional layer when fine-tuning the network, and only fine-tunes the fully-connected layer. For a new task, it is necessary to fine-tune the convolutional layer as well. (The features extracted by the classification model pay more attention to high-level semantics, and the target detection task requires the location information of the target in addition to the semantic information)

In response to these two problems, RBG proposed Fast R-CNN, a streamlined and fast target detection framework.


3) Fast R-CNN(ICCV2015)

With the introduction of R-CNN and SPP-NET in the front, let's look directly at the frame diagram of Fast R-CNN:


Compared with the R-CNN frame diagram, it can be found that there are two main differences: one is that an ROI pooling layer is added after the last convolutional layer, and the other is that the loss function uses a multi-task loss function (multi-task loss). The regression is directly added to the training of the CNN network.

(1) The ROI pooling layer is actually a simplified version of SPP-NET. SPP-NET uses pyramid maps of different sizes for each proposal, while the ROI pooling layer only needs to downsample to a 7x7 feature map. For the VGG16 network conv5_3, there are 512 feature maps, so that all region proposals correspond to a 7*7*512-dimensional feature vector as the input of the fully connected layer.

(2) The R-CNN training process is divided into three stages, while Fast R-CNN directly uses softmax instead of SVM classification, and uses multi-task loss function bounding box regression is also added to the network, so that the entire training process is end-to-end ( Remove the region proposal extraction stage).

(3) In the process of network fine-tuning, Fast R-CNN also fine-tuned some convolutional layers, and achieved better detection results.

Summary: Fast R-CNN combines the essence of R-CNN and SPP-NET, and introduces a multi-task loss function, which makes the training and testing of the entire network very convenient. When trained on the Pascal VOC2007 training set, the test result on VOC2007 is 66.9% (mAP). If the VOC2007+2012 training set is used for training, the test result on VOC2007 is 70% (the expansion of the dataset can greatly improve the target detection performance). Using VGG16 takes a total of about 3s per image.

Disadvantages: The extraction of region proposals uses selective search, and most of the target detection time is spent on this (2~3s for region proposal, and only 0.32s for feature classification), which cannot meet real-time applications, and does not achieve real end-to-end applications. End-to-end training test (region proposal is extracted first using selective search). So is it possible to directly use CNN to directly generate region proposals and classify them? The Faster R-CNN framework is a target detection framework that meets such needs.

4)Faster R-CNN(NIPS2015)

(Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks)

In this target detection framework classified by region proposal + CNN, the quality of the region proposal directly affects the accuracy of the target detection task. If a method is found to extract only a few hundred or less high-quality pre-selected windows, and the recall rate is high, this will not only speed up object detection, but also improve object detection performance (fewer false positives). The RPN (Region Proposal Networks) network came into being.

The core idea of ​​RPN is to use a convolutional neural network to directly generate region proposals, and the method used is essentially a sliding window. The design of RPN is ingenious. RPN only needs to slide once on the last convolutional layer, because the anchor mechanism and frame regression can obtain region proposals with multi-scale and multi-aspect ratios.


We directly look at the above RPN network structure diagram (using the ZF model), given the input image (assuming a resolution of 600*1000), after the convolution operation, the convolution feature map of the last layer (about 40*60 in size) is obtained. ). On this feature map, a 3*3 convolution kernel (sliding window) is used to convolve the feature map. The last convolutional layer has a total of 256 feature maps, then this 3*3 area can be convolved to obtain a 256 Dimensional feature vector, followed by cls layer and reg layer for classification and border regression respectively (similar to Fast R-CNN, except that the categories here are only target and background). Each feature region corresponding to the 3*3 sliding window simultaneously predicts the region proposals of 3 scales (128, 256, 512) and 3 aspect ratios (1:1, 1:2, 2:1) of the input image. The mechanism of this mapping is called for the anchor. So for this 40*60 feature map, there are about 20,000 (40*60*9) anchors in total, that is, 20,000 region proposals are predicted.

What are the benefits of this design? Although the sliding window strategy is also used now, the sliding window operation is performed on the feature map of the convolution layer, and the dimension is reduced by 16*16 times compared with the original image (after 4 times of 2*2 pooling operations in the middle); The scale uses 9 anchors, corresponding to three scales and three aspect ratios, and the border regression is followed, so even the windows outside these 9 anchors can get a region proposal that is closer to the target.

The detection framework used by the NIPS2015 version of Faster R-CNN is the target detection performed by the RPN network + Fast R-CNN network separation. The overall process is the same as that of Fast R-CNN, except that the region proposal is now extracted by the RPN network (instead of the original selective search). At the same time, in order to allow the RPN network and the Fast R-CNN network to share the weights of the convolutional layers, the author used a 4-stage training method when training the RPN and Fast R-CNN:

(1) Initialize the network parameters using the model pre-trained on ImageNet, and fine-tune the RPN network;

(2) Use the RPN network in (1) to extract the region proposal to train the Fast R-CNN network;

(3) Use the Fast R-CNN network of (2) to re-initialize the RPN and fix the convolutional layers for fine-tuning;

(4) Fix the convolutional layers of Fast R-CNN in (2), and fine-tune the network using the region proposals extracted by the RPN in (3).

The RPN and Fast R-CNN after weight sharing will improve the accuracy of target detection.

Using the trained RPN network, given the test image, you can directly get the region proposal after edge regression, sort the RPN network according to the category score of the region proposal, and select the first 300 windows as the input of Fast R-CNN for target detection. , using the VOC07+12 training set for training, the VOC2007 test set test mAP reaches 73.2% (selective search + Fast R-CNN is 70%), and the target detection speed can reach 5 frames per second (selective search+Fast R-CNN is 2 ~3s one).

It should be noted that the latest version has combined the RPN network and the Fast R-CNN network - the proposal obtained by the RPN is directly connected to the ROI pooling layer, which is a real implementation of a CNN network. A framework for end-to-end object detection.

Summary: Faster R-CNN integrates the region proposal and CNN classification that have been separated, and uses an end-to-end network for target detection, which has been improved in both speed and accuracy. However, Faster R-CNN still cannot achieve real-time target detection, obtain region proposals in advance, and then classify each proposal with a relatively large amount of computation. Fortunately, the emergence of target detection methods such as YOLO makes real-time possible.

In general, from R-CNN, SPP-NET, Fast R-CNN, Faster R-CNN along the way, the process of target detection based on deep learning has become more and more streamlined, with higher accuracy and faster speed. Come faster. It can be said that the R-CNN series target detection method based on region proposal is the most important branch of the current target.

3. Deep learning target detection algorithm based on regression method

The Faster R-CNN method is currently the mainstream target detection method, but the speed cannot meet the real-time requirements. Methods such as YOLO gradually show their importance. These methods use the idea of ​​regression, that is, given an input image, the target frame and target category of this position are directly returned at multiple positions of the image.

1)YOLO (CVPR2016, oral)

(You Only Look Once: Unified, Real-Time Object Detection)


Let's look directly at the flow chart of YOLO's target detection above:

(1) Given an input image, first divide the image into a 7*7 grid

(2) For each grid, we predict 2 bounding boxes (including the confidence that each bounding box is a target and the probability that each bounding box region is on multiple classes)

(3) According to the previous step, 7*7*2 target windows can be predicted, and then the target windows with lower possibility are removed according to the threshold, and finally the redundant windows can be removed by NMS.

It can be seen that the whole process is very simple. There is no need for the intermediate region proposal to find the target, and the direct regression completes the determination of the position and category.

So how can we directly return the location and category information of the target on the grid at different locations? The above is the network structure diagram of YOLO. The previous network structure is similar to the model of GoogLeNet. The main one is the structure of the last two layers. The convolutional layer is followed by a 4096-dimensional fully connected layer, and then the latter is fully connected to a 7 *7*30-dimensional tensor. In fact, 7*7 is the number of divided grids. Now we need to predict two possible positions of the target on each grid and the target confidence and category of this position, that is, each grid predicts two targets, each The information of each target has 4-dimensional coordinate information (center point coordinates + length and width), 1 is the confidence level of the target, and the number of categories is 20 (20 categories on VOC), the total is (4+1)*2+20 = 30-dimensional vector. In this way, the information required for target detection (frame information plus category) can be directly regressed on each grid by using the front 4096-dimensional full-image features.

Summary: YOLO transforms the object detection task into a regression problem, which greatly speeds up the detection, allowing YOLO to process 45 images per second. Moreover, since each network uses full-image information to predict the target window, the proportion of false positives is greatly reduced (sufficient contextual information). But YOLO also has problems: without the region proposal mechanism, only using 7*7 grid regression will make the target not very accurate positioning, which also leads to the detection accuracy of YOLO is not very high.

2)SSD

(SSD: Single Shot MultiBox Detector)

The problems of YOLO are analyzed above. Using the whole image feature to regress in a 7*7 rough grid is not very accurate to locate the target. Can it be combined with the idea of ​​region proposal to achieve more precise positioning? SSD combines the regression idea of ​​YOLO and the anchor mechanism of Faster R-CNN to achieve this.


The above picture is a frame diagram of SSD. First of all, the method of SSD to obtain target position and category is the same as YOLO, which uses regression, but YOLO predicts a certain position using the features of the whole image, and SSD predicts a certain position using this Features around the location (feels more reasonable). So how to establish the corresponding relationship between a certain position and its characteristics? Maybe you have already thought of using the anchor mechanism of Faster R-CNN. As shown in the frame diagram of SSD, if the size of a feature map of a certain layer (Figure b) is 8*8, then a 3*3 sliding window is used to extract the feature of each position, and then this feature is regressed to obtain the coordinate information of the target and Category information (Figure c).

Unlike Faster R-CNN, this anchor is on multiple feature maps, so that multi-layer features can be used and multi-scale can be achieved naturally (feature maps of different layers have different 3*3 sliding window receptive fields).

Summary: SSD combines the regression idea in YOLO and the anchor mechanism in Faster R-CNN, and uses the multi-scale regional features of each position in the whole image for regression, which not only maintains the fast speed of YOLO, but also ensures that the window prediction is consistent with Faster. R-CNN is also more accurate. SSD can reach 72.1% mAP on VOC2007, and the speed can reach 58 frames per second on GPU.

Summary: The proposal of YOLO gives a new idea of ​​​​target detection, and the performance of SSD allows us to see the real possibility of target detection in practical applications.

4. Improve the target detection method

The R-CNN series of object detection frameworks and the YOLO object detection framework give us two basic frameworks for object detection. In addition, researchers have proposed a series of methods to improve the performance of object detection based on these frameworks from other aspects.

(1) Hard negative mining

R-CNN uses the idea of ​​hard sample mining when training the SVM classifier, but Fast R-CNN and Faster R-CNN do not use hard sample mining due to the end-to-end training strategy (just set the positive and negative samples proportion and drawn at random). The Training Region-based Object Detectors with Online Hard Example Mining (oral) of CVPR2016 embeds the hard example mining mechanism into the SGD algorithm, so that Fast R-CNN is automatically selected according to the loss of the region proposal during the training process. Appropriate region proposals are trained as positive and negative examples. The experimental results show that using the OHEM (Online Hard Example Mining) mechanism can increase the mAP of the Fast R-CNN algorithm by about 4% on VOC2007 and VOC2012.

(2) Multi-layer feature fusion

Both Fast R-CNN and Faster R-CNN use the features of the last convolutional layer for target detection, and because the high-level convolutional layer features have lost a lot of detail information (pooling operation), they are not very accurate in positioning. Some methods such as HyperNet use the multi-layer feature fusion of CNN for target detection, which not only uses the semantic information of high-level features, but also considers the detailed texture information of low-level features, making target detection and positioning more accurate.

(3) Use context information

When extracting region proposal features for target detection, combined with region proposal context information, the detection effect is often better. (Context information is used in papers such as Object detection via a multi-region & semantic segmentation-aware CNN model and Inside-Outside Net)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324781299&siteId=291194637