Object detection is an important application of deep learning. It is to identify the objects in the picture and mark the position of the object. Generally, two steps are required:

1. Classification, identify what the object is

2. Positioning, find out Where is the object?

In addition to detecting a single object, it must also support the detection of multiple objects, as shown in the following figure:

This problem is not so easy to solve, because the size of the object varies greatly, the placement angle is variable, and the posture Uncertain, and there are many types of objects, many kinds of objects can appear in the picture and appear in any position. Therefore, object detection is a relatively complex problem.

The most direct method is to build a deep neural network, take the image and the label position as sample input, and then go through the CNN network, and then identify what object is through a fully connected layer of a classification head, and pass a regression head (Regression head). head) full connection layer regression calculation position, as shown in the following figure:

but "regression" is not easy to do, the amount of calculation is too large, and the convergence time is too long, you should find a way to convert it to "classification", then it is easy to think of the idea of boxing , that is, take "boxes" of different sizes, let the boxes appear in different positions, calculate the score of this box, and then obtain the box with the highest score as the prediction result, as shown in the following figure:

According to the scores compared above, select The black box in the lower right corner is used as the prediction of the target position.

But the question is: how big should the frame be? If it is too small, the object recognition will be incomplete; if it is too large, the recognition result will have a lot of other information. What to do then? Then take the boxes of various sizes and calculate them.

As shown in the figure below (to identify a bear), use frames of various sizes to repeatedly intercept the picture, input it into the CNN to recognize and calculate the score, and finally determine the target category and location.

This method is very inefficient and is too time-consuming. Is there an efficient target detection method?

**1. R-CNN was born**

R-CNN (Region CNN, Regional Convolutional Neural Network) can be said to be the pioneering work of using deep learning for target detection. The author Ross Girshick has won many times in the target detection competition of PASCAL VOC. In 2010, he led the team to win the Lifetime Achievement Award, and now works at Facebook's Artificial Intelligence Laboratory (FAIR).

The process of R-CNN algorithm is as follows

1. Input image

2. Generate 1K~2K candidate regions for each image

3. For each candidate region, use deep network to extract features (AlextNet, VGG and other CNNs can be used)

4. Send the features Enter the SVM classifier of each class to determine whether it belongs to this class

5. Use the regressor to finely correct the position of the candidate frame

Let’s expand and introduce

**1. Generate candidate regions**

Use the Selective Search (selective search) method to generate about 2000-3000 candidate regions for an image. The basic idea is as follows:

(1) Use an over-segmentation method to segment the image into small Area

(2) Look at the existing small areas, merge the two areas with the highest possibility, and repeat until the entire image is merged into one area position. Merge the following areas first:

- Color (color histogram) is similar

- Texture (gradient histogram) is similar

- The total area after merging is small

- After merging, the total area accounts for a large proportion of its BBOX, which

must be guaranteed when merging The scale of the merging operation is relatively uniform, so as to avoid a large area from "eating" other small areas one after another, and ensure that the shape is regular after merging.

(3) Output all the regions that have existed before, that is, the so-called candidate regions

**2. Feature extraction**

Before using the deep network to extract features, first normalize the candidate regions to the same size of 227×227.

Training using a CNN model, such as AlexNet, is generally slightly simplified, as shown in the figure below:

**3. Category judgment**

For each category of targets, a linear SVM two-category classifier is used to judge. The input is the 4096-dimensional feature output by the deep network (AlexNet in the figure above), whether the output belongs to this category.

**4. The measurement standard for position refinement**

target detection is the overlap area: many seemingly accurate detection results are often because the candidate frame is not accurate enough and the overlap area is small, so a location refinement step is required. For each class, train a linear The regression model is used to determine whether the frame is perfect, as shown in the figure below:

After R-CNN introduced deep learning into the detection field, the detection rate on PASCAL VOC was increased from 35.1% to 53.7% in one fell swoop.

**2. Fast R-CNN has been greatly accelerated**

. Following the launch of R-CNN in 2014, Ross Girshick launched Fast R-CNN in 2015. It has a sophisticated concept and a more compact process, which greatly improves the speed of target detection.

Compared with Fast R-CNN, the training time is reduced from 84 hours to 9.5 hours, and the test time is reduced from 47 seconds to 0.32 seconds, and the accuracy of the test on PASCAL VOC 2007 is almost the same, about 66%-67 %between.

Fast R-CNN mainly solves the following problems of R-CNN:

1. The speed of training and testing is slow

. There is a lot of overlap between candidate boxes in an image of R-CNN, and the extraction of features is redundant. On the other hand, Fast R-CNN normalizes the entire image and sends it directly to the deep network, followed by the candidate regions extracted from the image. The first few layers of features of these candidate regions do not need to be recalculated.

2. Large space required for training

Independent classifiers and regressors in R-CNN require a large number of features as training samples. Fast R-CNN integrates category judgment and position fine-tuning with deep network, and no additional storage is required.

The following is a detailed introduction

**1. In the feature extraction stage,** fixed-size input is not required through operations such as conv, pooling, and relu in CNN (such as AlexNet). Therefore, after performing these operations on the original image, the input image size is different The resulting feature map (feature map) size will also be different, so that it cannot be directly connected to a fully connected layer for classification.

In Fast R-CNN, the author proposes a network layer called ROI Pooling, which can map inputs of different sizes to a fixed-scale feature vector. The ROI Pooling layer evenly divides each candidate region into M×N blocks, and performs max pooling on each block. The candidate regions of different sizes on the feature map are converted into data of uniform size and sent to the next layer. In this way, although the size of the input image is different, the size of the obtained feature map (feature map) is also different, but this magical ROI Pooling layer can be added to extract a fixed-dimensional feature representation for each region, and then the normal softmax can be used. Type identification.

**2. In the classification and regression stage,** in R-CNN, first generate candidate boxes, then extract features through CNN, then use SVM to classify, and finally do regression to get the specific position (bbox regression). In Fast R-CNN, the author cleverly put the final bbox regression into the neural network, and merged it with the regional classification into a multi-task model, as shown in the following figure:

Experiments show that these two tasks can be shared Convolutional features and promote each other.

A very important contribution of Fast R-CNN is to successfully let people see the hope of real-time detection in the framework of Region Proposal+CNN (candidate region + convolutional neural network). It turns out that multi-class detection can really ensure accuracy while ensuring accuracy. Improve processing speed.

**3. Faster R-CNN is faster and stronger**

Following the launch of R-CNN in 2014 and the launch of Fast R-CNN in 2015, the Ross Girshick team, a leader in the target detection field, launched another masterpiece in 2015: Faster R-CNN, which makes The target detection speed of the simple network reaches 17fps, the accuracy rate is 59.9% on PASCAL VOC, and the complex network reaches 5fps, and the accuracy rate is 78.8%.

There is still a bottleneck problem in Fast R-CNN: Selective Search (selective search). To find all candidate boxes, this is also very time consuming. So do we have a more efficient way to find these candidate boxes?

Adding a neural network for extracting edges to Faster R-CNN means that the job of finding candidate boxes is also handed over to the neural network. In this way, the four basic steps of object detection (candidate region generation, feature extraction, classification, and location refinement) are finally unified into a deep network framework. As shown in the figure below:

Faster R-CNN can be simply regarded as a model of "regional generation network + Fast R-CNN", and uses the region generation network (Region Proposal Network, RPN for short) to replace the Selective Search in Fast R-CNN (selective search) method. RPN as shown in the figure below:

The working steps of RPN are as follows: - Sliding the window on the feature map (feature map) - Build a neural network for object classification + box position regression - The position of the sliding window provides the general position information of the object - Box regression provides a more precise position of the box

Faster R-CNN designs a network RPN for extracting candidate regions, replacing the time-consuming Selective Search, which greatly improves the detection speed. The following table compares the detection of R-CNN, Fast R-CNN, and Faster R-CNN. speed:

**Summarizing**

R-CNN, Fast R-CNN, Faster R-CNN along the way, the process of target detection based on deep learning has become more and more streamlined, more accurate, and faster. The R-CNN series target detection method based on region proposal (candidate region) is one of the most important branches in the field of target detection technology.

**Wall Crack Advice**

From 2014 to 2016, Ross Girshick et al. published the classic papers "Rich feature hierarchies for accurate object detection and semantic segmentation", "Fast R-CNN", "Faster R-CNN" on R-CNN, Fast R-CNN, Faster R-CNN R-CNN: Towards Real-Time ObjectDetection with Region Proposal Networks", in these papers, the idea, principle, and test situation of target detection are introduced in detail. It is recommended to read some papers to fully understand the target detection model.

Follow my official account "Big Data and Artificial Intelligence Lab" (BigdataAILab), and then reply to the keyword " **thesis** " to read the content of classic papers online **.**

**Recommended related reading**

- Dahua Convolutional Neural Network (CNN)
- Dahua Recurrent Neural Network (RNN)
- Dahua Deep Residual Network (DRN)
- Dahua Deep Belief Network (DBN)
- Big talk CNN classic model: LeNet
- Big talk CNN classic model: AlexNet
- Big talk CNN classic model: VGGNet
- Big talk CNN classic model: GoogLeNet
- A brief introduction to "transfer learning"
- What is "Reinforcement Learning"
- Analysis of the Principle of AlphaGo Algorithm
- How many Vs are there in big data?
- Apache Hadoop 2.8 fully distributed cluster building super detailed tutorial
- Apache Hive 2.1.1 installation and configuration super detailed tutorial
- Apache HBase 1.2.6 fully distributed cluster building super detailed tutorial
- Offline installation of Cloudera Manager 5 and CDH5 (latest version 5.13.0) super detailed tutorial