[Deep Learning] Framework Evolution - CNN-Based Classic Object Detection Framework

The target detection method based on convolutional neural network is popularly divided into two-stage target detection algorithm and single-stage target detection algorithm according to the detection speed . The two-stage target detection is based on region proposals . Usually, the proposed frame rough repair and background removal are performed first, and then the proposed frame classification and bounding box regression are performed, such as R-CNN, Fast R-CNN, and Faster-R-CNN. Single-stage target detection is based on regression , which integrates these two processes and adopts the implementation framework of "anchor + classification refinement", such as YOLO and SSD. In addition, there are search -based target detection and recognition algorithms, such as AttentionNet based on visual attention, algorithms based on reinforcement learning, etc.


This article mainly introduces the classic two-stage target detection algorithm and single-stage target detection algorithm, and sorts out its framework evolution and core ideas. The content is mainly extracted from "Deep Learning and Target Detection".


Two-Stage Objective Algorithm


For more than a decade before the advent of the region convolutional neural network-based object detection model (regions with CNN features, R-CNN), most visual recognition tasks were using the scale invariant feature transform (SIFT) Algorithm or histogram of oriented gradient (HOG) to extract features. When CNN shined in the ILSVRC classification project in 2012 (AlexNet), researchers realized that CNN can learn very robust and expressive features . Therefore, Girshick and others proposed R-CNN, and it has become the pioneering work of the CNN-based target detection model.


2014 R-CNN

Ross Girshick 原文链接 Rich feature hierarchies for accurate object detection and semantic segmentation

When performing target detection on an image, R-CNN first uses the selective search suggestion box extraction method to select about 2000 suggestion boxes in the image. Next, adjust each suggestion frame to the same size (227×227) and send it to AlexNet to extract features to obtain a feature map. Then, for each category, use the SVM classifier of this category to score all the feature vectors obtained , and get the scores of all the suggestion boxes in this image corresponding to each category. Subsequently, on each category, the suggestion box is independently screened using a greedy non-maximum value suppression method, filtering out lower-classified suggestion boxes with IoU greater than a specific threshold, and using the method of bounding box regression to the suggestion box. Fine-tuning the position and size of the target to make it more accurate to surround the target.


insert image description here

Note that R-CNN uses SVM classification instead of the softmax function of the last layer of CNN for classification, because the thresholds of positive and negative samples used in fine-tuning CNN and training SVM are different. When using SVM as a classifier, the mAP is 54.2%. When using softmax as the classifier, the mAP is 50.9%. The important contribution of R-CNN is to introduce deep learning into object detection .


2014 SPP-Net

Kaiming He 原文链接 Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

A convolutional neural network usually consists of a convolutional part and a fully connected part. In the convolution part, the convolution operation can be performed on any image size and convolution size to obtain a feature map; while in the fully connected part, a fixed-size input is required. Therefore, the problem of fixed size is all from the fully connected layer . However, due to the different sizes of the proposal boxes, scaling, stretching, cropping and other methods used to adjust them to the same size will lead to different degrees of distortion in the original image, even with some pre-processing adjustments, it is impossible to completely Eliminate unwanted effects of resizing.
insert image description here


To solve the problems caused by the above size adjustment, a spatial pyramid pooling method is proposed. The spatial pyramid pooling (spatial pyramid pooling, SPP) layer is used to eliminate the limitation of the fixed size of the network . The SPP layer is placed after the last convolutional layer to pool the feature map to generate a fixed-length output, and this The output is used as the input of the fully connected layer. As shown below: the first layer pools the entire FeatureMap to obtain 1 feature; the second layer divides the entire FeatureMap into 4 blocks to obtain 4 features; the third layer divides the FeatureMap into 16 blocks to obtain 16 features feature. Finally, 1+4+16=21 features are input to the fully connected layer for weight calculation, where 256 is the number of filters in the last convolutional layer. So the output of the entire SPP layer is a k×M-dimensional vector, where M=21, k=256.

insert image description here

Another innovation of SPP-Net is to perform only one convolution operation on the original image . Since the R-CNN obtains the proposal first, then resizes, and finally enters the CNN convolution, this is very inefficient. SPP-Net only collects the feature map of the original image once. After obtaining the FeatureMap, it finds the corresponding patch of each proposal on the feature map, and inputs this patch as the convolution feature of each proposal into SPP-Net to improve detection . The speed is 24~120 times that of R-CNN.


2015 Fast R-CNN

Ross Girshick Original link Fast R-CNN

Fast R-CNN absorbs the idea of ​​SPP-Net, uses the region of interest pooling layer (RoI pooling layer) similar to the SPP layer, and adds the classification step and bounding box regression step after feature extraction to the deep network for further analysis. Synchronous training makes the training of Fast R-CNN more concise and saves time and space compared with the multi-stage training of R-CNN. The training speed of Fast R-CNN is 9 times faster than R-CNN, and the detection speed is 200 times faster than R-CNN.

insert image description here
The method of Fast R-CNN extracting the suggestion box is the same as that of R-CNN, and external methods such as selective search can also be used. The image feature extraction part uses the convolution part of the image classification network such as VGG16, and the entire image to be detected is input into it for feature extraction to obtain the final feature map. All proposal boxes are sent to the RoI pooling layer, and the corresponding mapping area is found on the feature map and the size is fixed. This is followed by two fully connected layers for obtaining fixed-size RoI feature vectors. So far, each RoI has extracted a fixed-dimensional feature, and the RoI feature vector is used as the input of the two tasks of target classification and bounding box regression , and the target detection result can be obtained.

Note that experiments on Fast R-CNN show that the classification effect using the softmax function is better than that using the SVM, which is caused by the different structures of the two . In R-CNN, the softmax function is the last structure in the basic network AlexNet. The parameter training is fine-tuned by transfer learning. The training sample data is random, so its effect is not as good as the SVM trained with difficult negative samples. . In Fast R-CNN, the structure behind the basic network VGG16 has been removed, and softmax is an independent new structure in Fast R-CNN. Since the function itself introduces the characteristics of inter-class competition, it can achieve better results than SVM. Effect.


2015 Faster R-CNN

Shaoqing Ren Original link Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks


Faster R-CNN integrates the step of extracting target candidate boxes into the deep network , becoming the first real-time end-to-end deep learning target detection algorithm in the true sense. As an upgraded version of Fast R-CNN, we can simply regard Faster R-CNN as RPN+Fast R-CNN , and RPN+Fast R-CNN shares a part of the convolutional layer. As shown in the figure below, an image is sent to Faster R-CNN for detection, conv layers represent the convolutional layer of the basic network (such as VGG16, ZF), this part is the structure shared by RPN and Fast R-CNN ; the image is passed Conv layers, get the feature map; send the feature map to the RPN to get the suggestion box; send the suggestion box and the feature map to the residual network (Fast R-CNN) starting from the RoI pooling layer to get the target detection result.

insert image description here


RPN is a neural network used to extract region proposals . The feature of the RPN network is that the extraction of candidate frames is achieved by means of a sliding window . The center of the sliding window will pass through each point on the feature map, and each point on the feature map has a set of k preset anchors centered on the point. These k anchors have different aspect ratios and sizes, and the value of k is usually 9. The output of RPN contains the predicted values ​​of two categories (foreground and background, judged by IoU) and 4 parameters (x, y, w, h) of bounding box regression.

insert image description here


One-Stage Object Detection Methods


In order to make target detection meet real-time requirements, researchers proposed a single-stage target detection method. In the single-stage target detection method, the suggestion box is no longer used for "rough detection + fine refinement", but the result is obtained in one step . The single-stage object detection method only performs the feed-forward network calculation once, so the speed is greatly improved.


YOLO in 2015

Joseph Redmon You Only Look Once: Unified, Real-Time Object Detection

YOLO is the earliest single-stage target detection method and the first to achieve real-time target detection . YOLO can achieve a detection speed of 45 frames per second, and its mAP is twice or even higher than other real-time detection systems. YOLO v1 treats detection as a regression problem, so the process of processing images is very simple and straightforward. The input image is first resized to 448 pixels by 448 pixels, then a convolutional network is run on the image, and finally a fully-connected layer is used for detection. YOLO divides the input image into an S×S grid . If the center point of an object falls into a grid cell, the object is detected by that grid cell. Each grid cell predicts n bounding boxes and confidence scores for those bounding boxes. These confidence scores reflect YOLO's confidence that the object is contained in the bounding box, and how accurate its predicted bounding box is.

insert image description here

The limitations of YOLO are also very obvious. Compared to two-stage object detection systems, YOLO produces more localization errors and lags behind in accuracy (especially poor for small objects). At the same time, YOLO imposes a spatial constraint on the bounding box prediction (since each grid cell only predicts two bounding boxes and can only have one class). This spatial constraint limits the number of nearby targets that YOLO can predict, so predictions of flocks, crowds, and convoys using YOLO are not ideal.

It is worth noting that in order to achieve real-time high-precision detection, the YOLO series framework has been evolving in full swing, and it has been iterated to the v7 version so far, and there are many variants.


SSDs in 2015

Wei Liu original link SSD: Single Shot MultiBox Detector


Wei Liu et al. proposed the SSD method in the same year that YOLO was born. SSD absorbs the idea of ​​YOLO's fast detection, combines the advantages of RPN in Faster R-CNN, and improves the processing of multi-scale targets (no longer only uses the top-level feature map for prediction). Due to the different sizes of the features contained in different convolutional layers, SSD uses the feature pyramid prediction method to integrate the detection results of multiple convolutional layers to realize the detection of targets of different sizes . In Faster R-CNN, single-layer feature map prediction is used, that is, prediction is only made on the feature map at the top of the basic network. SSD predicts on multi-layer feature maps and detects objects of different sizes on feature maps of different sizes.

As a single-stage target detection method, SSD does not predict the suggestion box like Faster R-CNN, but directly predicts the bounding box of the target . When predicting the bounding box of the target, SSD introduces the concept of default box (which is equivalent to the anchor in Faster R-CNN). SSD detects targets of different sizes on feature maps of different sizes. Therefore, default boxes of different sizes will be represented by feature maps of different sizes. The closer to the top layer, the larger the size of the default box on the feature map, and the closer to the bottom layer . The smaller the size of the default box on the feature map . As shown below: the size of the default box on the 8×8 feature map is small, and a cat with a smaller size will be detected; while the size of the default box on the 4×4 feature map is larger, the size will be detected larger dogs. For each point on each feature map, it is also possible to use default boxes of different scales to correspond to different shapes.
insert image description here
The process of Faster R-CNN based on the anchor and SSD based on the default box is similar, but the SSD is one step ahead of the default box - directly performing multi-category judgment and bounding box prediction of the target. It is worth noting that classic networks such as YOLO and SSD are the basis for subsequent network evolution, and it is for this reason that pioneering works will be talked about.


2017 RetinaNet

Tsung-Yi Lin Original link Focal Loss for Dense Object Detection


The RetinaNet model uses ResNet+FPN (feature pyramid net, FPN) as the basic framework . After FPN, multiple feature maps of different sizes are obtained. The feature maps of each level are connected to two subnetworks, namely box subnet (boundary box regression subnetwork) and class subnet (target classification subnetwork). For the problem of unbalanced sample size in the class subnet, this method uses the focal loss function to calculate the loss.

insert image description here

In the image pyramid, the high-level low-resolution high-level features have rich semantic information and strong object recognition ability; the low-level high-resolution low-level features contain not rich semantic information and poor object recognition ability . FPN proposes a new feature pyramid structure for the problem that SSD does not use low-level features, which integrates low-level features with high-level features and increases the semantic information of low-level features, so that target recognition can be performed on low-level features, and the recognition of small targets is improved. Effect.

In addition, the newly proposed focal loss function is equivalent to adding weights related to the predicted probability of the model to each sample, reducing the loss weight of simple samples and retaining the loss of difficult samples to the maximum , which solves the problem of simple samples and difficult samples. The problem of balance between samples.


2017RefineDet

Shifeng Zhang original link Single-Shot Refinement Neural Network for Object Detection


RefineDet can be regarded as a combination of SSD, RPN, and FPN , which combines the advantages of single-stage target detection methods and two-stage target detection methods. RefineDet is a single-stage target detection method that uses two interconnected module structures. Its two-step series modules are ARM (anchor refinement module, anchor box improvement module) and DOM (object detection module, target detection module). The TCB (transfer connection block, conversion connection module) in RefineDet is used to convert the features obtained in ARM and pass them to ODM, which has the effect of feature fusion.

insert image description here
ARM is used to identify and filter the background area (ie negative anchor), and roughly adjust the size and position of the anchor , so that the subsequent precise positioning of the bounding box. This is similar to the role of RPN in Faster R-CNN, but RPN operates on a feature map, and ARM needs to process multiple feature maps of different sizes. The ARM fine-tuned anchor will be input into the subsequent ODM, and the bounding box can be processed. Regression and target classification are done. Like SSD, ODM is also detected on multiple feature maps of different scales. However, SSD uses a fixed anchor (ie default box) when detecting, while ODM uses a screened and Coarsely corrected anchor , so better detection results will be obtained. Two interconnected modules simulate the structure of two-stage object detection, thus enabling the model to improve detection accuracy while maintaining high efficiency.

The TCB part performs the feature conversion operation, that is, output feature map of the ARM part is converted into the input of the ODM part . The feature fusion of TCB and FPN is similar. It also adopts the idea of ​​upsampling the feature map and fusing it with high-level features, which makes RefineDet better than SSD in detecting small targets.


"Deep Learning and Target Detection" Du Peng et al. Electronic Industry Press Chapter 4 and Chapter 5
"Practical Principles, Architecture and Optimization of Deep Neural Networks on Mobile Platforms" Chapter 8 Lu Yusheng Machinery Industry Press

Guess you like

Origin blog.csdn.net/weixin_47305073/article/details/128226605