Literature review of small target detection algorithms based on deep learning

I recently read the literature on small target detection algorithms. When searching the literature, I learned about the development process of target detection and its typical algorithms. The following is a summary of the literature on small target detection algorithms from four methods based on the reported PPT. , They are: a brief introduction to target detection, background introduction and difficulties of small target detection, introduction, summary and prospects of small target detection algorithms . This blog can also be used as a presentation.

Brief introduction to target detection

Target detection example
Traditional target detection to target detection based on deep learning
The target detection process can be simply divided into two processes: localization and recognition. Localization refers to where a certain target is located. Recognition refers to what the target is located, which is a classification problem. The development of target detection can also be divided into two processes, one is traditional target detection, and the other is target detection based on deep learning.

Traditional target detection

Traditional target detection can be divided into three processes: obtaining the detection window, manually designing the features of the target of interest, and training the classifier.

In 1998, Papageorgiou published an article on A general framework for object detection, and proposed the Harr classifier, which is an object detection classifier used to detect human faces. It calculates the sum of the pixels of each detection window and then takes their difference. , Use the difference value as the feature to classify the target, the advantage of this method is fast. In 2004, David Lowe first proposed the rotation scale-invariant feature transform SIFT, which was supplemented and improved in 2004. SIFT extracts local features of an image. Phenomena such as image rotation, scale scaling, and brightness changes will not affect the feature extraction; , Affine transformation and noise also have good stability, and have many advantages such as high speed and scalability. Navneet Dalal and Bill Triggs proposed the Histogram of Oriented Gradient (HOG) in CVPR 2005. HOG composes features by counting and calculating the gradient histogram of the local area of ​​the image. The image geometry and optical deformation will not affect the feature extraction. It has good stability, and the HOG+SVM algorithm is also used in pedestrian detection.
By 2010, Pedro Felzenszwalb proposed a Deformable Part Model (DPM), using improved HOG features, combined with SVM classifiers and sliding windows (Sliding Windows), and used separately for the target multi-view problem and its own deformation problem. Multi-component strategy and graph structure component model strategy to deal with. In addition, multi-instance learning is used to automatically determine latent variables such as the category of the sample and the position of the part model. Due to its excellent detection performance, DPM won the championship in Pascal VOC (Visual Object Class) in 2007, 2008 and 2009, and Felzenszwalb was awarded the lifetime achievement award in 2010.

The traditional target detection algorithm has the advantage of high detection accuracy, but the disadvantage is high time complexity and high window redundancy. The traditional target detection algorithm is to scan the whole image by Huachuang, which is completely untargeted for the target to be detected. With the development of computer science and technology and GPU and other hardware facilities, target detection algorithms based on deep learning have begun to emerge.

Target detection based on deep learning

Target detection based on deep learning is to apply deep network to target detection. According to the principle of detection, target detection algorithms can be classified into two-stage method and single-stage method.Algorithm for target detection based on deep learning

Target detection based on candidate regions

The detection method based on region extraction first extracts candidate regions through selective search (Selective Search) or RPN (Region Proposal Net) and other methods, and then uses regression methods to classify and predict the position of the candidate regions. Representative algorithms include RCNN, SPP- Net, Fast RCNN, Faster RCNN and Mask RCNN etc.

In 2014, Girshick et al. proposed RCNN, and the mAP of this algorithm on the test set of VOC2007 reached 48%. The RCNN algorithm mainly includes 4 steps: ①Using the selective search algorithm to extract the candidate area, and zooming the candidate area to the same size; ②Using the convolutional neural network to extract the feature of the candidate area; ③SVM classifier to classify the feature of the candidate area; ④Using border regression The algorithm performs frame prediction. As the first more mature algorithm based on deep learning, the RCNN algorithm has made great progress compared to traditional machine learning algorithms, but its disadvantages are also obvious: the
use of 4 separate steps for detection is not suitable for end-to-end training; Each detection needs to generate more than 2,000 candidate frames, and each candidate frame requires a convolution operation. Overlapping candidate frames brings a lot of repeated calculations, which greatly affects the detection speed.

In 2014, He Yuming proposed the pyramid pooling layer network SPP-Net. Because the traditional CNN network is connected to the fully connected layer, the input image size needs to be cropped or stretched to meet the input requirements of the fully connected layer. However, in the process of image cropping or stretching, the image will be distorted and the target feature will be deformed. To solve this problem, SPP-Net adds a spatial pyramid pooling layer (SPP) to the CNN network. Regardless of the input size, SPP will fix its output to the same size, which improves the scale invariance of the image and reduces overfitting. After using SPP, only one convolution operation is required on the image, and the candidate area of ​​the image can be obtained by calculating the mapping relationship between the original image and the feature map, which greatly reduces the detection time of the algorithm. 1

Ren proposed the FastRCNN algorithm in 2015. Unlike RCNN, FastRCNN first extracts features from the image through a neural network, then selects candidate regions and pools different input ROIs into the same size, and finally, borders the output of the fully connected layer Regression and classification. ROI pooling can avoid scaling the candidate area and reduce the running time of the algorithm, but it is still time-consuming to select the candidate area.

In 2015, Ren et al. again proposed Faster RCNN. First, extract the image features; secondly, send the image features to the RPN (Region Proposal Network) network to obtain the extracted area; then extract the bounding box features; finally, predict the boundary of the object based on the candidate box features Boxes and categories. Compared with the Fast RCNN algorithm, the more important improvement of the algorithm is to use RPN instead of Selective Search to extract candidate regions. Another extremely important improvement of the algorithm is the introduction of a priori box, which is also used in single-stage algorithms such as YOLO.

Regression-based target detection

The regression-based target detection algorithm only performs a convolution operation on the image, and directly locates and recognizes the target in the first convolution operation. Such algorithms include YOLO (You only look once), SSD, YOLOv2, etc.

In 2016, Redmon et al. proposed the YOLO network. YOLO regards detection as a regression problem and only performs a convolution operation, so the detection speed is very fast. However, the YOLO algorithm is not ideal for detecting small targets due to the problem of scale division. If two target centers fall on the same grid at the same time, the algorithm cannot detect the target well. Notes I made when I read YOLO before: YOLOv1 reading notes-scanned version .

In 2016, Liu et al. proposed SSD (Single Shot MultiBox Director). The SSD directly returns the target category and specific location in a network. Secondly, it uses a convolutional neural network for prediction, and uses different scale feature extractors in the network structure. , And finally achieved good detection accuracy and detection speed.

In 2016, Redmon et al. proposed YOLOv2 on the basis of YOLO. YOLOv2's improvements to YOLOv1 are reflected in the following aspects: normalization, use of high-resolution images to fine-tune the model, the use of a priori box, and cluster extraction of prior scale , Constrain the position of the predicted frame, Passthrough layer detection of fine-grained features, multi-scale image training, using Darknet-19 network. In general, YOLOv2 is an improvement of YOLO, which has improved the detection speed and detection accuracy.

The performance of different detection algorithms on VOC2007

Background introduction and difficulties of small target detection

At present, the development of target detection algorithm is relatively good, and YOLOv5 has been released now, but the problem of small target detection is more difficult in the field of target detection. Small targets can be divided into absolute size and relative size from the size type. The absolute size of the small target size is not greater than 32*32, and the relative size is relative to the image, the target size is not greater than one-tenth of the image size. For small targets in complex scenes, especially for satellite remote sensing images, the main features are high resolution and large field of view, and the relatively small target size. However, the current mainstream target detection algorithms have a low recall rate for small target detection, and the detection accuracy is not high, and it is difficult to meet the requirements of real-time applications. Therefore, the research on small target detection of high-resolution images is not only the focus of target detection, but also the difficulty in this field, and it has very important significance in real applications.

Introduction to small target detection algorithm

As far as the current small target detection is concerned, it can be simply divided into three categories: multi-scale prediction, deconvolution and upsampling, and counter-network GAN . These three categories use different methods to improve the detection of small targets.

Multi-scale prediction

In the previous regression-based target detection, multi-scale detection was applied, but the final prediction value was only one. Multi-scale prediction not only uses multiple scales to extract the features of the picture, but also predicts on different scales. . Representative algorithms include: the FPN model proposed by LIN in 2017, Singh proposed the SNIP algorithm in 2018, and Zhao Yanan and others proposed the MFDSSD algorithm in 2018. The following introduces the multi-scale prediction method by introducing the FPN algorithm.

FPN (Feature Pyramid Networks for Object Detection),
FPN network structure
FPN is mainly to improve the structure of the network. First, use Resnet, vgg and other networks to build a bottom-up feature pyramid, and then through upsampling to build a top-down network , And then connect it horizontally. Originally, most target detection only uses top-level features for prediction, but the low-level feature semantic information is relatively small, but the target location is more accurate, and the high-level feature semantic information is richer, but the target is relatively rough. Secondly, although some algorithms also use multi-scale feature fusion, they only predict the final fusion feature, and FPN predicts each fusion feature.

Deconvolution and upsampling

Small targets occupies fewer pixels in the image, and the outline is relatively rough. If the resolution of image features can be improved and the features of small targets can be made larger, it will be helpful for small target detection. The application of deconvolution and upsampling in the network can increase the size of the feature map. By fusing with low-level features, it can improve the expressiveness of features and better predict small targets. In 2016, Fu et al. proposed the DSSD model for the poor detection effect of SSD methods on small targets; in 2015, Jonathan proposed the FNC network; in 2018, Fan Qinming proposed the AFFSSD network; these methods are all through deconvolution, etc. The method improves the semantic information of the low score in the picture. The following is only a rough introduction to DSSD;

DSSD network structure
DSSD's improvements to the SSD network are mainly in the following two points: 1. The basic network of the SSD network is changed from VGG to Resnet-101, which enhances the feature extraction capability; 2. The deconvolution layer is used to increase a large amount of context information. The low-resolution feature information is used as the contextual information, and is fused with the feature image of the previous twice-resolution information through deconvolution, and finally predicted by a prediction module.
Schematic diagram of deconvolution module
Perform a deconvolution operation on the red layer to make it the same scale as the blue layer of the previous level, and then merge the two to obtain a new red layer for prediction.

Confronting network GAN

Insert picture description here
PGAN uses Perceptual GAN ​​to improve the detection rate of small objects. The generator converts the features of low-resolution small targets into the features of high-resolution large objects. The resolver and the generator distinguish features in a competitive manner. In the continuous iteration process, the direct difference between the small target and the large target is reduced to improve the accuracy of small target detection. PGAN mines the structural associations between objects of different scales and improves the feature representation of small objects to make them similar to large objects. Contains two sub-networks, the generation network and the perception and discrimination network. The generative network is a deep residual feature generation model, which converts the original poor features into high-grade deformed features by introducing low-level fine-grained features. On the one hand, the resolution network distinguishes the high-resolution features generated by small objects from the features of real large objects, and on the other hand uses perceptual loss to improve the detection rate.

Insert picture description here

Summary and outlook

The current small target detection algorithms are basically improved on the existing deep network, so that it can be adapted to small target recognition;
secondly, compared to large target detection, small target detection data sets are less and of poor quality;
even if you look forward to it Hahaha


  1. A review of small target detection algorithms based on deep learning published by Zhang Xin in 2020↩︎

Guess you like

Origin blog.csdn.net/weixin_42279314/article/details/109332268