A review of research on target detection in remote sensing images

The particularity of remote sensing images

  • Diversity of scale: Aerial remote sensing images can be taken from a height of several hundred meters to nearly 10,000 meters, and even similar targets on the ground vary in size. For example, ships in ports are as large as more than 300 meters, and as small as only tens of meters. .
  • Particularity of perspective: The perspective of aerial remote sensing images is basically a high-altitude perspective, but most of the conventional data sets are from the ground level perspective, so the patterns of the same target are different. A good detector trained on conventional datasets can be used in The effect may be poor on aerial remote sensing images.
  • Small target problem: Many targets in aerial remote sensing images are small targets (dozens or even a few pixels), which results in a small amount of target information. The CNN-based target detection method outperforms conventional target detection data sets. , but for small target detection, the Pooling layer of CNN will further reduce the amount of information. A 24*24 target has only about 1 pixel after 4 layers of pooling, making it difficult to distinguish due to the low dimension.
  • Multi-directional problem: Aerial remote sensing images are shot from above, and the direction of the target is uncertain (while conventional data sets often have a certain degree of certainty, such as pedestrians and vehicles are basically standing), the target detector needs to know the direction. Be robust.
  • High background complexity: Aerial remote sensing images have a relatively large field of view (usually covering several square kilometers), and the field of view may contain a variety of backgrounds, which will cause strong interference to target detection.

1. Overview of Target Detection Research

1 Introduction

Object detection has always been one of the research hotspots in the field of computer vision. Its task is to return the category and rectangular bounding box coordinates of a single or multiple specific objects in a given image. Object detection tasks are highly challenging and have broader application prospects, such as autonomous driving, face recognition, pedestrian detection, medical detection, etc. At the same time, target detection can also be used as the research basis for more complex computer vision tasks such as image segmentation, image description, target tracking, and action recognition.

  • General Object Detection: Explore methods to detect different types of objects under a unified framework to simulate human vision and cognition.
  • Detection Applications: Detection in specific application scenarios, such as pedestrian detection, face detection, and text detection.

2. Traditional target detection

Most of the early object detection algorithms were built based on hand-crafted features. Due to the lack of effective image representation at that time, people had no choice but to design complex feature representations, as well as various acceleration techniques to exhaust limited computing resources.

  1. Viola Jones Detectors
    18 years ago, P.Viola and M.Jones achieved the first real-time detection of human faces without any constraints (such as skin color segmentation). On a 700MHz Pentium III CPU, with the same detection accuracy, the detector is dozens or even hundreds of times faster than other algorithms. This detection algorithm later became known as the "Viola-Jones" (VJ) detector. The VJ detector uses the most direct detection method. That is, sliding window: look at all possible positions and scales in the image and see Whether there is a window containing a face. The VJ detector combines three important technologies: "integral image", "feature selection" and "detection cascade", which greatly improves the detection speed.
  2. HOG Detector
    Histogram of Oriented Gradients (HOG) feature descriptor was originally proposed by N. Dalal and B. Triggs in 2005. HOG can be considered an important improvement over the scale-invariant feature transform and shape contexts at the time. In order to balance feature invariance (including translation, scale, lighting, etc.) and nonlinearity (differentiating different object classes), the HOG descriptor is designed to be computed on a dense grid of evenly spaced cells and uses overlapping local contrast normalization ( on "blocks") to improve accuracy. Although HOG can be used to detect various object classes, its main motivation is the pedestrian detection problem. To detect objects of different sizes, the HOG detector rescales the input image multiple times while keeping the detection window size unchanged. Over the years, HOG detectors have been an important foundation for many object detectors and various computer vision applications.
  3. Deformable Part-based Model (DPM)
    As the winner of the voco -07, -08, and -09 detection challenges, DPM is the pinnacle of traditional target detection methods. DPM was originally proposed by P.Felzenszwalb in 2008 as an extension of the HOG detector, and was subsequently improved by R.Girshick. DPM follows the detection idea of ​​"divide and conquer". Training can be simply regarded as learning a correct method of decomposing objects, and inference can be regarded as a collection of detections of different object parts. A typical DPM detector consists of a root-filter and some part-filters. This method does not require manual specification of part filter configurations (such as size and location), but instead develops a weakly supervised learning method in DPM, and all part filter configurations can be automatically learned as latent variables. R. Girshick further expressed this process as a special case of multi-instance learning. Important technologies such as "hard negative mining", "bounding box regression" and "context priming" are also used to improve detection accuracy. In order to speed up detection, R.Girshick developed a technology to "compile" the detection model into a faster model, implementing a cascade structure, achieving more than 10 times acceleration without sacrificing any accuracy.

3. Target detection based on deep learning

As the performance of hand-crafted features tends to be saturated, object detection reached a stable level after 2010. R.Girshick et al. took the lead in applying convolutional neural networks to target detection in 2014, and proposed a region with CNN characteristics (RCNN) for target detection. Since then, object detection has developed at an unprecedented rate.
The target detection algorithm is mainly divided into three steps: image feature extraction, candidate area generation and candidate area classification.

Two-stage algorithm represented by R-CNN series

First, a series of candidate areas where potential targets may exist are generated on the image through heuristic methods or convolutional neural networks, and then classification and boundary regression are performed on each candidate area in turn.

  1. R-CNN
    (1) uses selective search to generate candidate areas that may contain potential targets; (2) after sampling all candidate areas to a fixed resolution, input them into the convolutional neural network one by one to extract fixed-length feature vectors; (3) Use multiple support vector machines to classify all feature vectors; (4) Perform regression correction on the rectangular frame based on the known categories and extracted feature vectors to further improve the positioning accuracy.
    Advantages : Compared with traditional algorithms, the biggest innovation of R-CNN is that it no longer requires manual design of feature operators. Instead, it introduces a convolutional neural network to automatically learn how to better extract features. Experimental results also prove that this is more efficient. Effective.
    Disadvantages : (1) Although CNN is used to extract feature vectors, the selective search algorithm used to generate candidate areas is still based on underlying visual features, so the quality of the candidate frames is not high; (2) The three modules of the algorithm are independent of each other , resulting in a cumbersome training process, inability to achieve end-to-end training, and failure to obtain the global optimal solution; (3) When extracting feature vectors, each candidate area will be individually cropped from the original image, and then input into the neural network in turn , which not only takes up a lot of disk space, but also brings a lot of repeated calculations, resulting in very slow training and inference speeds.
  2. SPP-Net
    no longer passes the candidate areas into the CNN in sequence, but directly calculates the feature map of the entire image, and then divides the features of each candidate area. Before the fully connected layer, in order to unify the length of the feature vector, a new SPP layer is added to convert any input into a fixed-length output through a pooling operation.
    Advantages : Significantly speeds up the training and inference process.
    Disadvantages : The accuracy of SPP-Net is not significantly different from R-CNN, and its algorithm process is still multiple independent modules, and saving feature vectors still requires a lot of storage space.
  3. Fast R-CNN
    Fast R-CNN absorbs the idea of ​​SPP-Net and performs a one-time feature calculation on the entire image. The newly proposed RoI pooling layer is equivalent to a simplified version of the SPP layer. In addition, in order to simplify the process, Fast R-CNN no longer uses support vector machines for classification, nor does it use additional regressors. Instead, it designs a multi-task loss function and directly trains CNN on two new network branches. Classification and regression are performed separately.
    Advantages : Feature extraction, classification, and regression are integrated into one step, so that feature vectors no longer need to be saved midway, solving the problem of storage space; and overall optimization can be performed during training, thus achieving higher accuracy.
    Disadvantages : (1) The generation of candidate frames is still completely independent. Traditional algorithms such as selective search directly generate candidate regions based on the underlying visual features of the image and cannot learn based on specific data sets. (2) Selective search is very time-consuming. It takes 2 seconds to process an image on the CPU.
  4. Faster R-CNN
    designs an RPN candidate box generation network. (1) The input of RPN is the feature map of the entire image extracted by the existing Fast R-CNN skeleton network. This shared feature design not only makes full use of the feature extraction capabilities of CNN, but also saves operations; (2 ) proposed the concept of anchor points. RPN performs classification (foreground or background) and regression based on pre-set anchor points, which not only ensures the generation of multi-scale candidate frames, but also makes the model easier to converge. After RPN generates the candidate region, the remaining part of the algorithm is consistent with Fast R-CNN.
    Advantages : RPN replaces the selective search algorithm, and Faster R-CNN finally reaches a detection speed of 5FPS on the GPU, breaking the record of the PASCAL VOC data set; at the same time, it is also the first detection algorithm to truly implement end-to-end training. , marking the formal formation of the two-stage detector.

Since the advent of Faster R-CNN, almost all new two-stage detectors are based on it. In order to further improve the efficiency of Faster R-CNN, the R-FCN proposed by Dai et al. removes the time-consuming fully connected layer of independent calculation of each branch, and designs a position-sensitive score map and a position-sensitive RoI pooling layer to retain spatial information. Significantly improves inference speed and accuracy. Considering that the deep features of the network have strong semantic information, while the shallow features have strong spatial information, Lin et al. proposed an FPN architecture that combines deep feature maps with shallow feature maps one by one through multiple upsampling, based on multi-layer fusion. The final feature map is output, which can better detect targets of different scales, which is a milestone in multi-scale target detection. Mask R-CNN proposed by He et al. replaces the RoI pooling layer with the RoI alignment layer based on Faster R-CNN, making the feature map and original image pixels more accurately aligned, and adding a new mask branch. Used for instance segmentation. Surprisingly, this algorithm not only achieves excellent performance in instance segmentation tasks, but also improves the performance of target detection tasks by conducting multi-task training on classification, regression, and mask branches simultaneously. Qin et al. proposed a lightweight two-stage detector ThunderNet: through the lightweight skeleton network SNet customized for the detection task, the compression of RPN and detection heads, and the introduction of CEM, SAM and other modules, the model has improved speed and In terms of accuracy, it surpasses many one-stage detectors.

One-stage algorithm represented by YOLO and SSD

Only one convolutional neural network is used to directly locate and classify all targets on the entire image, skipping the step of generating candidate regions.

  1. OverFeat
    (1) The convolutional layer is used instead of the fully connected layer to implement a fully convolutional neural network, which can adapt to images of different resolutions as input, which is equivalent to using convolution to quickly implement the sliding window algorithm; (2) Using the same convolutional neural network As a shared skeleton network, classification, localization and detection tasks are implemented respectively by changing the network header.
    Advantages : OverFeat is 9 times faster than R-CNN in detection.
    Defects : The accuracy is not as good as R-CNN at the same time.
  2. YOLO
    divides the input image into 7*7 grids. Each grid is responsible for predicting the target whose center point is within the grid, and regressing the position of the center point relative to the grid, the length, width, and category of the target. YOLO's loss function consists of three parts: positioning loss, confidence loss, and classification loss, where confidence refers to whether the target exists. It can be seen that YOLO is an end-to-end algorithm. There is no concept of candidate boxes. An image is input and the required attributes are returned while detecting the foreground.
    Advantages : The YOLO algorithm truly realizes real-time target detection, and its detection speed can reach 45FPS. Fast YOLO can even reach 155FPS, which is an order of magnitude faster than the two-stage detector. In addition, YOLO considers more background information during detection, so the probability of misidentifying the background as the foreground is much lower than FastR-CNN.
    Disadvantages : (1) Each grid only detects 2 targets, and they are specified as the same category, making it difficult for the algorithm to handle the detection of dense targets; (2) The accuracy is worse than Fast R-CNN, especially in positioning. The main reason The reason is that the latter has gone through two rectangular box regressions from the whole to the local part, while YOLO has only gone through it once; (3) Due to the existence of the fully connected layer, the resolution of the input image is fixed; (4) Only in a single feature Detecting targets on the graph makes it difficult for the algorithm to detect multi-scale targets.
  3. SSD
    (1) train the network to predict targets of different scales on multiple feature layers of different depths, and finally integrate them; (2) introduce the anchor point concept of Faster R-CNN to make the model easier to converge and ensure the characteristics of different receptive fields The graph adapts to target detection at different scales; (3) Use a fully convolutional neural network to adapt to image inputs of different resolutions; (4) The loss function consists of positioning loss and classification loss, and there is no concept of YOLO's foreground confidence because it During classification, the background is directly regarded as a category and predicted simultaneously with other categories. In addition, SSD lays dense anchor points on the feature map, and the number of anchor points that effectively matches the target is very effective. If all samples are directly used for training, there will be a serious imbalance between positive and negative samples. Therefore, SSD adopts the method of hard case mining to alleviate this problem.
    Advantages : The detection speed of SSD is comparable to YOLO, and the accuracy is comparable to Faster R-CNN.
    Disadvantages : Compared with Faster R-CNN, the detection results of small targets have not been significantly improved.

Although the newly born series of one-stage detectors generally have an absolute speed advantage, there is also a non-negligible accuracy gap with the top two-stage detectors. Lin et al. believe that the most essential difference between the two types of algorithms is that the latter ensures the high quality and category balance of the second-stage training samples by screening candidate frames, while the former must predict every sliding window on the image. , in other words, there is a serious imbalance between positive and negative samples and an imbalance between difficult and easy samples. Therefore, they designed a new loss function Focal Loss for the one-stage detector. Focal Loss introduces two new parameters based on the cross-entropy loss function, one is used to reduce the weight of negative samples, and the other is used to reduce the weight of simple samples, so that the model can avoid a large number of problems caused by the first-stage algorithm during training. Negative samples and simple samples divert attention. In the experimental test, the author used ResNet and feature pyramid network architecture to design a simple one-stage detector RetinaNet, and applied Focal Loss for training. Finally, it showed an accuracy that surpassed Faster R-CNN on the MS COCO test set, especially in On the detection of small samples. After YOLOv2, Redmon and others upgraded it again and proposed YOLOv3. YOLOv3 has three main improvements: (1) multiple logistic regression classifiers are used to replace the softmax classifier, so that the model can be applied to classification tasks with intersections between categories; (2) the feature pyramid network architecture is introduced to perform the deepest feature map on Upsampling twice, fused with shallow features respectively, and finally setting different anchor points on the three feature layers to predict targets at different scales; (3) Learning the idea of ​​​​the residual network, Darknet-53 was designed as a new The skeleton network is comparable to Resnet-101 and ResNet152 in accuracy, but faster. YOLOv3 achieved the best trade-off between speed and accuracy at the time, and is currently one of the preferred algorithms for target detection in the industry.

2. Overview of research on multi-scale target detection

The fundamental reason why the detector performs poorly when facing data sets with a large scale span is that as the convolutional neural network continues to deepen, its ability to express abstract features becomes stronger and stronger, but shallow spatial information Also relatively lost. This results in the inability of deep feature maps to provide fine-grained spatial information for precise target positioning, and at the same time, the semantic information of small targets is gradually lost during the downsampling process.
When detecting targets with large scale and rich detailed features, stronger semantic information is needed as the basis for classification; when detecting targets with small scale and low deviation tolerance, finer-grained spatial information is needed to achieve precise positioning.
A general idea for solving scale problems: constructing multi-scale feature expressions
Insert image description here

1. Multi-scale target detection based on image pyramid

In the training phase, images of different scales are randomly input, which can force the neural network to adapt to target detection at different scales; in the testing phase, the same image is detected multiple times at different scales, and finally a non-maximum suppression algorithm is used to integrate all results , enabling the detector to cover targets within the largest possible scale range.
Advantages : Improve overall accuracy to a certain extent.
Disadvantages : High-resolution image input will not only increase memory overhead, but also increase calculation time. This will not only make it difficult to use larger batch sizes during training and affect model accuracy, but the exponentially increased inference time will further raise the threshold for putting the algorithm into practical applications.

Image pyramid based on scale generation network

When performing multi-scale detection, many layers of the pyramid actually do not detect effective targets, which means there is an obvious waste of resources. The reason is that the scale distribution of the target in each image is significantly different: some images may have only one scale of target, so only a certain layer of the pyramid needs to be detected; some images may only have medium Targets and large targets, so the highest resolution level in the pyramid is actually not needed, and that happens to be the most computationally expensive level. In order to improve the detection efficiency, they believe that if the scale distribution of the target in the image can be judged before formally detecting the target, the redundant layers in the image pyramid can be removed. Moreover, if the target scale is known, subsequent detection can be carried out. Detection should be further optimized. Therefore, they designed a scale generation network to split the original target detection task into two steps: scale estimation and single-scale target detection, as shown in the figure.
Insert image description here
The scale generation network is trained based on image-level supervision signals and outputs a scale histogram vector. After mean filtering and one-dimensional non-maximum suppression operations, a discrete target scale distribution is obtained. Since the target scale is known, subsequent detectors only need to detect targets of a single scale, so the size number of RPN anchor points can be reduced to 1, which can further increase the detection speed without affecting accuracy. Finally, the images are sequentially sampled to the resolution corresponding to the target scale, and then detected in turn. Finally, all results are summarized to complete the detection of multi-scale targets.

Image pyramid based on scale normalization

To address the challenges posed by the large number of small targets in the MS COCO data set, Singh et al. proposed a training strategy called Scale Normalized Image Pyramid (SNIP): an image pyramid is used to train the model, but each layer only provides appropriate Supervision signals within the scale range, as shown in the figure.
Insert image description here
The fundamental purpose of this is to allow the model to focus on detecting targets within a certain scale, while ensuring that all training data can be learned through the pyramid method. Finally, the image pyramid is also used when validating the model. This strategy can be applied to both stages of Faster R-CNN and brings an all-round improvement to the detection accuracy of targets of all scales. It can be said that SNIP is essentially an improvement on the traditional multi-scale training strategy based on the inherent defects of CNN, and takes advantage of the image pyramid into the mechanism. However, this training strategy does not solve the problem of memory and time overhead of image pyramid. Later, Singh et al. upgraded SNIP to SNIPER . In order to be able to solve the memory limitation of image pyramid during training, SNIPER no longer trains on complete images, but cuts out fragments with a fixed resolution of 512x512 from each layer of the pyramid as training units. Among them, the fragment size is used as the grid unit on different layers, and the grid that includes the effective target at that scale is selected as the fragment, which is the positive sample during training. In order to prevent the detector from misjudging the background as the target, the authors also used fragments containing several false positive examples as negative samples to participate in the training. Since the resolution of the fragments is smaller, the memory problem of the image pyramid is effectively solved, and a larger batch size can be used during training, which not only speeds up the training but also improves the detection accuracy of the model. However, when actually applying the model to detect targets, the complete image pyramid must still be accessed, so the computational time-consuming problem of inference has yet to be solved.

Image pyramid based on attention mechanism

The first to introduce amplification operations in deep learning target detection was AZ-Net proposed by Lu et al. They believe that the anchor point strategy of the RPN network is essentially an exhaustive algorithm with a fixed sliding window size, which is neither efficient nor applicable to multi-scale targets. Therefore, they designed an adaptive search candidate region generation algorithm AZ-Net. The algorithm uses the entire image as the starting point for search, and provides two outputs: adjacent area prediction and amplification indicator. The former refers to a series of candidate areas that are close to the scale of the search area, and the latter is used to indicate whether there is a smaller area within the current search area. Target. If it exists, the entire image is divided into five areas: upper left, lower left, upper right, lower right, and middle, which are used as new search starting points in turn until all areas no longer contain small targets. Experiments on the PASCAL VOC data set show that the candidate regions generated by this algorithm are fewer in number but of higher quality than those generated by the RPN network, but the accuracy advantage is not obvious. Gao et al. continued the search idea of ​​AZ-Net, and designed a coarse-to-fine strategy to detect targets in high-resolution images by introducing reinforcement learning with decision-making capabilities: first, a rough Fast R-CNN was used Detect the downsampled low-resolution image, generate an accuracy improvement probability map, then use reinforcement learning to find areas that may contain small targets, and use a more refined detector to detect targets in the high-resolution area. Use this area as a new algorithm input and pass it through the rough detector again, and so on until it no longer contains small targets. Experimental results show that, with almost no loss of accuracy, the algorithm reduces the number of pixel processing by 50% and the inference time on the Caltech pedestrian detection data set by 25%, and reduces the number of pixel processing on the YFCC100M data set by 70%. %, the inference time is reduced by 50%. Uzkent et al. continued the approach of Gao et al., and also introduced reinforcement learning to select areas in the image that need to be further viewed. However, the difference is that the algorithm also determines whether the area is dominated by large targets or small targets, and then uses two different methods to determine whether the area is dominated by large targets or small targets. The purpose of detection is to further save the amount of calculation. In general, these algorithms are derived from the idea of ​​​​attention mechanism , and treat multi-scale target detection as a recursive process from coarse to fine, from the whole to the details. The process is shown in the figure.
Insert image description here
These algorithms can be seen as optimization of the image pyramid: start detection from the top of the pyramid, and use reinforcement learning to determine which part of the next layer of the pyramid contains potential targets, and so on until the next layer no longer includes targets. until. Therefore, the algorithm is equivalent to using the decision-making ability of reinforcement learning as a guide, removing the redundant part of the image pyramid, and solving the problem of serious calculation time consuming during inference that still exists in the SNIPER strategy.

2. Multi-scale target detection based on feature pyramid within the network

Early detectors represented by R-CNN made predictions directly on the last layer of feature maps of the neural network. Due to the lack of fine-grained spatial features, the detection effect on small targets was poor, so multi-scale feature representations needed to be sought. Although the image pyramid can extract features of different scales based on inputs of different resolutions, it will incur serious memory and time overhead and is not applicable. Therefore, if a multi-scale feature representation can be constructed within a convolutional neural network, the multi-scale features that can be extracted by the image pyramid can be approximately obtained with only one input image, and the computational cost is much smaller. At this stage, the feature pyramid within the network is mainly constructed in the following two ways: (1) based on cross-layer connections to fuse feature maps of different depths in the network to obtain feature representations of different scales; (2) based on parallel branches with different receptive fields, Build a spatial pyramid.

Build feature pyramid based on cross-layer connections

Considering the layer-by-layer structure of the convolutional neural network, the deeper the feature map, the larger the receptive field, so the feature maps at different depths in the network form a natural multi-scale expression, so the SSD algorithm and the MS-CNN algorithm Both proposed that targets can be detected directly on these feature maps of different scales and finally integrated, where the shallow feature map is responsible for detecting small targets, and the deep feature map is responsible for detecting large targets. However, judging from the experimental results, the detection accuracy of small targets has not been significantly improved. The reason is that these feature layers have different depths and different feature representation capabilities, and there is a significant semantic gap. Although the shallow feature layer retains more fine-grained spatial information, its feature representation ability is too weak and it lacks effective semantic information, so the detection effect is poor. Therefore, it is inappropriate to directly predict targets of different scales on feature maps at different depths in the network. It is necessary to first construct a feature pyramid with sufficient feature information in each layer. In response to the shortcomings of the SSD algorithm, Lin et al. proposed the famous feature pyramid network FPN. The core idea of ​​FPN is to fuse different depths of feature information within the network, but the structure of layer-by-layer fusion from top to bottom is worthy of discussion, so a series of algorithms have emerged to discuss and improve this.
Insert image description here
The above methods all make changes to the feature fusion method proposed by FPN, while Li et al. improved the skeleton network of FPN itself. Most detectors use classification networks as skeleton networks (such as ResNet), and pre-training is also completed on classification data sets, which brings two problems: (1) Detectors such as FPN introduce additional data that are not involved in pre-training. Network stage; (2) The receptive field and downsampling coefficient of the skeleton network are both large. Although it is beneficial to image classification, the lack of spatial information is not conducive to the precise positioning of large targets, and the loss of semantic information during the downsampling process is not conducive to small targets. Recognition, even the introduction of the FPN architecture does not solve the essential problem. To this end, they designed a new skeleton network DetNet-59 specifically for the needs of detection tasks., compared to ResNet-50, there are three main differences: (1) The network and FPN have the same number of stages, so all stages can participate in pre-training; (2) Starting from the fourth stage, the downsampling coefficient of DetNet is fixed to 16, the number of channels is fixed at 256; (3) Atrous convolution is introduced in the residual module to increase the receptive field. Judging from the experimental results, the parameter amount of DetNet is between ResNet-50 and ResNet-101, but its performance on the detection task is better than them. Specific to targets of different scales, you will find that DetNet is particularly good at locating large targets and finding small targets, which is in line with the author's expectations.

Construct feature pyramid based on parallel branches

To construct multi-scale feature expressions, parallel branches with different parameters can be designed within the network. Each branch extracts feature maps at different spatial scales based on its own receptive field, and then constructs a spatial pyramid. In the field of deep learning, the spatial pyramid can be traced back to the Inception module proposed by GoogLenet. The module contains four branches. The first three branches use convolution operations of 1x1, 3x3 and 5x5 convolution kernels respectively. The fourth branch Perform max pooling, and finally fuse the outputs of all branches, as shown in the figure.
Insert image description here
Although the specific implementation methods are very different, the ideas of the Inception module and SPM are consistent, both of which are to extract the features of images at different spatial scales. The SPP module of SPP-Net also uses the multi-scale block method of SPM. By performing a pooling operation on each block, a feature map of any size can be converted into a fixed-length feature vector. All in all, building a spatial pyramid is also a feasible solution to solve the scale problem of target detection. In order to combine global information and local information, Zhao et al. designed a pyramid pooling module similar to the SPP module. The module contains four branches for 1x1, 2x2, 3x3, and 6x6 pooling to extract multi-scale information. In semantics The effect on segmentation tasks has been significantly improved. The PFPNet proposed by Kim et al . is also based on the idea of ​​​​fusion of contextual information at different scales. The SPP module containing three branches is introduced in the first-stage detector. However, the feature map obtained by pooling each branch has also been designed by the author. The MSCA module is fused with the output features of the other two branches respectively, upsampling and downsampling the feature maps of the other two branches, and then splicing the features with the main branch. Finally, target detection is performed on the output feature maps of the three branches, and the non-maximum suppression algorithm is used to summarize the results. Judging from the experimental results of the MS COCO data set, PFPNet is slightly better than YOLOv3 using the FPN architecture. The process is shown in the figure.
Insert image description here

3. Other strategies for multi-scale target detection

anchor point

In order to detect targets of different scales in early target detection, in addition to using fixed-sized sliding windows to slide layer by layer on the image pyramid, sliding windows of different sizes can also be used to slide on the same image in turn. The concept of anchor points introduced by the RPN network proposed by Ren et al. is equivalent to setting nine sliding windows of different sizes on the feature map extracted by the skeleton network as a priori information for detection to ensure that the network can cover as much as possible target within the scale range. Although the model's detection accuracy for small targets is not ideal, the multi-scale anchor point strategy has become the standard configuration of most later detectors. Combined with the feature pyramid, it can even further expand the scale range of the anchor point.
Disadvantages : (1) The size of the anchor points needs to be defined in advance. If it is not well defined, it will significantly reduce the model performance; (2) In order to ensure sufficient recall, a large number of anchor points are often required. However, most of the anchor points are It is not helpful for the test results.

Intersection and union ratio threshold

During the training process of target detection, we usually determine positive and negative samples based on the intersection ratio of the predicted rectangular box and the real label. For example, if the intersection ratio is greater than 0.5, it is a positive sample, and if it is less than 0.3, it is a negative sample. However, such threshold setting is mainly based on experience and is not necessarily the optimal choice. Moreover, using a fixed intersection ratio threshold is even more inappropriate for multi-scale target detection, because equal coordinate deviations will have a greater impact on the intersection ratio of small targets, while the impact on large targets will be much weaker. In order to try to solve this problem, Cai et al. proposed the Cascade R-CNN algorithm, setting three R-CNN networks with intersection and union ratio thresholds of 0.5, 0.6, and 0.7 respectively, and then cascaded them together. The basis for this is that if the intersection and union ratio threshold is directly increased on a single network, the number of positive samples will rapidly decrease, resulting in a significant decrease in network accuracy. Therefore, we gradually improve the quality of the generated rectangular boxes in a cascade manner. By using the output of the previous detection network as the input of the next detection network, we can continuously adapt to higher intersection and union ratio thresholds, and each network can Detect targets within a specific intersection ratio range.
Disadvantages : The cascade structure significantly improves accuracy, but it also significantly increases training time and inference time. Also considering that the fixed intersection-to-union ratio threshold is unreasonable.

dynamic convolution

The traditional convolutional neural network has an inherent flaw: the size of the convolution kernel is fixed, and the scale of the pooling layer is also fixed. This results in that the receptive fields of all feature layers in the network are always fixed, which is not conducive to perception. Targets at different scales. Therefore, there are a series of methods that try to make the convolution operation dynamic. For example, the introduction of dilated convolution allows the convolutional layer to change the receptive field monotonically with the dilated convolution coefficient while maintaining the same parameters. This also makes the neural network more convenient to capture multi-scale features. The variable convolution proposed by Dai et al. further adds an offset to the position of each sampling point in the convolution calculation, allowing the convolution kernel to take on various shapes. The dilated convolution is equivalent to a variable convolution. A special case of accumulation. Similarly, the pooling layer can also be transformed into variable pooling by adding bias. Judging from the visual results of the experiment, variable convolution can indeed help the neural network better adapt to targets of different shapes and scales. However, Zhu et al. also found that variable convolution introduces too much contextual information that may cause negative effects because the bias is uncontrollable. Therefore, they upgraded the variable convolution so that it can not only learn the bias, but also learn the weight of each sampling point, which is equivalent to a local attention mechanism. Overall, the design of variable convolution significantly increases the degree of freedom of the convolutional neural network and is well compatible with other detectors.
Disadvantages : The accuracy of model detection has been improved, but the number of parameters has also become about 3-4 times that of the original model, so it is currently difficult to generalize to mature detection networks.

bounding box loss function

L1 and L2 norms are classic regression loss functions that can be used to regress bounding boxes in object detection tasks. However, the convergence speed of the L1 loss function is slow and the solution is unstable, and the L2 loss function is sensitive to outliers and not robust enough. Therefore, Girshick proposed the smooth L1 loss function, which combines the characteristics of the two: compared with the L1 loss function, when it is close to the true value, the gradient value is small enough and the convergence is faster; compared with the L2 loss function, the outliers The gradient is smaller and more robust.
Disadvantages : (1) They all penalize the offset of the vertex coordinates and length and width of the rectangular frame, which cannot directly reflect the similarity between the predicted frame and the real frame; (2) None of them have scale invariance. In order to solve this problem, Yu et al. proposed the intersection-union ratio loss function, which treats the rectangular box as a whole and directly calculates the logarithm of the intersection-union ratio in the proportional form to guide boundary regression. Therefore, this loss function has scale-invariance. Variability, compared with the L2 loss function, it has a significant improvement in the effect when dealing with multi-scale targets.

Decoupling classification and positioning

The target detection task consists of two parts: target classification and target positioning. Traditional algorithms such as Faster R-CNN generally extract features from the candidate area through a shared fully connected layer in the second stage, and finally perform classification and regression on the two branches. . Based on heat map analysis, Song et al. pointed out that the sensitive area of ​​the classification task is the salience area of ​​the target, while the sensitive area of ​​the positioning task is the boundary area of ​​the target, and the two cannot be spatially aligned. Obviously, for multi-scale target detection, as the scale of the target increases, the spatial misalignment problem of classification and positioning tasks will become more serious. Similarly, Wu et al., based on the characteristics of fully connected layers and convolutional layers, believe that the spatial sensitivity of the former makes it more suitable for classification, and the weight sharing characteristics of the latter make the extracted features spatially relevant. Stronger and more suitable for the regression boundary, the experimental results prove this point. In order to solve the potential conflict between classification and regression problems, the most intuitive idea is to decouple the two tasks.

Small target feature reconstruction

In the MS-CNN algorithm, in order to better detect smaller-scale targets, a deconvolution layer is designed in the network to upsample the feature map, effectively reducing memory usage and calculation time. In the STOD algorithm proposed by Zhou et al., DenseNet-169 is used as the skeleton network, and a scale transformation module is designed to construct the feature maps of the last multiple channels into features with higher resolution and fewer channels by tile expansion. Figure, used to detect small targets. In order to enhance the semantic information missing from the shallow features of SSD when detecting small targets, the DES algorithm proposed by Zhang et al. designed a branch of the segmentation module to perform semantic segmentation, and added the segmented feature map to the shallow feature map as a weighted overlap. It is equivalent to an attention mechanism. Judging from the visualization results, irrelevant features on the shallow feature map have been effectively suppressed.

data augmentation

Data augmentation is also a feasible solution to alleviate scale problems, such as the random multi-scale training strategy of the YOLOv2 algorithm. In addition, Kisantal et al. used Mask R-CNN as the baseline and proposed two data enhancement methods to solve the problem of poor detection accuracy of small targets in the MS COCO data set: (1) Use an oversampling strategy to solve the problem of small target detection in the data set. There is a problem with fewer target pictures; (2) In the same picture, copy and paste the segmentation mask of the small target, so that the anchor point strategy can match more small target positive samples, thereby increasing the small target in the loss function weight in . The essence of this idea is to make the model more inclined to perceive small targets by changing the target scale distribution of the training data. Judging from the experimental results, the detection accuracy of large targets has slightly decreased, while the detection accuracy of small targets has improved. In object detection tasks, in order to improve the overall performance of the detector, additional data sets are usually used to pre-train the model, and then fine-tuned on the formal data set, or the additional data sets are directly involved in joint training.

3. Other target detection tasks

  • pedestrian detection
    Insert image description here
  • face detection
    Insert image description here
  • text detection
    Insert image description here
  • traffic light and sign(traffic sign/light detection)
    Insert image description here
  • Remote sensing target detection (detection in specific fields such as
    Insert image description here
    remote sensing target detection) Multi-scale target detection of high-resolution images : When detecting targets on high-resolution images, there is often no lack of detailed information of small targets, but it is difficult to achieve accuracy Tradeoffs with computing resources. Due to limitations such as memory and detection speed requirements, algorithms such as Faster R-CNN and YOLO will first downsample high-resolution images to a certain resolution and then pass them into the network for detection, which results in the loss of information. If the sliding window method is used to implement carpet detection, the overall speed is too slow. The strategy of using reinforcement learning to guide fine-grained detection proposed by Gao et al. has certain benefits for high-resolution images captured by daily equipment. However, whether it is still effective for images with denser fine-grained information (such as drone aerial photography), and whether more concise algorithms can be designed, remain to be further studied.

4. Evaluate the effectiveness of target detectors

       In early research on pedestrian detection, the “false negative rate and false positive rate per window (FPPW)” was often used as an evaluation criterion to measure detection performance. However, window-by-window measurements (FPPW) may have drawbacks and cannot predict full image properties in some cases. In 2009, the California Institute of Technology (Caltech) established a pedestrian detection benchmark, and since then the evaluation metric has been changed from per-window (FPPW) to false positive per-image (FPPI).
       As early as VOC2007, the most commonly used evaluation method for target detection is "average precision (AP)". AP is defined as the average detection accuracy under different recall situations and is usually evaluated in a category-specific manner. To compare performance across all object categories, the average AP (mAP) across all object categories is often used as the final measure of performance. In order to measure the target positioning accuracy, the Intersection over Union (IoU) is used to check whether the IoU between the predicted box and the ground truth box is greater than a predefined threshold, such as 0.5. If so, the object is identified as Successfully Detected, otherwise it is identified as Not Detected. Therefore, mAP-based 0.5 -IoU has become a de facto metric for object detection problems over the years.
       After 2014, due to the popularity of the MS-COCO dataset, researchers began to pay more attention to the accuracy of bounding box locations. MS-COCO AP does not use a fixed IoU threshold , but averages across multiple IoU thresholds, with the thresholds ranging from 0.5 (coarse positioning) to 0.95 (perfect positioning). This change in measurement encourages more precise object localization and may be important for some practical applications (e.g., imagine a robot arm trying to grab a wrench).
       In recent years, there have been further developments in the evaluation of open image datasets, such as considering group-of boxes and non-exhaustive image-level class hierarchies. Some researchers have also proposed some alternative indicators, such as "location recall accuracy". Despite recent changes, VOC/COCO-basedmAP remains the most commonly used target detection evaluation metric.

Guess you like

Origin blog.csdn.net/weixin_43312470/article/details/124086107