A Survey of the Four Pillars for Small Object Detection: Multiscale Representation, Contextual Infor

Article address: https://ieeexplore.ieee.org/document/9143165

This paper reviews the latest research on deep learning-based small object detection. This paper first briefly introduces the four pillars of small object detection, including multi-scale representation, contextual information, super-resolution, and region proposal. Then, state-of-the-art datasets for small object detection are listed. Furthermore, this paper investigates state-of-the-art small object detection networks, paying special attention to the differences and improvements compared to common object detection architectures to improve detection performance. Finally, some promising directions and tasks are proposed for future work on small object detection.

1. Introduction to the article:

insert image description here

The detailed definition of small objects can be specified in different ways. For example, the length and width of the small object bounding box should be less than 32 pixels; the small object bounding box should cover less than 1% of the original image. Small object detection is more difficult than ordinary object detection due to lower image coverage, fewer appearance cues, and larger datasets. The above figure clearly depicts small object detection and general object detection. The following figure shows the relevant classic networks as of 2020, where the average precision, IoU 0.5:0.95, means that the average AP of IoU is 0.5 to 0.95, and the step size is 0.05. Average precision IoU 0.5 corresponds to AP with IoU=0.5, and average precision IoU 0.75 corresponds to AP with IoU=0.75. Also, the measure of object size is: small objects (less than 3 2 2 32^2322 ), medium objects (from3 2 2 32^2322 to9 6 2 96^2962 ) and large objects (greater than9 6 2 96^2962 ). In the table below, the authors also define a project called "Decreased Degree of Reduction" (DOR) to illustrate the large performance gap between large and small object detection. It can be seen that the average precision (AP) is much lower for small objects compared to medium or large objects. Almost all general-purpose object detectors trained in this dataset perform poorly on small objects, because the number of medium and large objects is far greater than that of small objects.
insert image description here
With the development of object detection technology based on deep learning, many new detection networks for small objects have been proposed. In this paper, small object detection methods are mainly divided into four pillars. The division of the four pillars is based on popular object detection frameworks, as defined in mmdetection, which divide the detector into several modules, such as Backbone, Neck, AnchorHead, RoIExtractor, and RoIHead. The first two pillars on multi-scale representation and contextual information belong to the neck component, which refines or reconfigures the original feature maps generated by the backbone. This area is mainly related to the AnchorHead component. While super-resolution is not a component of the above, it adds two branch networks on the basis of baseline detectors, such as generator network and discriminator network. Considering that it has become an independent research direction for small object detection, we also describe it as a kind of pillar.

The details of the article

Multi-scale representation : On the one hand, detailed information in shallow conv layers is necessary for object localization. On the other hand, semantic information in deep conv layers greatly facilitates object classification. Due to the tiny size and low resolution of small objects, location details are gradually lost in high-level feature maps. While most general-purpose detectors only take the output of the last layer for the detection task, which contains rich segmentation information but lacks detailed information. Multi-scale representation is a strategy to combine detailed location information in low-level feature maps with rich semantic information in high-level feature maps.
Contextual information : Taking advantage of the relationship between objects in the real world and their coexisting environments, contextual information is another novel approach to improve the accuracy of small object detection. Medium and large objects can provide sufficient ROI features in common detectors. However, since there are few ROI features extracted from small objects, it is necessary to extract more contextual information as a complement to the original ROI features.
Super-resolution : As mentioned above, fine details are critical for object instance localization. Super-resolution techniques attempt to restore or reconstruct the original low-resolution image to a higher resolution, which means that more details of small objects can be obtained. For example, the core idea of ​​GAN is a generator network and a discriminator network. In this adversarial process, the generator's ability to generate real images and the discriminator's ability to distinguish between real and fake images continue to improve.
Region Proposal : Region proposal is a strategy aimed at designing more suitable anchor boxes for small objects. The anchor boxes of current mainstream detectors are mainly concentrated on ordinary objects, which indicates that the size, shape and number of anchor boxes used in ordinary detectors cannot well match small objects. If these anchor parameters of conventional detectors are directly applied to small objects, the extra noise information will lead to huge computational cost and reduce detection accuracy.
The framework of small target detection is mainly divided into two types. One is to use manual features and shallow classifiers to detect objects such as obstacles or traffic signs on the road. Due to the weak feature extraction method, the performance is usually poor. The other is to use DCNN to extract image features, and then modify the mainstream general object detection network to achieve a good compromise between accuracy and computational cost. To significantly improve the performance of traditional small object detection, various new methods have been proposed. In this paper, research work on small object detection is divided into five categories, namely multi-scale representation, contextual information, super-resolution, region proposal, and other methods. The top-performing networks in each category are described in detail, while other similar networks are briefly described for a clear explanation of each category.

(1) Multi-scale representation

The weak feature representation of small objects is the main reason for poor detection performance. After repeated downsampling operations in the CNN and pooling layers, fewer small object features exist in the final feature map. Furthermore, as the number of neural network layers increases, the inherent hierarchical structure generates feature maps with different spatial resolutions. Specifically, while deeper layers represent larger receptive fields, stronger semantics, and higher robustness to deformation, overlap, and illumination changes, the resolution of feature maps is reduced and more detailed information is lost . In contrast, shallow layers have smaller receptive fields and higher resolution, but lack semantic information.
1) Multi-feature map fusion: Some popular object detectors, such as R-CNN, Fast R-CNN, Faster R-CNN, and YOLO, only use the feature maps of the last layer to locate objects and predict confidence scores, as follows ( a) shown. Due to the lack of detailed information, these models often fail to detect small objects. SSD introduces pyramid-level features, assembling each feature map from the bottom to the top network layer as shown in (b) below, thus improving small object detection. However, considering all levels of features may generate a lot of unnecessary representation noise and high computational complexity. To simplify the network and improve detection efficiency, some researchers employ deconvolution layers to select only a few important feature maps that contain the most detailed and semantic information.
MDSSD : A deconvolution fusion block is proposed in , which adopts skip connections to fuse more contextual features. In this model, three high-level semantic feature maps of different scales (conv8_2, conv9_2 and conv10_2 of SSD layer) are first introduced into the deconvolution layer, and then the three shallow layers (conv3_3, conv4_3 and conv7 of VGG16 layer) are element-wise add up. It should be noted that deconvolution layers are used to up-sample the high-level feature maps to the same resolution as the corresponding low-level layers. The SSD is the backbone of the entire model; the fusion process is done in the fusion block. The basic idea is shown in (c) below.
insert image description here
DR-CNN: Different from the elements and strategies adopted by MDSSD, a deconvolutional region-based convolutional neural network (DR-CNN) adopts a cascade strategy to fuse multi-scale feature maps for small traffic sign detection. DR-CNN selects conv3, conv4 and conv5 from VGG16 to form a fused feature map for subsequent RPN and detection. After each deconvolution module, an L2 normalization layer is also used to ensure that the features are connected at the same scale. Another innovation of this network is about the loss function. Hard negative samples are of great benefit to the training phase. However, ordinary cross-entropy loss functions have difficulty distinguishing simple positive samples from hard negative samples. Therefore, in order to fully utilize Hard negative samples for better performance, the common cross-entropy loss function is replaced with a new two-level classification adaptive loss function in RPN and fully connected network.
MR-CNN : A multi-scale region-based convolutional neural network (MR-CNN) is proposed for small traffic sign recognition, where multi-scale deconvolution operations are used to increase the sampling of features from deeper convolutional layers, comparing them with shallower ones. The layers are directly connected to construct a fusion feature map. Therefore, the fused feature maps can generate fewer region proposals and achieve higher recall. Furthermore, the test results show that the method can effectively enhance the feature representation and improve the performance of small traffic sign detection.
Other briefly introduced methods : In addition, fused multi-scale feature maps are used to locate object locations and perform object classification using deep information; Inverse Feature Enhancement Network (BFEN), which transfers more semantic information from higher layers to lower layers; Fine-grained features are connected into a spatial layout preserving network (SLPN), which preserves the spatial information of the ROI pooling layer and achieves better localization accuracy; the feature maps of the third, fourth and fifth convolutional layers are extracted and combined into one-dimensional vector for classification and localization. ; A method for optimizing anchor size and fused multi-level feature maps for road litter detection. A novel feature fusion mechanism inspired by the Inception module. Choosing YOLOv3 as the basic framework, using multi-scale convolution kernels to form receptive fields of different sizes, can make full use of low-level information.

(2), the connection method of different characteristic ground:

Although many methods are based on multi-scale representation, the above are all proposed to improve the efficiency of small object detection, but there is little related work on how to fuse high-level feature maps and low-level feature maps. CADNet : A channel-aware deconvolutional network (CADNet) is proposed to study the relationship between feature maps in different channels in deep layers to avoid simple superposition of feature maps. By exploiting the correlation between features at different scales, the recall rate of small objects can be improved at lower computational cost. As shown in the figure below, the framework is divided into three steps, including a scale transfer layer, a convolutional layer, and an element-wise summation module. In particular, the scale transfer layer reorganizes four pixels of every four channels to the same location on a 2D plane to obtain location details and improve the resolution of feature maps. Then, more semantic information of the feature map is exploited through a convolutional layer with a kernel size of 4 × 4; the feature map is connected with the previous layer by an elementwise method. Therefore, the fusion layer contains both low-level detailed information and high-level semantic information. In general, multi-feature map fusion helps to capture detailed information and rich semantic information to facilitate object localization and classification, respectively. However, many multi-scale representation methods increase the computational burden while improving detection performance. Furthermore, redundant information fusion designs may cause background noise, resulting in performance degradation. insert image description here
Contextual information : Since small objects occupy only a small part of the image, the information directly obtained from fine-grained local regions is very limited. Generic object detectors typically ignore many contextual features outside these local regions. As we all know, every object exists in a specific environment or coexists with other objects. Then, some detection methods based on contextual information are proposed to exploit the relationship between small objects and other objects or backgrounds. The surrounding area of ​​small objects can provide useful contextual information to help detect target objects. The detection accuracy can be significantly improved by adding special context modules. Next, several important network models using contextual information are described in detail.
ContextNet: Augmented R-CNN can be considered as the first detector focused on small object detection. In this work, a novel Region Proposal Network (RPN) is proposed to encode contextual information around small object proposals. First, according to the size of the small object, the RPN anchor size is changed from the original 12 8 2 128^21282 25 6 2 256^2 2562 51 2 2 512^2 5122 p i x e l 2 pixel^2 pixel2 scaled to1 6 2 16^2162 4 0 2 40^2 402 10 0 2 100^2 1002 p i x e l 2 pixel^2 pixel2 , and extract small object proposals in conv4_3 feature maps instead of VG16's conv5_3. Second, the ContextNet module, which consists of three subnets, is designed to obtain contextual information around proposal objects, as shown below. The same two front-end subnetworks consist of several convolutional layers and one fully-connected layer; the back-end subnet consists of two fully-connected layers. The proposed region extracted by the improved RPN and the larger context region with the same center point as the proposed region are passed to two front-end networks, respectively. Meanwhile, the two 4096-D feature vectors obtained from the front-end network are concatenated before being fed into the back-end network. Experimental results show that this enhanced R-CNN model improves the detection map of small objects by 29.8% over the original R-CNN model.
insert image description here
Inside–Outside Net: Spatial Recurrent Neural Network (RNN) is employed in Inside–Outside Net (ION) to search for context information outside the target region; then, jump pooling is employed to obtain inner multi-level feature maps. Two consecutive four-directional spatial RNN units are used to move through each column of the image. The model concatenates multiple scales and contextual information for detection. In the ION method, the contextual feature maps are generated by the IRNN module on top of the network. It is worth noting that the IRNN is composed of RELUs, in addition, four copies of the conv5 layer of the original VGG16 are passed through 1×1 convolutional layers as the first four directional RNNs (left-to-right, right-to-left, top-to-bottom, bottom-to-top) input; then, the output of each direction is concatenated as the input of the next IRNN unit. Finally, the contextual features are obtained.
VSSA-NET: A multi-resolution feature fusion network is designed to utilize deconvolutional layers with skip connections and a vertical spatial sequence attention module for traffic sign detection. The network is mainly divided into two stages. The first stage is the multi-scale feature extraction module, which forms multi-resolution feature maps through Mobile Net and deconvolution layers. The second stage is to build a vertical spatial sequence attention module. To take full advantage of contextual information, each column of the three feature maps is treated as a spatial sequence. The traditional LSTM network-based encoder-decoder model is improved by introducing an attention mechanism in the decoding stage, which can encode contextual features without considering noise.
MFFD : As detection accuracy increases, deeper detection networks mean higher computational costs. A modular lightweight network model called Modular Feature Fusion Detector (MFFD); it not only has good performance in small object detection, but can also be embedded in resource-limited devices such as advanced assistance systems ( ADASs). Two novel modules are designed in this network. Among them, the front-end module uses a small-sized convolution kernel in the convolutional layer to reduce the information loss, while the smaller module changes the number of input channels through a point-wise convolutional layer (1×1 convolution) before entering the convolutional layer. The advantage is that the network incorporates multi-scale contextual information from available modules rather than directly from a single layer, enabling efficient computation.
Other briefly introduced methods : use concatenated modules or elements and modules in a multi-level feature fusion module to introduce contextual information into SSD. Meanwhile, a special layer named CSSD is designed to integrate multi-scale context information. This background layer employs dilated convolution and deconvolution to extract background information from multi-scale feature maps; a memory network is introduced to store semantic information and preserve the conditional distribution of previous detections. The memory-boosted score is added to the faster RCNN score, which is then optimized to complete region classification. PCNN consists of three blocks, where global features are obtained from SE module and part features are extracted from part localization network (PLN). Then, a second classification network flow (PCN) concatenates local features and global image features into a joint feature for final classification. Furthermore, the TL-SSD network is introduced, where the initial modules connect receptive fields of different sizes. Feature stitching combines shallow and deep feature layers; the shallow model can provide accurate location and state information, while the deep model can decide whether an object belongs to a traffic light. Multi-level contextual information through pyramid pooling is used to construct context-aware features. The context fusion module focuses on adding the scale of contextual information to the feature map. Context-aware RoI pooling is also designed to avoid compromising the structure of small objects and preserve contextual information, where a scale-dense convolutional neural network is applied to the vehicle detection scene. Leng et al. combined UV disparity algorithm with Faster R-CNN combining internal and contextual information.
Similar to the multi-scale representation, the contextual information also aims to provide more information for the final detection network. The difference is that contextual information mainly acquires information around the region of interest, and improves object classification by learning the relationship between objects and surrounding information. Therefore, redundant context information can also lead to information noise.

(3) Super-resolution

Super-resolution methods aim to recover high-resolution from corresponding low-resolution features. High-resolution images provide finer details about the original scene and can be applied well for small object detection. Gan-based algorithms have been proposed for reconstructing high-resolution images. Generative Adversarial Networks have made great progress in image super-resolution and consist of two sub-networks, one is the generator network and the other is the discriminator network. The generator generates super-resolution images to fool the discriminator, which tries to distinguish real images from fake images generated by the generator. A common form of GAN-based methods is shown below.
insert image description here

Perceptual GAN : The GAN method was first used for the small object detection task. A new conditional generator is introduced; it takes the underlying features as input to obtain more details for super-resolution representations. The generator includes multiple residual blocks to learn residual representations between small objects and similarly large objects. The discriminator consists of two branches, the adversarial branch and the perception branch. From one perspective, the adversarial branch distinguishes the generated small object super-resolved regions from similar large objects. From another perspective, the regular object detection task is done in the perception branch. Both branches try to achieve the smallest loss while training the generator to maximize the probability of the discriminator making a wrong judgment.
GAN : However, the high-resolution images generated by GANs are still not sharp enough. Therefore, a refinement module is added to recover some details for small face detection. First, MB-FCN is chosen as the baseline detector to generate regions with or without faces, which are passed to the generator and discriminator, respectively. Second, the low-resolution face goes through the upsampling module and the refinement module to obtain clear super-resolution regions. Third, non-face regions are used as negative data to train the discriminator, which simultaneously has two tasks to distinguish super-resolution and high-resolution regions; from non-face regions to face regions.
SOD-MTGAN : A Novel Multi-Task Generative Adversarial Network (MTGAN). In MTGAN, super-resolution images are generated by a generator network; a multi-task discriminator network is introduced to distinguish real high-resolution images from fake images, while predicting object categories and refining bounding boxes. More importantly, the classification and regression losses are back-propagated to further guide the generator network to generate super-resolution images for easier classification and better localization. The loss functions of the generator in MTGAN include adversarial loss (target loss), pixel-level MSE loss, classification loss (overall target loss), and bounding box regression loss, so that the reconstructed image is similar to the real high-resolution image containing high-frequency details . Compared with previous GANs, the classification and regression losses of the generated super-resolution images are added to the generator loss to ensure that the super-resolution images are recovered from the generator network; They are more realistic than methods.
JCS Network : Focused on small pedestrian detection, the JCS network consists of a classification sub-network and a super-resolution sub-network. By combining classification loss and super-resolution loss, these two subnetworks are integrated into a unified network. A similar residual structure such as VDSR is adopted in the super-resolution sub-network to explore the relationship between large-scale pedestrians and small-scale pedestrians, thereby recovering the details of small-scale pedestrians. Therefore, the reconstructed small-scale pedestrian contains both the original information of the small-scale pedestrian and the output information of the super-resolution subnet. In the training phase, the multi-layer channel feature (MCF) is based on HOG+LUV and the JCS network is applied to train the detector. Furthermore, multi-scale representation is combined with MCF to enhance the detection ability.

(四)、Region-Proposal

Before the advent of deep learning techniques, the optimal implementation of region proposals was the selective search algorithm. However, the computational efficiency of this approach is very limited. Faster R-CNN first introduced RPN to identify regions of interest; then, R-FCN was proposed to generate k×k×(C+1) feature maps instead of a single feature map, each of which was responsible for the detection of each class . However, due to the large anchor box size, small object detection is still difficult to locate accurately.
Based on FastMask, AttentionMask generates custom area schemes for small objects. In the early stages of the base network, a larger scale is added to the feature scale space (S8). In particular, to reduce the number of sampling windows, a scale-specific objectness attention mechanism (SOAM) is employed to select the most promising windows on each feature map at different scales. Although all scales are jointly adjusted according to their attention values ​​to find the best location for sampling windows, this strategy only preferentially samples and processes the most promising windows, saving memory and GPU sources for adding small objects The scale of detection (S8). More precise locations of anchor boxes generally have lower confidence, while they are more likely to be rejected by the post-processing of NMS. Therefore, Smoothed NMS (SNMS)9 is designed to take advantage of these anchor boxes and employ IoU predictions to provide more classification evidence. Furthermore, several pixels of the input image are cyclically shifted in four directions to avoid losing small objects located in the gaps of the near-anchor boxes.
Since the parameters in RPN are determined by prior knowledge, there is often a misfit problem in the training model of RPN. Therefore, by increasing the parameters, a Reinforced RPN (SRPN) is designed. In addition, particle swarm optimization and bacterial foraging optimization are introduced to find optimal parameter values; then, high-quality detection schemes can be obtained. Oversampled images containing small objects and small object augmentation are also introduced to make the model pay more attention to small objects. It is important to note that small object enhancement is to copy-paste small object areas multiple times in an image; pasted objects do not overlap existing objects. This increases the number of positive matching anchors and region proposals containing small objects. The results on the MS COCO dataset show that processing images with 3-fold oversampling and copy-paste strategy achieves the largest gain, with relative improvements such as 9.7% improvement in segmentation and small object detection, respectively, compared to the original mask R-CNN and 7.1%.
Processing background regions in neural networks takes a lot of time and memory. We propose a cascaded mask generation framework that strikes a balance between computational speed and accuracy. The original image is first resized to multiple scales. Then, each region generates a region scheme and mask through a mask generation module (MGM) inspired by RoI convolution. Finally, feature maps at each scale are concatenated for ROI and detection. After training in the SSD model, the feature map of the input image is divided into fuzzy target samples and salient target samples according to the credibility. Prominent object samples are detailed enough for identification, while fuzzy object samples (mainly far-small objects) are confirmed by SSD detection, object size confirmation, duplicate object removal, and verification of out-of-scope object removal. The method is also applicable to other detection models without architectural modifications. The region proposal network is applied when a region in the original image containing at least one object is cropped and then upscaled to the same input size. This makes the original small objects become more like large objects, which are easier to be detected by ordinary SSD detectors.

(5) Other structures

Based on faster R-CNN, KB-RANN focuses on the detection of traffic signs, where the pre-trained SqueezeNet generates feature maps, and the RNN architecture with attention mechanism (LSTM) searches for contextual information. Furthermore, since the original region proposal generator from faster R-CNN is too large for traffic signs, the pool4 layer of VGG-16 is reduced and ResNet is adopted to extract features for small signs. After that, it is combined with Online Hart Example Mining (OHEM) to make the network more robust.

Guess you like

Origin blog.csdn.net/qq_52302919/article/details/123933399