[Computer Vision] Research progress on small target detection: Research ideas for small target detection (detailed explanation)

1. Research ideas on small target detection

1.1 Data enhancement

Data enhancement is the simplest and most effective method to improve the performance of small target detection. Different data enhancement strategies can expand the size of the training data set and enrich the diversity of the data set, thereby enhancing the robustness and generalization of the detection model. ability. In a relatively early study, Yaeger et al. [20] significantly improved the accuracy of handwriting recognition by using data enhancement methods such as distortion, rotation, and scaling. Later, strategies such as elastic deformation [21], random cropping [22] and translation [23] were derived in data enhancement. Currently, these data augmentation strategies have been widely used in object detection.

In recent years, convolutional neural networks based on deep learning have achieved great success in processing computer vision tasks. The success of deep learning is largely due to the size and quality of the data set. Large-scale and high-quality data can greatly improve the generalization ability of the model. Data enhancement strategies are widely used in the field of target detection, such as horizontal flipping used in Fast R‑CNN [24] and Cascade R‑CNN [25], and adjusting image exposure and saturation used in YOLO [26] and YOLO9000 [27]. There are also commonly used methods such as CutOut[28], MixUp[29], and CutMix[30]. Recently, innovative strategies such as mosaic enhancement (YOLOv4 [31]) and retention enhancement [32] have been proposed, but these data enhancement strategies are mainly aimed at conventional target detection.

Focusing on the field of small target detection, small targets face many challenges such as low resolution, few extractable features, lack of sample number, and uneven distribution. The importance of data enhancement has become increasingly significant. In recent years, some data augmentation methods suitable for small targets have emerged (Table 1). In processing data, Yu et al. [17] proposed a scale matching strategy to crop according to different target sizes to reduce the gap between targets of different sizes, thereby avoiding the situation where small target information is easily lost in conventional scaling operations. Kisantal et al. [33] proposed a copy enhancement method to address the problems of small targets covering a small area, lack of diversity in appearance locations, and the intersection ratio between detection frames and ground truth frames being far smaller than the expected threshold. Copy and paste small targets multiple times in the image to increase the number of training samples for small targets, thereby improving the detection performance of small targets. On the basis of Kisantal et al., Chen et al. [34] proposed an adaptive resampling strategy for data enhancement in RRNet. This strategy is based on the pre-trained semantic segmentation network to copy the target image taking into account contextual information to solve the problem of Background mismatch and scale mismatch problems that may occur during simple copying are eliminated, thereby achieving better data enhancement effects. Chen et al. [35] started from the problems of small targets accounting for a small proportion and containing little information. During the training process, the images were scaled and spliced, and the large-sized targets in the data set were converted into medium-sized targets, and the medium-sized targets were converted into Small size targets, and while improving the quantity and quality of medium/small size targets, the calculation cost is also taken into consideration. In addition to designing data enhancement strategies corresponding to the characteristics of small targets, Zoph et al. [36] transcended the limitations of target characteristics and proposed an adaptive learning method such as reinforcement learning to select the best data enhancement strategy for small target detection. Some performance improvements have been achieved.

Insert image description here
Insert image description here
Although the data enhancement strategy solves the problem of small target information, lack of appearance features and texture to a certain extent, effectively improves the generalization ability of the network, and achieves good results in the final detection performance, it also brings Comes the increase in computing costs. Moreover, in practical applications, it is often necessary to optimize the target characteristics. Improperly designed data enhancement strategies may introduce new noise and damage the performance of feature extraction, which also brings challenges to algorithm design.

1.2 Multi-scale learning

Compared with conventional targets, small targets have fewer available pixels, making it difficult to extract better features. Moreover, as the number of network layers increases, the feature information and position information of small targets are gradually lost, making it difficult to be detected by the network. These characteristics lead to small targets requiring both deep semantic information and shallow representation information. Multi-scale learning combines the two and is an effective strategy to improve small target detection performance.

Early multi-scale detection had two ideas. One is to use convolution kernels of different sizes to obtain information of different scales through different receptive field sizes. However, this method has a high computational cost and the scale range of the receptive field is limited. Simonyan and Zisserman [13] proposed to use multiple After small convolution kernels have great advantages in replacing large convolution kernels, the method of using convolution kernels of different sizes is gradually abandoned. Later, the atrous convolution proposed by Yu et al. [37] and the variable convolution proposed by Dai et al. [38] opened up new ideas for this method of obtaining different scale information through different receptive field sizes. Another idea from the field of image processing - image pyramid [39], by inputting images of different scales, detects targets of different scales. This method was used in early target detection [40-41 ] (See Figure 2(a)). However, training a convolutional neural network model based on image pyramids has extremely high requirements on computer computing power and memory. In recent years, image pyramids have been rarely used in practical research applications. Only methods such as literature [42-43] are used to address problems such as excessive differences in target scales in data sets.

Insert image description here
Figure 2 4 ways of multi-scale learning

Most of the classic networks in target detection, such as Fast R-CNN [24], Faster R-CNN [44], SPPNet [45] and R-FCN [46], only use the last layer of the deep neural network for prediction. However, it is difficult to detect small objects in deep feature maps due to the loss of spatial and detailed feature information. In deep neural networks, the receptive field of the shallow layer is smaller, the semantic information is weak, and the context information is lacking, but more spatial and detailed feature information can be obtained. Starting from this idea, Liu et al. [47] proposed a multi-scale target detection algorithm SSD (Single shot multibox detector), which uses shallower feature maps to detect smaller targets, and uses deeper feature maps to detect Larger targets, as shown in Figure 2(b). Cai et al. [48] proposed a unified multi-scale deep convolutional neural network to solve the problem that small targets have little information and are difficult to match conventional networks. By using deconvolution layers to improve the resolution of feature maps, they can reduce memory and computing costs. At the same time, the detection performance of small targets is significantly improved.

To address the problem that small targets are susceptible to environmental interference, Bell et al. [49] proposed an ION (Inside-outside network) target detection method by cutting out features of the same area of ​​interest from feature maps of different scales, and then integrating these multi-scale features. to predict to achieve the purpose of improving detection performance. Similar to the idea of ​​ION, Kong et al. [50] proposed an effective multi-scale fusion network, namely HyperNet, which significantly improves recall by integrating shallow high-resolution features, deep semantic features and intermediate layer feature information. rate, thereby improving the performance of small target detection (see Figure 2(c)). These methods can effectively utilize information at different scales and are an effective means to improve the expression of small target features. However, there are a lot of repeated calculations between different scales, which is expensive in terms of memory and computing costs.

In order to save computing resources and obtain better feature fusion effects, Lin et al. [51] combined the advantages of single feature mapping, pyramid feature hierarchy and comprehensive features to propose Feature Pyramid network (FPN). FPN is currently the most popular multi-scale network. It introduces a bottom-up and top-down network structure to achieve feature enhancement by fusing features of adjacent layers (see Figure 2(d)). On the basis of FPN, Liang et al. [52] proposed a deep feature pyramid network, using a feature pyramid structure with horizontal connections to enhance the semantic features of small targets, and supplemented by specially designed anchor boxes and loss functions to train the network. In order to improve the detection speed of small targets, Cao et al. [53] proposed a multi-level feature fusion algorithm, namely feature fusion SSD, which introduced context information on the basis of SSD to better balance the speed and accuracy of small target detection. However, the feature pyramid method based on SSD needs to extract feature maps of different scales from different layers of the network for prediction, and it is difficult to fully integrate features of different scales. To address this problem, Li and Zhou [54] proposed a feature fusion single-shot multi-box detector, using a lightweight feature fusion module to connect and fuse the features of each layer to a larger scale, and then in the obtained A feature pyramid is constructed on the feature map for detection, which improves the detection performance of small targets at the expense of less speed. In order to solve the problem of low accuracy of small target recognition in airport video surveillance, Han Songchen et al. [55] proposed a small target detection method on airport pavement that combines multi-scale feature fusion and online hard example mining. This method uses ResNet-101 As a feature extraction network, a "top-down" feature fusion module with upsampling is established based on this network to generate high-resolution feature maps with richer semantic information.

Recently, the method of multi-scale feature fusion has been expanded. For example, Nayan et al. [56] proposed a new real-time detection algorithm to solve the problem that feature information of small targets is easily lost after passing through multi-layer networks. This algorithm uses Upsampling and skip connections extract multi-scale features at different network depths during the training process, significantly improving the detection accuracy and speed of small target detection. In order to reduce the computational cost of high-resolution images, Liu et al. [57] proposed a high-resolution detection network. By using a shallow network to process high-resolution images and a deep network to process low-resolution images, small targets should be retained as much as possible. More location information extracts more semantic information at the same time, improving the detection performance of small targets while reducing computational costs. Deng et al. [58] found that although multi-scale fusion can effectively improve the performance of small target detection, the feature coupling of different scales will still affect the performance, so they proposed an extended feature pyramid network, using additional high-resolution pyramid levels specifically for Small target detection.

In general, multi-scale feature fusion takes into account both shallow representation information and deep semantic information, which is conducive to feature extraction of small targets and can effectively improve small target detection performance. However, existing multi-scale learning methods not only improve detection performance, but also increase the amount of additional calculations, and it is difficult to avoid the influence of interference noise during the feature fusion process. These problems make it difficult to obtain small target detection performance based on multi-scale learning. further improvement.

1.3 Contextual learning

In the real world, there is usually a coexistence relationship between "target and scene" and "target and target", and exploiting this relationship will help improve the detection performance of small targets. Before deep learning, research [59] has proven that target detection performance can be improved by appropriately modeling the context, especially for small targets with unclear appearance features. With the widespread application of deep neural networks, some research has also tried to integrate the context around the target into deep neural networks, and has achieved certain results. The following will briefly review the current research status and development trends at home and abroad from two aspects: target detection based on implicit context feature learning and explicit context reasoning.

(1) Target detection based on implicit context feature learning. Implicit context features refer to background features around the target area or global scene features. In fact, the convolution operation in convolutional neural networks already considers the implicit contextual features around the target area to a certain extent. In order to utilize the contextual features around the target, Li et al. [60] proposed a target detection method based on multi-scale contextual feature enhancement. This method first generates a series of target candidate areas in the image, and then generates different scale context around the target. windows, and finally utilize the features in these windows to enhance the feature representation of the target (see Figure 3(a)). Subsequently, Zeng et al. [61] proposed a gated bidirectional convolutional neural network, which also generated support regions containing different scale contexts based on target candidate regions. The difference was that the network allowed different scales and resolutions to be Information is transferred between the generated support regions to comprehensively learn optimal features. In order to better detect tiny faces in complex environments, Tang et al. [62] proposed a context-based single-stage face detection method. This method designed a new context anchor frame to extract facial features while Its surrounding contextual information, such as head information and body information, is taken into account. Zheng Chenbin et al. [63] proposed an enhanced context model network that uses a dual atrous convolution structure to save parameters and at the same time enhance shallow context information by expanding the effective receptive field and less damage to the original target detection network. Basically, it flexibly acts on the shallow prediction layer in the network. However, most of these methods rely on the design of the context window or are limited by the size of the receptive field, which may lead to the loss of important contextual information.

Insert image description here
Figure 3 The exploration process of context in target detection

In order to make full use of context information, some methods try to integrate global context information into the target detection model (see Figure 3(b)). For early target detection algorithms, a common method of integrating global context is through statistical summary of the elements that make up the scene, such as Gist [64]. Torralba et al. [65] proposed to model visual context by calculating the statistical correlation between low-level features of the global scene and feature descriptors of the target. Subsequently, Felzenszwalb et al. [66] proposed a target detection method based on a hybrid multi-scale deformable component model. This method further improves the reliability of the detection results by introducing context to score the detection results twice. For the current target detection algorithm based on deep learning, the global context is perceived in three ways: a larger receptive field, global pooling of convolutional features, or treating the global context as a kind of sequence information. Bell et al. [49] proposed a context transfer method based on recurrent neural networks. This method uses recurrent neural networks to encode the context information in the entire image from four directions, and concatenates the four obtained feature maps to achieve global Contextual awareness. However, this method complicates the model, and the training of the model heavily relies on the setting of initialization parameters. Ouyang et al. [67] improved the target detection performance by learning the classification score of the image and using the score as a supplementary contextual feature. In order to improve the feature representation of candidate areas, Chen et al. [68] proposed a context fine-tuning network that first finds context areas related to the target area by calculating similarity, and then uses the characteristics of these context areas to enhance the features of the target area. Subsequently, Barnea et al. [69] regarded the utilization of context as an optimization problem, discussed to what extent context or other types of additional information can improve the detection score, and showed that simple co-occurrence relationships are the most effective context information. In addition, Chen et al. [70] proposed a hierarchical context embedding framework, which can be used as a plug-and-play component to enhance the feature expression of candidate regions by mining contextual clues, thereby improving the final detection performance. Recently, Zhang Ruiyan et al. [71] proposed a global context detection model for optical remote sensing targets. This model generates high-resolution heat maps by combining global context features with local features of the target center point, and uses global features to achieve target prediction. Classification. Furthermore, some methods exploit global context information through semantic segmentation. He et al. [72] proposed a unified instance segmentation framework that uses pixel-level supervision to optimize the detector, and jointly optimizes the target detection and instance segmentation models through multi-task methods. Although detection performance can be significantly improved through semantic segmentation, pixel-level annotation is very expensive. In view of this, Zhao et al. [73] proposed a method to generate pseudo-segmentation labels to optimize the detector and achieved good results. Furthermore, Zhang et al. [74] proposed an unsupervised segmentation method to enhance the feature map for target detection by jointly optimizing target detection and segmentation without pixel-level annotation. At present, methods based on global context have made great progress in target detection, but how to find contextual information from the global scene that is beneficial to improving the performance of small target detection is still a current research difficulty.

(2) Target detection based on explicit contextual reasoning. Display context reasoning refers to using clear contextual information in the scene to assist in inferring the location or category of the target. For example, using the contextual relationship between the sky area and the target in the scene to infer the category of the target. Contextual relationships usually refer to the constraints and dependencies between goals and scenes or goals and goals in a scene (see Figure 3(c)). In order to take advantage of context relationships, Chen et al. [75] proposed an adaptive context modeling and iterative improvement method to improve target classification and detection performance by using the output of one task as the context of another task. Afterwards, Gupta et al. [76] proposed a target detection method based on spatial context. This method can accurately capture the spatial relationship between the context and the target of interest, and effectively utilize the appearance features of the context area. Furthermore, Liu et al. [77] proposed a structural reasoning network to improve target detection performance by fully considering the relationship between scene context and targets. In order to utilize prior knowledge, Xu et al. [78] proposed a Reasoning-RCNN based on Faster R-CNN [44], which encodes contextual relationships by building a knowledge graph and uses prior contextual relationships to influence target detection. . Chen et al. [79] proposed a spatial memory network. Spatial memory essentially recombines target instances into a pseudo image representation, and inputs the pseudo image representation into a convolutional neural network for target relationship reasoning, thereby forming a Sequential reasoning architecture. On the basis of the attention mechanism, Hu et al. [80] proposed a lightweight target relationship network, which implemented the relationship modeling between objects by introducing the appearance and geometric structure relationships between different objects as constraints. The network requires no additional supervision and is easy to embed into existing networks, and can effectively filter redundant boxes, thereby improving target detection performance.

In recent years, methods based on context learning have been further developed. Lim et al. [81] proposed a method to use context to connect multi-scale features. In this method, additional features in different depth levels of the network are used as context, supplemented by an attention mechanism to focus on the target in the image, making full use of the context of the target. information, thereby improving the accuracy of small target detection in actual scenes. In order to solve the problem that indoor small-scale crowd detection faces the problem that target features overlap with background features and the boundaries are difficult to distinguish, Shen et al. [82] proposed an indoor crowd detection network framework, using a feature aggregation module (FAM). Fusion and decomposition operations aggregate contextual feature information to provide more detailed information for small-scale crowd detection, thereby significantly improving the detection performance of indoor small-scale crowds. Fu et al. [83] proposed a novel context reasoning method, which models and infers the inherent semantic and spatial layout relationships between targets, extracts the semantic features of small targets while retaining their spatial information as much as possible, and is effective Solved the problem of false detection and missed detection of small targets. In order to improve the classification results of targets, Pato et al. [84] proposed a context-based detection result rescoring method. This method uses a recurrent neural network and a self-attention mechanism to transfer information between candidate areas and generate context representation, and then uses The obtained context is used to conduct a secondary evaluation of the detection results.

The method based on context learning makes full use of the target-related information in the image and can effectively improve the performance of small target detection. However, existing methods do not take into account the possible lack of contextual information in the scene, and do not specifically use easy-to-detect results in the scene to assist in the detection of small targets. In view of this, future research directions can be considered from the following two perspectives: (1) Construct a context memory model based on category semantic pooling to alleviate the problem of lack of context information in the current image by utilizing the context of historical memory; (2) Based on The small target detection of graph reasoning improves the detection performance of small targets in a targeted manner through the combination of graph model and target detection model.

1.4 Generative adversarial learning

The method of generative adversarial learning aims to achieve the same detection performance as larger targets by mapping the features of low-resolution small targets into features equivalent to high-resolution targets. Although the data enhancement, feature fusion, and context learning methods mentioned above can effectively improve the performance of small target detection, the performance gains brought by these methods are often limited by the computational cost. To address the problem of low resolution of small targets, Haris et al. [85] proposed an end-to-end joint training method for super-resolution and detection models. This method improved the detection performance of low-resolution targets to a certain extent. However, this method has high requirements for training data sets and insufficient improvement in small target detection performance.

Currently, an effective method is to improve the resolution of small targets by combining with a Generative Adversarial Network (GAN) [86], reduce the feature difference between small targets and large/medium-scale targets, and enhance the accuracy of small targets. Feature expression, thereby improving the performance of small target detection. After Radford et al. [87] proposed DCGAN (Deep convolutional GAN), many tasks in computational vision began to use generative adversarial models to solve problems faced in specific tasks. In response to the problem of insufficient training samples, Sixt et al. [88] proposed RenderGAN, which uses adversarial learning to generate more images to achieve data enhancement. In order to enhance the robustness of the detection model, Wang et al. [89] improved the detection performance of difficult targets by automatically generating samples containing occlusion and deformation features. Subsequently, Li et al. [90] proposed a perceptual GAN ​​method specifically for small target detection. This method learns the high-resolution feature representation of small targets by confronting the generator and the discriminator against each other. In perceptual GANs, the generator converts small object representations into super-resolution representations that are similar enough to real large objects. At the same time, the discriminator competes with the generator to identify the generated representations and impose conditional requirements on the generator. This method learns high-resolution feature representations of small objects by pitting the generator and the discriminator against each other. This work upgrades the representation of small targets to "super-resolution" representation, achieving similar characteristics to large targets and achieving better small target detection performance.

In recent years, research on super-resolution reconstruction of small targets based on GAN has developed. Bai et al. [91] proposed a multi-task generative adversarial network (MTGAN) for small targets. In MTGAN, the generator is a super-resolution network that can upsample small blurred images into fine images and restore detailed information for more accurate detection. The discriminator is a multi-task network that distinguishes real images from super-resolution images and outputs class scores and bounding box regression offsets. Furthermore, in order for the generator to recover more details for easy detection, the classification and regression losses in the discriminator are back-propagated into the generator during training. MTGAN greatly improves the detection performance of small targets because it can recover clear super-resolution targets from blurred small targets. Furthermore, in view of the lack of direct supervision in existing super-resolution models for small target detection, Noh et al. [92] proposed a new feature-level super-resolution method, which uses atrous convolution. Keep the generated high-resolution target features with the same receptive field size as the low-resolution features generated by the feature extractor, thus avoiding the problem of generating wrong super-resolution features due to receptive field mismatch. In addition, Deng et al. [58] designed an extended feature pyramid network, which generates ultra-high-resolution pyramid layers through the designed feature texture module, thus enriching the feature information of small targets.

The target detection algorithm based on the generative adversarial model can significantly improve the detection performance by enhancing the characteristic information of small targets. At the same time, the step of using generative adversarial models to super-resolve small targets does not require any special structural design, and existing generative adversarial models and detection models can be easily combined. However, we are still facing two unavoidable problems: (1) Generative adversarial networks are difficult to train and it is difficult to achieve a good balance between the generator and the discriminator; (2) The diversity of samples generated by the generator during the training process is limited. , the performance improvement is limited after training to a certain level.

1.5 Anchorless mechanism

The anchor frame mechanism plays an important role in target detection. Many advanced target detection methods are designed based on the anchor frame mechanism, but the anchor frame design is extremely unfriendly to the detection of small targets. It is difficult for the existing anchor frame design to balance the contradiction between the recall rate of small targets and the computational cost, and this method leads to an extreme imbalance between the positive samples of small targets and the positive samples of large targets, making the model pay more attention to the large targets. detection performance, thereby neglecting the detection of small targets. In extreme cases, if the designed anchor box is much larger than the small target, then the small target will have no positive samples. The lack of positive samples for small targets will cause the algorithm to only learn detection models suitable for larger targets. In addition, the use of anchor boxes introduces a large number of hyper-parameters, such as the number of anchor boxes, aspect ratio and size, etc., making it difficult to train the network and difficult to improve the detection performance of small targets. In recent years, anchor-free mechanism methods have become a research hotspot and have achieved good results in small target detection.

One way to get rid of the anchor frame mechanism is to convert the target detection task into the estimation of key points, that is, a target detection method based on key points. Target detection methods based on key points mainly include two categories: corner-based detection and center-based detection. Corner-based detectors predict object bounding boxes by grouping corner points learned from convolutional feature maps. DeNet [93] defines target detection as estimating the probability distribution of the four corner points of the target, including the upper left corner, upper right corner, lower left corner and lower right corner (see Figure 4(a)). First, the annotated data is used to train a convolutional neural network, and then the network is used to predict the corner point distribution. Afterwards, corner point distribution and Naive Bayes classifier are used to determine whether the candidate area corresponding to each corner point contains the target. After DeNet, Wang et al. [94] proposed a new method of using connections between corner points and center points to represent targets, named PLN (Point linking network). PLN first regresses four corner points similar to DeNet and the center point of the target. At the same time, it predicts whether the key points are connected through a fully convolutional network, and then combines the corner points and their connected center points to generate the target bounding box. PLN performs well for dense targets and targets with extreme aspect ratios. However, when there are no target pixels around the corner, it will be difficult for PLN to detect the corner due to the limitation of the receptive field. Following PLN, Law et al. [95] proposed a new corner-based detection algorithm named CornerNet. CornerNet converts the target detection problem into a corner detection problem. It first predicts the upper left and lower right corner points of all targets, then matches these corner points in pairs, and finally uses the paired corner points to generate the bounding box of the target. An improved version of CornetNet, CornerNet‑Lite [96], improves from the two perspectives of reducing the number of pixels processed and reducing the number of calculations performed on each pixel, effectively solving two key use cases in target detection: Improve efficiency and accuracy without sacrificing accuracy in real-time. Compared with anchor box-based detectors, the CornerNet series has a simpler detection framework, which improves detection efficiency while achieving higher detection accuracy. However, this series still predicts a large number of incorrect object bounding boxes due to incorrect corner matching.

Insert image description here
Figure 4 Four forms of anchor-free mechanism

In order to further improve the target detection performance, Duan et al. [97] proposed a target detection framework based on center prediction, called CenterNet (see Figure 4(b)). CenterNet first predicts the upper left and lower right corner points and the center key point, then determines the bounding box through corner point matching, and finally uses the predicted center point to eliminate incorrect bounding boxes caused by corner point mismatch. Similar to CenterNet, Zhou et al. [98] proposed a bottom-up target detection network called ExtremeNet by matching extreme points and center points. ExtremeNet first uses a standard keypoint estimation network to predict the top, bottom, leftmost, rightmost 4 extreme points and the center point, and then groups the 5 points when they are geometrically aligned to generate boundaries frame. However, key point-based detection networks such as ExtremeNet and CornerNet need to go through a key point grouping stage, which reduces the overall speed of the algorithm. To address this problem, Zhou et al. [99] modeled the target as a single point, that is, the center point of the bounding box, without grouping the construction points or other post-processing operations. The keypoint estimation is then used in the detector to find the center point and regressed on all other object properties such as size, position, etc. This method provides a good balance between detection accuracy and speed.

In recent years, target detection methods based on key points have been expanded. Yang et al. [100] proposed a detection method called representative points (RepPoints), which provides a more fine-grained representation so that the target can be more finely defined. At the same time, this method can automatically learn the spatial information and local semantic features of the target, which improves the accuracy of small target detection to a certain extent (see Figure 4(c)). Furthermore, Kong et al. [101] were inspired by the fovea of ​​the human eye (the central area of ​​the retina, where most of the cones are concentrated and are responsible for high-definition imaging of vision), and proposed a possibility of directly predicting the existence of targets and The bounding box coordinate method first predicts the possibility of the target's existence and generates a category-sensitive semantic map, and then generates a bounding box of unknown categories for each location that may contain the target. Since it is freed from the constraints of anchor boxes, FoveaBox has good robustness and generalization capabilities for small targets and other targets with arbitrary aspect ratios, and its detection accuracy has also been greatly improved. Similar to FoveaBox, Tian et al. [102] used the idea of ​​​​semantic segmentation to solve the target detection problem and proposed a single-stage target detector FCOS (Fully convolutional one-stage) based on full convolution, which avoids the problem of anchor box-based mechanism. There are too many hyperparameters in the method and it is difficult to train (see Figure 4(d)). In addition, experiments show that replacing the first stage task of the two-stage detector with FCOS can also effectively improve detection performance. Then, Zhu et al. [103] used the anchor-free mechanism to improve the feature allocation problem in the feature pyramid, selecting corresponding features for the target based on target semantic information instead of anchor boxes, while improving the accuracy and speed of small target detection. Zhang et al. [104] started from the essential difference between the anchor-based mechanism and the anchor-free mechanism, that is, the definitions of positive and negative samples are different during the training process, and proposed an adaptive training sample selection strategy to automatically select based on the statistical characteristics of the object. Positive and negative samples. In response to the problem that small ships are difficult to detect in complex scenes, Fu et al. [105] proposed a new detection method - feature balancing and refinement network, using a general anchor-free strategy of directly learning to encode bounding boxes, eliminating the need for anchor boxes to Detect the negative impact on performance and use an attention mechanism based on semantic information to balance multiple features at different levels, achieving state-of-the-art performance. In order to more effectively handle multi-scale detection under the anchor-free framework, Yang et al. [106] proposed a feature pyramid network based on a special attention mechanism. This network can generate feature pyramids based on the characteristics of targets of different sizes, and thus better It handles multi-scale target detection problems and significantly improves the detection performance of small targets.

1.6 Other optimization strategies

In the field of small target detection, in addition to the several major categories summarized above, there are many excellent methods. In response to the problem of few small target training samples, Kisantal et al. [33] proposed an oversampling strategy to improve the performance of small target detection by increasing the contribution of small targets to the loss function. In addition to the idea of ​​increasing the weight of small target samples, another idea is to improve detection performance by increasing the number of anchor boxes dedicated to small targets. Zhang et al. [107] proposed a dense anchor frame strategy to improve the recall rate of small targets by designing multiple anchor frames in the center of a receptive field. Similar to the dense anchor frame strategy, Zhang et al. [108] designed a method to define the anchor frame scale based on the effective receptive field and equal proportion interval, and proposed a scale compensation anchor frame matching strategy to improve the recall rate of small face targets. . Increasing the number of anchor frames is very effective in improving small target detection accuracy, but it also adds huge additional computational costs. Eggert et al. [109] started from the perspective of optimizing the anchor frame scale. By deriving the relationship between the sizes of small targets, they selected appropriate anchor frame scales for small targets and achieved better detection results in trademark detection. Later, Wang et al. [110] proposed a guided anchoring strategy based on semantic features, which improved the performance of small target detection by simultaneously predicting the possible location of the target center and the scale and aspect ratio of the target. Furthermore, this strategy can be integrated into any anchor box-based approach. However, these improvements do not substantially balance the contradiction between detection accuracy and computational cost.

In recent years, with the increase of computing resources, more and more networks use the cascade idea to balance the target miss detection rate and false detection rate. The idea of ​​cascade has been around for a long time [111] and has been widely used in the field of target detection. It adopts a coarse-to-fine detection concept: use simple calculations to filter out most simple background windows, and then use complex windows to process those more difficult windows. With the advent of the deep learning era, Cai et al. [25] proposed the classic network Cascade R-CNN, which continuously optimizes the prediction results by cascading several detection networks based on different IoU thresholds. Later, Li et al. [112] expanded on the basis of Cascade R-CNN to further improve the small target detection performance. Inspired by the idea of ​​cascade, Liu et al. [113] proposed an asymptotic positioning strategy to improve the detection accuracy of pedestrian detection by continuously increasing the IoU threshold. In addition, the literature [114-116] shows the application of cascade network in difficult target detection, which also improves the detection performance of small targets to a certain extent.

Another idea is to detect in stages, balancing the contradiction between missed detection and false detection through cooperation between different levels. Chen et al. [117] proposed a dual detector, in which the first-scale detector detects small targets to the maximum extent, and the second-scale detector detects objects that cannot be identified by the first-scale detector. Furthermore, Drenkow et al. [118] designed a more efficient target detection method. This method first checks the entire scene at low resolution, and then uses the saliency map generated in the previous stage to guide subsequent targets at high resolution. detection. This approach is a good trade-off between detection accuracy and detection speed. In addition, the literature [119-121] performs segmentation of the front and rear scenes for difficult target recognition in aerial view images, distinguishing important areas from non-important areas, which not only improves detection performance but also reduces computational costs.

Optimizing the loss function is also an effective method to improve small target detection performance. Redmon et al. [26] found that during the network training process, small targets are more susceptible to random errors. Subsequently, they improved this problem [27] and proposed a loss function that sets different weights according to the target size, which improved the detection performance of small targets. Lin et al. [122] proposed a focal loss in RetinaNet to address the class imbalance problem, which effectively solved the foreground-background class imbalance problem existing in the training process. Furthermore, Zhang et al. [123] combined the cascade idea with focal loss and proposed Cascade RetinaNet, which further improved the accuracy of small target detection. Aiming at the problem of foreground and background imbalance that is prone to occur with small targets, Deng et al. [58] proposed a loss function that considers the balance between foreground and background, and improves the feature quality of the foreground and background through global reconstruction loss and positive sample block loss. This further improves the performance of small target detection.

In order to weigh the detection accuracy and speed of small targets, Sun et al. [124] proposed a multi-receptive field and small target focusing weakly supervised segmentation network, by using multiple receptive field blocks to focus on the target and its adjacent background, and based on Weights are set at different spatial positions to achieve the purpose of enhancing feature recognizability. In addition, Yoo et al. [125] reformulated the multi-target detection task as a density estimation problem of bounding boxes, and proposed a hybrid density target detector, which avoids the matching of ground truth boxes and prediction boxes and heuristic anchor boxes through problem conversion. The tedious process of design also solves the problem of imbalance between foreground and background to a certain extent.

2. References

  1. YAEGER L,LYON R,WEBB B.Effective training of a neural network character classifier for word recognition[J].Advances in Neural Information Processing Systems,1996,9: 807‑816. [百度学术]
  2. SIMARD P Y, STEINKRAUS D, PLATT J C. Best practices for convolutional neural networks applied to visual document analysis[C]//Proceedings of ICDAR. [S.l.]: IEEE, 2003, 3(2003). [百度学术]
  3. KRIZHEVSKY A, SUTSKEVER I, HINTON G E.Imagenet classification with deep convolutional neural networks[J].Advances in Neural Information Processing Systems,2012,25: 1097‑1105. [百度学术]
  4. WAN L, ZEILER M, ZHANG S, et al. Regularization of neural networks using dropconnect[C]//Proceedings of International Conference on Machine Learning. [S.l.]: PMLR, 2013: 1058‑1066. [百度学术]
  5. GIRSHICK R. Fast R‑CNN[C]// Proceedings of the IEEE International Conference on Computer Vision. New York: IEEE, 2015: 1440‑1448. [百度学术]
  6. CAI Z, VASCONCELOS N. Cascade R‑CNN: Delving into high quality object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 6154‑6162. [百度学术]
  7. REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: Unified, real‑time object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 779‑788. [百度学术]
  8. REDMON J, FARHADI A. YOLO9000: Better, faster, stronger[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 7263‑7271. [百度学术]
  9. DEVRIES T,TAYLOR G W.Improved regularization of convolutional neural networks with cutout[EB/OL].(2017‑08‑15)[2017‑11‑29].https://arxiv.org/abs/1708.04552. [百度学术]
  10. ZHANG H,CISSE M,DAUPHIN YN,et al.Mixup: Beyond empirical risk minimization[EB/OL].(2017‑10‑25)[2018‑04‑27].https://arxiv.org/abs/1710.09412 . [Baidu Academic]
  11. YUN S, HAN D, OH S J, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE, 2019: 6023‑6032. [百度学术]
  12. BOCHKOVSKIY A,WANG CY,LIAO HY M.Yolov4: Optimal speed and accuracy of object detection[EB/OL].(2020‑04‑23)[2020‑04‑23].https://arxiv.org/abs/ 2004.10934. [Baidu Academic]
  13. GONG C,WANG D,LI M,et al.KeepAugment: A simple information‑preserving data augmentation approach[EB/OL].(2020‑11‑23)[2020‑11‑23].https://arxiv.org /abs/2011.11778. [Baidu Academic]
  14. KISANTAL M,WOJNA Z,MURAWSKI J,et al. Augmentation for small object detection[EB/OL].(2019‑02‑19)[2019‑02‑19]. https://arxiv.org/abs/1902.07296. [Baidu Academic]
  15. CHEN C, ZHANG Y, LV Q, et al. RRNet: A hybrid detector for object detection in drone‑captured images[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Los Alamitos: IEEE, 2019: 100‑108. [百度学术]
  16. CHEN Y,ZHANG P,LI Z,et al.Stitcher: Feedback‑driven data provider for object detection[EB/OL].(2020‑04‑26)[2021‑03‑14]. https://arxiv.org /abs/2004.12432. [Baidu Academic]
  17. ZOPH B, CUBUK E D, GHIASI G, et al. Learning data augmentation strategies for object detection[C]//Proceedings of European Conference on Computer Vision. Cham: Springer, 2020: 566‑583. [百度学术]
  18. YU F,KOLTUN V.Multi‑scale context aggregation by dilated convolutions[EB/OL].(2015‑11‑23)[2016‑04‑30].https://arxiv.org/abs/1511.07122. [Baidu Academic ]
  19. DAI J, QI H, XIONG Y, et al.Deformable convolutional networks[C]// Proceedings of the IEEE International Conference on Computer Vision. New York: IEEE, 2017: 764‑773. [百度学术]
  20. ADELSON EH,ANDERSON CH,BERGEN JR,et al.Pyramid methods in image processing[J].RCA Engineer,1984,29(6): 33‑41. [Baidu Academic]
  21. LOWE D G.Distinctive image features from scale‑invariant keypoints[J].International Journal of Computer Vision,2004,60(2): 91‑110. [百度学术]
  22. DALAL N, TRIGGS B. Histograms of oriented gradients for human detection[C]//Proceedings of IEEE Computer Society Conference on Computer Vision & Pattern Recognition. [S.l.]: IEEE, 2005. [百度学术]
  23. SINGH B, DAVIS L S. An analysis of scale invariance in object detection snip[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 3578‑3587. [百度学术]
  24. SINGH B,NAJIBI M,DAVIS L S.Sniper: Efficient multi‑scale training[EB/OL].(2018‑05‑23)[2018‑12‑13].https://arxiv.org/abs/1805.09300. [Baidu Academic]
  25. REN S,HE K,GIRSHICK R,et al.Faster R‑CNN: Towards real‑time object detection with region proposal networks[EB/OL].(2015‑06‑04)[2016‑01-06].https://arxiv.org/abs/1506.01497. [百度学术]
  26. HE K,ZHANG X,REN S,et al.Spatial pyramid pooling in deep convolutional networks for visual recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,37(9): 1904-1916. [百度学术]
  27. DAI J, LI Y, HE K, et al.R-FCN: Object detection via region-based fully convolutional networks[EB/OL].(2016-05-20)[2016-06-21].https://arxiv.org/abs/1605.06409. [百度学术]
  28. LIU W, ANGUELOV D, ERHAN D, et al. SSD: Single shot multibox detector[C]//Proceedings of European Conference on Computer Vision. Cham: Springer, 2016: 21-37. [Baidu Academic]
  29. CAI Z, FAN Q, FERIS R S, et al. A unified multi-scale deep convolutional neural network for fast object detection[C]//Proceedings of European Conference on Computer Vision. Cham: Springer, 2016: 354-370. [百度学术]
  30. BELL S, ZITNICK C L, BALA K, et al. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 2874-2883. [百度学术]
  31. KONG T, YAO A, CHEN Y, et al. Hypernet: Towards accurate region proposal generation and joint object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 845-853. [百度学术]
  32. LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 2117-2125. [百度学术]
  33. LIANG Z, SHAO J, ZHANG D, et al. Small object detection using deep feature pyramid networks[C]//Proceedings of Pacific Rim Conference on Multimedia. Cham: Springer, 2018: 554-564. [百度学术]
  34. CAO G, XIE X, YANG W, et al. Feature-fused SSD: Fast detection for small objects[C]//Proceedings of Ninth International Conference on Graphic and Image Processing (ICGIP 2017). Bellingham: SPIE-int SOC Optical Engineering, 2018: 106151E. [百度学术]
  35. LI Z,ZHOU F.FSSD: Feature fusion single shot multibox detector[EB/OL].(2017-12-04)[2018-05-17].https://arxiv.org/abs/1712.00960. [Baidu Academic ]
  36. Han Songchen, Zhang Bihao, Li Wei, et al. Small target object detection algorithm in airport scenes based on improved Faster-RCNN [J]. Journal of Nanjing University of Aeronautics and Astronautics, 2019, 51(6): 735-741. [Baidu Academic]
  37. NAYAN AA,SAHA J,MOZUMDER AN,et al.Real time detection of small objects[EB/OL].(2020-03-17)[2020-04-14].https://arxiv.org/abs/2003.07442 . [Baidu Academic]
  38. LIU Z,GAO G,SUN L,et al.HRDNet: High-resolution detection network for small objects[EB/OL].(2020-06-13)[2020-06-13].https://arxiv.org/abs/2006.07607. [百度学术]
  39. DENG C,WANG M,LIU L,et al.Extended feature pyramid network for small object detection[EB/OL].(2020-05-16)[2020-04-09].https://arxiv.org/abs /2003.07021. [Baidu Academic]
  40. OLIV A,TORRALBA A.The role of context in object recognition[J].Trends in Cognitive Sciences,2007,11(12): 520-527. [Statement sheet]
  41. LI J, WEI Y, LIANG X, et al. Attentive contexts for object detection [J]. IEEE Transactions on Multimedia, 2016, 19(5): 944-954. [Baidu Academic]
  42. ZENG X,OUYANG W,YAN J,et al.Crafting gbd-net for object detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,40(9): 2109-2123. [百度学术]
  43. TANG X, DU D K, HE Z, et al. Pyramidbox: A context-assisted single shot face detector[C]// Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 797-813. [百度学术]
  44. ZHENG Chenbin,ZHANG Yong,HU Hang,et al.Object detection enhanced context model[J].Journal of Zhejiang University (Engineering Science),2020,54(3): 529-539. [百度学术]
  45. DIVVALA S K, HOIEM D, HAYS J H, et al. An empirical study of context in object detection[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2009: 1271-1278. [百度学术]
  46. TORRALBA A, SINHA P. Statistical context priming for object detection[C]// Proceedings of the Eighth IEEE International Conference on Computer Vision. New York: IEEE, 2001: 763-770. [百度学术]
  47. FELZENSZWALB P F,GIRSHICK R B,MCALLESTER D,et al.Object detection with discriminatively trained part-based models[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,32(9): 1627-1645. [百度学术]
  48. OUYANG W, WANG X, ZENG X, et al. Deepid-net: Deformable deep convolutional neural networks for object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2015: 2403-2412. [百度学术]
  49. CHEN Z, HUANG S, TAO D. Context refinement for object detection[C]// Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 71-86. [百度学术]
  50. BARNEA E, BEN-SHAHAR O. Exploring the bounds of the utility of context for object detection[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2019: 7412-7420. [百度学术]
  51. CHEN Z M, JIN X, ZHAO B, et al. Hierarchical context embedding for region-based object detection[C]//Proceedings of European Conference on Computer Vision. Cham: Springer, 2020: 633-648. [百度学术]
  52. ZHANG Ruiyan,JIANG Xiujie,AN Junshe, et al.Design of global-contextual detection model for optical remote sensing targets[J].Chinese Optics,2020,13(73): 138-149. [百度学术]
  53. HE K, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]// Proceedings of the IEEE International Conference on Computer Vision. New York: IEEE, 2017: 2961-2969. [Baidu Academic]
  54. ZHAO X, LIANG S, WEI Y. Pseudo mask augmented object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 4061-4070. [百度学术]
  55. ZHANG Z, QIAO S, XIE C, et al. Single-shot object detection with enriched semantics[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5813-5821. [百度学术]
  56. CHEN Q,SONG Z,DONG J,et al.Contextualizing object detection and classification[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,37(1): 13-27. [百度学术]
  57. GUPTA S,HARIHARAN B,MALIK J.Exploring person context and local scene context for object detection[EB/OL].(2015-11-25)[2015-11-25].https://arxiv.org/abs/1511.08177. [百度学术]
  58. LIU Y, WANG R, SHAN S, et al. Structure inference net: Object detection using scene-level context and instance-level relationships[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 6985-6994. [百度学术]
  59. XU H, JIANG C H, LIANG X, et al. Reasoning-RCNN: Unifying adaptive global reasoning into large-scale object detection[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2019: 6419-6428. [百度学术]
  60. CHEN X, GUPTA A. Spatial memory for context reasoning in object detection[C]// Proceedings of the IEEE International Conference on Computer Vision. New York: IEEE, 2017: 4086-4096. [百度学术]
  61. HU H, GU J, ZHANG Z, et al. Relation networks for object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 3588-3597. [百度学术]
  62. LIM JS,ASTRID M,Yoon HJ,et al.Small object detection using context and attention[EB/OL].(2019-12-13)[2019-12-16].https://arxiv.org/abs/ 1912.06319. [Baidu Academic]
  63. SHEN W, QIN P, ZENG J. An indoor crowd detection network framework based on feature aggregation module and hybrid attention selection module[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Los Alamitos:IEEE, 2019: 82-90. [百度学术]
  64. FU K, LI J, MA L, et al.Intrinsic relationship reasoning for small object detection[EB/OL].(2020-09-02)[2020-09-02].https://arxiv.org/abs/ 2009.00833.[Baidu Academic]
  65. PATO L V, NEGRINHO R, AGUIAR P M Q. Seeing without looking: Contextual rescoring of object detections for ap maximization[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2020: 14610-14618. [百度学术]
  66. HARIS M,SHAKHNAROVICH G,UKITA N.Task-driven super resolution: Object detection in low-resolution images[EB/OL].(2018-03-30)[2018-03-30].https://arxiv.org /abs/1803.11316. [Baidu Academic]
  67. GOODFELLOW IJ,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[EB/OL].(2014-06-10)[2014-06-10].https://arxiv.org/abs/1406.2661. [Baidu Academic]
  68. RADFORD A,METZ L,CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[EB/OL].(2015-11-19)[2016-01-07].https://arxiv.org/abs/1511.06434. [百度学术]
  69. SIXT L,WILD B,LANDGRAF T.Rendergan: Generating realistic labeled data[J].Frontiers in Robotics and AI,2018,5: 66. [百度学术]
  70. WANG X, SHRIVASTAVA A, GUPTA A. A-fast-RCNN: Hard positive generation via adversary for object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 2606-2615. [百度学术]
  71. LI J, LIANG X, WEI Y, et al. Perceptual generative adversarial networks for small object detection[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2017: 1222-1230. [百度学术]
  72. BAI Y, ZHANG Y, DING M, et al. SOD-MTGAN: Small object detection via multi-task generative adversarial network[C]// Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 206-221. [百度学术]
  73. NOH J, BAE W, LEE W, et al. Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small object detection[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE, 2019: 9725-9734. [百度学术]
  74. TYCHSEN-SMITH L, PETERSSON L. Denet: Scalable real-time object detection with directed sparse sampling[C]// Proceedings of the IEEE International Conference on Computer Vision. New York: IEEE, 2017: 428-436. [百度学术]
  75. WANG . [Baidu Academic]
  76. LAW H, DENG J. Cornernet: Detecting objects as paired keypoints[C]// Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 734-750. [百度学术]
  77. LAW H,TENG Y,RUSSAKOVSKY O, et al.Cornernet-lite: Efficient keypoint based object detection[EB/OL]. abs/1706.03646. [Statement sheet]
  78. DUAN K, BAI S, XIE L, et al. Centernet: Keypoint triplets for object detection[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE, 2019: 6569-6578. [百度学术]
  79. ZHOU X, ZHUO J, KRAHENBUHL P. Bottom-up object detection by grouping extreme and center points[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2019: 850-859. [百度学术]
  80. ZHOU
  81. YANG Z, LIU S, HU H, et al. Reppoints: Point set representation for object detection[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE, 2019: 9657-9666. [百度学术]
  82. KONG T,SUN F,LIU H,et al.Foveabox: Beyond anchor-based object detection[J].IEEE Transactions on Image Processing,2020,29: 7389-7398. [Baidu Academic]
  83. TIAN Z, SHEN C, CHEN H, et al. Fcos: Fully convolutional one-stage object detection[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE, 2019: 9627-9636. [百度学术]
  84. ZHU C, HE Y, SAVVIDES M. Feature selective anchor-free module for single-shot object detection[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2019: 840-849. [百度学术]
  85. ZHANG S, CHI C, YAO Y, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2020: 9759-9768. [百度学术]
  86. FU J,SUN X,WANG Z,et al.An anchor-free method based on feature balancing and refinement network for multiscale ship detection in SAR images[J].IEEE Transactions on Geoscience and Remote Sensing,2020, 59(2): 1331-1344. [百度学术]
  87. YAN J, ZHAO L, DIAO W, et al.AF-EMS detector: Improve the multi-scale detection performance of the anchor-free detector[J].Remote Sensing,2021,13(2): 160. [百度学术]
  88. ZHANG S, ZHU X, LEI Z, et al. Faceboxes: A CPU real-time face detector with high accuracy[C]//Proceedings of 2017 IEEE International Joint Conference on Biometrics (IJCB). New York: IEEE, 2017: 1-9. [百度学术]
  89. ZHANG S, ZHU X, LEI Z, et al. S3FD: Single shot scale-invariant face detector[C]// Proceedings of the IEEE International Conference on Computer Vision. New York: IEEE, 2017: 192-201. [百度学术]
  90. EGGERT C, ZECHA D, BREHM S, et al. Improving small object proposals for company logo detection[C]// Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. New York: Assoc Computing Machinery, 2017: 167-174. [百度学术]
  91. WANG J, CHEN K, YANG S, et al. Region proposal by guided anchoring[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2019: 2965-2974. [百度学术]
  92. VIOLA P, JONES M. Rapid object detection using a boosted cascade of simple features[C]// Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. New York: IEEE, 2001: 1-9. [百度学术]
  93. LI A,YANG . [Baidu Academic]
  94. LIU W, LIAO S, HU W, et al. Learning efficient single-stage pedestrian detectors by asymptotic localization fitting[C]// Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 618-634. [百度学术]
  95. YANG B, YAN J, LEI Z, et al. Craft objects from images[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2016: 6043-6051. [百度学术]
  96. YANG F, CHOI W, LIN Y. Exploit all the layers: Fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. New York: IEEE, 2016: 2129-2137. [百度学术]
  97. GAO M, YU R, LI A, et al. Dynamic zoom-in network for fast object detection in large images[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 6926-6935. [百度学术]
  98. CHEN S,LI J,YAO C,et al.DuBox: No-prior box objection detection via residual dual scale detectors[EB/OL].(2019-04-15)[2019-04-16].https://arxiv.org/abs/1904.06883. [百度学术]
  99. DRENKOW N,BURLINA P,FENDLEY N,et al.Objectness-guided open set visual search and closed set detection[EB/OL].(2020-12-11)[2021-04-14].https://arxiv.org/abs/2012.06509. [百度学术]
  100. YANG F, FAN H, CHU P, et al. Clustered object detection in aerial images[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. New York: IEEE, 2019: 8311-8320. [百度学术]
  101. ZHANG J, HUANG J, CHEN X, et al. How to fully exploit the abilities of aerial image detectors[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Los Alamitos:IEEE, 2019: 1-8. [百度学术]
  102. LI C, YANG T, ZHU S, et al. Density map guided object detection in aerial images[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Los Alamitos:IEEE, 2020: 190-191. [百度学术]
  103. LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection[C]// Proceedings of the IEEE International Conference on Computer Vision. New York: IEEE, 2017: 2980-2988. [百度学术]
  104. ZHANG H, CHANG H, MA B, et al.Cascade retinanet: Maintaining consistency for single-stage object detection[EB/OL].(2019-07-16)[2019-07-16].https://arxiv. org/abs/1907.06881. [Baidu Academic]
  105. SUN S,YIN Y,WANG X,et al.Multiple receptive fields and small-object-focusing weakly-supervised segmentation network for fast object detection[EB/OL].(2019-04-19)[2019-05-22].https://arxiv.org/abs/1904.12619. [百度学术]
  106. YOO J,LEE H,CHUNG I,et al.Density-based object detection: Learning bounding boxes without ground truth assignment[EB/OL].(2019-11-28)[2020-10-04].https://arxiv.org/abs/1911.12721. [百度学术]

Guess you like

Origin blog.csdn.net/wzk4869/article/details/135243312