[Reading notes for target detection papers] Feature-Enhanced CenterNet for Small Object Detection in Remote Sensing Images

Abstract:

        Compared with anchor-based detectors, anchor-free anchor -free detectors have the advantages of flexibility and lower computational complexity . However, in complex remote sensing scenarios, the constrained geometric size, weak features of targets, and widely distributed environmental elements similar to target features make small target detection a challenging task . To address these issues, we propose an anchor-free detector named FE-CenterNet , which can accurately detect small objects such as vehicles in complex remote sensing scenes . First, we design a Feature Enhancement Module (FEM ) consisting of Feature Aggregation Structure (FAS) and Attention Generation Structure (AGS)  . This module helps to suppress the interference of false alarms in the scene by mining multi-scale context information and combining the coordinate attention mechanism , thereby improving the perception of small objects . Meanwhile, to meet the high localization accuracy requirement for small objects, we propose a new loss function that does not require additional computation and time cost during inference . Finally, to verify the performance of the algorithm and provide a basis for subsequent research, we build a dim small vehicle dataset (DSVD) containing various objects and complex scenes . Experimental results show that the proposed method outperforms mainstream object detectors. Specifically,The Average Precision (AP) metric of our method is 7.2% higher than the original CenterNet with only 1.3 FPS lower .


1. Introduction

        Automatic object detection in remote sensing images has drawn increasing attention in commercial and military domains and can be widely used in aerial reconnaissance, traffic monitoring, and area surveillance applications . However,due to the limitation of resolution and quality of remote sensing images, most objects of interest such as vehicles [1-4] exhibit the following characteristics: small size, dim and fuzzy features, and low contrast [5,6]. In addition,the unique remote sensing imaging system leads to the complexity of the scene and the variability of the target orientation , which brings great difficulties to the detection task. Therefore, it is of great significance to study an effective small target detection method in remote sensing images . In this paper, we define objects with an area smaller than 32×32 pixels as small objects [7],focusing on small vehicle detection in remote sensing images.

        Convolutional Neural Networks (CNN) [8] can achieve end-to-end detection by adaptively learning representative features without handcrafting features. Typical detection networks can be roughly divided into two categories: anchor-based and anchor-free detectors . Anchor-based detectors, such as Faster-RCNN [9] and YOLO [10], need to fine-tune the anchor parameters according to the aspect ratio of objects in the dataset to achieve promising performance. However, the aspect ratios of objects in different remote sensing scenes appear to be so diverse, which makes it difficult and time-consuming to adjust the parameters of anchor points [11]. Without considering the choice of anchor, the anchor-free detector is independent of the hyperparameters of the anchor, which reduces the computational complexity of the algorithm . In addition, anchor-free detectors rely on keypoints to detect objects from high-resolution feature maps, which are easy to capture objects in a small range .

        CenterNet [12], as a typical representative of the anchor-free detector, directly predicts the center point of the object through the extracted feature map . Compared with other methods, CenterNet's concise object detection framework makes it have the potential to strike a balance between detection accuracy and speed . In addition, CenterNet uses high-resolution features four times downsampled from the input image for prediction, which has the advantage of achieving ideal detection performance for small and dense objects . However, due to the diversity and complexity of scenes and the monotonous appearance of small objects , it is difficult to extract robust features for adequate representation, since the performance of CenterNet relies heavily on the acquired feature maps, which limits Its application performance in complex remote sensing scenarios is confirmed .

        In this paper, we propose Feature Enhancement Center Network (FE-CenterNet ) by designing a Feature Enhancement Module (FEM) to help the network enhance practical features while suppressing unnecessary details . At the same time, we adopted a new loss function  in the CenterNet framework to ensure the localization accuracy of small objects . All the above improvements are achieved without many extra parameters and computational cost . Furthermore, to evaluate the performance of small object detectors in remote sensing images, we construct a dim small vehicle dataset (DSVD) consisting of various objects and complex scenes . Experiments demonstrate that FE-CenterNet has significant advantages in small object detection and achieves state-of-the-art performance on DSVD.

The main contributions of our work are as follows:

An anchor-free anchor-free detector with excellent performance for small object detection in complex remote sensing scenes.

• Feature enhancement module, which greatly improves the feature extraction and representation capabilities of small objects by mining multi-scale features and integrating attention mechanisms .

Established small and dimly blurred vehicle datasets that help evaluate the performance of small object detection algorithms.

        The remainder of this paper is organized as follows. After introducing related work on small object detection in Section 2, we elaborate on the proposed FE-CenterNet architecture in Section 3. In Section 4, we briefly introduce the constructed small and dimly lit dataset and describe the experiments performed to compare the performance of the proposed method with typical methods. Finally, we conclude and generalize in Section 4.


2. Related Works

        With the rapid development of deep learning technology, remote sensing target detection based on convolutional neural network (CNN) has gradually attracted attention. As we all know, mainstream methods are divided into anchor-based and anchor-free frameworks . This section introduces the main development trends of the two categories and analyzes the existing problems. Based on this, we  illustrate the reason for choosing the anchor-free framework  and the solution of the proposed small object detection method in remote sensing images.


2.1. Anchor-Based Framework for Object Detection

        After 2012, the rise of CNN has pushed target detection to a huge progress. The problems of poor accuracy and redundant computation based on manual feature descriptors can be alleviated by automatically mining important features . Anchor-based detectors use the hyperparameters of number, size, and aspect ratio to predict and classify different object candidates generated from various anchors . Regardless of whether candidates are generated or not, the anchor-based framework is divided into two-stage and one-stage detectors. The former uses Region Proposal Network (RPN) [9] to extract regions of interest (ROI) as the first stage, followed by precise bounding box regression and object classification.

        R-CNN [13], as the earliest two-stage detector, first uses a selective search method to generate candidates, and then uses CNN to extract features. The problem is that feature extraction needs to be performed individually for all candidates , which is an iterative and time-consuming process . To address the above issues, Fast-RCNN [14] directly extracts features from the overall image and then maps them to regions of interest . At the same time, in order to reduce the time consumed by traditional region proposal algorithms, Faster-RCNN [9] introduces RPN to achieve end-to-end object detection based on deep learning , which further simplifies the detection pipeline. Subsequent improvements mainly appeared on the basis of Faster-RCNN modules. Mask RCNN [15] utilizes RoIAlign instead of RoIPool to solve the region mismatch problem during feature map mapping . Cascade-RCNN [16] determines positive and negative samples through different IOU (interaction over union) thresholds, and cascades multiple networks to optimize prediction results. For small target detection, Wang et al. [17 Vehicle Detection Using Deep Learning with Deformable Convolution ] combined region-based fully convolutional network (R-FCN) and deformable convolution (DCN)to take full advantage of the limited information available on small vehicles. Zhang et al. [18 Small-Scale Aircraft Detection in Remote Sensing Images Based on Faster-RCNN] utilized K-means to generate hyperparameters for anchors. They introduced the improved VGG16 and Soft-NMS into Faster-RCNN to achieve effective detection performance for small aircraft. The two-stage network establishes a coarse-to-fine object detection method based on the anchor box mechanism , which achieves good detection performance. However, the two-stage process significantly increases computational cost and inference time .

        Compared with two-stage detectors, one-stage detectors treat object detection as a regression problem . It directly regresses the extracted features to obtain the target category probability and position coordinate values. The detector of the YOLO architecture , as a mainstream single-stage object detector, divides the input image into multiple grids of the same size . They classify suspicious objects and regress locations based on grid-centered bounding boxes . YOLOv2 [10] designed Darknet as a feature extraction network and added batch normalization (BN) after all convolutional layers. YOLOv3 [19] continues to improve on a backbone network named Darknet53, which downsamples feature maps by convolutions instead of pooling . In order to solve the problem of imbalance between positive and negative samples in the one-stage network , RetinaNet [20] proposed focal loss to adjust the weight of indistinguishable samples in the loss function . Now, many improved versions [3, 21, 22] are proposed based on YOLOv3, which apply and combine a large number of advanced detection techniques. They gradually achieved an excellent balance between accuracy and speed. For small object detection, Bashir et al. [23 Small Object Detection in Remote Sensing Images with Residual Feature Aggregation-Based Super-Resolution and Object Detector Network ] combined a recurrent Generative Adversarial Network (GAN) with the YOLO detection architecture, to achieve image super-resolution for small objects. Zhou et al [24 Vehicle Detection in Remote Sensing Image Based on Machine Vision] applied gamma correction to image preprocessing to brighten the shadow part of the image, and proposed a feature fusion structure IR-PANet to improve the detection of small Target recognition ability. Kim et al. [25 ECAP-YOLO: Efficient Channel Attention Pyramid YOLO for Small Object Detection in Aerial Image] proposed an efficient channel attention pyramid YOLO (ECAP-YOLO) , which adds a detection for small object detection layer.

        However, the anchor-based method is the earliest important branch of target detection, and its detection performance largely depends on the number of positive and negative samples and the hyperparameters of anchor boxes . Due to the unique overlooking perspective of remote sensing, the orientation changes of objects require setting a large number of anchor boxes with different scales and aspect ratios , which significantly increases the computational complexity of the algorithm. It limits detection performance and speed in remote sensing scenarios .


2.2. Anchor-Free Framework for Object Detection

        Anchor-free detectors remove the anchor mechanism and apply keypoints to generate candidates . These methods do not need to set hyperparameters for anchor points, which reduces the computational complexity of the algorithm . Keypoint-based anchor-free frameworks usually detect objects on high-resolution feature maps, so they tend to perceive small-scale objects . This method has higher flexibility and universality for object detection in special bird's-eye view remote sensing images . They have the potential to achieve effective detection of dim and small objects in complex scenes .

        CornerNet [26] achieves object localization and detection by predicting diagonal corners . It utilizes the distance between corner embedding vectors to match the same object . Based on CornerNet, CentripetalNet [27] introduces centripetal displacement module and cross star deformable convolution to achieve higher quality bounding box prediction. ExtremeNet [28] predicts the center keypoint with the four extreme keypoints of each category and uses brute-force enumeration for matching. FCOS [29] directly predicts the distances from the four sides of the bounding box to the center . Although the above methods delete the anchor box mechanism, they all include some complicated post-processing methods to match key points . Compared with other methods, CenterNet [12], as a compact object detection framework, strikes a balance between detection accuracy and speed. It predicts the heatmap of the center point of the object, and obtains information such as length and width through features around the center . Inspired by Region Generation Networks (RPN), FII-CenterNet [30] introduces foreground information to reduce the influence of complex scenes and focus on objects of interest. In [31], a probabilistic two-stage detector was constructed based on CenterNet combining object likelihood and conditional classification scores. In [32], CenterNet++ combines center keypoints with corner keypoints to detect objects as triplets to capture salient information globally. The above improved method is mainly inspired by the two-stage detector, which can improve the detection accuracy to a certain extent . However, the introduction of foreground information or region proposals breaks the structural simplicity of CenterNet and significantly increases the computational complexity and inference time . therefore, we FE-CenterNet is proposed to achieve promising detection performance with  little increase in detection time  . To enhance the perception of small objects, we design a feature enhancement module integrating coordinate attention mechanism and multi-scale feature extraction . Meanwhile, a loss function integrating robust localization information is proposed  during the training process, which can improve the regression accuracy of localization without extra computation for inference .


3. Proposed Method

        The main architecture of FE-CenterNet is shown in Figure 1. Similar to CenterNet, FE-CenterNet utilizes the modified DLA-34 [12] as the backbone to extract multi-level features and obtain four times downsampled feature maps . Specifically, we propose a feature enhancement module (FEM) after the backbone network , which improves the representation ability of small object features . This module consists of Feature Aggregation Structure (FAS) and Attention Generation Structure (AGS) , detailed explanations of these two structures are provided in Section 3.1. FAS combines multi-scale features with the introduction of contextual information to suppress false alarms in complex scenes . In addition, AGS embeds the coordinate relationship into the attention mechanism, which strengthens the perception of small objects . During training, we propose a new loss function to accommodate the high requirement on localization accuracy , which is elaborated in Section 3.2. The loss function improves detection performance without additional computation during inference.


3.1 Feature-Enhanced Module

        The detection of small targets in remote sensing images mainly faces two challenges : (1) the scene is complex, and there are too many false alarms , which interfere with the detection of small targets; (2) the target scale is small and the features are weak , making it difficult to capture the actual features . To address the above issues, we propose a Feature Enhancement Module (FEM), which consists of a Feature Aggregation Structure (FAS) and an Attention Generation Structure (AGS) . This module extracts multi-scale features to aggregate contextual information in images, and enhances the perception of effective features of small-scale objects through an attention mechanism . Through FEM, the feature aggregation and enhancement of the high-resolution feature maps extracted by the backbone extraction network can effectively improve the detection performance of small objects. The main structure of the feature aggregation enhancement module is shown in Figure 2.

        Due to the complexity of remote sensing images, false alarms with similar characteristics to the target are prone to occur , which greatly affects the detection performance . False alarms have the same characteristics as targets and are difficult to identify based on their characteristics alone . Therefore, it is necessary to introduce global context information and use the semantic information of the scene to suppress false alarms . Ordinary convolution has a fixed receptive field and can only perform feature extraction on local areas of fixed size . In Feature Aggregation Structure (FAS), inspired by the ASPP block , we utilize multiple parallel dilated convolutions to collect multi-scale information in feature maps [33]. Due to the aggregation of effective semantic information, the collected output is less affected by complex scenes

        Compared with ordinary convolution, dilated convolution obtains receptive fields of different scales by adjusting the expansion rate . Here, we denote a dilated convolution with dilation rate m and kernel size n × n as . For the input feature map , h and w represent the length and width of the feature map, respectively. c is the number of channels. Get the feature extraction result under the specific receptive field with the same dimension as the input feature map :

        We  set the dilation rate as 6, 12 and 18 , and perform dilated convolution operation on the input feature maps to obtain aggregated features at different scales . Furthermore, 1×1 convolutions are used to keep the feature representations at exact resolution with the input feature maps . Then, the concatenation operator and 1×1 convolution are applied to get the final output. The calculation process is shown below, where Fcat is the aggregation result of multi-scale information. Here, 1×1 convolution helps to make the output of the feature map channels the same as the channels of the input map .

        The multi-scale features extracted in different channels make different contributions to detecting small objects . Therefore, we add a channel attention mechanism to assign different weights to each channel according to the importance of feature fusion . The channel attention mechanism automatically acquires the importance of each channel by learning, thereby enhancing edge details and semantic information . Inspired by the SE block [34] , we obtain the global information of each channel through a spatial pooling operator , yielding 1×1×C channel feature vectors. The kth channel of the feature vector vc is expressed as:

Wherein represents the value of Fcat of the kth channel in the i-th row, j-th column. 

        After that, we used a bottleneck layer consisting of two fully connected layers . The dimensions of the feature vectors are first reduced and then restored to their original dimensions . The bottleneck layer can better accommodate complex correlations between channels and reduce the amount of computation . The sigmoid function processes the feature vector to obtain the normalized weight of each channel. Finally, the feature maps in each channel are multiplied by weight factors to get the rescaled result . The final output is written as:

Among them, FC represents the fully connected layer, and ReLU and Sigmoid are nonlinear activation functions.

        Due to the limited geometric scale, small objects lack texture details . At the same time, the positioning accuracy requirements for small targets are higher than those for large targets . This means that slight deviations in the center position may lead to inaccurate bounding box regression . Therefore, after using the Feature Aggregation Structure (FAS) to obtain feature maps that incorporate multi-scale information, we propose an Attention Generation Structure (AGS) based on a coordinate attention mechanism . By embedding coordinate positions, this structure  enhances the effective features of small objects, improving localization and perception .

        The attention mechanism helps the network to improve the perception of specific detailed features and semantic information by applying different importance to channels and regions . Inspired by the CA [35 Coordinate attention for efficient mobile network design.] block, the AGM consists of a spatial information encoding and decoding procedure . Coordinate embedding helps to mine spatial dimension information that is beneficial to localize small objects .

        First, AGS embeds . Compared with global pooling, this pooling can preserve coordinate information while obtaining channel description . Due to the embedding of coordinate positions, the encoded feature maps can capture the spatial information of the region of interest , which helps to satisfy the dependence of small object detection on position information. For a feature map from FAS , the pooling vector and , along a single spatial dimension, can be expressed as follows:

where is the pooling result of the k-th channel and the i-th position along the vertical direction, and is the pooling result of the k-th channel and the j-th position along the horizontal direction. 

        For pooled vectors computed from equations (5) and (6), we apply channel concatenation to obtain aggregated vectors . In addition, 1×1 convolution is also utilized to achieve channel dimensionality reduction. This channel compression process helps to represent channel correlations while reducing the number of parameters . The final encoding result  is expressed as:

Where conv1×1 means 1×1 convolution transformation.

        After obtaining the feature vector vencode that encodes the spatial information , the next step is to decode the spatial information and apply the decoded attention weights to the input feature map . The encoding vector vencode is split along the vertical and horizontal dimensions to obtain the unidirectional encoding vector sum :

Where split(·) represents the dimension splitting operator.

        For segmentation vectors, a 1×1 convolutional transformation is used to recover the effect of channel reduction , resulting in the same channel dimensions as the input feature map. The attention weights for decoding along different spatial directions can be written as:

where wx and wy are a pair of attention weights embedded in vertical and horizontal spatial information, respectively. By applying the decoded attention weights , the final output feature map can be expressed as:


3.2. Loss Function

        In order to improve the regression accuracy of small object bounding boxes, the original loss of CenterNet is updated to the complete Intersection over Union (CIOU) [36], which finally consists of keypoint heatmap, CIOU, size and center offset . The whole function Ldet is expressed as:

        We set (1, 0.1, 1) to  , which are hyperparameters that adjust the weights of various parts of the loss function.

        CenterNet detects objects as points and generates keypoint heatmaps , size predictions  , and center offsets   [11] before prediction, where W, H, and C denote width, length, and object category, respectively. R is the downsampling step size , which we set to 4 to ensure sufficiently high-resolution feature maps for small object detection. The keypoint loss is defined as:

where Pxyc is the ground-truth heatmap generated by the same Gaussian function as CenterNet. Since only those Pxyc = 1 are considered as positive samples , which brings an imbalance between positive and negative samples, we use focal loss to alleviate this situation . a and b are hyperparameters in focal loss, which are set to 2 and 4 by default. N is the total number of keypoints used for normalization.

        For  the kth ground-truth bounding box denoted by , its length and width are , and its center position is . Both size and offset are trained with L1 loss , calculated as follows,

where it  represents the integer part of the position after R times of downsampling.

        The original CenterNet loss function independently optimizes the center position and target size , resulting in poor positioning accuracy for small targets . Therefore, we introduce CIOU in the loss function computation training under the supervision of overlap of predicted and ground-truth bounding boxes . CIOU comprehensively considers the distance, overlap, and aspect ratio, and comprehensively optimizes the matching degree of the predicted box and the true bounding box . CIOU is written as:

where IOU is the intersection-over-union ratio between the predicted and ground-truth bounding boxes. ppred and pgt are the center points of prediction and ground truth respectively. ρ represents the Euclidean distance operator, and a is the weight factor. The aspect ratio similarity formula is:

The addition of CIOU loss can improve the positioning accuracy of CenterNet for small targets and improve the convergence efficiency of the network .


4. Experimental Results

4.1. Dim and Vehicle Datasets 

        We construct a dim small vehicle dataset (DSVD) based on the UNICORN 2008 dataset [37] to evaluate the small object detection performance of the proposed algorithm. The UNICORN 2008 source dataset is a wide-area motion imagery (WAMI) dataset containing 6471 images. Each image covers an area of ​​approximately 5 km × 5 km, and the image size is approximately 10,000 × 10,000 pixels. The small target data set based on UNICORN 2008 has the following detection difficulties :

        • Vehicles with a relatively drab appearance are small in size, with dull features and low contrast . It is difficult to obtain robust feature representations for these objectives. The local area of ​​the target is shown in Fig.3.

        • Wide image coverage, complex and diverse scenes , such as parking lots, roads, residential areas, etc. In addition, there are a large number of suspicious objects in the scene, which can easily become a source of false alarms. The complex scene is shown in Figure 4.

        The complexity of the above scenarios and the weak nature of the objects make vehicle detection in UNICORN 2008 rather challenging. We divide the images in UNICORN 2008 into several 640 × 640 pixel blocks and select different scenes. For up to 3225 picked images, we thoroughly label vehicles using rectangular bounding boxes. A total of 2257 images are randomly selected from the entire image for network training, and the remaining 968 images are used as test data for network performance evaluation.


4.2. Evaluation Metrics

        We apply precision, recall, F1-score and AP (average precision) metrics to evaluate the detection performance of the proposed method. The intersection-over-union ratio (IOU) of the detected and ground-truth bounding boxes is set to a threshold of 0.5. Among these metrics, precision and recall can be used to evaluate the detection of false negatives and false positives, calculated as follows:

Where TP, FP and FN represent true samples, false positive samples and false negative samples.

        F1 score and AP can evaluate the detection method more comprehensively. F1, the harmonic mean of precision and recall, is written as:

        AP is defined as the area enclosed by the recall-precision curve, the formula is:

To evaluate the inference speed of the detection algorithm, we used the FPS (frames per second) metric.


4.3. Implementation Details and Ablation Analysis

        All experiments are performed using the Pytorch framework on an Inter Xeon® Silver 4210R CPU and NVIDIA Quadro RTX 4000 GPU. During training, the input resolution is 512×512 pixels, and we get a 128×128 pixel feature map for prediction . We trained the model for 140 epochs with a batch size of 4. An Adam optimizer with a learning rate of 8 × 10−5 is chosen, which reduces by a factor of 10 at 90 and 120 epochs, respectively.

        Figures 5 and 6 show some detection results of our method, which can achieve good performance on small and dark objects . As shown in Fig. 5, some objects with limited appearance features have low contrast, while our method can detect all objects without false negatives . For complex scenes with a large number of distractors like the target in Fig. 6 , our method can also perform well without false alarms .

        Based on the constructed DSVD, we conduct ablation experiments on the proposed method to evaluate the performance improvement in small object detection. The same policy and parameters are applied during training and evaluation to ensure a fair comparison. As shown in Table 1, we use precision, recall, F1-score and AP to evaluate detection performance, and FPS to compare detection speed. Obviously, compared with CenterNet, the proposed method has excellent advantages in detection performance metrics with little impact on inference speed . The addition of FEM and the improvement of the loss function increased the AP index by 4.3% and 2.7%, and the FPS index decreased slightly. Finally, in this paper, compared with the original CenterNet,  FE-CenterNet, improves AP by 7.2%, adding almost no additional inference time (FPS from 17.9 to 16.6). Meanwhile, in order to intuitively illustrate the effect of FEM, we provide some feature visualizations in Fig. 7. Before augmentation, there is a large amount of noise , which may be highlighted in the feature map. However, after FEM enhancement, they are much suppressed . In addition, objects with weaker features on the left and right sides are not clearly shown in the feature map. Through feature augmentation, these objects are highlighted .

        In order to show the improvement of detection performance more intuitively, the detection results of CenterNet and the proposed FE-CenterNet are visualized , as shown in Figure 8. Among them, three typical scenes of roads, parking lots and communities in the dataset are selected. It can be seen that based on the feature enhancement module and the improved loss function, FE-CenterNet is not easily disturbed by false alarm sources, and has a stronger perception ability for small targets . In remote sensing images with complex and varied scenes, it is easy to appear natural landscapes and artificial equipment with similar characteristics to the target , such as shadow blocks (area 1, 3, 5, 6, 9, 10), roofs (area 2) and trees ( regions 4 and 7). CenterNet does not make good use of contextual information, which makes it difficult to distinguish objects from such false alarms . The network proposed in this paper can aggregate multi-scale features through the feature aggregation structure, effectively reducing false alarm errors and improving accuracy . At the same time, the vehicle targets in remote sensing images have small geometric scales and weak texture and structural features, which are difficult to be fully perceived by the network . As shown in Region 8 and Region 11, CenterNet fails to detect tiny objects . At the same time, FE-CenterNet can achieve more effective perception of them through the attention generation structure and improve the recall rate .


4.4. Algorithm Performance Comparison

        To verify the overall performance of our proposed method, we compare it with several representative detection algorithms based on the same implementation environment, dataset and evaluation metrics. Among the selected methods, CascadeRCNN [15] is a two-stage detector improved from Faster-RCNN, which generally outperforms single-stage and anchor-free detectors with higher computational complexity . ImYOLOv3 [3 Improved YOLOv3 Based on Attention Mechanism for Fast and Accurate Ship Detection in Optical Remote Sensing Images ] introduces an attention mechanism in single-stage YOLOv3 and performs well in remote sensing target detection. YOLOv7 [21] is currently the latest algorithm of the YOLO architecture, which combines a large number of advanced detection techniques. FII-CenterNet [23], also based on anchor-free CenterNet, improves traffic object detection by generating a network of foreground regions extracted from a two-stage network.

        The evaluation results are shown in Table 2. Among these methods, the coarse-to-fine detection process of Cascade-RCNN and the introduction of foreground information in FII-CenterNet help to improve the accuracy, but it is difficult to detect all objects . The two-stage idea also has a big impact on the inference speed . While imYOLOv3 improves the perception of small and dark targets by applying the attention mechanism , and the recall rate is relatively high . However, it is susceptible to false alarms similar to the target . YOLOv7 uses an anchor-free mechanism, which has obvious speed advantages. The combination of advanced training and inference strategies also makes it outperform other compared algorithms. However, it is not as effective as our FE-CenterNet due to the lack of policies designed for small objects . Our method, based on a multi-scale feature fusion structure and an attention generation structure, ensures the highest precision, recall, F1 and AP while maintaining a relatively high detection speed.

        The above conclusions are intuitively reflected by visualizing the detection results in Figure 9. We show the detection results of imYOLOv3, Cascade-RCNN, FII-CenterNet and the proposed method for comparison. We select the displayed images from a variety of complex scenes such as parking lots, neighborhoods, roads, etc. Meanwhile, the occlusion and low contrast between the object and the scene bring great difficulties to the detection . From region 1 to region 3 where the targets are densely distributed, the detection results of Cascade-RCNN and FII-CenterNet miss a large number of targets, while imYOLOv3 and the proposed method can detect more targets in such scenes. For the occluded target in area 4 and the objects with weaker features in area 5 to area 8, it is difficult for the comparison method to perceive the target, resulting in a large number of missed detections . However, the proposed method can achieve the best detection performance for the aforementioned complex scenarios without missing and false detections .


5. Conclusions

        In this paper, we propose an anchor-free detector named FE-CenterNet aimed at small and dark object detection in complex remote sensing scenes. First, we introduce multi-scale contextual information to suppress false alarm interference similar to objects , and integrate a coordinate attention mechanism to improve the perception of small objects , thus proposing FEM. Then, to improve localization regression accuracy, we propose a new loss function that combines CenterNet's original loss function with CIOU loss . Finally, to verify the detection performance, we construct DSVD consisting of various complex scenes and objects. Experimental results show that our method has better detection performance and higher inference speed than other typical algorithms , demonstrating its potential for small object detection in complex remote sensing scenarios .

Guess you like

Origin blog.csdn.net/YoooooL_/article/details/130562448