[Reading notes for target detection papers] Detection of plane in remote sensing images using super-resolution

Abstract

        Due to factors such as a large number of small targets, instance-level noise, and cloud occlusion , the target detection accuracy of remote sensing images is low, and the missed detection rate or false detection rate is high . This paper proposes a new object detection model based on SRGAN and YOLOV3, called SR-YOLO . The sensitivity of the SRGAN network to hyperparameters and mode collapse are resolved . At the same time, the FPN network in YOLOv3 is replaced by PANet , which shortens the distance between the bottom layer and the top layer . The SR-YOLO model enriches the features of each layer by enhancing the path , and has strong robustness and high detection ability . Experimental results on the ucas-high-resolution aerial object detection dataset show that SR-YOLO achieves excellent performance. Compared with YOLOv3, the average accuracy rate (AP) of SR-YOLO increased from 92.35% to 96.13%, the logarithmic mean missing rate (MR-2) decreased from 22% to 14%, and the recall rate increased from 91.36% to 95.12%. .


1 Introduction

        Remote sensing image target detection is widely used in civilian and military fields, such as guiding fruit picking, traffic management, environmental analysis, military surveying and mapping, military target reconnaissance, etc. Compared with field surveys, the accuracy of remote sensing images is higher . Because it can capture ground information in real time and get detailed information . It can accurately identify objects such as aircraft, ships, and automobiles in remote sensing images, and is of great significance in military operations and traffic management . A method combining increased image resolution with object detection is proposed to improve the detection task for some low-resolution images. [3], the regularization parameter S2R2 applies super-resolution techniques to low-resolution face recognition. [4] employ translation invariance and global methods for feature extraction. Eliminate artifacts and discontinuities in low-resolution images , and perform super-resolution reconstruction of face images to improve detection accuracy. Moreover, in some detection tasks, the accuracy of model detection is improved by deblurring images [5-8] or denoising [9] . These methods improve the resolution on the basis of traditional image processing techniques , but due to their own limitations, they are still affected by a large number of small targets, instance-level noise and cloud occlusion , and are difficult to apply to target detection in remote sensing images .

        This paper investigates a super-resolution method that leverages the power of end-to-end training in deep learning to combine low-level and high-level visual objects to produce what we call "You only see once with Super-resolution" (SR-YOLO ). Super-resolution images contain more distinguishable features, which can improve the accuracy of object detection . As a means of improving the robustness of object detection to low-resolution inputs , this approach provides significantly better results than other object detection methods and may be applicable to a wide range of remote sensing satellite image processing tools and advanced tasks . Compared with previous studies, this paper adopts advanced SRGAN super-resolution and the third version of You Look Only one (YOLOv3) target detection , and applies the combination of the two to aircraft detection in remote sensing satellite images, and improves its network structure, which is more It is well applied to the detection of remote sensing images. SR-YOLO first solves the hyperparameter sensitivity and mode collapse problems of SRGAN, and then enriches the semantic information of small objects by PANet [10] . Finally, the super-resolution technology is used to drive the detector for target detection , which solves the problem of difficult remote sensing small target detection .

        This paper is divided into two parts to improve : 1) Improve the SRGAN network . The residual network replaces the normalization layer of the generator network , and adds a penalty mechanism to reconstruct the loss function of the discriminator and generator . The training process of our SRGAN network is more stable, a more comprehensive feature space is obtained, and the generated images are more fine-grained . 2) Improve the YOLOV3 network . First, based on the dataset we use, we redefine a new set of bounding boxes suitable for aircraft detection . Finally, the path aggregation network (PANet) is used instead of the feature pyramid network (FPN) [11] as the neck network , sub-sampling is introduced , the features of all levels are pooled together, the distance between up and down is shortened, and the enhanced path is used to enrich the features of all levels .

        This paper presents our proposed method in five chapters. The first chapter introduces the research background, existing problems and solutions of this paper, and introduces the structure and outline of the paper. Chapter 2 presents related work on super-resolution and object detection. Section 3 presents our method in detail. Chapter 4 introduces the experiment process, including the comparison with other algorithms and the analysis of the experimental results based on the UCAS-AOD benchmark dataset. Chapter five summarizes the contributions and deficiencies of this paper.


2 Related work

        At present, there have been quite a lot of researches on improving the detection accuracy of low-resolution images through image reconstruction . In contrast, due to the constraints of super-resolution reconstruction and object detection tasks , there are relatively few studies on improving the detection accuracy of remote sensing images . We review this work from two directions.

2.1 Image Super Resolution

        Various super-resolution networks, including Super-Resolution Generative Adversarial Networks (SRGAN), Enhanced Deep Super-Resolution (EDSR), Deep Back-Projection Networks (DBPN), Super-Resolution DenseNets, and Deep Laplacian Pyramid Networks (DLPN ) have been proposed [12-16]. These super-resolution networks have significant image upscaling effects, greatly improving visual perception . These networks are better suited for images with complex backgrounds . For example, [17] uses low-resolution images for super-resolution reconstruction via DBPN, and then sends to SSD detection network to improve the accuracy of complex background image detection. With the introduction of more efficient convolutional neural networks (CNN), super-resolution techniques have also been developed rapidly. Super-resolution convolutional neural network (SRCNN) [18] first uses bicubic interpolation to enlarge the low-resolution image to the object size, then fits the nonlinear mapping through a three-layer convolutional network, and finally outputs the high-resolution image result . The network structure of SRCNN is very simple, using only three convolutional layers . Some studies have improved SRCNN [19, 20] by introducing residual networks. [21] introduced recurrent layers, but data augmentation using hand-crafted layers is still limited. Inspired by [21,23], DRRN [22] adopts a deeper network structure to obtain performance improvement . EDSR removes the redundant modules of SRResNet (Super-Resolution ResNet) [12], which can increase the size of the model and thus improve the quality of the results. Although the deep features of Diffusion-Convolutional Neural Networks (DCNN) can preserve the real texture of high-frequency images, it is still a difficult problem to eliminate blurring and artifacts . [24] introduces perceptual loss, and [25] introduces against loss, which already solves this problem .SRGAN uses perceptual loss and adversarial loss to improve the realism and fine texture details of generated images . However, SRGAN suffers from hyperparameter sensitivity and modality collapse , resulting in an unstable training process . Currently, few super-resolution techniques are combined with remote sensing images to solve the problem of object detection in remote sensing images .


2.2 Remote sensing image target detection

        Object detection can be divided into two categories: two-stage and one-stage. The two-stage detection algorithm divides the target detection problem into two stages: generating Region Proposals and classifying and refining candidate frame regions [27-30]. The single-stage detection algorithm is based on the regression method that does not need to generate the Region Proposals stage; it can directly obtain the category probability and position coordinate value of the object without a complicated framework  [31-34]. Generally, the two-stage algorithm has higher detection accuracy and slower speed, and is suitable for scenarios that require higher accuracy. The single-stage algorithm has low detection accuracy and high speed, which can realize real-time detection [35] .

        In order to improve the detection accuracy of objects in remote sensing images, [36] proposed an unsupervised score-based bounding box regression (USB-BBR) algorithm combined with non-maximum suppression algorithm to optimize the bounding box of the detected object area. For small objects in large-scale remote sensing images and large scenes, [37] proposes a Tiny-Net object detection method, which consists of a backbone TinyNet, an intermediate global attention block, a final classifier, and a detector. To detect specific objects in remote sensing images, the model [38] trains multiple detectors, each dedicated to buildings of a specific size. Furthermore, the model implicitly utilizes contextual information by simultaneously training on road extraction task and building detection task . [39] proposed a new deep network - Rotatable Region Residual Network (R3-Net) to detect multi-object vehicles in aerial images and videos.

        In order to improve the efficiency and accuracy of aircraft detection in remote sensing images, [40] proposed a coupled CNN-based weakly supervised learning framework for aircraft detection. [41] proposed an end-to-end semi-supervised object detection method compared to previous more complex multi-stage methods. End-to-end training gradually improves the pseudo-label quality over the course, and increasingly accurate pseudo-labels in turn benefit object detection training . [42] proposed a hybrid variable weighted stacked autoencoder (HVW-SAE) for learning quality-related features for soft sensor modeling. By prioritizing the reconstruction constraints on more quality-related variables, it can ensure that the learned features contain more information for quality prediction. [43] proposed a novel and flexible backbone framework, CBNetV2, to build high-performance detectors using existing open-source pre-trained backbones under the pre-trained fine-tuning paradigm. [44] proposed a novel dynamic head framework to unify head and attention for object detection. The proposed method significantly improves the representation power of object detection heads without any computational overhead . [45] proposed Spectral Spatial Weighted Kernel Manifold Embedding Distribution Alignment (SSWK-MEDA) for remote sensing image classification. This method applies a novel spatial information filter, effectively utilizes the similarity between adjacent sample pixels, avoids the influence of non-sample pixels, and utilizes the geometric structure of features in the manifold space to solve the problem of remote sensing data feature distortion in transfer learning scenarios. question..


3 Proposed method

        This paper proposes a new detection model SR-YOLO . We explore a better combination of super-resolution SRGAN and YOLOv3 detection network. Therefore, we must first solve the problem of instability in the training process of the SRGAN network and improve the quality of the generated images. Second, the ability of YOLOv3 to detect small objects is important. Therefore, this section will introduce our improvements in two parts, namely SRGAN network improvements and YOLOv3 network improvements.

3.1 SRGAN network improvement

Generate network fine-tuning:

        First, the BN layer in the SRGAN generation network is replaced by a residual network . [13, 46] demonstrated that removing BN layers improves performance and reduces computational complexity in PSNR-oriented tasks . At the same time, removing the BN layer can enhance the stability of network training and enhance the generalization ability of the network. After replacing the BN layer of each layer with a 3×3 convolution kernel convolution and PReLU activation layer , the depth and complexity of the network are increased, the features after each convolution are fully utilized, and the edge features of the generated network are improved. deal with.


Reconstruction loss function:

        [47] analyzed the reason for the instability of GAN training, that is, the JS divergence in the GAN network cannot smoothly brighten the distance between the illuminated distributions when the distributions p and q do not overlap, making this position unable to generate effective gradients information, causing the pattern to collapse. We borrow ideas from [47] to reconstruct the loss function of the discriminative generative network , the training process is more stable, and the loss convergence speed is accelerated.


3.1.1 Generate network fine-tuning.

        We use a network interpolation method to preserve perceptual quality and remove artifacts and noise in GANs . Specifically, we first train a PSNR-oriented network GPSNR, and then obtain a GAN-based network GGAN through fine-tuning. We interpolate all the corresponding parameters of the two networks to obtain an interpolation model GINTERP, and the parameters are shown in Equation 1:

Among them, θGPSNR and θGGAN are the parameters of GINTERP, GPSNR and GGAN respectively, and α∈[0,1] is the interpolation parameter. Experiments show that when α is 0.2, PNSR reaches the ideal level.

        We improve residual blocks in generative networks. The residual block of the original generation network, as shown in Figure 1, uses a 3×3 convolution kernel for convolution and BN layers, and then selects the PReLU function for activation. Finally, the 3×3 convolution kernel convolution and normalization are performed again. A very small number of parameters are added to the original residual block to make the feature information more abundant.

        Combining the eigenvectors obtained by the two convolution processes with the original eigenvectors ensures the integrity of the feature information. The 16 original residual blocks are stacked in the generative network or a total of 16×2 BN layers. In super-resolution tasks, the output image is usually required to be consistent with the original image in terms of color, contrast, brightness, etc., and only the resolution and some details of the image need to be changed. However, the BN in the SRGAN generator stretches the contrast of the image, and the color distribution of the image after BN processing is also normalized, which destroys the original contrast information of the image and affects the quality of the output image. When the statistics of the training set differ from the test set, BN layers tend to produce undesirable artifacts and limit the generalization ability of the model. [44, 45] demonstrated that removing BN layers improves performance and reduces computational complexity in PSNR-oriented tasks. At the same time, removing the BN layer can enhance the stability of network training and the generalization ability of the network. Therefore, as shown in our residual block in Figure 1, we replace the BN layer of the original residual block with a 3×3 convolution and PReLU activation layer, which increases the depth and complexity of the network and makes full use of the later features per convolution, and improvements to edge feature handling for generative networks.

        In our generated network, shown in Fig. 2, our 16 residual blocks are concatenated via 9×9 convolutional layers to obtain the full underlying feature space. Then, two-times upsampling and PReLU activations are used. Finally, a 9×9 convolutional layer is connected to recover high-resolution remote sensing data.

Guess you like

Origin blog.csdn.net/YoooooL_/article/details/130367905
Recommended