[Based on recursive mixed scales: Unsupervised GAN: Pansharpening]

Pansharpening Using Unsupervised Generative Adversarial Networks With Recursive Mixed-Scale Feature Fusion

(Unsupervised generative adversarial network pan-sharpening based on recursive mixed-scale feature fusion)

Pansharpening is an important technology to improve the spatial resolution of multispectral images. Most models are implemented at reduced resolution, resulting in suboptimal results at full resolution. Furthermore, the complex relationship between mass spectrometry and panchromatic (PAN) images is often overlooked in detail injection. In response to the above problems, an unsupervised generative adversarial network (RMFF-UPGAN) model based on recursive mixed-scale feature fusion was established to improve spatial resolution and retain spectral information. RMFF-UPGAN consists of a generator and two U-shaped discriminators. A two-flow trapezoidal branch is designed in the generator to obtain multi-scale information . On this basis, a recursive mixed-scale feature fusion subnetwork is designed. The extracted MS and PAN features of the same scale are fused a priori. Mixed-scale fusion is performed based on the previous fine-scale and coarse-scale fusion results. RMFF-UPGAN consists of a generator and two U-shaped discriminators. Fusion is performed sequentially according to the above method, a recursive mixed-scale fusion structure is constructed, and key information is finally generated. A compensation information mechanism is designed to reconstruct key information and compensate for the information . In order to overcome the distortion caused by ignoring the complex relationship between MS and PAN images, a nonlinear rectification block that reconstructs information is proposed . Two U-shaped discriminators are designed and a new composite loss function is defined. The proposed model is validated using two satellite data and the results show that the model outperforms other state-of-the-art methods in terms of visual evaluation and objective metrics.

INTRODUCTION

Remote sensing images are widely used in geological exploration, terrain classification, agricultural yield prediction, pest and disease detection, disaster prediction, national defense, environmental change detection and other fields. In these applications, images with high spatial resolution, high spectral resolution, or high temporal resolution are required. However, due to limitations of sensor technology, we obtain low spatial resolution multispectral or hyperspectral (LRMS/LRHS) images, low temporal resolution multispectral or hyperspectral images, and low spectral resolution panchromatic (PAN) images. . This requires fusion technology to fuse LRMS and PAN images together to generate high spatial resolution multispectral (HRMS) images. This fusion technique is called pansharpening. Pan-sharpening technology is generally divided into component substitution (CS) method, multi-resolution analysis (MRA) technology, variational optimization (VO) method and deep learning (DL) model.
CS techniques mainly include Intensity-Hue-Saturation (IHS) and its variants, Gram-Schmidt (GS), GS Adaptation (GSA), Principal Component Analysis (PCA) and Band-Dependent Spatial Details (BDSD). First, the LRMS image is projected into another spatial domain, and then the spatial structure information is extracted and replaced with a high-resolution image. Finally, the image is inversely transformed into the original space to obtain the fused image. The advantages of CS are that it is simple, widely used, can be integrated in a single software, easy to implement, and greatly improves the spatial resolution of LRMS images. Disadvantages include spectral distortion, oversharpening, aliasing, and blurring issues.
MRA methods mainly include smoothing filter-based intensity modulation (SFIM), Laplace pyramid (LP) transform, generalized LP (GLP) transform, curvelet transform, contourlet transform, non-sampled contourlet transform (NSCT), and modulation transfer function -GLP (MTF-GLP) transformation and its variants. The MRA method decomposes the LRMS and PAN images, then fuses them according to certain rules, and generates the fused image through inverse transformation. Compared with the CS method, the MRA method can retain more spectral information and reduce spectral distortion, but its spatial resolution is relatively low.
The VO method can be divided into two parts: energy function and optimization method. Its core is the optimization of variational models, such as panchromatic and multispectral image (P+XS) models, non-local variational panchromatic sharpening models, etc. Compared with the CS method and the MRA method, the VO method has higher spectral fidelity but is more computationally complex.
Convolutional neural networks (CNN) and generative adversarial networks (GAN) have been widely used in image processing. Some results have been achieved in pan-sharpening of remote sensing images. In the early days, a three-layer CNN pansharpening (PNN) was designed based on super-resolution reconstruction. By inputting LRMS and PAN image pairs into PNN, the nonlinear mapping of CNN is utilized to generate HRMS images. PNN is relatively simple and easy to implement, but it is prone to overfitting. Subsequently, a target adaptive CNN (TA-CNN) is modeled, utilizing the target adaptive adjustment stage to solve the problem of data source mismatch and insufficient training data. Yang et al. proposed a deep pan-sharpening network based on the ResNet module, namely PanNet, which uses the high-frequency information of LRMS and PAN images as input and outputs the residual between HRMS and LRMS images. However, PanNet ignores low-frequency information, resulting in spectral distortion. Wei et al. modeled a deep residual pan-sharpening neural network (DRPNN), implemented on ResNet blocks. Although DRPNN is implemented by taking advantage of the powerful nonlinear capabilities of CNN, the number of samples required should increase with the depth of the network to avoid overfitting. For training in the spatial domain, the generalization ability of the model needs to be improved. Deng et al. proposed a FusionNet model based on CS and MRA detail injection models. Injected details are obtained through deep CNN (DCNN). Unlike other networks, the input to this network is the difference of the PAN image and the LRMS image, which are copied into the same number of channels. Therefore, the network can introduce multispectral information and reduce spectral distortion. Hu et al. proposed a multi-scale dynamic convolutional neural network (MDCNN). The MDCNN mainly contains three modules: Filter generation networks, dynamic convolutional networks, and weight generation networks. MDCNN uses multi-scale dynamic convolution to extract multi-scale features of LRMS and PAN images, and designs a weight generation network to adjust the relationship between features at different scales and improve the adaptability of the network. Although dynamic convolution improves the flexibility of the network, the network design is more complex. This network tends to reduce effective detail information and spectral information when extracting features of LRMS and PAN images simultaneously. Wu et al. proposed RDFNet based on distributed fusion structure and residual module to extract multi-level features of LRMS and PAN images respectively. Then, the MS and PAN image features of the corresponding levels are gradually fused with the fusion results of the previous step to obtain the HRMS image. Although the network utilizes the features of multi-level LRMS and PAN as much as possible, more details and spectral information cannot be obtained due to the depth of the network. Wu et al. also designed TDPNet based on cross-scale fusion and multi-scale detail compensation. GANs offer huge potential for generating images. Shao et al. proposed a supervised conditional GAN ​​containing a residual encoder-decoder, namely RED-cGAN, which enhances sharpening capabilities under the constraints of PAN images. Liu et al. developed a deep CNN-based pan-sharpening GAN, namely PsGAN, which consists of a dual-stream generator and a discriminator to distinguish the generated MS images from the reference images. Benzenati et al. introduced a detail-injected GAN (DIGAN) consisting of a two-stream generator and a relativistic average discriminator. RED-cGAN, PsGAN and DIGAN are supervised methods trained on degraded resolution data, however, the products are not applicable to full resolution data. Ozcelik et al. built a self-supervised learning framework that treats pansharpening as a colorization, namely PanColorGAN, which reduces blur through color injection and random scale downsampling. Li et al. proposed a self-supervised method based on period-consistent GAN trained on reduced-resolution data, which constructed two generators and two discriminators. Input the LRMS and PAN images into the first generator to get the predicted image, and then input the predicted image into the second generator to get a PAN image that is consistent with the input PAN. Several unsupervised GANs have been proposed for the problem of no reference HRMS images. Ma et al. proposed an unsupervised pan-sharpened GAN (Pan-GAN), which consists of a generator and two discriminators (a spectral discriminator and a spatial discriminator). This generator generates HRMS images with concatenated MS and PAN images. The spectral discriminator is used to judge the spectral information between HRMS and LRMS images, so that the spectrum of HRMS data is consistent with the spectrum of LRMS data. The spatial discriminator identifies the spatial information between HRMS and PAN images so that the spatial information of the generated HRMS image is consistent with that of the PAN image. Pan-GAN uses two discriminators to better retain spectral information and spatial structure information, solving the ambiguity problem caused by downsampling during supervised training. However, the input is concatenated MS and PAN images, resulting in insufficient detail and spectral information. Zhou et al proposed an unsupervised dual discriminator GAN (PGMAN), which utilizes a dual-stream generator to generate HRMS and two discriminators to preserve spectral information and details respectively. Both Pan-GAN and PGMAN are trained directly on the original data without reference images. They obtain good training results at full resolution, but the training results obtained on data with reduced resolution are not ideal. This indicates that the model has poor generalization ability. Although some scholars have proposed various pan-sharpening networks and achieved certain fusion effects, most models are trained on reduced-resolution data. Due to changes in resolution, it is difficult to fuse full-resolution data. There are problems of spectral distortion and loss of details. Furthermore, in the detail injection model, details are added directly to the upsampled MS image, ignoring the complex details between the MS image and the PAN image, which is likely to cause spectral distortion or ringing. In response to the above problems, the unsupervised pan-sharpened GAN (RMFF-UPGAN) model with recursive mixed-scale feature fusion is used to train on observation data without reference images to improve spatial resolution and retain spectral information.
The main contributions of this paper are as follows:
1) A two-flow trapezoidal branch is designed in the generator to obtain multi-scale information. We use ResNeXt blocks and residual learning blocks to obtain spatial structure and spectral information at four scales.
2) By sequentially executing prior fusion and mixed-scale fusion, a recursive mixed-scale feature fusion structure is designed to generate key information.
3) A compensation information mechanism is designed to reconstruct key information and compensate for the information.
4) In order to overcome the distortion caused by ignoring the complex relationship between MS and PAN images, a nonlinear rectification block that reconstructs information is proposed.
5) Two U-shaped discriminators are designed and a new composite loss function is defined to better retain spectral information and details.

RELATED WORK

MRA-Based Detail Injection Model

The MRA method is a type of image fusion method, which is particularly common in the field of remote sensing images. These methods have good multi-scale spatial frequency decomposition properties, singular structure representation capabilities and visual perception properties. The implementation of wavelet efficient filter bank provides the possibility to process large-scale remote sensing image fusion. Based on the MRA method, the image is first decomposed into low-frequency components and high-frequency components through a certain decomposition method, and then the high-frequency components and low-frequency components are fused through a fusion method. Finally, the fused high-frequency components and low-frequency components are inversely transformed and reconstructed to generate a fused image . The NMR-based detail injection model can be represented by a general detail injection framework, as shown in the following formula:Insert image description here

ResNeXt

Xie et al. proposed a ResNeXt structure, which is an improvement on ResNet. This network uses group convolution to reduce the complexity of the network and improve the expressive ability of the network. The core of ResNeXt is the proposal of cardinality, which is used to measure the complexity of the model. ResNeXt proves that when the computational complexity and model parameters are similar, increasing the cardinality can achieve better expressive power than increasing the depth or width of the network. The ResNeXt network structure utilizes the idea of ​​split-convert-merge. However, the convolution operation for each topology is the same, thus reducing the computational complexity. The mathematical expression is:Insert image description here

METHODOLOGY

RMFF-UPGAN is modeled to improve spatial resolution and preserve spectral information. RMFF-UPGAN directly uses the original full-resolution data for training to reduce the impact of resolution changes on the results. The overall architecture of RMFF-UPGAN is shown in Figure 1, which consists of a dual-stream generator and two U-shaped relative average discriminators (i.e., U-RaLSD pe and U-RaLSD pa ). Insert image description here
In Figure 1, M and P represent the original images of MS and PAN respectively, ↑M represents the upsampled MS image, and HM represents the fused image. For the generator, a two-stream trapezoidal branch is first designed to obtain multi-scale information; the ResNeXt block extracts fine-scale low-level semantic information, and the residual learning block extracts medium-scale and coarse-scale high-level semantic information to obtain the spatial structure of four scales and Spectral information. Secondly, residual learning is used to design a recursive mixed-scale feature fusion sub-network ; the extracted MS and PAN features of the same scale are fused a priori. Mixed-scale fusion is performed based on the fine-scale and coarse-scale fusion results. Fusion is performed sequentially according to the above method to construct a recursive mixed-scale fusion structure, and finally generates key information. Then, the key information is reconstructed, and a supplementary information structure for the key information reconstruction is designed to compensate for the information. Finally, a rectification block for reconstructed information is established to obtain a fused image, which overcomes the distortion caused by ignoring the complex relationship between MS and PAN images. Two U-shaped discriminators are designed to better preserve spectral information and details. The U−RaLSDpa discriminator distinguishes the details in the HM image from the details in the P image and suggests that the details in the HM image are consistent with the details in the P image. The U-RaLSDpe discriminator is used to distinguish the spectral information of HM and the spectral information of M, and it drives the spectral information of HM to be consistent with the spectral information in the M image.Insert image description here

Dual-Stream Generator

The designed dual-stream generator consists of a dual-stream trapezoidal multi-scale feature extraction module, a recursive mixed-scale feature fusion module, a dual-stream multi-scale feature reconstruction module and a reconstructed information rectification module. The architecture of each module is detailed as follows
1) Dual-Stream Trapezoidal Multiscale Feature Extraction (DSTMFE): The DSTMFE branch structure of the generator is shown in Figure 2. It consists of two independent branches, which is different from our previous work. TDPNet. We replace the maxpooling operation with Conv4, a convolution operation with kernel size 4 and stride 2. The top branch extracts 4 scale features of the PAN image, and the bottom branch extracts 4 scale features of the MS image, where P 1 −P 4 represents the 4 scale features extracted from the PAN image, and M 1 −M 4 represents the 4 scale features extracted from the MS image. The sizes of the four extracted scale features are 256 × 256 × 32, 128 × 128 × 64, 64 × 64 × 128, and 32 × 32 × 256 respectively. Since PAN and MS images represented by underlying semantic features are the most informative, ResNeXt of group convolutional graphs provides multiple convolution branches, which provides a better way to retain information. This can increase the cardinality and improve network accuracy while reducing network complexity. Therefore, in order to retain more original information and reduce network complexity, the ResNeXt module extracts the first scale features P 1 and M 1 respectively.. In the latter three scales, the residual learning block and downsampling operation (i.e. Conv4) extract P2−P4 and M2−M4 features respectively. The structures of the ResNeXt block and residual learning block used in RMFF-UPGAN are shown in Figures 3(a) and 3(b). In Figure 3(a), the parameters used in the ResNeXt block are 1(4), 1 × 1, 4, where parameter 1(4) represents the number of channels of the PAN (MS) image, and 1 × 1 and 4 represent the kernel size. and convolution number. In Figure 3(b), the leaky ReLU (LReLU) function is used.Insert image description here

The expressions for using the ResNeXt module to extract features of MS images and PAN images are shown in (3) and (4) respectively. The expressions for using the residual learning module to extract features of MS images and PAN images are equations (5) to (8), where I = 2, 3, and 4 respectively.
Insert image description here
2) Recursive Mixed-Scale Feature Fusion: Insert image description here
According to the four-scale MS and PAN images generated in the DSTMFE stage, a recursive mixed-scale feature fusion (RMSFF) subnetwork is designed based on residual learning, as shown in Figure 4, consisting of a priori fusion block and a mixture It is composed of scale fusion blocks. In view of the four-scale characteristics of MS images and PAN images, a prior fusion block (PFB) is designed to aggregate the information of MS images and PAN images. PFB facilitates the learning of multi-modal information and the fusion of preliminary features of MS images and PAN images. Insert image description here
The "concatate +Conv3+residual block" mode is used to build the PFB, as shown in Figure 5(a). Conv3 first performs convolution operation, then performs primary fusion with LReLU function, adaptively adjusts the number of channels, and then further fuses the residual block. The kernel size of Conv3 and residual blocks is 3 × 3 with a stride of 1. The numbers of convolution kernels are 32, 64, 128, and 256 respectively. Mixed scale fusion block (MSFB) carries information of different scales, as shown in Figure 5(b). MSFB is constructed using scale transfer block (STB), concatenation, Conv3 and residual block, where Hi represents the fine-scale image and Li +1 represents the coarse-scale image. The STB is shown in Figure 6. STB downsamples the fine-scale image Hi , generates an image of the same scale as Li +1 , and then fuses it with Li +1 . Conv4 is used for downlink sampling, and the numbers are 64, 128, and 256 respectively. The fusion of mixed scales obtains the result of three scales, namely Mix_f 5, Mix_f 9 and Mix_f 13 .
Insert image description here
As shown in Figure 4, first, PFB fuses the same-scale features Mi and Pi ( i = 1,2,3,4) to generate P_M i (i = 1,2,3,4). Then, MSFB fuses the previous fusion result P_M~i ~(i = 1,2,3) with the next scale result P_M i+1 (i = 1,2,3) to generate P_M i+1 (i = 1 ,2,3) Features Mix_f i+4 (i = 1,2,3) with the same scale . Implement mixed scale information fusion according to the above sequence, and perform recursive fusion to generate key information Mix_f 13 . The entire fusion subnetwork constitutes a recursive mixed-scale fusion architecture, which utilizes MS and PAN image information of different modalities and scales, reducing information loss in MS and PAN images.
The expression of PFB is as follows: Insert image description here
The expression of MSFB is as follows: Insert image description here
3) Dual-Stream Multiscale Feature Reconstruction: In order to obtain more accurate reconstruction information, the Dual-Stream Multiscale Reconstruction (DSMR) subnet is designed to reconstruct the key information Mix_f 13 , as shown in Figure 7 shown. Insert image description here
The two branches reconstruct features at different levels at the same scale. In order to compensate for the information, we designed a compensation information mechanism (CIM) to reconstruct the information at each scale, as shown by the green arrow in Figure 7. Mix_f 13, in the RMSFF stage, the prior fusion results with the same scale and finer scale as the information to be reconstructed are introduced into CIM. The mixed-scale fusion results with the same scale as the information to be reconstructed are introduced into CIM. The upper branch uses the reconstruction results of the previous step and CIM to generate multi-scale information through multi-scale reconstruction blocks (MRB). The bottom branch uses the reconstruction results of the previous step, the upper branch results, the previous fusion results of Mix_f 13 and CIM to generate multi-scale information. The reconstruction results of the upper branches M_R 2 and M_R 4 provide supplementary information for the reconstruction of M_R 3 and M_R 5 respectively . These multi-scale information gradually generate the final reconstruction information TR.
The MRB is shown in Figure 5©. Compared with the scale of the information to be reconstructed, H represents finer-scale information, S represents the same-scale information, and L represents coarser-scale information. Before reconstruction, multi-scale information needs to be converted into information of the same scale. STB is shown in Figure 6. Coarse-scale information is converted into fine-scale information through deconvolution operation, and fine-scale information is converted into coarse-scale information through downsampling operation . The convolution kernel size of Conv3 and residual learning block used by MRB is 3 × 3, the stride is 1, and the numbers are 128, 64, and 32 respectively.
The proposed DSMR structure reuses the extracted underlying features for reconstruction through multi-scale skip connections. The underlying features contain rich details such as edges and contours, which can reduce the loss of details. This not only reduces the loss of details in PAN images and MS images, but also improves spatial resolution.
4) Reconstructed Information Rectification: Due to the physical imaging of different sensors, the relationship between MS images and PAN images is very complex. The band ranges of MS images and PAN images do not completely overlap, and the linear combination of MS image bands cannot accurately express the PAN image. The detail injection model directly adds the injected details to the upsampled MS image as expression (1). The detail injection model ignores the complex relationship between PAN images and MS images, which may lead to spectral distortion. Therefore, we design a "concatate +Conv1+conv(3 × 3)" mode to build a simple reconstructed information error correction block (RIRB), which constructs a nonlinear injection relationship. RIRB is shown in the orange box in Figure 7. The kernel size of Conv1 is 1 × 1 and the number is 12, followed by the LReLU function. The kernel size of conv(3 × 3) is 3 and the number is 4. The HM image is generated by nonlinear mapping of the ↑M image and the reconstruction information T R.
The generator expression of the pansharpening model is as follows:
Insert image description here

U-Shaped Relative Average Least-Squares Discriminator

In order to improve the performance and stability of the pansharpening model, we use a relativistic average discriminator to distinguish the relative probability between the generated images and the real images, and utilize the least squares loss function, that is, the relativistic average least squares discriminator ( RaLSD) to optimize the model. The architecture of RaLSD is similar to Real-esrgan, using a U-shaped structure to enhance the capabilities of RaLSD. But the difference is that we use the residual structure to replace the existing convolution operation. In the jump part, we use the "concatenate+SN(conv11)+LReLU" mode to replace the sum operation to increase the discriminative ability of the network. SN(conv1-1) represents spectral normalization (SN) of a convolution operation with kernel size 1 and stride 1. Insert image description here
The structure of the proposed U-shaped RaLSD (U-RaLSD) network is shown in Figure 8. The network consists of a
spectrum discriminator U-RaLSD pe and a detail discriminator U-RaLSD pa .
U-RaLSD pe and U- The structure of RaLSD pa is the same. Figure 8 shows the explanation of the colored arrows in the U-shaped structure. Except for the convolution operation of the last layer, the rest all perform SN operations.Insert image description here
Figure 9(a) and Figure 9(b) show the architecture of DRB and URB adopted in the U-shaped structure. In DRB and URB, we use convolution with stride 2 instead of maxpooling operation for downsampling, i.e. SN(conv3-2) refers to the convolution operation with kernel size 3 and stride 2, in conv3- 2 to perform SN operations. In addition, we use deconvolution with a stride of 2 instead of an interpolation operation for upsampling, that is, SN (deconv3-2) refers to a transposed convolution operation with a kernel size of 3 and a stride of 2. In deconv3-2 Perform SN operations on. The FURB operation is accomplished by a simple fusion, i.e. "concatenate+SN(conv1-1)+LReLU" mode, followed by URB. The original MS image, or DHM pa , is a spatially reduced version of the HM image, which is fed into the U-RaLSD pe to produce relativistic probabilities. U−RaLSD pa takes the PAN original image or the spectralized DHM pe of the HM image as input.
The expressions of U-RaLSD are shown in (12) and (13).Insert image description here

Composite Loss Function

We establish a new composite loss function consisting of a spatial consistency loss function, a spectral consistency loss function, a reference-free loss function and two adversarial loss functions.
The loss function of spatial consistency is expressed as:Insert image description here
Insert image description here

The goal is to integrate the spatial information of PAN images into MS images. Since the reference image does not exist, we utilize the high-frequency information and gradient information of the PAN image to enhance the spatial information of the MS image.
In order to maintain the consistency of spectral information between HM t and MS original images, the spectral consistency loss function is described as follows: Insert image description here
where L mc represents the spectral consistency loss function, ds represents the down-resolution operation consisting of a blur operation and a down-sampling operation, M t is the MS original image.
Since there is no reference data, we use the reference-free index QNR to measure the quality of the generated image. The desired QNR value is 1, i.e. the resulting image has neither spectral nor spatial detail loss. Therefore, the expression of the reference-free loss function is: Insert image description here
QNR is related to the spectrum loss index D λ and the spatial loss index D S , which is expressed as follows: Insert image description here
where B is the number of frequency bands, M n and F n are the nth band LRMS image and generation respectively. HRMS image. Q is the image quality index, expressed as: Insert image description here
We use the relative average least squares (RaLS) loss function to optimize the adversarial model to improve the performance and stability of the model. The adversarial loss of the generator with U-RaLSDpe and U-RaLSDpa discriminators respectively is expressed asInsert image description here
Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/weixin_43690932/article/details/132627413