[Bidirectional transmission ConvLSTM network: Pan-Sharpening]

D2TNet: A ConvLSTM Network With Dual-Direction Transfer for Pan-Sharpening

(D2TNet: Bidirectional transfer ConvLSTM network for pan-sharpening)

In this paper, we propose an efficient convolutional long short-term memory (ConvLSTM) network with bidirectional transfer of pan-sharpening, called D2TNet. We designed a specially structured ConvLSTM network that allows two-way communication, including multi-scale information and multi-level information. On the one hand, due to the sensitivity of spatial information to scale and the sensitivity of spectral information to level, extracting multi-scale and multi-level information facilitates more full utilization of the source image. On the other hand, ConvLSTM is used to capture the strong dependence between multi-scale information and multi-level information . In addition, we also introduce a multi-scale loss to make different scales promote each other, thereby producing high-resolution multispectral images that are closer to the ground truth.

INTRODUCTION

Due to the satellite's powerful ground measurement capabilities, the remote sensing images captured by its sensors contain rich ground information. Low-resolution multispectral (LRMS) images and panchromatic (PAN) images are two common capture methods. The former has the characteristics of high spectral resolution and low spatial resolution, while the latter has the opposite characteristics. In order to meet the needs of some practical applications (such as land survey, environmental monitoring and target detection), the pansharpening method fuses the captured LRMS images and PAN images to produce ideal high-resolution multispectral (HRMS) images. Because the generated HRMS images have excellent characteristics, pan-sharpening has become a research hotspot in the field of remote sensing image processing.
Over the past few decades, the pan-sharp field has received increasing attention. Various traditional methods have been proposed to solve the pan-sharpening problem. Generally speaking, traditional pan-sharpening methods can be roughly divided into four categories: methods based on component substitution (CS), methods based on multiresolution analysis (MRA), and methods based on CS/MRA hybrids and model-based methods. Due to the complexity of ground objects and the diversity of spectral features captured by different sensors, the manual design of traditional methods makes it difficult to establish a connection between the original image and the target HRMS image.
Fortunately, in the past few years, due to the powerful feature extraction capabilities and nonlinearity of neural networks, deep learning has become the focus of attention and has been introduced into a variety of tasks, including image fusion. Pan-sharpening methods based on deep learning can be divided into methods based on convolutional neural networks (CNN) and methods based on generative adversarial networks (GAN). Most CNN-based methods build networks to extract features, fuse features, and reconstruct HRMS. Encoder-decoder networks, dense convolutional networks, and residual convolutional networks are commonly used network structures. On this basis, the GAN-based method introduces a generator and a discriminator, and realizes the fusion process through the min-max game between them, and there is no true value. Both CNN-based methods and GAN-based methods are able to establish a more robust nonlinear mapping from source images to target images after training with a large amount of data, thereby getting rid of the limitations of traditional methods and achieving state-of-the-art performance.
Although current deep learning-based pan-sharpening algorithms have achieved impressive results, there are still some pressing issues that need to be addressed. On the one hand, most previous works directly input the original size LRMS and PAN images into the network . However, there are large differences in the characteristics of different ground objects captured by different sensors. Therefore, images at different scales can contain partially non-overlapping information. Considering multi-scale information and enhancing the interaction between them, the multi-scale information of the source image can be better utilized, so that the fusion result contains more feature information. On the other hand, although some pan-sharpening methods consider multi-scale information, they tend to associate information at different scales and different levels through dense blocks or Resblocks . However, there is a strong dependence on information at different scales and levels. Indiscriminate delivery of information can lead to an increase in invalid or redundant information while reducing the status of valid information. How to convey information correctly is a question worth thinking about.Insert image description here

In order to inherit the advantages of deep learning and solve the above problems, we propose an effective bidirectional transfer pan-sharpening method called D2TNet. Specifically, two-way transmission includes multi-scale and multi-level information interaction. Combining the advantages of convolutional long short-term memory (ConvLSTM) in handling long-term information dependencies, a figure-eight convolutional long-short-term memory network as shown in Figure 1 is designed to better solve the problem of two-way information interaction. This special structure utilizes three gates in ConvLSTM to achieve long-term information interaction between different scales and levels. It can make fuller use of original information, thereby obtaining richer spatial details and more realistic spectral characteristics. In addition to the figure-eight ConvLSTM structure, we also introduce a three-scale information loss in the total loss function, making the generated HRMS image spatial and spectral distribution closer to the ground truth.
Our contributions can be summarized as follows
: 1) An effective bidirectional information transfer pan-sharpening method based on a specific ConvLSTM structure is proposed, which realizes long-term information interaction between different scales and levels, thereby making fuller use of the original information and obtaining better results. Rich spatial details and more realistic spectral characteristics.
2) A new loss function containing a three-scale loss is introduced to enhance the consistency of the fusion results with the ground truth.
3) Extensive experiments are conducted to verify that our D2TNet outperforms state-of-the-art methods while having high efficiency.

RELATED WORK

Deep-Learning-Based Pan-Sharpening Methods

In recent years, with the development of deep learning in the field of image processing, pan-sharpening methods based on deep learning have become increasingly popular. These methods can be roughly divided into CNN-based methods and GAN-based methods. Inspired by the CNN-based image super-resolution processing method SRCNN, Masi et al. introduced PNN to solve the pan-sharpening problem. This is the first CNN-based pan-sharpening method. It overlays the interpolated LRMS and original PAN images and takes them as input to generate an HRMS image. PNN network structure is simple and efficient. In addition, Liu et al. proposed a CNN using TFNet with feature extraction capabilities. It constructs an encoder-decoder network to implement feature extraction, feature fusion, and reconstruction processes. Xu et al. proposed SDPNet focusing on spatial information and spectral information. Specifically, a spatial codec and a spectral codec are designed to select a unique feature mapping between two original images. In addition, Wang et al. introduced MPNet based on ConvLSTM. It uses the original ConvLSTM to fuse features of LRMS and PAN images at different levels, but does not fully utilize ConvLSTM to drive the fused image to contain more effective information. In addition to the above methods, there are also some methods based on multi-scale features. Wang et al. proposed MSDRN, which is a multi-scale deep residual network. It downsamples the stitched original images to different scales and connects them through upconvolution and stitching. Xu et al. proposed a multi-scale network called CPNet. First, the PAN image is downsampled 2 times and 4 times, and the LRMS is upsampled the corresponding number of times to obtain 3 sets of inputs of different scales. In our approach, we follow the approach to obtain multi-scale images in CPNet. However, the way they connect images at different scales is "pixel shuffle," a subjective human decision that leads to the risk of information loss. Afterwards, Jin et al. proposed a new pan-sharpening method that uses Laplacian pyramid to separate images into different scales. For each scale, a fusion CNN is designed to obtain the fusion result. However, it only correlates multi-scale features through shared parameters, which appears to be weak and insufficient to fully utilize multi-scale features. In addition, the multi-level information transfer in the above methods are all implemented through dense blocks or Resblocks, ignoring the relationship between shallow and deep layers.
Unlike CNN-based methods, GAN-based methods achieve fusion through an adversarial process between the generator and the discriminator. Liu et al. proposed PSGAN, which introduced GAN into pansharpening for the first time. A generator that fuses PAN and MS images is designed, and a discriminator is utilized to reduce the gap between the fused image and the ground truth. Later, Shao et al. proposed RED-cGAN using a residual encoder-decoder network. The design of conditional discriminators can further complement the spatial information in the final results. Furthermore, Ma et al. proposed Pan-GAN using dual discriminators, which is an unsupervised method without ground truth. The dual discriminator makes the result look like both a PAN image and an LRMS image, thus having both the spatial information of the PAN image and the spectral information of the LRMS image.
In the above methods, the multi-scale and multi-level information between the two original images are not exploited or not properly correlated, any of which may lead to spectral distortion or spatial distortion. This paper proposes a new method that considers multi-scale and multi-level effective information communication, so that the original information can be more fully utilized.

Convolutional Long Short-Term Memory

Long short-term memory (LSTM) is a network that is good at dealing with long sequence memory problems. Compared with the ordinary network structure, LSTM changes the internal network structure by adding three gates, namely input gate, output gate and forget gate. The input gate performs a nonlinear transformation on two elements (including the output of the previous timestamp and the input of the current timestamp) to obtain a new input. The forget gate selectively updates the state vector based on the state of the previous timestamp and the current timestamp. The output gate controls the output of the current timestamp according to the forget gate.
When the temporal data is a three-dimensional image, it is difficult for ordinary LSTM to describe the complex spatial features between points. In order to better describe the spatiotemporal relationship between images, ConvLSTM is introduced. It was first proposed by Xingjian et al. The author verified through experiments that ConvLSTM is better than LSTM in obtaining spatiotemporal relationships.
Due to the success of ConvLSTM in conveying image information, it is widely used in the field of image processing, including image classification, image segmentation, etc. Only Wang et al. introduced MPNet based on ConvLSTM to solve the pan-sharpening problem. However, they used the original ConvLSTM to perform different levels of feature fusion on LRMS and PAN images, and did not fully utilize ConvLSTM to drive the fused image to contain more effective information. Since ConvLSTM can reasonably filter useful information and pass it to the next timestamp, we utilize it to enhance information communication between multiple scales and multiple levels.

PROPOSED METHOD

Problem Formulation

On the one hand, it is necessary to extract hierarchical features at different levels because they help to represent the original information more comprehensively. In addition, the deep low-frequency features extracted by CNN can be regarded as the further extraction of the shallow high-frequency features; the deep layers have a strong dependence on the shallow layers . Therefore, we design multi-layer ConvLSTM to capture the differences between them, thereby learning more accurate hierarchical spectral features. On the other hand, due to differences in spatial details and spectral characteristics at different scales, correlating multi-scale information is beneficial to maintaining richer spatial details and more realistic spectral characteristics. Furthermore, low-scale information and high-scale information are interdependent for the same reason. Therefore, we also design a multi-scale ConvLSTM to correlate multi-scale features to maintain richer spatial details and more realistic spectral features.
Therefore, in order to better utilize the original information and effectively interact with multi-scale and multi-level information, we take advantage of the excellent performance of ConvLSTM in information transmission and propose a bidirectional (multi-scale, multi-level) through ConvLSTM network (D2TNet) Pansharpening method of transmission.
The entire framework is shown in Figure 2. Generate multi-scale images to obtain hierarchical information. Specifically, the LRMS image is upsampled to obtain LRMS↑2 and LRMS↑4. Similarly, the PAN image is downsampled to obtain PAN↓2 and PAN↓4. Three groups of images of the same scale are concatenated and sent to the three-stream (upper, middle, and lower) network respectively, as shown in Figure 2.Insert image description here

To achieve our goal, we design a figure-eight ConvLSTM network to connect information at different scales and levels. In order to provide the same type of features to the ConvLSTM network, we let the convolutional layers share parameters before the ConvLSTM. Furthermore, since our loss function utilizes all products of the three-stream network, the last convolutional layer also shares parameters to ensure that the middle and bottom stream networks contribute to generating HRMS.

Network Architectures

The final network structure is shown in Figure 2. The network parameters of the top-flow network are shown in Figure 3.
Insert image description here
In fact, in top, middle and bottom flow networks, the corresponding convolutional layers have the same number of input or output channels, just their sizes are different. Simply put, we only give the
three parameters in the network parameters Conv(·) of the top-flow network, which respectively represent the kernel size, the number of input channels, and the number of output channels. Except for the last layer, which uses tanh, the activation functions of all convolutional layers are rectified leaky linear units (ReLU) (lrelu). The three parameters in ConvLSTM(·) respectively represent the number of units, the input channel of the first unit and the output channel of the last unit. More specifically, each unit has the same input channel 32 and the same output channel 32, which makes it easy to transfer states between multiple scales and levels. Furthermore, residual networks are fully utilized during implementation due to the advantages of learning efficiency.Insert image description here

For each unit of ConvLSTM, its internal network architecture is shown in Figure 4. The calculation process can be expressed as follows: Insert image description here
where · represents multiplication and * represents convolution.
In our method , _ _ _ _ _ _ _ Output. When this unit is the first unit, we set both Ct−1 and Ht−1 to be zero, which is also called the initial state. From Figure 2, we find that it is possible for a unit to have two input states, for example, unit 5 of ConvLSTM1; it not only receives the state passed in from unit2, but also receives the state passed in from unit4. In this case, unit4 is first upsampled to the size of unit2; then, we sum up all input states to get the final input state. The specific operations of each unit are as follows (1)-(5). First, X t , H t−1 and C t−1 are integrated into the input gate through convolution, so that the effective information of Xt is kept in C t . Similarly, the same components are input to the forget gate to screen the information from C t−1 to C t , and the output gate controls the amount of information output from C t to H t . First, put X t , H t−1 and C t−1After convolution and integration to the input gate, the effective information of X t is maintained in C t . Similarly, the same components are input to the forget gate to screen the information from C t−1 to C t , and the output gate controls the amount of information output from C t to H t .

Loss Functions

Our loss function contains three parts, corresponding to the three-stream network. Compared with the traditional constraint only on the fused image, this constraint is stronger, making the final fused image closer to the ground truth. The entire loss function can be expressed as:
Insert image description here
where L top , L middle and L bottom represent the loss functions of these three flow networks respectively. λ 1 and λ 2 are used to make trade-offs between the three parts in (6).

  1. Loss Function of Top Stream Network: For top stream network, we expect the generated HRMS to be as close to the ground truth as possible. We constrained the generation of HRMS from both spectral and spatial perspectives. Specifically, we use the Structural Similarity (SSIM) index measure and Frobenius norm to constrain the similarity of spectral information between HRMS and ground truth, and use gradient loss to constrain the similarity of spatial details. In addition, to further constrain the features, we downsample the resulting HRMS to LRMS size and force its feature information to converge. Therefore, L top is determined to be Insert image description here
    the generated image of the HRMS representation of the top flow network, which is also the final result. G represents the ground truth, which is obtained according to the Wald protocol introduced in [31]. H, W, and C respectively represent the height, width, and number of channels of the HRMS image. SSIM(·) represents the SSIM between two elements. Use ξ1 and ξ2 to make trade-offs between the four parts in equation (7).
  2. Loss Functions of Middle and Bottom Stream Networks: For middle and bottom stream networks, we constrain them in the same way as we treat top stream networks. Their loss functions are as follows: Insert image description here
    where HRMS 2 and HRMS 4 represent the products of the mid-flow network and the bottom-flow network respectively. G↓2 and G↓4 are products that reduce the ground truth to half or a quarter of the original size.

Guess you like

Origin blog.csdn.net/weixin_43690932/article/details/132622169