【arXiv2309】RingMo-lite: A Remote Sensing Multi-taskLightweight Network with CNN-TransformerHybrid Fr

RingMo-lite: A Remote Sensing Multi-task Lightweight Network with CNN-Transformer Hybrid Framework, arXiv 2309

Paper: https://arxiv.org/abs/2309.09003

Code: Not yet open source

MindSpore/RingMo-Framework

Summary

In recent years, RingMo’s remote sensing (RS) vision base model has achieved excellent performance in various downstream tasks. However, the high demand for computing resources limits the application of these models on edge devices. It is necessary to design a more lightweight basic model to support on-orbit remote sensing image interpretation. Existing methods face challenges in achieving lightweight solutions while maintaining generality in RS image interpretation. This is due to the complex high- and low-frequency spectral components in RS images, making traditional single CNN or visual transformer methods unsuitable for the task.
Therefore, this paper proposes RingMo-lite, a RS multi-task lightweight network with a CNN-Transformer hybrid framework , which effectively utilizes the frequency domain characteristics of RS to optimize the interpretation process. It uses the Transformer module as a low-pass filter to extract the global features of the RS image through a dual-branch structure, and the CNN module as a stacked high-pass filter to effectively extract fine-grained details . Furthermore, in the pre-training stage, the designed frequency domain mask image modeling (FD-MIM) combines the high-frequency and low-frequency characteristics of each image patch to effectively capture the latent feature representation in RS data.

As shown in the figure, compared with RingMo, the proposed RingMo-lite reduces parameters by more than 60% in various RS image interpretation tasks, with an average accuracy drop of less than 2% in most scenarios, compared with similarly sized models Compared to achieving SOTA performance.

introduction

motivation 

The emergence of the RingMo remote sensing large model effectively solves the problem of insufficient generalization capabilities of existing methods. However, it has a large demand for computing and storage resources, is not flexible enough, and is difficult to apply to edge servers or terminals. This article aims to design a lightweight basic model.

In the field of general visual processing, there are three types of lightweight visual basic model methods:

  • Knowledge distillation, transfer learning, but requires an additional teacher model.
  • Neural architecture search, NAS, automatically searches for suitable network structures, but requires a lot of computing resources and processing time.
  • Network structure design: Network structure design, depending on the individual design, can achieve results and the calculation amount is not large.
Examples of frequency domain comparisons between specific target areas and large-scale scene areas in different RS scenarios. The 3D frequency domain plot in the second row is calculated based on the spectral components, where the closer to the center represents the low-frequency part and the closer to the periphery represents the high-frequency part. The third and fourth rows are the results of the image after high-pass filter and low-pass filter respectively.

There are two challenges in the RS field:

  • Remote sensing images have different resolutions and azimuth ranges, and the distribution of objects is complex . Therefore, remote sensing images often contain both specific target areas and large-scale ground objects, with many scale differences between them. The pixels of densely packed small objects change drastically in the spatial dimension, while the pixels of large ground objects change relatively more uniformly and slowly. The multi-scale differences of these objects pose a huge challenge to the generalization ability of the model.
  • Second, various remote sensing interpretation tasks tend to focus on different target regions. For example, scene classification tasks involve a wide range of spatial scales and therefore require more attention to global generalization information. However, in the downstream tasks of RS target detection, it is necessary to pay more attention to the local detailed information of targets such as aircraft, ships, and vehicles. The pixel changes of key objects in the RS image have corresponding representations in the frequency domain, and different frequencies refer to the intensity of feature changes. These differences in high- and low-frequency information partly influence interpretation accuracy in different downstream tasks .

Although many network structure design methods adopt a combination of CNN and Transformer, they mainly focus on using CNN to replace the part of the Transformer block to reduce calculations. Most of the existing methods do not pay attention to the advantages of using CNN and Transformer to extract high- and low-frequency information from RS images.

In summary, this paper proposes a new lightweight basic model RingMo-lite suitable for various remote sensing image interpretation tasks. First, in order to fully extract the detailed features of specific target areas and the global features of large-scale scenes, this paper designs a lightweight CNN-Transformer dual-branch hybrid architecture . in particular,

  • The Transformer structure establishes global relationships and long-range dependencies through a self-attention mechanism , enabling a deeper understanding of the structural and semantic aspects of images. Therefore, in the frequency domain of the input image, the Transformer can be regarded as a low-pass filter that extracts low-frequency information, which can better extract the information of large-scale surface feature elements.
  • The CNN architecture focuses on local details in the convolution sliding window through matrix calculations . Therefore, the CNN branch aims to further alleviate the spatial position bias and capture local features such as texture and details. In the frequency domain, CNN can be viewed as the superposition of multiple high-pass filters, which is suitable for extracting high-frequency information and processing specific target information.

Combining the advantages of two different structures, CNN and Transformer, the proposed dual-branch block decouples the hybrid structure in the channel dimension, comprehensively utilizes the high-frequency and low-frequency information in the RS image, and effectively improves the interpretation accuracy.

Secondly, this paper designs a frequency domain masked image modeling (FD-MIM) suitable for high-frequency and low-frequency information of RS images , and improves the pre-training effect of lightweight basic models by combining self-supervised learning. FD-MIM corresponds to the proposed CNN Transformer mixed frame work, which helps to better reconstruct image details during masking and facilitates the proposed lightweight model to learn rich feature representations suitable for different downstream tasks.

contribute

  • In order to achieve lightweight on-orbit interpretation, this paper proposes RingMo-lite, a dual-branch CNN-Transformer hybrid framework suitable for various RS image interpretation tasks . This method fully considers the high-frequency and low-frequency information of remote sensing images and tasks , effectively improving the interpretation accuracy.
  • Considering the frequency domain characteristics of the RS object area, this paper designs a FD-MIM self-supervised pre-training strategy, which helps the proposed framework learn richer feature representations and effectively improves the generalization of downstream tasks. ability.
  • Compared with RingMo, RingMo-lite has more than 60% fewer parameters in various RS image interpretation tasks , with an average accuracy drop of less than 2% , and compared with models of the same size, RingMo-lite can perform on four downstream Achieve SOTA performance in tasks including RS image classification, target detection, semantic segmentation and change detection.

Method RingMo

RingMo Network Framework

As shown in the figure, the input image is initially split into non-overlapping patches (size 4 × 4) using patch partitioning and treated as tokens. These patches are stacked and fed into the linear embedding layer. Image representation is obtained through four stages of processing. Each stage includes a different number of high and low frequency information fusion blocks (FIFB), and the specific number depends on the (2,2,6,2) configuration of Swin Tiny.
A patch merging layer (Patch Merging) is introduced between stages to offset the reduction in the number of tokens. In each FIFB, there is a subdivision into low frequency (LF) branch and high frequency (HF) branch. In order to optimally utilize the feature extraction capabilities of CNNs and Transformers, the input features of FIFB are sent to two branches respectively to capture low-frequency information and high-frequency information, and then fused and fed to the next block or patch merging layer.

  • The LF branch follows the main structure of Swin Transformer and obtains global features.
  • The HF branch further divides the input features into two parts and uses CNN to extract detailed features.

Network details

High-Low Frequency Information Fusion Block (FIFB)

Revisiting ViT and CNN: ViT uses multi-head self-attention to exchange information between non-overlapping tokens. As a low-pass filter, MSA is good at modeling long dependencies and capturing low-frequency information. However, MSA's spatial smoothing operation on feature maps tends to attenuate high-frequency signals, resulting in feature representations dominated by low-frequency information. In contrast, CNN uses local convolutions (Convs) within the receptive field to obtain local information. Contrary to MSA, Convs are high-pass filters that can effectively extract high-frequency representation of images. Therefore, MSA and Convs show complementary characteristics. MSA captures global dependencies and low-frequency information, while Conv is good at preserving local details and high-frequency information.

Frequency features in remote sensing tasks: Typically, the global structure of scenes and objects conveys low-frequency information in images, while local spatial details such as edges and textures appear as high-frequency information . Remote sensing images inherently contain small objects and extensive geographical features. The pixels of densely distributed small-scale objects vary greatly in space, while large-scale features are relatively uniform and slow. For RS image interpretation tasks, scene classification emphasizes extracting comprehensive global information, while target detection tasks focus on capturing details. Additionally, more fine-grained tasks require more local details. Based on these considerations, the paper proposes FIFB, which combines high-frequency and low-frequency information, thereby improving the model's multi-task generalization ability for RS images.

FIFB: As shown in Figure 4, the input features of FIFB F \in R^{N*N*C}are fed to two different branches respectively: LF branch and HF branch . The LF branch is based on Swin Transformer's architecture to capture broad dependencies over long distances.

The HF branch divides the input features into two partitions : F_1 \in R^{N*N*C/2}and F_2 \in R^{N*N*C/2}, to extract high-frequency information , leveraging the sharp sensitivity of the maximum filter and the detail perception of Convs, respectively. After splicing F1 and F2, a comprehensive feature map H with rich high-frequency information is generated.

The output of the FIFB process is the fusion of low-frequency features L and high-frequency features H:

 

Frequency Domain Masked Image Modeling

It is a common practice to design a pre-training strategy that captures both local and global image features to improve the efficiency and generalization ability of the model . One promising approach is to use masking techniques to emphasize specific features in an image. Mask Image Modeling (MIM) can incorporate inherent data relationships to guide the model to better understand complex RS images. By exploiting the structure of the input image and the correlation between neighboring pixels, it enables the model to learn meaningful representations without explicit labeling.

Many MIM methods usually adopt a random masking strategy. Select a certain proportion of image patches and perform mask completion on them. RS images have unique imaging mechanisms, containing more complex backgrounds and many smaller-scale objects, which limits many random masking strategies in RS image interpretation. In this context, the paper introduces the concept of high- and low-frequency domain masked image modeling (FD-MIM). FD-MIM corresponds to the proposed CNNTransformer hybrid framework. The proposed method can extract latent representations of the masked image and use them to reconstruct the original signal of the masked region. By appropriately retaining high- and low-frequency domain information in complex RS images, it helps to better reconstruct the details of the image while masking. The learned encoder is suitable for various optical RS downstream tasks, and the L1 regression loss is used to calculate the difference between the reconstruction results and pixels.

  • First, FD-MIM randomly selects 50% image patches from each RS image in the dataset . Frequency domain analysis is performed on these blocks using Discrete Fourier Transform (DFT) . The selected blocks are classified into high-frequency or low-frequency categories . This classification depends on comparing the proportion of high-frequency content pixels to low-frequency content pixels within each patch . Patches with a higher proportion of high-frequency content are designated as high-frequency plaques, while patches with a predominant low-frequency content are classified as low-frequency patches.
  • In order to further emphasize the high and low frequencies in the formation, the paper performs high-pass and low-pass filtering on these classified patches respectively . The former enhances the unique characteristics of the high-frequency portion, while the latter filtering helps preserve important low-frequency information. This step helps to better separate frequency components while maintaining key frequency domain properties.
  • Finally, in order to enhance the robustness and generalization ability of the model, the paper introduces random pixel masking, which involves randomly selecting pixels from frequency-separated patches and applying a masking operation . This strategy increases the complexity of reconstructed images during training, helping the model focus on learning the most relevant and discriminative features.

experiment

RS scene classification

RS target detection

 

 

RS semantic segmentation 

RS change detection 

RingMo and RingMo-lite comparison

Guess you like

Origin blog.csdn.net/m0_61899108/article/details/133550980