【读论文】SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer

【读论文】SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer

Paper: https://ieeexplore.ieee.org/document/9812535
If there is any infringement, please contact the blogger

introduce

Key words

  • Swin Transformer
  • Long-term dependencies, global information
  • Cross-domain integration

a brief introdction

An article published in IEEE/CAA JOURNAL OF AUTOMA TICA SINICA in 2022. The author of this paper is still the author of FusionGAN that we are familiar with.

Simply put, this paper proposes a method based on CNN and Swin Transformer to extract features containing local information and global information and fuse these features within and across domains.

There are several keywords here - local/global information, intra-domain and cross-domain. Let's talk about these keywords first.
The first is local information . In the paper, CNN is used to extract local information. The reason why local information is extracted is because CNN only focuses on the information within the window when performing convolution , so it is local information .
The second is global information. In the paper, Swin Transformer is used to extract global information. Because Swin Transformer can extract long-term dependency information, each feature contains global information.
Within the domain, Swin Transformer operations are performed on infrared image features and visual image features respectively.
Cross-domain is to use the K, V of the infrared image features and the Q of the visual image to perform a Swin Transformer operation, and vice versa, so as to extract the infrared features affected by the visual features and the visual features affected by the infrared features.

Next, let’s take a closer look at how the author achieved it.

Network Architecture

Overall architecture

Insert image description here
The overall architecture consists of three parts, namely feature extraction (CNN+Swin Transformer), feature fusion (cross-domain fusion and intra-domain information extraction based on Swin Transformer) and image reconstruction.

Feature extraction

Insert image description here

The network part of feature extraction is shown in the figure above

Shallow features Extraction contains two convolutional layers with a kernel size of 3 The fusion and extraction have better results.

Deep Feature Extraction contains 4 Swin Transformer layers. Based on the shallow features Extraction layer, features containing global information are extracted. The architecture here is actually very simple. The more difficult thing is to understand Swin Transformer. For details, see [Read the Paper] Swin Transformer . After reading the source code, I found that the author did not seem to divide the patches, that is, the patch_size is 1, so the early preparation of the swin transformer is relatively simple. The first step is to divide the entire image into multiple windows . The window size here is set to 8 , the moving distance is half of window_size, and then continue the above operation until the extraction of deep features is completed.

Feature fusion

Insert image description here

Feature fusion contains two blocks, and the structures of the two blocks are the same, as shown in the figure above.

Each block contains two swin transformer blocks. MCA and MSA only have different names and the internal structure is exactly the same. It is just that the KQV input by MSA is from the features of a single image, while the KQV input by MCA is from different images. feature. For example, KV comes from the features of the infrared image , and Q comes from the features of the visual image . After a wave of multi-head attention calculations, the feature information of the infrared image is affected by the feature information of the visual image , so the two can be considered Class feature information is fused. This is the fusion module in the paper, and it is also the most amazing part of the article in my opinion.

The two blocks are performed sequentially to complete the fusion of infrared features and visible features.

After this, the convolutional layer is entered, that is, local information is extracted again from the features containing global information.

Image reconstruction

Insert image description here
The last part, image reconstruction has two parts, the reconstruction module based on swin transformer and the reconstruction module based on CNN. The paper mentioned a P Swin Transformer layers. I simply thought that P was the quantity. The author set up four Swin Transformer layers to fully obtain the global information of the fused features, and finally set up three convolutional layers for extraction. local information and reduce the image dimension to the input dimension, from which the image fusion is completed.

loss function

Structural loss function
Insert image description here
, texture loss function,
Insert image description here
intensity loss function,
Insert image description here
and overall loss function
Insert image description here
are several types of loss functions that we are familiar with, so we won’t go into details here.

Summarize

Read the whole article and tell me what I personally find amazing.

  • Apply swin transformer to the field of image fusion
  • Cross-domain fusion here uses kv and q from different images for multi-head attention calculation
  • Both texture loss and intensity loss use the maximum value method to retain the most obvious texture details and intensity information as much as possible.

Of course, this article not only introduces infrared image fusion, but also image fusion of other modes. I will not explain too much in this blog. If you are interested, you can read the original article.

Interpretation of other fusion image papers
==》Read the paper column, come and click me》==

【Read the paper】DIVFusion: Darkness-free infrared and visible image fusion

【读论文】RFN-Nest: An end-to-end residual fusion network for infrared and visible images

【Read paper】DDcGAN

【读论文】Self-supervised feature adaption for infrared and visible image fusion

【读论文】FusionGAN: A generative adversarial network for infrared and visible image fusion

【读论文】DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs

【读论文】DenseFuse: A Fusion Approach to Infrared and Visible Images

reference

[1] SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer

Guess you like

Origin blog.csdn.net/qq_43627076/article/details/128943887