[Read the paper] TCPMFNet

Paper: https://www.sciencedirect.com/science/article/pii/S1350449522003863
If ​​there is any infringement, please contact the blogger

a brief introdction

What I want to introduce today is TCPMFNet. The infrared image fusion method proposed in this paper combines vision transformer. This is also the first time I have come into contact with this knowledge. Next, let’s take a look at this paper.

Network structure

As usual, let's first take a look at the architecture of the entire network
Insert image description here
. Seeing this structure, it seems a bit familiar, and it is somewhat similar to RFN-Nest. In that case, let's compare it with RFN-Nest.

Simply put, the entire architecture consists of three parts, namely the encoder, feature fusion device and decoder. The encoder extracts infrared image and visual image features at four scales, and then the features of each scale are input to the corresponding feature fusion device, and then the output fusion features are input to the decoder to generate the final fused image. Next, we dismantle the entire network architecture and talk about it part by part.

Encoder

Insert image description here

The architecture of the encoder is shown in the figure above. It can be seen that it is very interesting that there are two encoders, and there is data transmission between the two encoders. The author named these two encoders as main autoencoders ( Figure right) and auxiliary autoencoder (Figure left), the two encoders share the same network structure and parameter configuration.

Observing the structure of the decoder in the picture above, you will find that each decoder has five layers. Starting from the second layer, they are stage0, stage1, stage2, stage3, and except stage0, each stage of the main encoder The inputs include the output of the auxiliary encoder, that is, the input of stage1 is upsampled by the output of stage0 of the main encoder and the output of stage1 of the auxiliary encoder (the size of the feature matrix changes after maximum pooling and convolution, Therefore, the output of stage1 of the auxiliary encoder needs to be upsampled to the same size as the output of stage0 in the main encoder) and the results are added.

So why should we introduce this thing?

This is to make it easier to understand the following formula

Insert image description here
MSFIN is the input of the main encoder stagei+1, MSFO is the output of the main encoder stagei, ASFO is the output of stagei+1, where UP is the upsampling operation, and Conv is the convolution operation. Combine the above content to understand this It's very simple.

So why should we design such a network structure? What are the advantages of this compared to a single decoder?

Read the author’s explanation

Fusing the feature map from the auxiliary autoencoder with the feature map from the main autoencoder can distribute the extracted source image features into more channels, thereby improving the performance of feature extraction.

As shown in the figure below, you can clearly see the feature information in the feature map marked with purple circles. If you look closely, we will find that the infrared features and visual features of channel 24 in the main encoder are relatively poor, but channel 56 Both features in are well retained. Looking at the auxiliary encoder, you will find that it is just the opposite. This is directly complementary. The addition of the two can better retain feature information. This is why this structure is adopted. .
Insert image description here

Image fusion network

Vision Transformer

The blogger believes that the most important part of the entire architecture is also the part that gained me the most. Let’s take a look at it next.

To look at this part, first we need to understand what vision transformer is

Take a look at the picture below. This is the network structure given in the original paper of vision transformer. It doesn't seem very complicated. Putting aside the facts, it is indeed like this.
Insert image description here
First, let’s look at the left half, which is the overall architecture of the entire vision transformer. Recall the transformer. The most original transformer is used for natural language processing. Each word is used as a vector as the input of the entire network, but here we input a 3-dimensional data (three-channel data), which is difficult to Use the input method used before, otherwise the number of parameters will be too large .

At this time, a big boss proposed VIT, so how did he do it? Let’s continue reading.

The first step is to divide the image into multiple patches, and then convert these multiple patches into multiple one-dimensional data through a series of operations, and then input them into the transformer for processing, which is what is performed in the red box in the figure below Operation, how should this operation be done?
Insert image description here

Let’s talk about this process in detail. The first thing we need to do is divide the entire image into multiple blocks. The above picture is directly divided into 9 blocks. Then these 9 blocks enter the linear Projection of Flattened Patches (which is a full Connection layer) and then get 9 outputs. These 9 outputs are already one-dimensional vectors, and then add them to the position encoding to get the input required by the transformer.

So how can this process be achieved?

We can directly use the convolution operation to implement the above complex process. Here is a relatively small picture as an example. We have a 9x9x3 image data (just an example, generally not so small). At this time we only need Set the convolution kernel size to 3, the step size to 3, and the number of convolution kernels to 9. Just perform convolution on the entire image. After convolution, you will get 3x3x9 data. At this time, 3x3 will be tiled into dimensions. A vector of 9 is enough, so that we get a set of tokens with a length of 9 and a dimension of 9, and successfully convert the 2D image data into 1D data. At this time, just perform the normal operation of the transformer, that is, add a position code (1D position code) to each token, and the subsequent operations will be consistent with the transformer. You can notice that there is an additional toekn with a position code of 0. You can Use the output corresponding to this token to implement the classification operation.

In the same way, when we operate on a 224x224 size image and set each patch to 16, we only need to set the convolution kernel size to 16, the step size to 16, and the number of convolution kernels to 196. .

After understanding how to convert image data into data that the transformer can receive, here we will talk about what work is done in the transformer encoder.

As shown in the figure below, you can see that the entire architecture is actually very simple. Let's focus on the Multi-Head Attention here, which is the multi-head attention mechanism.
Insert image description here

First, let’s understand what the attention mechanism of the transformer is? Let’s take a look at this formula
Insert image description here

If you look at it alone, it is very abstract. Let’s break it down one by one. To put it simply, the entire formula is composed of three parts, namely Q, K and V. So how can we understand these three things?

Taking the current popular talent show as an example, we can think of Q as an evaluation standard (evaluating singing and dancing ability, etc.), K is your personal singing and dancing ability, and V is your basic score. When your singing and dancing ability K meets the evaluation Standard Q, the higher your final score will be.

What is the specific calculation process? Let's continue to take a look

As shown in the figure below, q and k here are both vectors, so we get the result of the transpose calculation of Q and K. What is the meaning of this matrix?
Insert image description here

Observe the matrix above. Taking the first row as an example, each data is the result of calculation between q1 and k. Does it mean that the data in the first row is actually the matching degree between q1 and all k. The same is true for other rows. Next, a one-step calculation is performed ( softmax processing is not performed here, you can bring it in by yourself ), as shown below

Insert image description here

Observing the final result, you will find that the values ​​​​of v1, v2 and v3 are no longer determined by itself, but are determined by the three values ​​​​of v1, v2 and v3, and the weight of each value is equal to q It is related to the degree of correlation of k.

Each value will be affected by all values ​​after attention. What is the use of this in the image? We will talk about this point in conjunction with the content of the paper later.

So what does long mean?

To put it simply, the original Q, K and V are divided into multiple points in terms of dimensions. For example, all three are 24-dimensional, and 4-head attention is used. At this time, the input data of each part is 6-dimensional. After calculation, After the final result, these multiple results are connected together to get the final result.

So why use long?

Looking back, you will find that the entire calculation process is fixed. How to let the network learn when different tasks are to be done? The answer to this is why use multi-head attention

Let’s first look at the formula of multi-head attention in the original paper of transformer.

Insert image description here

It can be clearly seen that QKV is multiplied by a W. This W can be learned and can handle different tasks.

Having said so much, we can start to look at the content of the paper.

Insert image description here

feature fusion network

Insert image description here
The whole network is relatively simple. From bottom to top, the first is a convolution layer. The function of this convolution layer should be the same as the convolution layer we mentioned before that converts image data into one-dimensional data. That is, using Convolution convolves the entire image and then flattens the first two dimensions of the resulting result.

After that, the output of the convolutional layer is copied into three copies, each as QKV, and input into multi-head attention (multi-head attention). The operations performed can be referred to the previous content . The output of multi-head attention is then combined with multi-head attention. The inputs of attention are added and input into the last layer (mlp), and then the output of the last mlp is added to its input, which is the final fusion feature result.

The formula for this part is as follows. Res is a residual connection, ATT represents the output of multi-head attention, and TFO is the final output.
Insert image description here
After understanding the structure, you may have a question, why should you use VIT instead of continuing to use CNN?

At this time, we recall that as mentioned in the previous article, each token output after the transformer contains the information of all tokens. When substituted into the VIT, each token contains the information of a patch in the image . After processing After that, does it mean that each processed token has the information of the entire picture, and the range of information obtained by CNN is generally limited to the convolution kernel size . There is a clear difference between the two, that is, CNN is suitable for obtaining Local information, while VIT can obtain global information . Both have their own advantages. For example, in this article, although VIT can obtain global information very well, some local processing is not as good as CNN, so the author uses a combination of the two. , this is the final converged network architecture, as shown below

Insert image description here

There are three fusion paths in the network, namely the convolution fusion path, the transformer fusion path and the hybrid path.

Let’s take a look at the formula
Insert image description here
. It is very clear that Conv is the convolution operation, TFO is the Transformer, CFPO is the result of the convolution path, TFPO is the result of the transformer path, MPO is the result of the hybrid path, and FM is the final result.

I have a question here. In my understanding, if the token output by VIT is directly added to the convolution result, do I need to convert the dimension of the token?

mesh connection decoder

Insert image description here
The decoder is relatively simple. The fused features of the four scales are used as input, and then the network performs upsampling and downsampling, and is used as input to different convolution nodes, and then finally summarized into a C2,0, after a Final_conv is the final result.

You can read the original text for the specific configuration, so I won’t give a redundant description here.

loss function

The overall loss function is as follows, Ld is the detail loss function, and Lf is the feature loss function. Insert image description here
The detail loss function is relatively simple and is still our old friend SSIM.
Insert image description here
Take a look at the feature loss function. Here Ff is the fusion feature, m is the feature of which scale, Fvi is the visual feature, and Fir is the infrared feature.

Looking at Lf, the author sets β1 to 0.6 and β2 to 0.4. I think the reason for this setting should be roughly similar to that in RFN, because Ld already tends to retain the features of the image, and Lf must ensure both infrared features and visual features. Preserve as much as possible, but since Ld is already beneficial to the preservation of visual features, the parameter settings here are biased towards preserving infrared features. The final Lf is the sum of feature losses at multiple scales, which is also quite interesting.
Insert image description here

Summarize

The subsequent training and ablation experiments will not be sorted out. This paper really opened the door to a new world for me. It was the first time I came into contact with VIT-related knowledge. The above summary of VIT was mostly after reading Mu Shen’s Transformer. I understand, I hope you guys can point me out if there are any mistakes.

Other fusion image paper interpretation and
reading paper columns, come and click me

【Read the paper】DIVFusion: Darkness-free infrared and visible image fusion

【读论文】RFN-Nest: An end-to-end residual fusion network for infrared and visible images

【Read paper】DDcGAN

【读论文】Self-supervised feature adaption for infrared and visible image fusion

【读论文】FusionGAN: A generative adversarial network for infrared and visible image fusion

【读论文】DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs

【读论文】DenseFuse: A Fusion Approach to Infrared and Visible Images

reference

[1] TCPMFNet: An infrared and visible image fusion network with composite auto encoder and transformer–convolutional parallel mixed fusion strategy
[2] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
[3] Attention Is All Y ou Need

Guess you like

Origin blog.csdn.net/qq_43627076/article/details/128559571