深度学习之COLA-Net

insert image description here

这篇文章首篇将局部(Local)注意力和全局(Non-Local)注意力进行结合来做图像重建的论文。文章作者设计了一个将局部注意力机制和全局注意力机制一起协同合作来重建图像的网络模型——Collaborative Attention Network(COLA-Net);文章的核心是提出了一个patch-wise的产生自注意力的新结构,和ViT不同,它除了拥有捕捉图像上长距离相关性之外,还能更好地捕捉局部相关性,当然也只是增加了一些局部相关,但并没有做到CNN那样捕捉局部相关性的能力。

Note:

  1. 局部注意力和全局注意力 \colorbox{violet}{局部注意力和全局注意力} :局部注意力是诸如RCAN中的通道注意力、EDVR中TSA的时间和空间注意力,它利用一些局部性算子来产生注意力,比如卷积、池化等手段;全局注意力,即Non-Local注意力是指自注意力机制,即Transformer的核心结构。
  2. 局部相关性和全局相关性 \colorbox{tomato}{局部相关性和全局相关性} Local Correlation and Global Correlation _ _: Local correlation refers to the application of CNN-based neural network in image recognition and other tasks, which uses the correlation of local areas, so that the convolution window can extract features in a small area; global correlation is long-term on the image. For the correlation in distance, this convolution cannot be done. Generally, the self-attention mechanism is used to generate the correlation between patches at different positions on the image.
  3. The core of the attention mechanism is correlation, which is measured by the similarity index . The autocorrelation coefficient in Transformer is realized by dot multiplication.

Reference documents:
Transformer-basic knowledge
Source code
Things you don't know about Transformer

Abstract

为什么需要结合局部注意力机制和全局注意力机制?
因为只有局部注意力机制的话,感受野就比较小,无法捕捉较长或较大范围内的特征;同理,只有全局注意力的话,就只能捕捉长距离范围内的相关性,而无法捕捉细小范围内的特征。因此,如果将两种注意力以一定方式相互协同配合,那么就可以发挥二者的优势,为图像重建带来表现力的提升。
Note:

  1. 图像重建包括:超分、去模糊、去噪、去块等等。

本文的主要内容:

  1. 局部注意力(Local-Attention)和全局注意力(Non-Local-Attention)在图像重建任务上可以产生出色的表现力,但目前的图像重建方法一般都包含2种注意力中的一种;因此作者提出了一种结合局部注意力和全局注意力的新结构——Collaborative Attention Network(COLA-Net),这也是首次将两种注意力一起结合在图像重建任务上。
  2. 此外,作者提出了一个将自注意力机制用在图像上的新结构,不同于ViT将图像分块、Embeddding、3次conv求QKV的过程,这个新结构是先3次conv求QKV、基于滑动窗口的unfold分块、reshape成QKV的过程,这样的自注意力模块的设计有2大好处:①Unfold过程基于sliding-window,因此不同patch之间是可以有重叠的,这样的话不同patch之间更加紧密,相当于间接增加了一些局部相关性;②开头的3次求KQV的过程如果设置成卷积size大于1的CNN层的话,后面的patch就是基于feature map,即patch是带着局部特征去生成QKV的,这就在本身捕捉全局相关性的基础上增加了局部相关性,这对于一些往一些需要局部特征信息的图像重建任务引入Transformer带来了思路,比如超分等
  3. COLA-Net has achieved SOTA levels in PSNR and visual perception in three major reconstruction tasks: image denoising, real image denoising, and decompression artifacts!

1 Introduction

  1. Local attention and global attention can complement each other to improve expressiveness. Specifically, when there are many repeated similar details in the image, then global correlation is conducive to the learning of the features of these details; if there are some complex texture details in the image, then only global attention will produce smooth artifacts , then local attention can help learn these textures.
  2. Therefore, COLA-Net needs to learn these two kinds of attention. For local attention, the author introduced channel-wise attention. In addition, the author increased the receptive field of local operations based on features of different scales; for global attention (Non-Local), the author introduced a new patch -wise self-attention network to build long-distance dependencies of different patches.

Summarize

  1. This paper introduces a network model that combines local attention mechanism and global attention mechanism—— COLA-Net.
  2. This paper introduces a patch-wise self-attention model structure, which is an extension of the self-attention module in Transformer in computer vision. This structure can capture the correlation of images over long distances.
  3. The author uses three image reconstruction tasks: image denoising, real scene denoising, and decompression artifacts to verify the effectiveness of COLA-Net. As a result, it achieves the expressiveness of SOTA and proves the feasibility of the model.

2 Related Work

Non-Local Attention \colorbox{orange}{Non-Local Attention} Non-Local Attention
Let Q \mathbb{Q}Q is the search range, and the query items (Query items) areyi y_iyi, the key items (Key items) are yj y_jyj,且ϕ ( yi , yj ) \phi(y_i,y_j)ϕ ( yi,yj) means the similarity between the two,G ( ⋅ ) G(\cdot)G ( ) represents embedding function, then the self-attention mechanism can be expressed as:
xi ^ = 1 zi ∑ j ∈ Q ϕ ( yi , yj ) G ( yj ) , ∀ i . (1) \hat{x_i} = \ frac{1}{z_i}\sum_{j\in \mathbb{Q}} \phi(y_i,y_j) G(y_j), \forall i.\tag{1}xi^=zi1jQϕ ( yi,yj)G(yj),i.( 1 ) i ∈I i\in\mathbb{I}iI represents all the pixel/patch numbers in the image, andQ ⊂ I \mathbb{Q}\subset \mathbb{I}QI; y i , y j y_i,y_j yi,yjThey are all vectors, and each vector represents a patch; zi = ∑ j ∈ Q ϕ ( yi , yj ) z_i=\sum_{j\in\mathbb{Q}}\phi(y_i,y_j)zi=jQϕ ( yi,yj) represents the normalized value;G ( yj ) G(y_j)G(yj)yj y_jyjAnother expression of ; xi ^ \hat{x_i}xi^is about yj y_j after combining attentionyjA new expression of , which combines yi y_iyiyj y_jyjrelevance.

In the neural network, that is, it is used in the Transformer, yi , yj y_i,y_jyi,yjRespectively represent Q, KQ, KQ , K ,G ( yj ) G(y_j)G(yj) is often the output of a convolutional layer (that is,VVV , soK , VK , VK and V are different expressions of the same token),ϕ ( yi , yj ) = yi T yj \phi(y_i,y_j)=y_i^T y_jϕ ( yi,yj)=yiTyj. So formula (1) expresses the iii patch finds all yi y_iwithin the search rangeyiAnd query the corresponding G ( yi ) G(y_i) by the similarity between each otherG(yi) value, the final weighted average value is theiii patches combine the output values ​​of all patch dependencies.

Note:

  1. Softmax calculates the similarity of each patch to other patches: the insert image description hereabove figure expresses the formula (1), xi ^ \hat{x_i}xi^Indicates the updated value, which is the token's attention-bearing representation.
  2. In Transformer, K, VK, VK and V are different expressions of the same token, which is equivalent to the key-value relationship,QQQ is often an expression of another token.

Local Attention \colorbox{yellow}{Local Attention} L o c a l A t t e n t i o  n
用局部算子产生的注意力机制称为Local Attention,比如RCAN中的通道注意力,EDVR中的时空注意力机制,它们都是用卷积、池化这种局部手段产生的。优点在于可以捕捉局部相关性,缺点在于感受野相对较小,无法捕捉长距离的相关性。


为什么用 Q K T QK^T QKT来衡量相似度?
我们知道学习到的 Q 、 K Q、K QK矩阵的每一行向量代表1个patch,因此 Q K T QK^T QKT就可以看成是向量之间的内积,如果每个向量都是单位向量,那么 Q K T QK^T QKT就相当于衡量余弦相似度。但Transformer中并没有对 Q 、 K Q、K QK做归一化,因此严格来说 Q 、 K Q、K QK并不能衡量相似度,合理的做法应该是:
s o f t m a x ( Q K T d k ⋅ ( ∏ i = 0 N − 1 ∏ j = 0 N − 1 ∣ ∣ Q i ∣ ∣ ⋅ ∣ ∣ K j T ∣ ∣ ) ) softmax(\frac{QK^T}{\sqrt{d_k}\cdot (\prod_{i=0}^{N-1}\prod_{j=0}^{N-1}||Q_i||\cdot ||K^T_j||)}) softmax(dk (i=0N1j=0N1QiKjT)QKT)
, but I personally think that it may still achieve good performance without normalization, and at the same time, in order to save a certain amount of calculation, it is simply not added.


Why does Transformer need a large amount of data to learn?
In a word: because of the weaker inductive bias (inductive preference), more training data is required for training.
        ~~~~~~~        Specifically : When AI researchers build new machine learning models and training paradigms, they often use a inductive biasesspecific set of assumptions called inductive biases ( ) to help the model learn more general solutions from less data. plan. In the past decade, the great success of deep learning has been partly attributed to the strong inductive bias. Based on its convolution architecture, it has been proven to be very successful in vision tasks . Their hard inductive bias makes it possible to learn samples efficiently, but The tradeoff is a possible lower performance ceiling. The Visual Transformer (such as ViT) relies on a more flexible self-attention layer. Recently, the performance of some image classification tasks has surpassed that of CNN, but ViT has a greater demand for samples. Vision Transformer often requires a large amount of additional data and longer training time, which makes it difficult for the practical application of Transformer in vision tasks. An important reason for this phenomenon is that existing work regards images as one-dimensional sequences, ignoring the modeling of the inductive bias unique to visual tasks, that is, modeling the local correlation of images and the scale invariance of objects, resulting in The model cannot efficiently utilize data, which affects the convergence speed and model performance. CNNs that are very successful on vision tasks rely on two inductive biases built into the architecture itself : ① local correlation: neighboring pixels are related; ② weight sharing: different parts of an image should be processed in the same way, regardless of their How about the absolute position. In contrast, self-attention-based vision models such as DeiT and DETR minimize the inductive bias. These models have matched or even outperformed CNNs when trained on large datasets, but they often struggle to learn meaningful representations when trained on small datasets. There is a trade-off here: the strong inductive biases of CNNs enable high performance even with very little data, but these inductive biases can limit the model when there is a lot of data. In contrast, Transformer has the smallest inductive bias, which shows that there is a limitation in the small data setting, but at the same time this flexibility allows Transformer to outperform CNN on large data.
        ~~~~~~~       In order to alleviate the problem that Transformer consumes a large amount of data in image tasks, there are generally two solutions: ①The pure Transformer structure is used in the ViT model, and it is found that using a large-scale data set can overcome the limitation of inductive bias. Use pre-training on large data sets and fine-tuning on medium or small data sets to achieve better expressiveness; ②Another way is to use Transformer and CNN in combination, whether it is to replace the 2 part of the CNN network with Transformer CNN or the combination of the two pipelines, or the method of adding a self-attention mechanism to CNN to utilize local correlation and global correlation at the same time can reduce the training requirements of Transformer.

Note:

  1. For inductive preferences , see Section 1.4 of the Watermelon Book for details.
  2. Rough explanation of inductive preference : In the data fitting scenario, there are actually many fitting solutions, but for example, we assume that a smooth curve is the better solution we think, so for example, add some regularization items to make the curve smoother , then the final solution that minimizes the loss function is the solution we want. We can know that this solution must be quite smooth, as shown below: insert image description here
    . The inductive bias in the neural network can be interpreted as: AI workers assume that the inductive bias of the neural network in computer vision is that its learning in local correlation, 2D neighbor structure, and spatial invariance can have better convergence performance Based on this assumption, they designed the sliding window structure of CNN to use the local correlation of images to learn tasks such as classification, so that the final effect we think is the optimal solution, and practice has proved that CNN can indeed do it. The relatively good expressiveness and convergence speed prove that this inductive bias assumption is correct. CNN is indeed the best choice for convergence performance and speed, and it can efficiently use data sets. As for Transformer, according to the no-lunch theorem , it can also solve tasks such as image classification. Like CNN, it is also a kind of solution space, but it does not have the inductive bias that CNN has, just like curve B in the above figure. This problem is not an optimal solution, it uses the global correlation of the image, because it cannot use the natural local correlation of the image in the inductive bias, but learns the global correlation based on the patch, because it cannot be used for the inductive bias Modeling, you have to learn from scratch, which means using more data to learn!
  3. From another perspective , understand the reason why Transformer needs a larger amount of data: because Transformer learns global correlation and has a larger receptive field, and if CNN wants to achieve such a large receptive field, it must use the network If the stack is very deep, the complexity model will inevitably need a large amount of data to learn, so Transformer needs a large amount of data sets to learn.

3 COLA-Net

3.1 Framework

The COLA-Net pipelineis as follows:
insert image description here
the whole network can be divided into 3 parts: ① shallow feature extraction; ② Collaborative Attention Block(CAB) cascade module; ③ reconstruction module.

Shallow Feature Extraction\colorbox{hotpink}{Shallow Feature Extraction}Shallow feature extraction
Extract shallow features F s F_s from the original imageFs, the specific expression is as follows:
F s = HSF ( ILQ ) . (2) F_s = \mathcal{H}_{SF}(I_{LQ}).\tag{2}Fs=HSF(ILQ).( 2 ) whereHSF ( ⋅ ) \mathcal{H}_{SF}(\cdot)HSF( ) represents a shallow feature extraction operator.

CAB cascade module\colorbox{yellow}{CAB cascade module}C A B cascade module
As shown in the figure above, the author cascades 4 CABs whose input is the shallow feature F s F_sFs, the output is FCAB 4 F_{CAB}^4FCAB4, which serves as input to the reconstruction module.
The mathematical expression is:
FCAB i = HCAB i ( ⋯ HCAB 1 ( F s ) ⋯ ) . (3) F^i_{CAB} = \mathcal{H}_{CAB}^i(\cdots \mathcal{H} _{CAB}^1(F_s)\cdots).\tag{3}FCABi=HCABi(HCAB1(Fs)).(3)其中 F C A B i 、 H C A B i F^i_{CAB}、 \mathcal{H}_{CAB}^i FCABiHCABiRespectively represent the iiThe output feature map of the i CAB block and theiii CAB block operators.

The CAB block is the core of COLA-Net, which is divided into two parts: ① Feature Extraction Module (FEM); ② Dual Branch Fusion Module (DFM).

FEM module :
insert image description here

The author designs 2 deep feature extraction sub-networks, one is relatively lightweight (Basic): it consists of 6 conv → BN → R e LU conv\to BN\to ReLUconvBNR e L U cascade composition, the size of each convolutional layer is3 × 3 × 64 3\times 3\times 643×3×6 4 ; The other is the enhanced (Enhanced) residual network structure, which consists of 5residual blockcascades, and each convolutional layer is3 × 3 × 64 3\times 3\times 643×3×6 4 .
In addition, if it is the first version of FEM, COLA-Net is recorded asCOLA-B; the second version of FEM is recorded as COLA-NetCOLA-E.

DFM module :
The DFM module is the core of CAB. It processes local attention and channel attention in parallel, and finally fuses it to generate a feature map with an attention mechanism, namely FCAB i F^i_{CAB}FCABi; The fusion part is also a process of attention mechanism.

rebuild_modules\colorbox{tomato}{rebuild_modules}
这个重建模块可以理解为3层意思:①深层特征提取;②从feature-wise重建成image-wise;③特征校正、调整。
除了进一步提取CAB输出的特征以外,还要和来自低层的特征进行融合,这部分结构也是被常用于图像重建任务,其有4个好处:①将不同层级的特征进行融合,从而让特征之间互为补充来重建出更好的图像;②构成残差结构,让网络更好的去学习残差信息,一定程度上减轻残差部分的学习压力,增加稳定性;③skip-connection可以作为正则化项;④正如DRCN所述:将原图像直接输送到网络尾端,可以弥补卷积过程造成的特征损失。
具体数学表达式如下:
I R E C = I L Q + H D F ( F C A B 4 ) = H C O L A ( I L Q ) . (4) I_{REC} = I_{LQ} + \mathcal{H}_{DF}(F_{CAB}^4) = \mathcal{H}_{COLA}(I_{LQ}).\tag{4} IREC=ILQ+HDF(FCAB4)=HCOLA(ILQ).( 4 ) fromHDF ( ⋅ ) , HCOLA ( ⋅ ) \mathcal{H}_{DF}(\cdot) , \mathcal{H}_{COLA}(\cdot)HDF()HCOLA( ) are the last convolutional layer and the entire COLA network, respectively.

Loss function\colorbox{dodgerblue}{loss function}loss function
L2 loss function:
L ( Θ ) = 1 B ∑ b = 1 B ∣ ∣ IHQ b − HCOLA ( ILQ b ) ∣ ∣ 2 2 . (5) L(\Theta) = \frac{1}{B}\sum ^B_{b=1}||I^b_{HQ} - \mathcal{H}_{COLA}(I^b_{LQ})||_2^2.\tag{5}L ( Θ )=B1b=1BIHQbHCOLA(ILQb)22.(5)

3.2 Dual-branch Fusion Module

insert image description here

整个DFM由3部分组成:①Local注意力模块;②Non-Local注意力模块;③融合模块。
DFM利用局部注意力模块去捕捉局部相关性来重建复杂纹理特征;利用全局注意力去捕捉全局相关性去重建长距离上相关的细节特征,比如一些重复性很高的特征;融合模块本质也是一种Local Attention,将两个分支的注意力结果进一步施加注意力来提取更有用的特征。
Note:

  1. 局部注意力的核心是局部操作(local operation),如卷积池化;全局注意力的核心是自相关性(self-similarity)计算,如公式(1)所示。

具体而言,FEM输出的feature map作为输入,即 F F E M i F^i_{FEM} FFEMi,然后局部注意里和全局注意力两个并行分支分别进行注意力提取,分别输出 F L i 、 F N L i F^i_L、F_{NL}^i FLiFNLi, the mathematical expressions are as follows:
{ FL i = HL ( FFEM i ) , FNL i = HNL ( FFEM i ) . (6) \begin{cases} F^i_L = \mathcal{H}_L(F^i_{FEM} ),\\ F_{NL}^i = \mathcal{H}_{NL}(F^i_{FEM}).\tag{6} \end{cases}{ FLi=HL(FFEMi),FNLi=HNL(FFEMi).( 6 ) whereHL ( ⋅ ) , HNL ( ⋅ ) \mathcal{H}_L(\cdot), \mathcal{H}_{NL}(\cdot)HL()HNL( ) represent the functions of the Local attention sub-module and the Non-Local attention sub-module respectively. And the output channels of both are the same, denoted asCCC

Fusion Submodule \colorbox{yellow}{Fusion Submodule} Fusion Submodule

The next step is to enter the fusion sub-network, which is essentially a channel-wise attention network (channel-wise) : first, FNL i , FL i F_{NL}^i, F^i_LFNLiFLiAdd; then output v ∈ RC v\in \mathbb{R}^C through global average poolingvRC ; Then use 2 fully connected layersHFC 1 ( v ) , HFC 2 ( v ) \mathcal{H}_{FC}^1(v), \mathcal{H}^2_{FC}(v)HFC1(v)HFC2( v ) output 2 respectiveCCTensors of C channels, and merged by channels into 2C 2CTensor of 2 C ; finally use softmax to get [0, 1] [0,1][0,1 ] , and then divide the result into 2 parts, each of which isCCC channels, namelyw N , w NL ∈ RC w_N,w_{NL}\in\mathbb{R}^CwN,wNLRC , and the correspondingFNL i , FL i F_{NL}^i, F^i_LFNLiFLiMultiply the elements to get the final result FCAB i F^i_{CAB}FCABi.
The mathematical expression of the whole fusion process is as follows:
{ v = G lobal Pooling ( FNL i + FL i ) , w NL , w L = softmax ( [ HFC 1 ( v ) , HFC 2 ( v ) ] ) , FCAB i = FNL i ⋅ w NL + FL i ⋅ w L . (7) \begin{cases} v = {\color{mediumorchid}GlobalPooling}(F_{NL}^i + F^i_L),\\ w_{NL}, w_L = softmax([\mathcal{H}_{FC}^1(v),\mathcal{H}^2_{FC}(v)]),\\ F^i_{CAB} = F^i_{NL }\cdot w_{NL} + F^i_L\cdot w_L.\tag{7} \end{cases}v=GlobalPooling(FNLi+FLi),wNL,wL=softmax([HFC1(v),HFC2(v)]),FCABi=FNLiwNL+FLiwL.(7)Note:

  1. Thisri w N , w NL w_N,w_{NL}wN,wNLIt plays the role of selecting the appropriate distribution scheme of local attention and global attention .
  2. The channel attention of the two outputs is not independent, because softmax puts the two processes together, and the channel attention of the two outputs in the Local-Attention to be discussed later is independent of each other.
  3. [ ⋅ , ⋅ ] [\cdot, \cdot] [,] means concat.
  4. Global average pooling is actually average pooling, but it is averaged on the entire feature map. We only need to input the size of the final output, and it will automatically calculate the required convolution kernel size and stride, etc. Information to finally output a tensor with a constant number of channels and a final size of size. For details, refer to Pytorch nn.AdaptiveAvgPool2d .
  5. The fusion process is learnable!

In order to verify the effect of fusion, the author uses a heat map to reflect, the specific heat value hhLet h
= 1 M ∑ m = 1 M am , am = { 1 , w L m ≥ w NL m , 0 , otherwise . (8) h = \frac{1}{M}\sum^M_{m=1} a_m,\\a_m = \begin{cases}1, w^m_L\ge w_{NL}^m,\\0 , otherwise \end{cases}.\tag{8}h=M1m=1Mam,am={ 1,wLmwNLm,0,otherwise.( 8 ) whereMMM isw L w_LwLthe length of hhh reflects a preference for two attentions, like a vote.
The final heat map results are as follows:
insert image description here
The heat mapreflects the contribution to the final result. Our goal is to know whether a certain area in the final fused feature map contributes more to local attention or global attention.

  1. In the above figure, the brighter the color, the greater the contribution of local attention here (or the greater the demand for local attention), and vice versa, the greater the contribution of global attention.
  2. In addition, we found that the contribution range of the larger local attention is mainly concentrated in some detailed texture areas, such as leaves and mountain peaks; while the contribution range of the global attention is some large-scale and highly repetitive areas, such as the sea of the sea.

3.3 Local Attention submodule

insert image description here
The above picture is the local attention sub-network . The author designed a branch that outputs two different receptive fields, and still selects different receptive field schemes based on the channel-wise attention mechanism. Different from the channel-wise attention mechanism of the fusion sub-network, the attention channels of the two outputs are independent , each uses softmax to obtain its own attention, and the final results are added at the element level.
The attention of this sub-network is generated by local operations , and its mathematical expression is:
FL i = HL ( FFEM i ) . F^i_L = \mathcal{H}_L(F^i_{FEM}).FLi=HL(FFEMi).

3.4 Non-local Attention submodule

Comparison and Analysis \colorbox{violet}{Comparison and Analysis} Comparison and Analysis
接下来我们先对目前的一些全局注意力机制进行比较,具体如下图所示:

insert image description here
蓝色区域表示为待重建区域;红色区域表示为相关长距离依赖区域;黄色区域表示搜索区域。
我们分3个指标来进行对比:①全局注意力是基于像素还是patch;②注意力机制的更新是基于像素还是patch,即公式(1)中的 x i ^ \hat{x_i} xi^;③搜索范围是窗口还是全部图像。

  1. (a):注意力基于像素pixel;每次更新像素pixel;搜索区域为整张图像。
  2. (b):注意力基于块patch;但是每次更新只更新patch中间那个像素值;搜索区域为窗口。
  3. (c): The attention is based on the block patch; the block patch is updated each time; the search area is a window.
  4. (d): This is the global attention module proposed in this paper , which is based on the block patch; the block patch is updated each time; the search area is the entire image.

Patch-wise Non-local Attention Model \colorbox{lightseagreen}{Patch-wise Non-local Attention Model} Patch-wise Non-local Attention Model

insert image description here

The above picture is a new self-attention module for Vision proposed in this paper . This structure is different from ViT .
①: Let input FFEM i ∈ RC × W × HF^i_{FEM}\in\mathbb{R}^{C\times W\times H}FFEMiRC × W × H , it first uses 3 independent embedding functions——θ ( ⋅ ) , φ ( ⋅ ) , g ( ⋅ ) \theta(\cdot), \varphi(\cdot), g(\cdot)θ ( ) , φ ( ) , g ( ) , the respective parameters are denoted asW θ , W φ , W g W_\theta, W_\varphi, W_gWiWfWg, they are 1 × 1 1\times 11×1 convolutional layer for outputtingQ, K, VQ, K, VQ , K , V , have the following functions:
{ Q = θ ( FFEM i ) , K = ϕ ( FFEM i ) , V = g ( FFEM i ) . (9) \begin {cases} Q = \theta(F^i_{FEM}),\\K = \varphi(F^i_{FEM}),\\V = g(F^i_{FEM}). \tag{9}\end{cases}Q=i ( FFEMi),K=φ ( FFEMi),V=g(FFEMi).( 9 )
②: The next step is the core operation -unfold, which generates a patch, which is different from ViT's way of directly talking about image segmentation. The structure of this paper is in the form of sliding windows, withsss is the step size to extract the patch, the size of each patch isC × W p × H p C\times W_p\times H_pC×Wp×Hp; For each group, unfold, each group contains NNN个3D-patch( N ≥ W / W p N\ge W/W_p NW/Wp), as shown in the coordinate box in the above figure. The advantage of this design is that it can add a certain local correlation on the basis of global attention, so it is more flexible than ViT, which does not overlap between blocks, but the disadvantage is that it will increase the amount of calculation.

③Next we will Q, KQ, KThe 3D-patches of Q and K are reshaped separately, and each patch is flattened into a 1D vector, so thatN × ( C ⋅ W p ⋅ H p ) N\times (C\cdot W_p\cdot H_p)N×(CWpHp) of the 2D matrixQ ~ , K ~ \tilde{Q}, \tilde{K}Q~K~
M = Q ^ K ^ T M = \hat{Q} \hat{K}^T M=Q^K^T isthe attention matrix, which measures the correlation between different 3D-patches, whereM ∈ RN × NM\in\mathbb{R}^{N\times N}MRN × N .
Next use softmax toMMEach row of M operates:
ϕ ( Q ^ , K ^ ) = softmax ( Q ^ K ^ T ) . (10) \phi(\hat{Q},\hat{K}) = softmax(\hat{Q } \hat{K}^T).\tag{10}ϕ (Q^,K^)=softmax(Q^K^T).( 1 0 )
④: Finally we willϕ \phiϕ asVVThe weights of V are weighted and summed. Specifically, according to formula (1),ϕ \phiEach row of ϕ and VVV performs element weighted summation to obtain the resultx ^ i ∈ RC × W p × H p \hat{x}_i\in\mathbb{R}^{C\times W_p\times H_p}x^iRC×Wp×Hp. Therefore, in the end, the multi-line results are processed fold, as shown in the small box on the lower right of the above figure, in order to obtain the output C × H × WC\times H\times W of the same sizeC×H×W , direct recklessness will cause overflow, so the overlapping parts areaveraged; the fold operation can be regarded as the opposite operation of unfold.

⑤: Finally connect a convolutional layer and add it to the input to reduce the pressure of residual training and avoid information loss caused by the convolutional layer .

Summarize

  1. The self-attention sub-network proposed in this paper and the self-attention network proposed by ViT are two types.
  2. The self-attention subnetwork proposed in this paper adopts s ≤ W ps\leq W_p when unfoldingsWpThe sliding window is used to extract the patch, so it increases the local correlation to a certain extent.
  3. The self-attention network proposed by ViT is learnable because there are 3 convolutional layers to operate on the input; and the network in this paper is also learnable because the initial stage uses 3 1 × 1 1\times 11×1 convolutional layer, sothe self-attention model proposed in this paper can generate learnable patches!
  4. If put 1 × 1 1\times 11×1 convolutional layerθ, φ, g \theta, \varphi, gIf θ , φ , and g are replaced by convolution kernels of larger size,3D-patches will have more local correlations obtained from local operations such as convolution, and they will carry local correlations to make global correlations This will benefit some reconstruction tasks that require local correlation, such as super-resolution. Video Super-Resolution Transformerthis article uses3 × 3 3\times 33×3θ 、 φ 、 g \theta、\varphi、gθ , φ , g to use local correlation and global correlation, which can not only use the detailed area to improve the reconstruction performance, but also have a larger receptive field to overcome the large motion problem.
  5. The emergence of ViT extends the application of Transformer in computer vision tasks, but its image-wisenon-overlapping blocks directly on it will inevitably destroy some spatial structures on the original image, and the self-attention module of the COLA-Net structure proposed in this paper is in feature-wiseUp press step size is sss for overlapping blocks, which not only preserves some spatial structure, but also increases a certain local correlation. The comparison between the two is shown in the figure below:insert image description here

In order to prove that the above-mentioned self-attention network can capture more useful long-distance dependence information than the method in Figure 4(a), the author compares the two, and the heat map is as follows: It can be seen that the pixel-level attention mechanism will cause a lot
insert image description here
of useless dependencies.

4 Experiments

slightly

5 Conclusion

  1. This paper is the first to fuse local attention and global attention together to solve image reconstruction tasks, and the fusion process is learnable.
  2. The author uses the channel-wise local attention sub-network and the patch-wise self-attention sub-network to combine the channel-wise attention fusion network. With such a framework structure as the core, (COLA-Net) is launched Collaborative Attention Network.
  3. The authors introduce a new structure that generates self-attention . Unlike ViT, this network can generate learnable patches.

Guess you like

Origin blog.csdn.net/MR_kdcon/article/details/124775804