一、论文简述

1. 第一作者：Xiuchao Sui、Shaohua Li

2. 发表年份：2021

3. 发表期刊：arxiv

4. 关键词：光流、Transformer、自注意力、交叉注意力、相关体

5. 探索动机：由于卷积的局部性和刚性权重，有限的上下文信息被纳入到像素特征中，并且计算出的相关性具有很高的随机性，以至于大多数高相关性值都是虚假匹配，因此难以处理带有运动模糊的大位移。

Although newest methods are very accurate on benchmark data, under certain conditions, such as large displacements with motion blur , flow errors could still be large.

The current paradigm computes the pairwise pixel similarity as the dot product of two convolutional feature vectors. Due to the locality and rigid weights of convolution, limited contextual information is incorporated into pixel features, and the computed correlations suffer from a high level of randomness, such that most of the high correlation values are spurious matches. Noises in the correlations increase with noises in the input images, such as loss of texture, lighting variations and motion blur. Naturally, noisy correlations may lead to unsuccessful image matching and inaccurate output flow. This problem becomes more prominent when there are large displacements. Reducing noisy correlations can lead to substantial improvements of flow estimation.

6. 工作目标：通过ViT解决上述问题。

An important advantage of Vision Transformers (ViTs) over convolution is that, transformer features better encode global context, by attending to pixels with dynamic weights based on their contents. For the optical flow task, useful information can propagate from clear areas to blurry areas, or from non-occluded areas to occluded areas, to improve the flow estimation of the latter. A recent study suggests that, ViTs are low-pass filters that do spatial smoothing of feature maps. Intuitively, after transformer self-attention, similar feature vectors take weighted sums of each other, smoothing out irregularities and high-frequency noises.

7. 核心思想：提出了“交叉注意力光流变换器”(CRAFT)，一种新的光流估计结构。CRAFT采用了两个新颖的组件，简化了相关体的计算。此外，为了测试不同模型对大型运动的鲁棒性，设计了一种图像移动攻击，通过移动输入图像来生成大型人工运动。

A semantic smoothing transformer layer fuses the features of one image, making them more global and semantically smoother.

A crossframe attention layer replaces the dot-product operator for correlation computation. It provides an additional level of feature filtering through the Query and Key projections, so that the computed correlations are more accurate.

8. 实验结果：SOTA

On Sintel (Final) and KITTI(foreground) benchmarks, CRAFT has achieved new stateof-the-art (SOTA) performance.

In addition, to test the robustness of different models on large motions, we designed an image shifting attack that shifts input images to generate large artificial motions. As the motion magnitude increases, CRAFT performs robustly, while two representative methods, RAFT and GMA, deteriorate severely.

9.论文下载：

https://openaccess.thecvf.com/content/CVPR2022/papers/Sui_CRAFT_Cross-Attentional_Flow_Transformer_for_Robust_Optical_Flow_CVPR_2022_paper.pdf

https://github.com/askerlee/craft

二、实现过程

1. CRAFT概述

网络继承了RAFT的管道。主要贡献是通过两个新的组件来恢复相关体计算部分（虚线绿色矩形）：帧2特征上的语义平滑转换器和跨帧注意力层来计算相关体，两个新颖的组件被突出显示为带红色边框的框。这两个组件有助于抑制相关体中的虚假相关。底部的GMA模块是全局运动聚合模块。

2. 语义平滑转换器

给定两幅连续的图像帧1和帧2作为输入，光流管道的第一步是使用卷积特征网络提取帧特征。为了增强具有更好全局上下文的帧特征，使用语义平滑变换器（简称SSTrans）对帧2特征进行变换。为了更好地适应不同的特征，采用扩展注意力作为SSTrans，而不是常用的多头注意力(MHA)。扩展注意力是一种具有更高的容量的混合系统，在图像分割任务中显示出优于MHA的优势。

扩展的注意力(EA)层由N个模式（子转换器）组成，计算N个特征集，这些特征集使用动态模式注意力聚集成一个集合：

式中B(k)为模式注意力得分，模式注意力概率G为所有B(k)沿模式维数的softmax。输出特征EA(X)是所有模式特征的线性组合。为了更好地保留帧的原始特征，我们添加了一个可学习的权重为w1的加权跳跃连接：

为了施加空间偏置，我们发现传统的位置嵌入不会形成有意义的偏置，而是使用相对位置偏置。偏置是一个矩阵B∈(2r+1)×(2r+1)，添加到计算出的注意力，其中r是指定偏置局部范围的半径。

具体地说，假设将原来的注意力矩阵重塑为一个四维张量A∈H×W×H×W，其中H、W是帧特征的高度和宽度。对于i,j处的每个像素，A(i,j)是一个矩阵，指定像素(i,j)与同一帧中的所有像素之间的注意力权重。将相对位置偏置b添加到像素(i,j)的半径r的邻域：

在实现中，选择模式数为4，相对位置偏置的半径r为7。下图可视化CRAFT在Sintel训练时的相对位置偏置。观察到两个有趣的模式:

最小的偏置值在2左右，位于（0,0）处，这意味着，当计算像素（i,j）的新特征时，该偏置项将使其自身特征的权重减少2。如果没有这一项，像素(i, j)对自身的注意力权重可能会主导其他像素的权重，因为特征向量与自身最相似。这一项减少了一个像素的旧特征在组合输出特征中的比例，有效地鼓励了来自其他像素的新信息的流入。
最大的权重在距离中心像素2 ~ 3像素处，这意味着这些周围像素的特征最常被用来补充中心像素的特征。

这两个观察结果在下图中得到了证实。下图为查询点(红色矩形)和同一图像中的所有像素之间SS转换器的自注意力关注的热图。最密集的区域是查询点注意力最高的地方，并提取特征来丰富自己。将位置偏置设置为0会导致性能下降。

在两个帧的特征上应用变换器是很有诱惑力的。然而，在实验中，这样做会导致性能下降。假设是基于一个普遍的信念，即图像匹配严重依赖于局部和结构的高频(HF)特征。同时，大量的高频噪声会污染信息特征，阻碍匹配。 SSTrans作为一种低通滤波器来抑制短波噪声，但同时可以减少HF特征而增强低频(LF)特征。因此，该模型学习在帧2中的LF和HF分量之间进行折衷，以与帧1匹配。在两个帧上应用SSTrans后，两个帧都包含较少的HF和较多的LF分量。对它们进行匹配可能会产生许多虚假的相关关系，并损害光流的准确性。这种直觉在下图中得到了证实。在Sintel (Final pass)测试集上帧2和第1帧上查询点之间的相关性。图像被裁剪。标准CRAFT设置(“单个SSTrans”)具有最小的噪声相关性。“双SSTrans”产生了更多的噪声相关性。

3. 相关体的跨帧注意力

在目前的范例中，相关体是跨帧像素匹配的基础。计算出帧特征f1和f2后，将相关体计算为4D张量∈H×W×H×W。传统上，相关体计算为f1和f2的成对点积：

从概念上讲，相关体本质上是变换器中没有通过查询和键投影进行特征变换的交叉注意力。查询/键投影可以被视为特征过滤器，这些特征过滤器为相关选出最有信息的特征。此外，为了获得不同的相关性，可以使用多个查询和键投影，就像扩展注意力(EA)一样。在具有多个通道的VCN中追求类似的多方面的相关。这些好处促使本文用一个简化的EA来取代点积：