一、论文简述

1. 第一作者：Jie Zhu

2. 发表年份：2021

3. 发表期刊：arxiv

4. 关键词：MVS、3D重建、Transformer、自注意力、交叉注意力

5. 探索动机：CNN的局限性，缺少全局上下文通常会导致无纹理或弱纹理区域出现局部歧义，从而降低匹配的鲁棒性。在以前的方法中，每个视图的特征都是独立于其他视图提取的。这些独立提取的特征很难用于三维重建。

卷积方式：These learning-based methods make use of convolutional neural networks (CNNs) to infer a depth map for each view, and carry out a separate multi-view depth fusion process to reconstruct 3D point clouds. For depth map inference, they generally first utilize a 2D CNN to extract dense features of each view separately, and then perform robust feature matching to regress the depth map.

问题1：However, these methods often suffer from matching ambiguities and mismatches in challenging regions, such as texture-less areas or nonLambertian surfaces. One of the main reasons is that dense features extracted by CNNs with a limited receptive field are difficult to capture global context. The lack of global context usually leads to local ambiguities in untextured or texture-less regions, thus reducing the robustness of matching. Although some recent works try to obtain large context using deformable convolution or multi-scale information aggregation, the solution of mining the global context in each view has not been explored yet for MVS.

问题2：Besides, in previous methods, the feature of each view is extracted independently from other views. These independently extracted features are hardly optimal for 3D reconstruction. For MVS, due to the widespread existence of non-Lambertian surfaces, the features without being 3Dconsistent at the same 3D position may vary considerably across different views, which leads to mismatches. Therefore, exploring a way to access 3D consistent features is critical for robust and reliable matching in MVS.

6. 工作目标：虽然最近的一些工作试图通过变形卷积或多尺度信息聚合来获得大上下文，但对于MVS，还没有探索在每个视图中挖掘全局上下文的解决方案。此外，探索一种获取三维一致性特征的方法对于MVS中鲁棒可靠的匹配至关重要。

7. 核心思想：The proposed MVSTR takes full advantages of Transformer to enable features to be extracted under the guidance of global context and 3D geometry, which brings significant improvement on reconstruction results.

A new MVS network built upon Transformer, termed MVSTR, is proposed. To our best knowledge, this is the first Transformer architecture for MVS.

A global-context Transformer is proposed to explore the intra-view global context.

To acquire 3D-consistent features, a 3D-geometry Transformer with the well-designed cross-view attention mechanism is proposed to efficiently enable inter-view information interaction for extraction of multi-view features.

8. 实验结果：

扫描二维码关注公众号，回复： 14761078 查看本文章

The proposed method outperforms the state-of-the-art methods on the DTU dataset and achieves robust generalization on the Tanks & Temples benchmark dataset.

9.论文下载：

https://arxiv.org/pdf/2112.00336.pdf

二、实现过程

1. MVSTR概述

结构如图所示。给定1张参考图像和N张源图像：

使用二维CNN提取局部特征，经过位置编码和展平后映射成序列。
针对每个视图的特征，构建全局上下文Transformer模块来探索视图内的全局上下文。
通过交叉视图注意力构建3D几何Transformer模块，获得3D一致性的稠密特征，有效地实现了多视图间的信息交互。
两个模块交替Z次，使每个视图的Transformer特征能够有效地感知视图内全局上下文和视图间的3D几何。
结合Transformer后的特征和局部特征，使用广泛使用的粗到细回归生成深度图。

2. 全局上下文Transformer

为了获得具有视图内全局信息的稠密特征，全局上下文Transformer模块利用多头自注意力进行长距离依赖学习。在被输入到全局上下文Transformer模块之前，参考特征和源视图的每个像素首先补充可学习的2D位置编码P，其中j表示每个视图的CNN特征图的像素数，所有视图的位置编码P都是相同的。然后，将具有位置信息的参考视图和源视图的特征展平为序列Xr和Xs，表示为：

全局上下文Transformer模块的结构如下图所示，其核心是Transformer Layer-S。对于每个视图，都有一个单独的Transformer Layer-S用于探索视图内全局上下文。设Xr为输入，Transformer Layer-S表示为：

其中Concat(,)表示连接操作，LN(·)表示层归一化。MSA (Qr,Kr,Vr)表示多头自注意力，查询Qr，键Kr和值Vr。多头自注意力使每个像素在视图中建立与所有其他像素的依赖关系。FFN(·)表示全连接前馈网络，用来提高模型的拟合能力。Cr表示参考视图的上下文感知特征。类似地，可以用同样的方法获得相应源视图Csi的上下文感知特征。全局上下文Transformer模块能够在每个视图中探索全局上下文，因此可以减少大面积无纹理或弱纹理区域的局部模糊性。

3. 3D-几何 Transformer

3D-几何Transformer模块可以有效获得具有3D一致性的稠密特征，有效地促进多个视图之间的信息交互。采用跨视图注意力机制，Transformer首先构造Layer-Cr，使参考视图能够访问所有源视图中的信息，从而获得3D一致性的参考特征Tr。然后，基于构建3D一致性的参考特征Tr, Transformer Layer-Cs，利用Tr中的信息获取与源视图中具有3D一致性的特征。

详细信息如上图所示。上下文感知特征Cr和Cs从全局上下文Transformer模块得到，输入到3D-几何 Transformer模块。为了实现源视图信息与参考视图信息的融合，首先采用N个交叉视图注意力，利用每个Cs增强Cr。这N个跨视图注意力记为Transformer Layer-Cr，表示为：

其中MCA多头交叉注意力。以Transformer Layer-Cr为基础，Cri为Csi增强的参考视图特征。然后，对增强后的特征进行平均运算，得到参考视图的3D一致性特征Tr，其表达式为:

随后，构建了额外的N个跨视图注意力的Transformer Layer-Cs，用参考视图的3D一致性的特征Tr来增强Csi ，从而获得源视图的3D一致特征Tsi。Transformer Layer-Cs的公式为:

得益于所设计的机制，获得了参考视图和源视图的3D一致性特征，因此可以有效地缓解非兰伯曲面的失配，从而提高3D重建。

4. 损失函数

类似于现有的从粗到细的MVS，在每个尺度上应用平滑L1损失来监督不同分辨率的深度估计结果，可表示为：

其中M为网络的总尺度数，设为3。Lm和αm分别为m尺度上的损耗和相应的损失权重。其中m = 1表示最粗的尺度，m = 3表示最细的尺度。随着m从1增加到3，αm分别设为0.5、1.0和2.0。

5. 实现

1）2D CNN用于从单个视图中提取特征，是一种类似于MVSNet使用的八层结构。特别是为了提高计算效率，将批处理归一层和ReLU激活替换为unified in-place activated batch normalization layer，节省了近40%的内存。每个视图的输出32通道的特征图，与输入图像相比缩小了4倍。2D CNN的权值是多个视图共享的。

2）全局上下文Transformer模块和3D-geometryTransformer模块交替堆叠4次。全局上下文Transformer模块，Transformer Layer-S的权重在多个视图间共享。同样，对于Transformer Layer-Cr和Transformer Layer-Cs层，跨视图注意力权值在多个视图间共享。对于所有Transformer层，采用线性化的多头注意力，其中头数设置为4。

3）粗到细深度回归包括代价体金字塔构建和3个尺度的3D CNN正则化。为了构建金字塔，首先利用Transformer后的特征，通过可微单应性变化和组数为8的平均组相关，在最粗尺度上构建代价体。在更大的尺度上，Transformer后的特征首先通过双线性插值进行上采样，并与通过1×1卷积层过滤的2D CNN中相应尺度的特征进行融合。然后，以与最粗糙尺度相同的方式，使用融合特征构建更精细的代价体。深度假设的个数和对应的深度区间设置为与CasMVSNet[中的相同。

4）对于3D CNN代价正则化，使用没有共享权重的3D U-Nets应用于3个尺度。与2D CNN类似，3D U-Nets中的批归一层和ReLU激活被替换为unified in-place activated batch normalization layer。最后，使用soft argmin操作对不同比例尺的深度图进行回归。

6. 实验

6.1. 实现细节

通过PyTorch实现，在GPU of NVIDIA GeForce GTX 1080Ti和CPU of Intel Core i9-9900K [email protected] GHz上训练。训练时，源图像的数量N设置为2，输入图像的分辨率设置为640×512。Adam对该网络进行了优化。

6.2. 与先进技术的比较

6.3. 消融实验

For a fair comparison, a fixed input size of 1152 × 864 is used to evaluate the computational cost on a single GPU of NVIDIA GeForce GTX 1080Ti.

【论文简述】Multi-View Stereo with Transformer（arxiv 2021）