1. Brief description of the paper

1. First author: Weitao Chen, Hongbin Xu

2. Year of publication: 2023

3. Published in journal: IJCAI

4. Keywords: MVS, 3D reconstruction, cost volume, Transformer

5. Exploration motivation: Existing methods deal with the cost body, either using manual processing techniques, which are agnostic to severe deformation, or due to the limitations of CNN, such as limited receptive fields, cannot distinguish locally consistent mismatches. Although the good performance of Transformer has been demonstrated in many applications, the time and memory complexity of key and query dot-product interaction in traditional self-attention modules grows quadratically with the spatial resolution of the input. Therefore, replacing a 3D CNN with a Transformer may result in unexpected additional memory footprint and inference latency.

对应估计问题框架：No matter which task this correspondence estimation problem is applied to, the matching task can be boiled down to a classical matching pipeline: (1) feature extraction, and (2) cost aggregation.

特征提取的变化：In learning-based MVS methods, the transition from traditional hand-crafted features to CNN-based features inherently solves the former step of the classical matching pipeline via providing powerful feature representation learned from large-scale data.

原始代价体的问题：handling the cost aggregation step by matching similarities between features without any prior usually suffers from the challenges due to ambiguities generated by repetitive patterns or background clutters.

6. Work goal: Use efficient Transformer to improve the accuracy of cost aggregation.

7. 核心思想：In this work, we focus on the cost aggregation step of cost volume and propose a novel cost aggregation Transformer(CostFormer) to tackle the issues above. We further introduce the Transformer architecture into an iterative multi-scale learnable PatchMatch pipeline.

We propose a novel Transformer-based cost aggregation network called CostFormer, which can be plugged into learning-based MVS methods to improve cost volume effectively.

CostFormer applies an efficient Residual Depth-Aware Cost Transformer to cost volume, extending 2D spatial attention to 3D depth and spatial attention.

CostFormer applies an efficient Residual Regression Transformer between cost aggregation and depth regression, keeping spatial attention.

8. Experimental results:

The proposed CostFormer brings benefits to learning-based MVS methods when evaluating
DTU, Tanks & Temples、ETH3D and BlendedMVS datasets.

9. Paper download:

https://arxiv.org/pdf/2305.10320.pdf

2. Implementation process

1. Overview

As shown in the figure below, CostFormer based on PatchMatchNet extracts feature maps from multi-view images, initializes and propagates them, and warps the feature maps in the source view to the reference view. The cost volume is then constructed by group correlation, and the views are then aggregated with pixel-wise view weights wi(p).

The cost aggregation module first utilizes a small 1×1×1 kernel three-dimensional convolutional network to obtain a single cost, C∈H×W×D. For a spatial window of K pixels, it can be organized into a grid, and an additional offset Δpk for each pixel can be learned for spatial adaptation. The space cost of aggregation C~(p,j) is defined as:

Among them, wk and dk weight the cost C based on feature and depth similarity. Given the sampling position (p+pk+Δpk)Ke=1, the features corresponding to F0 are extracted through bilinear interpolation. Then, group correlation is applied between the features and p at each sampling location. The results are concatenated into a volume, where a 3D convolutional layer with 1×1×1 kernel and sigmoid nonlinearity is applied to output normalized weights {wk}. Collect the absolute difference between each sampling point and the inverse depth of pixel p under the jth hypothesis. Then apply the sigmoid function to the inverse value to get {dk}.

It is worth noting that this cost aggregation is inevitably challenged by the ambiguity generated by repeated patterns or background clutter. Local mechanisms of blurring exist in many operations, such as local propagation and spatial adaptation via learnable small offsets. CostFormer alleviates these problems through RDACT and RRT.

After RRT, soft argmin regression depth is applied. Finally, a depth improvement module is designed to improve depth regression. For CascadeMVS and other cascade structures, CostFormer can be plugged in similarly.

2. Potential benefits of convergence

Residual Depth Awareness Cost Transformer (RDACT) consists of two parts. The first part is the stack of depth-aware Transformer layer (DATL) and depth-aware displacement transformer layer (DASTL), which is used to process the cost volume to fully explore the relationship between them. The second part is the re-embedded cost layer (REC), which recovers the cost body from the first part.

Given a cost body C0∈H×W×D×G, the temporary intermediate cost bodies C1, C2,…, CL 2∈H×W×D×G are first extracted alternately with DATL and DASTL respectively:

where DATLk is the k-th depth-aware Transformer layer with a regular window, DASTLk is the k-th depth-aware Transformer layer with a shifted window, and E is the embedding dimension of DATLk and DASTLk.

Then apply a re-embedding cost layer, i.e. CL, to the last Ck to recover G from E. The output formula of RDACT is:

REC is the re-embedding cost layer, which is a three-dimensional convolution with G output channels. If E=G, then Cout can be simply expressed as:

This residual connection allows different levels of cost body aggregation; Cout instead of C0 is then aggregated by the original aggregation network.

Before introducing the construction of DATL and DASTL, the details of the core composition called depth-aware multi-head self-attention (DA-MSA) and depth-aware shifted multi-head self-attention (DAS-MSA) are introduced, DA-MSA and DAS-MSA Both are based on the depth-aware self-attention mechanism. In order to explain the depth-aware self-attention mechanism, relevant knowledge about depth-aware block embedding and depth-aware window is provided as preparatory work.

Depth-aware block embedding: Obviously, applying the attention mechanism of feature maps directly at the pixel level is quite expensive in terms of GPU memory usage. To solve this problem, a depth-aware block embedding is proposed to reduce the high memory cost and obtain additional regularization. Specifically, before aggregating C∈H×W×D×G, given the group cost body, first apply depth-aware block embedding to C to obtain the token
. It consists of a 3D convolution with kernel size h×w×d and a layer normalization. In order to downsample the spatial size of the cost volume and maintain the depth assumption, set h and w to be greater than 1, and d to 1. Therefore, the sampling rate adapts to the memory cost and run time. Before convolution, the cost volume is padded to fit the space size and downsampling ratio. After layer normalization (LN), these embedded blocks are further segmented by depth-aware windows.

Depth-aware windows: In addition to nonlinear and linear global self-attention, local self-attention within a window has been proven to be more effective and efficient. Taking two-dimensional windows as an example, Swin Transformer directly applies the multi-head self-attention mechanism on non-overlapping two-dimensional windows to avoid the huge computational complexity of the global token. Expanding from the 2D space window, the embedded cost volume block ∈H×W×D×G with depth information is divided into non-overlapping 3D windows. These local windows are then transposed and reshaped into local cost tokens. Assuming that the size of these windows is hs×ws×ds, the total number of tokens is:

These local tokens are further processed by the multi-head self-attention mechanism.

Depth-aware self-attention mechanism: For local window token∈hs×ws×ds×G, the query, key and value matrices Q, K and V∈hs×ws×ds×G are calculated as follows:

where PQ, PK and PV∈G×G are projection matrices shared across different windows. For each head by introducing a depth- and spatial-aware relative position bias B1 ∈ (hs×hs)×(ws×ws)×(ds×ds), the depth-aware self-attention (DA-SA1) matrix within a 3D local window is computed for:

Among them, Q1, K1 and V1∈hswsds×G are given by Q, K and V∈hs×ws×ds×G. Using the process LayerNorm (LN) of DATL and the multi-head DA-SA1 of the current layer, it is expressed as:

For each head, the depth-aware self-attention (DASA2) matrix is an alternative module of DATL along the depth dimension by introducing a depth-aware relative position bias B2 ∈ ds × ds, and thus computed as:

Q2, K2 and V2 are reshaped into hsws×ds×G. B1 and B2 are along the depth dimension, in the range [-ds+1, ds-1]. Along the height and width dimensions, B1 is at [-hs+1, hs-1] and [-ws+1, ws-1]. In the implementation, a smaller-sized bias matrix B1~∈(2hs−1)×(2ws−1)×(2ds−1) is parameterized from B1, and the attention function is executed f times in parallel, and then the depth Aware multi-head self-attention (DA-MSA) outputs are concatenated. Layer normalization (LN) is used in the current layer, and the multi-head data processing processes DA-SA1 and DA-SA2 are expressed as:

Then, further feature transformation is performed using an MLP module with two fully connected layers and GELU nonlinearity between layers:

Compared to global attention, local attention enables high-resolution computation.

However, there are no connections between partial windows with fixed partitions. Therefore, regular and shifted window partitioning are used interchangeably to achieve cross-window connections. So at the next level, the window partition configuration moves along the height, width and depth axes (hs/2, ws/2, ds/2). Depth-aware self-attention will be calculated in these shift windows (DAS-MSA); the entire process DASTL can be expressed as:

DAS-MSA1 and DAS-MSA2 correspond to multi-head attention 1 and attention 2 respectively within a shifted window. Suppose the number of stages is n and there are n RDACT blocks in CostFormer.

3. Residual regression Transformer

After aggregation, the cost volume C∈HXWXD is performed for depth regression. To further explore the spatial relationships at a certain depth, a Transformer block is applied to C before softmax. Inspired by RDACT, the full residual regression Transformer (RRT) can be expressed as:

Among them, RTk is the k-th regression Transformer layer with a fixed window, RSTk is the k-th regression Transformer layer with a shift window, and RER is a re-embedding layer that restores the depth dimension from CL, which can be D output 2D convolution of channels.

RRT also calculates self-attention in local windows. Compared with RDACT, RRT pays more attention to spatial relationships. Compared to regular Swin Transformer blocks, RRT treats depth as one channel, the number of channels is actually 1, and that channel is compressed before RRT. Set embedding parameters to accommodate cost aggregation across different iterations. If the embedding dimension is D, then Cout can be simply expressed as:

Since a stage may iterate multiple times under different depth assumptions, the number of RRT blocks should be set to the same as the number of iterations.

4. Loss function

The final loss is the loss of all iterations of all stages combined with the loss of the final improvement module:

where Lk is the regression or unified loss of the i-th iteration of the k-th stage. Lref is the regression or uniform loss of the improved module. If the improvement module is not present, the Lref loss is set to zero.

5. Experiment

5.1. Implementation details

Nvidia GTX V100 GPU

5.2. Comparison with advanced technologies

DTU: 2693MB and 0.231 seconds

[Paper Description] CostFormer: Cost Transformer for Cost Aggregation in Multi-view Stereo (IJCAI 2023)