[Paper Brief] WT-MVSNet: Window-based Transformers for Multi-view Stereo (arxiv 2023)

1. Brief introduction of the paper

1. First author: Jinli Liao, Yikang Ding

2. Year of publication: 2023

3. Publish journal: arxiv

4. Keywords: MVS, 3D reconstruction, Transformer, epipolar, geometric constraints

5. Motivation for Exploration: However, matching every pixel in the reference image and the source image without polar geometric constraints leads to matching redundancy. A recent effort to perform attention-based matching along the epipolar lines of source images (MVS2D) suffers from sensitivity to camera pose and calibration inaccuracies, leading to false matches. While learned MVS methods aim to estimate the likelihood of depth hypotheses from multi-view feature consistency, they compute the absolute error between ground truth and predicted depth expectations without geometrically consistent supervision.

1. However, matching each pixel in reference and source images without epipolar geometry constraints incurs matching redundancy. A recent effort to perform attention-based matching along the epipolar lines of source images(MVS2D), suffers instead from sensitivity to inaccurate camera pose and calibration, which can in turn results to erroneous matching.

2. While learned MVS methods aim to estimate the likelihood of depth hypotheses from multi-view feature consistency, they calculate the absolute error between ground truth and predicted depth expectation without geometrical consistency supervision.

6. Work goal: to solve the above problems.

7. 核心思想:The proposed MVSTR takes full advantages of Transformer to enable features to be extracted under the guidance of global context and 3D geometry, which brings significant improvement on reconstruction results.

  1. We introduce a Window-based Epipolar Transformer (WET) for enhancing patch-to-patch matching between the reference feature and corresponding windows near epipolar lines in source features.
  2. We propose a window-based Cost Transformer (CT) to better aggregate global information within the cost volume and improve smoothness.
  3. We design a novel geometric consistency loss (Geo Loss) to supervise the estimated depth map with geometric multi-view consistency.

8. Experimental results:

Extensive experiments show that our method achieves state-of-the-art performance on
multiple datasets. It ranks 1st on the online Tanks and Temples benchmark

9. Paper download:

https://arxiv.org/pdf/2112.00336.pdf

2. Implementation process

1. Overview of WT-MVSNet

The overall structure is shown in the figure below. Given a reference image I0, source image Ii, corresponding camera extrinsic matrix Ti, camera intrinsic matrix Ki, depth range [dmin, dmax]. Based on CasMVSNet, the first step is to extract multi-scale features Fi at 1/4, 1/2 and full image resolution through Feature Pyramid Network (FPN). To strengthen the global feature interaction within and across views in multi-view images, a window-based Epipolar Transformer (WET) is proposed, which alternates between intra-attention and cross-attention on the extracted features. The transformed source features are then warped to the reference view for constructing a 3D cost volume V of H×W×C×D, where D is the candidate depth. Then, V is regularized with the proposed cost transformer (CT) to generate a probability volume P of H×W×D, which aggregates the global cost information for generating estimated depth. Finally, the cross-entropy loss (CE loss) is used to supervise the probability body, and the geometric consistency loss (Geo loss) is proposed to penalize the regions where the geometric consistency is not satisfied.

2. Window-based Epipolar Transformer

Most existing learning-based MVS methods construct cost volumes directly from features extracted by warps, resulting in a lack of global context information, and point-to-line matching is sensitive to wrong camera calibration. To solve this problem, a window-based epipolar Transformer (WET) is introduced, which reduces matching redundancy by using epipolar constraints and matches near epipolar windows.

2.1.  Preliminary preparation

attention mechanism. Swin Transformer proposes a hierarchical feature representation with only linear computational complexity. The Swin Transformer block contains window-based multi-head self-attention (W-MSA) and movable window-based multi-head self-attention (SW-MSA), which can be expressed as:

Among them, LN and MLP represent LayerNorm and Multilayer Perception. ^zl and zl are the outputs of the (S)W-MSA and MLP of the l-th block. Swin Transformer divides features into non-overlapping windows and groups as query Q, key K and value v, and calculates the similarity of extracted features through the dot product of Q and K corresponding to each v, defined as:

where d represents the dimension of the query and key. B is the relative positional deviation.

Intrinsic attention and interattention. When Q and K are extracted from the same feature map, the attention layer obtains relevant information in a given feature map. On the contrary, when Q and K are taken from different feature maps, the attention layer enhances the contextual interaction between different views.

2.2.  Window-based epipolar cross-attention

Following TransMVS-Net, mutual attention is performed between F0 and each Fi, and only Fi is updated. Specifically, the cross-attention between F0 and each Fi corresponding block pixel along the epipolar line is computed. In the first step, F0 is divided into M non-overlapping windows W0 of the same size, with hwin×Wwin of the same size, and the center point pj of W0j is changed to Fi through differentiable homography change. The center point after the i-th source view change is pij. To achieve cross-attention, a window Wij of the same size hwin×Wwin is divided around each pji, and the epipolar line of the pji in the source feature passes through this window. Therefore, cross-attention can strengthen the long-distance global contextual information interaction between the reference feature window and the window near the epipolar line of the source feature.

2.3. WET structure

The architecture of WET is shown in the figure below. WET mainly consists of intra-attention and inter-attention modules. In the internal attention module, the extracted features Fi are divided into non-overlapping windows. Each window is flattened and fed sequentially to W-MSA and SW-MSA. Since only internal attention is performed in each divided window, W-MSA cannot capture the global context of the entire input feature. To solve this problem, SW-MSA and shifted window division strategy are used to enhance the information interaction between different windows and obtain the global context. To reduce matching redundancy and avoid wrong camera poses and calibrations, window-based epipolar trans-attention is performed between reference and source views. In the cross-attention module, F0 is divided into non-overlapping windows, and each center point is warped to divide the corresponding window in the source features. After flattening the divided windows, calculate the cross-attention between each window in F0 and the corresponding window in each Fi to transform and update Fi.

3. Cost Transformer

The author further explores the impact of different regularizations and finds that the global receptive field has a significant impact on the final performance, and proposes a new window-based cost Transformer (CT) to aggregate global information in the cost body. As shown in the figure below, as the receptive field expands, the probability voxels with the highest probability in the depth dimension become smoother, more complete, and have higher confidence (yellow areas indicate higher probability, which is equivalent to higher confidence ). Compared with 3D CNN and deregularization, the probability volume generated by CT is of higher quality.

3D attention. To exploit the global receptive field in cost volume regularization, W-MSA and SW-MSA are extended to 3D, for which the 3D volume is flattened in both spatial and depth dimensions. Subsequent operations are similar to 2D attention.

CT structure. As shown in the figure below, it includes encoders, decoders, and skip connections. Given an input cost volume V, the encoder first divides V into non-overlapping 3D blocks, and each 3D block is flattened from 2×4×4×C to 32C. In addition, a linear embedding layer is used to project the channel dimension 32C into C`, resulting in an embedded cost volume V`. Afterwards, V` is further divided into non-overlapping 3D windows, and each 3D window is flattened from dwin×hwin×wwin×C` to dwinwinwin×C`. The flattened window is then fed into N 3D attention blocks, each of which consists of 3DW-Intra-Att, 3DSW-Intra-Att and a block pooling layer. and block pooling layers are responsible for spatial downsampling and channel dimensionality increase. In the decoder, deconvolution is utilized to recover the resolution. In order to reduce the loss of spatial information produced by the block pooling layer, the shallow features and deep features are concatenated together, that is, the multi-scale features of the encoder and decoder are fused together through skip connections. Coupled with a linear embedding layer and a patch expansion layer, the transformed V` is consistent with the V dimension. Finally, using a 3D convolution with a 1×1×1 kernel, the final probability volume P is obtained.

 4. Loss function

Geometry Consistency Loss. In general, depth estimation is only supervised in reference views without using multi-view consistency, which is usually used during the inference stage to filter outliers. In this paper, multi-view consistency is applied to the training stage, and a new geometric consistency loss (Geo loss) is proposed to penalize the regions where the geometric consistency does not meet. First, each pixel p in the estimated depth map D0 of the reference view is warped to obtain the corresponding pixel p`i in the adjacent source view. where D0(p) represents the depth value of pixel p. In turn, we backproject p`i into 3D space and then reproject it to the reference view p``:

In the formula, Dgt i(p`i) represents the real depth value of p`i. Define two reprojection errors as:

Therefore, the final Geo Loss LGeo can be written as: 

where Φ is the sigmoid function used to normalize the combined reprojection error with hyperparameter γ. pv represents a set of effective spatial coordinates obtained from the effective mask map, pg is the set of all pixels whose reprojection error is within a given threshold, for example: ξp < τ1 and ξd < τ2, where τ1 and τ2 are hyperparameters,
with Decreases as the number of stages increases.

total loss. In summary, the loss function consists of cross entropy loss (cross entropy loss, CE loss) and Geo loss:

5. Experiment

5.1. Implementation Details

Based on Pytorch implementation, trained on DTU training set. Similar to the 3 stages of 1/4, 1/2 and full image resolution of CasMVSNet, the corresponding depth intervals are attenuated by 0.25 and 0.5 from stage 1 to stage 3, respectively, and the depth assumptions of each stage are 48, 32 and 8 indivual. When training on DTU, set the number of images to N = 5, and the image resolution to 512×640. Adam was trained for 16 epochs, with a learning rate of 0.001, which was attenuated by 0.5 times after 6, 8, and 12 epochs, respectively. Set the combination coefficient γ = 100.0, the loss weights λ1 = 2.0 and λ2 = 1.0, the reprojection error is at 3 resolution thresholds τ1 are 3.0, 2.0, 1.0, and τ2 are 0.1, 0.05, 0.01. Training the model with a batch size of 1 on 8 Tesla V100 GPUs typically takes 15 hours and occupies 13GB of memory per GPU.

5.2. Comparison with advanced technologies

5.3. Restrictions

In the cross-attention module, the points behind the warp are fixedly selected from the reference features, regardless of the importance of the center points. In addition, the introduction of Transformer inevitably incurs high memory cost in the training phase and slows down the inference speed.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/129618250