【论文简述】MVSFormer:Multi-View Stereo by Learning Robust Image Features and Temperature-based(TMLR 2023)

1. Brief introduction of the paper

1. First author: Chenjie Cao

2. Year of publication: 2023

3. Published Journal: TMLR

4. Keywords: MVS, 3D reconstruction, pre-training, Vision Transformers

5. Motivation for Exploration: Regularization does not fully correct ambiguous feature matching from reflective or textureless regions with unreliable 2D image features. Therefore, learning good representative features during feature extraction is still of great significance for improving the generalization of MVS. Few previous works have explicitly explored the characteristics of convolutional neural networks (cnn) pretrained on additional image data , such as ResNet , because such pretrained CNNs may have some problems in MVS : 1) The underlying features of CNNs Only a limited receptive field is considered, lacking an overall understanding of the image, and unable to deal with reflective and textureless regions; 2) CNN 's advanced features have a high degree of semantic abstraction, so they are suitable for classification rather than fine-grained visual feature matching. Experiments also verified that the pre-trained CNN model did not achieve significant improvement on MVS .

  1. Such regularization can not completely rectify ambiguous feature matchings from reflections or texture-less regions with unreliable 2D image features. Therefore, it is still of great significance to learn good representative features during the feature extraction to improve the generalization of MVS.
  2. Few previous efforts have explicitly exploring the features from the Convolutional Neural Networks (CNNs) pre-trained on extra image data, e.g., ResNet (He et al., 2016), as such pre-trained CNNs may have some problems in MVS: 1) low-level features of CNNs only consider limited receptive fields, which lack the holistic image understanding, and fail to tackle with reflections and texture-less areas; 2) high-level features of CNNs are of highly semantic abstract, thus best for the classification rather than the fine-grained visual feature matching. We empirically validate that pre-trained CNN models fail to achieve significant improvement in MVS.

6. Work goal: Can the feature representation learning of MVS be significantly enhanced by pre-training Vision Transformers (ViTs) of other 2D image datasets? The advantages are as follows:

  1. For the issues of reflections and texture-less regions in MVS, ViTs equipped with long-range attention modules can provide global understanding for MVS models rather than the low-level textures.
  2. The patch-wise feature encoding of ViTs works reasonably well for feature matching. Since the depth prediction is intrinsically a 1D feature matching problem along epipolar lines, ViTs shall be the recipe for learning-based MVS. 

7. Core idea:

  1. To the best of our knowledge, it is the first work that systematically explores the influence of pre-trained ViTs on MVS. Learning a better feature representation by the feature extractor is important to set up a bridge between 2D vision and 3D MVS tasks.
  2. We propose a novel ViT enhanced MVS network – MVSFormer, which is further trained with the efficient multi-scale training strategy to be generalized for various resolutions.
  3. We analyze the merits and limitations of regression and classification-based MVS, and propose a simple but effective way to unify both. Classification-based confidence can filter outliers for the real-world reconstruction. Our temperature-based depth predictions also enjoy superior point cloud metrics. 

8. Experimental results: SOTA

  1. The proposed Twins-small pre-trained MVSFormer remarkably reduces the overall error of point cloud reconstruction in DTU from 0.312 to 0.289 compared with the CNN-based pre-trained ResNet with competitive computations and all other model settings unchanged.
  2. The proposed methods can achieve the state-of-the-art performance in both DTU dataset  and Tanks-and-Temples.

9. Paper & code download:

https://arxiv.org/pdf/2208.02541.pdf

https://github.com/ewrfcas/MVSFormer

2. Implementation process

1. Overview of MVSFormer

MVSFormer learns feature representations in feature extraction and augments them with hierarchical ViT-Twins (Fig. Aa) or plain ViT-DINO (Fig. Ab), and adopts several novel training strategies. The input of ViTs is down-sampled to 1/2 resolution. Then, a multi-stage cost volume formulation and regularization (Figure B) is proposed to compute the probability of depth hypotheses from coarse to fine. Finally, MVSFormer is optimized with a cross-entropy loss while inferring depth expectations .

Preliminary knowledge: (1) Twins uses the layered- ViT model for supervised pre-training, as shown in Figure (a). To further reduce the complexity, Twins constructs each attention block with separable local grouped self-attention and global downsampling attention. This global and local design outperforms classic pyramidal ViT . (2) DINO is pretrained in a self-supervised manner by self-distillation of plain-ViT, as shown in Figure (b). The outstanding feature of DINO is that its attention map in the last layer can learn class-specific features, enabling unsupervised object segmentation. Thanks to the unsupervised training and multi-crop strategy, DINO's feature representation generalizes well to various environments, lighting and resolutions.

2.  Feature extraction

Use FPN as the main feature extractor, and use pre-trained ViT for augmentation. In MVSFormer, ViT is used to formulate global feature correlations, while FPN is dedicated to learning detailed feature correlations. Before feeding the reference and source images into ViT , the images are first downsampled to (H/2, W/2) to save computation and memory cost. Bicubic interpolation is then used to resize the absolute position encoding of the pre-trained ViT to accommodate different image scales. Then the hierarchical-ViT output F(h) or plain-ViT output F(p) is directly added to the highest layer features of the FPN encoder. As shown in Figure (A), by scaling from (H/8, W/8) to the original resolution (H, W) through the FPN decoder, the feature F(l)l=4 from coarse to fine can be obtained. These features incorporate priors from ViT and CNN and leverage them to formulate more reliable cost bodies. Other feature fusion strategies are tried in the appendix , but the differences are negligible. Therefore, a simple and effective feature addition method is adopted in MVSFormer .

MVSFormer with trainable Twins . Twins are the backbone of the default MVSFormer and have the best rebuild performance . For tuning in MVSFormer with different resolutions , the ViT backbone needs to meet two conditions, namely an effective attention mechanism and robust position encoding at different scales. Twins solves these two problems well. Besides the pyramid structure, CPE in Twins can learn positional cues from zero padding and break the permutation equivalence of ViTs with appropriate CNN inductive bias . As shown in (b) , MVSFormer encodes 4 multi-scale features {Fs}, and the resolution is (1/8, 1/16, 1/32, 1/64) of the original image. These multi-scale features are up-sampled using another FPN.

Thanks to the efficient attention design, the pre-trained Twins can be fine-tuned at various resolutions at a relatively low learning rate during the training phase.

MVSFormer with frozen  DINO. A plain-ViT and plain attention based MVSFormer , MVSFormer -P , is also explored . Thus, the alternative MVSFormer-P requires only more memory overhead compared to regular MVS FPN training and achieves results comparable to the full trainable MVSFormer.

3. Efficient multi-scale training

Although ViT is powerful , the lack of translation invariance and locality make it difficult to handle input of various resolutions. However, most MVS tasks are tested at different high resolutions (HR) ( from 1200×1600 to 1080×1920 ) . CNN - based methods can largely address this issue through dynamic kernels and random cropping. Most importantly, CNNs can handle arbitrarily sized inputs, thanks to inductive biases, i.e. transfer equivalence and locality. For trainable Twins in MVSFormer , training at the same resolution tends to overfit to an input size and cannot be generalized to the HR case.

Therefore, multi-scale training redefines learned features derived from ViT- based detection tasks. In particular, for efficient multi-scale training, it must be ensured that 1) the size of each batch of images should be the same ; 2) dynamically change the batch size according to the image size, in order to make full use of the limited memory. Specifically, our model is trained with dynamic resolutions ranging from 512 to 1280 , and aspect ratios are randomly sampled from 0.8 to 0.67 . Aided by gradient accumulation, we preserve multi-scale training with maximum batch size instead of compromising with minimum batch size. Gradient accumulation divides a batch into several sub-batches and accumulates their gradients to update the model. All instances are randomly grouped into different resolutions and sub-batch sizes at the beginning of each epoch. Note that larger images should have smaller subbatch sizes to balance memory cost and vice versa. Training with a larger batch size helps to converge faster, reduces variance, and provides better performance for the BatchNorm layer . Therefore, gradient accumulation significantly improves the multi-scale training efficiency of MVSFormer . We found that a dynamic training size from 512 to 1280 is sufficient to generalize MVSFormer to at least 2K resolution tank and temples datasets .

4.  Correlation body construction

The cost volume and regularization are orthogonal to the idea of ​​this paper, since the main focus of this paper is better MVS feature extraction. The construction of the cost body adopts the group related body. A 2D CNN is also trained to learn pixel-weighted visibility for each source view by normalizing the associated entropy . N−1 source feature dependencies can be fused with their visibility:

After regularization by 3D U-Net, each stage outputs a 3D cost volume C∈D×H×W.

5.  Depth prediction based on temperature

Regression depth Dreg and classification depth Dcla:

Dreg is optimized with L1 loss and true depth, and Dcla is optimized with cross-entropy (CE) with a one-hot true depth volume.

REGs suffer from overfitting problems, leading to blurred depth predictions, while CLA is more robust but fails to achieve accurate depth results. In contrast, empirically found a different conclusion: confidence maps from CLA outperform REGs; this should not be overlooked, especially for the widely used multi-level MVS model . In particular, MVS networks cannot ensure that all predicted depth maps are correct due to reflections, occlusions, or lack of reliable source views . Therefore, providing a reliable confidence ( uncertainty ) depth map is also important for a good point cloud reconstructed by MVS . As shown in Figure (a) , in stages -2,3,4 , REGs maintain high confidence values ​​even when out-of-range depth assumptions are made. REGs have difficulty filtering outliers without compromising other correct depth points . Since CE cannot handle out-of-range depth labels, all depth outliers are masked during training . The authors also tried optimizing MVS with masked L1 loss , but its performance was not as good as regular regression.

Multi-stage confidence for regression classification. In stage 1, the depth range of both regression-based and classification-based methods misses the true depth (8.8). The regression still offers a high probability for the depth cap.

While CLA has many good performances in MVS , REGs outperformed CLAs in depth and point clouds in early experiments . Therefore, we target inaccurate depth predictions. Uniform Focal Loss (UFL) treats CE as multiple Binary Cross Entropy (BCE). Focal loss uses several hyperparameter controls to solve the imbalance problem in BCE. Different from UFL, this paper proposes a simple method to unify REGs and CLAs, which only adjusts the inference process without retraining the model. This paper first multiplies the cost body C by temperature  t before softmax , and rewrites D reg as a temperature-based depth expectation Dtmp as

Obviously, when t = ∞ or t = 1 , Dtmp is equivalent to Dcla or Dreg respectively . The core idea is to adjust t during inference to unify REGs and CLAs. For low-resolution early stages, set larger t to make the model work as a CLA for better global resolving power. For high-resolution later stages, our model tends to use lower t as REG to smooth local details. In practice, we set {t1, t2, t3, t4} = {5, 2.5, 1.5, 1} and get better performance than classification (t = ∞) , regression (t = 1) and other t settings . Note that Dtmp is only used during testing, since masked CLA optimized with CE is robust enough for MVS learning. Therefore, MVSFormerDcla is used in the training phase. Although adjusting t during testing may be influenced by the difference between the training and testing phases, only tends to regress in later phases with only some assumptions around the depth. And the t setting is sufficiently general and effective for various datasets.

6. Experiment

6.1. Datasets

DTU,Tanks-and-Temples,ETH3D

5.2. Implementation Details

MVSFormer is trained with the number of views N = 5 in 4 coarse-to-fine stages with a 32-16-8-4 depth hypothesis. For multi-scale training, 8 sub-batches are dynamically changed to 2 sub-batches according to the scale from 512 to 1280, and the maximum batch size is 8. Due to mixed precision, MVSFormer takes only about 22 and 15 hours to train for 10 epochs in DTU and BlendedMVS, respectively, using two V100 32GB Tesla GPUs.

6.1. Comparison with advanced technologies

SOTA

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/129232704