[Intensive reading of papers] Vis-MVSNet: Visibility-aware Multi-view Stereo Network

Today is a re-reading of the classics. This is an article published on BMVC2020, trying to solve the problem of visibility in MVS. This article was recently published in IJCV after an extension. The interpretation of this article is based on the expanded IJCV version, and the content of the journal version is more detailed.
Article link: BMVC2020 version and IJCV version
Code repository: Github

Abstract

Few existing networks explicitly consider pixel-level visibility, leading to false cost aggregation of occluded pixels. In this paper, we explicitly infer and integrate pixel-level occlusion information in MVS networks via matching uncertainty estimation. Pairwise uncertainty maps are jointly inferred with pairwise depth maps, which are further used as weighting guides during multi-view cost-volume fusion. In this way, the adverse effect of occluded pixels is suppressed in cost fusion.

1 Intro

Introduced the basic content of MVS.
A structure of an end-to-end network is proposed, considering the pixel visibility information. Depth maps are estimated from multi-view images in two steps. First, match the ref and src pairs to obtain a latent volume representing the matching quality. This volume is further regressed to obtain an intermediate estimate of the depth map and the uncertainty map, where the uncertainty is converted from the depth-wise entropy of the probability body. Second, using the uncertainty of pair matching as a weighted guide, all paired latent volumes are fused into a multi-view volume to attenuate mismatched pixels. The fused volume is regularized and regressed to the final depth estimate. Meanwhile, we also integrate group correlation and coarse-to-fine strategies to further improve the overall reconstruction quality. The network is end-to-end trainable, and the uncertainty part is trained in an unsupervised manner. Therefore, the existing MVS dataset with real depth maps can be directly utilized for training.

2 Related Work

Related work is introduced, learning-based MVS, visibility estimation, uncertainty estimation.

3 Method

3.1 Overview

Similar to CasMVSNet. First, ref image I 0 I_{0}I0和一组相同的src images { I i } i = 1 N {\{I_{i}\}}_{i=1}^{N}{ Ii}i=1NInput 2D UNet for multi-scale image feature extraction for depth estimation and uncertainty maps in three stages from low-resolution to high-resolution. for the kkthK -stage reconstruction, fusion of latent volume according to uncertainty, construction of cost body, regularization and use to estimate depth mapD k , 0 D_{k, 0}Dk,0. The intermediate depth map from the previous stage will be used for cost volume construction in the next stage. Finally, D 3 , 0 D_{3,0}D3,0As the final output of the system D 0 D_{0}D0
insert image description here

3.2 Feature Extraction

Using hourglass-shaped encoder-decoder UNet, the output resolution of the three levels is 1 8 × 1 8 \frac{1}{8} \times \frac{1}{8}81×81 1 4 × 1 4 \frac{1}{4} \times \frac{1}{4} 41×41 1 2 × 1 2 \frac{1}{2} \times \frac{1}{2} 21×21The 32-channel feature map of .

3.3 Cost Volume and Regularization

at kkIn the k scaling stage, we first construct a pair-wise cost volume for each ref-src pair, instead of directly constructing a unified cost volume from all views. foriipair i , by assuming the ref image has depthddd , we can get the feature map F k after warp from the src view, i → 0 ( d ) F_{k,i→0}(d)Fk,i0( d ) . We apply group-wise correlation to compute the cost map between the ref feature map and the warped src feature map. Specifically, given two 32-channel feature maps, we divide all channels into 8 groups of 4 channels each. The correlation between each corresponding pair is then computed, resulting in 8 values ​​for each pixel. The cost maps for all depth hypotheses are then stacked together as the cost volume. kkth_iiin stage kThe final cost of i image pairs C k , i C_{k,i}Ck,iThe size is N d , k × H × W × N c N_{d,k}×H×W×N_{c}Nd,k×H×W×Nc, where N d , k N_{d,k}Nd,kis the kkthNumber of depth hypotheses in k stages, N c = 8 N_{c}=8Nc=8 is the group number of the group correlation operation. The hypothesis set of the first stage is predetermined, and the hypothesis sets of the second and third stages are determined dynamically according to the depth map output of the previous stage.

Our cost adjustments are done in two steps. First, for the i-th pair in the k-th stage, each pairwise cost volume is regularized as latent volume V k , i V_{k,i}Vk,i. Then, all are fused to V k V_{k}Vk, further regularized to the probability volume P k P_{k}Pk, and return to the current stage D k , 0 D_{k,0} through the soft-argmax operationDk,0The final depth map of . Specifically, we first measure visibility by jointly inferring pairwise depth and uncertainty. Each latent volume is converted into a probability volume P k , i by additional 3D CNN and softmax operations P_{k,i}Pk,i. Then, the depth map D k , i is jointly inferred by soft-argmax and entropy operation D_{k,i}Dk,iand the corresponding uncertainty map U k , i U_{k,i}Uk,i. The uncertainty map will be used as a weighting guide during latent volume fusion.

3.4 Pair-wise Joint Depth and Uncertainty Estimation

The depth map is regressed from the probability volume with a soft-argmax operation. For simplicity, the number k of the stage is omitted below. We denote the probability distribution of all depth hypotheses as { P i , j } j = 1 N d \{P_{i,j}\}_{j=1}^{N_{d}}{ Pi,j}j=1Nd. The soft-argmax operation is equivalent to calculating the expectation of the distribution, Di is calculated as follows:
D i = ∑ j = 1 N ddj P i , j \begin{aligned} \mathbf {D}_{i} = \sum _{j= 1}^{N_d} d_j \mathbf {P}_{i, j} \end{aligned}Di=j=1NddjPi,j
To jointly regress the depth estimate and its uncertainty, we assume that the depth estimate follows a Laplace distribution. In this case, the estimated depth and uncertainty maximize the likelihood of the observed GT case:
p ( D gt , i ∣ D i , U i ) = 1 2 U i ⋅ exp ⁡ ( ∣ D i − D gt , i ∣ U i ) \begin{aligned} p( \mathbf {D}_{gt, i} | \mathbf {D}_{i}, \mathbf {U}_{i} ) = \ frac{1}{2\mathbf {U}_{i}} \cdot \exp \left( \frac{|\mathbf {D}_{i} - \mathbf {D}_{gt, i}|} {\mathbf {U}_{i}} \right) \end{aligned}p(Dgt,iDi,Ui)=2 Ui1exp(UiDiDgt,i)
Among them U i U_{i}Uiis the uncertainty of the depth estimate of the pixel. Note that the probability distribution { P i , j } j = 1 N d \{P_{i,j}\}_{j=1}^{N_{d}}{ Pi,j}j=1NdAlso reflects match quality. Therefore, we apply { P i , j } j = 1 N d \{P_{i,j}\}_{j=1}^{N_{d}}{ Pi,j}j=1NdThe entropy map H i H_{i}HiTo measure the quality of depth estimation, through the function fu f_{u}fuGeneral H i ​​H_{i}HiConvert to an uncertainty graph U i U_{i}Ui f u f_{u} fu为一个浅层的2D CNN:
U i = f u ( H i ) = f u ( ∑ j = 1 N d − P i , j log ⁡ P i , j ) \begin{aligned} \mathbf {U}_{i} = f_u(\mathbf {H}_{i}) = f_u(\sum _{j=1}^{N_d} - \mathbf {P}_{i, j} \log \mathbf {P}_{i, j}) \end{aligned} Ui=fu(Hi)=fu(j=1NdPi,jlogPi,j)
The reason for using entropy is that the randomness of the distribution is negatively correlated with unimodal distributions. Unimodality is an indicator of high confidence in depth estimates.
For joint learning of depth map estimation D i D_{i}Diand its uncertainty U i U_{i}Ui,我们最小化上述负对数似然:
L i j o i n t = 1 ∣ I 0 v a l i d ∣ ∑ x ∈ I 0 v a l i d − log ⁡ ( 1 2 U i exp ⁡ ∣ D i − D g t , i ∣ U i ) = 1 ∣ I 0 v a l i d ∣ ∑ x ∈ I 0 v a l i d 1 U i ∣ D i − D g t , i ∣ + log ⁡ U i \begin{aligned} \begin{aligned} L_{i}^{joint}&= \frac{1}{|I_0^{valid}|} \sum _{x\in I_0^{valid}} -\log \left( \frac{1}{2\mathbf {U}_{i}}\exp \frac{| \mathbf {D}_{i} - \mathbf {D}_{gt, i} |}{\mathbf {U}_{i}}\right) \\&=\frac{1}{|I_0^{valid}|} \sum _{x\in I_0^{valid}} \frac{1}{\mathbf {U}_{i}} | \mathbf {D}_{i} - \mathbf {D}_{gt, i} | + \log \mathbf {U}_{i} \end{aligned} \end{aligned} Lijoint=I0valid1xI0validlog(2 Ui1expUiDiDgt,i)=I0valid1xI0validUi1DiDgt,i+logUi
Constants are omitted from the formula. For numerical stability, in practice we directly infer that S i = log U i S_{i}=logU_{i}Si=l o g UiBut not U i U_{i}Ui. Log uncertainty plot S i S_{i}SiAlso by a shallow 2D CNN from the entropy map H i H_{i}Hiconverted.
Loss can also be interpreted as the attenuation of the L1 loss between the estimated value and the true value using a regularization term. The intuition is that the noise of erroneous samples should be reduced during training.

3.5 Volume Fusion

Omit the number of stages kkk , given a pair of latent volume{ V i } i = 1 N v \{V_{i}\}_{i=1}^{N_{v}}{ Vi}i=1Nv, a single V is fused from the volumes by a weighted sum, where the weights are negatively related to the estimated pairwise uncertainties.
V = ( ∑ i = 1 N v 1 exp ⁡ S i ) − 1 ∑ i = 1 N v ( 1 exp ⁡ S i V i ) \begin{aligned} \mathbf {V} = \left( \sum _{ i=1}^{N_v} \frac{1}{\exp \mathbf {S}_i} \right) ^{-1} \sum _{i=1}^{N_v} \left( \frac{1 }{\exp \mathbf {S}_i} \mathbf {V}_i\right) \end{aligned}V=(i=1NvexpSi1)1i=1Nv(expSi1Vi)
According to our observation, pixels with larger uncertainty are more likely to be located in occluded regions. Therefore, these values ​​in the latent volume may be weakened.

An alternative to the weighted sum is to apply S i S_{i}Siand perform hard visibility selection on a per-pixel basis. However, if there is no pair S i S_{i}Sivalue interpretation, we can only perform empirical thresholding, which may not be generalizable. Instead, our weighted sum formulation naturally incorporates all perspectives and considers the log uncertainty S i in a relative waySi

3.6 Coarse-to-Fine Architecture

Introduced a hierarchical structure, similar to CasMVSNet.

3.7 Training Loss

For each stage, the pair-wise L1 loss, pair-wise joint loss and L1 loss of the final depth map are calculated, and the total loss is the weighted sum of all three stage losses. To normalize the scales across different training scenes, all depth differences are divided by a predefined depth interval in the final stage.
L = ∑ k = 1 3 λ k [ L 1 , kfinal + 1 N v ∑ i = 1 N v ( L 1 , k , ipair + L k , ijoint ) ] \begin{aligned} L = \sum _{k =1}^3\lambda _k\left[ L_{1,k}^{final} + \frac{1}{N_v} \sum _{i=1}^{N_v} (L_{1, k,i }^{pair} + L_{k,i}^{joint})\right] \end{aligned}L=k=13lk[L1,kfinal+Nv1i=1Nv(L1,k,ipair+Lk,ijoint)]
The pair-wise L1 loss is also included because the uncertainty loss tends to over-relax the pair-wise depth and uncertainty estimates. The operation here can guarantee qualified pair-wise depth map estimation.

3.8 Point Cloud Generation

Describes how point clouds are generated.

4 Experiment

The data in the experimental part will not be displayed, and those who are interested can check it by themselves.

4.1 Implementation

4.2 Benchmarking on Tanks and Temples Dataset

4.3 Benchmarking on ETH3D Dataset

4.4 Benchmarking on DTU Dataset

4.5 Ablation Study

4.6 Memory and Time Consumption

5 Conclusion

We propose a visibility-aware depth inference framework for multi-view stereo reconstruction. We propose two-step cost-volume regularization, joint inference of pairwise depth and uncertainty, and weighted average fusion of pairwise volumes from uncertainty maps. The proposed method has been extensively evaluated on multiple datasets. Qualitatively, the system can produce more accurate and denser point clouds, which demonstrates the effectiveness of the proposed visibility-aware depth inference framework.

Guess you like

Origin blog.csdn.net/YuhsiHu/article/details/131754889