Vis-MVSNet: Visibility-Aware Multi-view Stereo Network(IJCV 2022)

Vis-MVSNet: Visibility-Aware Multi-view Stereo Network: Visibility-Aware Multi-view Stereo Network
Abstract: Explicitly infer and integrate pixel-level occlusion information in MVS networks via matching uncertainty estimation. The pairwise uncertainty map is jointly inferred with the pairwise depth map, which is further used as a weighting guide during multi-view cost-volume fusion. In this way, the adverse effect of occluded pixels is suppressed in cost fusion. The proposed framework Vis-MVSNet significantly improves the depth accuracy for heavily occluded scenes.

Innovation: Depth maps are estimated from multi-view images in two steps. First, each pair of reference source images is matched to obtain a latent volume representing the pairwise matching quality. This volume is further regressed to an intermediate estimate of the depth map and the uncertainty map, where the uncertainty is transformed by the depth entropy of the probability volume. Second, using the pairwise matching uncertainty as a weighting guide, all pairwise latent volumes are fused into a multi-view cost volume to attenuate mismatched pixels. Fusion volume regularization and regression to the final depth estimate.

1. Network structure

The basic architecture is similar to CasMVSNet, a reference image I 0 and a set of adjacent source images {I i }i=1~Nv, the framework predicts a reference depth map D 0 aligned with I 0 . First, all images are fed into 2D UNet for extracting multi-scale image features for depth estimation in three stages from low resolution to high resolution. For reconstruction at stage k, a cost volume is constructed, regularized and used to estimate a depth map D k,0 with the same resolution as the input feature map . The intermediate depth map from the previous stage will be used for cost volume construction in the next stage. Finally, D 3,0 serves as the final output D 0 of the system . As shown below
insert image description here

2. Feature extraction

The depth image features are extracted by the hourglass encoder-decoder UNet architecture, and the output resolutions are 1/8, 1/4, 1/2 feature maps of 32 channels.

3. Cost body and regularization

According to the obtained feature maps of different sizes, the homography is changed to construct the feature body.
When constructing the cost volume, we first construct a paired cost volume for each reference-source pair . For the i-th pair, assuming the depth of the reference image is d, we can obtain a distorted feature map F k,i→ from the source view 0 (d). Group correlation is applied to compute a cost map between reference and warped source feature maps. Specifically, given two 32-channel feature maps, we divide all channels into 8 groups of 4 channels. The correlation between each corresponding pair is then computed, resulting in 8 values ​​for each pixel. Then, the cost maps of all depth hypotheses are stacked together as a cost volume .
The size of the cost volume Ck,i obtained in the i-th stage k is N d,k ×H×W×Nc, N d,k is the number of depth hypotheses in the k-th stage, and the hypothesis set of the first stage is predetermined. The second and third stages are determined dynamically based on the depth map output from the previous stage.

The cost body regularization is divided into two steps. First, each pair of cost volumes is regularized into a latent volume V k,i respectively . Then, all potential volumes are fused into V k , V k is further regularized into probability volume P k , and the final depth map D k,0 of the current stage is obtained through soft-argmax operation regression . Fusion of latent bodies is visible. Specifically, visibility is first measured by jointly inferring depth and uncertainty. Each latent volume is transformed into a probability volume P k,i by additional 3D CNNs and softmax operations . The depth map D k,i and the corresponding uncertainty map U k,i are then jointly inferred by soft-argmax and entropy operations . The uncertainty map will be used as a weighting guide during fusion of potential volumes.

4. Pairwise joint depth and uncertainty estimation

As described in the previous section, a pair probability volume of joint depth and uncertainty estimates is obtained, and the depth map is regressed by soft-argmax. For simplicity, we denote the probability distribution of all depth hypotheses as {Pi,j}, omitting the stage number k. The soft-argmax operation is equivalent to calculating the expectation of the distribution, and Di is calculated asinsert image description here

To jointly regress the depth estimate and its uncertainty, we assume that the depth estimate follows a Laplace distribution. In this case, the estimated depth and uncertainty maximize the likelihood of the observed true value as:

insert image description here
where U i is the uncertainty of the depth estimate of the pixel. Note that the probability distribution {P~i, j } also reflects the matching quality. Therefore, we apply the entropy map H i of {P i, j~} j = 1~ N d to measure the depth estimation quality. Hi is transformed into an uncertainty map U i by the function fu , and fu is a shallow 2D CNN: the reason for using entropy is that the randomness of the distribution is negatively correlated with the unimodal distribution. Unimodality is an indicator of high confidence in depth estimates. To jointly learn the depth map estimate Di and its uncertainty Ui, we minimize the negative log-likelihood described above: the constant is omitted from the formula. For numerical stability, in practice we infer Si = log Ui instead of using Ui directly. The log uncertainty map Si is also transformed from the entropy map Hi by a shallow 2D CNN. The loss can also be interpreted as the decay of the L1 loss between the estimate and the ground truth with a regularization term . The intuition is that the noise of erroneous samples should be reduced during training.insert image description here


insert image description here

5. Body Fusion

Given a pair potential cost volume {Vi}, the volume V is obtained by weighting and fusion, where the weights are negatively correlated with the estimated pair uncertainty.
insert image description here
Pixels with high uncertainty are more likely to be located in occluded areas. Therefore, these values ​​in the latent volume can be attenuated. An alternative to the weighted sum is to threshold Si and perform hard visibility selection on each pixel. However, without an interpretation of Si values, we can only do empirical thresholds, which may not be universal. Instead, the weighted sum naturally fuses all views and takes into account the log uncertainty Si in a correlated way.

6. From coarse to fine structure

At all stages, depth hypotheses are uniformly sampled from a depth range. The first stage uses low-resolution image features to construct cost volumes with predetermined depth ranges and larger depth intervals, while subsequent stages use higher spatial resolutions with narrower depth ranges and smaller depth intervals.

For the first stage, the depth range is [d min , d min + 2Δd), and the depth number is N d,1 , where d min , Δd, N d,1 are predetermined. For the k-th stage (k ∈ {2,3}), reduce the depth range, number of samples and interval. The range is centered on the depth estimation from the previous stage and is different for each pixel. The depth range of pixel x is [D k−1,0 −w k Δd, D k−1,0 + w k Δd), and the depth number is p k N d,k , where w k < 1 and p k < 1 is a predefined scaling factor and D~k−1,0· is the final depth estimate of pixel x from k−1 in the previous stage.

7. Loss calculation

For each stage, the pair-L1 loss, pair-joint loss, and L1 loss of the final depth map are computed, and the total loss is the weighted sum of all three stage losses. To normalize the scales across different training scenes, all depth differences are divided by a predefined depth interval in the final stage. insert image description here
Since the uncertainty loss tends to over-relax the depth and uncertainty estimates, the L1 loss is included. The L1 loss here can guarantee qualified depth map estimation

8 Generating a point cloud

After generating depth maps for all views, the depth maps are fused into a unified point cloud by filtering for photometric and geometric consistency.

Probability map (photometric consistency). A probability map is additionally generated to filter out unreliable pixels. The final probability map is obtained by calculating the total probability around the depth probability in the [D−2,D+2] range of the final depth estimate. In addition, there are probability maps of different stages in the coarse-to-fine architecture, so the filtering criterion is if and only if all the probability maps of the three stages are above the corresponding thresholds pt,1, pt,2, pt,3, Pixels in the reference view will be preserved.

geometric consistency. Same way as before.

Geometry visibility fusion. All source depth maps are projected to a reference view, where each pixel in the reference depth map can receive a different number of depth values. For each pixel, the following metrics are computed for each depth: (1) occlusion, which is the amount of depth that occludes this pixel (with a depth value less than this depth value); (2) violations, that is, this depth is in free space after projection (3) Stability, which is occlusion minus violations. Finally, the minimum depth value with non-negative stability is selected as the new depth value of the pixel. More details on occlusion, violations, and stability can be found in the original paper (Merrell et al., 2007). Compared with simply picking the median of the depth candidate values, the visibility-based fusion method slightly improves the point cloud quality.

Geometric mean fusion. Noise in estimated depth values ​​can be reduced by averaging the source view reprojected depths. For a pixel p0 at depth d0 in the reference depth map, we collect the reprojected depth {di}i ∈ Ic from all consistent source views Ic. All depths are averaged to get the final result.

Small segment filtering. Small-segment filtering is added, and it is observed that small clusters of flying points are usually noise in space, and they can be easily removed according to their cluster size, which can be done at the depth map level. Given a depth map, a graph can be constructed where there is an edge between two adjacent pixels if both are valid and the depth difference is not large. Then remove the part of the connected few pixels. In practice, a depth difference percentage threshold of 0.05% and a cluster size of 10 are used.

The entire filtering and fusion mechanism is as follows:

Probabilistic map filtering;
geometric consistency filtering;
geometric visibility fusion;
geometric consistency filtering;
geometric mean fusion;
geometric consistency filtering;
segment filtering.
If a pixel is filtered out at a certain stage, subsequent steps will not be involved.

Guess you like

Origin blog.csdn.net/qq_44708206/article/details/129048584