[Paper Intensive Reading] RA-MVSNet: Multi-View Stereo Representation Revisit: Region-Aware MVSNet

What I read today is an article published on CVPR2023, the author is from Zhejiang University and Alibaba.
Article link: Multi-View Stereo Representation Revisit: Region-Aware MVSNet

Abstract

Deep learning-based multi-view stereo has become a powerful paradigm for reconstructing objects with full geometric details from multiple views. Most existing methods estimate pixel-level depth values ​​only by minimizing the gap between the predicted point and the ray's intersection with the surface, which usually ignores the surface topology. This is critical for untextured areas and surface boundaries that cannot be reconstructed correctly. To address this issue, we propose to exploit the point-to-surface distance to enable the model to perceive a wider range of surfaces. To this end, we predict the distance volume from the cost volume to estimate signed distances to points around the surface. Our proposed RA-MVSNet is patch-aware because the perceptual range is enhanced by associating hypothetical planes with surface patches. Therefore, it can increase the completion of texture-free regions and reduce outliers at boundaries. Furthermore, mesh topologies with fine details can be generated through the introduced distance volumes. Compared with traditional deep learning-based multi-view stereo methods, our proposed RA-MVSNet method utilizes signed distance supervision to obtain more complete reconstruction results. Experiments on DTU and Tanks & Temples datasets show that our proposed method achieves state-of-the-art results.

1 Introduction

Briefly, the main contributions are as follows:
• Introduce point-to-surface distance supervision of sampling points to expand the perceptual range of model predictions, thereby achieving complete estimation of texture-free regions and reducing outliers in object boundary regions.
• To address the challenge of lacking a real mesh, we compute signed distances between point sets based on a triangular mesh, which trades off in accuracy and speed.
• Experimental results on the MVS dataset show that the proposed method performs best on both the indoor dataset DTU and the large-scale outdoor dataset Tanks and Temples.

2 Related Work

The work of traditional MVS and learning-based MVS is introduced.

3 Method

insert image description here

3.1 Cost Volume Construction

It is basically the same as ordinary MVSNet, except that the Recursive Feature Pyramid structure is used as an image encoder to obtain a feature map of three stages.

3.2 Signed Distance Supervision

insert image description here

According to the implicit expression of SDF, the distance volume is constructed to predict the distance from the point to the surface.

Due to the introduction of distance prediction, we need to extend GT from depth maps to signed distance fields. Therefore, the depth map only contains sampled points with a signed distance of 0, lacking groundtruth for points around the surface.

For the volume from cost C{C}The exact query point pi p_{i}for each hypothetical plane of Cpi, we compute p_{i} from pipito surface sampling point p ′ p'p , as the GT of the signed distance. As shown in Figure 3, we use the two-point distanced ( pi , p ′ j ) d(p_{i}, p′_{j})d(pi,pj) as GT signed distance.
insert image description here

To speed up the process, a local search is used instead of finding the nearest neighbors from all surface sample points, as shown in Figure 4.
This patch-based local search method keeps the points that need to be calculated as few as possible within a reasonable range, thus reducing the time complexity of the search. We assume the resolution of the depth map is H × WH × WH×W , the number of query points isnn . Then, the time complexity of naive computation isO ( n × H × W ) O(n×H×W)O ( n×H×W ) , which is proportional to the resolution of the depth map. In contrast, the time complexity of patch-based local search reduces toO ( n × k × k ) O(n × k × k)O ( n×k×k ) , wherekkk is the patch size, usually set to5 55 . Therefore, the time complexity of patch-based local search can be reduced toO ( n ) O(n)O ( n ) . That is, it is only related to the number of query pointsnnn is proportional, and the search time for each query point is constant.

3.3 Volume Fusion

After obtaining the probability volume and distance volume, fuse is now needed to obtain the final depth map. Generally speaking, the softmax-based regularization network is used to predict the depth.
insert image description here

However, this method has precision problems because many invalid planes are involved in the calculation. The depth value of a pixel (U, V) is only related to several hypothetical planes corresponding to the pixel, and cannot be associated with other sampling points on the surface.
insert image description here
As shown in Fig. 5, we fuse the probability volume P and the introduced distance volume S to compute the depth map such that each pixel is related to the surrounding surface patches. Specifically, S can be viewed as filtering probability values ​​by thresholding. The fusion process of these two volumes is shown in Algorithm 1. Finally, we supervise two volumes P and S using the depth map ground truth and the generated signed distance ground truth. We adopt L1 loss for depth map and signed distance as follows:
insert image description here

3.4 Supervision of SDF Branch

Since we generate ground truth values ​​of point-to-surface distances from the corresponding depth maps, error bounds analysis is necessary. A reasonable assumption is to use a triangular mesh to represent the surface. There are three different cases, as shown in Figure 6.
insert image description here
Detailed analysis can be found in the paper.

4 Experiments

4.1 Implementation Setup

The training and testing configurations are presented.

4.2 Results on Tanks and Temples

insert image description here

4.3 Results on DTU

insert image description here

4.4 Ablation Studies

The ablation experiment is introduced, mainly in the following three aspects.
SDF Branch can greatly affect accuracy because it removes outliers.
Local Patch Size also affects the accuracy, and the performance will be better when k increases.
In terms of Fusion threshold, with θ \thetaAs θ increases, the accuracy will get worse, but the completeness will get better.

5 Conclusion

To sum up the above, the future direction is to use only SDF for MVS to reduce memory overhead.

Guess you like

Origin blog.csdn.net/YuhsiHu/article/details/131565904