[Paper Brief] Multi-View Stereo Representation Revisit: Region-Aware MVSNet (CVPR 2023)

1. Brief introduction of the paper

1. First author: Yisu Zhang

2. Year of publication: 2023

3. Published journal: CVPR

4. Keywords: MVS, 3D reconstruction, signed distance field

5. Exploration Motivation: Pixel depth estimation still suffers from two intractable flaws. One is that the estimation confidence of texture-free regions is low. The second is many outliers near object boundaries. This is mainly because surfaces are usually viewed as a set of uncorrelated sampling points, without topology. Since each ray is associated with only one surface sample point, it is impossible to pay attention to adjacent regions of the surface. As shown in the figure below, the estimation of each depth value is constrained by only one surface sampling point and cannot be extrapolated using the surrounding surface. However, in untextured regions and object boundaries, it is difficult to infer without broader surface information. Therefore, the too small perception range limits the existing learning-based MVS methods.

6. Work goal: To solve the above problems by utilizing surface information.

7. Core idea: A new RA-MVSNet framework is proposed, which is able to associate each hypothetical plane with a wider surface by point-to-plane distance. Therefore, the method is able to infer surrounding surface information in texture-free regions and object boundaries.

  1. We introduce point-to-surface distance supervision of sampled points to expand the perception range predicted by the model, which achieves complete estimation in textureless areas and reduce outliers in object boundary regions.
  2. To tackle the challenge of lacking the ground-truth mesh, we compute the signed distance between point sets based on the triangulated mesh, which trades off between accuracy and speed.

8. Experimental results:

Experimental results on the challenging MVS datasets show that our proposed approach performs the best both on indoor dataset DTU and large-scale outdoor dataset Tanks and Temples.

9. Paper download:

https://arxiv.org/pdf/2304.13614.pdf

2. Implementation process

1. Overview of RA-MVSNet

The overall framework mainly includes three parts: cost body construction, multi-scale depth map and symbol distance prediction, and ground truth processing, and consists of two branches. The first branch predicts the probability volume, and the second branch estimates the signed distance volume. RA-MVSNet fuses the two branches to obtain a filtered depth map, while the SDF branch can generate an implicit representation. Since the point-to-surface distance supervision employs an extra branch to compute the signed distance of sampled points around the surface via a cost volume, it is easy to add to existing learning-based MVSNet schemes with minor modifications. The cascaded MVSNet is used as the benchmark, and two branches, Cas-MVSNet, are used to predict the depth and symbol distance respectively.

2. Cost object construction

Based on MVSnet, constructed through homography changes. The recursive feature pyramid (RFP) is used as the shared weight of the image encoder to extract the features of three scales. To handle an arbitrary number of source images, an adaptive strategy is employed to aggregate all features Vi into a single cost volume C ∈ D×C'×H'×W' , and several 3D CNN layers are used to predict the pixel-wise weighting matrix Wi. The final cost body can be calculated as follows:

Where C is the cost body of the reference view. ⊙ represents element-wise multiplication. Vi and V0 are features extracted from source images and reference views using an image encoder.

3. Signed distance supervision

The distance from a point to a surface is usually expressed as an SDF (signed distance field). The core of this implicit representation is to calculate the distance from the sampling point near the surface to the object. Therefore, following the idea of ​​SDF, construct a distance body to predict the distance from point to surface, so as to take advantage of the implicit representation.

For the 3D cost volume that aggregates the features of the reference view and the source view, a regularization network is usually used to obtain the probability volume P, and P is regarded as the weight of different depth hypothetical planes:

 Among them, Fsoftmax is a 3D CNN regularization network based on softmax. The distance volume S represents the signed distances of these hypothetical planes:

where Ftanh is a tanh-based 3D CNN regularization network. Since points far from the surface are usually not conducive to reconstruction, tanh is adopted as the activation layer for the distance volume. So you can focus on nearby sampling points.

Since the prediction of the distance is introduced, the ground truth of the depth map needs to be extended to the signed distance field. Therefore, the depth map only contains sampled points with a distance of 0, lacking ground truth for points around the surface.

For each precise query point pi on the hypothetical plane of the cost volume C, we compute the shortest distance from pi to the surface sampling point p' as the ground truth of the signed distance. As shown in the figure, each hypothetical plane is regarded as a sampling point around the surface, and its corresponding nearest neighbor surface sampling point is found, and the two-point distance d(pi, pj') calculated by Kaolin is used as the real symbolic distance .

To speed up this process, finding the nearest neighbors from all surface sample points is replaced by a patch-based local search, as shown in the figure. The nearest neighbor is usually located near the query point, which can remove a large number of useless surface sampling points, and only keep the sampling points located in the local block of the intersection point.

This block-based local search method makes it necessary to calculate as few points as possible within a reasonable range, thereby reducing the time complexity of the search. Assuming that the resolution of the depth map is H×W and the number of query points is n, the time complexity of naive calculation is O(n×H×W), which is proportional to the resolution of the depth map. The time complexity of block-based local search is simplified to O(n×k×k), where k is the block size, usually set to 5. Therefore, the time complexity of patch-based local search can be reduced to O(n). That is, it is only proportional to the number n of query points, and the search time of each query point is constant.

4. Body Fusion

Once the probability volume {P∈D×H'×W'} and the distance volume {S∈R×H'×W'} are obtained, these two volumes are fused to obtain the final depth map D∈H'×W'. Typically, softmax-based regularized networks are often used to predict depth maps from P, which is considered as the weight of hypothesized planes at different depths. Therefore, the calculation method of the depth map is as follows:

where dmin and dmax are the distances of the nearest and farthest hypothetical planes, respectively. However, this method contains many invalid planes in the calculation, and there are precision problems. The depth value of a pixel (U, V) is only related to several hypothetical planes corresponding to the pixel, and cannot be related to other sampling points on the surface. So as shown in the figure, the probability volume P and the introduced distance volume S are fused to calculate the depth map, so that each pixel point is related to the surrounding surface patches.

Specifically, S can be viewed as a threshold filter of probability values. The fusion process of these two bodies is described in the algorithm.

Finally, the two volumes P and S are supervised using the depth map ground-truth and the generated signed-distance ground-truth, using the L1 loss as follows:

where D*i and Si* are the ground-truth depth map and point-to-surface distance at stage i, respectively. Di and Si are the predicted values ​​of the two branches. Therefore, the total loss is the weighted sum of the two branches:

λ is the weight to balance the two terms and is set to 0.1 in all experiments. 

5. SDF branch supervision

Since the ground truth point-to-surface distances are generated from the corresponding depth maps, an error bounds analysis is necessary. A reasonable assumption is to use a triangular mesh to represent the surface. There are three different cases, as shown in the figure.

In (a), the largest sphere centered at the query point p is tangent to the object surface at point o, then the true value of the signed distance at p is d(p,o). The margin from the query point p to the sample point set {p'} is d(p, p'j). Since p'j coincides with the tangent point o, the error of (a) is e2a, as shown below

where d(p,o) and (dp,p ′j ) denote the true and approximate values ​​of the signed distance, respectively.

In (b) and (c), a similar analysis method is used. Suppose O and O ′′ are points of tangency between the surface centered at P and the sphere. In (b), the truth value of the signed distance is D(p,o ), and in (C), the truth value is D(p,o ′′ ). Therefore, the error ranges of cases (b) and (c) can be expressed by the following equations:

Where e2 b and e2 c are the error squares of (b) and (c) respectively. Combining these three cases, covering all possible cases, the final error bound for the query point p is as follows:

Where e is the general error of the query point p, and p'j and p'j+1 are two adjacent surface points. This inequality states that the square of the error e2 does not exceed the square of the distance between two points reprojected from two adjacent pixels. 

6. Experiment

6.1. Implementation Details

Implemented by PyTorch, the batch size is 2, the DTU dataset uses two NVIDIA RTX 2080Ti, and the BlendedMVS dataset uses a single NVIDIA Tesla P40. Use a finer DTU truth like AA-RMVSnet. The DTU dataset and the Tanks and Temples dataset were evaluated using NVIDIA Tesla P40 GPU, 24G RAM.

6.2. Comparison with advanced technologies

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/130050968