[Paper Description] Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity (AAAI 2023)

1. Brief description of the paper

1. First author: Qingsong Ya, Qingsong Yan

2. Year of publication: 2023

3. Published in journal: AAAI

4. Keywords: MVS, 3D reconstruction, parallax, epipolar line, flow

5. Exploration motivation: Existing learning-based multi-view stereo (MVS) methods rely on depth range to construct a three-dimensional cost volume, which may fail when the depth range is too large or unreliable. Typically, the depth range determines the three-dimensional distribution of the cost volume that the network attempts to fit, and the size of the cost volume is limited by computing power and storage capacity, causing these methods to easily overfit the configured depth range. The figure below shows two state-of-the-art methods, the performance of GBiNet and IterMVS drops sharply when the depth range is enlarged. The reason is that these methods cannot capture enough matching information with a fixed number of depth values.

6. Work goal: Solve the above problems through epipolar parallax.

7. 核心思想:We propose a new MVS pipeline, which allows the CNN to focus only on the matching problem between two different views and relies on the multi-view geometry to recover the depth by matching results.

  1. Instead of constructing the 3D cost volume, this paper only constructs the 2D cost volume to match pixels between each pair and generates the depth map by triangulation. In other words, DispMVS exploits the multi-view geometry to reduce the burden of networks, which does not rely on the computationally expensive 3D cost volume to estimate the depth.
  2. We redesign the flow to deal with the multi-view stereo matching without applying stereo rectification for each pair. First, we propose the epipolar disparity flow (Eflow) and reveal the relationship between the E-flow and the depth. Then we extend E-flow from two-view to multi-view and iteratively update the E-flow by a fused depth to maintain multi-view consistency.

8. Experimental results:

DispMVS achieves the state-of-the-art result on the DTUMVS and Tanks&Temple without the 3D cost volume, demonstrating the effectiveness of combining multi-view geometry with a learning-based approach.

9. Paper download:

https://arxiv.org/pdf/2211.16905.pdf

2. Implementation process

1. Flow and Depth

Given a reference view r and a source view s, their internal parameter matrices Kr, Ks and their relative external parameters [Rs, Ts], define dr, ds as depth, frs, fsr to represent the flow of each view. Assuming the scene is static, depth can be converted to flow and vice versa based on multi-view geometry.

depth. To describe the three-dimensional shape of an image, pixels on the image plane can be reprojected into three-dimensional space. Equation 1 reprojects the pixel pr in r to Ppr in 3D by its depth dr(pr), where p~r is the homogeneous representation of pr to improve computational efficiency. Ppr can also be projected to s through Equation 2.

flow . Flow describes the movement of pixels on the image plane between two images. For the matching pr in pixel r and ps in s, the flow frs is calculated by Equation 3. Typically, the flow does not need to obey geometric constraints and has two degrees of freedom.

Depth to flow. Equation 4 shows how to convert dr(pr) to fr(pr). First reproject pr to Ppr through dr(pr), then project Ppr to s to obtain the matching pixel ps, and finally calculate frs(pr).

Flow to depth. While triangulation is a straightforward way to convert fr to dr, it must solve a non-differentiable homogeneous linear function. With this in mind, use a differentiable closed-form solution to compute depth, even if it is not optimal. Given pr and frs(pr), ps can be obtained from Equation 3. Based on the geometric consistency of multiple views, the constraints in Equation 5 are obtained:

It can be seen from Equation 6 that there are two methods of calculating depth, dxr(pr) and dyr(pr), because frs(pr) is a two-dimensional vector, providing streams of x-dimension fxrxs(pr) and y-dimension fyrys(pr). Theoretically, dxr(pr) is equal to dyr(pr). However, smaller flows are not numerically stable and introduce noise into the triangulation. Therefore, the depth triangulated by the larger stream is chosen according to Equation 7:

2. E-flow: epipolar parallax flow

Flow describes the motion of pixels on the image plane, but does not obey epipolar geometry, which introduces ambiguity when triangulating depth. Therefore, the epipolar geometry is used to suppress the flow, and the epipolar parallax flow (E-flow) is defined by Equation 8, where edir is the normalized direction vector of the epipolar and · is the dot product of the vectors.

E-flow and Flow. In contrast to Flow, E-flow is a scalar and only moves on polar lines. In static scenes, flow and E-flow are two different ways to describe pixel motion, and their relationship is shown in Equation 9:

The figure below shows the relationship between flow frs(pr) and E-flow ers(pr) of pixel pr. Draw ps in r to visualize E-flow and Flow. dpr is the depth of pr, frs(pr) and ers(pr) represent how pr moves. In static scenes, dpr, frs(pr) and ers(pr) are interrelated and can be converted to each other based on multi-view geometry.

E-flow and depth. Considering the relationship between E-flow and flow, as well as the relationship between flow and depth, E-flow can be converted to depth by Eq. 10, and vice versa. 

3. DispMVS

Given a reference image r and N source images si, DispMVS first extracts features from all input images, and then iteratively updates depth from random initialization. DispMVS estimates the E-flow of each pair through the 2D cost body separately, and converts the E-flow into depth by weighted sum for subsequent multi-view depth fusion. The figure below shows the process of the first iteration of DispMVS in the coarse stage.

Feature extraction. RAFT-based feature extraction uses two identical encoders to extract features from the input image. An encoder simultaneously extracts matching features from input images to compute similarity. At the same time, another encoder extracts contextual features from the reference image and iteratively updates the E-flow. Because a coarse-to-fine strategy is adopted to improve efficiency and accuracy, these encoders use the UNet structure to extract coarse feature features cr, csi. The resolution of Csi in the coarse stage is 1/16, and the resolution in the fine stage is 1/4 of fr. , fsi. In addition, deformable convolutional networks are used to capture valuable information in the decoder part.

Initialization of E-flow. DispMVS relies on E-flow to estimate depth and therefore requires an initial depth as a starting point. Adopt the initialization strategy of PatchMatch and use Equation 11 to initialize dr,0, where rand comes from the normal distribution and (dmin, dmax) are the lower and upper bounds of the depth range. When there is dr,0, DispMVS can initialize E-flow through Equation 10. The first iteration is 0.

E-flow estimation. After initialization, DispMVS estimates the esi of each pair (r, si) through the GRU module. For pr and coarse feature cr(pr), DispMVS uses ersi(pr) to find the matching point psi on si, and samples features from csi(ps). In order to increase the receptive field, DispMVS applies ms average pooling on csi to generate features of different scales, and evenly samples mp points along the epipolar line around psi,t, with a sampling distance of 1 pixel at each scale. Then, DispMVS calculates the similarity between cr(pr) and csi and mp×ms features through Equation 12 to construct a 2D cost body. Finally, DispMVS inputs the cost body and ersi(pr) to the GRU module to estimate the new ersi(pr) and weight wsi(pr). In DispMVS, the coarse stage sets ms=4, mp=9, and the fine stage ms=2, Mp=5.

The figure below compares two different sampling spaces. (a) is the sampling space of existing methods that uniformly sample in 3D space to construct 3D cost volumes. (b) is the sampling space of DispMVS. It samples uniformly on each source image plane and estimates the depth through triangulation. Its distribution is determined by the relative pose of the source image.

Multi-view E-flow estimation. Since E-flow is only applicable to two views, DispMVS utilizes weighted sums to extend the flow to multi-view situations. DispMVS converts ersi to drsi through Equation 10, and then fuses dr through Equation 13, where wsi is normalized by softmax. It is worth noting that DispMVS reconstructs depth iteratively, which means there are multiple conversions between depth and E-flow, further improving multi-view consistency.

A coarse-to-fine strategy. DispMVS uses cr and csi to iterate tc times in the coarse stage, and iterates fr and fsi f times in the fine stage. DispMVS starts at a random depth in the coarse stage, and then upsamples the coarse depth to the fine stage for subsequent improvements. Generally, DispMVS requires more tc and less tf to improve efficiency and accuracy.

loss function. Since DispMVS outputs a depth in each iteration, an l1 loss is computed for all depth maps to speed up convergence. To improve the stability of the training process, the ground truth (gt) and depth dri are normalized using the inverse depth range. The following formula is the loss function, where γ=0.9:

4. Experiment

4.1.  Implementation details

The training process is completed on two V100s.

4.2. Comparison with advanced technologies

5. Restrictions

Since DispMVS needs to continuously construct a two-dimensional cost volume during the iterative process, its computational efficiency is relatively low. In experiments, DispMVS takes approximately 0.7 seconds to process views on DTUMVS. In comparison, IterMVS only takes about 0.3 seconds per view, while DispMVS requires a more efficient epipolar matching module. In addition, DispMVS requires approximately 48G GPU memory when training, because DispMVS requires multiple iterations of the GRU module to update the depth, and the GRU module needs to save all gradients and intermediate results.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/131858829