[Paper Description] Multi-View Guided Multi-View Stereo (IROS 2022)

1. Brief description of the paper

1. First author: Matteo Poggi, Andrea Conti

2. Year of publication: 2022

3. Published in journal: IROS

4. Keywords: depth estimation, 3D reconstruction, deep learning, depth prompt

5. Exploration motivation: Although 2D CNN extracts more robust feature representations and achieves strong regularization through 3D convolution, the demanding computational needs still limit the full deployment of such solutions, which usually require a compromise in accuracy and trade-off between complexity. For example, infer depth at a lower resolution than one of the input images or implement a coarse-to-fine strategy. Furthermore, some of the challenges mentioned above, such as handling textureless areas, thin objects, or occlusions, are still open.

However, despite the more robust features representation extracted by 2D CNNs and the strong regularization achieved through 3D convolutions, the high-demanding computational requirements still limit the full deployment of such solutions, often requiring some trade-off between accuracy and complexity. For instance, inferring depth at resolution lower than the one of the input images or implementing coarse-to-fine strategies. Moreover, several challenges mentioned above, such as dealing with untextured regions, thin objects or occlusions, remain open.

6. Objectives of the work: Most of the challenges mentioned so far are inherent to the image domain itself. Therefore, their impact can be greatly mitigated if additional information can be obtained in a different way, for example, a sparse set of depth measurements sensed by active sensors.

We argue that most of the challenges mentioned so far are inherent to the image domain itself. Thus, their impact could be significantly softened given the availability of additional information with different modalities, for instance, by having access to a sparse set of depth measurements perceived by an active sensor. Nowadays, such sensors are at hand and readily available as standalone off-the-shelf devices. Moreover, they are always more frequently integrated into consumer products like mobile phones and tablets (e.g., Apple iPhones and iPads). However, despite their ever-increasing diffusion, they often provide only sparse depth data (i.e., at a much lower resolution compared to standard cameras).

7. Core idea: A multi-view stereo depth estimation framework is proposed. Assuming that a sparse set of depth measurement data and images is available, adjusting the cost body constructed by any state-of-the-art MVS network provides stronger guidance for the architecture to infer more accurate depth maps.

  1. multi-view stereo depth estimation. Assuming the availabilityof a sparse set of depth measurements acquired together with images, we modulate the cost volume built by any stateof-the-art MVS network to provide stronger guidance to the architecture towards inferring more accurate depth maps.
  2. We introduce coarse-to-fine guidance by applying cost volume modulation multiple times during the forward pass, compliantly to the coarse-to-fine strategy followed by recent MVS networks.
  3. We implement the proposed mvgMVS framework within five state-of-the-art deep architectures, each one characterized by different regularization and optimization strategies.

8. Experimental results:

To validate this claim, we run an exhaustive set of experiments by training a variety of state-of-the-art MVS architectures and their guided counterparts on the BlendedMVG and DTU datasets and assessing their accuracy on them. This proves that our framework consistently boosts the accuracy achievable with any considered deep networks in terms of depth map estimation and overall 3D reconstruction when guidance is available.

9. Paper and code download:

Multi-View Guided Multi-View Stereo

https://github.com/andreaconti/multi-view-guided-multi-view-stereo

2. Implementation process

1. Deep Multi-View Stereo background

Most learning-based MVS pipelines follow the same pattern. Given a set of N images, 1 of which serves as a reference image and the other N−1 as source images, a deep MVS network processes them to predict a global dense depth map aligned with the reference image. Most deep networks designed for this purpose define a cost volume that encodes the feature similarity between pixels in the reference image and potential matching candidate pixels in the source image. Given the intrinsic and extrinsic parameters K, E of each image, the latter are retrieved along the epipolar lines in the source view. Specifically, for a specific depth hypothesis z[zmin,zmax], the features Fi extracted from a given source view i are projected through a homography-based deformation operation π.

Then, in order to encode the similarity between the reference features F0 and Fz i, the variance-based volume is defined as follows

Therefore, for a given pixel, the lower the variance score, the more similar the features retrieved from the source view are, and therefore, the more likely it is that the hypothesis z is its correct depth. However, implementing this solution requires a lot of memory and is computationally complex. Therefore, several state-of-the-art networks implement coarse-to-fine solutions. Specifically, a set of variance-based cost bodies is constructed as:

s is the specific resolution or scale at which the cost volume is calculated. Correspondingly, the features from image i are sampled at resolution s as:

2. Guided Multi-View Stereo

By assuming a setup consisting of a standard camera and a low-resolution depth sensor (e.g., LiDAR), we exploit the output of the latter to shape the behavior of a depth network in estimating depth from a set of color images. When the set is limited to a single frame, a neural network is typically trained to complete the sparse depth points guided by the color image. When multiple images are available, this mechanism is often reversed and depth measurements are used as cues to guide the image-based estimation process. For example, through the Guided Stereo framework applied to binocular stereo, this strategy is implemented by applying Gaussian modulation to the feature volume so that it peaks corresponding to the depth cue z.

Similarly, this mechanism can also be applied to MVS to implement Guided Multi-View Stereo (gMVS). In fact, the amount of variance can also be modulated conveniently. In this case, since low variance encodes a high likelihood that the corresponding depth hypothesis z is correct, the Gaussian curve is flipped to force the variance-based cost volume to have a minimum near-depth hint z*.

v is a binary mask equal to 1 for pixels with valid cues (0 otherwise), and k, c are the magnitude and width of the Gaussian itself. The gMVS formulation outlined so far extends the Guided Stereo framework to MVS.

3. Multi-View Guided Multi-View Stereo

MVS relies on the availability of multiple images acquired from different viewpoints. Furthermore, it is assumed that the availability of sparse depth measurements for color images is registered in the setting. Therefore, a different set of cues is available for each source image. In this case, it is believed that aggregating multiple sets of depth cues from each viewpoint can provide stronger guidance to the network and further improve the results of the baseline gMVS framework. To do this, two main steps are performed.

Depth hints aggregation. Given a pixel with homographic 2D coordinate qi belonging to [1,N] from any source image i, for which depth value d*qi is available, the 3D coordinate p0 in the reference image viewpoint is:

A new depth cue d*q0 represented by the reference image viewpoint can be obtained from p0 and projected onto the image plane according to K0 at coordinate q0.

This allows aggregation of depth cues over the reference image. As shown in the figure, depth cues from many views (left) can be aggregated onto the reference image viewpoint (right), resulting in a denser depth cue map for stronger guidance. Modulating bodies in networks. This extension of the gMVS framework is called multi-view guided multi-view stereo (mvgMVS).

Depth hints filtering. Due to different viewpoints, some depth measurements taken in one of the source views may belong to occluded areas in the reference view. However, given the sparse nature of cues, if the constraint naively projects them into the view without considering their visibility, this will result in the aggregation of several erroneous values, as shown in Figure b.

If this were the case, the deep network would be guided with incorrect depth cues, thereby reducing its accuracy. In order to detect and remove these outliers, a filtering strategy is deployed that defines as an outlier any pixel q0 for which at least one pixel s exists in its neighborhood S(q0) such that:

q0 changes its relative position relative to s because it is occluded. If the difference between q0 and s pixel coordinates and angles (in spherical coordinates) have different signs, i.e. (xq0−xs)(θq0−θs) or (yq0−ys)(φq0−φs) are negative, then This happens;

q0 is much higher distance from the camera than s, i.e. dq0 > ds +ε, where ε is set according to the specific dataset used.

Although simple, this strategy allows the removal of most outliers at a small computational cost.

4. Coarse-to-Fine Guidance

Unlike deep stereo networks, which are usually constructed by stacking single volumes through stacked 3D convolutions, MVS networks are usually designed to embody coarse-to-fine estimation to reduce computational burden. We believe that any of the multiple cost bodies for network construction represents a possible entry point for bootstrapping the network. Therefore, all Vs are modulated during forward propagation:

vs and z* are the binary mask v and depth cue map z respectively downsampled to resolution s, using nearest neighbor interpolation.

5. Experiment

5.1. Dataset

Since none of the existing MVS data sets provide sparse depth points, the availability of sparse cues is simulated by randomly sampling from groundtruth depth maps. Therefore, for experiments, only datasets that provide such information can be selected, i.e. Tank & Temples cannot be evaluated. Regarding gMVS, we simulate the availability of sparse depth cues by randomly sampling 3% of pixels from the ground-truth depth map.

5.2. Implementation details

 For PyTorch, we set k = 10 and c = 0.01. Regarding filtering, we set it to 3. Experiments implementing gMVS and variants were conducted using 5 state-of-the-art networks (MVSNet, D2HC-RMVSNet, CAS-MVSNet, UCSNet, PatchMatchNet).

During training and testing, we set the number of images processed by the network to 5. Therefore, we accumulate depth cues from 5 views for mvgMVS. First, train on BlendedMVG, then finetune on DTU, and test on the test set of BlendedMVG and DTU.

5.3. Comparison with advanced technologies

6. Restrictions

Although the experiments highlight the potential of multi-view guided multi-view stereo frameworks, which are effective on both synthetic and real datasets, there is a limitation that may be important in some settings: networks trained with specific cue densities fail to generalize to less intensive prompt input. Specifically, once a bootstrap network is trained with a fixed density of input depth points, performance will degrade if such density is not guaranteed at test time. This behavior was investigated through further experiments using MVSNet, which aggregated 3% of the cues on the view during training and tested at different densities. It can be noticed that by reducing the number of hints, the network performance also decreases, but is still better than the original MVSNet trained without guidance. However, by completely ignoring hints, the performance is significantly lower than the original MVSNet. This behavior highlights that the network itself exploits cues almost blindly when trained with them and loses a lot of accuracy when cues are unavailable during deployment. Future research will explore better training protocols to achieve minimal degradation in accuracy under such conditions. Furthermore, the current evaluation is performed by simulating the availability of depth cues from the sensor. Further experiments on real sensing devices will allow to evaluate the robustness of the framework to depth sparse point noise.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/132101077