【论文阅读】MVSNet:Depth Inference for Unstructured Multi-view Stereo【2018】

Abstract

解决问题:端到端,从多视图中预测深度图。

We present an end-to-end deep learning architecture for depth map inference from multi-view images.

方法:

In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature.

  1. 提取图像特征
  2. and then build the3D cost volume? upon thereference camera frustum ? via the differentiable homography warping
  3. apply 3D convolutions to regularize and regress the initial depth map

Our framework接受N个图像输入,使用方差度量,将多个特征映射为一个cost feature?

在这里插入图片描述

实验结果:

在这里插入图片描述

With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime.

远超 之前的state-of-the-arts,速度也更快。泛化性强

1. Introduction

MVS是什么?

Multi-view stereo (MVS) estimates the dense representation from overlapping images, which is a core problem of computer vision extensively studied for decades.

多视点立体视觉(MVS)是计算机视觉的一个核心问题,它从重叠图像中估计出稠密的表示。

传统方法:

局限性:low-textured, specular and reflective regions of the scene
准确性还行,重建完整度还有很大地提升。

While these methods have shown great results under ideal Lambertian scenarios, they suffer from some common limitations. For example, low-textured, specular and reflective regions of the scene make dense matching intractable and thus lead to incomplete reconstructions.
It is reported in recent MVS benchmarks [1,18] that, although current state-of-the-art algorithms [7,36,8,32] perform very well on the accuracy, the reconstruction completeness still has large room for improvement.

虽然这些方法在理想的朗伯场景下显示了很好的效果,但它们也有一些共同的局限性。例如,场景的低纹理、镜面反射和反射区域使密集匹配难以进行,从而导致不完整的重建。

据最近的MVS基准[1,18]报告,尽管目前最先进的算法[7,36,8,32]在精度方面表现良好,但重建完整性仍有很大的改进空间。

CNN:two-view stereo matching(双目立体匹配)

优点:能够学习全局语义信息

Conceptually, the learning-based method can introduce global semantic information such as specular and reflective priors for more robust matching.

从概念上讲,基于学习的方法可以引入全局语义信息,如镜面反射和反射先验,以实现更鲁棒的匹配。

In fact, the stereo matching task is perfectly suitable for applying CNN-based methods, as image pairs are rectified in advance and thus the problem becomes the horizontal pixel-wise disparity estimation without bothering with camera parameters.

事实上,立体匹配任务非常适合应用基于CNN的方法,因为图像对是预先校正的,因此问题变成了水平像素方向的视差估计,而不需要考虑相机参数。

multi-view stereo

局限:arbitrary camera geometries
从two-view stereo 不好转 multi-view stereo,原因:arbitrary camera geometries

一些相关工作:: SurfaceNet [14] , Learned Stereo Machine (LSM)[15]

局限:volumetric representation of regular grids

However, both the two methods exploit the volumetric representation of regular grids. As restricted by the huge memory consumption of 3D volumes, their networks can hardly be scaled up:
然而,这两种方法都利用了规则栅格的体积表示。由于3D卷的巨大内存消耗,他们的网络很难扩展:

MVSNet

端到端,只计算深度图而不是整个3D场景。
we propose an end-to-end deep learning architecture for depth map inference, which computes one depth map at each time, rather than the
whole 3D scene at once.

输入:一个reference image和几个 source images ,输出:the depth map for the reference image。
MVSNet, takes one reference image and several source images as input, and infers the depth map for the reference image.

differentiable homography warping operation ?
The key insight here is the differentiable homography warping operation ?, which implicitly encodes camera geometries in the network to build the 3D cost volumes from 2D image features and enables the end-to-end training.

为了适应输入中任意数量的源图像,我们提出了一种基于方差的度量,该度量将多个特征映射为one cost feature in the volume。然后,该cost volume进行多尺度3D卷积,并回归初始深度图。最后,利用参考图像对深度图进行细化,以提高边界区域的精度。
To adapt arbitrary number of source images in the input, we propose a variance-based metric that maps multiple features into one cost feature in the volume. This cost volume then undergoes multi-scale 3D convolutions and regress an initial depth map. Finally, the depth map is refined with the reference image to improve the accuracy of boundary areas.

与其它方法[15, 14]的两点不同(贡献点?):1. camera frustum; 2. smaller problems of per-view depth map estimation

There are two major differences between our method and previous learned approaches [15,14].
First, for the purpose of depth map inference, our 3D cost volume is built upon the camera frustum instead of the regular Euclidean space.
Second, our method decouples the MVS reconstruction to smaller problems of per-view depth map estimation, which makes large-scale reconstruction possible.

2. Related work

MVS Reconstruction

According to output representations, MVS methods can be categorized into 1) direct point cloud reconstructions [22,7],2) volumetric reconstructions [20,33,14,15] and 3) depth map reconstructions [35,3,8,32,38].

  • direct point cloud reconstructions : 难以并行化,处理时间长
  • volumetric reconstructions:空间离散,高内存消耗
  • depth map reconstructions:更灵活:拆解问题,容易转化为点云或者volumetric

Learned Stereo.

Learned MVS.

3. MVSNet

在这里插入图片描述

3.1 Image Features

在这里插入图片描述

3.2 Cost Volume

在这里插入图片描述

Differentiable Homography

在这里插入图片描述

Cost Metric

Cost Volume Regularization

3.3 Depth Map

Initial Estimation

Probability Map

Depth Map Refinement

3.4 Loss

猜你喜欢

转载自blog.csdn.net/weixin_43154149/article/details/121555381
今日推荐