Accurate, Dense, and Robust Multiview Stereopsis(PMVS)准确,密集,和强大的多视角立体视觉

Abstract

This paper proposes a novel algorithm for multiview stereopsis that outputs a dense set of small rectangular patches covering the surfaces visible in the images.Stereopsis is implemented as a match, expand, and filter procedure, starting from a sparse set of matched keypoints, and repeatedly expanding these before using visibility constraints to filter away false matches.The keys to the performance of the proposed algorithm are effective techniques for enforcing local photometric consistency and global visibility constraints. Simple but effective methods are also proposed to turn the resulting patch model into a mesh which can be further refined by an algorithm that enforces both photometric consistency and regularization constraints. The proposed approach
automatically detects and discards outliers and obstacles and does not require any initialization in the form of a visual hull, a bounding box, or valid depth ranges. We have tested our algorithm on various data sets including objects with fine surface details, deep concavities, and thin structures, outdoor scenes observed from a restricted set of viewpoints, and “crowded” scenes where moving obstacles appear in front of a static structure of interest. A quantitative evaluation on the Middlebury benchmark [1] shows that the proposed method outperforms all others submitted so far for four out of the six data sets.

本文提出了一种用于多视点立体视觉的新颖算法,该算法可输出密集的一组小矩形块,覆盖图像中可见的表面。从稀疏的一组匹配的关键点开始,然后在使用可见性约束过滤掉错误的匹配之前重复扩展这些内容,将立体视觉实现为匹配,扩展和过滤过程。提出的算法是有效技术的关键点是要求是局部光度一致性全局可见性约束。还提出了简单而有效的方法来将生成的补丁模型转换为网格模型,该网格模型可以通过执行光度一致性正则化约束的算法进一步改进。所提出的方法自动检测并丢弃离群值和障碍物,并且不需要以可视外壳,边界框或有效深度范围的形式进行任何初始化。我们已经在各种数据集上测试了我们的算法,这些数据集包括具有精细表面细节,深凹和薄型结构的对象,从一组受限的视点观察到的室外场景以及在感兴趣的静态结构之前出现移动障碍物的“拥挤”场景。对Middlebury基准[1]的定量评估表明,对于六个数据集中的四个,提出的方法优于迄今为止提交的所有其他方法。

INTRODUCTION

MULTIVIEW stereo (MVS) matching and reconstruction is a key ingredient in the automated acquisition of geometric object and scene models from multiple photo-graphs or video clips, a process known as image-based modeling or 3D photography. Potential applications range from the construction of realistic object models for the film, television, and video game industries, to the quantitative recovery of metric information (metrology) for scientific and engineering data analysis. According to a recent survey
provided by Seitz et al. [2], state-of-the-art MVS algorithms achieve relative accuracy better than 1/200 (1 mm for a 20 cm wide object) from a set of low-resolution (640 * 480) images.They can be roughly classified into four classes according to the underlying object models:: Voxel-basedapproaches [3], [4],
[5], [6], [7], [8], [9] require knowing a bounding box that contains the scene, and their accuracy is limited by the resolution of the voxel grid. Algorithms based on deformable polygonal meshes [10], [11], [12] demand a good starting point—for example, a visual hull model [13]—to initialize the corresponding optimization process, which limits their applicability.Approaches based on multiple depth maps [14],[15], [16] are more flexible, but require fusing individual depth maps into a single 3D model. Finally, patch-based methods [17], [18] represent scene surfaces by collections of small patches (or surfels(surface element)).They are simple and effective and often suffice for visualization purposes via point-based rendering technique [19], but require a postprocessing step to turn them into a mesh model that is more suitable for image-based modeling applications.MVS algorithms can also be thought of in terms of the data sets they can handle, for example, images of

  1. objects, where a single, compact object is usually fully visible in a set of uncluttered images taken from all around it, and it is relatively straightforward to extract the apparent contours of the object and
    compute its visual hull;
  2. scenes, where the target object(s) may be partially occluded and/or embedded in clutter and the range of viewpoints may be severely limited, preventing the computation of effective bounding volumes
    (typical examples are outdoor scenes with buildings, vegetation, etc.); and
  3. crowded scenes, where moving obstacles appear in different places in multiple images of a static
    structure of interest (e.g., people passing in front of a building).

The underlying object model is an important factor in determining the flexibility of an approach, and voxel-based or polygonal mesh-based methods are often limited to object data sets, for which it is relatively easy to estimate an initial bounding volume or often possible to compute a visual hull model.
Algorithms based on multiple depth maps and collections of small surface patches are better suited to the more challenging scene data sets.Crowded scenes are even more difficult. Strecha et al. [15] use
expectation maximization and multiple depth maps to reconstruct a crowded scene despite the presence of occluders, but their approach is limited to a small number of images (typically three) as the complexity of their model is exponential in the number of input images. Goesele et al. [21] have also proposed an algorithm to handle Internet photo collections containing obstacles and produce im- pressive results with a clever view selection scheme.

In this paper, we take a hybrid approach that is applicable to all three types of input data. More concretely, we first propose a flexible patch-based MVS algorithm that outputs a dense collection of small oriented rectangular patches, obtained from pixel-level correspondences and tightly covering the observed surfaces except in small textureless or occluded regions. The proposed algorithm consists of
在这里插入图片描述
( 1 A sample input image 2 detected features 3 reconstructed patches after the initial matching 4 final patches after expansion and filtering 5 the mesh model.)

a simplematch, expand, and filterprocedure (Fig. 1): 1)Matching:Features found by Harris and difference-of-Gaussians operators are first matched across multiple pictures, yielding a sparse set of patches associated with salient image regions.Given these initial matches, the following two steps are repeated n times (n = 3 in all our experiments).2) Expansion:A technique similar to [17], [18], [22], [23], [24] is used to spread the initial matches to nearby pixels and obtain a dense set of patches. 3) Filtering: Visibility (and a weak form of regularization) constraints are then used to eliminate in-correct matches. Although our patch-based algorithm is similar to the method proposed by Lhuillier and Quan [17], it replaces their greedy expansion procedure by iteration between expansion and filtering steps, which allows us to process complicated surfaces and reject outliers more effectively. Optionally, the resulting patch model can be turned into a triangulated mesh by simple but efficient techniques, and this mesh can be further refined by a mesh-based MVS algorithm that enforces the photometric consis-
tency with regularization constraints. The additional computational cost of the optional step is balanced by the even higher accuracy it affords. Our algorithm does not require any initialization in the form of a visual hull model, abounding box, or valid depth ranges. In addition, unlike many other methods that basically assume fronto-parallel surfaces and only estimate the depth of recovered points, it actually estimates the surface orientation while enforcing the local photometric consistency, which is important in practice to obtain accurate models for data sets with sparse input images or without salient textures.
As shown by our experiments, the proposed algorithm effectively handles the three types of data mentioned above, and, in particular, it outputs accurate object and scene models with fine surface
detail despite low-texture regions, large concavities, and/or thin, high-curvature parts. A quantitative evaluation on the Middlebury benchmark [1] shows that the proposed method outperforms all others submitted so far in terms of both accuracy and completeness for four out of the six data sets.
The rest of this paper is organized as follows: Section 2 presents the key building blocks of the proposed approach. Section 3 presents our patch-based MVS algorithm, and Section 4 describes how to convert a patch model into a mesh and our polygonal mesh-based refinement algorithm. Experimental results and discussion are given in Section 5, and Section 6 concludes the paper with some future work. The implementation of the patch-based MVS algorithm (PMVS) is publicly available at [25]. A preliminary versionof this paper appeared in [26].

MVS匹配和重建是从多张照片或视频剪辑中自动获取几何对象和场景模型的关键过程,该过程称为基于图像的建模或3D摄影。潜在的应用范围包括为电影,电视和视频游戏行业构建逼真的对象模型,以及用于科学和工程数据分析的度量信息(计量学)的定量恢复。根据Seitz等人提供的最新调查 [2],最新的MVS算法从一组低分辨率(640*480)图像中获得的相对精度优于1/200(20 cm宽物体为1 mm)。根据基础对象模型,它们可以大致分为四类:基于体素的方法[3],[4],[5],[6],[7],[8],[9]需要知道包含场景的边界框,其精度受到体素网格的分辨率的限制。基于可变形多边形网格[10],[11],[12]的算法需要一个良好的起点(例如,可视外壳[13])来初始化相应的优化过程,这限制了它们的适用性。基于多个深度图[14],[15],[16]的方法更加灵活,但是需要将各个深度图融合到单个3D模型中。最后,基于补丁的方法[17],[18]通过收集小补丁(或面元素)的集合表示场景表面。它们简单有效,通常可以通过基于点的渲染技术满足可视化需求[19],但需要后处理步骤才能将它们转换为更适合基于图像的建模应用的网格模型。还可以根据MVS算法可以处理的数据集来考虑,例如

  1. 对象。通常在从其周围拍摄的一组整齐的图像中,一个紧凑的对象通常是完全可见的,提取对象的外观轮廓,计算其视觉外壳相对简单;
  2. 场景。可能会部分遮挡和/或嵌入杂物且视点范围可能受到严重限制,从而无法计算有效边界体积。(典型的例子是带有建筑物,植被等的室外场景)
  3. 拥挤的场景,移动的障碍物出现在感兴趣的静态结构的多个图像中的不同位置(例如,经过建筑物前面的人)。

基础对象模型是确定方法灵活性的重要因素,基于体素或基于多边形网格的方法通常仅限于对象数据集,对于这些对象数据集,估计初始边界体积相对容易,或者通常可以可是外壳模型。基于多个深度图和小表面补丁集合的算法更适合于更具挑战性的场景数据集。拥挤的场景更加困难。Strecha等。 [15]使用期望最大化和多个深度图来重建一个拥挤的场景,尽管存在遮挡物。但是由于其模型的复杂性在输入图像的数量上呈指数级增长,因此它们的方法仅限于少量图像(通常为三个)。Goesele等 [21]也提出了一种算法来处理包含障碍物的互联网照片集,并通过聪明的视图选择方案产生令人印象深刻的结果。
在本文中,我们采用了一种混合方法,适用于所有三种类型的输入数据。更具体地讲,我们首先提出一种灵活的PMVS算法,该算法输出密集、定向的矩形小补丁的集合,这些矩形补丁是从像素级对应关系中获得的,并紧密覆盖观察到的表面(除了小的无纹理或被遮挡的区域)。所提出的算法由简单匹配,扩展和过滤过程组成1)匹配:首先在多张图片上匹配 由Harris和高斯差分算子发现的特征,从而生成与显着图像区域相关的稀疏补丁集。给定这些初始匹配项,以下两个步骤将重复n次(在我们所有的实验中均为n=3)。2)扩展:类似于[17],[18],[22],[23],[24]的技术用于将初始匹配扩展到附近的像素并获得密集的补丁集。3)过滤:使用可见性(和弱化的正则化形式)约束来消除不正确的匹配。尽管我们的基于补丁的算法与Lhuillier和Quan [17]提出的方法相似,PMVS通过在扩展和过滤步骤之间进行迭代来替换贪婪的展开过程,从而使我们能够更有效地处理复杂的曲面并排除异常值。可以通过简单但有效的技术将生成的补丁模型转换为三角网格,并且可以通过基于网格的MVS算法进一步细化该网格,该MVS算法使用正则化约束实施光度一致性。可选步骤的额外计算成本与其提供的更高的精度相平衡。我们的算法不需要以可是外壳模型,边界框或有效深度范围的形式进行任何初始化。此外,与许多其他方法基本假定是正面平行的表面并且仅估计恢复点的深度不同,它实际上在增强局部光度一致性的同时估计表面方向,在实践中,这对于获取具有稀疏输入图像或没有显着纹理的数据集的准确模型非常重要。如我们的实验所示,所提出的算法可有效处理上述三种类型的数据,并且特别是,即使纹理区域低,凹度大/或薄或高曲率部分,它仍可以输出具有精细表面细节的精确对象和场景模型。对Middlebury基准[1]的定量评估表明,对于六个数据集中的四个,在准确性和完整性方面,所提出的方法均优于迄今为止提交的所有其他方法。
本文的其余部分安排如下:第2节介绍了该方法的关键组成部分。第3节介绍了PMVS算法,第4节介绍了如何将补丁模型转换为网格以及基于多边形网格的细化算法。实验结果和讨论在第5节中给出,第6节对本文进行了总结,并提出了一些未来的工作。基于补丁的MVS算法(PMVS)的实现可在[25]上公开获得。本文的初步版本出现在[26]中。

2 KEY ELEMENTS OF THE PROPOSED APPROACH

The proposed approach can be decomposed into three steps: a patch-based MVS algorithm that is the core reconstruction step in our approach and reconstructs a set of oriented points (or patches) covering the surface of an object or a scene of interests; the conversion of the patches into a polygonal mesh model; and finally a polygonal-mesh based MVS algorithm that refines the mesh. In this section, we introduce a couple of fundamental building blocks of the patch-based MVS algorithm, some of which are also used in our mesh refinement algorithm.
所提出的方法可以分解为三个步骤PMVS算法,它是我们方法的核心重建步骤,并重建覆盖感兴趣场景或对象表面的一组定向点(或补丁);将补丁转换为多边形网格模型;最后是基于可以优化多边形网格的MVS算法。在本节中,我们介绍了基于补丁的MVS算法的几个基本构造块,其中一些也用于我们的网格细化算法中。

2.1 Patch Model

A patch p is essentially a local tangent plane approximation of a surface. Its geometry is fully determined by its center c ( p ) c(p) c(p), unit normal vector n ( p ) n(p) n(p) oriented toward the cameras observing it, and a reference image R ( p ) R(p) R(p) in which p is visible (see Fig. 2). More concretely, a patch is a (3D) rectangle,which is oriented so that one of its edges is parallel to the x-axis of the reference camera (the camera associated with R ( p ) R(p) R(p). The extent of the rectangle is chosen so that the smallest axis-aligned square in R ( p ) R(p) R(p) containing its image projection is of size u u u * u u upixels in size ( u u u is either 5 or 7 in
all of our experiments).
补片p本质上是表面的局部切平面近似。其几何形状完全由其中心 c ( p ) c(p) c(p),法向向量 n ( p ) n(p) n(p) 朝向观察它的摄像机定向 和其中可见p的参考图像 R ( p ) R(p) R(p)决定(请参见图2)。更具体地说,patch 是(3D)矩形,其方向是使其边缘之一平行于参考摄影机(与 R ( p ) R(p) R(p). 相关联的摄影机)的x轴。选择矩形的范围,使 R ( p ) R(p) R(p)中包含其图像投影的最小轴对齐正方形的大小为 u u u * u u u像素大小(在我们所有的实验中, u u u 均为5或7)。

Guess you like

Origin blog.csdn.net/Vpn_zc/article/details/108942531