[Multi-view Stereo Vision] Introduction to MVSNet papers and evaluation indicators

This article mainly translates and interprets the first deep learning method in learning-based multi-view stereo vision: MVSNet, and adds some personal understanding, and introduces the principles of MVSNet's two evaluation indicators (distance metric, percentage metric) )

Summary

Propose an end-to-end deep learning architecture: from multi-view images, infer depth maps.

network:

  1. Extract deep visual image features
  2. Construct a 3D loss cube on the ref camera frustum by differentiable homography deformation
  3. Apply 3D convolution to regularize and regress the initial depth map; then use the reference image to refine the initial depth map to generate the final output

Innovation:

Flexibly adapts to arbitrary N-view inputs using a variance-based loss metric that maps multiple features into a single loss feature

Trained on DTU and evaluated on Tanks and Temples.

Through post-processing, the point cloud is generated, which is better and faster than the previous technology.

Strong generalization, better results without fine-tuning

1. Introduction

Review

MVS: Estimating dense representations from overlapping images

MVS traditional method:

Using hand-crafted similarity measures and engineered regularization, computationally intensive correspondences and 3D points are recovered.

Limitations: Low-texture, specular areas of the scene, resulting in incomplete reconstructions.

The most recent algorithm:

The accuracy is better, but the completeness needs to be improved

CNN-based approach:

Based on learning, introduce global semantic information (such as: the prior of specular reflection)

[Stereo matching] is suitable for using the CNN method: the image pair has been pre-corrected, and the problem becomes [parallax estimation] in the horizontal pixel direction, without considering the camera parameters.

[MVS] Unlike stereo matching, the input image is arbitrary camera geometry, which poses a big problem for the use of learning methods

SurfaceNet: Build Colored Voxel Cubes in advance, combine all image pixel colors and camera information into a single volume as network input

LSM, Learned Stereo Machine: direct use, differentiable projection/non-projection, to achieve end-to-end training/reasoning.

These two methods: Both use the volume representation of regular grids, because the huge memory consumption of 3D volume makes it difficult to expand the network.

LSM: only handles [low volume resolution] synthetic objects;

SurfaceNet: With a heuristic divide-and-conquer strategy, large-scale reconstruction takes a long time.

The method of this article:

Calculate the depth map one at a time, instead of calculating the entire 3D scene at once.

  1. [a ref image] and [multiple src images] are used as input to infer the depth map of ref.

  2. key insight: differentiable homography warping

This operation implicitly encodes the camera geometry in the network, constructs a 3D loss cube from 2D image features , and supports end-to-end training.

Camera geometry: It should refer to the geometric relationship between the camera and the object.

  1. Propose a variance-based measure

can accommodate any number of src in the input

This measure: maps multiple features to a loss feature in the volume.

  1. Perform multi-scale 3D convolution on the loss cube (Cost Volume) and return to the initial depth map.

  2. Use the ref image to refine the depth map and improve the accuracy of the boundary area.

Different from SurfaceNet and LSM:

  1. For depth map inference, the 3D Cost Volume in this article is built on camera frustum instead of regular Euclidean space.

    Below is a schematic diagram of a camera cone.

insert image description here

  1. The method of this paper: decouple the MVS reconstruction problem into . Smaller view-wise depth map estimation problem, enabling large-scale reconstruction.

experiment:

Train and evaluate on DTU

After simple post-processing, completeness and overall are the best compared with the previous ones.

Verification of generalization ability on Tanks and Temples

No need for fine tuning

Run times are faster than previous techniques

2. Related Work

MVS Reconstruction

The MVS method is divided into:

1) Direct point cloud reconstruction: directly operate 3D points, rely on the propagation strategy to gradually encrypt and reconstruct, and the propagation sequence is carried out, which is difficult to parallelize and takes a long time to process

2) Volume reconstruction: divide the three-dimensional space into regular grids, estimate whether each voxel is attached to the surface, space discretization error, high memory consumption

3) Depth map reconstruction: the most flexible, decoupling the complex MVS problem into a relatively small view-by-view depth map estimation problem, focusing on only one reference and several source images each time

Learned Stereo

Deep learning technology: used to better match pairs of patches

In stereo matching (Stereo Matching), Cost Volume is generally constructed to calculate the disparity map and generate a depth map based on the disparity map

Cost Volume

https://www.zhihu.com/question/366970399/answer/1340892604

cost volume display pixel-wise matching cost

x : a pixel in ref img

X_i : In the shooting direction of ref, at a given depth d_i, the corresponding 3D point

x_i : the position where X_i is projected onto the matching img

The cost volume records the matching cost or similarity of x and x_i. Both
insert image description here
ref img and matching img have known relative poses. A pixel in ref img determines the search space along the shooting direction and projects it onto the matching img to form an epipolar line.

A pixel on the ref img can only correspond to a pixel on the epipolar line in the matching img.

Take different depths in Search Space to get different points (d_1, d_2, …, d_9), which correspond to points on the epipolar line (x_1, x_2, …, x_9)

For each x_i and x, calculate the matching cost according to its neighborhood information, and get the matching degree between x_i and x. Taking the
insert image description here
above (x_1, x_2, …, x_9) as an example, you can get 9 matching costs, which can form a vector .

This vector means: x point on the matching img about (d_1, d_2, …, d_9)
insert image description here
Each pixel on the matching cost ref img forms a vector, which is combined to form a three-dimensional tensor, which is the cost volume (loss cube)

Learned MVS

SurfaceNet

LSM

3. MVSNet

insert image description here
Main process:

  1. Image Feature Extract:N input image --> deep feature
  2. Differentiable Homography:deep feature --> feature volumes
  3. Cost Metric:feature volumes --> cost volumes
  4. Cost Volume Regularization:cost volume --> probability volume
  5. Initial Estimation:probability volume --> depth map
  6. Depth Map Refinement:depth map --> refined depth map

3.1 Image Feature

The first step of MVSNet: extract N input images { I i } i = 1 N \{I_i\}^N_{i=1}{ Ii}i=1NThe depth feature { F i } i = 1 N \{F_i\}^N_{i=1}{ Fi}i=1N

Use: Eight-layer 2D convolutional network

In the third and sixth layers, the stride is 2, and the features are down-sampled, and the size is reduced to half of the original size. The feature tower is divided into three scales.

Inside each scale, two convolutional layers are used to extract higher-level image representations.

Except for the last green convolution, each layer is Conv+batch-normalization+ReLU.

Similar to common matching tasks, among all feature towers, parameters are shared for efficient learning.

The structure inside a feature tower: (input image is WxH)

Content Kernel Size Filter Number Stride Output
Conv+BN+ReLU 3 x 3 8 1 8 x W x H
Conv+BN+ReLU 3 x 3 8 1 8 x W x H
Conv+BN+ReLU 5 x 5 16 2 16 x W/2 x H/2
Conv+BN+ReLU 3 x 3 16 1 16 x W/2 x H/2
Conv+BN+ReLU 3 x 3 16 1 16 x W/2 x H/2
Conv+BN+ReLU 5 x 5 32 2 32 x W/4 x H/4
Conv+BN+ReLU 3 x 3 32 1 32 x W/4 x H/4
Conv 3 x 3 32 1 32 x W/4 x H/4

The output of the entire 2D network, a total of 32 channels, the feature size of each channel is reduced to W/4 x H/4

The increase in channels compensates for the reduction in size

At the same time, each image extracts such a 32 x W/4 x H/4 feature

3.2 Cost Volume

Step 2: Build a 3D cost volume from the extracted features and input camera.

Construct loss cube on reference camera frustum.

I 1 I_1 I1:ref

{ I i } i = 2 N \{I_i\}^N_{i=2} { Ii}i=2N:src

{ K i , R i , t i } i = 1 N \{K_i,R_i,t_i\}^N_{i=1} { Ki,Ri,ti}i=1N: Corresponding to each picture, camera internal reference, rotation and translation (external reference)

Differentiable Homography

Homography matrix:

https://www.sohu.com/a/223594989_100007727 (reprint of this article: https://blog.csdn.net/lyhbkz/article/details/82254893, afraid that this article on Sohu will be cold)
https:// www.cnblogs.com/wangguchangqing/p/8287585.html

Homography matrix: able to transform the coordinates (u1, v1) of an image to the corresponding position in image 2 (u2, v2)

Considering the coordinates (u, v) of a graphic and the coordinates (xw, yw) of this point in the world coordinate system, the camera calibration method can be used to calculate (xw, yw) according to scaling, internal parameters, and external parameters (rotation, translation). ) corresponding to (u,v).

insert image description here

By multiplying the parameters in the middle, you can get a 3x3 matrix:

insert image description here

即: x = H ⋅ x w x=H·x_w x=Hxw. , this H can be regarded as a homography matrix converted between xw and x.

Consider the corresponding two points x1, x2 in the two images, corresponding to the point xw in the world coordinate system

x 1 = H 1 ⋅ x w x_1=H_1·x_w x1=H1xw

x 2 = H 2 ⋅ x w x_2=H_2·x_w x2=H2xw

You can get x 2 = H 2 ⋅ H 1 − 1 ⋅ x 1 = H ⋅ x 1 x_2=H_2·H_1^{-1}·x_1=H·x_1x2=H2H11x1=Hx1. This H can be regarded as a homography matrix converted between x1 and x2.

According to the feature map extracted in the previous step, all feature maps (including ref) are warped to different parallel planes of the ref camera.

According to the homography transformation formula: x ′ ∼ H i ( d ) ⋅ x x'\sim H_i(d)·xxHi(d)x , the feature F i F_iof the ith view imageFi, warp is the feature map V i ( d ) V_i(d) at depth dVi(d)

For each view of the image, at all depths, a warp is performed?

∼ \sim :projective equality

H i ( d ) H_i(d) Hi( d ) : the homography transformation matrix
insert image description here
n1 of the i-th feature map and the ref feature map at depth d: the main axis of the ref camera

The homography matrix is ​​a 3x3 matrix, and the homography matrix ref to itself is an identity matrix.

The final result is: N feature volumes { V i } i = 1 N \{V_i\}^N_{i=1}{ Vi}i=1N

The warping process is similar to: classical plane sweeping stereo

The difference is: differentiable linear interpolation, from the feature map { F i } i = 1 N \{F_i\}^N_{i=1}{ Fi}i=1NMedium sampling

Instead of starting from image { I i } i = 1 N \{I_i\}^N_{i=1}{ Ii}i=1NMedium sampling.

As a step connecting 2D feature extraction and 3D regularization network, warping is implemented in a differentiable way, which in turn enables end-to-end training of depth graph inference.
insert image description here
insert image description here

Cost Metric

Multiple feature volumes { V i } i = 1 N \{V_i\}^N_{i=1}{ Vi}i=1N, aggregated into a Cost Volume CCC

To accommodate any number of input views, a variance-based N-view similarity cost metric M is proposed.

WHD is the width, height, and depth of the image, and the number of samples is 256, and F is the number of channels of the feature map, which is 32.

V = W 4 ⋅ H 4 ⋅ D ⋅ F V=\frac{W}{4}·\frac{H}{4}·D·F V=4W4HDF is the size of the feature volume. N feature volumes were obtained in the previous step.

The cost metric in this article defines a mapping M that maps N feature volumes domains to one.
insert image description here
insert image description here
V i ‾ \overline{V_i}ViIt is the average volume of all feature volumes, and all operations are performed by element.
insert image description here
Only one cost volume is drawn here, but it is actually F, that is, 32-channel Cost Volume, each of which is W 4 ⋅ H 4 ⋅ D \frac{W}{4}·\frac{H}{4}· D.4W4HD.

That is to say, the above cost metric only maps the feature volumes of multiple views to one, that is, for the feature volumes of N views of each channel, perform the above mapping to obtain 1 cost volume under each channel

The traditional MVS method aggregates the cost between ref and all src.

This article: All views contribute equally to the matching cost, and do not give priority to ref.

Many methods apply the average of multiple CNN layers to infer similarity, and the average itself does not provide information about feature differences.

Variance-based: Explicitly measures feature variance across multiple views

Cost Volume Regularization

The original cost volume calculated based on image features may be polluted by noise (non-Lambertian surfaces or object occlusion), and should be combined with smoothness constraints to infer depth maps.

The purpose of regularization: cost volume CC above refineC , generate probability volume PPfor depth inferenceP

Regularization using [Multi-scale 3D-CNN]

The four-scale network here is similar to the 3D version of UNet

Use an encoder-decoder structure

Aggregates neighboring information from large receptive fields at relatively low memory and computational cost

To reduce computational requirements:

The first 3D convolutional layer reduces the cost volume of 32 channels to 8 channels

Convolutions within each scale changed from 3 layers to 2 layers

The last convolutional layer outputs a single-channel volume

Along the depth direction, apply softmax for probability normalization

According to the above steps, a probability volume is generated, which is conducive to depth map inference

Because it can not only be used for depth estimation per pixel

And can also be used to measure: the confidence of the estimate

It will be shown in 3.3 that by analyzing the probability distribution of depth reconstruction, the quality of depth reconstruction can be easily determined, so that a simple and effective outlier filtering strategy can be obtained according to the probability distribution.

3.3 Depth Map

insert image description here

Initial Estimation

from PPRetrieve the depth mapDD in PThe easiest way for D: pixel-level winner-take-all, such as argmax (the dimension of P is DxWxH, that is, for a certain pixel, select the maximum probability value under all depths )

But argmax cannot produce sub-pixel estimates (sub-pixel estimates)

and cannot be trained using backpropagation due to its non-differentiable nature

Instead, we compute the expected value along the depth direction. (ie, the probability-weighted sum of all depth hypotheses) is as follows:
insert image description here
P(d): At depth d, the probability estimate of all pixels.

This operation can be called soft argmin, fully differentiable, and able to approximate the result of argmax.

Although, when the cost volume is constructed, the depth assumption is [ dmin , dmax ] [d_{min}, d_{max}][dmin,dmax] between uniformly sampled, but the expectation here is to be able to produce continuous depth estimates.

The output depth map is the same size as the 2D image feature map

The depth map and 2D image feature map are downscaled by a factor of 4 in each dimension compared to the input image. (because of the convolution of two stride=2)

Probability Map

Although multi-scale 3D-CNN has a strong ability to regularize the probability to a unimodal distribution.

However, for those wrongly matched pixels, the probability distribution is scattered and cannot be concentrated to a peak.

Therefore, the quality of the depth estimate d ^ \hat dd^ , defined as the probability that the ground truth depth is within a small range around the estimate.

Since depth hypotheses are discretely sampled along the camera frustum, it is only necessary to take the probability sum of the four nearest depth hypotheses to measure the quality of the estimate.

Other statistical measures can also be used, such as standard deviation, entropy

In this paper, it is obtained in experiments that these measurements do not significantly improve the measurement of depth image filtering, but the probability and formula can better control the threshold parameters of outlier filtering.

Depth Map Refinement

insert image description here
Although the depth map retrieved from the probability volume is a qualified output, due to the regularization involving a large receptive field, the reconstruction boundary may appear too smooth.

The ref image in nature contains boundary information, so use the ref image as a guide to refine the depth map.

Inspired by the latest image matting algorithm, a depth residual learning network is applied at the end of MVSNet.

[Initial depth map] and [Resized ref image] are connected as 4-channel input.

The depth residual is learned through three 32-channel 2D convolutional layers and one 1-channel convolutional layer.

The first three layers are Conv+BN+ReLU with a stride of 1, and the last layer does not contain BN and ReLU for learning negative residuals, and the step size is 1.

Adding back the initial depth map produces a refined depth map.

In order to prevent deviations on a certain depth scale (that is, different ranges of depth values ​​​​cause problems in the results), the initial depth amplitude is pre-scaled to the range [0,1], and after refinement, it is converted back to the original scale

3.4 Loss

This paper considers: the loss of [initial depth map] and [refined depth map].

Use the mean absolute difference between [ground truth depth map] and [estimated depth map] as the training loss.

Because the ground truth depth map is not always complete, only those pixels with [effective ground truth labels] are considered.
insert image description here
pvalid p_{valid}pvalidis a set of effective ground truth pixels

d§: the depth value of point p in the ground truth

d i ^ ( p ) \hat{d_i}(p) di^( p ) : initial depth estimate

d r ^ ( p ) \hat{d_r}(p) dr^( p ) : refinement depth estimation

λ \lambdaλ : Control the weight of the loss of the initial and refined depth map in the total loss, set to 1 in the experiment

4. Implementations

4.1 Training

Data Preparation

MVS dataset: ground truth is given in the form of point cloud or grid.

A ground truth depth map needs to be generated.

Use screened Poisson surface reconstruction (SPSR) to generate a grid surface, then render the grid to each viewpoint, and generate a depth map of the corresponding viewpoint

Divide the dataset:

Validation set: scans{3, 5, 17, 21, 28, 35, 37, 38, 40, 43, 56, 59, 66, 67, 82, 86, 106,117}.

Evaluation/Testing set: scans{1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49,62, 75, 77, 110, 114, 118}.

Training set: the other 79 scans.

In the DTU dataset, for each object, 49 different viewing angles are scanned, and 7 different lighting conditions are adopted for each viewing angle

Treat each image as a ref img during training, that is, each image is used as a training sample

Then the total number of training samples: 79x49x7

View Selection

In the training of this article, the number of pictures is set to N=3, that is, 1 ref + 2 src.

According to the sparse points, for each img pair, calculate score(i,j)
insert image description here
p: Common track common track of view i and j

θij§: p's baseline angle
insert image description here
c: camera center camera center

G: It is a piecewise Gaussian function, determined in a certain baseline angle θ0
insert image description here
experiment: insert image description here
select the two src pairs with the ref image, the two with the highest score, and the three images that are input to the network together with ref.

Data preprocessing:

Because: 1) The image will be reduced during feature extraction. 2) There is a four-scale encoder-decoder structure in 3D regularization, and image scaling also occurs

So: the size of the input image must be divisible by 32

This article: Adjust and crop the image block to W640xH512, and change the corresponding camera parameters at the same time

Depth assumption: uniform sampling from 425mm to 935mm with a resolution of 2mm, D=256

Train parameters:

input view number image width image height depth sample number
N= 3 W= 640 H= 512 D= 256

4.2 Post-processing

Depth Map Filter

What the network estimates is the depth value of each pixel.

Outliers in the background and occluded regions are filtered out before converting the result to a dense point cloud.

  1. photometric consistency

    Measure the quality of the match

    According to the probability depth map, pixels with a probability value less than 0.8 are regarded as outliers

  2. geometric constraint

    Measuring depth consistency across multiple views

    Similar to the left-right parallax check in stereo:

    1) Project ref pixel p1 through its depth d1 to pixel pi in another view

    2) Reproject pi to pixel p_reproj in ref according to its depth di

    3) For the original point and depth p1 and d1, and the reprojected p_reproj and d_reproj, if it satisfies:
    ∣ preproj − p 1 ∣ < 1 and ∣ dreproj − d 1 ∣ < 0.01 |p_reproj-p_1|<1 and| d_reproj-d_1|<0.01pre p r o jp1<1 and dre p r o jd1<0 . 0 1
    means that p1's depth estimate d1 is consistent in both views

    The standard selection is N=3, so the three views need to be consistent.

    The pixel in the ref that does not meet the consistency is regarded as an abnormal point

Depth Map Fusion

insert image description here
Apply depth map fusion to integrate depth maps of different views into statistical point cloud representations.

The reconstruction is used: The visibility-based fusion algorithm (The visibility-based fusion algorithm)

Depth occlusion and conflicts between different viewpoints can be minimized

To further suppress noise, in the filtering step, the visible view of each pixel is determined, and the average of all reprojected depths is used as the final depth estimate for the pixel.

Directly reproject the fused depth map to space to generate a 3D point cloud

5. Experiments

5.1 Benchmarking on DTU dataset

Evaluated on the divided DTU test

Test parameters:

input view number image width image height depth sample number
N= 5 W= 1600 H= 1184 D= 256

insert image description here

1)distance metric

This evaluation metric is introduced in the paper Large_scale_data_for_multiple_view_stereopsis.

Accuracy is measured as the distance from the MVS reconstruction to the structured light reference, encapsulating the quality of the reconstructed MVS points.

Completeness is measured as the distance from the reference to the MVS reconstruction, encapsulating how much of the surface is captured by the MVS reconstruction.

Overall : Because there is an f-score in the percentage metric to synthesize acc and comp, mvsnet adds an overall to synthesize acc and comp in the distance metric, and the value of overall is the average value of acc and comp A comparison and evaluation of
insert image description here
multiview stereo reconstruction algorithms

We now describe how we evaluate reconstructions by geometric comparison to the ground truth model.

Let us denote the ground truth model as G and the submitted reconstruction result to be evaluated as R. The goal of our evaluation is to assess both the accuracy of R (how close R is to G), and the completeness of R (how much of G is modeled by R). For the purposes of this paper, we assume that R is itself a triangle mesh.

How close is R to G (find the corresponding G point according to R point, and see how far away it is)

How much G is modeled by R (according to the points in G, find out if there is a corresponding point in R, and the boundary points are removed)
insert image description here
insert image description here

2)percentage metric

Introduced in the paper Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction.
insert image description here

  • Accuracy

    It can also be called precision, which measures the distance from the reconstructed point cloud (R) to the ground truth (G)

    Equation (3) defines the distance from a point r in R to G:

    Among the distances from r to all points in G, the smallest distance is the distance from r to the entire G

    Formula (4) defines the precision of R under the threshold d:

    For all points in R, calculate the distance from these points to G;

    Count the number of all points whose distance is less than d;

    After dividing by the total number of points in R, multiply by 100, and the final result can be converted into a percentage, so it is called percentage metric
    insert image description here

  • Completeness

    It can also be called recall to measure the distance from G to R.

    Equation (5) defines the distance from a point g to R in G, similar to Equation (3).

    Formula (6) defines the recall of R under the threshold d. Similar to formula (4).
    insert image description here

  • F-score

    Combine precision and recall to calculate F-score

The precision quantifies the accuracy of the reconstruction: how closely the reconstructed points lie to the ground truth. The recall quantifies the reconstruction’s completeness: to what extent all the ground-truth points are covered. Precision alone can be maximized by producing a very sparse set of precisely localized landmarks. Recall alone can be maximized by densely covering the space with points. However, either of these schemes will drive the other measure and the F-score to 0. A high F-score for a stringent distance threshold can only be achieved by a reconstruction that is both accurate and complete.

precision: Quantify the accuracy of the reconstruction, how close the reconstructed point is to GroundTruth

recall: Quantify the integrity of the reconstruction, to what extent the reconstructed points cover all GroundTruth

Precision can be maximized by generating a very sparse set of accurate local landmarks; recall can be maximized by densely covering the space with points.

However, the implementation of these two schemes will make another measure and f-score approach 0.

We can only make a good balance between the two, and increase the value of f-score as much as possible.
insert image description here

5.2 Generalization on Tanks and Temples dataset

Generalization parameters:

input view number image width image height depth sample number
N= 5 W= 1920 H= 1056 D= 256

On the model trained by DTU, it only needs to modify the number of input images, image size and depth assumptions to get good generalization results on TAT.
insert image description here
insert image description here

5.3 Ablations

Ablation experiments were performed on the validation set

Validation parameters:

input view number image width image height depth sample number
N= 3 W= 640 H= 512 D= 256

insert image description here

View Number

Previously trained model View Number N=3

Increasing the number of views can reduce the loss of Validation. Consistent with the cognition of MVS reconstruction.

At the same time, although the model is trained with N=3, it performs better with N=5, indicating that the model is very flexible and can be applied to different input settings.

Image Features

Cost Metric

Comparison: Variance-based and mean-based loss metrics.

Variance-based converges faster and has lower validation loss.

An explicit difference measure, computing multi-view feature similarity, is more reasonable.

Depth Refinement

Comparison: Network with depth map refinement, network without depth map refinement

Whether to refine or not has little effect on the validation loss.

But with a refined network, the results of the percentage metric indicator are improved.

5.4 Discussions

Running Time

MVSNet is more efficient. It takes about 230s to rebuild a scan, with an average of 4.7s per view.

5x faster than Gipuma, 100x faster than COLMAP, 160x faster than SurfaceNet

Question: Is this the speed of reconstructing the point cloud?

If so, is the main time spent on generating the point cloud or estimating the depth.

The FastMVSNet behind improves the speed of estimating depth or the speed of generating point clouds.

If the main time is point cloud generation, and FastMVSNet only improves the speed of depth estimation, then you should find an algorithm that can speed up the generation of point clouds.

At the same time, it should be investigated whether the 3D point cloud can be template-matched with the CAD drawing.

GPU Memory

Required GPU memory: related to the size of the input image and the number of depth samples D.

Tesla P100 graphics card (16GB), consumer level GTX 1080ti graphics card (11GB) are acceptable

Training Data

The DTU data set uses the normal information to provide the ground truth point cloud, so that it can be converted into a mesh surface, and then the depth map is rendered to generate the ground truth depth map.

However, the TAT dataset has no normal information or mesh surface, and cannot be fine-tuned to obtain better performance.

The depth map obtained by rendering DTU works well, but there are still limitations:

1) The provided GT mesh is not 100% complete, and some triangles behind the foreground will be incorrectly rendered into the depth map as valid pixels, deteriorating the training process.

2) If a pixel is occluded in all other views, it should not be used for training. But without a complete mesh surface, the occluded pixels cannot be correctly identified.

The author hopes that the future MVS dataset can provide a GT depth map with complete occlusion and background information.

6. Conclusion

Propose a DL framework for MVS refactoring

Taking an unstructured image as input, in an end-to-end manner, a depth map of the ref image is inferred.

Core contributions:

1) Encode the camera parameters as a differentiable homography to build a cost volume on the camera frustum.

2) The constructed cost volume can be connected to two-dimensional feature extraction and three-dimensional loss regularization network.

MVSNet is better, faster, and has strong generalization on TAT without fine-tuning.

Guess you like

Origin blog.csdn.net/qq_41340996/article/details/124786530