[Paper Intensive Reading 1] Detailed Explanation of the Components of the MVSNet Architecture

MVSNet (ECCV2018) is a classic paper on the use of neural networks for 3D reconstruction, but most of the explanations are actually translations... It seems that the principles and functions of the most important steps in network training are hazy, thanks to @百一The blog post using a book as a metaphor for feature maps gave me a general understanding of the whole process.
The diagram below combines my own understanding to introduce the process step by step. I personally think it is relatively easy to understand. Of course, discussions and exchanges are welcome!

MVSNet Architecture

1. Training process

1. Feature extraction

Similar to the traditional 3D reconstruction method, the first step is to extract image features (such as SIFT). The difference is that this paper uses an 8-layer convolutional network to extract deeper image feature representations from the image. The network structure is shown in the following figure:

Input: N 3-channel images, width and height are W, H
Output: N groups of 32-channel images, each channel scale is, H/4, W/4

insert image description here

2. Build a Feature Volume

2.1 Homography transformation

Homography transformation Simply put, for a point X in 3D space, we take a photo with camera 1 to get the corresponding two-dimensional pixel point P(x,y) on photo 1; take a photo with camera 2 at another position to get photo 2 The corresponding two-dimensional pixel point P'(x', y') on - through a correct homography matrix H (including the position conversion parameters R, T of camera 1, 2, the distance d from camera 1 to point X), It can be realized that P' = HP . That is, on the premise that the internal and external parameters of the camera have been obtained in advance, only one depth value variable is needed to find the position of point P on the reference image corresponding to point P' on the source image.

insert image description here

If we know the camera parameters of two poses (the position conversion parameters of cameras 1 and 2), now set a depth interval [d1, d2], and set the resolution to Δd, thus obtaining D=(d2-d1) /Δd planes, then each d corresponds to a homography transformation matrix h.
If each pixel on a picture is transformed using the matrix hi corresponding to di, a transformed picture can be obtained, which means that when the real depth of each pixel is d, each pixel in another pose The corresponding eigenvalues ​​of .
And we assume D depths, and we are about to get D transformed images. Each image represents the corresponding feature value of the transformation when the real depth of its pixel point is the current depth, which can be understood as the blue color of each layer in the above image. Lighten the layer.

Why is it a " cone cone of view ", because when the real depth corresponding to the homography matrix is ​​different, it can be known from the principle of near large and far small that the number of feature points that can be seen by the camera at the current position decreases as the depth decreases, so A cone appears. ( Bilinear interpolation is used in the next step to construct the feature body to ensure that the feature maps of all depths have the same size)

2.2 Feature body construction

insert image description here

In this article, we obtained N feature maps (1 reference image, N-1 original images) through feature extraction, and each feature map has 32 channels.
Each feature map can be understood as a book with 32 pages, and the size of each page is HxW.
Then, a book (a feature body) becomes a stack of books through the homography transformation in 2.1, which is equivalent to converting the original first The page is transformed into the first page at n depths through the matrix [h1,h2] with depth [d1,d2], and the original second page becomes the second page at n depths... At this time, a
book = A
certain page of a certain depth d book = A certain feature channel at a certain depth
A certain word on a certain page of the book = Features of a certain point in a certain feature channel at a certain depth

It should be noted that the feature map of Ref is copied directly at each depth, because multiple Srcs are to be transformed to this reference map

3. Generate Cost Volume (Cost Volume)

insert image description here
Through the second step, we got N stacks of books. Now look at the first book of each stack of books vertically, which represents the transformed feature value of each pixel on each feature map when the assumed depth is d1—if a certain pixel is real If the depth is close to d1, the eigenvalues ​​of the column should be approximate after transformation .
Based on this idea, the variance is calculated for each pixel on each page of each book, which represents the difference between each feature point of each channel of each image feature map when the assumed depth is di. The smaller the variance, the more similar it is. The true depth of the feature point is more likely to be di.

In this step, a stack of books is obtained, each book = a depth,
each page = a channel of a feature map at a certain depth,
points on a page = similarity of points on a channel of a feature map at a certain depth, the smaller the variance, the more similar

This is the reason why the paper said that it can accept any N inputs, because the variance is taken, so the inputs are all the same.

4. Cost Volume Regularization

In the third step, the cost body was obtained, but the paper said "The raw cost volume computed from image features could be noise-contaminated" , that is, the cost body contains noise due to non-Lambertian surfaces, occlusion, etc., and it needs to pass regularization To obtain a probability volume P (probability volume), a network structure similar to UNet is used to encode and decode the original cost volume, and finally compress the number of channels to 1, that is, to turn a stack of books into a book book
insert image description here
at this time book = probability body
each page = a certain depth
point on the page = the probability of the point at that depth

For example, for a point (x, y) on the (H, W) plane, if the value is the largest at the depth d, the depth of the point is d

I personally feel that the features of multiple channels are compressed into one channel. The intuitive idea is to retain the features of the channel with the smallest variance (most likely to belong to the current depth), and then the smaller the variance, the more likely the depth of this pixel is the current layer depth; but The article says that it is for the purpose of noise to try the network to regularize to obtain the final probability body. I don't quite understand why this UNet network is used.

5. Depth Map Initial Estimation

insert image description here
Using the probability body in step 4, calculate the expectation along the direction of depth d, and then get the initial depth value of the corresponding pixel point; find the
expectation for each pixel point, that is, turn the probability body into a probability map.

6. Depth Map Refinement

insert image description here

7. Loss calculation (Loss)

insert image description here

2. Post-processing

The paper mentions * "With simple post-processing"* in the abstract, and the model works very well. This post-processing mainly includes two parts: depth image filtering and depth image fusion.

1. Depth Map Filter

insert image description here
Depth map filtering mainly proposes two constraints, photometric constraints and geometric constraints.

1.1 Geometric constraints

First of all, the geometric constraints are relatively simple, that is to say, the reference point p1 is projected to the source view point pi through its estimated depth d1, and then the pi point is reprojected to the reference view point preproj through its depth estimation di. The preproj point depth estimation after this reprojection For dreproj, if it satisfies ,
insert image description here
it is said to satisfy the geometric constraints. In the paper, it is guaranteed that the three views meet the geometric constraints.

1.2 Photometric constraints

Luminosity constraints, in fact, calculate a probability map while obtaining the initial depth map through the probability insert image description here
volume point.

2. Depth Map Fusion

insert image description here
That is, the depth map is inferred from multiple viewing angles and fused with a specific fusion algorithm; the pixel depth selection of each depth map is to use the mean value of the reprojection calculated when geometric constraints are used as the final depth estimate.

3. Summary

insert image description here

Guess you like

Origin blog.csdn.net/qq_41794040/article/details/127692123