[Paper Intensive Reading 4] Detailed Explanation of MVSNet Series Papers-CVP-MVSNet

The full text of CVP-MVSNet is called "Cost Volume Pyramid Based Depth Inference for Multi-View Stereo". The
main innovation of CVPR 2020 (CCF A) is to use the coarse-to-fine (coarse-to-fine) mode to construct the cost volume pyramid (cost volume pyramid), the process is as follows:

(1) Construct L+1-layer image pyramids (Image Pyramid) with different resolutions. First, use the N images of the lowest-precision L+1 layer to infer the depth map D L+1 based on the MVSNet process, and up-sample to obtain the initial depth of the L - layer Figure D L
(2) Based on the initial depth map D L , combined with the N images of the L layer, a partial cost volume (partial cost volumes) is constructed by reprojection, and each of the initial depth map D L is deduced in turn. The residual depth of the pixel (residual depth, that is, Δd relative to the initial depth) is added to obtain the final depth map D L of the current L layer (
3) Repeat step (2) until the final 0th layer is inferred, which is the original Dimensional Depth Chart


This article is the fourth article in the MVSNet series. It is recommended to read [Paper Intensive Reading 1] MVSNet Series Papers Detailed Explanation-MVSNet and then read it for easy understanding.

1. Problem introduction & innovation points

This paper is mainly to improve the accuracy and solve the timeliness problem when MVS reconstruction is completed based on the learning method . It is pointed out that the optimized RMVSNet reduces memory consumption but leads to longer time, and the Point-MVSNet similar to this article is also coarse. -to-fine iteratively optimizes the depth map, but it takes a long time to operate directly on the point cloud.

The innovation points are mainly summarized in two points:

  • Build a cost body pyramid in a coarse-to-fine way, realize an MVS depth inference network that is 6 times faster than Point-MVSNet , and consume less memory.
  • When constructing each layer of the cost body pyramid, especially when constructing the local cost body after the first layer, the residual depth search range (that is, how far to search Δd relative to the initial depth) is used, and the search range selection is given Detailed analysis of the relationship with image resolution .

2. Thesis model

insert image description here

1. Feature Pyramid

First build an image pyramid, and use a 9-layer convolutional neural network with shared parameters (Leaky-ReLU as the activation function) to extract features, the number of output channels is 16, and the width and height dimensions are [W/2 l , H / 2 l ] , forming a feature pyramid for subsequent use.

2. Cost Body Pyramid

2.1 Cost body for rough depth map estimation (layer L+1)

This step is the standard MVSNet inference process. It is worth mentioning that the paper explains the role of the homography matrix H: it describes the possible correspondence between the pixel x on the reference view and the pixel x i on the source view i , which can be expressed as λ i x i =H i (d)x, λ i represents the depth of x i under source view i.

2.2 Cost body for multi-scale depth residual estimation (layer L-layer 0)

First, in 2.1, we obtained the depth map D L+ 1 of the L+1 layer , and upsampled it to obtain the initial depth map D L+1 of the L layer . The purpose of this step is to obtain a combined residual depth estimation The final depth map D L of the Lth layer =D L+1 +ΔD L .
This step is then repeated up to layer 0 to obtain the final depth map.


This step is the core step, this picture must be understood!

First, for the initial depth map D L+1 after 2.1 upsampling , we define the depth of the pixel point p(u,v) on the L-th layer image as d p =D L+1 (u,v).

The figure below shows the two steps of the operation, the left is the reprojection operation, and the right is the operation of extracting features and constructing a local cost body.
insert image description here

2.2.1 Left reprojection process

Find the corresponding 3D point (green) based on the initial depth of the current point p, add or subtract a value as the farthest and nearest possible real 3D point (purple, red), the residual search depth s p refers to the purple point and red The distance between points ( the range selection method will be explained in detail in 3.1 ), the residual depth plane is to divide M possible depth value planes in the middle, and the distance interval of the residual depth plane is Δd p =s p / M, the depth values ​​of M possible 3D points are (D L+1 (u,v)+mΔd p ), where m ∈ {-M/2, … , M/2-1}.
[It can be understood that the initial depth is the starting point, and the initial depth plus or minus sp /2 is the farthest and closest depth of the possible 3D points, which is also the meaning of Δd p residual depth]

At this time, for the pixel point p under the current reference view, we can project its M possible 3D points of different depths according to the following formula, and obtain the features corresponding to M depths in a source view, as shown in the purple set in the figure , green, and red depth points correspond to a feature in each source view.
insert image description here

2.2.2 Build a local cost body on the right

On the right is the variance of the features after the projection of these possible 3D points of different depths, and the variance value is used as the cost value of the pixel point p at the depth; since there are M hypothetical depths and a total of HxW pixels, it constitutes The local cost body of [H, W, M] can be deduced after passing it through the same 3D CNN to infer the residual depth Δd.

[Here and MVSNet construct the cost body using the variance method, but MVSNet assumes a series of depth values, and transforms each pixel coordinate on each reference view to the corresponding pixel coordinate under the source view through the homography matrix H to obtain its value. , and here is to find the characteristics of the point on each source view by assuming the real 3D point position—that is, MVSNet finds the pixel coordinates corresponding to the source view from the reference view pixels and assigns values. This article finds the source from the reference view pixel points corresponding to 3D points Features corresponding to the view]

3. Depth map inference

3.1 Depth Sampling of Cost Volume Pyramid

insert image description here
The paper observes that the sampling of the hypothetical depth plane should be related to the image resolution - as shown in the figure above, too dense depth sampling will cause the image features after 3D point projection to be too dense and not provide additional depth inference information , Therefore it is not necessary.
insert image description here
Therefore, the paper first uses the initial depth value of the pixel point p to find the corresponding 3D point (green), and projects it to each source view , along the direction of the level line (using the limit search principle, as mentioned before******** *) Search for pixels that are 2 pixels (?) away from the projection point in the left and right directions , and reproject them to 3D rays . At this time, the intersection of the two rays and the depth direction of the reference view is the search depth .

Original text: find points that are two pixels away from the its projection along the epipolar line in both directions (see Fig. 3 “2 pixel length”), according to the original text, should the left and right
pixels be 2 pixels away from the projection point? The one marked in the picture should be 4 Pixel length.

3.2 Depth Map Inference

Afterwards, like MVSNet, the cost body is put into the 3D convolutional network to aggregate context information and output the probability body, and the probability body is expected to obtain a depth map through soft-argmax. The special feature is that the cost body is along the pyramid from top to bottom Constructed, the final depth map of the layer obtained each time (except the L+1th layer) should be calculated according to the following formula: that is, the
insert image description here
upsampled depth + the expected residual depth inferred from the probability body obtained by the local cost body of the layer

4. Loss function

insert image description here
Consistent with MVSNet, the sum of the l1 loss for each layer depth map and the real depth is used as loss.

3. Summary

  • The picture is very nice
  • The idea is quite ingenious. It is definitely faster to use images from coarse to fine than to operate on point clouds. First, get a small-sized initial depth map, and then continue to up-sample and then iterate to the final depth map, and find the local cost of the residual depth. Volume construction is actually the variance method of MVSNet, but in order to use the existing depth map in the process of finding features, first find out the approximate 3D point position, and use the 3D point to find the feature of the corresponding position of the source view.
  • A detailed thinking and analysis is given on how to select the residual sampling depth. The projection point and the left and right 2 pixel points are back-projected to the ray as the search depth. The reason is that if the points on the image are too dense, additional information cannot be provided, thus reducing the The memory consumption problem caused by too many samples, the size and method of this depth sampling have also been improved in P-MVSNet, but its inverse-depth setting is not carefully discussed, just like the 3D convolutional network at the end of this article. In the supply material, but it shows that this piece can still be innovative.

Guess you like

Origin blog.csdn.net/qq_41794040/article/details/127897080