[Paper Description] N2MVSNet: Non-Local Neighbors Aware Multi-View Stereo Network (ICASSP 2023)

1. Brief description of the paper

1. First author:Zhe Zhang

2. Date:2023

3. Periodical publication:ICASSP

4. Keywords:MVS, 3D reconstruction, depth estimation, deep learning, RGB guided depth improvement

5. Exploration motivation:Current methods use convolutions with fixed-size kernels, lack anisotropy in low-texture areas, and produce false depth at the edges of foreground and background. mix. In addition, under the coarse-to-fine scheme, severe mismatch in the coarse stage will lead to the accumulation and propagation of errors in the fine stage.

However, current works are limited to using fixed-size convolution kernels, leading to suboptimal features that lack anisotropy in low-textured regions and tend to produce invalid depth blending at the edge of the foreground and background.

AA-RMVSNet imports deformable convolution to extract image features, and PatchmatchNet augments the traditional propagation and cost evaluation steps of Patchmatch with learnable modules. However, these methods do not fully exploit the pixel-wise depth correlation between neighbors.

Besides, following the coarse-to-fine scheme, severe mismatchings in coarse stages will result in error accumulation and propagation in finer stages.

6. Work objectives:Solve the above problems.

7. Key idea:The key idea is based on the observation that the depth of a pixel is closely related to its surrounding neighbors. For example, areas within edges have a positive impact on foreground pixels, while areas in the background have the opposite effect (Figure 1(a)).

  1. We propose the Adaptive Non-local Neighbors Matching (ANNM) strategy, which leverages the pixel-wise spatial correlation of neighbors and extends it as the voxel-wise 3D ANNM for preferable depth perception.
  2. We apply RGB guided depth refinement to repolish mispredictions in coarser stages by highlighting the value of contours and preventing the accumulation and propagation of errors for finer stages.

8. Experimental results:

The proposed network is extensively evaluated on the DTU dataset and the Tanks and Temples dataset. The proposed method achieves state-of-the-art performance from extensive experimental results.

9. Paper download:

https://ieeexplore.ieee.org/document/10095299/authors#authors

2. Implementation process

1. Network structure

A coarse-to-fine scheme is adopted, and two stages are given for demonstration, the adaptive non-local neighbor matching (ANNM) strategy and its three-dimensional extension, RGB-guided depth improvement, and energy aggregation loss.

2. Adaptive non-local domain matching

For a pixel p(x, y) on image I, the purpose of the adaptive non-local neighbor matching (ANNM) strategy proposed in this paper is to adaptively find a set of sampled pixels and their spatial correlation in a larger area property to construct non-local domain-aware features of p. First, as shown in Figure (a), channel unification and resampling operations are performed on I. Inspired by [17], the blue channel is more robust to feature recognition, and the RGB channels (IR, G, B) are merged into a unified representation:

where wR = 0.1, wG = 0.3, wB = 1.0. Then I´ is downsampled with step = r, and repeated superposition is performed to obtain a resampled image I´ with a shape of r2 × H/r × W/r. After resampling, the receptive fields become more efficient and the weights are shared between different channels.

After image transformation, two ANNM sub-modules K and G are proposed to generate the non-local spatial similarity weight wp (k2 × 1 × 1) and sampling offset Op (2k2 × 1 × 1), where k is the size of the weighted kernel. The specific expression is as follows:

As shown in Figure (b), each pixel in the domain-aware feature is obtained by performing a weighted average calculation on the learned weights and offsets. Based on the features extracted by classic FPN (local fixed-size kernel), the final non-local neighbor-aware feature F can be calculated as:

where OΔx represents the horizontal offset, OΔy represents the vertical offset, ⊕ and ⊙ represent element-wise addition and multiplication respectively, and N = k2 is the number of sampled pixels around the central pixel. We use pixel-shuffle(ps) to transform the non-local similarity from the feature domain to the spatial domain, forming the shape of I0.​ 

The 2D features are encoded into the 3D camera frustum. Then, an intuitive idea is to further extend 2D ANNM to 3D space to better explore explicit non-local depth-aware cost matching. 3D ANNM is applied solely to the reference feature volume for the simple reason that the ultimate goal is to estimate the depth map. As shown in Figure (c), which represents the feature volume Vbase directly warped from the reference feature F0, the two three-dimensional sub-modules learn the non-local depth similarity weight Wq as k3 × 1 × 1 and the three-dimensional sampling offset Oq as 3k3 × 1 × 1. The final depth perception feature volume V integrates the depth perception from surrounding voxels and is calculated as follows:

Where q is the voxel in Vbase, the number of sampling voxels N = k3, OΔd represents the depth offset. The learned depth-aware feature volume represents a more robust domain matching, and then the {Vi} of all viewpoints are composed into a single cost volume to estimate the depth map.

3. RGB boot depth improvements

The depth map generated from the coarser stages is used to guide the finer layers. However, serious mispredictions at coarser stages will accumulate at finer stages. To this end, the recent SOTA RGB guided depth improvement network DCTNet is referred to in the coarsest layer to better highlight the contours with depth improvement value. In particular, coarse depth is upsampled to an I0 shape aided by high-resolution RGB contour information. ΦD and ΦI0 are the features of D and I0 generated by the feature extraction module, while the edge attention weight WI0 learned from I0 is used to prevent excessive texture transfer between RGB/depth images. The improved depth D can be obtained:

The calculation formula of DCT(·,·,·) is as follows:

Among them, λ is the learnable parameter, L is the Laplacian filter, F is the DCT operation, F−1 is the inverse operation, I is the identity matrix, ⊘ is the element-wise division, and K is the 2D basic image of I0. Perform refined corrections on outliers predicted in the coarse stage, and provide strict in-depth hypothesis guidance in the fine stage.

4. Loss function

The MVS problem is formulated in a classification way, and the true value distribution PGT is supervised under the effective pixel set Ω. The prediction depth represents the cross-entropy loss LossCE between P and PGT, which is first used for constraints. In addition, using the DCT transform [22] with decorrelation and energy concentration characteristics, an innovative energy aggregation constraint LossEA is proposed as:

The total loss of the model is expressed as:

where λ1 = λ2 = 0.5.​ 

5. Experiment

Implemented using PyTorch, trained on NVIDIA Tesla V100 GPU for 16 epochs with an initial learning rate of 0.001. Consumes 0.6s and 4.7G GPU memory to achieve full resolution.

5.1. Comparison with advanced technologies

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/132101335