[Paper Brief] RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo (CVPR 2023)

1. Brief introduction of the paper

1. First author: Changjiang Cai

2. Year of publication: 2023

3. Published journal: CVPR

4. Keywords: MVS, depth map, GRU, Transformer, pose network

5. Motivation for Exploration: Existing methods that apply 2D or 3D convolutional encoders to aggregate and regularize cost volumes are flawed.

  1. The 2D CNN methods use multi-level features as skip connections to help decode the cost volume for depth regression. Even though the skip connections improve the depth maps, they weaken the role of cost volume and the geometry knowledge embedded therein to some extent. Hence, 2D CNN methods suffer from degraded generalization when testing on unseen domains.
  2. The 3D CNN methods use soft-argmin to regress the depth map as the expectation from the cost volume distribution, and hence cannot predict the best candidate but instead an averaged one when dealing with a flat or multi-modal distribution caused by textureless, repeated, or occluded regions, etc.

6. Goal of the work: Since the cost volume plays a crucial role in multi-view geometry coding, the goal is to improve its construction at the pixel and frame level.

7. 核心思想:Our core idea is a “learning-to-optimize” paradigm that iteratively indexes a plane-sweeping cost volume and regresses the depth map via a convolutional Gated Recurrent Unit (GRU).

  1. we learn to index the cost volume by approaching the correct depth planes per pixel via an index field (a grid of indices to identify the depth hypotheses)
  2. To facilitate the optimization, we propose to improve the cost volume at pixel- and frame- levels, respectively. At the pixel level, a transformer block is asymmetrically applied to the reference view (but not to the source views). At the frame level, we propose a residual pose net to rectify the camera poses that are usually obtained via Visual SLAM and inevitably contain noise. 

8. Experimental results: SOTA

We conduct extensive experiments on indoor-scene datasets, including ScanNet, DTU, 7-Scenes, and RGB-D Scenes V2. We also performed well-designed ablation studies to verify the effectiveness and the generalization of our approach. Our method achieves state-of-the-artperformance in terms of both within-dataset evaluation and cross-dataset generalization.

9. Paper download:

https://arxiv.org/pdf/2205.14320.pdf

2. Implementation process

1. RIAV-MVS comparison

First, RIAV-MVS is developed using RAFT's GRU-based iterative optimization. However, RAFT operates on the optical flow of all pairs of relevant volumes (without multi-view geometric constraints) (Figure (a) and (c)), this paper proposes a method for multi-view depth estimation by constructing a planar scanning cost volume (Figure (b)) . Second, IterMVS iteratively predicts the depth and reconstructs a planar scan cost volume using an updated depth plane centered at the predicted depth (Fig. (d)). In contrast, as shown in Figure (e), the index field proposed in this paper serves as a new design that connects cost volume optimization (i.e., learning better image features through backpropagation) and depth map estimation (i.e., scanning the plane by sampling ). It makes forward learning and backward learning differentiable.

2. Overview of RIAV-MVS

Including feature extraction blocks (ie F-Net, Transformer and C-Net), building cost body, cost body optimization (index field GRU-based optimization block and residual pose block) and depth prediction

3. Feature extraction

Given a reference image I0 and a source image IS, F-Net is used to extract the matching features of I0 with each IS, and the contextual features provided by C-Net for I0.

Local matching feature extraction. The feature extractor F-Net is based on PairNet, which is a lightweight Feature Pyramid Network (FPN) on top of the first 14 layers of MnasNet. Specifically, the reference input image I0 H×W is spatially downscaled to 1/32 scale and restored to 1/2 scale, resulting in multi-scale features. An additional fusion layer G is added to aggregate them into a matching feature f0 at a scale of 1/4, namely:

​Where the fusion layer G is a series of operations, Conv3×3, batch normalization, ReLU, Conv1×1, downsampling and upsampling by scale x, <.> is the connection along the channel dimension, and then f0∈H /4×W/4×F1, F1=128. Likewise, F-Net (shared weights with I0) is applied to the source image IS.

Global matching feature for reference views. In addition to extracting local pixel-level features from CNN, global long-range information is also exploited to better guide feature matching. To this end, a transformer layer (four-headed self-attention layer with position encoding) is applied to the local features f0 of the reference image to construct an aggregated feature f0a ∈ H/4×W/4×F1:

​where σ( ) is the softmax operator, wα is the learned scalar weight initialized to zero, wq, wk, and wv are the projection matrices of query, key, and value features, and h=4 represents multi-head attention. The final output f0a contains local and global information, balanced by the parameter wα to enhance the subsequent cost volume construction.

It is worth noting that this transformer self-attention is only applied to the reference image, while the source features still have local representations from the CNN. The asymmetric use of transformer layers better balances high-frequency features (via high-pass CNN) and low-frequency features (via self-attention). High-frequency features are beneficial for image matching in local and structural regions, while low-frequency features suppress noise information through transformed spatial smoothing (acting as a low-pass filter), providing more global context cues for robust matching, especially for images filled with Areas of low texture, repeating patterns, occlusion, etc. This way, the network can know where to rely on global features instead of local ones, and vice versa.

4. ​Cost body construction

The cost volume is constructed using the global matching feature map f0a of I0 and the local matching feature fS of IS. The cost (or matching) volume uniformly samples M0 = 64 planar hypotheses in inverse depth space, st 1/d ~ U(dmin,dmax). where dmin and dmax are the near and far faces of the 3D frustum, respectively. For indoor scenes (such as ScanNet), dmin=0.25, dmax=20 meters. Calculate the similarity between the two through group correlation, and finally get C0∈H/4×W/4×M0.

5. Iterative optimization based on GRU

Depth prediction as a learning-optimized dense stereo matching problem. Given the cost body C0, the depth estimation formula of the reference image is to find the optimal solution D∗= argminDE(D,C0), so that the energy function (D,C0) (including a data item and a smoothing item). Unfortunately, this global minimization is NP-complete because there are many discontinuities that preserve energy. This paper does not directly optimize the energy function E, but learns to process the cost body C0.

In order to realize the problem that SGM cannot perform end-to-end, it is suggested to use GRU-based modules to implicitly optimize the matching bodies. It estimates a sequence of index field φt by expanding the optimization problem into t iterative updates (in the descending direction), mimicking the updates of a first-order optimizer. At each iteration t, the index field φt ∈ H×W is estimated as an index grid to iteratively better approximate (i.e., be closer to the ground truth) depth hypotheses with lower matching costs. Specifically, the prediction residual index field δφt is used as the update direction for the next iteration, namely: φt+1 = φt + δφt, (similar to the direction r in SGM), which is obtained through the training system (e.g., feature encoder, transformer layer and residual pose network, etc.) to minimize the loss between the predicted depth map and the ground truth. The recurrent estimation of the index field anchors the learning directly in the cost volume domain. This indexing paradigm distinguishes our method from other depth estimation methods, such as deep regression, which incorporates multi-level features of cost volume and 2D CNN jumps, and soft -argmin.

The index field is updated iteratively. Use 3-chain GRUs to estimate the index field sequence, φt ∈ H/4×W/4, starting from the initial point φ0. Using the softargmin-start from the cost volume C0, ie, φ0 = summation i*σ(C0), where σ( ) is the softmax operator along the last dimension of the cost volume C0, converts it to each index i probability. This setup is good for the convergence of predictions. Similar to RAFT, a four-level matching pyramid Ci∈H/4×W/4×M0/2i is constructed by repeatedly pooling the cost body C0 with a kernel size of 2 along the depth dimension. To index matched pyramids, we define an operator. Given a current estimate of the index domain φt, construct a 1D grid around φt with integer offsets r = ±4. Since φt is real, the grid is indexed from each level of the matching pyramid by linear interpolation. The retrieved cost values ​​are then concatenated into a single feature map Ctφ∈H/4×W/4×F2. Then the index field φt is taken, and the retrieved cost feature value Cφt0 is concatenated with the context feature f0c and input to the GRU layer together with the latent hidden state ht. The GRU outputs a residual index field δφt and a new hidden state ht+1:

Upsampling and depth estimation. Given an index field φt, depth hypotheses are sampled by linear interpolation, estimating the depth map at iteration t. Since φt has a resolution of 1/4, it is upsampled to full resolution using a convex combination of 3×3 neighbors. Specifically, weight masks W0 ∈ H/4×W/4x(4×4×9) are predicted from the hidden state ht using two convolutional layers, and softmax is performed on the weights of these 9 domains. The final high-resolution index field φut is obtained by weighted combination of 9 neighborhoods and reshaping them to resolution H×W.

When constructing the cost body, use M0=64 depth assumption, B0=di. A smaller M0 helps reduce computation and space. If the plane B0 is directly sampled using the upsampled index field φu t , discontinuities will be seen in the inferred depth map even if the quantitative evaluation is not hindered. To alleviate this, a coarse-to-fine schema is proposed and a depth assumption of B1 = di is used with M1 = 256. Similar to the upsampling of optical flow or binocular stereo disparity, when realizing spatial upsampling, the optical flow or disparity value itself must be scaled, and the depth index field is adjusted by a scale sD=M1/M0=4. To simulate the convex combination mentioned earlier, we apply a similar weighted summation along the depth dimension when sampling depth in B1. Specifically, another mask W ∈ H×W×sD×M0 is predicted from the hidden state using three convolutional layers, and further reshaped into H×W×M1. Given a pixel p and an upsampled index field φu t, the final depth Dt is estimated as:

where the neighbors of a given pixel p within a radius r = 4 centered on the index φut(p) are aggregated, [i] gives the largest integer less than or equal to i, and [i] represents the depth plane B1 by linear interpolation Indexing is done because index i is a real number.

It is worth mentioning that this method adds regression (similar to softargmin) and classification (similar to argmin) to existing methods, making it robust to multimodal distributions and achieving sub-pixel accuracy through linear interpolation. The combination of classification and regression has been seen in IterMVS, but the argmax operator is not used for the purpose of "classification", thanks to our proposed index field estimation, which directly connects the cost volume index and the depth hypothesis with sub-pixel accuracy. stand up.

6.  Residual pose network

An accurate cost body is beneficial for GRU-based iterative optimization. The quality of the generated cost volume C0 is determined not only by the matching features (f0a and fS), but also by the homography variation. However, in practice, camera poses are usually obtained by Visual SLAM algorithms, which inevitably contain noise. Therefore, a residual pose network is proposed to correct the camera pose for accurate feature matching. The reference image and the source image of the warp are encoded using the image-net pretrained ResNet18 backbone. Specifically, given the current estimated depth map Dt and the ground truth Dgt at iteration t, the source image Ii is warped to the reference image with a homography by the (noisy) real camera pose Θ and either Dt or Dgt. Dt or Dgt with probability prob(Dt) = 0.6 is randomly selected during network training, but the predicted depth Dt is always used during network inference. The input of the pose network is the connected I0 and I~I of the warp, and the output is the axis angle representation, which is further converted into a residual rotation matrix ∆θi, which is used to update θi`=∆θi·θi. In this way, the residual pose ∆Θ for each source and reference pair is predicted and corrected Θ`=∆Θ Θ. The updated pose Θ` is used to calculate a more accurate cost volume C1, followed by residual iteration of GRU.

7. Loss function

The network supervises the inverse L1 loss between predicted depth Dt and ground truth Dgt, evaluated on valid pixels (i.e., with non-zero ground truth depth). As the weights grow exponentially, the depth loss is defined as:

In the formula, Nv is the number of effective pixels, γ=0.9. Supervising Residual Pose Networks with Photometric Loss LP. The total loss is defined as L=LD+LP.

8. Experiment

Comparison with Advanced Technology

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/130050914