[Paper Brief] DSC-MVSNet: attention aware cost volume regularization based on depthwise separable (CIS 2023)

1. Brief description of the paper

1. First author: Song Zhang

2. Year of publication: 2023

3. Published in journal: Complex & Intelligent Systems

4. Keywords: MVS, 3D reconstruction, depthwise separable convolution, channel attention

5. Exploration motivation: It is difficult for MVS methods based on deep learning to balance efficiency and effectiveness.

6. Work goal: How to significantly reduce the amount of calculation while maintaining the effect is the main research issue.

7. 核心思想:We propose the DSCMVSNet, a novel coarse-to-fine and end-to-end framework for more efficient and more accurate depth estimation in MVS.

  1. We propose a 3D UNet-shape network and firstly use the depthwise separable convolution for 3D cost volume regularization, which can effectively improve the model efficiency with performance maintained.
  2. We propose a 3D-Attention module to enhance the ability in cost volume regularization to fully aggregate the valuable information of cost volume and alleviate the problem of feature mismatching.
  3. We proposed an effective and efficient feature transfer module to upsample the LR depth map to obtain the HR depth map to achieve higher quality reconstruction.

8. Experimental results:

The proposed method outperforms the state-of-the-art method in dynamic areas with a significant error reduction of 21.3% while retaining its superiority in overall performance on KITTI. It also achieves the best generalization ability on the DDAD dataset in dynamic areas than the competing methods.

9. Paper download:

https://link.springer.com/content/pdf/10.1007/s40747-023-01106-3.pdf?pdf=button

https://github.com/zs670980918/DSC-MVSNet

2. Implementation process

1. Overview

  1. Use information feature extraction network to extract corresponding features;
  2. Use DSC-Attention 3D UNet to regularize the rough cost volume C×D×1/8H×1/8W;
  3. Use the feature transfer module to upsample the LR depth map Ds∈1×1/8H×1/8W to the HR depth map Dd∈1×1/4H×1/4W;
  4. Using the input image and HR depth map, through the Gaussian-Newton network layer, the improved depth map Dr∈1×1/4H×1/4W is obtained;
  5. Finally, the improved depth map is fused to obtain a point cloud.

2. Three-dimensional depth separable convolution (3D-DSC)

The 3D CNN is divided into 3D depthwise convolution (depthwise is the depth dimension, which can perform cost aggregation on the cost volume information in the depth dimension) and 3D pointwise convolution (pointwise is the spatial dimension, which can perform cost aggregation on the cost volume information in the spatial dimension). 

3D depthwise convolution. 3D depth convolution is performed independently on the cost volume of each channel to obtain a channel-independent intermediate feature map, which is defined as follows:

In the formula, W1 represents the weight of the three-dimensional depth convolution, V∈C×D×H×W represents the cost body, i, j, u represent the position index, and K, L, M represent the kernel size of the convolution.

3D pointwise convolution. 3D pointwise convolution acts on these channel-independent feature maps to aggregate channel-related information, as defined:

 In the formula, W2 represents the weight of the three-dimensional point-wise convolution, V∈C×D×H×W represents the intermediate feature map, and N represents the kernel size of the convolution.

These two convolutions are performed sequentially to form a complete convolution. Its mathematical expression is defined as the formula:

This paper theoretically compares the 3D-DSC regularization scheme with other mainstream regularization schemes and proves the effectiveness of the scheme. The receptive field of a voxel is indicated in cyan. Horizontal is the depth dimension, vertical is the channel dimension. H and W represent height and width, respectively. In this figure, let H and W be one-dimensional.

 (a) Spatial Regularization (SR), which filters cost volumes at different depths. However, due to the small receptive field, the regularization results of SR are greatly affected; (b) 3D CNN regularization (3D-CNN), using 3D CNN to obtain a larger receptive field for cost body regularization. But it brings more computational costs; (c) Recursive regularization is an RNN-based method that proposes sequential processing to divide the cost body into depth-independent cost maps to reduce computational costs; (d) 3D-DSC regularization is a DSC-based method that splits the cost volume into intermediate feature maps and then applies point-wise convolution to establish the relationship between these intermediate feature maps to maintain the performance of the model. Compared with SR, our method can obtain a larger receptive field, while 3D CNN regularization can obtain better performance, but also brings higher computational cost. Then the efficiency of 3D-DSC and 3DCNN is compared.

3. 3D attention module (3DA)

Although 3D-DSC can effectively aggregate cost volume information, there is still a feature mismatch problem that affects the quality of the cost volume. The feature mismatch problem occurs when features of different keypoints are mismatched, which will cause the cost volume to have similar confidence at different depths, ultimately leading to inaccurate depth estimation. Specifically, as shown in the lower figure of Figure 3, a reference feature matches two similar source features (the two hands of the Buddha statue) at different depths, and the confidence levels at different depths are similar in the cost volume. These similar confidence levels will affect the quality of the depth map. And use 3DA to alleviate this problem. Red voxels represent similar confidence; light red represents weakened confidence.

Since the attention mechanism can highlight important information by calculating different weights, the attention mechanism is used to solve the feature mismatch problem. 3D attention consists of two modules that alleviate this problem by utilizing the information of the entire cost volume to calculate attention weights to enhance or weaken the confidence of similarities at different depths.

Channel attention block. The channel attention block performs attention on channel information. It is constructed by a multi-layer perceptron (MLP) and acts on the channel of the cost body V∈C×D×H×W to obtain the channel attention enhancement weight Wˆ. Multiply the channel weight W and the cost body V to obtain the channel improved cost body V'∈C×D×H×W. The channel attention block is defined as:

Among them, Max Pool is the maximum pooling and AvgPool is the average pooling. W∈C is the channel attention enhancement weight, and both parts share the weight of MLP.

Spatial Depth Attention Block. Different from ordinary attention that uses full perception (without distinguishing between space and depth), the spatial depth attention block perceives cost information from two different dimensions, such as space and depth, according to the composition of the cost body. First, the cost volume is filtered along the spatial direction using a space-oriented anisotropic convolution with a kernel size of 1 × 7 × 7 (different positions at the same depth) to reduce noise while maintaining useful matching information at the same depth. It provides more accurate spatial information for the next depth-oriented convolution. Then, a depth-oriented anisotropic convolution with a kernel size of 7×1×1 (same position at different depths) is used to act on the depth dimension, effectively enhancing or weakening the matching information at different depths at the same spatial position. Finally, an isotropic convolution with a kernel size of 7×7×7 is used, acting on multiple dimensions (space, depth) to fully aggregate the information of the above process. The formulation of the spatial depth attention block is defined as:

where σ is the activation function; W ̄∈1×D×H×W is the spatial depth weight; f1×7×7 is the spatial convolution, f7×1×1 is the depth convolution, and f7×7×7 is Overall convolution. 

These two modules are cascaded to form a 3D attention module, and the formula is defined as follows:

After regularization, a softmax operation is used in the depth direction to regress all values ​​between [0,1] to form a probability volume P of depth estimation. Finally, the hypothetical plane values ​​of different depths are multiplied by the probability body P to obtain the LR depth map D~s. The formula is:

4. Feature transfer module

The high-resolution depth map obtained by upsampling directly affects the quality of the point cloud results. In order to obtain high-resolution and accurate depth maps, a Feature Transfer Module (FTM) for low-resolution (LR) depth map upsampling is proposed.

The input of FTM is a three-channel reference image I0∈3×H×W and a single-channel LR depth map Ds∈1×1/8H×1/8W. In order to unify the scale of the input, the bicubic interpolation algorithm is first used to upsample the LR depth map Ds to obtain a larger-scale depth map D~s∈1×1/4H×1/4W. The reference image is downsampled into a 16-channel image I0∈16×1/4H×1/4W. After unification, a common offset and weight extraction backbone is proposed to obtain the offsets of the reference image and depth map. The backbone consists of a seven-layer convolutional feature extraction network, an offset convolution, a weight convolution and a sigmoid layer. The backbone is defined as:

In the formula, fFE represents the extraction network, foc represents the offset convolution, fwc represents the weighted convolution, and sigmoid represents the sigmoid layer. 

The OWC Block is then used to calculate weights ∈k2/16×1/4H×1/4W and offsets ∈k2/8×1/4H×1/4W for guided depth map upsampling, where k is a hyperparameter, Set k=12. Specifically, the corresponding offsets and weights are multiplied together, and the results are passed through PixelShuffle to obtain the target offsets and weights. The offset is then used to guide feature sampling, and the sampled features are multiplied by the weights to obtain the final result. Finally, the HR depth map is obtained through the residual addition block. Define the equation of the above process as:

Among them, fps represents the PixelShuffle operation of PyTorch, fgs represents the grid_sample function, and Dres represents the depth residual.

5. Information feature extraction network

Many previous methods only use sequential convolution operations to extract feature maps from input images {Ii}i, which only contain high-level semantic information. The loss of low-level spatial information will affect the quality of the reconstruction results. Therefore, an information feature extraction network that utilizes skip connections to propagate low-level spatial information to aggregate multi-level feature information is proposed. This network has three components (Encoder, Decoder, Adjuster), and the architecture is shown in the table below. Each convolutional layer represents a convolution block, batch normalization (BN) and ReLU. "sp" means skip connection.

6. Cost body construction

defined as:

Vi is the average volume of all characteristic volumes.

7. Depth map improvement

The depth map obtained in the previous step is of insufficient quality and needs further improvement. In Fast-MVSNet, the Gaussian-Newton network layer is an effective and efficient depth map improvement module. Therefore, a Gaussian network layer is used to improve the depth map D∈1×1/4H×1/4W for MVS reconstruction.

8. Training loss

Calculate the average absolute value error between the predicted depth map and the real depth map as the training loss, such as:

In the formula, D~d is the HR depth map, D~r is the improved depth map, D~ is the real depth map, pvalid is the effective point set of the real depth map, and λ is used to balance loss1(p) and loss2(p ). During training, λ is usually set to 1.0.

9. Experiment

9.1 . Implementation details

Set the RMSProp optimizer, the initial learning rate is set to 0.0008, and the attenuation weight of each epoch is set to 0.002. The batch size is set to 16 and training is performed on 6 NVIDIA GTX 2080ti GPU devices.

9.2. Comparison with advanced technologies

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/132101021