[Brief description of the paper] Adaptive region aggregation for MVS matching using deformable convolutional network (2023)

1. Brief description of the paper

1. First author:Han Hu

2. Date:2023

3. Published in journal:The Photogrammetric Record

4. Keywords:MVS, 3D reconstruction, deep learning, adaptive convolution

5. Exploration motivation:Despite encouraging results obtained using adaptive matching windows, the determination of geometric priors remains a challenge.

Ideally, the estimation of the photo-consistency must be confined within a certain region, for example, within the boundaries of the target object. Low-level geometrical features, such as line contours, planar segments or superpixels, are typically used to adaptively select the supporting domain. However, low-level features are vulnerable to noise and may not correspond to the object boundaries. For instance, even correctly defined line contours may not essentially represent the contours of discontinuous regions. Therefore, the determination of meaningful geometrical priors requires high-level semantic understanding of the object rather than low-level geometric clues.

6. Work goal:Improve the effectiveness of matching features.

7. Core idea:This paper proposes an MVS adaptive region aggregation method using deformable convolutional networks (DCNs).

  1. a learnable adaptive region aggregation method for MVSNet based on DCNs for effectively matching descriptors;
  2. a dedicated offset regulariser for the learnable offsets of the DCN to enhance its convergence.

8. Experimental results:

The proposed method outperforms the state-of-the-art method in dynamic areas with a significant error reduction of 21.3% while retaining its superiority in overall performance on KITTI. It also achieves the best generalization ability on the DDAD dataset in dynamic areas than the competing methods

9. Paper download:

https://github.com/YueyueBird-su/DFPN_Module_in_MVS

https://onlinelibrary.wiley.com/doi/10.1111/phor.12459

2. Implementation process

1. Deformable feature extractor with adaptive aggregation window

The impressive feature learning capabilities of CNNs allow the creation of advanced features. As mentioned before, determining suitable support domains to compute photometric consistency requires semantic understanding of the scene, e.g., adaptive aggregation of windows must exploit latent features obtained through CNN layers. DCNs introduce additional pixel shifts in convolutions and select features at shifted locations, resulting in irregular receptive fields. Modeling adaptive aggregation windows for image matching using DCN. The figure below demonstrates how a 3 × 3 deformable convolution learns offsets from N-dimensional features and performs convolution.

 This method can effectively overcome the problem of inaccurate depth maps caused by inconsistent images from different viewpoints in the boundary area.

2. Functions

In the coarse-to-fine matching strategy, a multi-scale depth map Dl is generated. Because the quality of fine matching depends on the coarse-level results, all depth maps contribute to the final loss function. L1 is used as the loss function. In addition, the ground-truth depth map is downsampled by bilinear interpolation. The final loss of the multi-scale depth map is defined as:

​where λ={2,1,0.5} is set to a decreasing weight to balance the difference in image resolution.
Since depth loss LD models the deviation of continuous values ​​corresponding to low frequencies, it is not better than one-hot representation in terms of semantic classification. Neural networks trained with semantically informative labels perform better than other forms of neural networks. Although feature extraction networks have significantly fewer parameters than classification tasks, using kernel offsets in DCNs results in insufficient convergence performance. Since the descriptors of similarity matching are usually localized and the framework involves multiple layers to increase the receptive field, the following offset regularizer is used to constrain the offset distances in all offset maps in DCN.

Among them, the offset distance of each core point p is recorded as ok(p) =√x2 +y2, and the offset is truncated to 0 if it is less than 3 pixels. The above two losses are balanced by empirical weights and jointly used for backpropagation L=LD + 10LO.

おすすめ

転載: blog.csdn.net/qq_43307074/article/details/132101597