[Paper Description] Occlusion-Aware Depth Estimation with Adaptive Normal Constraints (ECCV 2020)

1. Brief description of the paper

1. First author: Xiaoxiao Long

2. Year of publication: 2020

3. Published in journal: ECCV

4. Keywords: multi-view depth estimation, normal constraints, occlusion awareness strategy, deep learning

5. Exploration motivation: Most existing deep learning methods fail to preserve prominent characteristic 3D shapes, such as corners, sharp edges, and faces, because they only use depth for supervision, and therefore their loss functions are not designed to preserve these structures. This problem is particularly pernicious when reconstructing indoor scenes using man-made objects or regular shapes. Another issue is the performance degradation caused by depth blurring of occluded areas, which is ignored by most existing works.

  1. However, most of these works fail to preserve prominent features of 3D shapes, such as corners, sharp edges and planes, because they use only depth for supervision and their loss functions are therefore not built to preserve these structures.
  2. This problem is particularly detrimental when reconstructing indoor scenes with man-made objects or regular shapes. Another problem is the performance degradation caused by the depth ambiguity in the occluded region, which has been ignored by most of the existing works.

6. Work objectives: A new method for depth estimation using a single mobile color camera is proposed, which aims to preserve important local features (edges, corners, high curvature features) and planar regions. It uses one frame of video as a reference image and other frames as source images to estimate depth in the reference frame.

7. Core idea: A new occlusion-aware strategy is adopted to merge these initial depth maps and occlusion probability maps into the final reference view depth map.

  1. Our first main contribution is a new structure preserving constraint enforced during training. We therefore propose a new Combined Normal Map (CNM) constraint, attached to local features for both local high-curvature regions and global planar regions. For training our network, we use a differentiable least squares module to compute normals directly from the estimated depth and use the CNM as ground truth in addition to the standard depth loss.
  2. Our second contribution is a new neural network that combines depth maps predicted with individual source images into one final reference-view depth map, together with an occlusion probability map. It uses a novel occlusion-aware loss function which assigns higher weights to the non-occluded regions. Importantly, this network is trained without any occlusion ground truth.

8. Experimental results:

We experimentally show that our method significantly outperforms the stateof-the-art multi view stereo from monocular video, both quantitatively and qualitatively. Furthermore, we show that our depth estimation algorithm, when integrated into a fusion-based handheld 3D scene scanning approach, enables interactive scanning of man-made scenes and objects in much higher shape quality

9. Paper and code download:

https://arxiv.org/abs/2004.00845

https://github.com/xxlong0/CNMNet

2. Implementation process

1. Overall network structure

As shown in the figure, the frames in the local time window in the video are taken as input, the video frame rate is 30 fps, and the video frames are sampled at an interval of 10 frames. Assuming that the time window size is 3, the middle frame is used as the reference image Iref. The two adjacent images serve as two source images and should have sufficient overlap with the reference image. The goal is to calculate an accurate depth map from the perspective of the reference image.

First, an initial 3D cost volume is formed through differentiable homography operations. Next, cost aggregation is applied to the initial cost volume to correct any incorrect cost values, and then the initial depth map is extracted from the aggregated cost volume. In addition to pixel-level depth supervision of the initial depth map, a new local and global geometric constraint, the combined normal map (CNM), is implemented to train the network to produce better results. The accuracy of depth estimation is then further improved by applying a new occlusion-aware strategy that aggregates depth predictions from different adjacent views into a single depth map for reference occlusion probability map.

2. Cost body construction

Previous work used feature maps extracted from image pairs to construct a 4D cost volume. This paper directly uses image pairs to avoid large memory and time-consuming 3D convolution operations on the 4D cost volume.

3. DepthNet for initial depth prediction

After obtaining the cost volume Ci from the image pair, each Ci is first aggregated using the neural network DepthNet to correct incorrect values ​​by aggregating adjacent pixel values. To exploit more detailed contextual information, Ci is stacked together with the source image Is as input into DepthNet. Next, the initial depth map Dref i is retrieved from the aggregated cost volume Vi using a 2D convolutional layer. Note that two initial depth maps, Dref1 and Dref2, are generated for the reference view Iref.

Training networks with deep supervision. With only depth supervision, point clouds converted from estimated depth cannot preserve regular features such as sharp edges and planar regions. Therefore, it is recommended to also implement normal constraints for further improvement. Instead of local surface normals or global virtual normals, a so-called combined normal map (CNM) is used, which combines local surface normals and global planar structural features in an adaptive way.

Pixel depth loss. Using the standard pixel depth map loss is as follows:

In the formula, Q is the set of all valid pixels in the true depth, D^(q) is the true depth value of pixel q, and Di(q) is the initial estimated depth value of pixel q.

Combined normal map. In order to simultaneously preserve the local and global structure of the scene, the Combined Normal Map (CNM) is introduced as the ground truth for supervision with normal constraints. To obtain the normal map, PlaneCNN is first used to extract planar areas such as walls, tables, and floors. Then, local surface normals are applied to the non-planar area, and the average of the surface normals in the planar area is used as the distribution for that area. A visual comparison of local normal maps and combined normal maps is as follows.

The key idea here is to use local surface normals to capture rich local geometric features in high curvature regions, and use average normals to filter the noise in surface normals in planar regions to preserve the global structure. In this way, CNM significantly improves the depth prediction and recovery of good 3D structure of the scene compared to using only local or global normal supervision.

Combined normal loss. CNM is defined as:

Among them, Q is a valid true pixel set, N^(Q) is the combined normal of pixel q, Ni(q) is the surface tangent normal of the three-dimensional point corresponding to pixel q, all normalized to unit vectors. To obtain accurate depth maps and preserve geometric features, pixel-wise depth loss and combined normal loss are combined to supervise the network output. The total loss is: 

where λ is the trade-off parameter, which is set to 1 in all experiments.

4. Occlusion-aware RefineNet

The next step is to combine the initial depth maps Dref 1 and Dref 2 of the predicted reference images from different image pairs to obtain the final depth map Df in. An occlusion-aware network based on occlusion probability maps is designed, namely RefineNet. Occlusion refers to the area in Iref that cannot be observed in Is1 or Is2.

When calculating the loss, lower weights are assigned to occluded areas and higher weights are assigned to unoccluded areas, which shifts the focus of the network to unoccluded areas because the depth prediction of unoccluded areas is better. reliable. Furthermore, the occlusion probability map predicted by the network can be used to filter out unreliable depth samples, as shown in the figure, which is very useful when fusing depth maps for 3D reconstruction.

RefineNet predicts the final depth map and occlusion probability map based on the average cost volume of the two cost volumes and the two initial depth maps. RefineNet has one encoder and two decoders. The first decoder estimates the occlusion probability based on the average cost volume V¯ and the occlusion information encoded in the initial depth map. Intuitively, for a pixel in a non-occluded area, its response is strongest (peak) at the nth layer of the average cost volume V¯ at depth dn, and Dref 1 and Dref 2 of this pixel have similar depth values. However, for a pixel in the occluded area, it has a scattered response on the depth layer of V¯ and a very different value on the initial depth map of that pixel. Another decoder predicts fine depth maps using depth constraints and CNM constraints. To train RefineNet, a new occlusion-aware loss function is designed as follows:

Among them, Dr(q) is the improved estimated depth at pixel q, Nr(q) is the surface normal of the three-dimensional point corresponding to q, D^(q) is the true depth of q, and N^(q) is the combination of q. Normal, P(q) is the occlusion probability value of q (P(q) is higher when q is occluded). The weight α is set to 0.2 and β is 1.

5. Loss function

Given the predicted depth D and the true depth D~, the loss can be described as:

Among them, LSI is the scale-invariant loss, LVNL is the virtual normal loss [42,43], and β is the weight factor, which is set to 4. Since there is a monocular depth Dmono and a final depth prediction Dt, the final loss Lfinal is

6. Restrictions

First, the performance of this method relies on the quality of CNM based on global planar region segmentation. Existing plane segmentation methods are not robust to all scenarios. One possible solution is to jointly learn segmentation labels and depth predictions. Second, for video-based 3D reconstruction tasks, it is possible to design an end-to-end network to generate 3D reconstructions directly from videos without having to invoke the explicit step of integrating estimated depth maps using TSDF fusion.

Guess you like

Origin blog.csdn.net/qq_43307074/article/details/134944130