NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo Vision
Abstract: This method utilizes traditional SfM reconstruction and learning-based a priori. The key is to use the learned prior to guide the optimization process of NeRF. The system first tunes the monocular deep network on the target scene by fine-tuning the sparse SfM reconstruction. Then, it is shown that NeRF's shape-radiance ambiguity persists in indoor environments, and it is proposed to solve this problem by using an adaptive depth prior to monitor the sampling process of volume rendering.

1. Network Architecture

First, use the depth obtained by COLMAP to train a monocular depth network specific to the current scene. Then use the depth map predicted by this monocular deep network to guide the learning of NeRF. Finally, we use filters to further improve the quality of the depth map according to the results of view synthesis.
The core of the method is to use the depth prior predicted by the network to guide the optimization process of the neural radiation field. The following figure is the architecture of the system, which integrates the additional information in the learning-based prior into the NeRF training pipeline
insert image description here

2. Scene-sensitive depth prior

The method aims to leverage learning-based deep priors to help optimize geometry at test time. However, instead of using the same monocular deep network for all test scenes, we adapt the network to each scene to obtain scene-specific depth priors . Empirically, this test-time adaptive approach greatly improves the quality of the final depth output.
The suggestion for adjusting the scene-specific depth prior is to fine-tune (finetune) the monocular deep network on the basis of traditional SfM reconstruction. The purpose of this step is actually to make this deep network overfit on the current scene. Specific method: The COLMAP algorithm is used to obtain the point cloud of multi-view fusion, and the point cloud is projected to each view to obtain the sparse depth of each view. Since the point cloud of multi-view fusion has been checked for geometric consistency, although the depth is sparse, it is relatively accurate. Furthermore, due to the problem of scale ambiguity, we use a scale-invariant loss function: insert image description here
where D p i is the predicted depth map and D SfM i is the sparse depth obtained from COLMAP. Aligning the scale of the predicted depth map with sparse depth monitoring by using a scaling factor α(D p i , D SfM i ) in the loss formulation can be computed by averaging the differences over all valid pixels:
insert image description here
Fine-tuned monocular depth networks is a more powerful prior suitable for specific target scenarios. The quality of adaptive depth priors can be further improved by guided optimization of NeRF

3. Guided optimization of the neural radiation field

The implicit volume is directly optimized by integrating the adaptive depth prior described above. NeRF's related principle papers gave a brief introduction and pointed out that NeRF usually does not perform well in areas with poor texture, and there is shape radiation ambiguity in the wall part (non-textured area) in the figure below. NeRF can well fit the RGB image of the training perspective (Figure (a)), but it does not learn the correct 3D structure of the scene (Figure ©). The essential reason for this problem is that for the same group of RGB images, there will be multiple neural radiation fields corresponding to it. In addition, the RGB pictures of real indoor scenes will be blurry and the pose transformation between pictures will be relatively large, which leads to a decline in the learning ability of the network and exacerbates this problem. insert image description here
By explicitly restricting the sampling range around the depth prior, we avoid most of the degradations of NeRF in indoor scenes. This enables accurate depth estimation by directly optimizing RGB images. The error map of the adaptive depth prior is first obtained by a geometric consistency check. For N input views, the adapted depth prior is denoted as {D i }i=1~N. Project the depth map of each view to all other views:
insert image description here
K is the camera intrinsic function, T i–>j is the relative pose, p i–>j , D i–>j , are the projected 2D coordinates in the jth view and depth, then use the relative error between Dj′ and D i–>j to calculate the depth punch projection error, note that some pixels do not overlap on some view pairs. Therefore, the error map of the depth prior for each view ei is defined as the average of the top K minimum cross-view depth projection errors.
Project the depth of each viewing angle to other viewing angles and calculate the relative error with the depth of other viewing angles , and use the error map {e i }Ni=1 to calculate the adaptive sample range [tn, tf] of each camera ray
insert image description here
The sampling center point of each ray under each viewing angle in NeRF is the depth prior at the corresponding position, and the sampling range is determined by the error map. The smaller the error, the higher the confidence of the depth prior, and the smaller the sampling range; conversely, the larger the error, the lower the confidence of the depth prior, and the larger the sampling range.

As shown in the figure below, multi-view consistency checks are performed on the adaptive depth prior to obtain an error map, which helps to calculate the adaptive depth range from each camera ray to the sampling point
insert image description here

4. Reasoning and View Synthesis

For inference, the depth map for each input view can be predicted directly by resampling within the sampling range defined in the range formula and applying the following to compute the expected value. This provides accurate output depth for NeRF equipped with our proposed guided optimization scheme.
Explanation of the following equation: During volume rendering, NeRF uses the near boundary tn and far boundary tf computed from the sparse 3D reconstruction to monitor the sampling space along each ray. Specifically, it partitions [tn,tf] into M bins, and randomly samples a query point for each bin with a uniform distribution
insert image description here

To further improve depth quality, view synthesis results from NeRF are utilized to compute per-pixel confidence for predicted geometry. If the rendered RGB at a particular pixel does not match the input training image, we attach a relatively low confidence to the depth prediction for that pixel. The confidence S j i of the jth pixel in the i-th view is specifically defined as
insert image description here
this part for detailed explanation referencehttps://zhuanlan.zhihu.com/p/407123751?utm_id=0

While the proposed guided optimization strategy requires an adaptive depth prior as input to guide point sampling along the camera ray, we can still perform novel view synthesis by directly using the adaptive depth prior from the nearest visible view.

Guess you like

Origin blog.csdn.net/qq_44708206/article/details/129092692