Classic literature reading--NoPe-NeRF (optimizing the neural radiation field without pose prior)

0. Introduction

Training Neural Radiation Fields (NeRF) without precomputing camera poses is challenging. Recent progress in this direction has shown that NeRF and camera pose can be jointly optimized in forward-looking scenarios. However, these methods still face difficulties with severe camera motions. We address this challenging problem by introducing a distortion-free monocular depth prior. These priors are generated by correcting scale and translation parameters during training, enabling constraints on the relative pose between consecutive frames. This constraint is achieved by our proposed novel loss function . Experiments on real-world indoor and outdoor scenes show that our method can handle challenging camera trajectories and outperform existing methods in terms of novel viewpoint rendering quality and pose estimation accuracy. The project page of this article " NoPe-NeRF: Optimizing Neural Radiance Field with No Pose Prior " is https://nope-nerf.active.vision .

1. Main contributions

In summary, we propose a method to jointly optimize camera pose and NeRF from image sequences with extensive camera motion. Our system is enabled by contributions from three areas.

  1. We propose a novel method for integrating single depth into pose-free NeRF training by explicitly modeling scale and displacement distortions.

  2. We provide relative positions for camera-NeRF joint optimization through an inter-frame loss using unwarped single-depth maps.

  3. We further regularize our relative pose estimation via a depth-based surface rendering loss.

2. Details

In this paper, we address the challenge of handling large camera motions in pose-free NeRF training. Given a sequence of images, camera intrinsics and their monocular depth estimates, our method simultaneously recovers the camera pose and optimizes NeRF . We assume that camera intrinsics are available in image metablocks and run an off-the-shelf monocular depth network DPT [7] to obtain monocular depth estimates. Without repeating the benefits of monocular depth, we develop around the efficient integration of monocular depth into unposed-NeRF training.
Training is a joint optimization of NeRF, camera pose, and distortion parameters for each monocular depth map . Distortion parameters are supervised by minimizing the difference between monocular depth maps and depth maps rendered from NeRF, which are multi-view consistent. In turn, the distortion-free depth map effectively mediates the shape-radiance ambiguity, thus simplifying the training of NeRF and camera pose .
Specifically, the distortion-free depth map provides two constraints. We provide relative poses between adjacent images via a Chamfer distance-based correspondence between two point clouds back-projected in a distortion-free depth map, thus constraining the global pose estimate . Furthermore, we use surface-based photometric consistency to regularize relative pose estimation by treating undistorted depth as a surface .
insert image description here

3. NeRF and Pose

3.1 NeRF

Neural Radiance Field (NeRF) [24] represents the scene as a mapping function F Θ : ( x , d ) → ( c , σ ) F_Θ: (x, d) → (c, σ)FTh:(xd)( c , σ ) , wherex ∈ R 3 x ∈ \mathbb{R}^3xR3 is the 3D position,d ∈ R 3 d ∈ \mathbb{R}^3dR3 is the viewing direction,c ∈ R 3 c ∈ \mathbb{R}^3cR3 is the radiation color,σ σσ is the bulk density value. This mapping is usually parameterized by a neural networkF Θ F_ΘFThAchieved. Given NNN张图像I = { I i ∣ i = 0... N − 1 } I = \{I_i | i = 0 . . . N − 1\}I={ Iii=0...N1 } and the corresponding machine shapeΠ = { π i ∣ i = 0... N − 1 } Π = \{π_i | i = 0 .Pi={ pii=0...N1 } , the composite image I ^ \hat{I}can be minimizedI^ with capture imageIII betweenL rgb = ∑ i N∥ I i − hat I i∥ 2 2 L_{rgb} = \sum^ N_i \| I_i − hat{I}_i\|^2_2Lrgb=iNIiha t Ii22To optimize NeRF.
insert image description here
Here, I ^ i \hat{I}_iI^iis obtained by aggregating camera rays r ( h ) = o + hdr(h) = o + hdr(h)=o+Radiation color on h d at near and farhn h_nhnand hf h_fhfbetween renders. More specifically, we use a volume rendering function to synthesize I ^ i \hat{I}_iI^i
insert image description here
其中, T ( h ) = e x p ( − ∫ h n h σ ( r ( s ) ) d s ) T(h) = exp(−\int^h_{h_n} σ(r(s))ds) T(h)=exp(hnhσ ( r ( s )) d s ) is the cumulative transmittance along a ray. See [24] for more details.

3.2 Joint optimization of attitude and NeRF

Previous studies [12, 18, 45] have shown that the above photometric error L rgb L_{rgb} can be minimized by using the same volume rendering process in Eq. (2)LrgbSimultaneous estimation of camera parameters and NeRF.
The key is to condition the camera raycast as a variable camera parameter Π ΠΠ , because the camera rayrrr is a function of the camera pose. Mathematically, this joint optimization can be expressed as:
insert image description here
where, the symbolΠ ^ \hat{\Pi}Pi^ indicates the camera parameters updated during the optimization process. Note that the only difference between Equation (1) and Equation (3) is that Equation (3) treats the camera parameters as variables.
In general, the camera parametersΠ \PiΠ includes camera intrinsics, pose and lens distortion. This paper only considers estimating the camera pose, e.g.,iiThe camera pose of the i- frame image is a transformationT i = [ R i ∣ ti ] T_i=[R_i|t_i]Ti=[Riti] , whereR i ∈ SO ( 3 ) R_i\in SO(3)RiSO ( 3 ) means rotation,ti ∈ R 3 t_i\in \mathbb{R}^3tiR3 means panning.

3.3. Correction of monocular depth

Using an off-the-shelf monocular deep network such as DPT [28], we generate a monocular depth sequence D = D i ∣ i = 0... N − 1 D = {D_i | i = 0 . . . N -1}D=Dii=0...N1 . As expected, monocular depth maps are not multi-view consistent, so our goal is to recover a series of multi-view consistent depth maps, which are further exploited in our relative pose loss term.

Specifically, we consider two linear transformation parameters for each monocular depth map, resulting in a sequence of transformation parameters for all frames Ψ = ( α i , β i ) ∣ i = 0... N − 1 Ψ = {( α_i, β_i) | i = 0 . . . N-1}Ps=( ai, bi)i=0...N1 , whereα i α_iaiβ i β_ibidenote scale factor and offset, respectively. Under the multi-view consistency constraints of NeRF, our goal is to recover D i D_iDiThe multi-view consistent depth map D i ∗ D^∗_iDi:
insert image description here
by jointly optimizing α i α_iaiβ i β_ibiAnd NeRF, to achieve this joint optimization, mainly through the undistorted depth map D i ∗ D^∗_iDiAnd the depth map D ^ i \hat{D}_i rendered by NeRFD^iThis is achieved by enforcing consistency between them. This consistency is achieved by a depth loss:
insert image description here
where
insert image description here
Equation (5) is good for both NeRF and monocular depth maps. On the one hand, the monocular depth map provides a strong geometric prior for NeRF training, reducing the shape-radiance ambiguity. On the other hand, NeRF provides multi-view consistency, so we can recover a set of multi-view consistent depth maps for relative pose estimation.

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/129366012