[3D reconstruction] SceneRF: self-supervised monocular 3D scene reconstruction based on NeRF


Summary

3D reconstruction from 2D images has been extensively studied, trained with deep supervision. To relax the reliance on expensively acquired datasets, we propose SceneRF, a self-supervised monocular scene reconstruction method trained using only sequences of pose images. We efficiently handle large scenes with explicit depth optimization and a novel probabilistic sampling strategy . At inference time, one input image is sufficient to generate new depth views, which are fused together to obtain a 3D scene reconstruction. Experiments show that we outperform all recent baselines on novel depth view synthesis and scene reconstruction on indoor BundleFusion and outdoor Semantic kitti


提示:以下是本篇文章正文内容,下面案例可供参考

I. Introduction

While binocular vision is a clear evolutionary advantage for perceiving the environment, physiological studies have shown that humans can perceive depth even with monocular vision.

A small fraction of the 3D field has dealt with the reconstruction of complex scenes from a single image [ ] , but they all require deep supervision and cannot be limited to images. Meanwhile, NeRF self-supervisedly optimizes the radiation field from one or more viewpoints, bringing many derivatives with unprecedented performance in new viewpoint synthesis. However, when it comes to single-view input, they are mostly limited to objects. For complex scenes, all except [33] train using synthetic data [62] or require additional geometric cues to train on real data. Reducing the need for supervision on complex scenes will reduce our reliance on cost-obtained datasets.

In this work, we approach single-view reconstruction of complex (possibly large) scenes in a fully self-supervised manner. SceneRF trains only on a sequence of pose images to optimize a large NeRF . Figure 1 illustrates the inference that a single RGB image is sufficient to reconstruct a 3D scene from the fusion of a synthetic new depth/view sampled at an arbitrary location. We build on PixelNeRF [78] and propose specific design choices to optimize depth explicitly.
insert image description here

Aiming at the challenge of large scenes, we introduce a new probabilistic ray sampling to efficiently select optimized sparse locations within large irradiance volumes, and a spherical U-Net with the goal of generating data, features, or visualizations beyond the field of view of the input image result .

2. Method

SceneRF learns to infer scene geometry from monocular RGB images, trained in a self-supervised fashion using image-conditioned Neural Radiative Fields (NeRFs). Given a training set consisting of S sequences, each sequence has m RGB images with corresponding poses, denoted as:
insert image description here. We estimate a neural representation based on the first sequence framework, conditioning learned shared across sequences, and self-supervised optimization by other frameworks.

2.1. NeRF for new deep synthesis

The original NeRFs optimize continuous volumetric radiation fields: for a given 3D point x ∈ R 3 and viewing direction d ∈ R 3 , it returns a density σ and an RGB color c. We will build on PixelNeRF [78] to learn a generalizable radiation field across sequences and introduce new design choices to efficiently synthesize new depth views.

The training process of SceneRF is shown in Figure 2. Given the first input frame I 1 of sequence 1 , we extract a feature volume with our SU-Net: W = E(I 1 ) . Then, randomly select a source future frame I j , 2≤j≤m, and randomly extract L pixels from it. Knowing the pose of the source and the internal parameters of the camera, N points can be effectively sampled along the ray passing through the pixel. Project each sampling point x onto a sphere with ψ(·) , and retrieve the corresponding input image feature vector W(ψ(x)) through bilinear interpolation . The latter combines orientation d and position encoding γ(x), passed to NeRF MLP f( ) to predict point density σ and RGB color c in input frame coordinates:

insert image description here
insert image description here

The original NeRF applies an orthogonal method to approximate the color C of the camera ray r:

insert image description here
T i is the cumulative transmittance, and δi is the distance between adjacent sampling points.

3.1.1 Depth Estimation

Unlike most nerfs, we try to explicitly reveal depth from the radiation volume (where d i is the distance from point i to the sample location):

insert image description here

To optimize depth without GT, we take inspiration from self-supervised depth methods [19, 20] and apply a photometric reprojection loss between the warped source image Ij and its previous frame Ij 1 , as a target. We choose consecutive frames to ensure maximum overlap. Using the sparse depth estimate ˆDj, the photometric reprojection loss: , proj( ) projects the 2D coordinate i in I j−1
insert image description here
using the intrinsic parameters and pose of the ad-hoc camera . Importantly, while ^Dj is sparse—because only certain rays are estimated—the randomness of these rays provides statistically dense supervision. To consider moving objects, we apply the pixel-wise automatic masking strategy in [Digging into self-supervised monocular depth prediction] .


3.2 Probabilistic Ray Sampling (PrSamp)

Previous studies [24, 44, 47] have shown that for volume rendering, sampling points closer to the surface improves performance and reduces computational cost with fewer f( ) inferences. Since we train without deep supervision, this is a recurrent problem (since the surface locations are unknown).

The goal of our probabilistic ray sampling strategy (PrSamp) is to approximate the continuous density along each ray as a mixture of 1D Gaussian distributions to guide point sampling. It implicitly learns to associate high mixture values ​​with surface locations, leading to better sampling with significantly fewer points. For example, optimizing a voxel of 100 meters requires only 64 points per ray.

Referring to the notation and (step) in Figure 3, for each ray r, we first uniformly sample k points between the near and far boundaries.

insert image description here

(1) Taking points and their corresponding features as input, a dedicated MLP g( ) predicts a mixture of k 1D Gaussians {G1,...,Gk}.
(2) Then we sample m points in each Gaussian, and then sample 32 more uniform points; equivalent to N=k×m+32N points. . The addition of uniform points is essential for exploring the scene volume and preventing g( ) from falling into local minima.
(3) Then pass all points to f( ) in Equation (1) for volume rendering of color Cˆ(r) and depth ˆD(r).
(4) Intuitively, the density {σ1,...σN} inferred by f( ) is a clue to the 3D surface position, which we use to update our Gaussian mixture, in order to solve the underlying point-Gaussian assignment problem
(5) We rely on the Probabilistic Self-Organizing Map (PrSOM) from [2]. In short, PrSOM assigns points to a Gaussian distribution based on the likelihood that the former is observed by a set of points while strictly maintaining the topology of the mixture. For each Gaussian g i and its assigned point Xi , the updated g' i is the average of all points j ∈ Xi , defined by the conditional probability p(j/g i ) in NeRF and the occupancy probability of j weighted.

Finally, (6) updates the Gaussian predictor g( ) based on the average of the KL divergence between the current and new Gaussian distributions:
insert image description here

To further enforce a Gaussian distribution on the visible surface, we also minimize the distance between the depth and the nearest Gaussian distribution

insert image description here


3.3 Spherical Unet (SU-net)

By definition, the effective domain of f( ) is limited to the W( ) of the feature volume , which for standard U-Net is the camera's field of view FOV, thus preventing color and depth from being estimated outside the FOV, resulting in inability to extract feature. This is not suitable for scene reconstruction. Instead, we equip SU-Net with a decoder convolved in the spherical domain. Because spherical projections are less distorted than planar counterparts, the FOV can be enlarged (typically around 120◦) to fill in colors and depths outside the FOV of the source image.

In the bottleneck, the features of the encoder are mapped with ψ() to an arbitrary sphere, and then passed to the spherical decoder (a lightweight dilated convolution is used in the spherical decoder to increase the receptive field at low cost). Like standard U-Net, we use multi-scale skip connections to enhance gradient flow only by mapping features with ψ() .

In practice, we map a 2D pixel [x, y] T to its normalized latitude-longitude spherical coordinates [θ, φ]. Consider ∇x,∇y,1>∼K−1x,y,1>aa ray going through the pixel and camera center. Projection:
insert image description here
[θ, φ] is uniformly discretized when fed into the decoder, and the features are stored in a tensor covering an arbitrarily large FOV.


3.4. Scene reconstruction scheme

As shown in Figure 4, given an input frame, we synthesize new depths (uniformly sampled every ρ meters) along an imaginary straight path up to a given distance. At each location, we also vary the horizontal viewing angle Φ = {−φ, 0, φ}.
insert image description here

The composite depth is then converted to a TSDF using [3dmatch], the overall scene TSDF for voxel v using the minimum of all:V (v) =

, where i spans all composite depths. Traditionally, the voxel-wise TSDF is a weighted average of all TSDFs [10, 48], but our experience shows (Appendix C.2) that better results can be obtained using the minimum value. We speculate that this is related to the linear increase in depth error with distance.

4. Experiment

We evaluate SceneRF on two main tasks: new depth synthesis and scene reconstruction , and an auxiliary task: new view synthesis . Test the above three tasks on the dataset SemanticKITTI and BundleFusion . SemanticKITTI has a large driving scene (≈100m depth), and the image sequence is captured from a front-facing camera, which provides little perspective change. In contrast, bundle fusion has shallower indoor scenes (≈10m) and sequences with larger lateral motion.

In PrSamp, use k = 4 Gaussians with m = 8 samples per Gaussian, but change new depth/view samples for reconstruction. For angles of Φ={−10,0,+10}, sample each ρ=0.5 m (SemanticKITTI); and use ρ=0.2 m to sample up to 2.0 m, use Φ={−20,0,+20} for BundleFusion data.

Some results:
insert image description here
Compare the results:
insert image description here


Summarize

提示:这里对文章进行总结:

For example: the above is what we will talk about today. This article only briefly introduces the use of pandas, and pandas provides a large number of functions and methods that allow us to process data quickly and easily.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/131921263