[3D Editing] Interpretation of Removing Objects From Neural Radiance Fields

题目:Removing Objects From NeRF 从神经辐射场中移除对象
Paper: https://arxiv.org/abs/2212.11966
Authors: Silvan Weder, Guillermo Niantic, ETH Zurich, University College London, nianticlabs.github.ionerf-object-removal



Summary

nerfs are becoming a commonly used method for representing 3D scenes, allowing novel view synthesis . When using nerf for 3D reconstruction, personal information or unsightly objects need to be removed. Such deletion is not easy to achieve using the NeRF editing framework, so it is proposed to delete objects from NeRF representations created from RGBD sequences . Our NeRF rendering method leverages recent work in 2D image rendering and is guided by user-supplied masks . Our algorithm is based on a confidence-based view selection procedure. It chooses which separate 2D rendering image to use in creating the NeRF so that the resulting rendering NeRF is 3D consistent. We show that our NeRF editing method is effective for synthesizing plausible inpainting in a multi-view consistent manner. We use a new and still challenging dataset to validate our method on the NeRF internal rendering task.


提示:以下是本篇文章正文内容

I. Introduction

Since Neural Radiation Field was first published, there has been an explosion of extensions to the original framework. The use of nerfs goes beyond the original task of new view synthesis. It is already in the hands of novices for applications such as editing in NeRF or live capture and training. Among them, it is desirable to delete the part of the scene. For example, a scan of a house shared on a property sales website may need to be removed with something unattractive or personally identifiable. Similarly, in augmented reality applications, objects can also be removed to be replaced, for example, removing a chair from a scan to see how a new chair fits into the environment.

Some editorial work on nerfs has appeared. For example, object-centric representations separate labeled objects from the background, which allows editing of trained scenes using user-guided transformations; semantic decomposition allows selective editing and transparency of certain semantic parts of the scene . However, these methods only augment the information from the input scans, limiting their ability to generate the illusion of elements that are not observed from any angle. To remove objects while filling in the resulting holes, solving this problem requires : a) being able to exploit multi-view information when parts of the scene are observed in some frames but occluded in others, b) using generative processes to Fill in areas that have never been observed. To this end, we pair the multi-view consistency of NERF with the generative power of 2D intra-rendering models trained on large-scale 2D image datasets . Such 2D inpainting is not a multi-view consistent structure and may contain severe artefacts. This leads to corrupt reconstructions, so we devise a new confidence-based view selection scheme that iteratively removes inconsistent inpainting from optimization , yielding multi-view consistent results.
insert image description here
Contributions :
1. We propose the first method to inpaint NeRF by exploiting the power of single image inpainting.
2. A new view selection mechanism that automatically removes inconsistent views from optimization.
3. A new dataset is proposed to evaluate indoor and outdoor object removal and painting scenarios

2. Related work

Image restoration (inpainting) . Its scheme addresses the problem of filling missing regions reasonably in a single image. A typical approach is to use an image-to-image network with an adversarial loss , or to use a diffusion model. Different methods have been proposed to encode the input image, for example, using masking or Fourier convolution. However, these methods do not give temporal consistency between video frames, nor the ability to synthesize new viewpoints.

Removing moving objects from videos : While video painting is an excellent research problem in computer vision, most work focuses on removing moving objects. This is usually achieved by guidance from nearby frames , e.g. by estimating flow and sometimes depth. Moving objects make the task easier because their movement does not obscure the background, making most of the scene visible in at least one frame. In contrast, to map videos of static scenes, generative methods are required, since large parts of the scene are never observed.

Remove static objects from video : These can be used to paint areas when occluded pixels are visible in other frames in the sequence. For example, static foreground distractors such as fences and raindrops are removed from videos. However, there are usually still some pixels that are not visible in other views, so some other method is needed to fill them. For example, patches are propagated from visible pixels to areas to be painted, while missing pixels are painted by patch matching. Kim et al. rely on precomputed meshes for each scene to remove objects. Our key difference to these methods is that our inpainting can be extrapolated to new viewpoints.

Inpainting in compositing from a new perspective . Inpainting is often used as an integral part of new view synthesis to estimate the texture of regions not observed in the input, e.g. for panorama generation. Philip et al. allow object removal from image-based rendering, but assume that background regions are locally planar.

2.1. New View Synthesis and NeRF

NeRF is a very popular image-based rendering method that uses a differentiable volume rendering formulation to represent scenes; a multi-layer perceptron (MLP) is trained to regress colors and non transparency. This combines an implicit 3D scene representation, with light field rendering and new view compositing. Extensions include work to reduce aliasing artifacts, can handle unbounded scenes, reconstruct scenes only from sparse views, or improve NeRF efficiency methods, for example, by using octrees or other sparse structures

There is a neural radiation field for depth perception . To overcome some of the limitations of NeRF, especially the requirement for dense views and the quality of the reconstructed geometry, depth can be used for training. These can be sparse depths from structure-motion, or depths from sensors.

Object-centric semantics NeRF for editing . One direction of progress in NeRF is to decompose a scene into its constituent objects. This is motion based on dynamic scenes, or instance segmentation based on static scenes. Both works also treat the scene background as a separate model. However, similar to in-video rendering, dynamic scenes allow for better background modeling as more of the background is visible. In contrast, visual artifacts can be seen in the background representation of , which models a static scene. Methods that decompose scenes based on semantics can also be used for object deletion. However, they don't attempt to complete the scene when the semantic part is removed.

Generative models for novel view synthesis . 3D-aware generative models can be used to synthesize views of an object or scene from different viewpoints, in a 3D-consistent manner. Generative models can hallucinate views of new objects by sampling in the latent variable space , and their ability to conform to source views (or memories) may be limited, in contrast to NeRF models that only have a specific scene. To train generative models , most algorithms require a large indoor scene dataset with RGB and camera poses . In contrast, the 2D pre-trained inpainting network we use can be trained on any image, has less dependence on the training data, and has fewer constraints on indoor scenes.

3. Method

We assume an RGB-D sequence with camera poses and intrinsics. Depth and pose can be obtained, for example, using a dense motion architecture pipeline. In most experiments, we capture the proposed RGB-D sequences directly using Apple's ARKit framework, but we also show that we can relax this requirement by using a multi-view stereo approach to estimate the depth of RGB sequences. In doing so, we also assume that we have access to the per-frame masks of the objects to be removed. The goal is to learn a NeRF model from this input, which can be used to synthesize consistent new views, where the masked regions of each frame should be drawn reasonably. An overview of the method is shown in the figure below.
insert image description here

3.1.RGB and deep inpainting network

Our method relies on a 2D single-image rendering method to render each RGB image individually. Also, we need a deep inpainting network. We use these two networks as black boxes, our method is agnostic to the chosen method. Future improvements to single-image inpainting can directly translate to improvements to our method. Given an image input I n and the corresponding mask M n , use the single image input algorithm to generate a new image input I ^ \hat II^ n. Similarly, the depth rendering algorithm generates a depth rendering depth mapD ^ \hat DD^ n**. Figure 2 shows some results of the 2D inpainting network (good results on the left, poor results on the right, containing severe artifacts that destroy optimization).

3.2 Background in NeRF

Following the original NeRF paper, we represent a scene as an MLP F Θ that predicts color c = [r,g,b] and density σ for a 5-dimensional input containing x,y,z positions and two viewing directions. predicted color of pixel r, I ^ \hat II^ n(r)is obtained by the volume rendering function along its associated ray:

insert image description here

Among them, K is the number of rays, t i is the sampling position, δ i = t i+1 −t i is the distance between adjacent samples, w i is the cumulative weight of alpha, and its sum is less than or equal to 1 according to the construction. The NeRF loss is then (additional loss can be added if there is an input depth label):

insert image description here
Ω n represents the two-dimensional domain of image n, D n (r ) is the input depth of pixel r, and ˆDn(r ) is the corresponding predicted depth

Finally, use the warp regularization loss in Mip NeRF360 to better constrain NeRF and remove "floaters". It encourages the non-zero cumulative weight w i to be concentrated in a small area of ​​the ray , so for each pixel r:
insert image description here
the former item is to minimize w between two small areas, and the latter item is in a single small area, Make it the smallest value of opaque cumulative weight w

3.3. Confidence-Based View Selection

Although most of the individual RGB images look realistic, they still suffer from two problems: 1) some illustrations are incorrect, and 2) although the individuals are credible, they are not multi-view consistent (observed in multiple views The same area is not necessarily done in a consistent way). A confidence-based view selection scheme that automatically selects which views to use in NeRF optimization. We associate each image n with a nonnegative uncertainty measure U n . The corresponding per-image confidence e −un is used to reweight the NeRF loss . This confidence value can be seen as a loss decay term, similar to any uncertainty prediction term. The RGB loss is set as (Equation 5):
insert image description here
where the color of pixel r is supervised by the inpainting image for mask pixels and supervised by the input RGB image for pixels outside the mask. The second term of this loss is computed only on a finite set of images P, where P⊆{1,…,N}. In practice, this means that only inpainting regions are used in NeRF optimization. Next we will discuss how to choose the set P.
We use pixels inside and outside the split mask to construct a depth loss:
Finally, we include two regularizers. One is about the uncertainty weight L P reg to prevent trivial solutions where e −Un is 0. The other is a Mip-NeRF based distortion regularizer around the loss detailed in Equation (4), so
insert image description here
view orientation and multi-view consistency . When optimizing NeRF, we make three observations: a) the multi-view inconsistency of inpainting is modeled by the network using viewing directions ; b)We can strengthen multi-view consistency by removing the viewing direction from the input; c) when not using the viewing direction as input, the inconsistency introduces cloud-like artifacts in the density.

In order to prevent A and C, and correctly optimize capture inpainting image I ^ \hat II^ nuncertainty variable Un, we propose:ΘMV
in NeRFthat does not take the observation direction as input, 2) stop the gradient from color inpainting and FΘMVto density , leaving the uncertainty variable unasthe only view-dependent input. This design forces the model to encode inconsistency between inpaintings into uncertainty predictions, while keeping the model consistent across different views. The loss period of FΘMVis based on formula 5:the following diagram shows the structure:insert image description here

insert image description here

The following is the total loss, optimized by MLP parameters {Θ σ , Θ c , Θ MV } and uncertainty weight U P ={u n , n ∈ P}:insert image description here

Iterative refinement . In one iteration, using the predicted uncertainty U n of each image , the non-confident images are gradually removed from the optimization of NeRF, that is, the network iteratively updates the image set P that contributes to the loss in the mask . After optimizing LP for K steps , we find the estimated confidence median m. We then remove all 2D inpaint regions with associated confidence scores smaller than m from the training set , retrain NeRF with the updated training set, and repeat these steps K times. Note that images excluded from P still participate in the optimization , but only for rays in non-mask regions, since they contain valuable information about the scene. The pseudocode on the right side of the figure above

3.4 Implementation Details

Masking the objects to be removed : Our method requires a mask for each frame as input. Manually annotating each frame with a 2D mask, as done in other inpaint methods, is time-consuming. Instead, we propose to manually annotate a 3D box containing objects, which only needs to be done once per scene . We reconstruct a 3D point cloud of the scene using the camera pose and the input depth map, importing into a 3D visualization and editing software such as MeshLab, where the 3D bounding boxes of the removed objects are specified. Alternatively, we can rely on 2D object segmentation methods (such as Mask RCNN), or a baseline in the 3D object bounding box detector Objectron, to mask objects.

Mask refinement . In practice, the masks obtained from the annotated 3D bounding boxes are rather coarse and contain most of the background. We propose a mask refinement step (removing part of the empty space in the 3D bounding box) to obtain a tighter mask around objects. First take all the points in the reconstructed 3D point cloud that are inside the 3D bounding box. A refined mask is then obtained by rendering these points into each image and doing a simple comparison with the depth map to check for occlusions in the current image . The resulting mask is cleaned by amplifying pixel leakage caused by sensor noise using binary dilation and erosion. The effect of our mask refinement step is shown in Figure 5.

insert image description here
Inpainting network : We use the [Resolution-robust large mask inpainting with fourier convolutions] model to simultaneously Inpaint RGB and depth. The rendering of the RGB image and the depth map is done independently, and we use the reference network provided by [This model. Our depth map is preprocessed by clipping to 5m and linearly mapping depth in [0 m, 5 m] to pixel values ​​in [0, 255]. This matrix is ​​copied into an H×W×3 tensor for input to the model

NeRF Estimation : The implementation of our method is built on top of Mip NeRF and Reg NeRF. The shared MLP F σ Θ consists of eight 256-dimensional layers, while the branches F c Θ and F MV Θ consist of four 128-dimensional layers, respectively. The final output uses softplus to activate the density, ReLU to activate the uncertainty, and sigmoid to activate the color channel. We use λ RGB = λ depth = λ dist = 1 and λ reg = 0.005 to weight the loss function, optimized by Adam with an initial learning rate lr = 0.0005. Filtering steps: Every K grad = 50000 steps to remove low confidence images, resulting in Kouter = 4 filtering steps.

4. Experiments

4.1. Datasets

An RGB-D dataset of real scenes is introduced, aiming to evaluate the quality of object removal. Our dataset will be made public, with two variants:

  1. Real objects : There are 17 scenes: indoor/outdoor landscape/portrait orientation, focusing on small regions with at least one object, one of which is marked as an object of interest. They vary in difficulty with respect to background textures, object sizes, and complexity of scene geometry. For each scene, we collect two sequences, one with and one without the object we want to remove . These sequences were collected using ARKit on an iPhone 12 Pro with LiDAR and contain RGB-D images (resolution 192×256) and poses. Two sequences of the same scene use the same camera coordinate system, and the sequence lengths vary from 194 frames to 693 frames.

As mentioned before, the mask is obtained by annotating the 3D bounding box of the object of interest and refined for all scenes. For each scene, we train the NeRF model using sequences with objects and corresponding masks, and use sequences without objects for testing . Using real objects makes it easier to evaluate how the system handles realistic shadows and reflections, and new view compositing

  1. Synthetic objects : Most video and image inpaint methods do not perform new view synthesis, meaning they cannot be evaluated fairly on "real object" datasets. Therefore, we introduce a separate synthetically augmented variant of our dataset. This uses the same scenario as the real object dataset, but we only use sequences that do not contain objects. We then manually localize a 3D object mesh from ShapeNet [7] in each scene. The object is placed so that it has a reasonable position and size, for example, a laptop on a table. These masks are obtained by projecting the mesh into the input image, which is our only use of the 3D object mesh. For this synthetic dataset, every 8 frames are tested and the remaining frames are used to train the NeRF model.
  2. ARKitScenes : We further validate our method qualitatively on ARKitScenes. This is an RGB-D dataset of 1661 scenes where the depth was captured via iPhone Lidar

4.2. Evaluation indicators

All metrics in this paper are computed only within masked regions (compare system output images of test images with GroundTruth images). Three standard metrics for NeRF evaluation: PSNR, SSIM and LPIPS. Evaluate the geometric completion rate, and calculate the L1 and L2 errors of the GroundTruth depth in the rendered depth map and mask area (the average value is taken on all frames of the sequence, and the average value is taken from all sequences)

Comparison of baseline and state-of-the-art methods:
insert image description here
insert image description here


Summarize

提示:这里对文章进行总结:

This paper proposes a framework to train neural radiance fields where objects can be removed from output renderings. This method leverages existing 2D inpainting work and introduces a confidence-based automatic view selection scheme to select single-view inpaints with multi-view consistency. We experimentally verify that our proposed method improves novel view synthesis of 3D rendered scenes compared to existing work. We also introduce a dataset to evaluate this work, which sets a benchmark for other researchers in the field.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/131002788