ICCV 2023 Oral | Generating cross-ray neural radiation fields from new perspectives from unconstrained image collections

image.png

Article link: https://arxiv.org/abs/2307.08093
Code link: https://github.com/YifYang993/CR-NeRF-PyTorch.git

01. Introduction

This work aims to provide 3D immersive experiences by synthesizing new perspective images from unrestricted image collections, such as images crawled from the Internet. This method enables users to appreciate international landmarks in any season from multiple perspectives, such as the Brandenburg Gate in Berlin, Germany, and the Tverskaya Fountain in Rome, Italy.

Specifically, it is assumed that the user wants to go to the Brandenburg Gate to enjoy the scenery at different times and weather conditions, but the travel cost is too high due to study, work, etc. and cannot go there in person. So how can we “cloud play” this scenic spot in various weather conditions, at various times, and from various angles without going out?

At this time, the CR-NeRF we proposed can come in handy. Users only need to collect any photos of the Brandenburg Gate from the Internet, whether they are daytime, night, spring, summer, autumn, or winter scenes, and then use CR-NeRF to generate a new perspective image of the Brandenburg Gate. CR-NeRF can render images according to the camera angle and image style given by the user. Through this method, users can experience the diverse scenes of the Brandenburg Gate in the virtual environment and feel the landscape changes caused by different times and weather, allowing users to visit world famous places at home and enjoy an immersive travel experience. This technology not only saves travel costs and time, but also provides users with more possibilities to explore the world.

02. Summary

Neural Radiation Fields (NeRF) is a revolutionary method of rendering scenes that demonstrates impressive capabilities in generating new perspectives from static scene images by sampling a single ray per pixel. However, in practice, we usually need to recover NeRF from unconstrained image collections, which faces two challenges:

1) Images usually have dynamic changes in appearance due to differences in shooting time and camera settings;

2) Images may contain transient objects such as people and cars, causing occlusions and artifacts.

Traditional approaches address these challenges by locally exploiting individual rays. In contrast, humans typically perceive appearance and objects by leveraging information globally across multiple pixels. In order to simulate the human perception process, in this paper, we propose Cross-ray NeRF (CR-NeRF) , which utilizes interactive information across multiple rays to synthesize a new perspective that is unobstructed and has the same appearance as the image. Specifically, to model different appearances, we first propose to use novel cross-ray features to represent multiple rays, and then recover the appearance by fusing the global statistics of the rays (i.e., the covariance of the ray features and the image appearance).

Furthermore, to avoid occlusion introduced by transient objects, we propose a transient object processor and introduce a grid sampling strategy to shield transient objects. We have found theoretically that exploiting the correlation between multiple rays helps capture more global information. Furthermore, experimental results on large real-world datasets verify the effectiveness of CR-NeRF.

03. Method motivation

Through CR-NeRF, we input photos under different lighting conditions to reconstruct a 3D scene with controllable appearance while eliminating occlusions in the image. Reconstructing NeRF with Internet image datasets faces the following two challenges.

  1. Different appearance:  Suppose two tourists take photos from the same viewpoint, they are still in different conditions: different shooting time, different weather (such as sunny, rainy, foggy), different camera settings (such as aperture, shutter, ISO). This changing condition means that multiple photos of the same scene taken from the same angle may have very different appearances.

  2. Transient occlusion:  Transient objects such as cars and passengers may obscure the scene. Since these objects often only exist in a single image, it is often impractical to reconstruct these objects with high quality. The above challenges conflict with NeRF's static scene assumption, resulting in inaccurate reconstruction, over-smoothing and ghosting artifacts [1] .

Recently, researchers have proposed several methods (NeRF-W [1]  ; Ha-NeRF [2] ) to solve the above challenges. From Figure 1(a), NeRF-W and Ha-NeRF reconstruct the 3D scene using a single camera ray method. Specifically, this method fuses appearance features and occlusion features with single-ray features separately, and subsequently synthesizes each color of the new view pixels independently. A potential problem with this approach is that it relies on local information of each ray (e.g., information of a single image pixel) to identify appearance and transient objects.

In contrast, humans tend to exploit global information (e.g., information across multiple image pixels), which provides a more comprehensive understanding of an object to observe its appearance and deal with occlusions. Based on this, we propose to use the cross-ray paradigm to handle changing appearance and transient objects (see Figure 1(b)), where we utilize global information from multiple rays to recover appearance and handle transient objects. Then, we simultaneously composite the regions of a new view.

MOTIVATION.png
Figure 1: Motivation diagram of CR-NeRF

04. Method

Based on the cross-ray paradigm, we propose a cross-ray Neural Radiance Fields (CR-NeRF), as shown in Figure 2. CR-NeRF consists of two parts:

  1. To simulate the variable appearance, we propose a new cross-ray feature to represent the information of multiple rays. We then fuse the cross-ray features and the appearance features of the input image through a cross-ray transformation network using global statistics (e.g., feature covariance of the cross-rays). The fused features are fed into the decoder to obtain the colors of multiple pixels simultaneously.

  2. In terms of transient target processing, we propose a unique perspective that treats transient target processing as a segmentation problem to detect transient targets by considering the global information of image regions. Specifically, we segment the input image to obtain visibility maps of objects. To reduce computational overhead, we introduce a grid sampling strategy that samples the input rays and the segmented mappings identically to pair the two. We theoretically analyzed that more global information can be captured by using the correlation between multiple rays.

Next, we describe the two parts of CR-NeRF in detail.

PS: We assume that readers have knowledge about NeRF, camera models, etc. If they have not yet mastered the relevant knowledge, please refer to the preliminary section in the CR-NeRF paper.

pipelinev41.png

Figure 2: CR-NeRF method flow

4.1 Style migration module

4.2 Occlusion processing module

05. Experiment

5.1 Quantitative results

We conduct extensive experiments on the Brandenburg Gate, Sacre Coeur and Trevi Fountain datasets. As shown in Table 1 , we observe that original NeRF performs the worst among all methods because NeRF assumes that the scene behind the training images is static. By modeling style embeddings and processing transient objects, NeRF-W and Ha-NeRF achieve comparable performance in PSNR, SSIM and LPIPS. Due to the advantage of crossing rays, our CR-NeRF outperforms NeRF-W and Ha-NeRF.

quant.png
Table 1: Comparison between CR-NeRF and SOTA methods

5.2 Visual experiment

We present qualitative results for all compared methods in Figure 3 . We observed that NeRF produces haze artifacts and inaccurate appearance. NeRF-W and Ha-NeRF are able to reconstruct more promising 3D geometries and model appearances from ground truth images. However, the reconstructed geometries are not precise enough, for example, the shape of the Brandenburg's greenery and the ghostly effect surrounding the columns, the Sacre's cavities, etc. Additionally, existing methods generate insufficiently realistic appearances, such as the sunlight on Sacre’s statue, and the colors of Trevi’s blue sky and gray roofs. In comparison, our CR-NeRF introduces a cross-ray paradigm and thus achieves more realistic appearance modeling and reconstructs consistent geometry by suppressing transient objects.

hallucination_gifv21.png
Figure 3: Comparison between CR-NeRF and SOTA methods

5.3 Ablation experiment of cross-ray appearance transfer module and transient object processing module

Table 2 shows the ablation experimental results of CR-NeRF on the Brandenburg, Sacre and Trevi data sets. We observe that the performance of our baseline (CR-NeRF-B) gradually improves by adding a cross-ray appearance migration module (CR-NeRF-A) and a transient processing module (CR-NeRF-T).

ablation.png
Table 2: Ablation experiments of CR-NeRF

5.4 Reasoning speed

inferencetime.png
Table 3: Comparison of inference time between CR-NeRF and Ha-NeRF

5.5 More experiments

We conducted interpolation experiments on appearance features, conducted appearance transfer comparison experiments with SOTA methods, and also produced video demos. Please read our paper and visit the github link.

06. Summary and outlook

6.1 Summary

The contributions of this work are summarized as follows:

  • A new cross-ray paradigm for synthesizing new views from unconstrained photo collections:  We find that existing methods fail to produce satisfactory visual results from unconstrained photo collections via a single-ray level paradigm, mainly due to the neglect of multi-rays potential cooperative interactions. To address this problem, we propose a new cross-ray paradigm that exploits global information across multiple rays.

  • Interactive and global scheme for handling different appearances:  Unlike existing methods that handle each ray independently, we represent multiple rays by introducing cross-ray features, which facilitates the interaction between rays through feature covariance. This enables us to inject global information representation of appearance into the scene, resulting in more realistic and efficient appearance modeling. Our theoretical analysis demonstrates the necessity of considering multiple rays in appearance modeling.

  • A new segmentation technique for handling transient objects:  We reformulate the transient object problem as a segmentation problem. We segment visible images using global information of unconstrained images. Additionally, we employ grid sampling to pair the map with multiple rays. Experimental results show that CR-NeRF eliminates transient targets in reconstructed images.

6.2 Outlook

There is still a lot of room for improvement in this work. For example, we said at the end of the paper that currently, since there is no GT supervision for instantaneous objects, we completely rely on deep models to automatically learn data patterns from data, and there is still a lack of sophisticated modeling. More importantly, we believe that the definition of instantaneous objects remains an unsolved problem and we leave it to our future work.

Quote

[1] Martin-Brualla, Ricardo, et al. "Nerf in the wild: Neural radiance fields for unconstrained photo collections." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[2] Chen, Xingyu, et al. "Hallucinated neural radiance fields in the wild." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3] Schwarz, Katja, et al. "Graf: Generative radiance fields for 3d-aware image synthesis." Advances in Neural Information Processing Systems 33 (2020): 20154-20166.


  About TechBeat Artificial Intelligence Community

TechBeat (www.techbeat.net) is affiliated with Jiangmen Venture Capital and is a growth community that gathers global Chinese AI elites.

We hope to create more professional services and experiences for AI talents, accelerate and accompany their learning and growth.

We look forward to this becoming a high ground for you to learn cutting-edge AI knowledge, a fertile ground for sharing your latest work, and a base for upgrading and fighting monsters on the road to AI advancement!

More detailed introduction >> TechBeat, a learning and growth community that gathers global Chinese AI elites

Guess you like

Origin blog.csdn.net/hanseywho/article/details/132496824