ICCV 2023 Oral | CR-NeRF: Generating Novel Perspective Cross-ray Neural Radiation Fields from Unconstrained Image Collections

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —> [NeRF and Transformer] exchange group

c7efcd23042868e590c590196d3bdf69.png

  • Cross-Ray Neural Radiance Fields for Novel-view Synthesis from Unconstrained Image Collections

  • Article link: https://arxiv.org/abs/2307.08093

  • Code link:

  • https://github.com/YifYang993/CR-NeRF-PyTorch.git

introduce:

This work aims to provide a 3D immersive experience by synthesizing new perspective pictures from unrestricted image collections, such as images crawled from the Internet. This method enables users to appreciate international landmarks in any season from multiple perspectives, such as the Brandenburg Gate in Berlin, Germany, and the Tver Fountain in Rome, Italy. Specifically, assume that the user wants to go to the Brandenburg Gate to enjoy the scenery at different times and in different weathers, but the travel cost is too high due to reasons such as studies and work, so he cannot go there in person. So how can you "cloud play" the scenic spot in various weathers, various times, and from multiple angles without going out? At this time, our proposed CR-NeRF can come in handy. Users only need to collect any photos about the Brandenburg Gate from the Internet, whether it is daytime, night, spring, summer, autumn and winter scenes, and then use CR-NeRF to generate a new perspective image of the Brandenburg Gate. CR-NeRF can render images according to the camera angle and image style given by the user. Through this method, users can experience the diverse scenes of the Brandenburg Gate in a virtual environment, feel the landscape changes brought about by different time and weather, so that users can enjoy the world's famous attractions at home and enjoy an immersive travel experience. This technology not only saves travel costs and time, but also provides users with more possibilities to explore the world.

An example of a 3D scene reconstructed by CR-NeRF is as follows:

Summary:

Neural Radiative Fields (NeRF) is a revolutionary approach to rendering scenes that has demonstrated an impressive ability to generate new perspectives from static scene images by sampling a single ray per pixel. However, in practice, we usually need to recover NeRF from an unconstrained collection of images, which faces two challenges: 1) images usually have dynamic changes in appearance due to different shooting times and camera settings; 2) images may contain Transient objects such as people and cars, causing occlusions and artifacts. Traditional approaches address these challenges by locally exploiting individual rays. In contrast, humans typically perceive appearance and objects by exploiting information globally across multiple pixels. To mimic the human perceptual process, in this paper, we propose Cross-Ray NeRF (CR-NeRF), which exploits the interactive information across multiple rays to synthesize new viewpoints that are unoccluded and have the same appearance as images. Specifically, to model diverse appearances, we first propose to represent multiple rays using novel intersecting ray features, and then restore appearances by fusing global statistics of rays (i.e., covariance of ray features and image appearance). Furthermore, to avoid occlusions introduced by transient objects, we propose a transient object processor and introduce a grid sampling strategy to mask transient objects. We theoretically find that exploiting the correlation between multiple rays helps to capture more global information. Furthermore, experimental results on large real datasets verify the effectiveness of CR-NeRF.

method motivation

With CR-NeRF, we input photos under different lighting conditions to reconstruct 3D scenes with a controllable appearance while removing occlusions in the images. Reconstructing NeRF with Internet image datasets faces the following two challenges. 1) Different appearance: Assuming that two tourists take photos even at the same viewpoint, they are still in different conditions: different shooting time, different weather (such as sunny, rainy, foggy), different camera settings (like aperture, shutter, ISO). This changing condition results in multiple shots of the same scene from the same perspective that may appear drastically different. 2) Transient occlusion: Transient objects such as cars and passengers may occlude the scene. Since these objects usually only exist in a single image, it is usually impractical to reconstruct these objects with high quality. The above challenges conflict with NeRF's static scene assumption, resulting in inaccurate reconstructions, over-smoothing, and ghosting artifacts. Recently, researchers have proposed several methods (NeRF-W; Ha-NeRF) to address the above challenges. From Figure 1(a), NeRF-W and Ha-NeRF reconstruct 3D scenes using a single-camera ray approach. Specifically, Said that this approach fuses appearance features and occluder features with single-ray features separately, and then independently synthesizes each color of a new view pixel. A potential problem with this approach is that it relies on local information for each ray (e.g., a single information from image pixels) to recognize appearance and transient objects. In contrast, humans tend to utilize global information (such as information across multiple image pixels), which provides a more comprehensive understanding of objects to observe their appearance and Handling occlusion. Based on this, we propose to use the cross-ray paradigm to handle changing appearance and transient objects (see Fig. 1(b)), and we exploit the global information from multiple rays to restore appearance and handle transient objects. Then, we Simultaneously composite a new view's region.

dc271c052dcb42bcc2fc61b46d9b3a34.png

      
   Figure 1: Motivation diagram of CR-NeRF

method

Based on the cross-ray paradigm, we propose a cross-ray Neural Radiance Fields (CR-NeRF), as shown in Figure 2, CR-NeRF consists of two parts: 1. In order to simulate the variable appearance, we A new cross-ray feature is proposed to represent the information of multi-rays. We then fuse the cross-ray features and the appearance features of the input image via a cross-ray transformation network using global statistics (e.g., the feature covariance of the cross-rays). The fused features are fed into the decoder to obtain the colors of multiple pixels simultaneously. 2. In terms of transient object processing, we propose a unique perspective that treats transient object processing as a segmentation problem and detects transient objects by considering the global information of image regions. Specifically, we segment the input image to obtain object visibility maps. To reduce computational overhead, we introduce a grid sampling strategy to equally sample both the input ray and the segmented map, making the two paired. We theoretically analyze that exploiting the correlation between multiple rays can capture more global information. Next, we specifically describe the two parts of CR-NeRF.

PS: We assume that the reader has knowledge about NeRF, camera model, etc. If you have not mastered the relevant knowledge, please refer to the preliminary part of the CR-NeRF paper.

74cdafc7d7a2b37179410e8b703c36e6.png

      Figure 2: Method flow of CR-NeRF

style transfer module

We first represent scene information with multiple rays . To model appearance from multi-view observations, we first represent scene information using multiple rays. To this end, we propose a new cross-ray feature with the equation:

The MLP is a multilayer perceptron. For each ray point r(ti), we query the MLP at the three-dimensional position and viewing direction, and use the volume rendering technology (VR) to obtain the intersecting ray features.

With the cross-ray features in hand, we fuse them with the style of the input image, thus injecting appearance in the scene representation . The key to our cross-ray appearance modeling is to exploit the underlying complementary information between them to facilitate appearance modeling from a given appearance image to a scene representation. To achieve this, we learn a transformation that aligns the passed cross-ray features and appearance features with an auxiliary identity, the problem is formulated as follows:

where is the corresponding style feature, and β is a hyperparameter, which is a constant matrix for matching the transformed feature sum. In our paper, we theoretically analyze the necessity of considering multiple rays to solve the above problems.

In order to generate a new view image with a satisfactory appearance from the transformed features θ, we need to use the decoder θ during the training process of appearance modeling. Inspired by Equation (1), we set the loss function for appearance modeling as:

We use a custom encoder θ to model the transformed features so that the content of the transformed image closely matches the original image. In this way, we can synthesize a new view image with θθ.

Occlusion Processing Module

To deal with transient objects arising from unconstrained collections of photos in novel view synthesis, we propose a new problem-solving perspective: Obtaining visible views of transient objects by segmenting reference images. Utilizing the receptive field of the deep segmentation network, the interaction of different pixels and rays is facilitated, thus introducing more global information. We deploy a lightweight segmentation network. During the training phase, due to limited GPU memory, we cannot sample all the rays we interact with, so naively handling all rays of transient objects() is not applicable. Therefore, we apply the grid sampling strategy (GS) [3], which will be paired with stripe rays (see Fig. 2). The forecasting process is represented by the following formula:

, where , and are the height and width of . Here, a visual map is learned without the supervision of a ground truth segmentation mask. During training, in order to save computational overhead, we will set it to be less than . The occlusion processing loss function we designed is

where denotes element-wise multiplication. The loss is designed by masking transient objects.

experiment

Quantitative results

We conduct extensive experiments on the Brandenburg Gate, Sacre Coeur and Trevi Fountain datasets. As shown in Table 1, we observe that the original NeRF performs the worst among all methods, because NeRF assumes that the scene behind the training images is static. By modeling style embeddings and handling transient objects, NeRF-W and Ha-NeRF achieve comparable performance in terms of PSNR, SSIM and LPIPS. Due to the advantage of crossed rays, our CR-NeRF outperforms NeRF-W and Ha-NeRF.

4f12983c035aaa3e8aaf210de2e0c8c0.png

      
   Table 1: Comparison of CR-NeRF and SOTA methods

Visualization experiment

We present qualitative results for all compared methods in Fig. 3. We observed that NeRF produces foggy artifacts and inaccurate appearance. NeRF-W and Ha-NeRF are able to reconstruct more promising 3D geometries and model appearances from ground truth images. However, the reconstructed geometry was not precise enough, for example, the shape of the greenery in Brandenburg and the ghostly effect around the columns, the cavity of the Sacre, etc. Furthermore, existing methods generate less realistic appearances, such as the sunlight on Sacre's statue, the blue sky and the gray roof color of Trevi. In comparison, our CR-NeRF introduces a cross-ray paradigm, thus enabling more realistic appearance modeling and reconstructing consistent geometry by suppressing transient objects.

9a5125793467f7720148f2b3c4b42d55.png

      
   Figure 3: Comparison of CR-NeRF and SOTA methods

Ablation experiments of cross-ray appearance migration module and transient object processing module

Table 2 shows the ablation experimental results of CR-NeRF on the Brandenburg, Sacre and Trevi datasets. We observe that the performance of our baseline (CR-NeRF-B) is gradually improved by adding the cross-ray appearance transfer module (CR-NeRF-A) and the transient processing module (CR-NeRF-T).

15a320eabf8bfc5c8d019ce160b7060a.png        

   Table 2: Ablation experiments of CR-NeRF

Reasoning speed

Our CR-NeRF significantly outperforms Ha-NeRF in inference efficiency (2.12 seconds vs. 24.09 seconds in Table 3) when processing multiple images of different appearances with a fixed camera position. This is because our CR-NeRF uses the NeRF backbone to generate cross-ray features only once, synthesizing various appearances by fusing Fcr and appearance embeddings for each image. In contrast, Ha-NeRF requires its MLP backbone to be used for every estimation. For efficiency, we try to improve the inference speed of Ha-NeRF by saving interim results. However, moving the results to host memory requires significant additional I/O time because Ha-NeRF's interim results take up GPU memory that exceeds the capacity of a single TITAN Xp GPU.

eed6b0261cc6445dfc931fb5fc81fa54.png

       
   Table 3: Comparison of inference time between CR-NeRF and Ha-NeRF

more experiments

We conducted interpolation experiments on appearance features, performed appearance migration comparison experiments with SOTA methods, and made video demos, etc. Please read our paper and visit the github link.

Summary and Outlook

Summarize

The contributions of this work are summarized as follows:

  • A new cross-ray paradigm for synthesizing new views from unconstrained collections of photos: We find that existing methods fail to produce satisfactory visual results from unconstrained collections of photos via a single-ray level paradigm, mainly due to ignoring multi-rays potential cooperative interactions. To address this issue, we propose a new cross-ray paradigm that exploits global information across multiple rays.

  • Interactive and Global Schemes for Handling Different Appearances: Unlike existing methods that process each ray independently, we represent multiple rays by introducing intersecting ray features, which facilitate the interaction between rays through feature covariance. This enables us to infuse appearance representations with global information in the scene, enabling more realistic and efficient appearance modeling. Our theoretical analysis demonstrates the necessity of considering multiple rays in appearance modeling.

  • A New Segmentation Technique for Handling Transient Objects: We reformulate the transient object problem as a segmentation problem. We leverage the global information of unconstrained images to segment visible views. Additionally, we employ grid sampling to pair the map with multiple rays. Experimental results show that CR-NeRF eliminates transient objects in reconstructed images.

Outlook

There is still much room for improvement in this work. For example, we said at the end of the paper that at present, due to the lack of GT supervision for transient objects, it is completely dependent on the deep model to automatically learn the data pattern from the data, and there is still a lack of fine modeling. More importantly, we believe that the definition of transient objects remains an open problem, which we leave to our future work.

quote

[1] Martin-Brualla, Ricardo, et al. "Nerf in the wild: Neural radiance fields for unconstrained photo collections." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[2] Chen, Xingyu, et al. "Hallucinated neural radiance fields in the wild." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3] Schwarz, Katja, et al. "Graf: Generative radiance fields for 3d-aware image synthesis." Advances in Neural Information Processing Systems 33 (2020): 20154-20166. 

 
  

Click to enter —>【NeRF】communication group

ICCV/CVPR 2023 Paper and Code Download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
NeRF和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-NeRF或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如NeRF或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch1cce4e34e595e57057ec0e42abb8d636.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/132388050