Daily Academic Express 4.25

CV - Computer Vision  | ML - Machine Learning  | RL - Reinforcement Learning  | NLP Natural Language Processing 

Subjects: cs.CV

1.Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models

Title: New view synthesis for long-term photometric consistency with diffusion models

© Jason J. Yu, Fereshteh Forghani, Constantine G. Derpanis, Marcus A. Brubaker

Article link: https://arxiv.org/abs/2304.10700

Project code: https://yorkucvil.github.io/Photoconsistent-NVS/

Summary:

        Synthesizing new views from a single input image is a challenging task, where the goal is to generate new views of a scene from desired camera poses that may be separated by large motions. The high degree of uncertainty in this synthesis task, due to unobserved elements both within the scene (i.e., occlusions) and outside the field of view, makes it attractive to use generative models to capture a wide variety of possible outputs. In this paper, we propose a novel generative model capable of generating a sequence of realistic images consistent with a specified camera trajectory, as well as a single starting image. Our approach centers on a model based on autoregressive conditional diffusion, which is able to interpolate visible scene elements and extrapolate unobserved regions of the view in a geometrically consistent manner. Adjustment is limited to capturing an image of a single camera view and the (relative) pose of a new camera view. To measure the coherence of a sequence of generated views, we introduce a new metric, Threshold Symmetric Epipolar Distance (TSED), to measure the number of coherent frame pairs in a sequence. While previous methods have been shown to produce high-quality images and consistent semantics across pairs of views, we empirically show based on our metrics that they are often inconsistent with the desired camera pose. In contrast, we demonstrate that our method can produce realistic and view-consistent images.

2.VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos(CVPR 2023)

Title: VisFusion: Video-Based Visibility Online 3D Scene Reconstruction

Authors: Huiyu Gao, Wei Mao, Miaomiao Liu

Article link: https://arxiv.org/abs/2304.10687

Project code: https://github.com/huiyu-gao/VisFusion

Summary:

        We present VisFusion, a visibility-aware online 3D scene reconstruction method based on pose monocular videos. In particular, we aim to reconstruct scenes from volumetric features. Unlike previous reconstruction methods, which aggregate the features of each voxel from the input views regardless of its visibility, our goal is to explicitly infer its visibility via a similarity matrix computed from the projected features in each image pair to improve feature fusion. Following on from previous work, our model is a coarse-to-fine pipeline including volume sparsification. Different from their works that globally sparse voxels with a fixed occupancy threshold, we perform sparsification on local feature quantities along each visual ray to preserve at least one voxel per ray for more details. The sparse local volume is then fused with the global volume for online reconstruction. We further propose to obtain better TSDF predictions by learning their residuals across scales to predict TSDFs in a coarse-to-fine manner. Experimental results on benchmarks demonstrate that our method can achieve superior performance with more scene details.

3.Factored Neural Representation for Scene Understanding

Title: Decomposed Neural Representations for Scene Understanding

作宇:Yu-Shiang Wong, Niloy J. Mitra

Article link: https://arxiv.org/abs/2304.10950

Project code: https://yushiangw.github.io/factorednerf/

Summary:

        A long-term goal of scene understanding is to obtain interpretable and editable representations that can be constructed directly from raw monocular RGB-D video without specialized hardware setup or priors. The problem is even more challenging in the presence of multiple moving and/or deforming objects. Traditional approaches address the setting by mixing simplifications, scene priors, pretrained templates, or known deformation models. The advent of neural representations, especially neural implicit representations and radiative fields, opens up the possibility of end-to-end optimization to jointly capture geometry, appearance, and object motion. However, current methods yield global scene encodings, assume multi-view capture with limited or no motion in the scene, and do not facilitate simple operations beyond novel view synthesis. In this work, we introduce a factorized neural scene representation that can be learned directly from monocular RGB-D videos to generate object-level neural representations with object motion (e.g., rigid trajectories) and/or Deformation (eg, non-rigid motion). We evaluate our method against a set of neural methods on synthetic and real data to demonstrate that representations are efficient, interpretable, and editable (e.g., altering object trajectories).

More Ai information: Princess AiCharm
insert image description here

Guess you like

Origin blog.csdn.net/muye_IT/article/details/130383153