Daily Academic Express 5.26

CV - Computer Vision  | ML - Machine Learning  | RL - Reinforcement Learning  | NLP Natural Language Processing 

Subjects: cs.CV

1.Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

Title: Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiative Fields

Authors: Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, Jing Liao

Article link: https://arxiv.org/abs/2305.11588

Project code: https://eckertzhang.github.io/Text2NeRF.github.io/

Summary:

        Text-driven 3D scene generation is widely applicable to video games, film industry and metaverse applications that have heavy demands on 3D scenes. However, existing text-to-3D generative methods are limited to generating dreamlike 3D objects with simple geometric shapes and lack of realism. In this work, we demonstrate Text2NeRF, which is capable of generating various 3D scenes with complex geometries and high-fidelity textures purely from text cues. To this end, we employ NeRF as the 3D representation and utilize a pre-trained text-to-image diffusion model to constrain NeRF's 3D reconstruction to reflect the scene description. Specifically, we employ a diffusion model to infer text-related images as content priors, and a monocular depth estimation method to provide geometric priors. Both content and geometry priors are used to update the NeRF model. To guarantee texture and geometric consistency between different views, we introduce a progressive scene repair and update strategy for new view synthesis of the scene. Our method requires no additional training data, only a natural language description of the scene as input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in generating realistic, multi-view consistent and diverse 3D scenes from various natural language cues.

2.Segment Any Anomaly without Training via Hybrid Prompt Regularization

Title:Segmenting Any Anomaly Without Training via Mixed Prompt Regularization

Authors: Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Zongwei Du, Liang Gao, Weiming Shen

Article link: https://arxiv.org/abs/2305.10724

Project code: https://github.com/caoyunkang/Segment-Any-Anomaly

Summary:

        We propose a new framework, Segment Any Anomaly+ (SAA+), for zero-shot anomaly segmentation and hybrid hint regularization to improve the adaptability of modern base models. Existing anomaly segmentation models often rely on domain-specific fine-tuning, limiting their generalization across myriad anomaly patterns. In this work, inspired by the strong zero-shot generalization ability of base models such as Segment Anything, we first explore their assembly to exploit various multimodal priors for anomaly localization. For nonparametric base model adaptation for anomaly segmentation, we further introduce hybrid cues derived from domain expert knowledge and target image context as regularization. Our proposed SAA+ model achieves state-of-the-art performance on multiple anomaly segmentation benchmarks including VisA, MVTec-AD, MTD, and KSDD2 in the zero-shot setting.

3.VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Title: VisionLLM: Large Language Models Are Also Open Decoders for Vision-Centric Tasks

Authors: Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai

Article link: https://arxiv.org/abs/2305.11175

Project code: https://github.com/OpenGVLab/VisionLLM

Summary:

        Large language models (LLMs) have significantly accelerated progress in artificial general intelligence (AGI), and their impressive zero-shot capabilities for user-tailored tasks endow them with great potential in a range of applications. However, in the field of computer vision, although numerous powerful vision foundation models (VFMs) are available, they are still limited to tasks in a predefined form, and it is difficult to match the open task capabilities of LLM. In this work, we propose an LLM-based framework for vision-centric tasks, called VisionLLM. The framework provides a unified perspective on vision and language tasks by treating images as foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. Then, an LLM-based decoder can make appropriate predictions for open-ended tasks based on these instructions. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, and has achieved good results. Notably, using a generic LLM-based framework, our model can achieve more than 60% mAP on COCO, comparable to detection-specific models. We hope this model can set a new baseline for general vision and language models.

More Ai information: Princess AiCharm
insert image description here

Guess you like

Origin blog.csdn.net/muye_IT/article/details/130911699