Daily Academic Express 4.19

CV - Computer Vision  | ML - Machine Learning  | RL - Reinforcement Learning  | NLP Natural Language Processing 

Subjects: cs.CV

1.Visual Instruction Tuning

Title: Visual Instruction Tuning

Authors: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee

Article link: https://arxiv.org/abs/2304.08485

Project code: https://llava-vl.github.io/

Summary:

        Instruction tuning of large language models (LLMs) using machine-generated instruction trace data improves zero-shot capabilities for new tasks, but this idea has been less explored in the multimodal domain. In this paper, we make the first attempt to generate multimodal language-image instruction tracking data using language-only GPT-4. With instruction-tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, a large multimodal model trained end-to-end that connects a visual encoder and LLM for universal visual and language understanding . Our early experiments show that LLaVA exhibits impressive multimodal chat capabilities, sometimes exhibiting multimodal GPT-4 behavior on unseen images/commands, and is comparable to synthetic multimodal command following datasets. Compared with the GPT-4 of , it produces a relative score of 85.1%. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make publicly available the GPT-4-generated visual instruction tuning data, our model, and codebase.

2.Learning to Render Novel Views from Wide-Baseline Stereo Pairs(CVPR 2023 )

Title: Learning to render novel views from wide-baseline stereo pairs

By Yilun Du, Cameron Smith, Ayush Tewari, Vincent Sitzmann

Article link: https://arxiv.org/abs/2304.08463

Project code: https://yilundu.github.io/wide_baseline/

 

Summary:

        We introduce a novel view synthesis method given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are observed only once periodically, requiring prior-based reconstruction of scene geometry and appearance. We find that existing methods for synthesizing new views from sparse observations fail due to recovering incorrect 3D geometry, and because the high cost of differentiable rendering prevents them from scaling to large-scale training. We take a step towards addressing these shortcomings by formulating a multi-view transformer encoder, an efficient image-space epipolar sampling scheme to assemble image features for object rays, and a lightweight cross-attention based level renderer. Our contribution enables our method to be trained on large-scale real-world datasets of indoor and outdoor scenes. We demonstrate that our method learns a strong multi-view geometry prior while reducing rendering time. We perform extensive comparisons of retained test scenes on two real-world datasets, significantly outperform previous work from sparse image observations to novel view synthesis, and achieve multi-view consistent novel view synthesis.

3.DETRs Beat YOLOs on Real-time Object Detection

Title: DETRs beat YOLOs on real-time object detection

Authors: Wenyu Lv, Shangliang Xu, Yian Zhao, Guanzhong Wang, Jinman Wei, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu

Article link: https://arxiv.org/abs/2304.08069

Summary:

        Recently, the end-to-end transformer based detector (DETR) has achieved remarkable performance. However, the high computational cost of DETRs has not been effectively addressed, limiting their practical applications and preventing them from fully exploiting the benefits of no post-processing, such as non-maximum suppression (NMS). In this paper, we first analyze the impact of NMS on inference speed in modern real-time object detectors and establish an end-to-end speed benchmark. To avoid the inference delay caused by NMS, we propose Real-Time Detection Transformer (RT-DETR), which is, to our knowledge, the first real-time end-to-end object detector. Specifically, we design an efficient hybrid encoder to efficiently handle multi-scale features by decoupling intra-scale interactions and cross-scale fusion, and propose IoU-aware query selection to improve object query initialization. Furthermore, our proposed detector supports flexible adjustment of inference speed by using different decoder layers without retraining, which facilitates the practical application of real-time object detectors. Our RT-DETR-L achieves 53.0% AP on COCO val2017 and 114 FPS on T4 GPU, while RT-DETR-X achieves 54.8% AP and 74 FPS, superior in both speed and accuracy. for all YOLO detectors of the same size. Furthermore, our RT-DETR-R50 achieves 53.1% AP and 108 FPS, which is 2.2% AP higher in accuracy than DINO-Deformable-DETR-R50 and about 21 times higher in FPS. PaddleDetection will provide source code and pre-trained models.

More Ai information: Princess AiCharm
insert image description here

Guess you like

Origin blog.csdn.net/muye_IT/article/details/130264800