Daily Academic Express 5.19

CV - Computer Vision  | ML - Machine Learning  | RL - Reinforcement Learning  | NLP Natural Language Processing 

Subjects: cs.CV

1.On the Hidden Mystery of OCR in Large Multimodal Models 

Title: On the Hidden Mysteries of OCR in Large Multimodal Models

Authors: Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, Xiang Bai

Article link: https://arxiv.org/abs/2305.07895

Project code: https://github.com/Yuliang-Liu/MultimodalOCR

Summary:

        Large-scale models have recently played a leading role in natural language processing and multimodal visual language learning. Their efficacy in text-related vision tasks is still less explored. We conduct a comprehensive study of existing publicly available multimodal models, evaluating their performance on text recognition, text-based visual question answering, and key information extraction. Our findings shed light on the strengths and weaknesses of these models, which rely primarily on semantic understanding to recognize words and have poor perception of individual character shapes. They also exhibit indifference to text length and are limited in their ability to detect fine-grained features in images. Thus, these results suggest that even the current most powerful large-scale multimodal models cannot compete with domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results presented in this study can provide a fundamental framework for the conception and evaluation of innovative strategies aimed at enhancing zero-shot multimodal techniques. The evaluation pipeline will be available at this https URL.

2.BlendFields: Few-Shot Example-Driven Facial Modeling(CVPR 2023

Title: BlendFields: Few-Shot Example-Driven Facial Modeling

© Kacper Kania, Stephan J. Garbin, Andrea Tagliasacchi, Virginia Estellers, Kwang Moo Yi, Julien Valentin, Tomasz Trzciński, Marek Kowalski

Article link: https://arxiv.org/abs/2305.07514

Project code: https://blendfields.github.io/

3.CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Title: CodeT5+: An Open Code Large-Scale Language Model for Code Understanding and Generation

Literature:Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, Steven CH Hoi

Article link: https://arxiv.org/abs/2305.07922

Summary:

        Large language models (LLMs) pre-trained on large amounts of source code have achieved remarkable progress in code intelligence. However, existing code LLMs suffer from two major limitations in terms of architecture and pre-training tasks. First, they usually employ a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by application inflexibility, while in the latter, the model is viewed as a single system for all tasks, leading to poor performance on a subset of tasks. Second, they usually use a limited set of pre-trained objectives, which may not be relevant for some downstream tasks, thus leading to a large drop in performance. To address these limitations, we propose "CodeT5+", an encoder-decoder LLM family for codes, in which component modules can be flexibly combined to suit various downstream coding tasks. This flexibility is enabled by our proposed mixture of pre-training objectives to mitigate the pre-training-fine-tuning discrepancy. These objectives cover span denoising, contrastive learning, text-to-code matching, and causal LM pretraining tasks on unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with a frozen off-the-shelf LLM, without training from scratch, to efficiently scale our model, and explore instruction tuning to align with natural language instructions. We extensively evaluate CodeT5+ on more than 20 code-related benchmarks in different settings, including zero-shot, fine-tuning, and instruction tuning. We observe the performance of state-of-the-art (SoTA) models on various code-related tasks, such as code generation and completion, mathematical programming, and text-to-code retrieval tasks. In particular, our instruction-tuned CodeT5+ 16B achieves new SoTA results against other open-code LLMs on the HumanEval code generation task.

More Ai information: Princess AiCharm
insert image description here

Guess you like

Origin blog.csdn.net/muye_IT/article/details/130780009