Daily Academic Express 5.20

CV - Computer Vision  | ML - Machine Learning  | RL - Reinforcement Learning  | NLP Natural Language Processing  

Subjects: cs.CV

1.Improved baselines for vision-language pre-training

Title: Improving Baselines for Visual-Language Pretraining

Enrico Fini, Pietro Astolfi, Adriana Romero-Soriano, Jakob Verbeek, Michal Drozdzal

Article link: https://arxiv.org/abs/2305.08675

Summary:

        Contrastive learning has emerged as an effective framework for learning multimodal representations. CLIP is a seminal work in this field, achieving impressive results by training on paired image-text data using a contrastive loss. Recent work claims to improve CLIP with an additional non-contrastive loss inspired by self-supervised learning. However, it is sometimes difficult to separate the contribution of these additional losses from other implementation details (such as data augmentation or regularization techniques) used to train the model. To shed light on this question, in this paper we first propose, implement and evaluate several baselines obtained by combining contrastive learning with recent advances in self-supervised learning. In particular, we align image and text modalities using a loss function that has proven successful for visual self-supervised learning. We find that these baselines outperform basic implementations of CLIP. However, the advantage disappears when stronger training methods are used. In fact, we find that a simple CLIP baseline can also be substantially improved by using well-known training techniques popular in other subfields, with relative improvements of up to 25% on downstream zero-shot tasks. Furthermore, we find that applying image and text augmentation is sufficient to make up most of the improvements obtained by previous work. With our improved CLIP training method, we achieve state-of-the-art performance on four standard datasets and consistently outperform previous work (up to +4% on the largest dataset), while being much simpler.

Subjects: cs.CL

2.ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4

Title: ArtGPT-4: Artistic Vision-Language Understanding Using Adapter-enhanced MiniGPT-4

Authors: Zhengqing Yuan, Huiwen Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, Kun Wang

Article link: https://arxiv.org/abs/2305.07490

Project Code: https://huggingface.co/Tyrannosaurus/ArtGPT-4

Summary:

        In recent years, large language models (LLMs) have made significant progress in natural language processing (NLP), and models such as ChatGPT and GPT-4 have achieved impressive capabilities in various language tasks. However, training such a large-scale model is challenging, and it is often difficult to find a dataset that matches the size of the model. Fine-tuning and training models with fewer parameters using new methods has emerged as a promising approach to overcome these challenges. MiniGPT-4 is one such model that achieves visual-language understanding comparable to GPT-4 by utilizing a novel pre-trained model and an innovative training strategy. However, the model still faces some challenges in image understanding, especially for artistic images. A novel multimodal model called ArtGPT-4 has been proposed to address these limitations. ArtGPT-4 was trained on image-text pairs in just 2 hours using a Tesla A100 device, using only about 200 GB of data. The model can depict artistic images and generate visual code, including beautiful HTML/CSS web pages. Furthermore, this paper proposes novel benchmarks for evaluating the performance of visual language models. In subsequent evaluation methods, ArtGPT-4 scores more than 1 point higher than the current \textbf{state-of-the-art} model and only 0.25 points lower than Artist on a 6-point scale.

3.StructGPT: A General Framework for Large Language Model to Reason over Structured Data

Title: StructGPT: A general framework for reasoning about large language models on structured data

Authors: Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong

Article link: https://arxiv.org/abs/2305.09645

Project code: https://github.com/RUCAIBox/StructGPT

Summary:

        In this paper, we investigate how to improve the zero-shot inference capabilities of large language models ~ (LLMs) on structured data in a unified manner. Inspired by LLM tool augmentation research, we develop an Iterative Reading-then-Reasoning~ (IRR) method to solve question answering tasks based on structured data, called StructGPT. In our approach, we build specialized functions to collect relevant evidence (reading) from structured data, and let LLM focus the inference task based on the collected information (reasoning). In particular, we propose an invoking-linearization-generation process to support LLM reasoning on structured data with the help of external interfaces. By iterating this process using the provided interface, our method can gradually approach the target answer for a given query. Extensive experiments on three types of structured data demonstrate the effectiveness of our method, which can significantly improve the performance of ChatGPT and achieve comparable performance to full-data supervised tuning baselines.

More Ai information: Princess AiCharm
insert image description here

Guess you like

Origin blog.csdn.net/muye_IT/article/details/130780031