Which visual language model is better? InstructBLIP, MiniGPT-4? Comprehensive evaluation benchmark LVLM-eHub tells you

picture

 The original author of Xi Xiaoyao's science and technology
 | Wang Siruo

Large-scale language models such as LLaMA and GPT-3 have achieved strong understanding and reasoning capabilities for natural language, and have built a powerful language base model for the AI ​​community. Furthermore, GPT-4, which continues to iterate, endows the model with the visual ability to process images.

Today, building a powerful multimodal model has become the consensus of the community. A large number of Vision-Language Models (LVLMs) such as BLIP2, LLaVA, MiniGPT-4, mPLUG-Owl, and InstructBLIP have been proposed one after another like a blowout.

Do existing visual language models truly align image and text modalities? Which visual language model is better?

The strength of the existing visual language models is undoubtedly the focus of researchers. The Shanghai Artificial Intelligence Laboratory has constructed an evaluation benchmark LVLM-eHub to conduct a comprehensive evaluation of eight visual text models including InstructBLIP and MiniGPT-4.

The study found that existing instruction fine-tuning visual language models such as InstructBLIP are severely overfitting to existing tasks, and their generalization ability in real scenes is poor. Additionally, the model is highly prone to object hallucinations, generating descriptions of objects that do not appear in the image.

 Large model research test portal

GPT-4 Portal (free of wall, can be tested directly, if you encounter browser warning point advanced/continue to visit):
Hello, GPT4!

Thesis title:
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Paper address:
https://arxiv.org/pdf/2306.09265.pdf

1. Construct six types of multimodal quantitative performance evaluation data sets, and build a model interactive evaluation platform

LVLM-eHub is composed of quantitative ability assessment and online interactive evaluation platform. Specifically, on the one hand, the quantitative ability assessment extensively evaluates LVLM in visual perception, visual knowledge acquisition, visual reasoning, visual common sense, Object Illusion and Embodied Intelligence 6 Multimodal Abilities.

On the other hand, build an online interactive evaluation platform to conduct anonymous random pairwise battles against visual language models in a crowdsourcing manner, and provide user-level model rankings in open-world question-and-answer scenarios.

picture

Visual perception:  Visual perception is the ability to recognize scenes or objects in an image, and it is the primary ability of the human visual system. Including image classification tasks, multi-class recognition and object counting tasks.

Visual knowledge acquisition:  Visual knowledge acquisition requires going beyond perception to understand images and acquire knowledge. Including optical character recognition, key information extraction and image description tasks.

Visual reasoning:  Visual reasoning requires a comprehensive understanding of images and associated text. To evaluate the visual reasoning ability of LVLM, three tasks including visual question answering (VQA), visual entailment and knowledge-based image description tasks are included.

Visual Commonsense:  This evaluation tests the model's understanding of common shared human knowledge by using ImageNetVC and Visual Commonsense Reasoning (VCR). Specifically, ImageNetVC is used for zero-shot visual commonsense assessment, such as color and shape, while VCR covers various scenarios, such as spatial, causal, and psychological commonsense.

Object hallucination:  The visual language model has the problem of object hallucination, that is, the generated description object is inconsistent with the target image. This paper evaluates the object hallucination problem of the visual language model on the MSCOCO dataset.

Embodied Intelligence:  Embodied intelligence aims to create humanoid robots and let them learn to solve complex tasks that require environment interaction. This paper uses high-level tasks in EmbodiedGPT as a benchmark.

This paper investigates the zero-shot capabilities of visual language models on various new tasks to evaluate the six categories of capabilities mentioned above. Specifically, this paper regards zero-shot evaluation as a hint engineering for different task forms:

  • Question Answering: Design appropriate visual question prompts to ensure that visual language models generate meaningful results, e.g., “what is written in the image” as text prompts for OCR tasks.

  • Prefix-based scores: For multiple-choice choice tasks, given a certain visual cue for an image, let the model generate the likelihood of the image and text, and take the visual cue that produces the maximum likelihood result as the answer.

  • Multi-turn reasoning: Utilize an LLM such as ChatGPT to generate sub-questions for a given question, a visual language model provides the corresponding sub-answers, and another LLM evaluates the quality of the sub-answers. Iterate through this process until a satisfactory answer is obtained or a predefined maximum number of iterations is reached.

  • User Polls: Let humans rate the quality, relevance, and usefulness of text generated by visual language models in specific contexts. In order to maintain the fairness of the evaluation, this paper will randomly shuffle the model output order and anonymize the output during the evaluation process.

More interestingly, the study also built an interactive evaluation platform for visual language models, allowing the models to be paired in the form of a tournament. Users can use images and text inputs to chat with the paired models on any topic and model real-world conditions. After the chat phase, users vote for the model and let users act as referees, which can lead to more convincing evaluation results than traditional evaluation metrics.

Let the multi-modal model have a "Pokémon World Championship", it's you, Pikachu, LLaVA model~

picture

2. Evaluation results of existing visual language models

The article evaluated 8 representative models, including BLIP2, LLaVA, LLaMA-Adapter V2, MiniGPT-4, mPLUG-Owl, Otter, InstructBLIP and VPGTrans.

The major models have achieved relatively good zero-sample capabilities in six categories of tasks, especially InstructBLIP has achieved performance far superior to other models in almost all tasks.

picture

InstructBLIP has achieved far better performance than other models on various tasks

But the author pessimistically pointed out that the reason for this superior performance is the performance of the model overfitting.

On the one hand, InstructBLIP has performed instruction fine-tuning on the 1.6 million VQA dataset, far exceeding other visual language models. Therefore, it performs extremely well in the quantitative evaluation of existing in-domain tasks. On the other hand, it is close to the real scene. In the online interactive evaluation, InstructBLIP is much worse than other models, but mPLUG-Owl and MiniGPT-4 perform best.

picture

Instruction fine-tuning datasets for 8 major visual language models

picture

InstructBLIP performed poorly in the online interactive evaluation close to the real scene, but other models such as mPLUG-Owl, MiniGPT-4, Otter and other models performed well.

The good news is that a larger instruction fine-tuning data set can improve the performance of the model on in-domain tasks, but the bad news is that the model has overfitted the data. Therefore, how to build a powerful and broader general vision Language models still have a long way to go!

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132622508