Overview|Which is the best open source multi-modal large model?

Organized from Happy Qubit | Public Account QbitAI

The explosion of GPT-4 has completely set off an upsurge in research on multi-modal large models in the academic community.

However, the industry has always had different opinions on how to measure the performance of such models, and there is no evaluation standard with a wide enough coverage .

In addition, there is no comprehensive review to define and study it.

With this in mind, Tencent Youtu jointly published two papers on multi-modal large models jointly with the University of Science and Technology of China and Xiamen University.

There is not only a review of the first multi - modal large model——

91f19494843e8173bffba09c9d66a4bd.png

There is also a comprehensive evaluation list !

7153716f7c1608f59445bc24864aec2d.png

Related projects have become popular on GitHub, and as of July 3, they have received 2,200+ stars.

db69be114be0983c2e531d3e79596552.png

So, what are the best multimodal large models in the industry at present? What are its definitions, key technologies, advantages and existing challenges?

Let's take a look together.

Multi-modal large model TOP12 ranking

The researchers set a total of 16 lists , including two general lists and 14 sub-tasks.

The overall list can be regarded as the score of the "overall ability" of the model, which is divided into perception and cognition . The 14 sub-tasks are some subdivided tasks, which can evaluate whether the multi-modal large model is better at doing certain tasks. thing.

9c09677b9de7c371737e5f6c3a84ed3e.png

The researchers selected a total of 12 open source multi-modal large models to make a "demonstration" for the evaluation criteria.

The general list of perception categories is the total score of all perception tasks, showing that BLIP-2 is the highest:

0e04d2ec4c97923988a58f19bf71cb88.png

The overall cognitive list is a list of various cognitive tasks, which add up to the highest MiniGPT-4:

264f41c4f388ee5a09d9d9536b5a2f1b.png

According to the evaluation results, BLIP-2 and InstructBLIP remain in the top three of these two lists, and they are indeed the "top players" of the current open source multi-modal large model.

Specific to the 14 sub -tasks , the ranking of the model is different.

The evaluation results are as follows. It can be said that it is clear at a glance who is more "biased" and who is more comprehensive and optimal in various tasks:

646232ad2b1869cc6aa86248de98c142.png

So, how did the ratings for this list come about?

How are the scoring criteria derived

The paper believes that a good multimodal large model scoring standard should have the following four characteristics :

(1) It should cover as many areas as possible, including perception and cognition (perception is the basis of cognition).

Among them, the former refers to identifying objects, including their existence, quantity, location, and color; the latter refers to performing more complex reasoning based on comprehensive perceptual information and knowledge in LLM, including common sense reasoning, numerical calculation, tasks such as text translation and code reasoning.

(2) Its data or annotations should avoid using existing public data sets as much as possible to reduce the risk of data leakage.

Therefore, all instruction-answer pairs in the evaluation should be manually constructed, and for the few public datasets used, only their images are used without relying on their original annotations. At the same time, try to collect data through manual photography and image generation.

(3) The instruction design should be as concise as possible and conform to human cognitive habits.

Different instruction designs may greatly affect the output of the model, but all models are evaluated under a unified concise instruction to ensure fairness. A good multi-modal large model should have the ability to generalize to such concise instructions, so as to avoid falling into hint engineering.

(4) The output of the multi-modal large model under this concise instruction should be intuitive and easy for quantitative statistics.

The open answers of multimodal large models pose great challenges to quantitative statistics. Existing methods tend to use GPT or manual scoring, but may face problems of inaccuracy and subjectivity.

Therefore, after production, the final evaluation question looks like this:

c8a96785120e53fa7a889e9ffcf1387a.png

Then, a score is made based on the accuracy of the model's answers.

It is worth mentioning that the authors also tried to design instructions for multiple-choice questions, but found that the current multimodal large model is still difficult to follow such more complex instructions. (doge)

The first review of multimodal large models

Of course, the evaluation criteria for this list are not "groundless".

If you want to know why the list is scored in this way, you can take a look at another paper review on multimodal large models, which carefully organizes its definition, key technologies and challenges.

Specifically, the paper defines the multimodal large model (MLLM) as "a model extended from LLM with the ability to receive and reason about multimodal information".

This type of model has the following advantages over the popular unimodal LLM:

  • It is more in line with the habit of human beings to perceive the world. Human beings have multiple senses to receive multiple modal information, which are usually complementary and synergistic. Therefore, using multimodal information generally leads to better cognition and task completion.

  • More powerful and user-friendly interface. By supporting multi-modal input, users can input and convey information in a more flexible way.

  • Wider task support. LLM usually can only complete tasks related to pure text, while MLLM can additionally complete more tasks through multimodality, such as picture description and visual knowledge question answering.

Therefore, in order to study such multi-modal large models, it is often necessary to master three key technologies :

1. Multimodal Instruction Tuning (M-IT)
2. Multimodal In-Context Learning (M-ICL)
3. Multimodal Chain of Thought (M-CoT) )

In addition, one of its applications needs to be studied (multimodal system with LLM as the core), that is, LLM-aided visual reasoning (LLM-Aided Visual Reasoning, LAVR).

6284559f853e7c6b63ca055f8b554bd3.png

However, the current multi-modal large model is still in its infancy, so there are still some challenges, such as limited perception ability, fragile reasoning chain, command obedience ability needs to be further improved, and object hallucinations are common.

For more summary details and list details, you can click on the paper to view~

Multimodal large model list:
https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation

Paper address:
[1] Review: https://arxiv.org/abs/2306.13549
[2] Evaluation: https://arxiv.org/abs/2306.13394

Lying down, 60,000 words! 130 articles in 30 directions! CVPR 2023's most complete AIGC paper! Read it in one sitting.

Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read

Simple explanation of stable diffusion: Interpretation of the potential diffusion model behind AI painting technology

In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN has to read: StyleGAN

bc07dbb0e3ae191d33f770c4230793ba.png Click me to view GAN's series albums~!

A cup of milk tea, become the frontier of AIGC+CV vision!

The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models

ECCV2022 | Summary of some papers on generating confrontation network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of GAN papers on 35 topics

Over 110 articles! CVPR 2021 most complete GAN paper combing

Over 100 articles! CVPR 2020 most complete GAN paper combing

Dismantling the new GAN: decoupling representation MixNMatch

StarGAN Version 2: Multi-Domain Diversity Image Generation

Attached download | Chinese version of "Explainable Machine Learning"

Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"

Attached download | "Mathematical Methods in Computer Vision" share

"A review of surface defect detection methods based on deep learning"

A Survey of Zero-Shot Image Classification: A Decade of Progress

"A Survey of Few-Shot Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."

Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/131618426