New multi-modal large models dominate the list! Supports mixed input of images and texts, so you can learn immediately even if you don’t understand the knowledge

Cressy from Aofei Temple
Qubit | Public account QbitAI

The multi-modal large model family has a new member!

Not only can multiple images and text be combined and analyzed, but also the spatio-temporal relationship in the video can be processed.

This free and open source model has topped both the MMbench and MME lists, and currently remains in the top three floating rankings.

09eb06982ef05e7dae76b6d4a8260146.png

△MMBench list, MMBench is a comprehensive multi-mode capability evaluation system based on ChatGPT jointly launched by Shanghai AI lab and Nanyang Technological University

7792c4277aa729ac10f1fb13cb0e320a.png

△MME list, MME is a multi-modal large language model evaluation conducted by Tencent Youtu Lab and Xiamen University

This large multi-modal model is called MMICL and was jointly launched by Beijing Jiaotong University, Peking University, UCLA, Zuzhi Multi-Mode Company and other institutions.

MMICL has two versions based on different LLMs, based on two core models: Vicuna and FlanT5XL.

Both versions have been open source. Among them, the FlanT5XL version can be used commercially, while the Vicuna version can only be used for scientific research purposes.

In MME's multi-task test, the FlanT5XL version of MMICL has maintained its leading position for several weeks.

Among them, the cognitive aspect achieved a total score of 428.93 (full score of 800), ranking first, significantly surpassing other models.

18e49eb5ce5dcd82b86cfb9766d264a1.png

The total score in terms of perception is 1381.78 (out of 2000), ranking second only to Alibaba’s Qianwen-7B and Kunlun Wanwei’s Tiangong model in the latest version of the list.

18aa486cfcdc88b6e0ff7d64427f5253.png

In terms of required configuration, the official statement is that six A40s are required during the training phase, and the inference phase can run on one A40.

Only 0.5M data built from the open source data set is needed to complete the second phase of training, which takes only dozens of hours.

So, what are the characteristics of this large multi-modal model?

I can watch videos and “learn and sell now”

MMICL supports prompts in the form of text and pictures interspersed, and it is as natural to use as WeChat chat.

Feed the two pictures to MMICL in a normal speaking manner, and you can analyze their similarities and differences.

de4fcd8a75f3cc8f9674e29a78000fa1.png

In addition to its strong image analysis capabilities, MMICL also knows "learn now and sell now".

For example, we gave MMICL a picture of a pixel-style horse in “Minecraft”.

Since the training data are all real-world scenes, MMICL does not recognize this overly abstract pixel style.

6fe4ae27f04fc8edb43f1467fc312ece.png

But as long as we let MMICL learn a few examples, it can quickly perform analogical reasoning .

In the picture below, MMICL learned three scenarios: horses, donkeys and nothing, and then correctly judged the pixel horse after changing the background.

d2e0b398f3efde6d2cb59740626b9af2.png

In addition to pictures, dynamic videos are also not a problem for MMICL. It not only understands the content of each frame, but also accurately analyzes the spatio-temporal relationship.

Let's take a look at this football battle between Brazil and Argentina. MMICL accurately analyzed the actions of the two teams.

6afc1b6f235883db6433608f4b850b2b.gif

You can also ask MMICL about the details in the video, such as how the Brazilian players blocked the Argentine players.

a3902bf377dfef921d11433bd44c530e.png

In addition to accurately grasping the spatiotemporal relationship in the video, MMICL also supports real-time video stream input.

We can see that the person in the surveillance screen is falling. MMICL detected this anomaly and issued a prompt asking if help is needed.

e6fdff8fb445243b636b194d8f7938c8.gif

If we compare the top five in perception and cognition on the MME list in one picture, we can see that MMICL's performance has achieved good results in all aspects.

9286dc9a4c587b0bf75bcaf338c648b7.png

So, how does MMICL do it, and what are the technical details behind it?

Training is completed in two phases

MMICL aims to solve the problems encountered by visual language models in understanding complex multi-modal inputs with multiple images.

MMICL uses the Flan-T5 XXL model as the backbone. The structure and process of the entire model are shown in the figure below:

6e266bf37d7c9ffbab5c03280003c89a.png

MMICL uses a structure similar to BLIP2, but can accept interleaved graphics and text input.

MMICL treats images and texts equally, and splices the processed image and text features into an interleaved image and text form according to the input format and inputs them into the language model for training and inference.

Similar to InstructBLIP, the development process of MMICL is to freeze the LLM, train the Q-former, and fine-tune it on a specific data set.

The training process and data structure of MMICL are shown in the figure below:

d743fd9bee3a9d552aa6cab05a8f6979.png

Specifically, the training of MMICL is divided into two stages:

  • In the pre-training stage, the LAION-400M (refer to LLaVA) data set was used

  • Multi-modal in-context tuning, using its own MIC (Multi-Model In-Context Learning) data set

fced74fa0f5168fa6e4b291a2f363639.png

The MIC data set is constructed from public data sets. The above figure shows the content contained in the MIC data set. The MIC data set also has the following characteristics:

The first is the explicit reference established between images and texts. MIC inserts image declarations into the interlaced data of images and texts, uses image proxy tokens to proxy different images, and uses natural language to create images. Referential relationships between texts.

0a98dd7fe50eddaaccd3e915e5c3f3a9.png

The second is a multi-image data set that is interconnected in space, time or logic, ensuring that the MMICL model can have a more accurate understanding of the relationship between images.

7481e62628f13e11f73597c4bc7d3f0e.png

The third feature is the example data set, which is similar to the process of "on-site learning" of MMICL. It uses multi-modal context learning to enhance MMICL's understanding of complex image and text input interspersed with images and text.

8ff1eca7bc73cef2df2263c4594fa742.png

MMICL achieves better results on multiple test data sets than BLIP2 and InstructionBLIP, which also use FlanT5XXL.

Especially for tasks involving multiple pictures, MMICL has shown great improvement for such complex picture and text input.

4e766142641ddba98c5008bb1b613098.png

The research team believes that MMICL solves the problem of language bias that often exists in visual language models and is one of the reasons for its excellent results.

Most visual language models ignore visual content when faced with the contextual content of large amounts of text, which is a fatal flaw when answering questions that require visual information.

Thanks to the research team's approach, MMICL successfully alleviates this language bias in visual language models.

0dfc771976c9fb8eb05532dda85ea8f0.png

Readers interested in this large multi-modal model can check out the GitHub page or paper for more details.

GitHub page:
https://github.com/HaozheZhao/MIC
paper address:
https://arxiv.org/abs/2309.07915
Online demo:
http://www.testmmicl.work/

Guess you like

Origin blog.csdn.net/QbitAI/article/details/133053131