Cressy from Aofei Temple
Qubit | Public account QbitAI
The multi-modal large model family has a new member!
Not only can multiple images and text be combined and analyzed, but also the spatio-temporal relationship in the video can be processed.
This free and open source model has topped both the MMbench and MME lists, and currently remains in the top three floating rankings.
△MMBench list, MMBench is a comprehensive multi-mode capability evaluation system based on ChatGPT jointly launched by Shanghai AI lab and Nanyang Technological University
△MME list, MME is a multi-modal large language model evaluation conducted by Tencent Youtu Lab and Xiamen University
This large multi-modal model is called MMICL and was jointly launched by Beijing Jiaotong University, Peking University, UCLA, Zuzhi Multi-Mode Company and other institutions.
MMICL has two versions based on different LLMs, based on two core models: Vicuna and FlanT5XL.
Both versions have been open source. Among them, the FlanT5XL version can be used commercially, while the Vicuna version can only be used for scientific research purposes.
In MME's multi-task test, the FlanT5XL version of MMICL has maintained its leading position for several weeks.
Among them, the cognitive aspect achieved a total score of 428.93 (full score of 800), ranking first, significantly surpassing other models.
The total score in terms of perception is 1381.78 (out of 2000), ranking second only to Alibaba’s Qianwen-7B and Kunlun Wanwei’s Tiangong model in the latest version of the list.
In terms of required configuration, the official statement is that six A40s are required during the training phase, and the inference phase can run on one A40.
Only 0.5M data built from the open source data set is needed to complete the second phase of training, which takes only dozens of hours.
So, what are the characteristics of this large multi-modal model?
I can watch videos and “learn and sell now”
MMICL supports prompts in the form of text and pictures interspersed, and it is as natural to use as WeChat chat.
Feed the two pictures to MMICL in a normal speaking manner, and you can analyze their similarities and differences.
In addition to its strong image analysis capabilities, MMICL also knows "learn now and sell now".
For example, we gave MMICL a picture of a pixel-style horse in “Minecraft”.
Since the training data are all real-world scenes, MMICL does not recognize this overly abstract pixel style.
But as long as we let MMICL learn a few examples, it can quickly perform analogical reasoning .
In the picture below, MMICL learned three scenarios: horses, donkeys and nothing, and then correctly judged the pixel horse after changing the background.
In addition to pictures, dynamic videos are also not a problem for MMICL. It not only understands the content of each frame, but also accurately analyzes the spatio-temporal relationship.
Let's take a look at this football battle between Brazil and Argentina. MMICL accurately analyzed the actions of the two teams.
You can also ask MMICL about the details in the video, such as how the Brazilian players blocked the Argentine players.
In addition to accurately grasping the spatiotemporal relationship in the video, MMICL also supports real-time video stream input.
We can see that the person in the surveillance screen is falling. MMICL detected this anomaly and issued a prompt asking if help is needed.
If we compare the top five in perception and cognition on the MME list in one picture, we can see that MMICL's performance has achieved good results in all aspects.
So, how does MMICL do it, and what are the technical details behind it?
Training is completed in two phases
MMICL aims to solve the problems encountered by visual language models in understanding complex multi-modal inputs with multiple images.
MMICL uses the Flan-T5 XXL model as the backbone. The structure and process of the entire model are shown in the figure below:
MMICL uses a structure similar to BLIP2, but can accept interleaved graphics and text input.
MMICL treats images and texts equally, and splices the processed image and text features into an interleaved image and text form according to the input format and inputs them into the language model for training and inference.
Similar to InstructBLIP, the development process of MMICL is to freeze the LLM, train the Q-former, and fine-tune it on a specific data set.
The training process and data structure of MMICL are shown in the figure below:
Specifically, the training of MMICL is divided into two stages:
In the pre-training stage, the LAION-400M (refer to LLaVA) data set was used
Multi-modal in-context tuning, using its own MIC (Multi-Model In-Context Learning) data set
The MIC data set is constructed from public data sets. The above figure shows the content contained in the MIC data set. The MIC data set also has the following characteristics:
The first is the explicit reference established between images and texts. MIC inserts image declarations into the interlaced data of images and texts, uses image proxy tokens to proxy different images, and uses natural language to create images. Referential relationships between texts.
The second is a multi-image data set that is interconnected in space, time or logic, ensuring that the MMICL model can have a more accurate understanding of the relationship between images.
The third feature is the example data set, which is similar to the process of "on-site learning" of MMICL. It uses multi-modal context learning to enhance MMICL's understanding of complex image and text input interspersed with images and text.
MMICL achieves better results on multiple test data sets than BLIP2 and InstructionBLIP, which also use FlanT5XXL.
Especially for tasks involving multiple pictures, MMICL has shown great improvement for such complex picture and text input.
The research team believes that MMICL solves the problem of language bias that often exists in visual language models and is one of the reasons for its excellent results.
Most visual language models ignore visual content when faced with the contextual content of large amounts of text, which is a fatal flaw when answering questions that require visual information.
Thanks to the research team's approach, MMICL successfully alleviates this language bias in visual language models.
Readers interested in this large multi-modal model can check out the GitHub page or paper for more details.
GitHub page:
https://github.com/HaozheZhao/MIC
paper address:
https://arxiv.org/abs/2309.07915
Online demo:
http://www.testmmicl.work/