University of Science and Technology of China and Tencent released the first "Summary of Multimodal Large Language Models"

Click the card below to follow the " CVer " official account

AI/CV heavy dry goods, delivered in the first time

Click to enter —> [Multimodal and Transformer] exchange group

Project link of "Overview of Multimodal Large Language Models" (real-time update of the latest papers, has won 2.4K Stars):

https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

Reply in the background of the CVer public account: Multimodal LLM , you can download the summary PDF and project of this article

Recently, Multimodal Large Language Model (MLLM) has received widespread attention and has become an emerging research hotspot. MLLM is usually based on the Large Language Model (LLM) and incorporates other non-textual modal information to complete various multimodal tasks. Compared with conventional multimodal models, MLLM has emerged some amazing new capabilities, such as image-based poetry creation and OCR-Free mathematical reasoning. These powerful capabilities show that MLLM is promising as a way to achieve general artificial intelligence.

Researchers from the University of Science and Technology of China and Tencent Youtu Lab discussed the research progress of MLLM in depth and published the first review in this field "A Survey on Multimodal Large Language Models": 

7809df3a2277d9810b3c04c1d6deb192.png

Paper link: https://arxiv.org/abs/2306.13549

Project link (real-time update of the latest papers, has obtained 2.4K Stars):

https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

We define MLLM as "a model extended from LLM with the ability to receive and reason about multimodal information". Compared with the popular unimodal LLM, this type of model has the following advantages:

  • It is more in line with the habit of human beings to perceive the world. Human beings have multiple senses to receive multiple modal information, which are usually complementary and synergistic. Therefore, using multimodal information generally leads to better cognition and task completion.

  • More powerful and user-friendly interface. By supporting multi-modal input, users can input and convey information in a more flexible way.

  • Wider task support. LLM usually can only complete tasks related to pure text, while MLLM can additionally complete more tasks through multimodality, such as picture description and visual knowledge question answering.

This review mainly revolves around three key technologies and one application of MLLM, including:

  • Multimodal Instruction Tuning (M-IT)

  • Multimodal In-Context Learning (M-ICL)

  • Multimodal Chain of Thought (M-CoT)

  • LLM-Aided Visual Reasoning (LLM-Aided Visual Reasoning, LAVR)

The first three techniques form the basis of MLLM, while the last one is a multimodal system with LLM at its core. The three techniques have been widely studied in the field of NLP as the representative capabilities of LLM, but many new characteristics and challenges will emerge when they are extended to the multimodal field. LLM-assisted visual reasoning systems involve several typical design ideas, namely LLM as controller, decision maker or semantic decorator. CVPR2023 best paper Visual Programming [1] adopts the design idea of ​​using LLM as the controller. This article will give a brief overview of the aforementioned aspects and related challenges. For more detailed content, please refer to the original article.

Multimodal instruction fine-tuning M-IT

Instruction refers to the description of the task. Multimodal instruction fine-tuning is a technique for fine-tuning the pre-trained MLLM through instruction-formatted data. Through this technology, MLLM can generalize to unseen tasks following new instructions, improving zero-shot performance. The command format for multimodal is as follows:

203e4e547a85f70f0e0a89753ea13dd2.png

Figure 1. M-IT format

The basic form of multimodal instruction data can be summarized as (instruction, multimodal input, answer) triplet. Instruction design can be divided into two methods: manual design and GPT-assisted design. The former refers to manually designing a series of instruction templates for each task. For example, for traditional visual question answering tasks, the instruction can be designed as "<image> What is the answer to the question? {question}", where and {question} ( Corresponding to <text> in Figure 1) are the images and questions in the original visual question answering task. Another GPT-assisted design method is to manually design a small number of samples to Prompt GPT to generate richer instructions. For multimodal instruction fine-tuning, we summarize the existing work from three aspects: data, modality bridging (Modality Bridging) and evaluation, as shown in the following figure:

06e075ca339f430ec09749c11ca35734.png

Figure 2. M-IT summary

Multimodal Context Learning M-ICL

Multimodal context learning refers to giving a small number of samples as prompt input, stimulating the potential ability of the model and normalizing the output of the model. Its sample is shown in the figure below:

4187f7610411527da9ef7a3928fda3db.png

Figure 3. M-CoT sample

At present, the research work related to M-ICL represented by Flamingo[2] is still relatively small. LLM usually does not require special training to have ICL capabilities, but the current MLLM still relies on training, and there is still a lack of in-depth research on sample selection and sample order.

Multimodal chain of thought M-CoT

Multimodal thinking chains obtain answers to multimodal tasks by explicitly reasoning step by step (given intermediate reasoning steps). Compared with directly outputting answers, M-CoT can achieve better performance on more complex reasoning tasks. We summarize the current research from four aspects: Modality Bridging, Learning Paradigm, Thought Chain Configuration and Generation Mode:

58250bd10ec7dc55234d6d113f5ca8b8.png

Figure 4. M-CoT summary

At present, there are few studies on M-CoT, and it is still in the preliminary exploration stage.

LLM-Assisted Visual Reasoning LAVR

This type of work exploits the powerful embedded knowledge and capabilities of LLM and other tools to design various visual reasoning systems. Compared with traditional visual reasoning models, these works have the following good properties: (1) Strong zero/few-shot generalization ability. (2) Possess new capabilities. These systems are capable of performing more complex tasks, such as deciphering the deeper meaning of memes. (3) Better interactivity and controllability. We summarize the current progress from three parts: the training paradigm, the role played by LLM, and the evaluation:

e1e3e4c619eb0bfd6c38911aabeefd23.png

Figure 5. LAVR Summary

Challenges and Future Directions

At present, the development of MLLM is still in its infancy. There are still many challenges and research issues in both related technologies and specific applications. We summarize the following points:

The perception ability of the existing MLLM is limited, resulting in incomplete or incorrect visual information, which further makes subsequent reasoning errors. This may be due to the compromise between information capacity and computational burden in existing models.

  • The reasoning chain of MLLM is relatively fragile. It is manifested that even for simple multimodal reasoning problems, the model sometimes outputs wrong answers due to broken reasoning chains.

  • The command obedience ability of MLLM needs to be further improved. The performance is that after instruction fine-tuning, even for relatively simple instructions, some MLLMs still cannot output the expected answer.

  • Object phantoms are common. It is shown that the reply output by MLLM does not match the content of the picture, and there are phenomena such as fabricated objects, which affect the reliability of MLLM.

  • Efficient parameter training. Due to the large model capacity of MLLM, under the condition of limited computing resources, efficient parameter training is expected to unlock more MLLM capabilities.

The above-mentioned first four issues have been evaluated and discussed in detail in the paper (arxiv.org/abs/2306.13394) of the same series as this article, and you are welcome to read it. In addition to the above problems, MLLM has only carried out preliminary explorations in specific sub-directions. For example, M-ICL still lacks in-depth research on sample selection and sorting.

For more details, please read

Paper link: https://arxiv.org/abs/2306.13549

Project link: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

Reply in the background of the CVer public account: Multimodal LLM , you can download the summary PDF and project of this article

[1] Gupta, Tanmay and Kembhavi, Aniruddha. Visual programming: Compositional visual reasoning without training. CVPR 2023

[2] Alayrac, Jean-Baptiste and Donahue, Jeff and Luc, Pauline and Miech, Antoine and Barr, Iain and Hasson, Yana and Lenc, Karel and Mensch, Arthur and Millican, Katherine and Reynolds, Malcolm and others. Flamingo: a visual language model for few-shot learning. NeurIPS 2019

Click to enter —> [Multimodal and Transformer] exchange group

The latest CVPR 2023 papers and code download

 
  

Background reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

Background reply: Transformer review, you can download the latest 3 Transformer review PDFs

多模态和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-多模态或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer等。
一定要备注:研究方向+地点+学校/公司+昵称(如多模态或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号

It's not easy to organize, please like and watch822d93fd3fbf99b241f321d6272c4b55.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/131566718