Leveraging large language models for multimodal tasks

Author: Hu Anwen (Multimodality, NLP)

From: Artificial Intelligence and Algorithmic Learning

Large language model LLM (Large Language Model) has strong general knowledge understanding and strong logical reasoning ability, but it can only process text data. Although the released GPT4 has the ability to understand pictures, it has not yet opened a multi-modal input interface and will not disclose any technical details on the model. Therefore, at this stage, how to use LLM to do some multimodal tasks still has certain research value.

This article sorts out some of the work done on vision-lanuage tasks based on LLM in the past two years, and divides them into 4 categories:

  1. Use LLM as the understanding center to invoke multimodal models, such as VisualChatGPT(2023)[1], MM-REACT(2023)[2];

  2. Convert vision into text as input to LLM, such as PICA(2022)[3], PromptCap(2022)[4], ScienceQA(2022)[5];

  3. Use visual modality to influence the decoding of LLM, such as ZeroCap[6], MAGIC[7];

  4. Freeze LLM, train additional structures such as visual encoders to adapt to LLM, such as Frozen[8], BLIP2[9], Flamingo[10], PaLM-E[11];

Next, each category will select representative work for a brief introduction:

1. Use LLM as the understanding center to call the multimodal model

Take Microsoft Visual ChatGPT[1] as an example. Its goal is to enable a system to not only communicate with people about visual content, but also to draw and modify pictures. To this end, Visual ChatGPT uses ChatGPT as the understanding center for communicating with users, integrates multiple visual foundation models (Visual Foundation Models), and tells ChatGPT the usage and input and output formats of each basic model through prompt engineering (ie Prompt Manager), so that ChatGPT Decide how these models should be invoked to meet user needs, as shown in Figure 1.

4ff89a6f1481a95d67a5be2192085d8f.jpeg

Figure 1: Schematic diagram of the Visual ChatGPT system

The MM-REACT [2] proposed by another Microsoft team later has the same idea. The main difference lies in the design of prompt engineering and MM-REACT focuses more on the general understanding and interpretation of vision, including many Microsoft Azure APIs, such as Celebrity recognition, bill recognition, Bing search, etc.

2. Convert vision into text as input to LLM

Taking PICA [3] as an example, its goal is to make full use of the massive knowledge in LLM to do Knowledge-based QA. Given a graph and a question, previous work mainly retrieves relevant background knowledge from external sources, such as Wikipedia, to assist answer generation. However, after PICA tries to describe the picture in the form of text, it is directly combined with the question as the input of LLM, so that LLM can directly generate an answer through in-context learning, as shown in Figure 2.

95872c02e169532d6bb3fa2c8ac6605b.png

Figure 2: Schematic diagram of the PICA method

The effect of in-context learning depends more on the quality of examples/demonstrations. For this reason, the author of PICA used CLIP to select 16 training examples that are closest to the current test examples in terms of questions and pictures as examples.

3. Using visual modality to influence the decoding of LLM

Taking MAGIC [3] as an example, its goal is to let LLM do the task of image captioning. Its core idea is to increase the generation probability of visually related words when generating each word. The formula is shown in Figure 3.

0095851fdc021ddbade534bbfef3368f.jpeg

Figure 3: Schematic diagram of MAGIC decoding formula

The formula mainly consists of three parts: 1) LLM predicts word probability; 2) degradation penalty (orange); 3) visual correlation (red). The degradation penalty mainly hopes that the generated words can bring new information. In the visual correlation part, the correlation of all candidate words and pictures is calculated based on CLIP, and the probability after softmax is taken as the predicted probability.

4. Training additional structures such as visual encoders to adapt to LLM

This part of the work is currently the work with the highest attention, because it has the potential to "extend LLM to a multimodal model at a cost much lower than that of multimodal general model training". Frozen published by DeepMind in 2021, Flamingo in 2022 and BLIP2 by Saleforce in 2023 are all in this route, as shown in Figure 4.

c3c77259d60cd6a876475d123c013989.png

Figure 4: Schematic diagram of Frozen, Flamingo, BLIP2.

During Frozen training, the picture is encoded into 2 vision tokens, which are used as the prefix of LLM. The goal is to generate subsequent text, and Conceptual Caption is used as the training corpus. The effect of Frozen doing downstream VQA and image classification through few-shot learning/in-context learning is not very strong, but some multimodal in-context learning capabilities have been observed.

In order to solve the problem that the size of the visual feature map may be inconsistent (especially for multi-frame video), Flamingo uses Perceiver Resampler (a decoder similar to DETR) to generate a fixed-length feature sequence (64 tokens), and before each layer of LLM An additional layer of cross-attention layer for attention calculation on visual features is added to achieve stronger visual correlation generation. Flamingo's training parameters are much higher than those of Frozen, so a large amount of data is used: 1) MultiModal MassiveWeb (M3W) dataset: The mixed graphic and text data collected from 43 million web pages are converted into a sequence of graphic and text cross-arrangements (according to the pictures on the web page Relative position, which determines the position of the <image> token in the text token series after conversion into a sequence); 2) ALIGN (alt-text & image Pairs): 1.8 million image-text pairs; 3) LTIP (LongText & Image Pairs) : 312 million image-text pairs; 4) VTP (Video & Text Pairs): 27 million video-text pairs (an average video 22s, frame sampling rate 1FPS). Similar to LLM, Flamingo's training goal is also text generation, but it assigns different weights to different data sets. The weights of the above four parts are 1.0, 0.2, 0.2, and 0.03 respectively. It can be seen that the training of the M3W data set with cross-arranged graphics and text is important. The sexiness is the highest, and the author also emphasizes that this type of data is an important factor for multimodal in-context learning capabilities. Flamingo achieves very good zero-shot and few-shot performance on multiple tasks.

BLIP2 adopts a visual encoding structure similar to Flamingo, but adopts a more complex training strategy. It consists of two stages of training. The first stage is mainly to let the visual encoder learn to extract the most critical visual information. The training tasks include image-Text Contrastive Learning, Image-grounded Text Generation and Image-Text Matching; the second stage is mainly to The output of the visual coding structure is adapted to LLM, and the training task is also language modeling. The training data of BLIP2 includes MSCOCO, Visual Genome, CC15M, SBU, 115M images from LAION400M and descriptions generated by BLIP on web images. BLIP2 has achieved strong zero-shot capitoning and VQA capabilities, but the author mentioned that its in-context learning capabilities have not been observed, that is, input samples cannot improve its performance. The author's analysis is because there is no interlaced data of graphics and text used by Flamingo in the training data. However, Frozen did not use this kind of data, but it also observed a certain in-context learning ability. Therefore, the multimodal in-context learning capability may be related to training data, training tasks, and position encoding methods.

Summarize

"Using LLM as the understanding center to invoke multimodal models" can easily and quickly deploy a multimodal understanding and generation system based on LLM. The difficulty lies in the design of prompt engineering to schedule different multimodal models;

"Converting vision into text as the input of LLM" and "using visual modality to influence the decoding of LLM" can directly use LLM to do some multimodal tasks, but the upper limit may be low, and its performance depends on the external multimodal model. ability;

"Training additional structures such as visual encoders to adapt to LLM" has higher research value, because it has the potential to integrate any modality into LLM and realize a true multimodal model. The difficulty lies in how to achieve a strong in- The ability of context learning.

related work

slightly…… 

Background reply: join the group , join the NLP communication group~

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130612952