Large multimodal models based on LLMs (Visual ChatGPT, PICa, MM-REACT, MAGIC)

When LLMs already have a very strong dialogue ability, how to make it have multi-modal capabilities such as vision and voice is the next hot spot (although GPT4 already has it), this series will be updated from time to time to use LLMs to do more Article on modal tasks.

Intuitively, if you directly train a multimodal framework similar to the chatgpt architecture, it will consume a very large amount of data and calculations. In addition, if a new modality is added each time, the existing framework needs to be retrained. It is difficult for universities or enterprises to bear the price. Therefore, the current articles try to use some strategies or adaptation methods to connect the language model and other models, especially visual and language.

This blog post first sorts out some articles that do not train visual models. These articles mainly use some strategies to enable LLMs to complete multimodal tasks.


insert image description here

Visual ChatGPT
Visual ChatGPT is an agent that uses LLMs, that is, uses LLMs as the language center, by telling it the input and output formats of each visual foundation model (Visual Foundation Models, VFMs), and then letting ChatGPT perform the model according to the needs of the user. call and select.

  • As shown in the figure above, the user uploads an image + instructions (for example, please generate red flowers according to the depth of the image, and then change the style to cartoon step by step).
  • So PromptManager decomposes the user-specified multiple executable captures, and then calls its base model library (22).
  • That is, first predict the image depth according to the depth estimation, then use the depth to generate red flowers, and finally use stable diffusion for style transfer.
    insert image description here

Since chatGPT is used as a base, this is destined to be a framework for multi-round dialogue. As shown in the figure, in the middle of the figure, you can see that there are four parts to enter the prompt manager:

  • system principles P: Specify some system rules to get a prompt that chatgpt can understand to help integrate multiple basic vision models. For example, for accessing VFMs, accessing images according to file names, and cot to decompose user commands (as shown in the above figure, query is disassembled into multiple steps that can be called). In addition there are some system principles that constrain reasoning and system reliability.
  • visual foundation models F: A bunch of basic visual model VFMs that can be called. In order to facilitate the model call, it is also necessary to define the name, usage, input/output, and example (optional).
  • user query Q: User query at the current moment.
  • history of dialogue H: All dialogue history, but here will be truncated according to the maximum input of chatgpt.

So for a dialogue S = ( Q 1 , A 1 ) , ( Q 2 , A 2 ) , … , ( QN , AN ) S=(Q_1,A_1),(Q_2,A_2),…,(Q_N,A_N)S=(Q1,A1),(Q2,A2),,(QN,AN) , in the first round of dialogue, the obtained replyA ij + 1 A^{j+1}_iAij+1is the result of calling the basic visual model tool j times, that is, A ij + 1 = Chat GPT ( M ( P ) , M ( F ) , M ( H < I ) , M ( Q i ) , M ( R i < j ) , M ( F ( A i ( j ) ) ) ) A^{j+1}_i=ChatGPT(M(P),M(F),M(H_{<I}),M(Q_i) ,M(R_i^{<j}),M(F(A^{(j)}_i)))Aij+1=ChatGPT(M(P),M(F),M(H<I),M(Qi),M(Ri<j),M(F(Ai(j)))) M is the prompt manager, which is used to turn each function into a reasonable prompt and hand it over to chatgpt.

paper:Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
arxiv:https://arxiv.org/abs/2303.04671
github:https://github.com/microsoft/TaskMatrix


insert image description here

It is not a long-term solution for PICa
to adjust the openai interface. If there are some strategies to make vision a prompt input, a lot of calculations can be avoided. One of the most intuitive ideas is to convert vision into text first, and then input the text into LLMs.

Therefore, this article of PICA is mainly to convert vision into text (the way of in-context learning), and then perform a kind of Knowledge-based QA work. As shown in the lower left corner of the figure above, the input to the model is

  • 【N-shot VQA examples】【Question】【Image textual descriptions】, and then input into frozen LLMs (GPT3) to take advantage of the power of large models.
    insert image description here

Specifically, the text converted from the picture will be directly stitched together with the question, and then used as the input of the LLM. The idea of ​​in-context learning used here needs to ensure quality and quantity, so the author proposes two strategies: In-context example selection and Multi-query ensemble.

  • In-context example selection. The samples suitable for the current question should be similar to the current question, so CLIP (ViT-B/16) is used to select n samples most similar to the question as few-shots (16), in an attempt to let LLMs directly generate answers.
  • Multi-query ensemble. Here, k prompts are regenerated for n samples, and finally the highest value of the k answers is used as the output.

paper:An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
arxiv:https://arxiv.org/abs/2109.05014
code:https://github.com/microsoft/PICa


However, because some visual information will be lost in the process of converting images into captions, some models will first obtain visual information that is more relevant to the query, such as adding an Image-Question Matching module, such as using attention such as Q-former, these models will In the next blog post: LLMs-based multi-modal large model (Flamingo, BLIP-2, KOSMOS-1) will be sorted out.

Here is an article that uses the question generation model to generate corresponding questions to accommodate the image.

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models
insert image description here

The idea is more intuitive. As shown in the figure above, first use the caption model to generate the caption of the picture, and then extract the nouns, adjectives, etc., because they are likely to be the keywords in the answer, and then pass a question generation based on these words The model to generate the corresponding question to optimize the (question, answer) pair.

paper: https://arxiv.org/abs/2212.10846
code: https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa (the implementation version of LAVIS)


insert image description here
The MM-REACT
model can be said to be a synthesis of the ideas of the above two models. On the one hand, it converts images into text through a caption model and then inputs them into the large model. On the other hand, it calls chatgpt to call various visual models to achieve Various multimodal tasks.

As shown in the figure above, specifically, the query input by the user will first be handed over to chatgpt to determine whether it is necessary to call the visual model (such as caption, ocr, bing search, etc.), and execute the corresponding action if it needs to be called, otherwise directly Just take the output structure of chatgpt and return it to the user.

paper:MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
arxiv:https://arxiv.org/abs/2303.11381
code:https://github.com/microsoft/MM-REACT


Finally, MAGIC
added MAGIC (iMAge-guided text GeneratIon with CLIP). Its advantage is that it does not require multimodal training data, and only needs to use off-the-shelf language models (such as GPT-2) and graphic-text matching models (such as CLIP ) can complete multi-modal generation tasks with high quality in a zero-shot manner.

Why can it be trained without multimodal data? The reason is that it can directly use visual information to guide the generation process of the pre-trained language model. As shown in the figure below, the visual features can only participate in the decoding process of the language model, that is, the MAGIC Search decoding algorithm.
insert image description here
Since the idea of ​​MAGIC is to add visual constraints when LLMs are generated, so that the generated words are closer to the vision, the most critical part lies in the following formula, which consists of three items
insert image description here
:

  • model confidence: The probability of LLM predicted words, which is the output loss of normal LLMs.
  • degeneration penalty: degeneration penalty, hv h_vhv [ x < t : v ] [x_{<t}:v] [x<t:v ] features after splicing, andhxj h_{x_j}hxj x < j + 1 x_{<j+1} x<j+1The characteristics of the sequence, by calculating the cosine of the two to encourage each generated word will bring some new information.
  • magic score: visual correlation, based on CLIP to calculate the softmax correlation of all candidate words and pictures, that is, the f function.

paper:Language Models Can See: Plugging Visual Controls in Text Generation
arxiv:https://arxiv.org/abs/2205.02655
code:https://github.com/yxuansu/MAGIC


The next blog post will continue to sort out some articles that need to train visual models to adapt to multi-modal large models, which is also the current mainstream direction:

Guess you like

Origin blog.csdn.net/qq_39388410/article/details/130756614