Dharma Academy's open source multi-modal dialog model mPLUG-Owl

The popularity of miniGPT-4 has not diminished so far, and it is less than half a month since the launch of LLaVA, and the new picture-viewing chat model has come out. The model to be introduced today is a multimodal dialogue generation model similar to miniGPT-4 and LLaVA, and its name is mPLUG-Owl.
insert image description here

  • Paper link: https://arxiv.org/abs/2304.14178
  • Project link: https://github.com/X-PLUG/mPLUG-Owl
  • 在线demo:https://modelscope.cn/studios/damo/mPLUG-Owl/summary

mPLUG-Owl demonstrates a strong graphic understanding ability:
insert image description here
the following are the trial results of the author of this article:
insert image description here

The contributions of this paper are as follows:

  • Propose a new modular way of training multi-modal large models
  • The evaluation set OwlEval is proposed to test the ability of multimodal models on vision-related tasks
  • Open source model code, demo code and model weight files are convenient for researchers to conduct further research.

mPLUG-Owl

model architecture

insert image description here

This paper proposes mPLUG-Owl, whose overall architecture is shown in Figure 2. It consists of the visual base model f V f_VfV, visual abstraction module f K f_KfKAnd the pre-trained language model f L f_LfLcomposition. The visual abstraction module summarizes long, fine-grained image features into a small number of learnable tokens, thereby achieving efficient modeling of visual information. The generated visual tokens are fed into a language model along with text queries to generate corresponding responses.

training strategy

insert image description here

As shown in Figure 1, there are currently three main training strategies to train end-to-end multimodal LLM models. These strategies are:

  1. Freeze vision modules and language modules during pre-training and instruction fine-tuning stages, and tune limited parameters, such as MiniGPT4.
  2. Freeze vision modules, train language modules like Kosmos-1.
  3. Freeze the vision module during the instruction fine-tuning stage and train the language module, such as LLaVA.

However, these models all freeze the parameter tuning of the vision module, thus limiting the alignment between different modalities. In addition, they lack the joint training of unimodal and multimodal data, making it difficult to effectively stimulate the various potentials of large models.

To overcome these limitations, mPLUG-Owl employs a different training strategy. First, it trains the vision module with multimodal data and freezes the language module. This allows visual features to match linguistic features. Then, it jointly tunes the LoRA parameters of the language module using multimodal and unimodal data, while freezing the vision module. In this way, the model can learn a variety of single-modal and multi-modal instructions, and has the ability of single-modal and multi-modal multi-turn dialogue.

experiment

quantitative analysis

insert image description here
insert image description here

As shown in Figure 3, this paper manually evaluates mPLUG-Owl on the constructed multimodal evaluation set OwlEval. The evaluation results are divided into four grades AD, which represent the corresponding generation quality in descending order. It can be seen from the evaluation results that mPLUG-Owl achieved the best results.

In order to explore the performance of mPLUG-Owl on single-round dialogues and multi-round dialogues, this paper also extracts some single-round dialogues and some multi-round dialogues from OwlEval for manual evaluation. The result is shown in Figure 4. It can be found that mPLUG-Owl has a strong multi-round dialogue ability.

Ablation experiment

insert image description here

In order to explore the impact of the training strategy and the use of instruction data on the model results, this paper also shows the results of the ablation experiments, as shown in Table 2.
insert image description here

In addition, this paper also found an interesting phenomenon: the learning of multimodal data can improve the text unimodal ability of the model. As shown in Table 3, using ChatGPT to score the generated results, it is found that mPLUG-Owl, which only adjusts LoRA parameters, beats Alpaca with full parameter adjustment in plain text generation ability.

qualitative analysis

insert image description here

It can be seen from Figure 6 that mPLUG-Owl has a strong multi-round dialogue ability.
insert image description here

It can be found from Figure 7 that mPLUG-Owl also has a strong reasoning ability.
insert image description here

Although mPLUG-Owl has a strong ability to understand graphics and text, there are still some gaps compared with GPT-4. As shown in Figure 8, although mPLUG-Owl has correctly understood the joke, it mistakenly identified the VGA plug as a USB plug.
insert image description here

Figure 9 shows some additional joke interpretation examples.
insert image description here

As shown in Figure 10, although the multi-image associated data is not trained in the training phase. mPLUG-Owl exhibits certain multi-image correlation capabilities.
insert image description here

As shown in Figure 11, although mPLUG-Owl was only exposed to English data during the training phase, it exhibited interesting multilingual capabilities. This may be because the language module in mPLUG-Owl adopts LLaMa which supports multiple languages, so this phenomenon occurs.
insert image description here

Although mPLUG-Owl is not trained on document data with annotations, it still demonstrates certain text recognition and document understanding capabilities. The test results are shown in Figure 12.
insert image description here
insert image description here

As shown in Figures 13 and 14, mPLUG-Owl has shown a strong ability in multimodal open ending continuation.
Here are some more interesting examples:
insert image description here
insert image description here
insert image description here

More open source applications

For collections of smart traffic team models, papers, blog posts, and live broadcasts, click here to browse .

​DamoFD face detection 0.5G

RetinaFace face detection key point model

Face Liveness Detection Model-IR

Face Liveness Detection Model-RGB

FLCM Face Keypoint Confidence Model

Facial expression recognition model FER

Face attribute recognition model FairFace

Supongo que te gusta

Origin blog.csdn.net/sunbaigui/article/details/130576294
Recomendado
Clasificación