The popularity of miniGPT-4 has not diminished so far, and it is less than half a month since the launch of LLaVA, and the new picture-viewing chat model has come out. The model to be introduced today is a multimodal dialogue generation model similar to miniGPT-4 and LLaVA, and its name is mPLUG-Owl.
- Paper link: https://arxiv.org/abs/2304.14178
- Project link: https://github.com/X-PLUG/mPLUG-Owl
- 在线demo:https://modelscope.cn/studios/damo/mPLUG-Owl/summary
mPLUG-Owl demonstrates a strong graphic understanding ability:
the following are the trial results of the author of this article:
The contributions of this paper are as follows:
- Propose a new modular way of training multi-modal large models
- The evaluation set OwlEval is proposed to test the ability of multimodal models on vision-related tasks
- Open source model code, demo code and model weight files are convenient for researchers to conduct further research.
mPLUG-Owl
model architecture
This paper proposes mPLUG-Owl, whose overall architecture is shown in Figure 2. It consists of the visual base model f V f_VfV, visual abstraction module f K f_KfKAnd the pre-trained language model f L f_LfLcomposition. The visual abstraction module summarizes long, fine-grained image features into a small number of learnable tokens, thereby achieving efficient modeling of visual information. The generated visual tokens are fed into a language model along with text queries to generate corresponding responses.
training strategy
As shown in Figure 1, there are currently three main training strategies to train end-to-end multimodal LLM models. These strategies are:
- Freeze vision modules and language modules during pre-training and instruction fine-tuning stages, and tune limited parameters, such as MiniGPT4.
- Freeze vision modules, train language modules like Kosmos-1.
- Freeze the vision module during the instruction fine-tuning stage and train the language module, such as LLaVA.
However, these models all freeze the parameter tuning of the vision module, thus limiting the alignment between different modalities. In addition, they lack the joint training of unimodal and multimodal data, making it difficult to effectively stimulate the various potentials of large models.
To overcome these limitations, mPLUG-Owl employs a different training strategy. First, it trains the vision module with multimodal data and freezes the language module. This allows visual features to match linguistic features. Then, it jointly tunes the LoRA parameters of the language module using multimodal and unimodal data, while freezing the vision module. In this way, the model can learn a variety of single-modal and multi-modal instructions, and has the ability of single-modal and multi-modal multi-turn dialogue.
experiment
quantitative analysis
As shown in Figure 3, this paper manually evaluates mPLUG-Owl on the constructed multimodal evaluation set OwlEval. The evaluation results are divided into four grades AD, which represent the corresponding generation quality in descending order. It can be seen from the evaluation results that mPLUG-Owl achieved the best results.
In order to explore the performance of mPLUG-Owl on single-round dialogues and multi-round dialogues, this paper also extracts some single-round dialogues and some multi-round dialogues from OwlEval for manual evaluation. The result is shown in Figure 4. It can be found that mPLUG-Owl has a strong multi-round dialogue ability.
Ablation experiment
In order to explore the impact of the training strategy and the use of instruction data on the model results, this paper also shows the results of the ablation experiments, as shown in Table 2.
In addition, this paper also found an interesting phenomenon: the learning of multimodal data can improve the text unimodal ability of the model. As shown in Table 3, using ChatGPT to score the generated results, it is found that mPLUG-Owl, which only adjusts LoRA parameters, beats Alpaca with full parameter adjustment in plain text generation ability.
qualitative analysis
It can be seen from Figure 6 that mPLUG-Owl has a strong multi-round dialogue ability.
It can be found from Figure 7 that mPLUG-Owl also has a strong reasoning ability.
Although mPLUG-Owl has a strong ability to understand graphics and text, there are still some gaps compared with GPT-4. As shown in Figure 8, although mPLUG-Owl has correctly understood the joke, it mistakenly identified the VGA plug as a USB plug.
Figure 9 shows some additional joke interpretation examples.
As shown in Figure 10, although the multi-image associated data is not trained in the training phase. mPLUG-Owl exhibits certain multi-image correlation capabilities.
As shown in Figure 11, although mPLUG-Owl was only exposed to English data during the training phase, it exhibited interesting multilingual capabilities. This may be because the language module in mPLUG-Owl adopts LLaMa which supports multiple languages, so this phenomenon occurs.
Although mPLUG-Owl is not trained on document data with annotations, it still demonstrates certain text recognition and document understanding capabilities. The test results are shown in Figure 12.
As shown in Figures 13 and 14, mPLUG-Owl has shown a strong ability in multimodal open ending continuation.
Here are some more interesting examples:
More open source applications
DamoFD face detection 0.5G
RetinaFace face detection key point model
Face Liveness Detection Model-IR
Face Liveness Detection Model-RGB
FLCM Face Keypoint Confidence Model
Facial expression recognition model FER
Face attribute recognition model FairFace