LaVIN—Efficient Instruction Fine-tuning for Multimodal Dialogue Models

From: Smarter

Enter the NLP group —> join the NLP exchange group

a9542262defdc53844c723bfa7a8f9cd.png

Paper address: https://arxiv.org/pdf/2305.15023.pdf

Code address: https://github.com/luogen1996/LaVIN

Fitting large language models to multimodal instructions usually takes a lot of training time. Both BLIP2 and mini-GPT4 require a large number of image-text sample pairs for pre-training. At the same time, LLaVA needs to fine-tune the entire large language model. These solutions greatly increase the cost of multi-modal adaptation, and at the same time tend to cause a decline in the text capability of large language models.

This paper proposes an efficient mixed-modal instruction fine-tuning scheme, which realizes the rapid adaptation of large language models to text instructions and text+image instructions. Based on this scheme, this paper proposes a new multimodal large model (LaVIN-7B, LaVIN-13B), which has the following advantages:

  • Efficient parameters (3~5M training parameters)

  • Efficient training (on the multimodal scientific question answering dataset, the fastest fine-tuning takes 1.4 hours)

  • Excellent performance (six points faster than LLaMA-Adapter!)

  • Support plain text and text plus image command dialogue

6eec001a52519d305f05de0b448b04bb.gif

Network Structure and Training

As shown in the figure above, LaVIN is fine-tuned based on LLaMA, and the overall structure is very simple.

  • End-to-end jointly optimized architecture. The backbone of CLIP is directly connected to LLaMA, without other complicated designs. The entire CLIP and LLM are completely frozen and trained by adding an adapter. At the same time, by inserting an adapter into CLIP, the entire model can be optimized end-to-end. Compared to LLaVA, this end-to-end optimization saves the pre-training process of alignment between CLIP and LLM.

  • Multimodal dynamic reasoning. In the large language model, this paper designs a new module called Mixture-of-modality adapter. This module can switch the inference path of the adapter according to the modality of the input command. In this way, the decoupling of the two modalities during training can be achieved. Simply put, when a text command is input, the model will use a set of adapter paths to adapt. When the input is an image + text instruction, the model will switch to another set of adapter paths for inference.

  • Multimodal hybrid training. During the training process, LaVIN directly mixes plain text data and graphic data, and directly packs them into batches for training. Apart from that, there is no additional optimization process and other complicated designs.

Although the design and training of LaVIN are very simple, it benefits from the joint optimization of the entire model, dynamic reasoning, and multi-modal hybrid training. The actual performance of LaVIN is comparable to that of LLaVA. LaVIN has achieved a performance of 90.8 in multimodal scientific question answering, which is nearly six points higher than LLaMA-Adapter, and very close to LLaVA (90.9). After fine-tuning about 200k GPT3+4 instruction data, LaVIN can conduct high-quality text dialogue and graphic dialogue. In addition, this adapter-based paradigm still has a lot of room for optimization. The training time and speed in this paper hardly adopt any optimization strategy. After adding quantitative training strategies such as QLoRA, the training cost of LaVIN may be reduced by another order of magnitude.

1e53987a6a31378796a7bb21ca7d8eb3.png

484556bafb013e4e208c441c1db9b4ee.png


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130960077