MiniGPT4, open source

Introduction

MiniGPT-4 aims to align visual information from a pretrained visual encoder with an advanced large language model (LLM). Specifically, in terms of text, the author utilizes Vicuna as a language decoder, and in terms of visual perception, uses the same visual encoder as BLIP-2, and both language and visual models are open source. The main goal of this article is to use the linear mapping layer to bridge the gap between the visual encoder and the LLM. The model architecture diagram is as follows:

insert image description here
characteristic:

  • MiniGPT-4 aligns the frozen vision encoder from BLIP-2 with the frozen LLM, Vicuna, using only one projection layer.
  • We train MiniGPT-4 in two stages. The first traditional pre-training stage is trained using about 1 million aligned image-text pairs in 4 hours using 5 A10s. After the first stage, llamas are able to understand images. However, the spawning ability of llamas is severely affected.
  • To address this issue and improve usability, we propose a novel approach to create high-quality image-text pairs through the model itself together with ChatGPT. Based on this, we created a small (3500 pairs in total) but high-quality dataset.
  • The second fine-tuning stage trains this dataset on dialogue templates to significantly improve its generation reliability and overall usability. To our surprise, this stage was computationally efficient, taking only about 100 minutes with a single A7.
  • MiniGPT-4 yields many emerging visual-language features similar to those exhibited in GPT-4.

Project address: https://github.com/Vision-CAIR/MiniGPT-4#online-demo
Online experience address: https://minigpt-4.github.io/

Quick experience

  1. Prepare code and environment

git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd MiniGPT-4
conda env create -f environment.yml
conda activate minigpt4

  1. Prepare the Vicuna weight file for training

The current version of Minigpt-4 is built on V0 Versoin of Vicuna-13b. Please refer to their instructions here for weights. The final weights will be in a single folder with the following structure:

>vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
  1. Start the demo locally

python demo.py --cfg-path eval_configs/minigpt4_eval.yaml

train

The training of MiniGPT-4 consists of two alignment stages.

  1. In the first pre-training stage, the model is trained using image-text pairs from the Laion and CC datasets to tune the vision and language models. To download and prepare the dataset, please check our first stage dataset preparation instructions https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_1_STAGE.md . After the first stage, visual features are mapped and can be understood linguistically. To start the first phase of training, run the following command. In our experiments, we use 4 A100s. You can change the save path in the configuration file train_configs/minigpt4_stage1_pretrain.yaml

    torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml

  2. In the second stage, we use a small self-created high-quality image-text pair dataset and convert it into a dialogue format to further align MiniGPT-4. To download and prepare our Stage 2 dataset, check out our Stage 2 dataset preparation instructions https://github.com/Vision-CAIR/MiniGPT-4/blob/main/dataset/README_2_STAGE.md . To start the second stage alignment, first specify the path to the checkpoint file trained in stage 1 in train_configs/minigpt1_stage4_pretrain.yaml . You can also specify the output path here. Then, run the following command. In our experiments, we use 1 A100.

Experimental results

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_45066628/article/details/130231186