Multimodal transition state - latent modal

background:

With the advancement of large models, single-modal large models can no longer meet the needs of real work. Many scientific research teams and institutions have started multimodal research. Several multimodal institutions have been introduced in previous articles, so I won’t introduce too much in this part. The ideal multi-modality should be that there is no mode. A single model can input all kinds of data without distinction, and can generate any desired output results according to the control needs. In other words, we can use one model to align all relationship between modalities. To achieve such a goal, there are at least three architectures:

1. Pull the various modal input/output horizontally, design each part corresponding to the modal in sub-blocks, let a kind of input data predict the value of the output data through task design, and let the model learn the modal through enough data training Correspondence between

2. There is only one modal data input/output during training, and the corresponding relationship between modalities can be learned by training the model

3. Map the various modalities to a unified space, and pull up the mapping relationship between the modalities through alignment in the unified space

The first two architectures are one-stage training methods, which means that the data between various modalities does not need to be mapped to the latent space. It can be directly aligned to the information alignment of various modalities and granularities. In theory, as long as there is enough data, the model expression ability of the first stage is more powerful. But because of the difficulty of data collection, data processing difficulty, training cost, and reasoning computing power requirements of downstream tasks. The current mainstream multi-modality should be a two-stage alignment training architecture. Right now:

1. First map each modal data to a space separately

2. Then design multiple modal data alignment tasks to align the modalities

However, in the process of modality alignment, the design of tasks can be divided into finer-grained divisions based on: whether to consider the data sequence, the difficulty of training tasks, and the generality of training tasks. The expressiveness of the output model will also vary greatly under different task designs.

The three models introduced below: clip, blip2, mini-chatgpt4 differ as follows:

1.clip: The training data basically does not consider the sequence between the data and the task design graphics to do similar calculations

2.blip2: The training data basically does not consider the sequence, task design: similar calculation, modal fusion task design, VQA, graph to text generation

3.mini-chatgpt4: consider the data sequence (multiple rounds of dialogue for a single picture), and generate multiple rounds of dialogue for the picture

Transition model:

clip

CLIP (Contrastive Language-Image Pre-Training) is a training neural network based on (image, text) pairs. It can target natural language instructions to predict the most relevant text fragments based on a given image, and has zero-shot prediction capabilities similar to GPT-2 and 3. CLIP performs the same as the original ResNet50 on the ImageNet "zero-shot" task, while the predicted samples may not be in the original 128 million labeled data for training.

The training tasks of CLIP include: 1. Image-text pair similarity calculation 2. Given image embbeding, predict the object category in the image through text instructions. Training data: 400M image-text pairs collected by web crawling, passed weak supervision (should be multiple rounds of boost). 500,000 Internet queries, each query recalls 20,000 pieces of data, the image embedding tried: resnet, VIT two architectures to extract, the text embedding uses: 63M parameters, 12 layers, 512 input length, 8 attention headers Transformer masked self-attention to extract. The amount of data and the image-text classification converted into prompts have greatly improved the performance of zeroshot and the alignment of image-text information, and the ability to represent information has been greatly improved. (task design, data selection should be important)

blip2

End-to-end training of large-scale models, vision and language pre-training has become increasingly costly. In order to solve the problem of expensive training of large models, BLIP-2 proposes a general and efficient pre-training strategy, which can be used for Bootstrap vision and language pre-training from the existing frozen pre-trained image encoder and frozen large-scale language model. BLIP-2 bridges the modality gap through a lightweight query Transformer, which is trained two stages ago. The first stage learns Boot-strap visual-language representations from frozen image encoders. The second phase learns from Bootstrap vision to language generation in a frozen language model. BLIP-2 achieves state-of-the-art performance on various visual-linguistic tasks despite having much fewer training parameters than existing methods.

Overview of the BLIP-2 framework. A two-stage strategy is employed to preprocess a lightweight query Transformer to resolve the two-modal differences. The first stage learns Bootstrap visual-language representations from frozen image encoders. The second stage learns from Bootstrap vision-to-language generation in frozen LLM, which enables zero-shot image-to-text generation.

Fixed vision pre-training model, through three tasks to train a Q-Former to encode the semantics in the image input into a feature space similar to the text feature space. Specifically, the model acquires features from images based on K learnable query embeddings and a cross-attention mechanism. The three tasks include:

1. Image-text matching: classify the input (image, text) binary groups and determine whether they are related

2. Image-based text generation: Given an image input, generate a corresponding text description

3. Image-text comparative learning: shorten the distance between image features and corresponding text features, and increase the distance between them and irrelevant text features

The output of Q-Former is input to a fixed large language model through a fully connected network, and the visual features that have been initially aligned with text features are further encoded into an input that the large language model can understand through the task of [image-based text generation].

advantage:

By fixing the unimodal pre-training model, BLIP-2 greatly reduces the computational and data resources required for pre-training.

With only 129M text pairs, a 16-A100(40G) machine can train the largest BLIP-2 in less than 9 days

By fixing the parameters of the large language model, BLIP-2 retains the Instruction Following ability of the large language model.

shortcoming:

The model does not have multimodal In-Context-Learning capabilities

minichatgpt4

blip2 can see a two-stage training method. The first stage is to learn the graphic representation ability, and the second stage is to use the graphic embedding as a soft prompt to generate prefix copywriting. It is a step further than clip only learning the representation ability of graphics and text, and better aligns the expression of graphics and text in the mapping space through the generated prompt to guide the text generation method.

The goal of MiniGPT-4 is to align a well-trained visual encoder with an advanced large-scale language model (LLM), using Vicuna as a language decoder, which is built on LLaMA and capable of performing various complex language tasks. For visual perception, we use the same visual encoder as BLIP-2 and combine its trained Q-precursors. Both language and vision models are open source. Using a linear projection layer, a path is used to align the visual encoder and LLM through a designed dialogue template. A two-stage training method is proposed.

1. The initial stage involves pre-training the model on a large number of aligned image-text pairs for visual-linguistic knowledge.

2. In the second stage, a small but high-quality image-text dataset is used, and the pre-trained model is fine-tuned using the designed dialogue template to improve the generation reliability and usability of the model.

The first pre-training phase
In the first pre-training phase, the design task allows the model to acquire visual language knowledge from aligned image-text pairs. The output of the injected projection layer is used as a soft prompt for the LLM, prompting it to generate the corresponding real text.
During the whole pre-training process, the two pre-trained visual encoders and LLM are kept frozen, and only the linear projection layer is pre-trained. Use the publicly available Conceptual Caption, SBU and LAION datasets to train the model. After 20,000 training steps, the batch size is 256, covering about 5 million image-text pairs. Using 4 A100 (80GB) GPUs, the whole process took about 10 hours. The problem is designed as a graphic embedding into a soft prompt for text generation:

<Img><ImageFeature></Img> Describe this image in detail. Give as many details as possible. Say everything you see. ###Assistant:

After the first pre-training phase, MiniGPT-4 shows that it can have rich Knowledge and the ability to provide reasonable answers to human questions. But it is observed that it sometimes struggles to generate coherent linguistic output, such as generating repeated words or sentences, broken sentences, or irrelevant content. These issues hamper MiniGPT-4's ability to have a fluent visual conversation with humans.
Similar problems also exist in GPT-3. Although GPT-3 is pre-trained on a wide range of linguistic datasets, it cannot directly generate linguistic output that matches user intent. By learning instruction fine-tuning and reinforcement learning from human feedback, GPT-3 gradually evolved into GPT-3.5, capable of generating more human-like output. This phenomenon is similar to the state of MiniGPT-4 after the first pre-training stage. Models may struggle to generate fluent and natural language output at this stage.

In order to enhance the naturalness of the generated language and improve the usability of the model, a second alignment process is required. While instruction fine-tuning and dialogue datasets are readily available in the NLP domain, there are no comparable datasets in the visual language domain. To address this issue, a high-quality image-to-text dataset needs to be carefully designed for alignment. This dataset was then used in the second alignment process for fine-tuning with MiniGPT-4.
Initial alignment image-text generation. In the initial stage, a detailed description of a given image is generated using the model derived from the first pre-training stage. In order for the model to generate more detailed image descriptions, a prompt following the vicuna language model dialogue format is designed, as follows:
###Human: <Img><ImageFeature></Img> Describe this image in detail. Give as many details as possible. Say everything you see. ###Assistant:
In this prompt, it represents the vision generated by the linear projection layer feature.

To identify incomplete sentences, it is necessary to check whether the generated sentences exceed 80 words. If not exceeded, another hint can be added, ###Human: Continue ###Assistant:  , to prompt MiniGPT-4 to continue generating sentences. By combining the outputs of the two steps, a more comprehensive image description can be created. This approach enables the model to generate more detailed and informative image descriptions. 5,000 images on the concept caption dataset are randomly selected and a corresponding linguistic description is generated for each image using this method.
Image descriptions generated by data preprocessing still have many noises and contain errors, such as repeated words or sentences, and questions containing unreasonable statements. To alleviate these problems, ChatGPT is used to refine descriptions with follow-up hints.
Post-processing After preprocessing, each image description was manually checked for correctness to ensure its quality. Specifically, we check whether each generated image description conforms to the desired format, and manually optimize the generated caption description by removing repeated words or sentences that ChatGPT cannot detect. In the end, only about 3,500 image-text pairs met my requirements, and these pairs were then used in the second-stage alignment process.

3.3 The second hyperparameter tuning
In the second hyperparameter tuning, we use curated high-quality image-text pairs curated high-quality image-text pairs for hyperparameter tuning. In hyperparameter tuning, we use predefined prompts from the following template:
###Human: <Img><ImageFeature></Img> <Instruction> ###Assistant:
In this prompt, represents our randomly sampled A set of instructions containing variations of instructions such as "describe this image in detail" or "can you please describe the contents of this image for me". It is important to point out that we do not use the regression loss for that specific text image cue.
As a result, MiniGPT-4 is now able to generate more natural and reliable responses. In addition, we observe that the hyperparameter tuning process of the model is very efficient, requiring only 400 training steps with a batch size of 12, which can be completed in a short 7 minutes on a single A100 GPU.

advantage:

1. We show that emerging visual-language capabilities can be achieved by aligning a frozen visual encoder with the state-of-the-art large-scale language model Vicuna.
2. By using a pre-trained visual encoder and a large language model, MiniGPT-4 achieves higher computational efficiency. The findings show that training only one projection layer can effectively align visual features with large language models. MiniGPT-4 only needs about 10 hours of training on 4 A100 GPUs.
3. Aligning visual features with large language models using only raw image-text pairs from publicly available datasets does not develop a well-performing MiniGPT-4 model. It can produce artificial language output that lacks consistency, including repetition and broken sentences. Addressing this problem requires the use of high-quality, well-aligned datasets, which significantly improves its usability.

summary:

1. At present, the large multi-modal and multi-model text language model is still dominated by image-text alignment and text generation, and it is still impossible to generate images and text together

2. From clip to blip2 to minichatgpt4, unified input format and unified task design have become a trend

3.minichatgpt4 also failed to realize the real in-context image generation, but the learning tasks of the first stage and the second stage are the same. Input: figure and text do softmax to unify the representation and generation problems into a generation task, and the second stage data The same format but with longer sequences and more requirements to use high-quality human screening data for aliment

4. Looking at the multi-modality of graphics and text from the trend, the task design should have both input: instruction, graphics, text, content ==> output: graphics, text, content, annotate; this way can truly achieve in-context graphics and text Multimodal, solving in-context graphic tasks, introducing time series variables of the human physical world

5. Whether the lantent modal learning will be fused with raw rext, pixcel granular data or access to finer-grained graphic and text features, solving the controllability of fine-grained graphic and text generation is a direction that can be paid attention to (of course, doing so will affect the number of training skills. , training computing power requirements will be very high)

Guess you like

Origin blog.csdn.net/liangwqi/article/details/130479824