CV multimodal principle analysis under AIGC: from CLIP/BLIP to stable diffusion/Midjourney, GPT4

foreword

I finally wrote the core theme of this CV multimodal series: stable diffusion, why I am obsessed with writing this stable diffusion, because of three points

  1. When stable diffusion and midjourney were very popular last year, I wanted to write it, because I was often swiped, but the time would not be staggered
  2. After ChatGPT came out at the end of November last year, I started writing the technical principles behind ChatGPT in early January this year , and in February this year, a reader " Heaven's Proud Son" left a message under my ChatGPT principle article: " Like, ten years I saw your svm before, but I feel like I haven’t written it for many years. You can also write about the recent AI painting stable diffusion and the related sampling acceleration algorithm. I replied at the time: Ha, ten years ago

    , welcome Come back, thank you old readers and old friends.
    Indeed, many friends have read my SVM note, which has a huge influence, but after the SVM note, I still wrote a lot of new blogs/articles, including but not limited to: xgboost , CNN, RNN, LSTM, BERT, etc.

    In the future, there will be an update plan every quarter, welcome to come often
    About Stable Diffusion, you can first read this article illustrating Stable Diffusion
    " (this article is also one of the important references for this article)
  3. In mid-March this year, when OpenAI announced that GPT4 has the ability of CV multimodality, I had a stronger motivation to research and explore AI painting and CV multimodality, and wrote out the technical details behind it
    . I wanted to write it, but at that time I was writing about the principles, deployment, and fine-tuning of various open source replacement models, so I didn't have time to write it, including the 100 papers I planned before, which was also delayed

On 4.23, after the ChatGPT principle course I talked about, I finally had time to write this multi-modal blog. However, I wanted to write clearly about the technical details behind stable diffusion and midjourney, so I had to start with the diffusion model. Previous " The Origin of AI Drawing Ability: From VAE, Diffusion Model DDPM, DETR to ViT/MAE/Swin Transformer "

Quoting this passage from the previous article, “With the launch of stable diffusion and Midjourney last year, AI painting has made Vincent’s map extremely popular. The character design of various games and the product/page design of online stores have all used AI painting. tools, and many friends have made a lot of income by using AI painting, saving time and effort and making money, really fragrant”

Following the above, this article will clearly write the underlined models in the table below

January March April May June August October November
2020 DETR DDPM

NO

VisionTransformer 

2021

CLIP

GIVE HER

SwinTransformer

MAE

SwinTransformerV2

2022 BLIP FROM E 2

StableDiffusion 

BEiT-3

Midjourney V3

2023 BLIP2

VisualChatGPT 

GPT4

Midjourney V5

SAM(Segment Anything Model)

And the process will introduce MiniGPT-4, VisualGPT to HuggingGPT, AutoGPT these models

Part 1 From CLIP to BLIP1/BLIP2, DALLE/DALLE 2

1.1 CLIP: Pre-training Method Based on Contrastive Text-Image Pairs

When I saw the CLIP paper for the first time, my first reaction was that it was too powerful..​

CLIP to be released by OpenAI in January 2021

  1. Extract visual features through super-large-scale model pre-training, and conduct comparative learning between pictures and texts (simple and rough understanding is that when posting Weibo/Moments, people like to post a paragraph of text and then match it with one or several pictures, CLIP is learn this correspondence)
  2. And after pre-training, do not fine-tune direct reasoning (that is, zero-shot, use the seen picture features to judge the category of unseen pictures, instead of fine-tuning the downstream task training set), so that on the ImageNet dataset, the CLIP model Without using any image from the ImageNet dataset for training, the final model accuracy can be tied with a supervised trained ResNet-50 (zero-shot accuracy on ImageNet is 76.2%, which is in previously considered impossible)  

In order to train CLIP, OpenAI collected a total of 400 million text-image pairs from the Internet. The paper calls it WIT (Web Image Text. WIT is of high quality and cleaned very well. Its scale is equivalent to JFT-300M, which is also One of the reasons why CLIP is so powerful (the DALL-E model was also bred on WIT)

Its training process is:

  1. As shown in the first step in the figure below, the input of CLIP is a pair of paired image-text pairs (for example, the input is a picture of a dog, and the corresponding text also indicates that this is a dog). These texts and pictures are respectively Output the corresponding features through Text Encoder and Image Encoder. Then compare and learn these output text features and picture features.

    If the model inputs n pairs of picture-text pairs, then the n pairs of image-text pairs that are paired with each other are positive samples (marked on the diagonal of the output feature matrix in the figure below The blue part), the other n^2-npairs of samples are all negative samples, so the training process of the model is to maximize the similarity of n positive samples, and at the same time minimize the n^2-nsimilarity of negative samples
    . Text Encoder can use the text transformer model commonly used in NLP
    Image Encoder can use common CNN model or vision transformer and other models.
    The similarity is to calculate the cosine similarity of text features and image features.

    After cosine similarity, CLIP can directly realize zero-shot image classification, that is, without any training and fine-tuning. It only takes two simple steps to achieve zero-shot classification, as shown in points 2 and 3 below
  2. Construct the description text of each category according to the classification label of the task: A photo of {label}, and then send these texts to the Text Encoder to obtain the corresponding text features. If the number of categories is n, then n text features will be obtained
  3. Send the image to be predicted to Image Encoder to obtain image features, and then calculate the scaled cosine similarity with n text features (consistent with the training process), and then select the category corresponding to the text with the largest similarity as the image classification prediction result. Further
    , These similarities can be regarded as logits, and after being sent to softmax, the predicted probability of each category can be obtained

The following is the corresponding pseudo code

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - 输入图片维度
# T[n, l] - 输入文本维度,l表示序列长度

# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

#  分别提取图像特征和文本特征
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# 对两个特征进行线性投射,得到相同维度的特征d_e,并进行l2归一化,保持数据尺度的一致性
# 多模态embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# 计算缩放的余弦相似度:[n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# symmetric loss function
labels = np.arange(n) #  对角线元素的labels
loss_i = cross_entropy_loss(logits, labels, axis=0) # image loss
loss_t = cross_entropy_loss(logits, labels, axis=1) # text loss
loss = (loss_i + loss_t)/2 # 对称式的目标函数

In October 2021, disco diffusion released by Accomplice is the first AI open-source painting tool that combines the CLIP model and the diffusion model. Its core is the CLIP-Guided diffusion model (CLIP-Guided diffusion model)
. A series of improved models of CLIP, such as Lseg, GroupViT, ViLD, GLIP

1.2 From BLIP1, BLIP2 to miniGPT4

1.2.1 BLIP1: ViT + BERT - Unified understanding and generation tasks through encoder-decoder

With the rapid development of AI, multi-modality is becoming a trend, and the Vision-Language Pre-training (VLP) + Fine-tuning => Zero Shot / Few Shot mode is a good mode to quickly solve multiple downstream tasks , VLP is the beginning of this model, so there are many related researches on VLP. BLIP is a new VLP architecture that can be flexibly and quickly applied to downstream tasks, such as: image-text retrieval, image translation, and VQA, etc.

Simply put, the main feature of BLIP is the combination of encoder and decoder to form a unified understanding and generation of multimodal models. When using BLIP for follow-up work, you can use its ability to understand (encoder) and use its ability to generate (decoder), expanding the application of multimodal models

1.2.1.1 Model structure of BLIP

CLIP uses image-encoder (ViT / ResNet) & text-encoder (transformer), and then directly compares the cosine similarity between image features and text features to get the result, while BLIP's approach is much more complicated

As shown in the figure below, in order to pre-train a unified model with both comprehension and generation capabilities, the BLIP model is mainly composed of 4 parts, from left to right are

  • Part 1 of the above picture: Visual Encoder Image Encoder (ViT) - Extracting Image Features
    The essence of the visual encoder is the architecture of ViT: splitting the input image into patches and encoding them into a series of Image Embeddings, and using additional The [CLS] token to represent the global image features
  • Part 2 of the above figure: Text Encoder Text Encoder (BERT) - Extracting Text Features The text
    encoder is the architecture of BERT, where [CLS] token is attached to the beginning of the text input to summarize the sentence, and its function is to extract text features and the first Part of the image features for comparative learning

In this process, a contrastive learning objective function (Image-Text Contrastive Loss, ITC ) will be trained
. ITC acts on the visual encoder (ViT) in part 1 and the text encoder (BERT) in part 2. The goal is to align vision and The feature space of the text, the method is to make the similarity of the positive sample image-text pair greater, and the similarity of the negative sample image-text pair is lower, which is also used in ALBEF. The author still uses the momentum encoder in ALBEF here, its purpose is to generate some pseudo-labels to assist the training of the model

For the convenience of comparison, paste the model structure diagram of BLIP again

  • Part 3 of the above picture: Visual text encoder Image-grounded Text Encoder (variant BERT) - inserting a cross-attention layer in BERT, so as to make a binary classification visual text encoder for image features and text features The specific method is in the text
    encoder For example, an additional cross-attention (Cross-Attention) is inserted between the self-attention (Bi Self-Att) layer of each transformer block of BERT and the feed-forward network (Feed Forward) to introduce visual features. The function is based on the pictures given by ViT Features and text input are classified into two categories, so the encoder is used, and the attention part is a two-way Self-Attention, and an additional [Encode] token is added as a joint representation of the image text

In this process, an image-text matching objective function (Image-Text Matching Loss, ITM ) is trained
. ITM acts on the visual encoder in part 1 and the visual text encoder in part 3. It is a binary classification task, and the goal is to learn Joint representation of image text, using a classification head to predict whether the image-text pair is a positive or negative match, the purpose is to learn the multimodal representation of image-text, adjust the fine-grained alignment between vision and language, the author is still here Using the hard negative mining technology in ALBEF

  • Part 4 of the above picture: Visual Text Decoder Image-grounded Text Decoder (variant BERT) - generate text based on image features and text features The
    visual text decoder uses Cross-Attention, which is based on the image features and text input given by ViT Do the task of text generation, so the decoder is used, and the Bi Self-Att in the Image-grounded Text Encoder structure in the third part of the above figure is replaced with the Causal Self-Att. The goal is to predict the next token and add a The additional [Decode] token and the end token are used as the starting point and end point of the generated result.
    One point to note is: the part of the same color is shared by parameters, that is, the visual text encoder and visual text decoder share the same layer except the Self-Attention layer. All other parameters. When each image-text is input, the image part only needs to pass through one ViT model, and the text part needs to pass through the text model 3 times

In the process of training a language model objective function (Language Modeling Loss, LM )
, after all, since BLIP contains a decoder, it is used to generate tasks. Since there is a requirement for this task, it means that a language model objective function for the generation task is required. LM acts on the visual encoder in part 1 and the visual text decoder in part 4. The goal is to use a given image to Autoregressive approach to generating descriptions about text. Compared to the widely used MLM loss (cloze) in VLP, LM enables the model to convert visual information into coherent subtitles

1.2.1.2 BLIP subtitle and filter method CapFiltg

Throughout the above process, there is a non-negligible problem, that is, high-quality human-annotated image-text pairs {(I_h,T_h)}(eg, COCO) are not available in large quantities because of their high cost.

  • The data of CLIP comes from the image-text pairs crawled from the web {(I_w,T_w)}, so the data set is easy to expand to a large size, and the method of comparative learning is basically self-supervised, and data labeling is not required;
  • BLIP improves the shortcomings of CLIP to directly fetch data from the Web, and proposes a Captioning and Filtering (CapFilt) module. This module is used to reduce noise and enrich data. It mainly includes two modules: the subtitle and filter method CapFilt (Captioning and Filtering)

As shown below

The CapFilt method consists of two modules:

  1. Captioner: It is equivalent to generating subtitles for a network picture. It is a visual text decoder (corresponding to part 4 of the above BLIP model structure), which uses the LM objective function fine-tuning on the COCO dataset to decode the text of the given image, so as to realize the given network picture, Captioner generates I_wsubtitles T_sEffect
  2. Filter Filter: Filter out the noise image-text pair, it is a visual text encoder (corresponding to the third part of the above BLIP model structure), see if the text matches the image, use ITC and ITM objective function fine-tuning
    Filter removes noisy text in original Web text andT_w synthetic text , if the ITM head predicts it as not matching the image, the text is considered noisyT_s

Finally, the filtered image-text pairs are combined with human-annotated pairs to form a new dataset, which the authors use to pre-train a new model

The image below shows the visualization of the text accepted and rejected by the filter (green text is approved by the filter, while red text is rejected by the filter)

// to be updated

1.2.2 BLIP2

// to be updated

1.2.3 MiniGPT4

Model architecture: Vicuna + BLIP2 + linear projection layer based on LLaMA fine-tuning

MiniGPT-4 has many capabilities similar to those demonstrated by GPT-4, such as detailed image description generation and website creation from handwritten sketches, and writing inspired stories and poems based on a given image, providing solutions to problems shown in the image, Like teaching users how to cook from food photos, etc.

The model architecture of miniGPT4 consists of a language model stitched together with a visual model, and finally a linear projection layer is added for alignment. Specifically

  • It first uses the vicuna fine-tuned based on LLaMA as a language decoder

  • In terms of visual perception, the same pre-trained visual component as BLIP-2 is adopted (this component consists of ViT-G/14 and Q-Former of EVA-CLIP [13])

  • Afterwards, a single projection layer is added to align the encoded visual features to the language model vicuña and freeze all other visual and language components

Model training: pre-training (5 million image-text pairs) - fine-tuning

In training, it is still the classic pre-training-fine-tuning mode

  1. During the whole pre-training process, both the pre-trained visual encoder and LLM are kept frozen, and only the linear projection layer is pre-trained. Specifically, we use the combined dataset of Conceptual Caption, SBU, and LAION to train our model. After 20,000 training steps, the batch size is 256, covering about 5 million image-text pairs. The whole process takes about 10 hours, and uses 4 x A100 (80GB) GPUs
  2. However, simply aligning visual features to LLMs is not enough to train high-performance chatbot-like visual conversational models, and the noise behind raw image-text pairs may lead to incoherent language output. Therefore, we collected another 3500 high-quality aligned image-text pairs, and further fine-tuned the model with the designed session template (only 400 training steps are needed, batch size is 12, and the final 7 minutes can be completed using a single A100 GPU) , to improve the naturalness of the generated language and its usability

1.3 From DALLE to DALLE 2

1.3.1 DALL-E:Zero-Shot Text-to-Image Generation

Interestingly, DALL-E , like CLIP, was released in early 21, and the corresponding paper is " Zero-Shot Text-to-Image Generation "

Dataset: 250 million image-text pairs + 12 billion parameters

// to be updated

1.3.2 DALL-E 2

DALL-E 2 mainly consists of two parts

  1. The first part is Prior: convert user input to image representation, accept text labels and create CLIP image embeddings The
    text and image embeddings used in it are from the previously introduced CLIP (comparative language-image pre-training) network, for the input Image returns the best caption. It does the opposite of what the DALL-E 2 does -- it converts images to text, and the DALL-E 2 converts text to images. The purpose of introducing CLIP is to learn the connection between the visual and text representation of the object
  2. The second part is to convert this representation into an actual photo (called Decoder): it accepts a CLIP image embedding and generates an image

After the model training is completed, the inference process is as follows:

  1. The input text is converted into a CLIP text embedding using a neural network.
  2. Reduce the dimensionality of text embeddings using Principal Component Analysis.
  3. Create image embeds using text embeddings.
  4. After entering the Decoder step, the diffusion model is used to convert image embeddings into images.
  5. The image is upscaled from 64×64 to 256×256, and finally upscaled to 1024×1024 using a convolutional neural network

// To be updated..

The second part popular understanding of stable diffusion

// to be updated

The third part is from Visual ChatGPT, GPT4 to Midjourney V5, SAM (Segment Anything Model)

// to be updated

References and Recommended Reading

  1. Learning Transferable Visual Models From Natural Language Supervision
    CLIP原始论文
  2. ​​​​​​CLIP papers are intensively read paragraph by paragraph , this is one of the notes for interpretation of this video: CLIP and improvement work series
  3. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
    BLIP原始论文
  4. Multimodal Super Detailed Interpretation (6): BLIP: Unified Understanding and Generation of Bootstrapped Multimodal Model
    Jizhi AI | Detailed Explanation of Multimodal New Posture BLIP Algorithm Realization
  5. Understand the working principle of DALL·E 2, Stable Diffusion and Midjourney
  6. After reading the DALL-E paper, we found that large datasets also have flat versions

Creation, modification, and new records after the first release

  1. During the three-day Dragon Boat Festival holiday, continue to improve BLIP/BLIP2, DALLE/DALLE 2 and other related content

Guess you like

Origin blog.csdn.net/v_JULY_v/article/details/131205615