[Computer Vision | Natural Language Processing] BLIP: Unified Vision-Language Understanding and Generation Tasks (Paper Explanation)

I. Introduction

The paper we are going to introduce today is BLIP , and the full name of the paper is Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation .

insert image description here
The paper address is:

https://arxiv.org/pdf/2201.12086.pdf

The code address is:

https://github.com/salesforce/BLIP

The demo address is:

https://huggingface.co/spaces/akhaliq/BLIP

Before we start the text, let's try to see how the demo works.

2. Trial effect

How effective is BLIP ? Users simply upload an image, or click the built-in example to load an image and it's done.

The BLIP model has two functions:

  • image annotation
  • answer the questions

Gradio demo of BLIP : Guided Language-Image Pretraining for Unified Vision-Language Understanding and Generation (Salesforce Research). Image uploads are now disabled from March 23, 2023. Click on one of the examples to load them:

Upload the famous oil painting "Starry Night", and the model outputs "caption: a painting with many clouds above the city" under the image annotation function.

insert image description here

Let's take a look at some of the effects of previous blogger tests:

insert image description here
insert image description here

According to the test results, it is still good.

3. Research Background

Vision-Language Pre-training ( Vision-Language Pre-training , VLP ) improves the performance of many vision-language tasks. However, most existing pre-trained models can only perform well on understanding -based tasks or generation -based tasks .

But seldom do you get better results on both fronts.

Existing VLP methods mainly suffer from two limitations:

  1. From a model perspective, most methods adopt encoder-based models, or adopt encoder-decoder-based models. However, encoder-based models are difficult to directly transfer to text generation tasks, and encoder-decoder models have not been successfully used for image-text retrieval tasks;
  2. From a data point of view, SOTA models such as CLIP and SimVLM are pre-trained through image-text pairs collected on the web . Although the performance of expanding the data set has been improved, the text on the web is noisy, which is not true for VLP . best.

Recently, researchers from Salesforce Research proposed BLIP (Bootstrapping Language-Image Pre-training), which is used to unify vision-language understanding and generation tasks.

BLIP is a new VLP framework that can support a wider range of downstream tasks than existing methods.

BLIP can effectively utilize noisy web data by bootstrapping the captions, where the captioner generates captions and the filter removes noisy captions.

The research achieved SOTA performance on visual-language tasks. For example, on image-text retrieval tasks, recall@1 increased by 2.7%; on image annotation tasks, CIDEr increased by 2.8%, and VQA increased by +1.6%.

BLIP also shows strong generalization ability when it is directly transferred to the video-language task in a zero-shot manner.

4. Model structure

The researchers' proposed BLIP is a unified vision-language pre-training (VLP) framework that learns from noisy image-text pairs.

Next, explain in detail the model architecture MED (mixture of encoder-decoder), its pre-training goals, and the method CapFilt for dataset self-expanding.

The following figure shows the pre-training model architecture and goals of BLIP:

insert image description here

BLIP uses Visual Transformer as an image encoder, divides the input image into patches, then encodes the patches into an embedding sequence, and uses an additional [CLS] tag to represent the global image features. Compared with using pre-trained object detectors for visual feature extraction, using ViT is more computationally convenient and has gradually become mainstream.

To pre-train a unified model with comprehension and generation capabilities, the researchers propose a multimodal hybrid encoder-decoder (MED), which can be used for multiple tasks.

  1. Unimodal encoder, which encodes images and text separately. The text encoder is the same as BERT, appending a [CLS] token at the beginning of the text input to summarize the sentence.
  2. Image-based text encoder (Image-grounded text encoder), by inserting an additional cross-attention for each Transformer block of the text encoder between the self-attention (SA) layer and the feed-forward network (FFN) (CA) layer to inject visual information. A task-specific [Encode] tag is appended to the text, and the output embedding of [Encode] is used as a multimodal representation of the image-text pair.
  3. An Image-grounded text decoder that replaces the bidirectional self-attention layer in the encoder with a causal self-attention layer. Use [Decode] tags to indicate the beginning and end of a sequence.

Obviously, through the above three modules, this MED model has the ability to match generation-based tasks and understanding-based tasks at the same time.

五、Pre-training objectives

This article uses three objectives during pre-training, namely two understanding-based objectives and one generatin-based objective.

  1. Image-Text Contrastive Loss (ITC): Through the idea of ​​contrastive learning, align the feature space of the visual transformer and the text transformer in order to obtain a better representation of image and text.
  2. Image-Text Matching Loss (ITM): Aims at learning image-text multimodal representation to capture fine-grained alignment of vision and language. Simply put, it is image-text matching, and finally outputs a binary classification, positive or negative.
  3. Image-Text Matching Loss (ITM): The generation task in the three tasks generates a corresponding description for a given picture. Compared to the MLM loss widely used in VLP, LM enables the model to generalize from visual information to coherent captions.

Six, CapFilt architecture

Since large-scale pre-trained text-image pairs are usually sourced from the web, this text often does not accurately describe the visual content of the images, making them noisy signals and suboptimal for learning visual-language alignment.

Therefore, the author proposes a CapFilt architecture to improve the quality of image-text pair.

insert image description here

As shown in the figure above, where I w I_wIwand T w T_wTw代表 web image-text pair, ( I h , T h ) (I_h, T_h) (Ih,Th) represent high-quality hand-annotated image-text pairs.

It introduces two modules: a captioner for generating captions based on web images, and a filter for removing image-text pair noise. Both captioner and filter are initialized from the same pre-trained MED model and fine-tuned separately on COCO dataset. Fine-tuning is a lightweight process.

The whole process is roughly: first pre_train, then use I h I_hIh T h T_h ThFinetune the captioner and filter respectively. The captioner generates a corresponding caption given a web image. The filter uses ITM to judge whether the web image-text pair and the web image-generated caption pair match. If not, filter it out, and finally filter the remaining Image-text pairs and I h I_hIh T h T_h ThTogether pre_train a new model. Personal understanding is more like a novel online self-knowledge distillation.

7. Experiment

insert image description here

The figure above shows the impact of the proposed captioner and filter on the final result.

insert image description here

The figure above shows the impact of the parameters sharing strategy on the final result.

insert image description here
The picture above is a comparison between image-text retirval and other SOTA tasks, and it can be seen that there is a big improvement.

insert image description here

The figure above is a comparison with other image caption SOTA methods.

8. Conclusion

The BLIP architecture proposed by the author achieves the effect of SOTA on a wide range of downstream tasks, including understanding-based tasks and generation-based tasks. At the same time, the model uses a dataset bootstrapping method to solve the problem of a large amount of noisy data collected in the web.

The authors also propose several potential ways to improve the performance of BLIP:

  • Perform multiple rounds of dataset bootstrapping
  • Generate multiple captions for each picture to expand the corpus
  • Train multiple captioners and filters, and perform model ensemble

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130526913