[AIGC] 7, BLIP | Unified understanding and generation tasks generate higher quality text descriptions for images

insert image description here

论文:BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Code: https://github.com/salesforce/BLIP

Online experience: https://huggingface.co/spaces/Salesforce/BLIP

Source: ICML 2022 | Salesforce Research

Time: 2022.02

contribute:

  • A multimodal hybrid model MED that can jointly train understanding and generation tasks is proposed, which can be used for single-modal encoding, image-based text encoding, and image-based text decoding
  • The CapFilt method is proposed, which can generate higher-quality text descriptions for images from the network, improve the reliability and description richness of datasets, and bring performance improvements to downstream tasks

insert image description here

1. Background

Vision Language Pretraining (VLP) has greatly improved the effect of many vision-language tasks

But existing models have two main problems:

  • Model level: Many existing methods use encoder models or encoder-decoder models, but encoder-based models cannot be directly used for text generation tasks, and encoder-decoder models cannot be used well for text retrieval tasks.
  • Data level: Existing methods such as CLIP, ALBEF, and SimVLM are trained on a large number of image-text pairs obtained from the network. Literature study is not the best choice)

The BLIP proposed in this paper is a new VLP framework that provides a unified Language-Image Pre-training framework for vision-language understanding tasks and generation tasks. The main contributions are:

  • Multimodal mixture of Encoder-Decoder (MED): A model decoupling MED that can efficiently train multi-tasks and can be flexibly migrated is proposed

    MED can be used as a single-mode encoder, or as an image-grounded text encoder or image-grounded text decoder

    MED is trained using three vision-language objective functions: image-text contrastive learning, image-text matching, image language modeling

  • Captioning and Filtering (CapFilt): An improved method for noisy image-text pairs datasets

    Finetuned the pre-trained MED to obtain two models, one is the title generation model, and the other is the noise title filtering model

It performs well on the following tasks:

  • Text search
  • image description
  • visual text quiz

2. Method

insert image description here

2.1 Model structure

The image encoder in this article uses the Transformer structure, which divides the image into patches and then enters the Transformer Layer after linear mapping and adding position encoding and cls token.

In order to train a unified model that can achieve both understanding and generation, MED is proposed, which is a multi-task model that can achieve any of the following three tasks:

  • Unimodel encoder: a single-mode model that can encode images or text, text encoder is similar to BERT
  • Image-grounded text encoder: inject visual information by inserting cross-attention in multiple self-attentions, and then after FFN, there will also be an encode token in the text enbedding, which can express image-text pairs
  • Image-grounded text decoder: Replace the bidirectional self-attention layer in the image-grounded text encoder with a causal self-attention layer, and then use the decode token to identify the beginning of a sentence, and the end-of-sequence token to mark the end

insert image description here

2.2 Pre-training Objectives

In the pre-training phase, the author jointly optimizes three objective functions, including two understanding-based objective functions and one generation-based objective function.

insert image description here

1、Image-text Contrastive Loss(ITC)

Used to train a single-mode encoder, the purpose of learning is to make the correctly matched image-text pair have a higher similarity expression

The author uses ITC loss, which introduces a momentum encoder to generate features, and creates a soft label from the momentum encoder as a training target for explaining potential positives in negative pairs

2、Image-Text Matching Loss (ITM)

Used to train image-grounded text encoder, dedicated to learning multimodal expression of image-text and capturing fine-grained alignment of vision and language

ITM is a binary classification task, the model will use the ITM head to predict whether the image-text pair is a match (positive) or a mismatch (negative)

In addition, in order to mine more negative information, the author uses negative sample mining, and negative pairs with high contrast similarity in a batch will be selected and used to calculate loss

3、Language Modeling Loss(LM)

Used to train an image-grounded text decoder for generating a text description of a given image

Use cross entropy loss to train the model, and use autoregressive method to maximize the likelihood function of text

In addition, the author also used label smoothing (0.1) when calculating loss

Compared with the MLM loss commonly used in VLP, LM can make the model have better generalization ability in terms of graphic text

4. Others

In order to make pre-training more efficient, the parameters of text encoder and text decoder are shared (except for SA layers). The functions of encoder and decoder are as follows, and the hidden layers (CA and FFN) in the middle are similar, so use Parameter sharing can improve training efficiency:

  • The encoder uses bi-directional self-attention to create an expression for the current input token
  • The decoder uses causal self-attention to predict the next token

2.3 CapFilt

insert image description here

Due to the high cost of artificial high-quality labeling, the existing labeled image-text data is very small

Many current VLP methods use image and alt-text pairs collected directly from the Internet. Alt-text is text with a specific format, which cannot describe an image very well, and cannot guide the learning of the model well. It will bring some noise.

So the author proposed Captioning and Filtering (CapFilt): a method that can improve the quality of text corpora, including two models, and both models are initialized with pre-trained MED and fine-tuned on COCO

  • captioner: It is an image-grounded text decoder, fine-tuned using the LM objective function, and decodes a text for each image, that is, generates a text description for the given images
  • filter: It is an image-grounded text encoder that uses ITC and ITM objective functions to fine-tune, learn whether the image and text match, and remove noisy text from the network text and the generated text, as shown in the prediction of the ITM head. The text is inconsistent image matches, the text is considered to be noisy text and removed. The function of the filter is to select a matching text as an image from the web text and the synthetic text
  • The joint use of captioner and filter can first generate a description for the image, and then select a better description to update the text description corresponding to the image, improving the text description quality of the web dataset
  • Finally, combine the filtered image-text pairs with manual annotations to form a new data set for pre-train

It may not be easy to understand here, let's look at Figure 4 below:

  • T w T_w Tw: The text data of the corresponding image directly taken from the Internet, there are good and bad texts on the Internet
  • T s T_s Ts: The text data generated by captioner for image, the synthesized text data also has good and bad
  • An image will only correspond to a text as a matching tag, so the responsibility of the filter is in T w T_wTwand T s T_sTsChoose one to keep, the green one means the text retained by the filter, and the red one means the text filtered out by the filter
  • That is to say, the filter will choose to retain a matching text as an image in the web text and synthetic text

insert image description here

insert image description here

3. Effect

3.1 Training Details

Pre-training on two 16-GPU cards

  • The image transformer is based on ImageNet pre-trained ViT, and also expands two variants of ViTs: ViT-B/16 and ViT-L/16 (experiments without special statement are based on ViT-B)
  • The text transformer is BERT
  • The distribution uses batch_size=2880 and batch_size=2400 to pre-train ViT-B and ViT-H, 20 epochs
  • AdamW optimizer,weight decay=0.05
  • The image size is 224x224 in pre-training and 384x384 in finetuned

data set:

  • Pre-training uses a total of 14 million data
    • Includes two human-annotated datasets (COCO and Visual Genome)
    • Three network datasets (Conceptual Captions, Conceptual 12M, SBU captions)
  • There is also a network dataset LAION with a lot of noisy text, which contains a total of 115 million images, which will be used in some experiments

3.2 Effect of CapFilt

insert image description here

As shown in Table 1, when the model is pre-trained on different data sets, the effect of using CapFilt on downstream tasks, such as image-text retrieval and text description generation, is compared.

  • When only using captioner or filter on the 14M data set, it can bring some performance improvements
  • When using both captioner and filter, the effect is better

The role of CapFilt:

  • Performance can be further improved by using larger data sets and larger backbone networks, which also verifies CapFilt's ability to improve both in terms of models and data
  • In addition, the performance of the base model can also be improved by using a dozen ViT-L for captioner and filter.

Figure 4 shows some text descriptions (red) and synthetic text descriptions (green) on the network, illustrating the text synthesis ability of the captioner and the noise text filtering ability of the filter.

insert image description here

insert image description here

3.3 Sample Diversity is the Key to Text Synthesizers

In CapFilt, the authors use nucleus sampling to generate synthetic text descriptions

  • Nucleus sampling is a random decoding method. Each decoding token is randomly selected from tokens greater than the threshold (such as 0.9), which can introduce greater sample diversity to the network and include more information that is beneficial to network learning. information, so the effect is better
  • Beam search is a deterministic decoding method. The token with the highest score is selected as the corresponding text, which reduces some diversity.

insert image description here

3.4 Parameter sharing and decoupling

In the pre-training phase, the text encoder and decoder share parameters other than self-attention, as shown in Table 3, comparing several different parameter sharing methods, pre-training on the 14M data set and the LIONS network data set.

The results show:

  • Sharing parameters other than SA is better than not sharing, and it can also reduce model parameters and speed up training
  • If the parameters of the SA layer are shared, the effect will be reduced due to the conflict between the encoding and decoding tasks

insert image description here

In CapFilt, captioner and filter are separately finetuned on COCO

As shown in Table 4, the author also verified the effect of sharing parameters between captioner and filter, which will lead to performance degradation of downstream tasks

This should be because the noisy caption generated by the captioner cannot be filtered out by the filter if the parameters are shared, that is, the filtered noisy ratio is reduced from 8% to 25%.

insert image description here

3.5 Comparison with SOTA

1、Image-text retrieval

The author verified the effect of image-to-text retrieval (TR) and text-to-image retrieval (IR) on COCO and Flickr30K datasets

The pre-trained model was finetuned using ITC and ITM loss

To improve efficiency, the author first selects k candidates based on image-text feature similarity, and then ranks these candidates based on ITM score

As shown in Tables 5 and 6, BLIP achieves good results

insert image description here

Oh

2、Image Captioning

Use NoCaps and COCO as judgment datasets

The author gives a hint "a picture of {}" as the beginning of each caption

As shown in Table 7, BLIP trained with 14M data outperforms other methods

The effect of BLIP trained with 129M data is similar to that of LEMON trained with 200M data

insert image description here

3、Visual Question Answering(VQA)

VQA requires a model to answer input images and questions

insert image description here

insert image description here

4、 Natural Language Visual Reasoning (NLVR2)

5、Visual Dialog (VisDial)

insert image description here

insert image description here

insert image description here

Guess you like

Origin blog.csdn.net/jiaoyangwm/article/details/130036782