VLP, multi-modal graphics and text tasks (4)

        Graph-text retrieval, visual question answering (VQA), and image description are arguably the three most widely studied graph-text tasks in the literature. They require AI systems to understand the content of input images and text. Inspired by the great success of language model pre-training , coupled with the unification of architectures used in the NLP and CV communities , there has been a surge in research interest in developing VLP methods for graphics and text tasks. Specifically, a large number of image-caption pairs are input into a model that processes both images and text for pre-training to obtain rich multi-modal knowledge that encodes and facilitates downstream tasks.

In this chapter, we conduct a systematic review of this emerging training paradigm.

1 We outline representative VLP models and classify them into several categories.

2 We describe the Transformer-based model architecture for VLP and analyze the model design from multiple aspects, including image encoder, text encoder, multi-modal fusion, etc.

3 We introduce commonly used pre-training targets and pre-training data sets respectively.

4 We have listed some advanced research topics, including basic models, multi-modal few-shot learning, unified VL modeling, VLP knowledge, robustness evaluation, model compression, etc.

1. Overview of VLP model

We roughly divide VLP methods into two categories:

(i) dual encoder and (ii) fusion encoder.

1.1 Dual encoder

        For dual encoders, images and text are encoded separately , and modal interactions are only handled through the dot product (e.g., cosine similarity) between image and text feature vectors. This model architecture is very effective for image retrieval tasks. When scaling, a powerful image encoder can be learned from scratch via large-scale contrastive pre-training. Such as CLIP and ALIGN. However, due to the lack of deep multi-modal fusion, CLIP performs poorly on VQA and visual reasoning tasks.

1.2 Fusion encoder

         For fused encoders, in addition to using image encoders and text encoders, additional Transformer layers are often used to model the deep interaction between image and text representations . Major examples include UNITER, VinVL, SimVLM and METER.

Advantages : This fused encoder architecture achieves excellent performance on VQA and image description tasks,

Disadvantages : However, it is very inefficient when applied to image retrieval because all possible image-text pairs (matching or not matching) need to be encoded to calculate the ranking similarity score.

        Recent studies, such as ALBEF, UFO, and VLMo, have also shown that the design of dual encoders and fused encoders can be encapsulated into a single framework, making the model suitable for both fast image retrieval and VQA and image description tasks.

        Among fused encoder-based methods, we further classify the models into two categories based on whether they can be end-to-end pre-trained. This classification also roughly reflects the way VLP methods have evolved over time. Specifically, most early VLP methods adopt a two-stage pre-training process, which first extracts image region features from a pre-trained object detector . Recently, end-to-end pre-training methods have become popular, where image features can be extracted from convolutional neural networks (CNN), visual transformers (ViTs), or just using image patch embeddings, and model gradients can be backpropagated to the visual backbone network Perform end-to-end training. End-to-end VLP methods achieve new state-of-the-art on all major VL tasks.

Evolution of VLP representative models for image-text tasks over time

Glossary of representative VLP models. OD: Object Detector. Xformer: Transformer. Emb: Embed. MLM/MIM: Masked Language/Image Modeling. ITM: Image-Text Matching. ITC: Image-Text Contrastive Learning. WRA: word-region alignment. TP: Token Prediction. CA: Contrast alignment. GC: terrestrial + subtitles. (†) In many cases (such as Flamingo (Alayrac et al., 2022), CoCa (Dou et al., 2022b), and GIT (Dou et al., 2022b)), the multimodal fusion module itself is also directly called (or used as) Text decoder.

1.2.1 VLP model based on target detector:

        Early methods used pre-trained OD to extract visual features. Among them, ViLBERT and LXMERT use collaborative attention for multi-modal fusion, in which two Transformers are applied to regional features and text features respectively, and the representations of the two modalities are fused through another Transformer in a later stage.

        

        On the other hand, VisualBERT, Unicoder-VL, VL-BERT and UNITER use a fused attention module to input regional features and text features into a single Transformer. OSCAR inputs additional image labels into the Transformer model, while VinVL uses a more powerful pre-trained object detector for feature extraction and demonstrates state-of-the-art performance on VL tasks.

      

        On the one hand, regional features are object-level and semantically rich; on the other hand, extracting regional features may be time-consuming, and pre-trained object detectors are usually frozen during pre-training, which may limit the capacity of the VLP model.

1.2.2 End-to-end VLP model:

        Researchers have tried different methods of pre-training VL models in an end-to-end manner. Specifically, we further divide the images into two subcategories based on how they are encoded.

  • CNN-based grid features . PixelBERT and CLIP-ViL directly feed mesh features and text from CNN directly into the Transformer. SOHO first discretizes the grid features using the learned visual dictionary, and then inputs the discretized features into the cross-modal module. While using mesh features directly can be efficient, it often requires the use of inconsistent optimizers for CNNs and Transformers. For example, PixelBERT and CLIP-ViL use AdamW to optimize Transformer and SGD to optimize CNN respectively.

  • ViT-based image patch features . In recent years, visual transformers (ViTs) have become an increasingly active research topic in CV. Among them, ViLT directly inputs image patch features and text mark embeddings into the pre-trained ViT model, and then pre-trains the model on the image-text dataset. ViTCAP further extends ViLT for image description tasks. This also led to subsequent work such as UFO and VLMo. Among them, UFO uses the same Transformer for both image/text encoding and multi-modal fusion, while VLMo includes additional multi-modal expert layers. In addition, visual parsing, ALBEF, METER, BLIP, X-VLM and FIBER all use ViT as image encoder (such as ordinary ViT and Swin Transformer) and design different goals for model pre-training.

        

1.3 Research progress

Research progress driven by large-scale VLP using the VQA task as a case study. From August 2017 to August 2019, many task-specific methods were developed. Since August 2019, the OD-based VLP model has become popular. Subsequently, due to the emergence of visual Transformer, the end-to-end VLP model became mainstream.

        Now, we take the VQA task as a case study to illustrate the research progress promoted by large-scale VLP.

  •  From August 2017 to August 2019, many task-specific methods were developed, including the use of object-centered visual features, advanced attention mechanism design, object relationship modeling, and the application of Transformer. The corresponding VQA accuracy increases from about 66% to about 71%. 
  • From August 2019 to August 2021, visual-language pre-training (VLP) has become a mainstream trend. It first started with an OD-based VLP model, which improved VQA accuracy from about 71% to about 78%; then the end-to-end VLP method based on convolutional networks and visual Transformer dominated the field. 
  • From August 2021 to August 2022, we witnessed the vigorous development of large-scale multimodal base models such as SimVLM, Florence, Flamingo, CoCa, GIT, and BEiT-3. When these models are scaled both in terms of model size and pre-training dataset size, the performance of VQA further improves, from about 80% to about 84%.

2. Model framework

        Given a pair of images and text, the VL model first extracts text features w = {w_1, \dots,w_N}and visual features through a text encoder and a visual encoder v = {v_1, \dots , v_M}. Here, N is the number of tokens in the sentence and M is the number of visual features of the image, which can be the number of image regions/grids/patches, depending on the specific visual encoder used. Textual and visual features are then fed to a multimodal fusion module to generate cross-modal representations, which are then optionally fed into a decoder to generate the final output. A schematic diagram of this general framework is shown in the figure:

Schematic diagram of the general framework of Transformer-based visual language model

        In many cases, there is no clear boundary between the image/text backbone, the multi-modal fusion module and the decoder. In this paper, we refer to the part of the model that only receives image/text features as input as the corresponding visual/text encoder, and the part of the model that receives both image and text features as input as the multi-modal fusion module. Furthermore, if there are other modules that take multi-modal features as input to generate output, we call them decoders.

2.1 Visual Encoder

There are three types of visual encoders:

(i) Object Detector (OD)

(ii) Ordinary convolutional neural network (CNN)

(iii)Visual Transformer

FROM:

        In VL research, the most widely used object detector is Faster R-CNN , pre-trained on the Visual Genome (VG) dataset as shown in BUTD. In VinVL , a more powerful OD model based on the ResNeXt-152 C4 architecture is pre-trained on multiple public OD datasets (including COCO, OpenImages, Objects365, and VG). By using this more powerful OD model, it is observed that Significant performance improvements across a wide range of VL tasks. Additional steps are also taken to encode the position information of the image region, usually represented as a 7-dimensional vector. The visual features and positional features are then fed to fully connected layers to project into the same embedding space. The final embedding of each region is obtained by summing the two FC outputs and then passing through a normalization layer.

Paper address: Faster r-cnn: Towards real-time object detection with region proposal networks

CNN:

        In PixelBERT and SOHO, ResNet-50, ResNet-101 and ResNeXt-152 pre-trained from ImageNet classification are used. In CLIP-ViL, ResNet-50, ResNet-101 and ResNet-50x4 pre-trained from CLIP are used. In SimVLM, they use the first three blocks of ResNet-101 and ResNet-152 (excluding Conv stem) as the base model and large model respectively, and use a large variant of ResNet-152 with more channels as the giant model . It is generally observed that a more powerful CNN backbone network leads to stronger downstream performance.

ViT:

        The image is first divided into patches, then flattened into vectors and linearly projected to obtain patch embeddings. Embeddings of learned trainable special labels [CLS] are also added before the sequence. These patch embeddings, together with learnable 1D position embeddings and a latent image type embedding, are fed into a multi-layer Transformer block to obtain the final output image features. Different ViT variants have been studied for VLP, such as pure ViT, DeiT, BEiT, Swin Transformer and CLIP-ViT.

        

In short, regardless of the visual encoder used, the input image is represented as a set of feature vectors.

2.2 Text encoder

         Following BERT and RoBERTa, the VLP model first segments the input sentence into a series of subwords, and then inserts two special markers at the beginning and end of the sentence to generate the input text sequence. After obtaining text embeddings, existing works either directly feed them into a multi-modal fusion module or feed them into several text-specific layers before fusion. For the former, the fusion module is usually initialized using BERT, so the roles of text encoding and multi-modal fusion are intertwined and absorbed in a single BERT model. In this case, we regard the text encoder as a word embedding layer.

        Language model (LM) pre-training has achieved impressive performance on various tasks, and different pre-trained LMs have been proposed. In METER, the authors studied text encoding using BERT, RoBERTa, ELECTRA, ALBERT, and DeBERTa. In Flamingo, a giant pre-trained LM with 70B parameters is used as the text encoder and kept frozen during the VLP process for multi-modal few-shot learning.

        In short, regardless of the text encoder used, the input text is represented as a set of feature vectors.

2.3 Multi-modal fusion

        For dual encoders like CLIP and ALIGN, fusion is achieved through the dot product between image and text feature vectors . For the fusion encoder, it receives simultaneously v = {v_1, \dots , v_M}and w = {w_1, \dots,w_N}as input, and learns a multi-modal representation of the context, represented by \widetilde{v} = {\widetilde{v_1}, \dots , \widetilde{v_M}} and \widetilde{w} = {\widetilde{w_1}, \dots , \widetilde{w_M}}. There are mainly two types of fusion modules, namely co-attention and merged attention, dual-stream mode and single-stream mode. As shown in the picture:

Collaborative attention and merged attention design for multi-modal fusion

        

  • Single-stream mode (merged attention) textual features and visual features are simply concatenated together and then fed into a single Transformer block. Single-stream architecture fuses multi-modal inputs by merging attention, often called merged attention. The single-stream architecture is more parameter-efficient because both modes use the same set of parameters.

  • Dual-stream mode (co-attention) means that visual and text encoding features are not combined together, but are independently input to two different Transformer blocks. The two Transformer blocks do not share parameters, but achieve cross-modality through cross-attention. Interaction, therefore also called co-attention.

        

        For region-based VLP models, co-attention modules and merged-attention modules can achieve comparable performance. However, the merged-attention module is more parameter efficient since the same parameter set is used for both modalities. For the end-to-end VLP model, as shown in METER, co-attention performs better. However, there is currently no conclusion as to which module is better, and this is largely an empirical choice in model design. In mPLUG, the combination of co-attention module and merged-attention module is used for multi-modal fusion. In BLIP and FIBER, fusion is performed by simply inserting cross-attention modules into the image and text backbone networks, which may be more lightweight and efficient. In MLP-ViL, the authors study methods for multi-modal fusion using MLP architecture.

        

2.4 Discussion: Unified Modeling of Shared Backbone

        Transformer has now become a general-purpose computing engine. In UFO, the authors tried to use the same Transformer shared backbone for image/text encoding and multi-modal fusion. In MS-CLIP and VATT, the same shared backbone is also used for comparative pre-training across multi-modal applications. In VLMo, a mixed-modality expert layer is further added, while the same self-attention layer is used for image/text encoding and multi-modal fusion. This expert hybrid design achieves good performance on multiple visual-linguistic tasks.

2.5 Encoder-decoder vs encoder-only

Encoder-only model architecture comparison with encoder-decoder

        Most VLP models adopt an encoder-only architecture that directly feeds the cross-modal representation into an MLP-based output layer to generate the final output. This encoder-only design is naturally suitable for VL understanding tasks such as VQA and visual reasoning . When used for image subtitle generation, the same encoder acts as a decoder, generating output-tagged subtitles one by one by using causal masks.

        Inspired by T5 and BART in the NLP field, VL-T5, OFA, and DaVinci advocate the use of a Transformer-based encoder-decoder architecture, which first inputs the cross-modal representation into the decoder and then enters the output layer. In these models, the decoder simultaneously focuses on the encoder’s representation and previously generated tokens, generating output autoregressively. The use of encoder-decoder architecture can unify various image-text tasks, and is suitable for zero/few-shot learning VLP models, which is also a natural choice for generation tasks. In MDETR, the authors also adopt an encoder-decoder architecture, but the decoder is designed to generate bounding boxes in parallel, following the pioneering work of DETRE.

3 Pre-training objects

        Now, we introduce how to design the pre-training task. First, we will review Masked Language Modeling (MLM) and Image-Text Matching (ITM), both of which are widely used in almost every VLP model. We will then shift our focus to Image-Text Contrast (ITC) learning and various Masked Image Modeling (MIM) tasks.

3.1 Masked Language Modeling (MLM)

        In VLP, MLM used with image-text pairs has also proven useful. In MLM, given an image-text pair, we randomly mask the input words with 15% probability and replace the masked words with a special marker [MASK] \widetilde{w}_{ \m}. The goal is to predict these mask tokens based on their surrounding words and paired images \widetilde{v}, by minimizing the negative log-likelihood:

        where θ represents the trainable parameter. Each pairing ( \widetilde{w}_{ \m}, \widetilde{v}) is sampled from the entire training set D. Several MLM variants are used in VLP.

  • Seq-MLM: To adapt the pre-trained model to image explanation generation, it was observed to be beneficial to add a seq2seq causal mask during pre-training. That is, in Seq-MLM, the model can only predict mask tokens using its preceding context, which is consistent with how the model performs image explanation generation during inference.
  • LM: BLIP and CoCa use direct language modeling in VLP. The model autoregressively predicts image captions on an image tag-by-tag basis.
  • Prefix-LM: Adopting the encoder-decoder framework in SimVLM, a PrefixLM pre-training objective is proposed. A sentence is first split into two parts, bidirectional attention is enabled on the prefix sequence and the input image, while a causal attention mask is applied on the remaining tokens.

3.2 Image-Text Matching (ITM)

        In ITM, given a batch of matching or mismatching image-caption pairs, the model needs to identify which images and captions correspond. Most VLP models treat image-text matching as a binary classification problem. Specifically, a special token (i.e., [CLS]) is appended to the beginning of the input sentence to learn a global cross-modal representation. We then fed the model separately using matching or non-matching image caption pairs as <\widetilde{v} , \widetilde{w}>input, with equal probability in each case, adding a classifier on top of the [CLS] tag to predict a binary label y indicating whether the sampled image caption matched . Specifically, denoting the output score as s_\Theta (\widetilde{v} , \widetilde{w}), we apply a binary cross-entropy loss for optimization:

        In addition to randomly sampling negative image-text pairs, more difficult negative pairs can also be mined through the image-text contrastive loss introduced below. This approach has been reported to be effective in improving downstream performance in ALBEF, VLMo and FIBER.

3.3 Image-Text Contrastive Learning (ITC)

        Early VLP models, such as UNITER and VinVL, did not use ITC in the pre-training stage (one exception is Light-ningDOT). Although ITC loss has been extensively studied before VLP, in the context of end-to-end VLP, it is mainly used by CLIP and ALIGN to pretrain dual encoders . Subsequently, it is also used to pretrain fused encoders like in ALBEF . \widetilde{v} , \widetilde{w}Note that this ITC loss is applied to the output of the image and text encoders before multi-modal fusion (i.e. using w and v instead ). Specifically, given a batch of image-text pairs of size N, ITC aims to predict N matching pairs from all possible N2 image-text pairs. To calculate image-to-text and text-to-image similarity, we have:

        Among them, σ is a learned temperature hyperparameter, i2t and t2i are the image-to-text and text-to-image contrast losses respectively. The ITC loss can be further enhanced through triple contrastive learning (TCL), multi-modal learnable codebook (CODIS), or loop interaction between ITC and ITM (LoopITR).

3.4 Masked Image Modeling (MIM)

        Similar to the MLM objective, researchers have studied various masked image modeling (MIM) tasks for pre-training. Specifically, the model is trained to reconstruct occluded patches or regions based on the remaining visible patches or regions \widetilde{v}and all words .\widetilde{w}\widetilde{w}_{-m}

The design of MIM can be divided into two categories.

For OD-based VLP model

        Such as LXMERT and UNITER, in which some input regions are randomly occluded (i.e., the visual features of the occluded regions are replaced with zeros), and the model regresses the original region features by minimizing the mean square error loss. Researchers also tried to first use pre-trained object detectors to generate object labels for each region, which can contain high-level semantic information, and then the model was trained to predict object labels for occluded regions instead of original region features.

For the end-to-end VLP model

        For example, ViLT and METER, researchers have studied methods for occluded image modeling using occluded patch regression/classification. in particular,

  • For MIM with discrete VQ markers, inspired by BEiT, the discrete VQ markers of the input patches are first extracted, and then the model is trained to reconstruct the discrete markers. Specifically, each image is first decomposed into a series of discrete markers using the VQ-VAE model in DALL-E. Then, the image is resized so that the number of patches equals the number of markers, so each patch corresponds to a discrete marker. We then randomly mask 15% of the patches and feed the masked image patches into the model as before, but now the model is trained to predict discrete labels instead of masked patches.
  • For MIM with intra-batch negative samples, by imitating MLM using text vocabulary, using a dynamic vocabulary built from intra-batch negative samples, the model is trained to select original patches by reconstructing the input patches. Specifically, in each training step, we sample a batch of image-caption pairs \{​{<v_k, w_k>}\}^B_{k=1}, where B is the batch size. We consider all patches in {vk}B k=1 as candidate patches. For each masked patch, we mask 15% of the input patch. The model needs to select the original patch from the set of candidate patches. The model is trained to maximize its probability, similar to noise contrastive estimation.

        It is worth noting that recent state-of-the-art VLP models (e.g., VinVL, ALBEF, VLMo) do not use MIM during pre-training, and in ViLT and METER, the authors also demonstrate that MIM does not contribute to downstream performance. However, there are also recent studies that have adopted masked visual language modeling (such as MaskVLM and VL-BEiT), trying to randomly mask patches/markers while keeping other modalities intact.

3.5 Other Pre-training Tasks

In addition to the typical pre-training tasks presented above, the researchers also explored other possibilities. For example:

  • UNITER proposes a word-region alignment objective that attempts to align image and text features using optimal transfer.
  • In E2E-VLP, MDETR, GLIP and X-VLM, bounding box prediction and phrase localization from object detection are directly used as fine-grained pre-training tasks.

references:

 Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

LXMERT: Learning Cross-Modality Encoder Representations from Transformers 2019

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks 2019

UNITER: Universal image-text representation learning

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows 2021

Guess you like

Origin blog.csdn.net/qq_41458274/article/details/132943752