The Chinese Academy of Sciences proposes: Overview of Vision-Language Pre-training (VLP) to learn about the latest developments in multimodality!

Click the card below to follow the " CVer " public account

AI/CV heavy dry goods, delivered as soon as possible

Author: Feilong Chen et al

Reprinted from: Heart of the Machine |   Editor: Chen Ping

Learn about the latest advances and new areas in vision-language pretraining in this article.

Getting machines to respond similarly to humans has always been the goal of AI research. In order to give machines the ability to perceive and think, researchers have conducted a series of related studies, such as face recognition, reading comprehension and human-machine dialogue, through these tasks to train and evaluate the intelligence of machines in specific aspects. Typically, domain experts build standard datasets by hand, and then train and evaluate relevant models on those datasets. However, due to the limitations of related technologies, training models often requires a large amount of labeled data to obtain better and more powerful models.

Pre-trained models based on the Transformer architecture alleviate this problem. They are first pre-trained through self-supervised learning to train models from large-scale unlabeled data, thereby learning a general representation. They can achieve surprising results when fine-tuned on downstream tasks using only a small amount of manually labeled data. Since BERT was applied to NLP tasks, various pretrained models have developed rapidly in the unimodal domain, such as Vision Transformer (ViT) and Wave2Vec. Extensive work has shown that they are beneficial for downstream unimodal tasks and avoid training new models from scratch.

Similar to the unimodal domain, the multimodal domain also suffers from the problem of less high-quality annotated data. We can’t help but ask, can the above pre-training methods be applied to multimodal tasks? Researchers have explored this question and made significant progress.

In this paper, researchers from the Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academy of Sciences investigate recent advances and new areas in vision-language pre-training (VLP), including image-text pre-training and video-text pretraining . VLP learns the semantic correspondence between different modalities through pre-training on large-scale data. For example, in image-text pre-training, researchers expect the model to associate dogs in text with dog appearances in images. In video-text pre-training, researchers expect the model to map objects/actions in text to objects/actions in video.

de2b4f66c8ce3176119e81ab2116eaec.png

Paper address: https://arxiv.org/abs/2202.09061

To achieve this goal, researchers need to cleverly design VLP objects and model architectures to allow the model to mine associations between different modalities.

To give readers a better overall grasp of VLP, this study first reviews its recent progress from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks . Then, the article summarizes the specific VLP model in detail. Finally, the article discusses the new frontier of VLP. It is understood that this is the first survey in the field of VLP. The researchers hope that this survey can shed light on future research in the VLP field.

Overview of VLPs

A review of five aspects of VLP and its recent progress

In terms of feature processing : the paper mainly introduces how the VLP model preprocesses and represents images, videos and texts to obtain corresponding features.

To take full advantage of unimodal pretrained models, VLP randomly initializes standard transformer encoders to generate visual or textual representations. Visually speaking, VLP encodes ViT-PF with pretrained visual transformers such as ViT and DeiT. In terms of text, VLP uses a pretrained text transformer (such as BERT) to encode text features. For simplicity, the study named these transformers Xformer.

In terms of model architecture : The paper introduces the VLP model architecture from two different perspectives: (1) from the perspective of multi-modal fusion to observe the single-stream and dual-stream architectures (2) from the overall architecture design to compare the encoder and encoder-decoder device comparison.

A single-stream architecture refers to combining textual and visual features and feeding them into a single transformer block, as shown in Figure 1(a) below. The single-stream architecture fuses multimodal inputs by pooling attention. The single-stream architecture is more parameter efficient because both modes use the same set of parameters.

The two-stream architecture means that the textual and visual features are not combined, but fed into two different transformer blocks independently, as shown in Fig. 1(b). The two transformer blocks do not share parameters. For higher performance, cross-attention (shown by the dashed line in Fig. 1(b)) is used to achieve cross-modal interactions. For higher efficiency, cross-attention between visual transformer blocks and textual transformer blocks can also be omitted.

b944459ba9afe1f492df61191b6b5bd1.png

Many VLP models only employ an encoder architecture , with different modal representations fed directly into the output layer. In contrast, other VLP models advocate the use of a transformer encoder-decoder architecture, where different modal representations are first fed into the decoder and then into the output layer.

In terms of pre-training objectives : The paper pre-trains the VLP model by using different pre-training objectives, and summarizes the pre-training objectives into four categories: completion, matching, time, and specific type .

Completion refers to reconstructing masked elements from unmasked parts. Take masked language modeling (MLM) as an example, which was first proposed by taylor and is widely known as BERT as a pre-training task. The MLM in the VLP model is similar to the MLM in the pretrained language model (PLM), which can not only predict the masked text token from the rest of the text tokens, but also predict the masked text token from the visual token. As a rule of thumb, a BERT-following VLP model randomly masks each text input token with a mask rate of 15%, and uses a special token [MASK] 80% of the time, a random text token 10% of the time, and the remaining 10% Time uses the original token to replace the masked text. However, in the paper "Should You Mask 15% in Masked Language Modeling?" by Chen Danqi et al. at Princeton University, the authors found that under an effective pre-training scheme, they could mask 40-50% of the input text and obtain better results than the default one. 15% better downstream performance.

In Masked Vision Modeling (MVM), like MLM, MVM samples a visual (image or video) region or patch and masks its visual features, typically with 15% probability. The VLP model needs to reconstruct the visual features of the mask given the remaining visual features and all textual features.

Vision-language matching (VLM) is the most commonly used pre-training objective for aligning vision and language. In the single-stream VLP model, we use a special token [CLS] representation as a fused representation of the two modalities. In the two-stream VLP model, we concatenate the visual representation of special vision token [CLSV] and the textual representation of special text token [CLST] as a fused representation of the two modalities. The VLP model feeds the fused representation of the two modalities to the FC layer and the sigmoid function to predict a score between 0 and 1, where 0 represents a visual and linguistic mismatch and 1 represents a visual and linguistic match. During training, the VLP model samples positive or negative pairs from the dataset at each step.

In terms of pretraining datasets : Most datasets for VLP are constructed by combining public datasets across multimodal tasks. Here, some mainstream corpora and their details are shown in Table 1 below.

10861c45147addc7ca8d3db8f720649e.png

In terms of downstream tasks : a wide variety of tasks require the fusion of visual and linguistic knowledge. This subsection of the paper introduces the basic details and goals of such tasks and divides them into five categories: classification, regression, retrieval, generation, and other tasks , where classification, regression, and retrieval tasks are also known as understanding tasks.

Among the classification tasks , it includes Visual Question Answering (VQA), Visual Reasoning and Synthetic Question Answering (GQA), Visual-Linguistic Reasoning (VLI), Natural Language Visual Reasoning (NLVR), Visual Common Sense Reasoning (VCR), etc. In VQA, image or video visual input is provided, it is usually considered as a classification task, and the model predicts the most suitable answer from a pool of choices; in GQA, we can consider GQA as an upgraded version of VQA, aiming at Advancing research on visual reasoning in natural scenes; in VLI, given a video clip with aligned captions as a premise, paired with a natural language hypothesis based on the video content, the model needs to infer whether that hypothesis contradicts the given video clip.

In regression tasks , Multimodal Sentiment Analysis (MSA) aims to detect emotions in videos using multimodal signals such as vision, language, etc. It is used as a continuous intensity variable to predict the emotional direction of utterances.

In retrieval task , Vision-Linguistic Retrieval (VLR) understands vision (image or video) and language through an appropriate matching strategy, which consists of two subtasks, vision-to-text retrieval and text-to-visual retrieval, where vision-to-text retrieval is Fetch the most relevant textual descriptions from a larger description pool based on vision and vice versa.

In generative tasks , Visual Captioning (VC) aims to generate semantically and syntactically appropriate textual descriptions for a given visual (image or video) input. In addition, the paper introduces other downstream tasks such as multimodal machine translation (MMT), visual language navigation (VLN), and optical character recognition (OCR).

SOTA VLP model

Image-text VLP model . VisualBERT, known as the first image-text pretrained model, uses Faster R-CNN to extract visual features, concatenates visual features and textual embeddings, and feeds the concatenated features into a single transformer initialized by BERT. Many VLP models follow a similar feature extraction and architecture to VisualBERT when adjusting the pre-training targets and pre-training datasets. Recently, VLMO leverages image patch embeddings and text word embeddings to feed the combined embeddings together with modal experts into a single transformer and achieve impressive performance. METER explores how to use single-modality pretrained models and proposes a two-stream architectural model to handle multimodal fusion, achieving state-of-the-art performance on many downstream tasks.

Video-Text VLP Model . VideoBERT is known as the first video-text pretrained model that extends the BERT model to handle both video and text. VideoBERT uses a pre-trained ConvNet and S3D to extract video features and concatenate them with text word embeddings, which are fed to a transformer initialized with BERT. When training VideoBERT, ConvNet and S3D are frozen, indicating that the method is not end-to-end. Recently, inspired by ViT, Frozen and Region-Learner first process video clips into frames and obtain patch embeddings according to how ViT processes each frame of image. Frozen and Region-Learner optimize themselves and achieve SOTA performance in an end-to-end fashion.

Table 2 below summarizes more existing mainstream VLP models:

42937128516f4aaa0d35f4af0ac709d4.png

In the future, based on the existing work, the researchers hope that VLP can be further developed from the following aspects:

  • Combined with acoustic information, most of the previous multimodal pre-training studies emphasized the joint modeling of language and vision, while ignoring the information hidden in the audio;

  • Knowledge learning and cognition, although existing VLP models have achieved remarkable performance, they are essentially fitting large-scale multimodal datasets, making VLP models more knowledgeable is important for future VLPs;

  • Hint optimization, by designing discrete or continuous hints and using MLMs for specific downstream tasks, these models can reduce the computational cost of fine-tuning a large number of parameters, bridging the gap between pre-training and fine-tuning.

 
  
上面综述PDF下载

后台回复:VLP综述,即可下载上面论文

ICCV and CVPR 2021 Paper and Code Download

Backstage reply: CVPR2021, you can download the CVPR 2021 papers and open source papers collection

Background reply: ICCV2021, you can download the ICCV 2021 papers and open source papers collection

Background reply: Transformer review, you can download the latest 3 Transformer reviews PDF

CVer-Transformer exchange group established

Scan the QR code below, or add WeChat: CVer6666, you can add CVer assistant WeChat, you can apply to join the CVer- Transformer  WeChat exchange group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, autonomous driving, Reinforcement learning, lane line detection, model pruning & compression, denoising, dehazing, deraining, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication, PyTorch, TensorFlow and Transformer Wait.

Be sure to remark: research direction + location + school/company + nickname (such as Transformer + Shanghai + handover + Kaka) , according to the format remarks, it can be passed faster and invited to the group

9db42a72cc02b861d9c21c1cec7e98eb.png

▲Scan the code or add WeChat: CVer6666, enter the exchange group

CVer Academic Exchange Group (Knowledge Planet) is here! If you want to know the latest, fastest and best CV/DL/ML paper express, high-quality open source projects, learning tutorials, practical training and other materials , please scan the QR code below to join the CVer academic exchange group, which has gathered thousands of people!

e9dfc5898aee652a3fe3c48202c0c3c8.png

▲Scan the code to enter the group

▲Click the card above to follow the CVer official account

It is not easy to organize, please like and watch2cfbc7c9fc4206ceb16a9428d8f4d5e1.gif

Guess you like

Origin blog.csdn.net/amusi1994/article/details/123815133