ViLT: Vision-Language Transformer Model Without Convolution and Regional Supervision

Translator | Zhang Ruiqing

Unit | Natural Language Processing Laboratory, Northeastern University

From | Machine Translation Academy

Enter the NLP group —> join the NLP exchange group

about the author

8411154ab6c77e257750f3e88ee32f07.png

WonjaeKim, Research Scientist from Naver AI Lab. Before joining Naver, worked as a research scientist at Kakao. Graduated from Seoul National University with BSc and MSc in Computer Science and Engineering.

fa5695edeb72de7e77e4b54b485d1644.png

Bokyung Son, Research Scientist from Naver AI Lab. Worked as an artificial intelligence researcher at Kakao. Graduated from Seoul National University with BA and MA in Linguistics, majoring in Natural Language Processing and Computational Linguistics.

translator said

Multimodality refers to the involvement of multiple different perception modalities or sources of information in information processing, transmission, and expression. These perceptual modalities can include language, vision, hearing, touch, etc., which work together to convey richer and more comprehensive information. In a multimodal system, information between different modalities can complement and interact with each other, thereby providing deeper and more comprehensive understanding and communication.

Taking human perception as an example, we usually receive multiple sensory information at the same time in our daily life. When we watch a movie, we not only rely on visual information to understand the plot and characters, but also rely on auditory information (dialogue, sound effects), linguistic information (subtitles or dialogue), and emotional tactile experience, etc., which interact with each other. Intertwined, together constitute our perception and understanding of movies.

The concept of multimodality is also widely used in computer science and artificial intelligence. For example, in natural language processing, multimodal models can combine text and image information, enabling computers to better understand and generate rich content. In human-computer interaction, multimodal interfaces can allow users to interact with computers more naturally and conveniently through voice, touch, and gestures. In addition, multimodality also has broad application prospects in areas such as autonomous driving, medical diagnosis, and sentiment analysis.

Original link: https://arxiv.org/abs/2102.03334

text:

The alignment problem of different modal semantics has always been a key topic in the research of multimodal artificial intelligence. Multimodal data in the traditional sense includes: visual data, text data, sound data, tactile data, etc.; in continuous research and development, multimodal data is further refined into image data, video data, language and text data , other text data (such as codes, etc.), sound data, voice data, infrared data, 3d point cloud data, and many other data forms.

1 Introduction

11812067100a03efec5f251076f6a4e1.png

图1. Vision Transformer Overview

Different modal data have different semantic densities, different signal-to-noise ratios, and different ranges of knowledge that can be covered. Therefore, the alignment of different modal data faces great difficulties. Taking the semantic alignment of visual-linguistic data as an example, aligning different modal data needs to solve two problems of formal alignment and content alignment.

The first is the issue of formal alignment. The classic work in the field of NLP ELMo [3 ] proposes a method of extracting the underlying semantic vector of words using contextual bidirectional prediction. Bert [4] is improved to a sentence reconstruction task similar to cloze filling, using the high parallelism of Transformer [5] and Deeper depth solves the problem of manual labeling cost of language data for the arrival of the era of large models. The rapid development of the NLP field has also stimulated the attention of scholars in the field of computer vision to explore the potential of Transformer in the field of vision. Among them, the most influential one is Vision Transformer [6] from Google. This work combines a resolution of 224 The *224 image is divided into 16*16 patches with a resolution of 14*14, and each patch is treated as a token and handed over to the Transformer Encoder module for calculation. So far, Transformer has unified the data format and calculation method of language data and visual data, and solved the problem of form alignment.

The paper I want to share with you today, ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, is the Transformer multimodal pilot work based on the above work.

2 related work

441988073358368ffc33040e8d30ddd2.png

Figure 2. Four types of visual language models

In the first part of the paper, the authors summarize the multimodal methods up to 2021 and classify them by the amount of calculation. As shown in Figure 2, VE represents visual encoding, TE represents text encoding, MI represents modality fusion, and the area of ​​the box represents the calculation amount.

Before the emergence of ViT, the visual part of the visual-language model was dominated by CNN [8] Backbone, and the VE part was almost always embedded with pixel-based and CNN-based models, such as Pixel-Bert [7 ] . This article mentions that most of the research on the visual-language model focuses on improving the performance and calculation of the VE part, and because the regional features are usually cached in advance during training to reduce the burden of feature extraction, they are often ignored in academic seriousness. The shortcomings of the overweight VE part, such as in actual application scenarios, require a huge overhead and reasoning time to extract visual features, and its reasoning is too slow, which greatly affects its real usability. At the same time, the author also believes that such methods rely too much on the generalization and training data volume of CNN Backbone, and there is a lot of room for optimization.

Therefore, in this work, the authors focus on the lightweight and fast visual embedding, and only use Transformer as the main body of the network to deal with the two modalities in a unified way. Different from the previous visual-language model, the ViLT model does not have a convolutional network. Through a special design, the deep embedder specially used for visual input is removed, which significantly reduces the model size and running time. It can be seen from Figure 3 The running speed of the parameter-efficient model developed in this work is dozens of times faster than the VLP (Vision-Language Pretrain) model using regional features, and more than four times faster than the VLP model using grid features. In terms of performance, there is no significant decline compared to the above model, and even better than the above model in some tasks.

e70d6245a4badd96ad5210feb9efe495.png

Figure 3. Comparison of inference costs

The main contributions of the ViLT work are summarized by the authors as follows:

ViLT is the simplest visual-language model architecture so far. It uses Transformer to process visual and language features in a unified way. This design significantly reduces running time and improves parameter efficiency.

ViLT proves for the first time that CNN is not the only solution for vision-language tasks, and has achieved satisfactory performance on vision and language tasks without applying any CNN network. Visual Semantic Embedding (VSE) models such as VSE++ (Faghri et al., 2017) and SCAN (Lee et al., 2018) belong to Figure 2a. They use different embedders for images and text, the former being much heavier. They then use simple dot products or shallow attention layers to represent the similarity of embedded features in the two modalities.

ViLT demonstrates that using whole word masking and image augmentation can further enhance model performance in training for multimodal tasks.

3 background

3.1 Classification of visual language models

The authors propose a taxonomy of vision and language models based on two points: (1) whether the two modalities are evenly expressive in terms of dedicated parameters and, or computations; (2) whether there is an interaction between the two modalities in deep networks . Combinations of these points lead to the four prototypes in Figure 2.

Visual Semantic Embedding (VSE) models such as VSE++ [9] and SCAN [10] belong to Fig. 2a. They use different embedders for images and text, the former being much heavier. They then use simple dot products or shallow attention layers to represent the similarity of embedded features in the two modalities.

CLIP [11] (Radford et al., 2021) belongs to Figure 2b because it uses a separate but equally expensive Transformer model embedder for each modality. The interaction between the merged image vector and text vector is still shallow (dot product). Although CLIP performs well on image-to-text retrieval, we were unable to observe the same level of performance in other vision and language downstream tasks. For example, using the dot product of pooled visual and text vectors from CLIP as a multimodal representation to fine-tune an MLP head on NLVR2 [12] gives a low dev accuracy of 50.99 ± 0.38 (runs with three different seeds) ; with a chance-level accuracy of 0.5, we conclude that the representation cannot learn this task. This is also consistent with the findings of Suhr [13] et al. (2018) All models that simply fuse multimodal representations fail to learn NLVR2.

This result supports the authors' speculation that even simple fusion of outputs from high-performance unimodal embedders may not be sufficient for learning complex vision and language tasks, embodying the need for more rigorous schemes for intermodal interaction.

Unlike models with shallow interactions, the newer VLP model in Figure 2c uses deep transformers to simulate the interaction of image and text features. However, besides the interaction module, the convolutional network still participates in extracting and embedding image features, which account for most of the computation shown in Figure 3. Modulation-based vision and language models [14] also belong to Fig. 2c, their visual CNN stems correspond to visual embedders, RNNs produce modulation parameters for text embedders, and modulation CNNs produce modality interactions.

ViLT proposed in this paper is the first model belonging to Fig. 2d, where the embedding layer of raw pixels is shallow and computationally light compared to text labelling. Therefore, this architecture focuses most of the computation on modeling modal interactions.

3.2 Modal interaction

Transformer techniques are at the heart of current VLP models, which take sequences of visual and textual embeddings as input, simulate intra-modal and inter-modal interactions throughout the layers, and then output a sequence of context-dependent features.

Briarello et al. [15] categorized interaction patterns into two categories: (1) single-stream methods (e.g., VisualBERT [16] , UNITER [17] ), where layers collectively operate on the concatenation of image and text inputs; and (2) Two-stream methods (eg, ViLBERT [18] , LXMERT [19] ), where the two modes are not connected at the input level. For the interactive Transformer model module, the authors follow the single-stream approach since the two-stream approach introduces additional parameters.

3.3 Visual Embedding

All high performance VLP models share the same Text Embedding - BERT but they differ on the visual embedder. Nevertheless, visual embedding is still the bottleneck of existing VLP models in most (if not all) cases. We reduce this bottleneck by introducing Patch Projection instead of using region or grid features, which use larger extraction modules.

regional characteristics. The VLP model mainly utilizes regional features, also known as bottom-up features [20] . They are obtained from off-the-shelf object detectors such as Faster R-CNN [21] .

The general flow of generating region features is as follows. First, RPN is used to propose regions of interest (RoIs) based on grid features pooled from a CNN backbone. Then, non-maximum suppression (NMS) reduces the number of ROIs to a few thousand. After being pooled by operations such as RoI Align [22] , RoIs pass through the RoI header and become region features. NMS is applied again to each class, eventually reducing the number of features to less than 100.

The above process involves several factors affecting performance and runtime: backbone, style of NMS, RoI header. Previous work was lenient about controlling these factors, as follows:

• Backbone: ResNet-101 [1] and ResNext-152 [1] are two commonly used backbones.

• NMS: NMS is usually done in a class-by-class manner. Applying NMS in each class becomes a major runtime bottleneck, e.g. in the VG dataset [23] with 1.6 K classes. To address this issue, categorical NMS [24] was recently introduced .

• RoI header: C4 [20] was originally used . The header was introduced later [25] . This imposes a huge runtime burden when heads work for all RoIs.

However, lightweight object detection models are unlikely to be faster than backbone or single-layer convolutions. Freezing the visual backbone and cache region features ahead of time only helps during training, not during inference, not to mention that it inhibits performance itself.

Mesh features. In addition to object detection heads, the output feature grids of convolutional neural networks (such as ResNets) can also be used as visual features for vision and language pre-training. Direct use of grid features was first proposed by specific models [26] , mainly to avoid using severely slow region selection operations.

X-LXMERT [27] revisits grid features by fixing target regions as grids instead of grids from region proposal networks. However, their caching of features precludes further adjustments to the backbone.

Pixel-BERT is the only VLP model that replaces the VG pretrained object detector with a ResNet variant backbone pretrained for ImageNet classification. Unlike the frozen detectors in the regional feature-based VLP models, the backbone of Pixel-BERT is tuned during vision and language pre-training. The downstream performance of Pixel-BERT with ResNet-50 is lower than that of regional feature-based VLP models, but it uses the heavier ResNeXt-152 compared to other competitors.

But the authors claim that grid features are not the first choice, since deep convolutional neural networks are expensive and they account for a large part of the overall computation, as shown in Figure 3.

Patch Projection.  In order to minimize overhead, the author adopts the simplest visual embedding scheme: linear projection operating on image patches. For image classification tasks, ViT introduces block projection Embedding. Patch projection greatly simplifies the visual embedding step to the level of text embedding, and also consists of simple projection operations.

The author uses a 32 × 32 patch projection, which requires only 2.4 M parameters. This is in stark contrast to the complex ResNe(X)t Backbone and object detection elements. Its running time is also negligible, as shown in Figure 3, and a detailed runtime analysis is done in Section 4.6 of the original text.

4 ViLT

4.1 Model overview

The ViLT model proposed in this paper is a concise, single-stream visual language model with the smallest VE module among all the methods proposed in this paper. Refer to Figure 2 for the specific structure.

What is more counterintuitive is that the author initialized the parameters of the model in the actual experiment process to speed up the training, and the effect was very poor when directly using bert for parameter initialization. In many experiments, the author tried to use the pre-trained ViT Initialize the parameters of the IM module and initialize the patch embedding of ViLT using only the patch embedding of ViT. In the end, the author found that using the pre-trained ViT to initialize the IM module works best. The difference in the structure of the ViT and Bert models lies in the location of their layer normalization (LN). In ViT, the LN layer is located in the multi-head attention and FC layer. in front of and behind these two layers in Bert. The pre-training model used in this work is ViT-B/32, pre-trained on ImageNet.

4.2 Pre-training objectives

The authors train ViLT using two objectives commonly used to train VLP models: image-to-text matching (ITM) and masked language modeling (MLM).

ITM. Image Text Matching. The authors randomly replace the aligned images with different images with a probability of 0.5. A single linear layer ITM head projects the pooled output features p onto the binary classes, and we compute the negative log-likelihood loss as our ITM loss.

In addition, the author also uses this block alignment idea based on the optimal transportation principle to assist the training of ITM tasks.

MLM. Masked Language Modeling. The author uses a masking probability of 0.15 for random masking in the secondary task, and performs masking reconstruction tasks to calculate the loss, and the method is the same as Bert.

At the same time, the author uses a whole word masking method different from Bert, and uses visual information to reconstruct masked words to strengthen the interaction between modalities.

4.3 Image Enhancement

In multimodal tasks, image enhancement methods are rarely used because they often affect the semantics of images. For example, image cropping may affect the number of targets, and image normalization may affect the color of the image itself. However, image enhancement itself is a good trick to enhance the robustness and generalization of the model, so the author still uses some data enhancement methods that do not affect the semantics of the image during the fine-tuning process.

5 experiments

7044f6b65edd3e66978a08527bd72f1f.png

Figure 4. Dataset

The specific experimental methods will not be repeated here. Here is a brief introduction to the data set and evaluation effect of this work. This work uses four data sets with a total of 4M pictures and about 10M descriptions for pre-training, as shown in Figure 4. And evaluate on VQAv2 and NLVR2, the evaluation effect is shown in Figure 5-7 below.

635c76fd3365be5d42b63526278154d3.png

Figure 5. Evaluation result-1

f0178ad0fa7de95800abaac91893c7f5.png

Figure 6. Evaluation result-2

The effect shown in the picture is very clear. It can be seen that the main improvement of ViLT is in Time, which greatly improves the reasoning speed. In terms of scores and effects, ViLT did not decrease due to the substantial increase in reasoning speed, but on some tasks. A small increase.

7536344ec6520d4c031b02965665e512.png

Figure 7. Ablation experiment

The ablation experiment shown in Figure 7 explores which design of the model improves the performance. It can be seen that the whole word masking and image enhancement have brought several points of improvement.

6 Visualization effects

b3d26a11d9415017a3b8dd24d01544d6.png

Figure 8. Visual multimodal alignment display

The author shows the image visualization alignment effect of ViLT. It can be seen that when more calculations are allocated to the IM module, the model will show a strong alignment ability. The authors believe that their design on the WPA task enhances the alignment ability of the model.

references

1. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Deep Residual Learning for Image Recognition"

2.He K, Chen X, Xie S, et al. Masked autoencoders are scalable vision learners[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16000-16009.

3.Sarzynska-Wawer J, Wawer A, Pawlak A, et al. Detecting formal thought disorder by deep contextualized word representations[J]. Psychiatry Research, 2021, 304: 114135.

4.Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

5.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.

6.Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.

7.HUANG Z, ZENG Z, LIU B, 等. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers[M/OL]. arXiv, 2020[2023-07-25]. http://arxiv.org/abs/2004.00849. DOI:10.48550/arXiv.2004.00849.

8.Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.

9.Faghri F, Fleet D J, Kiros J R, et al. Vse++: Improving visual-semantic embeddings with hard negatives[J]. arXiv preprint arXiv:1707.05612, 2017.

10.Lee K H, Chen X, Hua G, et al. Stacked cross attention for image-text matching[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 201-216.

11.Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.

12.Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., and Artzi, Y. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.

13.Suhr A, Zhou S, Zhang A, et al. A corpus for reasoning about natural language grounded in photographs[J]. arXiv preprint arXiv:1811.00491, 2018.

14.Perez E, Strub F, De Vries H, et al. Film: Visual reasoning with a general conditioning layer[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).

15.Bugliarello E , Cotterell R , Okazaki N ,et al.Unmasked Multimodal Pretraining: Unifying the Vision and Language BERTs[J]. 2020.DOI:10.48550/arXiv.2011.15124.

16.Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.

17.Chen, Y.-C., Li, L., Yu, L., Kholy, AE, Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. Uniter: Learning universal imagetext representations . . . . arXiv preprint arXiv:1909.11740,

18.Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for visionand-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23, 2019.

19.Tan, H. and Bansal, M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.

20.Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086, 2018.

21.Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2016.

22.He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.

23.Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., and Chen, X. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1026710276, 2020.

24.Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., and Gao, J. Vinvl: Making visual representations matter in vision-language models. arXiv preprint arXiv:2101.00529, 2021.

25.Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., and Parikh, D. Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956, 2018.

26.Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., and Chen, X. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1026710276, 2020.

27.Cho, J., Lu, J., Schwenk, D., Hajishirzi, H., and Kembhavi, A. X-lxmert: Paint, caption and answer questions with multi-modal transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8785–8805, 2020.


Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132399877