[Paper & Model Explanation] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

0 Preface

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Paper URL: https://arxiv.org/abs/2102.03334
Source URL: https://github.com/dandelin/vilt

1 summary

  Vision-and-Language Pre-training (VLP) improves the performance of various joint vision and language downstream tasks. The current VLP method relies heavily on the image feature extraction process (in the past, it was found that the better the visual network, the better the final effect), most of which involve areas beyond the visual distance (such as target detection, Figure 1). Region Feature) and convolutional structures (such as ResNet, Grid Feature in Figure 1). But we found that it has problems in the following two aspects:
(1) Efficiency/speed. Extracting image features takes more time than multimodal fusion. It stands to reason that if you only do a good job with multiple single modes, it does not guarantee that the final multi-modal effect is still very good, and the effect of multi-modality depends mostly on the effect of fusion.
(2) Expressive ability. If only a pre-trained model is used to extract features, the effect of this multimodal model is very limited. For example, with a pre-trained target detector, the current target detection data set is not large in categories, and the detected categories are very limited and cannot cover all ranges. Therefore, if the model is not end-to-end, it only uses the pre-trained model to extract features, then there is a high probability that it will not be the optimal solution.

  In this paper, the author proposes a minimal VLP model – Vision-and-Language Transformer (ViLT), in the sense that the processing of visual input is greatly simplified to the same non-convolutional way as we process text input. The authors show that ViLT is tens of times faster than previous VLP models, yet performs competitively or better on downstream tasks.

Figure Figure1 : Comparison of traditional VLP architecture and ViLT.

In terms of images, there are three categories:

  1. Region Feature: such as ViLBERT, UNITER, etc., these are to extract regional features. Given an image, get it through a convolutional neural network Backbone (such as ResNet50, ResNet101, etc.), and then through Region Operations (such as RoI, NMS, etc., listed in the text below the dark purple ~810ms in the figure below) Some regional characteristics. This is equivalent to doing a target detection task, and what you get is some bounding boxes, all in discrete form, which can be imagined as a bunch of sequences. But such shortcomings are obvious, as shown in the figure, using the target detection method, the running time of the entire model is 900ms, but the visual part accounts for 885ms (75+810), and the text only accounts for 15ms. Therefore, in order to make the model achieve good results, the model wastes too many resources in terms of vision.
  2. Grid Feature: such as Pixel-BERT, which uses a ResNet pre-trained on ImageNet, and then passes the feature map obtained by ResNet as a discrete sequence to Transformer to learn (just like ViT hybrid), only CNN Backbone , without the subsequent target detection related RoI and NMS (Non-Maximum Suppression, non-maximum suppression) and so on. Although this method greatly shortens the running time, the performance drops too much, so it is not very good.
  3. Patch Projection: It is the method of ViLT. In fact, it is the same as the preprocessing of Vision Transformer (ViT for short). It is implemented through a Linear Embedding layer and turns the patch into a token. The running time of ViLT in terms of vision only needs 0.4ms, which is greatly reduced compared with the running time of traditional models, and the effect of the model will not drop much.

In terms of text, these models are basically the same, and become word tokens one by one through an Embedding matrix.
After obtaining the visual sequence and the text sequence, input it to Modality Interaction (basically Transformer) for fusion between modalities.

  Although it is proposed here that the running time of ViLT is greatly shortened compared to the running time of traditional models, the training time of ViLT is not short, even longer than many previous methods. And the effect of ViLT is not stronger than the previous Region Feature method. But ViLT's main achievement is its exceptionally short runtime.


2 Introduction

  In previous studies, pre-training was basically based on image-text pairs, and their objective functions were basically image text matching (image-text matching loss) and masked language modeling (used in NLP such as BERT, etc. mask learning) [although some work uses other objective functions, these two objective functions are used in all VLP models], and then fine-tuned in downstream tasks, which often involve both modalities.
  If you want to do VLP, the text will undoubtedly use Transformer, then the pixels of the image need to be transformed into a certain form, such as a discrete but highly semantic feature representation, so that the image can be matched with language tokens, and then Pass it to Transformer together. In the Vision Transformer, the image is divided into 16 × 16 patches, and then these patches are passed to the Transformer to learn. Before the Vision Transformer was proposed, most of VLP still relied on a target detector. At present, the VLP model basically uses a pre-trained target detector, which is in the Visual Genome dataset (with 1600 object categories and 400 attribute categories). on pre-training.

Why use an object detector?

  1. As mentioned above, what VLP wants is a discrete and semantically strong feature representation, and target detection happens to be a discrete national policy. What is returned is the bounding box, which is the detected object. The bounding box has clear semantic information. , and is discrete, just use RoI to extract features directly. The region of the target detection here can actually be imagined as a certain word in the sentence in the text.
  2. It was related to the downstream tasks of VLP at the time, mainly VQA (Visual Question Answering), Image Captioning, Image Retrieval , etc. Objects have a very direct connection and a very strong dependence on objects.

  However, using the target detector to extract image features is too wasteful of resources, so I also began to try to reduce the amount of calculation here. One of the attempts is Pixel-BERT, which uses a ResNet pre-trained on ImageNet, and then passes the feature map obtained by ResNet as a discrete sequence to Transformer to learn (just like ViT hybrid), so the amount of calculation is just There is only CNN Backbone, and there is no RoI and NMS (Non-Maximum Suppression, non-maximum suppression) related to target detection, etc., and the running time is much faster.

  But the author thinks it is still not enough, the current VLP research is still focused on how to improve performance by improving visual embedders. In experiments, people often ignore the time to extract features. During training, you can first extract the features of the data set and store them on the hard disk, and use them directly during actual training. However, in actual applications, the data used are real-time New data needs to extract new features. This part cannot be stored on the hard disk in advance, and the time spent on this part cannot be ignored.

  So the author focused on how to design a lighter and faster method for extracting image features. The author refers to ViT ( An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale , for related explanations, please refer to [Paper & Model Explanation] Vision Transformer ), divides the image into several patches, and then converts the patch into an embedding through the linear projection layer , thus replacing the previous cumbersome process of extracting image features.

Contributions of ViLT:

  • ViLT is the simplest vision-and-language model so far. Except for the Transformer for modality fusion, no other models are used (ResNet and even a target detection network were used in the past). ViLT results in very short run times and a reduction in the number of parameters.
  • While reducing computational complexity, ViLT can guarantee little or no performance degradation without using Region Features and CNN, which was not achieved in previous work.
  • More data enhancement methods are used during training, such as masking out the entire word in the text part, and using image enhancement in the image part, so that the performance of the model is improved.

3 Background (Short Overview)

3.1 Vision-and-Language model classification

The author made a classification of the current VLP model based on two points:

  1. Whether the expression strength of images and texts is balanced, that is, the amount of parameters and calculations used by the two modalities. It stands to reason that the importance of images and texts should be about the same, and images should not be more important than texts like most previous methods.
  2. How to combine the two modes

Based on the above two points, the author classifies VLP models into four categories, as shown in Figure 2 , where VE, TE, and MI represent visual embedder, textual embedder, and modality interaction, respectively.
insert image description here

  • ( a ) (a) ( a ) VE > TE > MI: Like the VSE (visual semantic embedding) series, the text and fusion are relatively light, but the image is relatively cumbersome.
  • ( b ) (b) ( b ) VE = TE > MI: Like CLIP, the expression strength of images and text is basically the same, and the amount of calculation is similar. It is relatively light in modal fusion, and the features of the two modalities are directly multiplied. It is more suitable for feature extraction and retrieval tasks, but it is not suitable for tasks such as VQA, because CLIP simply obtains the features of two modalities, and VQA needs to integrate the features of the two modalities to know the value of the two modalities. Correspondence.
  • ( c ) (c) ( c ) VE > MI > TE: such as ViLBERT, UNITER, the image part uses target detection, and the fusion part uses Transformer. As mentioned above, the performance of this method is really good, achieving good results in various downstream tasks
  • ( d ) (d) ( d ) MI > VE = TE: That is, the model in this article, ViLT, uses the idea of ​​ViT to make the image part very lightweight.

3.2 Mode of modal fusion

There are two main categories:

  • single-stream approaches: use only one model. So how to solve the input of two modals? The easiest way is to concat the two inputs, merge the two sequences into one sequence and then input it into a model.
  • dual-stream approaches: use two models. These two models first do some processing on their respective inputs, fully excavate the information in the single mode, and then do some fusion later.

  The author uses single-stream, which is to concat the features of the two modalities and then pass them to the Transformer for learning. The dual-stream is two models, so the number of parameters required is larger.

3.3 Visual Embedding method

  On the text side, most of them use the embedder-tokenizer in the pre-trained BERT, so this part is the same, and it is very lightweight. So the text will not go into too much detail, mainly talking about visual feature extraction.

Region Feature
  first extracts some features through a Backbone (such as ResNet-101, ResNext-152), then extracts some RoIs through FPN, and then reduces the RoIs to a certain number through NMS, then this number is actually the length of the sequence. What you get is a bunch of bounding boxes, and then you can get some one-dimensional vectors through the RoI head. These are the Region Features.
  Although it sounds reasonable here, some continuous pictures are turned into some discrete bounding boxes, and each box has certain characteristics, which are matched with the text. But this whole process is very resource intensive. Even now there are some fast and lightweight models for target detection, but they are still not as fast as a simple backbone or patch embedding.

The Grid Feature
  author listed several Grid Feature methods, but they are still not lightweight enough, and the performance has dropped a lot.

Patch Projection
  refers to ViT's patch projection embedding. Not only is the visual part much lighter, but the model performance is basically unchanged compared with the original Region Feature method.


4 ViLT(Vision-and-Language Transformer)

4.1 Model

insert image description here
As shown in Figure3 :

Input section:

The input is text and image respectively:

  • The text sequence gets the word embedding through the BERT tokenizer. If the length of the text sequence is L, the dimension of the token embedding is H, so the input of the text to the Transformer is a matrix of L×H.
  • The image is first divided into several patches, and each patch becomes a series of tokens through patch embedding. Assuming that the sequence length of the image token is N, the dimension of the token embedding is still H, so the input of the image to the Transformer is an N×H matrix.

The following part is the token generated after the text and image pass through the embedding layer.

insert image description here

  • The asterisk part is the [CLS] token, the one on the left is the [CLS] token of the text, and the one on the right is the [CLS] token of the image.
  • The gray part represents the category of the modality, as shown in the figure, the text part is 0, and the image part is 1. For the single-stream method, text and images are concat together into a series of tokens as the input of Transformer. If you don’t tell the model which piece is text and which piece is an image, it may not be conducive to the learning of the model; if you tell the model which piece is text, Which piece is an image, the model may train the relationship between the text and the image, which is more conducive to learning.
  • Dark green and dark purple (corresponding to 0 1 2 3 4 5 6 in the figure) are position embeddings in text and images, respectively.
  • Light green and light purple are the tokens generated after the text and image pass through the embedding layer respectively.

  But these three (gray-Modal-type embedding, dark green/purple-Token/Patch position embedding, light green/purple-text and image generated tokens) are not concat as shown in the figure, but rather Added up (the following code, from vilt/modules/vilt_module.py in the source code ).

The part of concat is to concat the whole obtained after the above addition and turn it into a sequence.

This is the input of the Transformer model, then the length of the input sequence is 1 + L + 1 + N = 2 + L + N, then the entire input is (2 + L + N) × H.

output part

  Both Image Text Matching and Word Patch Alignment are used to calculate the similarity between text and pictures and whether they match. Masked Language Modeling is used for the text part.

  Image Text Matching is equivalent to a binary classification task, to see if the text and the picture match, the output it uses is the first position of the entire sequence output, just like the [CLS] token, not all the outputs are used , the pooler in the figure is a matrix of H×H, and then becomes 1×H, and after passing through a FC, Image Text Matching can be done. Most VLP models use the objective function of Image Text Matching.

  Word Patch Alignment is also used to calculate the similarity between text features and image features. Using optimal transport (optimal transport theory), it can be understood as taking the output of text and the output of images as a probability distribution, and then calculating these two distributions Of course, the smaller the distance, the better.

  Masked Language Modeling is equivalent to filling in the blanks, masking out a certain word, and then rebuilding it through the model, which is basically used in all NLP tasks.

  One of the things that can be improved here is to add the objective function of "cloze" to the image part. However, since the work of BEiT and MAE has not been done at that time, the image field cannot be very effective for reconstruction tasks, so the author here Just didn't add it. Now VL-BEiT has added reconstruction loss in the visual part

4.2 Whole Word Masking

  Mask out the entire word. The author gave an example in the paper, if there is a word "giraffe", if a tokenizer such as BPE is used, then "giraffe" will be divided into ["gi", "##raf", "##fe"], at this time these small Parts are tokens one by one, assuming that one of the tokens “##raf” is masked off at this time, it becomes [“gi”, "[MASK]”, “##fe”], in English, “gi " at the beginning, and "##fe" ends with very few words, so the model can easily know to fill in "##raf" in the middle. The text can be judged by filling in "##raf" in the middle, so that this loss loses its meaning. In this case, the author masks the entire word, that is, removes "giraffe" from the entire sentence. At this time, if the model wants to reconstruct "giraffe", it must use image information to further strengthen the relationship between the image and the text. contact.

4.3 Image Augmentation

  In previous studies, little data augmentation was used in VLP models. The author used color inversion (since text often also contains color information) and cutout (since it cleans up small but important objects scattered throughout the image), and made some tweaks to get good results.


5 experiments

5.1 Dataset

5.2 Comparative experiment

classification task

  The traditional Region Feature method is still relatively good in performance. The running time of ViLT is greatly shortened compared with traditional methods. Compared with VisualBERT, ViLT has better performance. However, compared to the best OSCAR and VinVL, ViLT is still competitive on the VQAv2 data set, but it is obviously not good in NLVR2. The main achievement of ViLT is that the combination of running speed and accuracy is better than previous models - the running speed is extremely short and the accuracy is also very competitive.

Retrieval task

zero-shot:

insert image description here

fine-tuning:

insert image description here

  In short, ViLT is not as good as the best model in the past in terms of performance, but it does a better job in terms of time and performance trade-offs, the performance is slightly lower, and the simplicity and speed have been greatly improved.

5.3 Ablation experiment

insert image description here
The Training Steps column proves that the longer the training time, the better the model performance.
w indicates whether to use whole word masking, using it can improve the model, but it is relatively small.
m indicates whether to use MPP objective ("cloze" on the image, reconstructing the image), the experiment showed that the performance did not increase after using MPP, and the author did not use this later (BEiT, MAE had not yet been published, image reconstruction field not very good).
a indicates whether to use RandAugment, that is, data enhancement on the image, and it can be seen that the improvement of model performance is very effective.

5.4 Comparison of VLP models


6 Conclusion

  This paper proposes a minimal VLP model ViLT, which does not use embedding methods that required a large number of configurations in the past (such as Faster R-CNN and ResNet to extract features), and only uses a simple Patch Embedding to solve image feature extraction. The whole model is simpler, faster, and the result is okay. Although the performance of ViLT is not as good as SOTA, it provides a supervision method that does not require convolution and region.

The author provides several possible research directions in the future:

  • Scalability : If the model is larger and the data set used is larger, the results will be better.
  • Masked Modeling for Visual Inputs : Reconstruction methods are also used on images. This part currently has BEiT and MAE, and there have been papers that have improved ViLT in this part of image reconstruction.
  • Augmentation Strategies : According to the ablation experiment, data enhancement is indeed very effective.

Guess you like

Origin blog.csdn.net/Friedrichor/article/details/127167784