BERT-based multi-mode learning --VL-BERT articles

Foreword

BERT The emergence of NLP development to achieve a big leap, even big brother has no say NLP can do it, it is to fight back the money to fight the machine. However, I believe that after the progress in any field will have more stringent requirements, research and no end, can never meet demand. The multi-modal, requires that the machine has a multi-dimensional perception is a stronger challenge. On this topic is becoming another new hot spot. 19 years from now the number of papers to be seen.

So, we went up to the momentum of development in the following videoBERT after a lot of research and a combination of image and BERT work. The following sections describe VL-BERT MSRA produced by this model to glance at the present stage image + BERT research status bar.

Background reply [VL-BERT] Download original paper ~ ~

Model Introduction

VL-BERT transformer model as the backbone to extend into text + BERT input image. So the question is, how will both fancy fusion it? Let's speculate about the alchemy of ideas:

  1. Pictures and text can not be directly aligned, violence enter entire map

Then there is the figure by dashed box portion up red, direct images, text, and segment position embeding additive input. MLM do this task is no problem, but how to determine the model can accurately extract image information it?

  1. Extract important part of the image, increasing the input image without text

Since the whole picture is much larger than the size of the text token, a one-time input entire picture is clearly not conducive to interactive graphics and text information. Therefore, the use of target detection tool picture into blocks, extracting the core of interest in the image RoI (region-of-interest), coupled with [the IMG] identification, entered into the model (FIG pale green solid frame up section). In order not to lose global information in [the END] has added a position corresponding to the whole image. In addition, we assume that the picture is no order at all different areas, that position embedding is the same.

Analog input text, the model is actually received text token (subword) corresponding to the word embedding, so we have all the input image (either whole or partial picture RoIs) using pre-trained R-CNN-dimensional visual feature extraction 2048 embedding entered into the model.

Since supervised learning task (pretrain)

Binding model structure described above, re-emphasize two pre-training tasks:

  1. Masked Language Model with visual Clues

According to the text + image information predictive text token, an upgraded version of MLM . The only difference is that in addition to the mask's word prediction is not based on a text mask can also be assisted in accordance with visual information. For example, the example image above, the word sequence after the mask is a kitten drinking from [MASK], if there is no picture to our visual information can not be predicted mask word is bottle.

  1. Masked RoI Classification with LinguisticClues

According to the text + image information to predict RoIs category for the image of "MLM". The following diagram, for example, for the first picture using the target detection means to extract and obtain RoIs Category, and local area random mask (leaf part). Note that, because the model will receive input throughout the image, in order to avoid information leakage, but also part of the whole picture corresponding mask. Finally, the model predicts mask area belongs to categories based on text messages and picture messages are mask of.

Downstream task (FineTune)

Model by receiving <text, image> input, learning to general cross-modality is represented by the self-supervision task, you can use a lot of natural cross-modal task. Continues BERT original setting, Feature [the CLS] may predict the relationship between the final output text and pictures (sentence-image-relation), the mask or the text token output RoI used for word-level prediction RoI-level or the .

Let's look at different downstream task is how to achieve a pair of ~

  1. Visual commonsense reasoning (VCR)

RoIs and to multiple problems (Q) given a picture, you need to select the answer (A) and explain why (R). VCR mission beyond the target detection (object detection), complex reasoning tasks that require a combination of cognitive level. The following figure shows two examples of the data [1], it is really difficult ???? complex.

Overall task {Q-> AR} can be broken down into two subtasks {Q-> A} (Q A prediction according to answer questions) and {QA-> R} (based on QA reasoning reason R & lt) . The two sub-tasks are multiple choice, the model just from the candidate answers were selected as the most correct option is like. This follows the input graphic consists of two parts Question (known information) and Answer (candidate answers), the image input RoIs artificially marked. For {Q-> A} tasks , known as text information of a text question Q described. Of {QA-> R} tasks , known as text information prediction task issues a Q plus answer A. The last two tasks layer [the CLS] The output prediction candidate answer (A / R) is correct.

There is a reasonable place, the normal mode of thinking is to have a reliable theoretical basis for R A. arrive at the correct answer But the logic of the above model is to have the correct answer, look for good reason. Cause and effect reversed.

The end result, whether it is a comparison task-specific model R2C or other multi-modal model, VL-BERT have a very significant advantage.

  1. Visual quiz (VQA)

Articles follow a model BUTD experiment set designed specifically for VQA task, the task VQA into a multi-classification problem 3k + candidate answers, output forecast token according to the last layer is masked Answer.

Compared special design of the network structure (BUTD), VL-BERT on improved accuracy of 5%, and other multi-modal pretrain model quite effective.

  1. Referenceing Expression Comprehension(visual grounding)

This task it is based on a natural language description of the specific area, location picture that judgment described the sentence is talking about which position of the picture. Because we have a picture carved out RoIs, so only need to final output of each RoIs, take a Regin classification (binary) to determine whether the region can be described Query.

analysis

VL-BERT transformer model as the backbone, the BERT extension can receive text and picture type input at the same time, cross-modality learning representation, far SOTA task specific model in the three downstream task and achieve comparable models or other pretrain slightly better results.

Its main advantage is that text and pictures of the depth of interaction . Contrast-year work LXMERT [2], for text and image input respectively using the single-modal Transformer, then followed by a cross-modal Transformer, VL-BERT using a single cross-modal Transformer, text and picture messaging can make more earlier interactions.

But this work I think there is still a need to make a question mark, or where further in-depth study.

Two articles from supervisory tasks used are derived from the MLM, do not judge the text and pictures are the same (Sentence-Image Relation Prediction) typical of this task.

Comparative experimental analysis article, reference is added Sentence-Image Relation Prediction task pretraining cause downstream task effect is reduced, due to the reason data quality analysis, sentence-image signal corresponding to the noise is large. But intuitively correspondence between the text and the picture is a strong inter-modal learning represented signal, and ViBERT [3] and the task is on LXMERT positive earnings.

If the data quality optimization, noise reduction signal corresponding to the sentence-imgage, can optimize performance of the VL-BERT?

If there is still negative income, whether it is from the other two supervisory tasks already covered by the sentence-image corresponding to the information, adding this task only role is to bring the noise data?

Are these three self-contradictory or conflicting supervisory tasks place? What their relationship? Worthy of further research and exploration.

Xi little attention [House] sell Meng Yao background papers reply [original] download VL-BERT with notes ~

Published 33 original articles · won praise 0 · Views 3264

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/105036223