Bert application in the field of CV

AI's three core sectors (CV / Speech / NLP) have in recent years have made great progress and development. But the saying as also Xiao, also lost, depth of learning has been the generalization and robustness issues criticized in capacity, general AI road within the foreseeable future.

However, thanks to the recent pre-training model of success, it seems that cross-modal problem (VQA, plug-speak, etc.) have also become more of a. Cross-modal solution based on the pre-training can be divided into two branches, one for Video-Linguistic BERT (video data into the BERT), the other is the Visual-Linguistic BERT (the image data into the picture BERT). The main difficulty is how non-text information into the framework of BERT. This article covers only video-linguistic BERT.

Video can be understood as a group picture to play, each of which is defined as a picture frame (Frame) . Typically required by the first processing video data per frame x (fps) frequencies of the video to be extracted, and then the n consecutive frame composed of a fragment (Clip) , so that video will not be cut into a lot of overlapping fragments. For each segment a clip (including m frame) using the CV model pretrained art (e.g. ResNet etc.) to extract a feature vector (Visual Features) , the final video is represented as a sequence of feature vectors .

 

Extracted from the video feature is a continuous real-valued vector NATURAL vector (real number belongs to the space), and discrete texts are very different. Currently, the video feature vector is injected mainly in two ways BERT the following:

(1) Pipeline manner: real valued discrete vectors, and added to the text alignment token BERT model;

(2) end to end: fine-tuning the model structure BERT, directly involved in the calculation using the real-valued vector.

Man of few words said, small evening will introduce the two methods below two papers respectively. Read the subsequent need for BERT a relatively deep understanding and knowledge. There needs to be the venue here , a simple review of BERT review. End of the text as well as eggs, do not miss it ~

《VideoBERT: A Joint Model for Video and Language Representation Learning》

This is a combination of video BERT will learn classic cross-modal representation. The work of the video feature vector extracted by the method of discrete cluster, in turn increases the visual token token on the basis of the text, visual and textual information with learning.

Method 1

1.1 video processing text data (video and language processing)

Video processing for the first selection of 20 frames (20 fps) video from the second INPUT, each composed of a segment 30. Pretrained with each clip ConvNet for extracting a feature vector (Dimension 1024). However, since the feature vector belongs to the space R ^ 1024, uncountable. And a token corresponding to the text, the original task continues BERT in MLM, the authors make use of hierarchical k-means clustering on the extracted feature vectors of all, to give a total of 20,736 cluster center. The cluster center as a visual token, each visual feature vector belongs to the class by its center characterized.

For text processing, use ready-made speech recognition tool (Automatic Speech Recognition) to extract text video, using the language model LSTM-based on their punctuation. Subsequent processing continues BERT raw, cut with WordPieces word vocabulary size is 30,000.

1.2 input format (input format)

After the previous processing, video and visual information in the language have become discrete token, VideoBERT input format continues BERT original design, only increased [>] This special token and the token used to distinguish text visual token.

              

1.3 from supervisory tasks (pretrain)

BERT from the original two supervisory tasks:

(. 1) cloz E (Cloze) / the MLM (mask Language Model) : prediction of the mask text token;

(2) NSP (the Next sentence Prediction) : predict whether the two periods of continuous supposing.

The first task can be naturally extended to the visual token. Like text token as early mask visual token, do not use the mask of text token and visual token predict visual token of the mask, it is a multi-classification problem, use softmax as a loss function.

The second task becomes VideoBERT NSP in predicting whether the sequence text and visual sequence is consistent, that is, whether the two extracted from the same video. Similar original BERT, we extract the video data from another visual sequence as negative cases, visual sequence from the video data as positive cases. It is a binary classification problem. 

1.4 downstream task

VideoBERT by these two self-supervised learning task in fact a joint visual-liinguistic representation (distribution) p (x, y), where x represents the visual sequence, y represents the text sequence. The joint distribution can be used on the following three tasks:

(1) text-to-video  : according to video text prediction, auto illustration according to the text. 

       

(2) video-to-text :  The video predictive text, automatic generation of video summaries.

(3) unimodal fashion (using the single mode): the edge using text or video distribution, according to the prediction above and below. The text is that we are very familiar with the language model, the case of video we can predict what might happen later in accordance with the previous video content.

              

Experiment 2

The actual article is designed to verify the validity of the two downstream task of learning to cross-modality joint representation.

2.1 Picture Talk      

The video and a fixed template "now let me show you how to  [MASK] the [MASK]," predicted the mask off of keywords (a verb and a name) . The following figure shows the qualitative three examples, each example shows a video class centers of two segments, and the predicted top verbs and nouns.

                    

Quantitative comparison of the data table of the task on the effect of different methods. S3D is a classic model of supervision , in addition to S3D models were not used to train the supervisory signal ( ZERO-SHOT Classification, direct use of pre-trained model ). BERT (language prior) represents the direct use of original BERT, VideoBERT (language prior) refers to an increase of video data in the original BERT on the basis of the extracted have text data to learn, VideoBERT (cross modal) is a complete model and combine video and text data Learn. Comparative Experiment results can be seen, the accuracy of the top-5, three kinds of effect and improve BERT set, the validity and effectiveness of multimodal data, the final zero-shot VideoBERT (cross modal) can be achieved, and supervised learning S3D similar effect. The reason for the results of all kinds of top-1 is slightly less of BERT BERT word piece cut word-based classification is more conducive to open-vocablary, focusing on the accuracy of semantics rather than exact match.   

2.2 video caption

The author uses the task to verify the validity VideoBERT as feature extraction. Using the same transformer encoder-decoder model generating video summary, except that the input of the model feature.

(1) using the extracted S3D feature (baseline)

(2) using the extracted feature VideoBERT

(3) VideoBERT feature splicing S3D feature (strongest brand)

       

             

We can see from the example of qualitative using video caption content VideoBERT feature generated more detail, more vivid and specific. From the quantitative indicators, VideoBERT + S3D achieved the best results, VideoBERT learn out feature has greatly improved downstream task video caption.   

《Learning Video Representations Using Contrastive Bidirectional Transformer》

Read previous work, small partners may have a doubt, the real-valued continuous feature vectors (visual features) is limited by clustering regular classes centers, you will lose a lot of detail contained in the video information it (⊙⊙)? So, this article would no longer use cluster continuous real-valued discrete type of visual features, but direct use of real-valued vector visual features, by fine-tuning the model algorithm, multi-modality of the BERT.

Method 1

              

First, panorama on the model, above the broken line is pretrain stage, below the dotted line downstream task of fine-tuning. Gray boxes indicate plain text data model and pre-training BERT fix. White boxes represent pure black line pretraining CBT video data model, the red line portion multimodal data using a pre-trained to cross-modal transformer combination of the two front. Below you one by one with a small evening unveiled ~ ~ ~ each part

BERT model 1.1 plain text

Since supervisory tasks or raw BERT of MLM , random mask text token predict not mask the use of the surrounding text.

             

Wherein yt is the correct token is a mask, yt yt off mask is removed represents the text sequence. This is actually MLM loss function is to maximize the probability of correctly predicted yt yt is. But here the probability prediction yt yt is defined as follows.

             

Which is obtained through the feature transformer. Optimization goal is to be the mask word sequence yt and true representation of embedding similar yt (collinear) .

Said original BERT BERT and essentially the same, except that instead of the form of the inner product computation of a probability softmax. This small change modeling and visual echoes back part of the model structure is very elegant.

1.2 visual CBT model

Based on data from supervisory tasks video MLM model is also seamless, but because the visual feature is a continuous real-valued vectors, the authors used the NCE (Noise contrastive Estimation) Loss :

             

             

The definition of probability BERT above comparison, and the definition of NCE is not super like! ????    

 It is visual sequence mask through visual BERT output. Uncountable because visual feature, not like the text portion to be exhaustive of all embodiments of the negative, the negative sample embodiment by way of negative samples. The goal is to optimize the visual sequence xt similar feature et mask the true representation of xt and visual.

     

1.3 Model of cross-modality CBT

Introduced in front of a single mode in the module, for both video (video extracted from the visual features referred to as y = y1: T) and text (text using ASR extracted from the video token, referred to as x = x1: T) data with the corresponding relation therebetween study indicates multimodal interaction on to cross-modal CBT modules friends ~ ~

Although the visual features y and x text from the same period of the video, but even instructional videos, in which each frame ( Frame Level) does not correspond exactly, so we can not compel model can predict yt yt forecast by xt or by xt. Only you need to ask them (a talking model that can predict x y) corresponding relationship can exist on Level Sequence . Use the same NCE loss:

             

             

X and y represent each calculated by visual CBT and BERT model.

             

             

Into the cross-modal transformer calculated interaction representation , the mutual information between x and y with a shallow MLP . Similar optimization goals and two front, cross between the positive examples (x, y) information of the large, negative example (x, y ') of small mutual information. 

1.4 overall model

Overall model is an integrated part of the top three. Although all three parts of the input slightly different, but the algorithm is very consistent symmetrical, with very perfect.

             

Experiment 2

2.1 action recognition

As the downstream task to verify the validity of a visual representaions action recognition. Left following table contrasts the pretrain strategies (Shuffle & Learn and 3DRotNet) and a random initialization baseline, the effect of using both methods and finetune fix feature in the two data sets (UCF101 and HMDB51) a. The experimental results show the effectiveness of visual CBT model in this paper. On the right is a table and a variety of state-of-art models of supervision compared, CBT model also has a very significant upgrade directly than these models.

       

2.2 action anticipation 

Article uses three different data sets the Breakfast dataset, the 50Salads dataset and the ActivityNet 200 dataset. Do not understand the action anticipation tasks can be simply considered to be based on multi-classification tasks like video. In this experiment, the authors not only proved CBT method is better than other existing methods, but also prove CBT on long video has a good representation capability.

              

Left table shows the comparison of several other methods and CBT, CBT three tasks on the same superior to other methods (to put three experimental data, it is very convincing), wherein the self-super = Y represents the the method of using pretrain-finetune manner, self-super = N indicates that the mode is the end-to-end training.

Table on the right is a comparison of the length of the different video effects, different models. CBT data on three machine significantly better than the other two are consistent baseline (AvgPool and LSTM), and increases as the length of the video, the CBT effect was getting better. General model will be on long long text or video failure, such as a table, two baseline methods, but can be of unlimited length. CBT, but from a longer video learned better representation of the model such that the effect becomes good. (Sri Lanka a country !! ????)

2.3 other video tasks

       

The article also compares the video captioning and action segmentation of these two tasks, CBT than VideoBERT mentioned earlier also improved on video captioning, this may enhance the clustering step is to optimize the loss of information issues now.

 

[This is] egg ????

No. backstage reply [public] videoBERT get the original paper (evening school to their own notes while reading oh)

Reference notes read papers, simpler ~~

 

Published 33 original articles · won praise 0 · Views 3271

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104623997