[Paper & Model Explanation] VideoBERT: A Joint Model for Video and Language Representation Learning

foreword

Title of the paper: VideoBERT: A Joint Model for Video and Language Representation Learning
    (VideoBERT: A Joint Model for Video and Language Representation Learning)
Paper URL: https://arxiv.org/abs/1904.01766
Source URL: https:// github.com/ammesatyajit/VideoBERT

  Based on the success of BERT in NLP, the author combined BERT into the video field (visual-language) and proposed VideoBERT. Experiments show that when VideoBERT performs zero-shot inference directly on the action classification task, it can achieve similar results to the previous supervised trained S3D. When using VideoBERT to do the downstream task video captioning, it also significantly exceeds S3D. The author also combines VideoBERT and S3D, and the effect of the hybrid model even exceeds SOTA.

Some examples:

Figure 1 : VideoBERT text-to-video (text to video) generation and future prediction.

Top half: Given some recipe text, split it into sentences y = y 1 : T y=y_{1:T}y=y1:T, use VideoBERT to calculate xt ∗ = arg maxk p ( xt = k ∣ y ) x_t^*=arg\ max_k\ p(x_t=k|y)xt=arg maxk p(xt=k y ) to generate a sequence of video tokensx = x 1 : T x=x_{1:T}x=x1:T

Bottom half: Given a video token, the top three future tokens predicted by VideoBERT at different time scales are shown here. In this case, VideoBERT predicts that a bowl of flour and cocoa is likely to bake in the oven and may become a brownie or cupcake.

Visualize video tokens using images from the training set closest to the center point in the feature space.

Figure 2 : Additional text-to-video generation and future prediction examples from VideoBERT, detailed in Figure 1 .

Figure 6 : Examples of descriptions generated by VideoBERT and S3D baseline. (where GT is ground truth) In the last example, VideoBERT fails to exploit the full temporal context because it ignores the paper towel.

0 summary

  Self-supervised learning is becoming increasingly important to exploit the abundance of unlabeled data on platforms such as YouTube. Given that most existing methods learn low-level representations, we propose a joint visual-linguistic model (Original: a joint visual-linguistic model) to learn high-level features without any Clear oversight. In particular, inspired by its recent success in language models, the authors build upon the BERT model to learn bidirectional joint distributions of sequences of visual and linguistic tokens derived from vector quantization of video data and off-the-shelf speech recognition output, respectively. VideoBERT has been used in many tasks, including action classification and video captioning. The authors demonstrate that it can be applied to open vocabulary classification (open vocabulary classification), and also confirm that a large amount of training data and cross-modal information are critical to performance. In addition, VideoBERT outperforms SOTA in the video description task, and the quantitative results verify that the model learns high-level semantic features.


Image low-level, high-level features: understanding of semantic information, high-level and low-level features in images

  • Image underlying features: contour, edge, color, texture and shape features, etc.
  • Image high-level features: The high-level semantic features of an image refer to what we can see. For example, if we extract low-level features from a face, we can extract the outline of the face, nose, eyes, etc., then the high-level features are displayed as a face. The semantic information of high-level features is relatively rich, but the target position is relatively rough.

1 Introduction

  Deep learning can benefit a lot from labeled data, but it is difficult to obtain extremely large amounts of labeled data. Thus, there has recently been a lot of interest in self-supervised learning , where we train a model on various "auxiliary tasks" and hope that it can be applied to feature or representation discovery in downstream tasks. A variety of such auxiliary tasks have been proposed in the image and video domains, however, most of these methods focus on low-level features (e.g., texture) and short timescales (e.g., motion patterns lasting a second or less ). The authors find that actions and activities unfolded on longer timescales (e.g., minutes) correspond to high-level semantic features, as such representations will be helpful for various video understanding tasks.

  In this paper, the authors exploit the words that human language has evolved to describe high-level objects and events, thus providing a natural source of self-supervision. In particular, we propose a simple way to model the relationship between the visual and linguistic domains, which combines three off-the-shelf approaches:

  • Automatic Speech Recognition (ASR) systems that convert speech into text;
  • Apply vector quantization (Vector Quantization, VQ) to extract the underlying spatiotemporal visual features from the pre-trained video classification model;
  • The recently proposed BERT model is used to learn a joint distribution over sequences of discrete tokens.

  VideoBERT is to use BERT to learn a form of p ( x , y ) p(x,y)p(x,y ) , wherexxx is a sequence of "visual words",yyy is a sequence of spoken words. Given such a joint model, various interesting tasks can be easily handled. For example, we can make text-to-video predictions, which can be used to automatically illustrate a set of instructions (such as a recipe), as shown in theFigure 1andthe top half ofFigure 2We can also perform the more traditional video-to-text task of dense video description, asFigure 6. In Section 4.6, the authors show that their video description method significantly outperforms previous state-of-the-art on the YouCook II dataset.

  We can also use our model in a "unimodal" way. For example, the implied marginal distribution p ( x ) p(x)p ( x ) is a language model for visual words that we can use to make long-term predictions. The examples in the bottom half of Figure 1andFigure 2illustrate this point. Of course, the future is uncertain, but the model can generate plausible guesses at a higher level of abstraction than other deep generative models for video, such as those based on VAEs or GANs, which tend to predict low-level aspects of scenes. Small changes, such as the position or pose of a small number of objects.

  In summary, the author's main contribution in this paper is a simple method to learn high-level video representations that capture semantically meaningful and temporally long-range structures. The rest of the paper describes this contribution in detail. Section 2 briefly reviews related work, Section 3 describes how to apply recent advances in natural language modeling to the video domain, Section 4 presents results on action recognition and video description tasks, and Section 5 concludes.


2 related work

supervised learning

  Most video representation learning methods utilize large labeled datasets to train convolutional neural networks for video classification. However, collecting such labeled data is expensive, and the corresponding label vocabularies are often too small to represent the nuances of multiple behaviors (e.g., "sipping", "drinking, sip") )" and "gulping (swallowing)" are only slightly different). Furthermore, these methods are designed to represent short video clips, typically only a few seconds long, whereas our work focuses on the long-term evolution of events in videos and does not require manually provided labels.

unsupervised learning

  In recent years, various methods for learning density models from videos have been proposed. These methods either use a single static random variable which is then decoded into a sequence using an RNN, or use a VAE or GAN type loss function. Recent work uses temporal random variables, such as SV2P ( Stochastic Variational Video Prediction ) and SVGLP ( Stochastic Video Generation with a Learned Prior ), and there are also various GAN-based methods, such as SAVP ( Stochastic Adversarial Video Prediction ) and MoCoGAN ( MoCoGAN: Decomposing Motion and Content for Video Generation ). This paper differs from the aforementioned work by using BERT applied to derive visual tokens from videos without any explicit random latent variables. Therefore, our model is not a generative model of pixels, but it is a generative model of features derived from pixels.

self-supervised learning

  To avoid learning the joint model p ( x 1 : T ) p(x_{1:T})p(x1:T) , learningp ( xt + 1 : T ∣ x 1 : t ) p(x_{t+1:T}\mid x_{1:t})p(xt+1:Tx1:t) have become popular, where we divide the signal into two or more blocks, such as grayscale and color, or the previous frame and the next frame, and try to predict one of them and the other. Our approach is similar, except we use quantized visual words instead of pixels. Furthermore, although we learn a set of conditional distributions, our model is a proper joint generative model, as explained in Section 3.

cross-modal learning

  The multimodal type of video is also a broad source of supervision for learning video representations, on which this paper builds. Since most videos contain synchronized audio and visual signals, these two modalities can supervise each other to learn a strong self-supervised video representation. In this work, the authors use speech (provided by ASR) instead of low-level sounds as a source of cross-modal supervision.

natural language model

  Large-scale language models such as ELMO and BERT have achieved SOTA in a variety of NLP tasks, both at the word level (eg part-of-speech tagging) and sentence level (eg semantic classification). BERT was subsequently extended to pre-train on multilingual data as well. This paper builds on BERT to obtain systems in the domains of language and vision.

Image and Video Descriptions

  There has been a lot of recent work on image description, which is of the form p ( y ∣ x ) p(y\mid x)p ( andx ) , whereyyy is a manually provided description,xxx is the image. There is also some work on video description, using manually provided time segments or estimated segments. The author uses the jointp ( x , y ) p(x, y)p ( x , y ) model, and apply it to video descriptions, and obtain SOTA, as discussed in Section 4.6 of this paper.

teaching video

  Various papers have used trained models to analyze instructional videos, such as cooking videos. We differ from this work in that we do not use any human labeling and we learn a large-scale generative model with both word signals and discretized visual signals.

3 models

  In this section, we briefly summarize BERT and then describe how it can be extended to jointly model video and language data.

3.1 BERT

  BERT proposes to use a "mask language model" to learn language features, that is, set x = { x 1 , … , x L } x=\{x_1,…,x_L\}x={ x1,,xL} is a discrete collection of tokens. We can define the joint probability distribution on this set as follows:
p ( x ∣ θ ) = 1 Z ( θ ) ∏ l = 1 L ϕ l ( x ∣ θ ) ∝ exp ⁡ ( ∑ l = 1 L log ⁡ ϕ l ( x ∣ θ ) ) p(x\mid\theta) = \frac{1}{Z(\theta)} \prod \limits_{l=1}^L \phi_l(x\mid\theta) \varpropto \exp (\sum \limits_{l=1}^L\log \phi_l(x\mid\theta))p(xi )=Z ( θ )1l=1Lϕl(xi )exp(l=1Llogϕl(xi ))

where ϕ l ( x ) \phi_l(x)ϕl( x ) is thellthl potential function (potential function), parameterθ \thetaθsumZZ __Z is the partition function.

  The above model is permutation invariant. To capture order information, we can "label" each word by its position in the sentence. BERT learns each word token and the embedding of these tags, and then sums the embedding vectors to obtain a continuous representation of each token. The log potential (energy) function for each position is defined as follows:
log ⁡ ϕ l ( x ∣ θ ) ) = xl T f θ ( x \ l ) \log \phi_l(x\mid\theta)) = x_l^T f_\theta(x_{\backslash l})logϕl(xi ))=xlTfi(x\l)

where xl x_lxlis the llOne-hot vector of l
tokens and their labels, and : x \ l = ( x 1 , . . . , xl − 1 , MASK , xl + 1 , . . . , x L ) x_{ \backslash l} = (x_1,...,x_{l-1},MASK,x_{l+1},...,x_L)x\l=(x1,...,xl1,MASK,xl+1,...,xL)

f ( x \ l ) f(x_{\backslash l}) f(x\l) is a multi-layer bidirectional Transformer, which takes aL × D 1 L\times D_1L×D1The tensor, contains and x \ l x_{\backslash l}x\lCorresponding D 1 D_1D1dimension embedding vector, and returns an L × D 2 L\times D_2L×D2The tensor, where D 2 D_2D2is the size of the output of each Transformer node. The model is trained to approximately maximize the pseudo-log-likelihood:
L ( θ ) = E x ∼ D ∑ l = 1 L log ⁡ p ( xl ∣ x \ l ; θ ) L(\theta) = E_{x \sim D}\sum \limits_{l=1}^L \log p(x_l \mid x_{\backslash l};\theta)L ( i )=ExDl=1Llogp(xlx\l;i )

  In practice, we can randomly optimize logloss by sampling locations and training sentences (by ffsoftmax calculation of f function predictions).

  BERT can be extended to model two sentences by concatenating them together, however, we are often not only interested in simply modeling the extended sequence, but the relationship between two sentences (e.g., is this a to the preordered sentences or randomly selected sentences). BERT does this by using a Classification Header [CLS] token, and by concatenating sentences with a Separator [SEP] token. The final hidden state corresponding to the [CLS] token is used as the ensemble sequence representation from which we predict the label for the classification task, which might otherwise be ignored. In addition to distinguishing sentences with [SEP] tokens, BERT can also choose tokens based on the sentence each token comes from. The corresponding joint model can be written as p ( x , y , c ) p(x,y,c)p(x,y,c ) , wherexxx is the first sentence,yyy is the second sentence,c = { 0 , 1 } c=\{0,1\}c={ 0,1 } is a token to indicate whether the sentences are discrete or continuous in the source document.

  For consistency with the original paper, we also add the [SEP] token at the end of the sequence, although it is not strictly required. Thus, a typical pair of masked-out training sentences might look like this: [CLS] let's make a traditional [MASK] cuisine [SEP] orange chicken with [MASK] sauce [SEP]. In this case, the corresponding class label c = 1 c=1c=1 , indicatingxxxyyy is continuous.

3.2 VideoBERT

Figure 3: Illustration of the masked token prediction (“cloze”) task in a video and text context. This task also allows training with text-only and video-only data, and VideoBERT can also be trained with a language-visual alignment classification objective (not shown here).

  In order to extend BERT to video, we can still use the pre-trained language model and scalable implementation for inference and learning, the author made a minimal change, that is, convert the original visual data into a sequence of discrete tokens. To this end, the authors propose to generate a sequence of "visual words" by performing hierarchical vector quantization on features extracted from videos using a pretrained model. In addition to the simplicity of this method, it also encourages the model to focus on high-level semantics and longer-term dynamic changes in the video, which is different from most existing self-supervised video representation learning (focusing on low-level features, such as local texture and motion). In sharp contrast.

  We can combine language sentences (generated from videos using ASR) and visual sentences to generate data like this: [CLS] orange chicken with [MASK] sauce [>] v01 [MASK] v08 v72 [SEP] , where V01 and V08 are visual tokens, and [>] is a special token used to combine text and video sentences.

  While the cloze task naturally extends to sequences of linguistic and visual tokens, it is not so straightforward to use BERT for the next sentence prediction task. The authors propose a language-vision alignment task that uses the final hidden state of the [CLS] token to predict whether a language sentence is temporally aligned with a vision sentence . Note that this is a noisy indication of semantic relatedness, since even in instructional videos the speaker may be referring to something that is not there visually.

  To solve this problem, the adjacent sentences are first randomly concatenated into a long sentence, and the model can learn the semantic correspondence even if the two sentences are not well aligned in time. Second, since in different videos, even for the same action, the state transition speed will vary greatly, we randomly select the subsampling rate of 1-5 steps for video tokens. This not only helps the model to be more robust to changes in video speed, but also allows the model to capture temporal dynamics over a larger time scale and learn longer-term state transitions.

  Overall, there are three training regimes corresponding to different input data modalities: text-only, video-only, and video-text. For plain text and plain video, the standard mask objective function is used to train the model. For text-video, we use the language-visual alignment classification objective described above. The overall training objective is a weighted sum of the individual objectives: the text objective makes VideoBERT do a good job at language modeling; the video objective makes it learn a "language model for video" that can be used to learn dynamics and predictions; the text-video The goal is for it to learn the correspondence between the two domains.

  Once the model is trained, we can use it in various downstream tasks, here the authors quantitatively evaluate two applications. In the first application, it is used as a probability model, and it is required to predict or infer symbols that have been masked out, such as the zero-shot classification in Section 4.4. In the second application, the predicted representation of the [CLS] token (from the internal activation function of the model) is extracted, and this dense vector is used as the representation of the entire input. This can be combined with other features derived from the input for use in downstream supervised learning tasks such as video captioning in Section 4.6.

4. Experiment and analysis

4.1 Dataset

  In the fields of language and vision, previous models have proved that the larger the data set, the better the performance of the model. Therefore, we hope to train VideoBERT with a fairly large-scale video dataset. Since they wanted to explore the connection between language and vision, the authors wanted to find videos in which the spoken language was more likely to refer to the visual content. Intuitively, this is often the case for instructional videos, and the authors pay special attention to cooking videos, as it is a well-studied domain and existing annotated datasets are available for evaluation. Unfortunately, such datasets are relatively small, so the authors turned to YouTube to collect a large-scale video dataset for training.

  The authors extract a set of publicly available cooking videos from YouTube using the YouTube Video Annotation System to retrieve videos related to "cooking" and "recipe". The author also filters according to the length of the video, deleting videos longer than 15 minutes, resulting in a set of 312K videos. The total duration of this dataset is 23186 hours, which is approximately 966 days. For reference, this is more than two orders of magnitude larger than the largest cooking video dataset, YouCook II, which consists of 2K videos totaling 176 hours.

  To obtain text from videos, we utilize YouTube's Automatic Speech Recognition (ASR) toolkit (provided by the YouTube Data API) to retrieve time-stamped speech information. The API returns the sequence of words and the predicted language type. Of the 312K videos, 180K ASRs can be retrieved by the API, of which 120K are expected to be in English. In experiments, only text from English ASR is used as text-only and video-text targets for VideoBERT while all videos are used as video-only targets.

  VideoBERT is evaluated on the YouCook II dataset, which contains 2000 YouTube videos with an average duration of 5.26 minutes and a total of 176 hours. Videos have manually annotated segmentation boundaries and captions. On average, each video has 7.7 clips and each caption has 8.8 words. Using the provided dataset split, 1333 videos are used for training and 457 for validation. To avoid potential bias before training, any videos appearing in YouCook II were also removed from the pre-training set.

4.2 Video and Language Preprocessing

  For each input video, we sample frames at 20 fps and create clips from non-overlapping windows of 30 frames (1.5 seconds) on the video. For each 30-frame clip, a pre-trained video ConvNet is used to extract features. Using S3D ( Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification ), it adds separable temporal convolutions to an Inception network ( Going Deeper with Convolutions ) backbone. Feature activation is performed before the final linear classifier, and 3D average pooling is used to obtain a 1024-dimensional feature vector. Among them, S3D is pre-trained on the Kinetics dataset, which contains a wide range of behaviors in YouTube videos, and serves as a general representation for each individual clip.

Use hierarchical Kmeans to tokenize visual features, adjust hierarchical dd   by visually checking cluster consistency and representativenessThe number of d and each levelkkThe number of clusters for k . Setd = 4 d=4d=4 k = 12 k=12 k=12 , a total of1 2 4 = 20736 12^4=20736124=20736 clusters. Figure 4illustrates the result of this "vector quantization" process.

Figure 4: Examples of video-sentence pairs from pre-trained videos. We quantize each video clip into a token, which is then represented by the corresponding visual centroid. For each row, we show the original frame (left) and the visual center (right). We can see that the tokenization process preserves semantic information rather than low-level visual appearance. (Translation of the sentence in the picture: "But at the same time, you're just moving around the cake board, and you can keep reusing, making sure your serving is clean, so you can get this all done, but it's a very interesting thing Things, especially at a birthday party."; "Butter a little butter on one side, place some of the filling on it, spread another slice of bread evenly, and spread a little more butter on top because we're going to be toasting a sandwich." )

  For each ASR word sequence, use an off-the-shelf LSTM-based language model to break the WordPieces into sentences by adding punctuation marks. For each sentence, BERT's standard text preprocessing steps are followed and the text is tokenized into words. Here the authors use the same vocabulary provided by the authors of BERT, which contains 30,000 tokens.

  Unlike language, which can be naturally decomposed into sentences, how to decompose video into semantically consistent segments is unclear. The authors use a simple heuristic to solve this problem: when an ASR statement is available, it is associated with a start and end timestamp, and we treat video tokens in that time period as a segment. When ASR is not available, 16 tokens are simply treated as a fragment.

4.3 Model pre-training

  We initialize BERT weights from a text pretrained checkpoint. Specifically, using the BERTL arge BERT_{Large} released by the authorBERTLargemodel, using the same backbone architecture: it has 24 layers of Transformer blocks, each of which has 1024 hidden units and 16 self-attention heads.

  The authors added support for video tokens, adding 20736 entries to the word embedding lookup table for each new "visual words". These entries are initialized with the S3D features of their corresponding cluster centers. During pre-training, the input embeddings are frozen.

  The model training process basically follows the settings of BERT: use 4 Cloud TPUs in the Pod configuration, with a total batch size of 128, and train the model for 500,000 iterations, about 8 epochs. Using the Adam optimizer, the initial learning rate is 1e-5, and the learning rate decays linearly. Training takes about 2 days.

4.4 Zero-shot Action Classification

  After pre-training, VideoBERT can be used for zero-shot classification of new data sets (such as YouCook II) (zero-shot means that the model does not use the data on YouCook II during the pre-training process, that is, the model is in other dataset for training and direct transfer to YouCook II for classification at test time). More precisely, we want to calculate p ( y ∣ x ) p(y\mid x)p ( andx ) , wherexxx is a sequence of visual tokens,yyy is a sequence of words. Since the model is trained to predict sentences, we define Y as a fixed sentence,“now let me show you how to [MASK] the [MASK],”from the predicted tokens in the first and second masked positions, respectively Extract verb and noun labels. SeeFigure 5.

Figure 5: Given a video, nouns and verbs are predicted using VideoBERT. See text for details. Videos are first converted to video tokens (two are shown in each example here) and then their centrosomes are used for visualization.

  For quantitative evaluation, the YouCook II dataset is used. In Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction , the author collected the ground truth bounding boxes of the 63 most common objects for the YouCook II validation set. However, there are no ground truth labels for actions, nor do many other common objects. Therefore, the authors collect action and object labels from ground truth captions to address this shortcoming. An off-the-shelf part-of-speech tagger was run on the ground truth caption to retrieve the 100 most common nouns and 45 most common verbs and used these to derive the ground truth. While VideoBERT's word piece vocabulary makes it effective for open-vocabulary classification, it is more likely to make semantically correct predictions that do not exactly match the more limited ground truth. Therefore, the authors present Top-1 and Top-5 classification accuracy measures, the latter of which aims to alleviate this problem, leaving more complex evaluation techniques to future work. Finally, if more than one verb or noun is associated with a video clip, we consider a prediction correct if it matches any of them. The authors demonstrate performance on the YouCook II validation set.

Table 1: Action classification performance on YouCook II dataset.

  Table 1 shows the top-1 and top-5 accuracy of VideoBERT and its ablation. To verify that VideoBERT actually uses the video input, first remove VideoBert's video input and use only the language model p ( y ) p(y)p ( y ) to make predictions. Also used the language prior in the text-only BERT model, which has no fine-tune in cooking videos. It can be seen that VideoBERT performs significantly better than these two baselines. As expected, the language prior of VideoBERT is suitable for cooking sentences and outperforms Vinilla BERT.

  The authors then compared with a fully supervised classifier trained using YouCook II's training split. Using precomputed S3D features (same input as VideoBERT), average pooling over time followed by a linear classifier. As shown in Table 1 , the supervised framework outperforms VideoBERT in verb accuracy, which is not surprising since VideoBERT has an efficient open vocabulary. (See Figure 5 for an illustration of action label ambiguity.) However, the top 5 accuracy metrics show that VideoBERT achieves comparable performance to the fully supervised S3D baseline without using any supervision from YouCook II, which demonstrates that It shows that the model can be competitive in this zero-shot.

4.5 Benefits of Large Training Sets

Table 2: Action classification performance on YouCook II dataset as a function of pre-training data size.

  As shown in Table 2 , the effect of pre-training dataset size on model performance is investigated. In this experiment, 10K, 50K, and 100K video subsets are randomly sampled from the pre-training set, and VideoBERT is pre-trained using the same settings as above for the same amount of time. It can be seen that as the amount of data increases, the accuracy grows monotonically without any sign of saturation. This suggests that VideoBERT may benefit from a larger pre-training dataset.

4.6 Migration learning captioning

  We further demonstrate the effectiveness of VideoBERT as a feature extractor. To extract features given only the video input, we again use a simple fill-in-the-blank task, appending video tokens to the template sentence "now let's [MASK] the [MASK] to the [MASK], and then [MASK] the [ MASK].” by extracting features of video tokens and masked text tokens, averaging them, and concatenating the two for supervised models in downstream tasks.

  The authors evaluate features extracted on video captioning, following the setting of End-to-End Dense Video Captioning with Masked Transformer , where ground truth video clips are used to train a supervised model that maps video clips to captions. The authors use the same model as theirs, the Transformer encoder-decoder, but replace the encoder input with features from VideoBERT. The authors also concatenate the VideoBERT features with the average ensemble's S3D features; for the baseline, consider using only the features of S3D and not the features of VideoBERT. In terms of parameters etc.:

  • Transformer block layer is set to 2
  • The hidden unit size is set to 128
  • Dropout probability is set to 0.4
  • Use 5-fold cross-validation on the training split to set hyperparameters and report performance on the validation set
  • The model was trained for 40K iterations
  • The batch size is 128
  • Use the same Adam optimizer as VideoBERT pre-training, set the initial learning rate to 1e-3, and use a linear decay schedule
Table 3: Video captioning performance on YouCook II. We follow the setup in [End-to-End Dense Video Captioning with Masked Transformer](https://arxiv.org/abs/1804.00819) and demonstrate captioning performance on the validation set, given ground truth video clips. The higher the numbers in the table, the better the performance.

  As shown in Table 3 , the authors follow the standard practice of machine translation and calculate the micro-average of BLEU and METEOR scores at the corpus level (micro-averaged), and also calculate the ROUGE-L and CIDEr scores. For the baseline method (i.e. End-to-End Dense Video Captioning with Masked Transformer ), the metrics are recomputed using the predictions provided by the authors. We can see that VideoBERT consistently outperforms S3D baselines, especially CIDEr. We can also see that cross-modal pre-training outperforms the video-only version. Furthermore, by concatenating features from VideoBERT and S3D, the model achieves the best performance in all evaluation metrics.

Figure 6 : Examples of descriptions generated by VideoBERT and S3D baseline. (where GT is ground truth) In the last example, VideoBERT fails to exploit the full temporal context because it ignores the paper towel.

  As shown in Figure 6 , some qualitative results are displayed. We note that the predicted word sequences are rarely exactly equal to the ground truth, which explains why the absolute values ​​of the indicators for computing n-grams in Table 3 are all low. Semantically, however, the result seems reasonable.

5 Conclusion

  This paper employs the powerful BERT model to learn joint visual-linguistic representations of videos. Experimental results show that VideoBERT is able to learn high-level semantic representations and outperforms SOTA in video captioning on YouCook II dataset. It is also demonstrated that the model can be directly used for open-vocabulary classification, and its performance increases monotonically with the size of the training set.

  This work is a first step towards learning such joint representations. For many applications, including cooking, it is important to use spatially fine-grained visual representations, not just working at the frame or clip level, so that we can distinguish individual objects and their properties. We envision either using pretrained object detection and semantic segmentation models, or unsupervised techniques to achieve broader coverage.

  In addition to improving the model, the authors also plan to evaluate the method on other video understanding tasks, as well as domains other than cooking. (For example, the recently released COIN dataset of human-labeled instructional videos can be used.) The authors are quite optimistic about the future of large-scale representation learning from video and language.

Guess you like

Origin blog.csdn.net/Friedrichor/article/details/127374249