Video pre-training model summary

Tip: After the article is written, the table of contents can be automatically generated. How to generate it can refer to the help document on the right


vilbert (2019)

论文题目:ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Structure: A dual-stream multimodal model: text + vision
Innovation: The main innovation of VilBERT is to propose a dual-stream mechanism, that is, streams for vision and language respectively, and the two streams interact at the Transformer layer. This structure can deal with different modalities separately under multi-modality, and provide interaction between modalities.
Network structure:
modify the key-value attention mechanism under the query condition in BERT, and develop it into a multi-modal co-attention transformer module. The key-value pairs exchanged in multi-head attention, and realize sparse interaction through the common attention mechanism;
pre-training:
random mask operation, training model reconstruction;

Hero

Thesis title : HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Structure : 1. Cross-modal Transformer to fuse subtitle sentences and their corresponding local video frames;
2. Temporal Transformer uses all surrounding frames as the global context, to get sequential contextual embeddings for each video frame. The proposed hierarchical model is able to first absorb visual and textual local contexts at the frame level, and then transfer to the global video-level temporal context.
Pre-training tasks : 1. Text mask (MLM)
2. Frame mask (MFM)
3. Modal alignment (VSM)
4. Random frame order (FOM)

Uni-Perceiver(2021年)

Not open source
Thesis title: Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for
Zero-shot and Few-shot Tasks
Structure: Transformer as the backbone network for multimodal applications (such as visual language recognition). They transform the input of different modalities with a modality-specific tokenizer into a unified input token sequence, and use a modality-independent Transformer encoder to share the parameters of different input modalities and target tasks, and then encode different token sequences into the shared in the representation space of .


Innovation: Encode input and target from arbitrary modalities into the same representation space, and model the joint probability of input and target by the similarity of input and target representation.
Various tasks of unimodal and multimodal can be solved by comparing the similarity of embedding . Therefore, the author proposes a dual Encoder structure in this paper, one Encoder is used to process input information, and one Encoder is used to process target. Information, and finally the similarity between input and target is obtained through cosine similarity. The goal of pre-training is to maximize the similarity between the matching input and target. Joint probability:insert image description here

Loss:
insert image description here

Data2vec (2022)

Thesis title : Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Structure :
1. Use the ViT strategy to encode the image into a series of patches, each patch is 16x16 pixels, input to the linear transformation module;
use multi-layer Speech data is encoded by a 1-D convolutional neural network that maps a 16 kHz waveform to a 50 Hz representation.
The text is preprocessed to obtain sub-word units, which are then embedded into the distribution space via learnable embedding vectors.
2. Mask
3. Training: It is more efficient and accurate to share the parameters of feature encoder and position encoder between teacher and student networks.

Innovation : Use self-distillation under the standard Transformer architecture. We first construct a representation of the full input data, which is intended to serve as the target of the learning task (teacher model); next, we encode a masked version of the input sample and use it to predict the full data representation (student model). Training is performed by predicting the model representation of the complete input data given a partial view of the input, and the learning task is to allow students to predict these representations given a partial view of the input; Loss: smooth
L1 loss

ViLT(2021 ICML)

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Paper address: https://arxiv.org/abs/2102.03334

Code address: https://github.com/dandelin/vilt

The adjustment made on the basis of uniter greatly speeds up the operation;
ideas: 1. Efficiency/speed, extracting input features has more calculations than multi-modal interaction; 2. Expressiveness, the ability of visual embedder and predefined The visual vocabulary determines the upper bound of the overall model performance. 3. Make adjustments in the modal interaction part.

Network framework : 1. Use the pre-trained feature extractor on other data sets to extract visual information. This paper adopts the linear projection method used in ViT, without convolution, so the speed is fast.
2. The text also adopts linear mapping.
3. Most of the calculations of the structure are used in the modal interaction, and the lightweight network is used in the modal feature extraction.

Pre-training:
1. Image-text matching (ITM): 1) Randomly replace the picture corresponding to the text with a different picture with a probability of 0.5 2) Use a linear ITM head to map the output feature into a binary value for the corresponding output of the text flag logits, used to determine whether the image text matches. The negative log-likelihood loss is then calculated as the loss function (cross-entropy) of the ITM. In addition, the authors also designed Word Patch Alignment (WPA) to calculate the alignment score between two subsets: (text subset) and (visual subset). 2. MLM: The goal of this task is to predict the words that are masked based on the context vector
. In this article, the author sets the probability of mask to 0.15. The authors use a two-layer MLM head that inputs and outputs the logit of the mask vocabulary. Then, set the MLM loss as the negative log-likelihood loss of the mask token.

Summary : The innovation point lies in the visual feature extraction part. Using the linear projection method of ViT, the speed is very fast, but the performance is not as good as the region feature method.

ViTAEv2(2022 NeurIPS)

600 million parameter model, bigger model, more tasks, higher efficiency. . .
Network framework : Visual feature extraction uses multi-scale convolution to introduce scale invariance and output splicing;
innovation points :
1. RC module: reduction cell uses multi-scale convolution to introduce scale invariance for the transformer model, and finally concatenates;
2 , NC module: Normal cell uses parallel convolution branches to introduce local inductive bias without affecting the global modeling ability of transformer.
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

VideoCLIP(2021 EMNLP)

Paper address: https://arxiv.org/pdf/2109.14084.pdf

Code address: https://github.com/pytorch/fairseq/tree/main/examples/MMPT
Network structure : 1. Text and vision are sent to two independently trained transformers;
2. Use contrastive loss to learn video and text Correspondence between. The learning objective is to minimize the sum of contrastive loss functions between two modalities.
Innovation points : 1. Sampling the text (if the video clip is sampled first, there may be no corresponding text)
2. Sampling the timestamp within the boundaries of the text clip as the center of the video clip;
3. Starting from the center timestamp The video clip is expanded.

M3P (CVPR 2021)

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training

Paper address: https://arxiv.org/abs/2006.02635

Code address: https://github.com/microsoft/M3P

Map objects that appear in different ways or texts expressed in different languages ​​to a common semantic space:
1) Through multilingual pre-training, learn to use multilingual corpora to represent multilingual data;

2) Learning multilingual multimodal representations by randomly replacing some English words with translations in other languages;

3) These representations are learned through the multi-task objective to handle multilingual multimodal tasks.

Guess you like

Origin blog.csdn.net/PETERPARKERRR/article/details/123371704