【论文笔记】VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

1 Introduction

1.1 Two mainstream architectures for visual language (VL) pre-training

(1) dual-encoder: Encode images and text separately

Advantages: Retrieval tasks

Disadvantages: Shallow interactions between images and text are not sufficient to handle complex VL classification tasks

(2) Single encoder: Fusion encoder that performs cross-modal attention on model image-text pairs

Advantages: Achieves excellent performance on VL classification tasks

Disadvantages: All possible image texts need to be jointly encoded, which is not suitable for retrieval tasks under large data sets.

1.2 Introduction to VLMO paper

A unified visual language pre-training model (VLMO) is proposed, which can be used as a dual encoder to separately encode images and text for retrieval tasks, or as a fusion encoder to encode images for classification tasks - Modeling deep interactions of text pairs.

1.3 Contribution

In addition to VLMO itself, I personally think the two major contributions of this paper are:

(1) Multi-mode converter SA is shared and only relies on FFN adjustment

(2) Phased training strategy: first train a single modality, and finally train multi-modality

2. Method

2.1 Input representation

2.1.1 Image representation

Image representation = patch embedding, learnable 1D position embedding and image type embedding (actually represented by feature output from vit)

2.1.2 Text representation

  • Add sequence start mark ([T_CLS]) and special boundary mark ([T_SEP])

Text representation = word embedding, text position embedding and text type embedding

2.1.3 Text and image representation 

Concatenate image and text input vectors

2.2 Mixture-of-Modality-Experts Transformer (MOME)

  • MOME Transformer introduces a mixture of modality experts as an alternative to the standard Transformer’s feedforward network.
  • Each MOME Transformer block captures modality-specific information by switching to a different modality expert and uses multi-head self-attention (MSA) shared between modalities to align visual and linguistic content.

In fact, it’s easy to understand when you look at the picture below . The three modalities share multi-head self-attention (MSA), but use independent FFN to capture information in different modalities. This idea has also been used for reference and learning in many works.

 2.3 Training objectives

The three classic ones: ①ITC image-text comparison learning ②MLM masked language modeling ③ITM image-text matching.

Proposed global hard negatives mining and sampled hard negative image-text pairs from more training examples collected from all GPUs. Different from ALBEF's single GPU 

2.4 Staged pre-training

First perform visual pre-training on pure image data, and then perform linguistic pre-training on pure text data to learn general image and text representations

  • Pure image data: pre-trained visual expert (V-FFN) and self-attention module 
  • Plain text data: Freezing the parameters of visual experts and self-attention modules to train language experts (L-FFN).
  • Image-text data: The entire model is pre-trained on visual language.

3.Code

The essence of the code is here hahahahaha... 

Supongo que te gusta

Origin blog.csdn.net/weixin_50862344/article/details/131366319
Recomendado
Clasificación