【Paper Notes】Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

0. Preface 

[Reference] Multimodal Dissertation Lectures

Based on the research before 2021, there are several big trends:

(1) The visual ability of the model should be stronger than the text extraction ability

(2) Model fusion should have a more complex design, not just a simple dot product operation (clip)

(3) Selection of loss function: ① ITC ② MLM ③ ITM

The reason why the WPA loss function is abandoned is because it brings a relatively large cost to the training. (refer to ViLT)

1 Introduction

1.1 Challenges

(1) Image features and word token embeddings exist in their own space, and it is difficult to model the relationship between the two

(2) Object detector labeling and computational cost is too high

(3) The noise of the ALT data set affects the training (most of the data sets crawled from the web are mainly keywords, which cannot be described well)

1.2 The core contribution of this paper

  • Proposed ALign BEfore Fuse (ALBEF)
  • Proposed Momentum Distillation (MoD)

2. ALBEF

2.1 Model Architecture

  • Visual Encoder: Using 12-layer VIT-base
  • Text Encoder: The first 6 layers of the BERTbase[40] model
  • Multimodal encoder: last 6 layers of BERTbase

Image via VILT

2.2 Three loss functions for pre-training

Image-text contrastive learning (ITC) on unimodal encoders, masked language modeling (MLM) and image-text matching (ITM) on multimodal encoders.

2.2.1 Image-Text Contrastive Learning (ITC)

Regular Softmax:

Softmax in this paper:

 The paper introduces a learnable temperature parameter T

Obtain cross-entropy measurement through image text similarity and softmax similarity of one-hot distribution

2.2.2 Masked Language Modeling (MLM)

Classic bert training method. Because the input tokens are randomly masked with a probability of 15%, a second forward pass is required here

2.2.3 Image-Text Matching (ITM)

A strategy is proposed to sample hard negtivate data for ITM tasks with zero computational overhead for training: use the Softmax similarity obtained in image-text contrastive learning to sample a text (or image) that has a high similarity to an image (or text) ).

2.3 Momentum Distillation momentum distillation

This idea is based on the picture-text-to-Chinese text crawled from the web, which does not necessarily fully describe the picture.

Momentum Models: Exponential Moving Average (EMA) Composition of Single and Multimodal Encoders

3. Code

For the code part, refer to [Read the paper and see the code] Multimodal series-ALBEF . It's detailed enough so I won't reinvent the wheel

Guess you like

Origin blog.csdn.net/weixin_50862344/article/details/131213928