【Paper Notes】BEIT:BERT PRE-TRAINING OF IMAGE TRANSFORMERS

GitHub

1 Introduction

1.1 Challenges

  1. The input unit of the visual transformer, the image patch, has no pre-existing vocabulary.
  2. Predicting the raw pixels of a mask patch tends to waste modeling power on pre-training short-range dependencies and high-frequency details

1.2 Review Bert's basic architecture and process

  1. Input encoding: convert each word in the input text into a fixed-dimensional vector representation through a tokenizer
  2. Input Transformer Encoder: A multi-layer Transformer encoder is used to capture the contextual information of the input text.

1.3 Key points to grasp

  • How to tokenize: Obtained through the potential code of DiscreteVAE (Dalle_VAE is used in the code)!
  • How to mask the picture? Randomly mask a certain percentage of image patches!
  • What is online learning? Visual markers!

2. Method

Figure 1

2.1 Image representation

2.1.1 Image PATCH

Basically the same as vit, each 224×224 image is divided into a grid of 14×14 image blocks

2.1.2 Visual token representation

Represents an image as a sequence of discrete labels obtained by an "image tokenizer", rather than raw pixels.

Image taggers learned using discrete variational autoencoders (dVAEs). There are two modules in the process of visual labeling learning, the labeler and the decoder.

  • Tokenizer: Maps image pixels x to discrete labels z
  • Decoder: Reconstructs the input image x based on visual markers z

Since the underlying visual representations are discrete, model training is non-differentiable.

2.2 Backbone network: transformer

  • A special token[S] is pre-prepared for the input sequence (lower left corner of Figure 1)
  • Add standard learnable 1D position embeddings to patch embeddings.
  • The encoder is actually the transform of the L layer
  • The output of the last layer is used as an encoded representation of the image patch

2.3 Pre-training BEIT

  • Given an input image x, we split it into N image patches and label them with N visual tokens. We randomly masked about 40% of the image patches. Replace masked blocks with learnable embeddings e[M] ∈ RD.
  • The image patches of the mask are fed to the L-layer transformer. The final hidden vector is treated as an encoded representation of the input patch.
  • Use a softmax classifier to predict the corresponding visual markers

The goal of pre-training is to maximize the log-likelihood of the correct visual label given the corrupted image:

3. Code

The core code is shown in the figure above. The input picture is obtained through d_vae (Dalle_VAE is used in the code) to obtain a coded representation, and a Vit is used to obtain the coded representation of the predicted image block.

3.1 dataset

In the figure above, a batch gets three outputs:

samples, images, bool_masked_pos = batch

 The construction of the data set is only four short lines

def build_beit_pretraining_dataset(args):
    transform = DataAugmentationForBEiT(args)
    print("Data Aug = %s" % str(transform))
    return ImageFolder(args.data_path, transform=transform)

The first two of DataAugmentationForBEiT are okay to say, they are the data processing of vit and vae respectively, and the third is used to randomly generate masks.

3.2 Visual token representation

Actually this part

        with torch.no_grad():
            input_ids = d_vae.get_codebook_indices(images).flatten(1)
            bool_masked_pos = bool_masked_pos.flatten(1).to(torch.bool)
            labels = input_ids[bool_masked_pos]

 Obtain the code of the entire image through DiscreteVAE (Dalle_VAE is used in the code)

Codebook: The Codebook here is similar to a table, a dictionary, or a principal component vector in principal component analysis. [ What does the codebook mentioned in the reference deep quantitative learning mean? It’s a bit abstract but I think the meaning and taste are right]

[Dig a small hole] You should read the papers after the VAE series

3.3 Image Converter

In fact, it is to add the vit of cls_token and mask_token and a linear layer to predict the coded representation (code) of the image block

Guess you like

Origin blog.csdn.net/weixin_50862344/article/details/131260769