【论文笔记】BEIT 3 ——Image as a Foreign Language: BEIT Pretraining forAll Vision and Vision-Language Tasks

beit and beit2 are still single-modal work. By the third generation, it has become a multi-modal work. Banxiang is not sure whether it is beit3... This work can still see the shadow of other work of their group. Needless to say, the beit series, as well as vlmo, etc., can be regarded as a comprehensive work.

hexagon warrior 

1. BEIT 3 

1.1 Basic skeleton: Multiway Transformer

Each layer contains a visual expert and a language expert.

The last three layers have visual language experts designed for fused encoders.

 1.2 Pre-training tasks

The difference from the previous work is that the three classic ones are abandoned, and there is only one training task: masked data modeling . (It really echoes the title)

(1) Text data

Flagged by SentencePiece tokenizer, randomly blocked by 15%

(2) Image data

The image data is tokenized by BEIT v2’s tokenizer to obtain discrete visual tokens as reconstruction targets, masking 40% of the image patches.

(3) Image-text pair

Randomly mask 50% of text markers and 40% of image blocks.

2.Code

2.1 beit3

The most basic code of beit3 is in the torchscale library

from torchscale.model.BEiT3 import BEiT3

 

Guess you like

Origin blog.csdn.net/weixin_50862344/article/details/131384233