beit and beit2 are still single-modal work. By the third generation, it has become a multi-modal work. Banxiang is not sure whether it is beit3... This work can still see the shadow of other work of their group. Needless to say, the beit series, as well as vlmo, etc., can be regarded as a comprehensive work.
hexagon warrior
1. BEIT 3
1.1 Basic skeleton: Multiway Transformer
Each layer contains a visual expert and a language expert.
The last three layers have visual language experts designed for fused encoders.
1.2 Pre-training tasks
The difference from the previous work is that the three classic ones are abandoned, and there is only one training task: masked data modeling . (It really echoes the title)
(1) Text data
Flagged by SentencePiece tokenizer, randomly blocked by 15%
(2) Image data
The image data is tokenized by BEIT v2’s tokenizer to obtain discrete visual tokens as reconstruction targets, masking 40% of the image patches.
(3) Image-text pair
Randomly mask 50% of text markers and 40% of image blocks.
2.Code
2.1 beit3
The most basic code of beit3 is in the torchscale library
from torchscale.model.BEiT3 import BEiT3