Flamingo

Build multi-modal models based on existing image models and text models. The inputs to the final model are images, videos, and text, and the output is text.

The Vision encoder comes from the pre-trained NormalizerFree ResNet (NFNet), and is further learned through image and text contrast loss. The output of the image after passing through the Vision encoder is a 2D grid, and the output of the video after passing through the Vision encoder is a 3D grid after sampling at a frequency of 1FPS. Both are expanded into 1D and sent to the Perceiver Resampler.

Perceiver Resampler can convert the features of variable-length pictures or videos into fixed-length ones. The structure is as shown in the figure below. By inputting learnable latent queries through the Attention and FFW layers, a visual representation is obtained.

Text models are based on Chinchilla models.

Combining visual features and text features through gated cross-attention dense module. The gated cross-attention dense module uses the tanh-gating mechanism, multiplying the output after text and visual modality cross-attention by tanh(a). a is initialized to 0. The tanh-gating mechanism ensures that during initialization, the model is not affected by visual features, and the output is the output of the language model.

When calculating cross-attention for vision and text, single-image cross-attention is used. When calculating cross-attention for images and text, a mask is used so that the text token can only see the token of the previous image.

The training data dataset includes public data and self-built data. M3W (43 million webpages), ALIGN dataset (1.8 billion images with alt-text 43 million webpages), 312 million image and text pairs, 27 million short videos and text pairs.

Guess you like

Origin blog.csdn.net/icylling/article/details/132172520