VLP, multimodal advanced topics (6)

        As research on image- and text-based visual language pre-training grows, many other interesting research topics have emerged. Below, we briefly discuss each individual topic, such as large models, few-shot learning, unified modeling, robustness evaluation, etc.

In each theme, due to limited space, we only list some representative works. Additionally, if some models have already been shown in one topic, they will not be shown again in another topic to avoid duplication. For example, Flamingo was shown in the Few-Shot Learning topic and therefore will not be shown in the Big Models topic.

1. Large model

        Scale is considered an important factor in achieving state-of-the-art performance and building a common base model. As observed in the field of natural language processing, more and more large language models are pre-trained, from the 340M size BERT-large model to GPT-3 with 1750B parameters to the recent PaLM with 540B parameters. A similar trend has been observed in the field of visual language pre-training. As shown in the table below, we summarize some recent large-scale visual language pre-training models, including model size, pre-training dataset size and pre-training tasks. Here are some observations:

Summary of model sizes, pre-training dataset sizes, and pre-training tasks for recent large-scale VLP models. Note that some of the numbers shown in the table are based on our best estimates. 1: For VLP, 20 million image-text pairs are used for training, while 900 million data are used to pre-train the Florence image encoder. 2: This is the model size of the object detector used in VinVL (Zhang et al., 2021b). 3: Includes 2.1 billion image-text data and 27 million video-text data. 4: Before filtering, including 1.8 billion image-text data and 3 billion image-label data. 5: The shared attention block contains another 317 million parameters. 6: Includes 21 million image-text pairs and 14 million images from ImageNet-21K (additional 160GB document omitted here). 7: The complete pre-training tasks of PaLI (Chen et al., 2022e) include language model (LM), prefix language model (PrefixLM), visual question answering (VQA), visual question generation (VQG), optical character recognition (OCR) and target detection (OD). : In our context, the module that receives both image and text features as input is considered as fusion module, while the module that only receives text as input is considered as text encoder. Sometimes, fusion modules are referred to as text decoders in the literature, such as SimVLM (Wang et al., 2022k), Flamingo (Alayrac et al., 2022), and GIT (Wang et al., 2022d). ITC: Image-Text Contrast Loss. ITM: Image-Text Matching. MLM/LM: (masked) language modeling. MIM: Masked image modeling. MVLM: Visual-Language Modeling of Occlusion.

• Most large-scale visual language pre-trained models are obtained through contrastive pre-training or generative pre-training, or a combination of both. Using image-text comparison training (ITC) can achieve fast image and text retrieval and open domain image classification, while using mask language model (MLM) or language model (LM) after the fusion module can support image explanation generation and visual question answering and other multimodal understanding tasks.

• Current large-scale visual language pre-training models typically contain about 1 billion parameters and are pre-trained on about 1 billion to 10 billion image-text pairs.

• Flamingo uses a large fixed language model (70B size) to maintain the contextual few-shot learning capability inherited from the pre-trained language model, while GIT uses a large contrast pre-trained image encoder, coupled with a relatively small text decoder.

• Both Flamingo and GIT first pre-train the image encoder through contrastive learning, and then perform generative pre-training. However, in Flamingo, both the image encoder and text decoder are kept frozen; in GIT, the text decoder is randomly initialized and the image encoder is not kept frozen during the generative pre-training phase.

• CoCa is different from performing contrastive and generative pre-training separately, it performs contrastive and generative pre-training simultaneously in one stage.

• BEiT-3 achieves state-of-the-art performance on VQA and other visual language tasks by using only masked data modeling and multi-channel Transformer design.

Examples of recent large-scale VLP models are shown below: (a) Comparative pre-training, including CLIP (Radford et al., 2021), ALIGN (Jia et al., 2021), Florence (Yuan et al., 2021), BASIC (Pham et al., 2021) ) and other models. (b) Generative pre-training, including models such as GIT (Wang et al., 2022d) and Flamingo (Alayrac et al., 2022). LEMON (Hu et al., 2022) and most other OD (object detection) based VLP models also adopt this model architecture, but use additional pre-training losses such as MLM and ITM. (c) Joint contrastive and generative pre-training, such as CoCa (Yu et al., 2022a). METER (Dou et al., 2022b) also uses this model architecture, but uses MLM and ITM for pre-training. Basic scale models like ALBEF (Li et al., 2021a) and FIBER (Dou et al., 2022a) also employ both ITC and MLM losses. (d) Generative pre-training with encoder-decoder architecture, including models such as SimVLM (Wang et al., 2022k) and PaLI (Chen et al., 2022e). (e) VL-BEiT (Bao et al., 2022b) and BEiT-3 (Wang et al., 2022g) adopt a multi-channel Transformer design and perform in unified occlusion data modeling.

2. Context-based few-shot learning

It would be nice to achieve state-of-the-art performance through complete model fine-tuning. It would be more ideal to train a model that can quickly adapt to different downstream tasks by providing some contextual examples. In the context of language model pre-training, GPT-3 demonstrates this capability through extensive pre-training on large-scale text corpora. Inspired by this, researchers have also begun to explore multi-modal contextual few-shot learning. Below we mainly discuss three jobs: Frozen, PICa and Flamingo.

• Frozen (Tsimpoukelli et al., 2021) is a pioneering study in this field. It shows that powerful contextual few-shot learning performance can be obtained by using a large frozen language model and learning an image encoder to align the embedding spaces of images and text through a simple image captioning task. However, this method only uses two global vectors to encode the image and cannot capture all the visual information of the image. In addition, the frozen language model only has a model size of 7B, which may not be large enough.

• In order to retain the powerful contextual few-shot learning capabilities of 175B scale GPT-3 (Brown et al., 2020), PICa (Yang et al., 2022d) proposes to cue GPT-3 by using image captions for multi-modality Few-shot learning because GPT-3 can only read text and not images. This simple approach can already outperform supervised state-of-the-art methods on the challenging OK-VQA benchmark, which requires external knowledge to correctly answer questions about the input image. However, its performance improvement on the VQAv2 dataset is limited because subtitles cannot capture every detail of the image and fine-grained visual information may be lost. Recently, under a similar idea, VidIL (Wang et al., 2022j) was also proposed, which performs few-shot video-language learning by inheriting contextual learning capabilities from GPT-3.

• To address the above challenges, Flamingo proposes to simultaneously use a contrastive pre-trained frozen image encoder and a large frozen language model, and insert a gated cross-attention module to connect the two frozen models. Through large-scale pre-training and the use of 70B-scale frozen language models, Flamingo reports state-of-the-art results in contextual few-shot learning.

Overall, these works demonstrate the potential of contextual few-shot learning by leveraging multi-modal approaches and large-scale pre-training. Further research in this area could lead to more advanced models with enhanced context adaptation capabilities.

In addition to relying on large language models, researchers have also explored other methods for few-shot learning. In FewVLM (Jin et al., 2022), the authors proposed to use PrefixLM and MLM to train a VL-T5-like model (Cho et al., 2021), and found that PrefixLM was very helpful for the zero/few-sample image captioning task, while MLM It works well for VQA tasks with zero/few samples. In TAP-C (Song et al., 2022), the authors show that CLIP (Radford et al., 2021) can be used as a few-shot learner for VQA and visual entailment tasks. For the VQA task, the authors redefined it as an image-text retrieval task; while for the visual implication task, titles and hypotheses (text-text pairs) were used in training, while images and hypotheses (image-text pairs) were used in inference. right).

Zero-shot image interpretation. An important advantage of training large VLP models is that zero-shot generalization can be achieved. In image-text tasks, although zero-shot retrieval can be easily achieved by using contrastive loss during pre-training, evaluation of zero-shot image captioning is rare, mainly because the models operate on network-scale noisy image-text pairs. Without pre-training, resulting in poor zero-sample performance. Quantitative evaluation of zero-shot subtitles is provided in SimVLM (Wang et al., 2022k) and FewVLM (Jin et al., 2022), while LEMON (Hu et al., 2022) and CM3 (Aghajanyan et al., 2022 ) provides qualitative visual examples. Zero-shot image captioning can also be achieved through the joint use of CLIP and GPT-2 as discussed in MAGIC (Su et al., 2022) and ZeroCap (Tewel et al., 2022).

3. Unified image-text model

The image-text task categories that researchers are trying to unify include the following: (a) non-closed set classification, such as VQA, image-text retrieval, NLVR2, etc.; (b) open text generation, such as image subtitle generation, visual narrative and Open VQA; (c) region/mask localization, such as phrase alignment, referential expression understanding/segmentation, and location-based subtitle generation; (d) pixel prediction, such as text-to-image generation, text-based image editing, etc.

Image-text tasks can be roughly divided into four categories:

(i) Closed classification, such as VQA, image-text retrieval and visual reasoning;

(ii) Open text generation, such as image description, visual narrative, and free and open VQA;

(iii) box/masked localization, such as phrase-based, reference expression understanding/segmentation and scene-based description;

(iv) Pixel prediction, such as text-to-image generation and text-based image editing. How to design a unified image-text model that can support all these downstream tasks has become an increasingly important topic. We briefly summarize below some current attempts toward this goal.

  • • Unify image-text tasks as text generation. VL-T5 (Cho et al., 2021) draws on the ideas of T5 (Raffel et al., 2020) and BART (Lewis et al., 2020a) and proposes to use a sequence-to-sequence (seq2seq) encoder-decoder framework to combine different VL Tasks are unified into text generation, so that different tasks can be directly supported without introducing task-specific headers. Since pre-trained object detectors are used to (pre-)extract bounding boxes and their corresponding region features, the box prediction task in phrase-based and quotation expression understanding becomes a region-indexed classification problem. However, the fact that the model cannot be pre-trained end-to-end leads to poor downstream performance. SimVLM (Wang et al., 2022k) proposes a simple end-to-end seq2seq learning framework that treats VQA as a text generation task, just like VL-T5, and performs large-scale pre-training.
  • • Unify text generation and box prediction as language modeling. The above methods have unified certain image-text tasks (such as VQA, visual reasoning, and image description) into text generation. However, the bounding box coordinates cannot be predicted directly. By quantizing bounding box coordinates into discrete labels, Pix2Seq (Chen et al., 2022c) and Pix2SeqV2 (Chen et al., 2022d) propose using the seq2seq framework to address object detection (OD) as a language modeling task. On this basis, UniTAB (Yang et al., 2021c) attempts to unify text generation and bounding box prediction into a single Transformer encoder-decoder architecture, enabling UniTAB to Handle different VL tasks using a set of parameters while generating the desired text and box outputs and detect alignment relationships between words and boxes.
  • • Unify text generation and image generation as language modeling. By using VQ-VAE (van den Oord et al., 2017; Razavi et al., 2019), images can also be represented as a series of discrete image markers. Therefore, image generation can naturally be viewed as a language modeling task. Recent work such as Taming Transformer (Esser et al., 2021b), DALL-E (Ramesh et al., 2021), and Parti (Yu et al., 2022b) have shown that this approach can generate high-quality realistic images. Inspired by this, recent work has shown that image generation and text generation (e.g. image description) can be unified, such as ERINE-ViLG Zhang et al. (2021a), L-Verse (Kim et al., 2022) and DU-VLG (Huang et al. , 2022). Furthermore, DaVinci (Diao et al., 2022) combines the prefix image modeling task with prefix language modeling (as used in SimVLM) for pre-training. Aghajanyan et al. (2022) introduce CM3, a causal mask generation model pretrained on a large corpus of structured multimodal documents that can contain text and image tags (obtained from pretrained VQVAE-GAN). After pre-training, the authors show that the model can generate images unconditionally, conditionally, and learn to perform image description tasks in a zero-shot setting.
  • • Unify text generation, box prediction, and image generation. OFA (Wang et al., 2022e) proposes to unify text generation, box prediction and image generation, which is achieved by combining the ideas of Pix2Seq (Chen et al., 2022c) and VQ-VAE (van den Oord et al., 2017). Utilizing the same idea, Unified-IO (Lu et al., 2022a) further supports diverse modalities such as images, masks, keypoints, bounding boxes, and text as well as diverse tasks such as depth estimation, inpainting, semantic segmentation, Subtitles and reading comprehension. However, Unified-IO's current performance in downstream tasks is unsatisfactory.
  • • Unify positioning and VL understanding. Serializing bounding boxes into marker sequences enables designing a unified model to handle all tasks without introducing task-specific headers, which is very attractive. However, downstream object detection (OD) performance has either not been evaluated or still lags far behind the state-of-the-art. There is another line of work that attempts to unify localization and VL understanding, but still uses an additional OD header to output bounding boxes. This includes GPV-1 (Gupta et al., 2022a), MDETR (Kamath et al., 2021), UniT (Hu and Singh, 2021), GLIPv2 (Zhang et al., 2022b), and FIBER (Dou et al., 2022a). Specifically, GPV-1 (Gupta et al., 2022a) and GPV-2 (Kamath et al., 2022) promote the concept of a universal visual system. MDETR (Kamath et al., 2021) and GLIP (Li et al., 2022h) propose to unify object detection and phrase base into scene-based pre-training, which further inspired GLIPv2 (Zhang et al., 2022b) to combine localization and VL understanding. Unite. FIBER (Dou et al., 2022a) provides another solution to handle localization and VL understanding tasks by designing a new fusion-in-backbone structure and a new pre-training strategy, i.e., first on the image - Coarse-grained pre-training on text data, and then fine-grained pre-training on image-text-box data.

In addition to unifying different tasks within a framework, there is also work on designing unified Transformers. For example, UFO (Wang et al., 2021a) developed a unified Transformer that can be flexibly used as a dual encoder and a fusion encoder. VLMo (Wang et al., 2021c) proposes to introduce additional modality experts, and its scale-up version BEiT-3 (Wang et al., 2022g) recently achieved state-of-the-art results on VQA and other VL tasks.

4. Knowledge

        We mainly focus on knowledge-demanding VQA tasks that require external knowledge to answer questions correctly. Below, we divide the discussion into three parts.

• data set. The earliest explicit knowledge-based VQA datasets are KB-VQA (Wang et al., 2017b) and FVQA (Wang et al., 2017a). However, the knowledge required in these datasets is retained in the knowledge graph used to generate the datasets. KVQA (Shah et al., 2019b) is based on images from Wikipedia articles. OK-VQA (Marino et al., 2019) is a recently popular VQA dataset that requires the use of external open domain knowledge to answer questions given input images. More recently, WebQA (Chang et al., 2022) is a dataset collected using web queries, and A-OKVQA (Schwenk et al., 2022) is a crowdsourced dataset consisting of a series of questions that require broader common sense and world knowledge to answer question composition.

• Sources of knowledge. Knowledge sources can be divided into two categories: (i) explicitly structured symbolic knowledge bases, such as Wikipedia, ConceptNet, WordNet, and Google Images; and (ii) implicit unstructured knowledge bases, i.e., large-scale pretrained language models , such as GPT-3 (Brown et al., 2020), which contains rich encyclopedia and common sense knowledge.

• method. Most studies adopt a two-step approach to solve knowledge-based VQA tasks, that is, first retrieve knowledge from external resources, and then perform reasoning on the selected knowledge, input images, and questions for answer prediction. Below, we mainly discuss the methods designed for OK-VQA. Specifically, Shevchenko et al. (2021) proposed to use knowledge embeddings to build a knowledge base, and then inject these knowledge embeddings into the VLP model. KRISP (Marino et al., 2021) proposes to retrieve stored implicit knowledge from pre-trained language models as supplementary knowledge resources for structured knowledge bases. MAVEx (Wu et al., 2022c) proposes an answer verification method to better utilize the knowledge retrieved by noise. Recently, PICa (Yang et al., 2022d) demonstrated that state-of-the-art results can be achieved by using image captioning and contextual few-shot learning to facilitate GPT-3. This approach is further enhanced in KAT (Gui et al., 2022), supplemented by retrieving knowledge from an explicit knowledge base.

In addition to knowledge-based VQA, which explicitly requires the use of external knowledge to solve the task, there are also models such as ERINE-ViL and ROSITA (Cui et al., 2021) that use knowledge embedded in the scene graph to improve standard VL tasks ( Such as VQAv2 and image-text retrieval) performance. By pre-training on large-scale image-text data, recent GIT work (Wang et al., 2022d) shows that rich multi-modal knowledge about the visual world is already encoded in the model weights, and the pre-trained model can be conveniently Recognize text, tables/charts, food, logos, landmarks, characters, products, etc. in a scene and output this knowledge in natural language form while fine-tuning the TextCaps dataset (Sidorov et al., 2020). A related survey on knowledge-intensive NLP tasks is Yin et al. (2022).

5. Robustness and exploratory analysis

In most of the visual language pretraining (VLP) literature, models are usually trained on models such as VQAv2 (Goyal et al., 2017b), image captioning, NLVR2 (Suhr et al., 2019), visual entailment (Xie et al., 2019), image text retrieval, referential expression understanding (Yu et al., 2018a) and other standard benchmark data sets. These benchmark datasets have driven tremendous progress in the field (see, for example, Figure 3.2), and even some large VLP models have surpassed human performance on some tasks. While this progress is meaningful and exciting, we should not focus solely on leaderboard rankings and avoid overclaiming or underestimating a model’s learned capabilities (discussed in detail below). To date, the robustness of these pretrained models is unclear. Next, we review common approaches to robustness analysis from the following dimensions: (i) diagnostic tests; (ii) challenging datasets that test the model’s ability to generalize beyond the distribution; (iii) human-made adversarial attacks ;(iv) Exploratory analysis.

Diagnostic testing. Diagnostic tests are designed to verify a specific capability or a specific type of robustness of the VLP model. For example, Li et al. (2020c) conducted extensive evaluations of OD-based VLP models, including: (i) robustness against language variation via VQA-Rephrasings (Shah et al., 2019a); (ii) via VQA -LOL (Gokhale et al., 2020) Robustness against logical reasoning; (iii) Robustness against visual content manipulation via IV-VQA and CV-VQA (Agarwal et al., 2020). CLEVR (Johnson et al., 2017a) is a diagnostic dataset for testing combinatorial visual reasoning. GQA (Hudson and Manning, 2019b) provides a large-scale rule-based question set containing scene graphs from real-world images to test the capabilities of VQA models in position reasoning and relational reasoning. Winoground (Thrush et al., 2022) is a carefully curated dataset for probing the visual language compositionality of VLP models in image-text matching tasks. In addition, Parcalabescu et al. (2020) also proposed to test the VL model in counting tasks. The Visual Commonsense Tests (ViComTe) dataset (Zhang et al., 2022a) is designed to test how well unimodal (language only) and multimodal (image and language) models capture a wide range of visually salient attributes. VALSE (Parcalabescu et al., 2021) is designed to test VLP models centered on linguistic phenomena. CARET (Jimenez et al., 2022) aims to systematically measure the consistency and robustness of modern VQA models through six fine-grained capability tests.

Generalization ability beyond distribution. Typically, VL models evaluate their performance by conducting performance measurements on unseen data distributed in the same distribution as the training data. However, this assumption is not true when actually deploying VL systems. One of the most popular VQA datasets used to test the generalization ability of VL models beyond the distribution is VQA-CP (Agrawal et al., 2018). It is built by reshuffling examples from VQAv2. GQA-OOD (Kervadec et al., 2021) is improved on the GQA dataset and aims to evaluate the difference in performance on in-distribution and out-of-distribution datasets. In addition to VQA, VLUE (Zhou et al., 2022d) also creates out-of-distribution test sets for other VL tasks, including image text retrieval, image captioning, and visual localization. Gupta et al. (2022b) introduced the GRIT benchmark, which aims to test the performance, robustness, and calibration of vision systems on 7 vision and VL tasks, multiple data sources, and different concepts. Recently, Agrawal et al. (2022) conducted a comprehensive study on the out-of-distribution generalization capabilities of modern VLP models by conducting cross-dataset evaluations.

Human versus attack. In order to build a benchmark that can evolve organically over time, Li et al. (2021b), Sheng et al. (2021) collected adversarial VQA through an iterative process of adversarial human-machine and model (Nie et al., 2020) data set. Interestingly, they found that during the data set collection process, non-expert annotators can easily and successfully attack modern VLP models. The performance of these VLP models on the new benchmark is also much lower than on the standard VQAv2 dataset. Recently, Bitton et al. (2022) introduced WinoGAViL, an online game for collecting VL correlations as a dynamic benchmark for evaluating state-of-the-art VLP models. On the one hand, these benchmarks are valuable because they successfully demonstrate the weaknesses of the SoTA VLP model and provide new perspectives for robustness research in the community. On the other hand, we also need to be careful not to underestimate the ability of models to learn, since these datasets are specifically collected to trick these models.

Exploratory analysis. In addition to testing the robustness analysis of VLP models on various benchmark data sets, there is also a series of works aimed at exploring and understanding what is learned in VLP models (Cao et al., 2020; Li et al., 2020d; Salin et al., 2022), such as cross-modal input elimination testing (Frank et al., 2021), verb comprehension (Hendricks and Nematzadeh, 2021), bias analysis (Srinivasan and Bisk, 2022), and data in the VLP model, decoupling of attention and loss roles (Hendricks et al., 2021) and more.

6. VL is used for language, model compression, multi-language VLP and others

Visual Language Learning (VL) is a method of deep learning using image and text data. In recent years, with the emergence of VLP models such as CLIP and ALIGN, people have gradually accepted the effectiveness of this method. These models can learn powerful image encoders from scratch and achieve zero-shot image classification capabilities. At the same time, the understanding of human language is also inseparable from visual knowledge, such as color, size and shape. So a natural question is whether image and text data can also help better learn language representations.

In order to enrich the learned language representation, Vokenization and iACE proposed the method of splicing images with corresponding tags as "voken". In VidLanKD, the author uses the knowledge transfer method of video distillation to improve language understanding tasks involving world knowledge, physical reasoning, and temporal reasoning. VaLM proposes to use relevant images retrieved from CLIP to enhance text tags and use a visual knowledge fusion layer to achieve multi-modal vision-based language modeling. In terms of model compression, MiniVLM studies how to design compact visual-language models and proposes to use efficient and low-cost offline object detectors to replace commonly used object detectors. DistilVLM proposes knowledge distillation of VLP. VL-Adapter and ladder side-tuning propose methods to effectively adapt to large language models.

Multilingual VLP is a relatively little researched area. UC2 and M3P propose to add multilingual text encoders and use English and multilingual data for joint pre-training. MURAL extends ALIGN to multilingual scenarios by introducing a large number of translation pairs. CCLM introduces a cross-language, cross-modality, cross-view language modeling method with zero-shot cross-language transfer capabilities that are superior to the English VL model.

In unsupervised VLP, Li et al. explored how to perform reinforcement learning without parallel image-text data. They proposed pre-training on occlusion modeling on text-only data and image-only data, and utilized object labels detected by the object detection model as a bridge to connect the two modalities. Zhou et al. believe that using labels alone is not enough, they propose to use a retrieval-based approach to construct a weakly aligned image-text corpus, and apply a set of multi-granularity aligned pre-training tasks to narrow the gap between the two modalities.

"Socratic Models" is a concept that combines basic models in different fields in a zero-sample or few-sample manner, and uses language as a representation to reason together. Models such as PICa, MAGIC, BEST and PaLI fall into this category.

In addition to standard VL tasks, VLP can also be applied to application fields such as TextVQA, TextCaps, visual dialog, fashion domain tasks, and visual-linguistic navigation.

All in all, VLP has extensive applications and research in various fields, covering aspects such as representation richness, model compression, multi-language, unsupervised learning, and Socratic Models.

7. Graphic and text generation

Another important graphics and text task that has not been covered in this chapter is text-to-image (T2I) generation, whose goal is to produce an image that correctly reflects the meaning of the text description, and can also be regarded as the inverse process of image caption generation (Chen et al. , 2015). Before the emergence of VLP, Mansimov et al. (2016) was a pioneer in T2I generation. They showed that cyclic variational autoencoders can generate novel visual scenes conditioned on image descriptions, but the quality of the generated images is not satisfactory. Research on T2I generation has greatly developed with the boom of generative adversarial networks. Reed et al. (2016) generalized conditional GANs to T2I generation and have shown that they can be used with restricted datasets (e.g. Oxford-102 Flowers and CUB-200 Birds) at smaller image resolutions (64x64). In recent years, due to improved multi-modal coding (such as StackGAN (Zhang et al., 2017), StackGAN++ (Zhang et al., 2018c)), novel attention mechanisms (such as AttnGAN (Xu et al., 2018), SEGAN (Tan et al., 2019), ControlGAN (Li et al., 2019a)), the use of recurrent structures (such as MirrorGAN (Qiao et al., 2019)), etc., this field has made significant progress.

Autoregressive models of discrete tokens such as DALL-E (Ramesh et al., 2021) and Parti (Yu et al., 2022b) as well as diffusion-based models such as DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022)) Schematic for high-quality text-to-image generation with large-scale pretraining.

To extend the success of GANs to limited data partitions, pretraining is often used, i.e., the optimization process is initialized with a GAN model pretrained on some large dataset (Grigoryev et al., 2022). However, most GAN-based pre-training is only performed on image datasets and does not exploit image-text pairs for visual language pre-training (VLP), except for recent work using CLIP models in GAN-based methods (e.g. LAFITE (Zhou et al., 2022f)), this approach demonstrates the first work in training T2I generative models without explicitly using textual data.

In the context of VLP, although GAN-based methods are still popular in the field of image synthesis, a new paradigm shift is emerging in T2I generation. In the context of VLP, we classify these methods into two categories: (i) VQ-token based autoregressive methods (such as DALL-E (Ramesh et al., 2021) and Parti (Yu et al., 2022b)), and (ii) Diffusion-based methods (e.g. DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022)). Examples of these methods are shown in Figure 3.11. The following is a brief review of these recent works.

7.1 Autoregressive method based on VQ-token

Over time, autoregressive and diffusion-based text-to-image/video models have been developed. Only some representative works are shown below.

        Discrete Token representation. In 2017, VQ-VAE (van den Oord et al., 2017) was proposed, which provides a simple yet powerful generative model that can learn discrete representations for high-quality image reconstruction. Subsequently, in VQ-VAE-2 (Razavi et al., 2019), the researchers showed that high-fidelity and high-resolution images can be generated. With the popularity of the Transformer model (Vaswani et al., 2017), which has achieved impressive improvements in areas such as language models (Devlin et al., 2019) and image generation pre-training (Chen et al., 2020b), for VQ token sequences The modeling of is also naturally handled by the Transformer (Esser et al., 2021b). Autoregressive modeling. These recent advances brought by Transformer provide possible avenues for T2I generation and the potential to benefit from large-scale VLPs. Specifically, DALL-E (Ramesh et al., 2021) demonstrated that training a large-scale autoregressive Transformer on a large number of image-text pairs can produce high-fidelity generative models with controllable synthesis results. NUWA (Wu et al., 2022b) proposes a unified multi-modal pre-trained model that can generate or manipulate visual data (i.e., images and videos) with the help of the 3D Transformer encoder-decoder framework and the 3D Neighbor Attention (3DNA) mechanism. In NUWA-Inifinity (Wu et al., 2022a), the authors further proposed an autoregressive over-autoregressive generation method for high-resolution infinity visual synthesis, capable of generating images with arbitrary aspect ratios. Parti (Yu et al., 2022b) adopted a similar Transformer-based encoder-decoder architecture and trained the model at scale and demonstrated impressive image generation results. Make-A-Scene (Gafni et al., 2022) proposes to use additional segmentation maps (which can be generated or not) as additional input during the image generation process to further assist the image generation process. Other examples include CogView (Ding et al., 2021) and CogView2 (Ding et al., 2022a), which is also similar to DALL-E (Ramesh et al., 2021). Bi-directional graphics and text generation. ERINE-ViLG (Zhang et al., 2021a), L-Verse (Kim et al., 2022) and OFA (Wang et al., 2022f) demonstrate the possibility of large-scale generative joint pre-training for both text and image labeling (from VQ- VAE) for a variety of downstream tasks such as style learning (domain-specific text-to-image generation), super-resolution (image-to-image), image subtitle generation (image-to-text), and even text-image retrieval.

7.2 Diffusion-based methods

        Continuous diffusion. Recently, diffusion models, such as the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020), have achieved great success in image generation tasks. Recent research (Dhariwal and Nichol, 2021) shows that the quality of image synthesis is higher compared to VQ-token based models and GANs. In addition, the latest denoising diffusion implicit model (DDIM) (Song et al., 2021) further accelerates the sampling process and achieves almost perfect inversion. For a comprehensive survey of diffusion models, see the paper by Yang et al. (2022c). To extend diffusion-based methods to T2I generation, GLIDE (Nichol et al., 2021) employed continuous diffusion and compared CLIP guidance and classifier-free guidance in diffusion models and concluded that in terms of human evaluation, with 35 The diffusion model with 100 million parameters and no classifier guidance is better than DALL-E. Recently, DALL-E 2 (Ramesh et al., 2022), Imagen (Saharia et al., 2022), and Stable Diffusion (an amplified version of Latent Diffusion) (Rombach et al., 2022) have pushed this field to a new level, especially It is a steadily spreading open source work. The Latent Diffusion model (Rombach et al., 2022) proposes diffusion in continuous latent space rather than in pixel space as in DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022). Discrete diffusion. By combining VQ-token-based and diffusion-based methods, recent studies, such as ImageBART (Esser et al., 2021a) and VQ-Diffusion (Gu et al., 2022c), propose to use parametric models of conditional DDPM to predict VQ- The latent discrete coding space of VAE (Razavi et al., 2019) is modeled for the task of T2I generation. Text to video generation. The field is developing at a rapid pace. Recent research is not only satisfied with text-to-image generation, such as Make-A-Video (Singer et al., 2022), Imagen Video (Ho et al., 2022), and Phenaki (Villegas et al., 2022), which has greatly improved the quality of text-to-video generation. reached a new level.

参考:Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Guess you like

Origin blog.csdn.net/qq_41458274/article/details/133275090