Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Paper reading)

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, Google Research, Brain Team, Neurips2022, Cited:619, Code, Paper

1 Introduction

We present Imagen, a text-to-image diffusion model with unprecedented levels of realism and deep language understanding. Imagen builds on large Transformer language models for understanding text and relies on the strength of diffusion models in high-fidelity image generation. Our key finding is that general large language models (e.g. T5) pretrained on plain text corpora are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen is faster than increasing the size of the image diffusion model Better sample fidelity and image text alignment. Imagen achieves a state-of-the-art FID score of 7.27 on the COCO dataset without training on COCO, and human raters find that Imagen samples are on par with the COCO data itself in terms of image-text alignment. For a more in-depth evaluation of text-to-image models, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. Using DrawBench, we compare Imagen to recent methods, including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE, and DALL-E 2, and find that human raters prefer Imagen in side-by-side comparisons, both in terms of sample quality Or image-text alignment.

2. Introduction

In recent years, multimodal learning has gradually emerged, among which text-image synthesis and image-text contrastive learning are the most prominent. These models transformed the research community and gained widespread public attention through creative image generation and editing applications. To further explore this research direction, we introduce Imagen, a text-to-image diffusion model that combines the power of transformer language models (LMs) and high-fidelity diffusion models to provide unprecedented photographic realism and a deeper understanding of language. Compared to previous work on model training using only image-text data, the key finding behind Imagen is that text embeddings from a large Lm, pretrained on a text-only corpus, are very effective for text-image synthesis.
insert image description here
Although Imagen is conceptually simple and easy to train, it produces surprisingly powerful results. Imagen's zero-shot FID-30K on COCO is 7.27, outperforming other methods and significantly outperforming previous work such as parallel work on GLIDE (12.4) and DALL-E 2 (10.4). Our zero-shot FID score also outperforms state-of-the-art models trained on COCO, e.g., Make-A-Scene (7.6). Furthermore, human raters noted that samples generated from Imagen were consistent with reference images on COCO captions in terms of image-text alignment.

We introduce DrawBench, a new suite of structured text prompts for text-to-image evaluation. DrawBench provides deeper insights through multidimensional evaluation of text-to-image models, and text hints aim to probe different semantic properties of the model. These include compositionality, cardinality, spatial relationships, the ability to handle complex textual cues or cues with rare words, and creative cues that push the limits of a model's ability to generate highly implausible scenarios well beyond training the extent of the data. Using DrawBench, extensive human evaluations show that Imagen outperforms other recent methods. We further demonstrate some clear advantages of using large pretrained language models as Imagen text encoders, rather than using multimodal embeddings such as CLIP.

The main contributions are:

  1. We find that large frozen language models trained only on text data are very effective text encoders for text-to-image generation, and that scaling the size of the frozen text encoder improves sample quality significantly more than scaling the size of the image diffusion model.
  2. We introduce dynamic thresholding, a new diffuse sampling technique, to take advantage of high bootstrap weights, generating more realistic and detailed images than before.
  3. We highlight several important diffusion architecture design choices and propose Efficient U-Net, a new architectural variant that is simpler, converges faster, and is more memory efficient.
  4. We implemented the latest COCO FID 7.27. Human raters found Imagen to be on par with reference images in terms of image-text alignment.
  5. We introduce DrawBench, a new comprehensive and challenging evaluation benchmark for text-to-image tasks. In DrawBench human evaluation, we find that Imagen outperforms all other works, including the concurrent work on DALL-E 2 [54].

3. Image

Imagen consists of a text encoder that maps text to a sequence of embeddings, and a cascade of conditional diffusion models that map these embeddings to images of increasing resolution (see figure). In the following subsections, we describe each component in detail:
insert image description here

3.1 Pretrained Text Encoder

Text-to-image models require powerful semantic text encoders to capture the complexity and composition of arbitrary natural language text inputs. In current text-to-image models, it is standard procedure to train text encoders on paired image-text data; they can be trained from scratch or pre-trained on image-text data (e.g., CLIP). Image-to-text training objectives show that these text encoders can encode visual semantics and meaningful representations, especially relevant for text-to-image generation tasks. Another option for encoding text is a large language model. Recent advances in large language models (e.g. BERT, GPT, T5) have shown leaps in text understanding and generation capabilities. Language models are trained on text-only corpora, which are much larger than paired image-text data and thus exposed to a very rich and wide distribution of text. These models are also typically much larger than the text encoders in current image-to-text models.

Therefore, it becomes natural to explore two families of text encoders for text-to-image tasks. Imagen explores pretrained text encoders: BERT, T5, and CLIP. For simplicity, we freeze the weights of these text encoders. Freezing has advantages such as offline computation of embeddings, and has negligible computation or memory usage during text-image model training. In our work, we find a clear conviction that scaling the size of the text encoder improves the quality of text-to-image generation

3.2 Diffusion model and classifier-free guidance

Here, we briefly introduce diffusion models. Diffusion models are a class of generative models that convert Gaussian noise into samples from the learned data distribution through an iterative denoising process. These models can be conditional, e.g. based on class labels, text, or low-resolution images. The form of the loss function:
E x , c , ϵ , t [ wt ∥ x ^ θ ( α tx + σ t ϵ , c ) − x ∥ 2 2 ] \mathbb{E}_{\mathbf{x}, \mathbf {c}, \boldsymbol{\epsilon}, t}\left[w_{t}\left\|\hat{\mathbf{x}}_{\theta}\left(\alpha_{t} \mathbf{x }+\sigma_{t} \boldsymbol{\epsilon}, \mathbf{c}\right)-\mathbf{x}\right\|_{2}^{2}\right]Ex , c , ϵ , t[wtx^i( atx+ptϵ ,c)x22]
This is different from what you usually see, let me tell you, here is the predictedx 0 = x x_{0}=xx0=x c c c isx 0 x_{0}x0corresponding conditions. Weight is with ttt -dependent, meaning that for differentttThe t value is given different weights,and the ttThe loss of t is different, in short, it is forttt is divided into degrees of difficulty.

Classifier guidance is a technique that uses gradient correction from pretrained models during sampling to improve sample quality while reducing diversity in conditional diffusion models, see Guided Diffusion/Diffusion Models Beat GANs on Image Synthesis (Paper reading) for details . Classifier-less bootstrapping is an alternative technique by randomly dropping the condition cc during trainingc (e.g., 10% probability), jointly training a single diffusion model on the conditioned and unconditioned targets avoids this pre-training model. Note that this article uses the adjustedxxx predicts, that is,also predicts the noise ϵ θ \epsilon_{\theta}ϵi, but first transform the noise into x ^ \hat x through the formulax^ Specifies the equation:
ϵ ~ θ ( xt , c ) = w ϵ θ ( xt , c ) + ( 1 − w ) ϵ θ ( xt ) \tilde \epsilon_{\theta}(x_{t}, c) = w\epsilon_{\theta}(x_{t}, c) + (1-w)\epsilon_{\theta}(x_{t})ϵ~i(xt,c)=w ϵi(xt,c)+(1w ) ϵi(xt)
is actually a conditional and unconditional equilibrium,w = 1 w=1w=1 is conditional, ifw > 1 w>1w>1 , the condition will be enhanced.

3.3 Large guidance weight samplers

We corroborate the results of recent work on text-guided diffusion and find that adding classifier-free bootstrap weights improves image-text alignment but compromises image fidelity, resulting in highly saturated and unnatural images. We found that this was due to a train-test mismatch caused by high bootstrap weights. At each sampling step ttt x x The prediction of x must be within the same training data asxxx is within the same bounds, i.e. within [-1, 1], but we empirically found that high bootstrap weights lead toxxThe prediction of x exceeds these bounds. This is a train-test mismatch, and the sampling process produces unnatural images, sometimes diverging, as the diffusion model is repeatedly applied to its own output throughout the sampling process. To address this issue, we investigate static thresholding and dynamic thresholding. The effect visualization is as follows:

insert image description here
Thresholding technique for 256×256 samples of “Photographs of Astronauts Riding Horses”. Bootstrap weights increase from 1 to 5 from top to bottom. Processing without thresholding results in poorer images with high bootstrap weights. Static thresholding is an improvement, but still leads to oversaturation of samples. Our dynamic thresholding achieves the highest quality images.

3.4 Robust Cascade Diffusion Model

Imagen uses the pipeline of the base 64×64 model and two text-conditional super-resolution diffusion models to upsample the 64×64 generated image to a 256×256 image, and then upsample to a 1024×1024 image. Cascaded diffusion models with noise-adjusted augmentation are very effective in gradually generating high-fidelity images ( Cascaded diffusion models for high fidelity image generation ). Furthermore, by making the super-resolution model aware of the amount of added noise through noise level tuning, it significantly improves the sample quality and helps to increase the robustness of the super-resolution model to deal with artifacts generated by the low-resolution model. Imagen uses noise-adjusted augmentation for both super-resolution models. We found this to be critical for generating high-fidelity images.

Given a conditionalized low-res image and an augmentation level (also called $aug_level) (e.g. Gaussian noise or strength of blur), we corrupt the low-res image with augmentation (corresponding to aug_level), and condition on aug_level Diffusion model. During training, the aug_level is chosen randomly, while during inference we scan its different values ​​to find the best sample quality . In our case we use Gaussian noise as a form of augmentation and apply a variance preserving Gaussian noise augmentation similar to the forward process used in the diffusion model.

4. Evaluation and experimentation

insert image description here
We train 2 billion parameter models for 64×64 text-to-image synthesis, and train 600M and 400M parameter models for 64 to 256 and 256 to 1024, respectively. We use a batch size of 2048 and 2.5M training steps for all models. Our base 64×64 model uses 256 TPU-v4 chips, and both super-resolution models use 128 TPU-v4 chips. I calculated that it costs about 6 million yuan a day, just this GPU .

5. Efficient U-Net

We introduce a new architectural variant for our super-resolution model, which we call Efficient U-Net. We find that our efficient U-Net is simpler, converges faster, and is more memory efficient than some previous implementations, especially for high resolutions. We make several key modifications to the U-Net architecture, such as shifting model parameters from high-resolution blocks to low-resolution blocks, scaling skip connections by 1 / 2 1/\sqrt{2}1/2 , and reverse the order of the downsampling/upsampling operations to improve the speed of the forward pass. Efficient U-Net makes several key modifications to the typical U-Net model:

  1. We transfer model parameters from high-resolution blocks to low-resolution blocks by adding more residual blocks for lower resolutions. Since lower-resolution blocks typically have more channels, this allows us to increase model capacity with more model parameters without incurring prohibitive memory and computational costs.
  2. When using a large number of residual blocks at lower resolutions (we use 8 residual blocks at lower resolutions), we find that scaling skip connections by 1 / 2 1/\sqrt{2}1/2 Significantly improved convergence speed.
  3. In the downsampling block of a typical U-Net, the downsampling operation happens after the convolution, while the upsampling operation in the upsampling block happens before the convolution. We reverse the order of the downsampling and upsampling blocks to significantly speed up the forward pass of U-Net and see no performance drop.

With these key simple modifications, Efficient U-Net is simpler, converges faster, and is more memory efficient than some previous U-Net implementations. The figure below shows the complete architecture of the efficient U-Net, while Figures A.28 and A.29 show the detailed description of the downsampling and upsampling blocks of the efficient U-Net, respectively.

insert image description here

insert image description here
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_43800752/article/details/130133033