Generative AI New World | Interpretation of papers in the field of Text-to-Image

f4d94094a42c9b3be268c332180a22e2.gif

The author of this article  is Huang Haowen 

Amazon Cloud Technology Senior Developer Evangelist

In the last article, we started to explore another rapidly advancing field of Generative AI: the field of Text-to-Image . An overview of the basic content of Text-to-Image such as CLIP, OpenCLIP, diffusion model, DALL-E-2 model, Stable Diffusion model, etc.

The content of this issue will be the interpretation of the main papers in the direction of Text-to-Image.

Interpretation of Variational Autoencoder VAE Paper

Variational Auto-Encoder

1. Auto-Encoder architecture

Auto-encoder (Auto-Encoder) is an unsupervised learning neural network for learning a compressed representation of input data. Specifically, it can be divided into two parts:

Encoder: responsible for compressing data into a low-dimensional representation;

Decoder: responsible for restoring the low-dimensional representation to the original data.

82a4f396a2dc46f2549aa16152a719b3.png

Source: https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798

After reading this, some readers may ask: Since the decoder only needs to input some low-dimensional vectors, it can output high-dimensional image data; can we directly use the decoder model as a generative model? For example: some vectors are randomly generated in a low-dimensional space, and then sent to the decoder to generate pictures.

The reason not to do this is that the vast majority of random generation is meaningless noise, and since we don't explicitly model the distribution, we don't know which vectors will generate useful pictures; the dataset we use for training is usually finite and therefore will only have finite responses. However, the entire low-dimensional space is very large. If we only sample randomly in this space, the probability of sampling to generate useful pictures is not high.

And VAE (Variational Auto-Encoders) is based on AE, which explicitly models the distribution, helping the autoencoder to become a qualified or even excellent generative model.

2. Dimensionality reduction and latent space

Dimensionality Reduction and latent space

In the previous section, we talked about a concept of dimensionality reduction (Dimensionality Reduction), which is very important in all generative AI fields. In this section I will give a layman's explanation.

In machine learning, dimensionality reduction is the process of reducing the number of features that describe some data. This reduction can be achieved by selection (keeping only some of the existing features) or extraction (reducing the number of new features created based on old features), in many cases where low-dimensional data is required (data visualization, data storage, heavy computation, etc.) Very useful.

First, let's call an encoder the process of generating a "new feature" representation from an "old feature" representation (either by selection or extraction), and decoding in reverse. Dimensionality reduction can then be interpreted as data compression, where the encoder compresses the data ( from the initial space to the encoded space, also known as the latent space ), and the decoder decompresses it. Of course, depending on the initial data distribution, latent space dimensions, and encoder definition, this compression can be lossy, meaning that some information is lost during encoding and cannot be recovered upon decoding.

59a699b829655bc9772baa1ddc3fbb9c.png

Source: https://theaisummer.com/latent-variable-models/

The auto-encoder (Auto-Encoder) uses a neural network to reduce the dimensionality . The general idea of ​​an autoencoder is very simple, including setting the encoder and decoder as a neural network, and using an iterative optimization process to learn the best encoding and decoding scheme. So, in each iteration, we feed some data to the autoencoder architecture (the encoder is followed by the decoder), compare the encoded and decoded output with the original data, and then backpropagate the error through the architecture to update the weights of the network .

The entire autoencoder architecture ensures that only the main structural parts of the information are passed through and reconstructed. From the perspective of the overall framework, the considered encoder family E is defined by the encoder network architecture, and the considered decoder family D is defined by the decoder network architecture; and to minimize the reconstruction error, the parameters of these networks are Gradient descent (Gradient Decent) to complete. As shown below:

82ff63818b44b4c106ef5d40510d5ce5.png

Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

This autoencoder structure faces two main challenges in the real world.

First, significant dimensionality reduction without reconstruction loss often comes at the cost of a lack of interpretable and exploitable structure in the latent space, or, more simply, a lack of regularity; second, in most cases, The ultimate goal of dimensionality reduction is not only to reduce the number of dimensions of the data, but also to retain most of the data structure information in a simplified representation.

For these two reasons, in the real world we have to carefully control and tune the size of the latent space and the "depth" of the autoencoder (which defines the degree of compression and quality) according to the ultimate goal of dimensionality reduction. As shown below:

6ee4481e716b5d4707d40c42a5bd397f.png

Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

3. Variational Autoencoder

Variational Auto-Encoder

After paving the previous knowledge base, we can finally explore the essence of this paper on VAE.

a026d50f6cf5e6c003578cd20b6424eb.png

Source: https://arxiv.org/pdf/1312.6114.pdf

So far, we have discussed the problem of dimensionality reduction and introduced autoencoders, which are encoder-decoder architectures that can be trained with gradient descent. Now let's connect the content generation problem and see the limitations of autoencoders in solving this problem, and then welcome the variational autoencoder (Variational Auto-Encoder).

Regarding the combination of content generation and autoencoder, we may think, if the latent space has enough rules, can we randomly pick points from the latent space to decode to get new content? As shown below:

aca2279bf11a370252b92b0daf4e1dc6.png

Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

Definition of Variational Autoencoder

Therefore, to be able to use the decoder of an autoencoder for generative purposes, we must ensure that the latent space is sufficiently regular. A possible solution to obtain such regularity is to introduce explicit regularization during training. A variational autoencoder can be defined as an autoencoder whose training is regularized to avoid overfitting and ensure that the latent space is well-characterized for the generative process.

Just like a standard autoencoder, a variational autoencoder is an architecture consisting of an encoder and a decoder, trained to minimize reconstruction errors between encoded decoded data and the original data. However, to apply some regularization to the latent space, we make a slight modification to the encoding-decoding process: instead of encoding the input as a single point, we encode it as a distribution over the latent space. Then train the model as follows:

  1. The input is encoded as a distribution over the latent space

  2. A point in the latent space is sampled from this distribution

  3. Decoding the sampling points, the reconstruction error can be calculated

  4. Reconstruction errors are backpropagated through the network

As shown below:

b70c5af33eba25883315ed49cc16272e.png

Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

In fact, the encoding distribution is chosen to be normal so that the encoder can be trained to return the mean and covariance matrices describing these Gaussian distributions. The reason why the input is encoded as a distribution with some variance rather than a single point distribution is that it expresses latent space regularization very naturally. In this way, both local and global regularization of the latent space are ensured (variance controls local regularization, mean controls global regularization).

Therefore, the loss function that is minimized when training a VAE consists of a "reconstruction term" (located in the last layer) and a "regularization" (located in the latent layer), the latter tends to return by making the distribution of the encoder close to the standard normal distribution to regulate the organization of the latent space. This regularization is expressed as the KL divergence (Kulback-Leibler Divergence) between the return distribution and the standard Gaussian distribution. Since the KL divergence between the two Gaussian distributions has a closed form, the mean and covariance of the two distributions can be directly used matrix representation. As shown below:

344755396e995a08cfccf52f7d71c8a8.png

Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

My personal understanding about VAE is that the core of the VAE architecture is two Encoders, one is used to calculate the mean value and the other is used to calculate the variance; the mean value and variance will be calculated by the VAE architecture using a neural network.

VAE essentially adds "Gaussian noise" to the result of the Encoder (corresponding to the network for calculating the mean in VAE) on the basis of our conventional autoencoder , so that the resulting Decoder can be robust to noise; and The additional KL loss (the purpose is to make the mean value 0 and the variance 1) is actually equivalent to a regularization item for the Encoder, and it is hoped that all the things from the Encoder have a zero mean value.

Another Encoder (corresponding to the network that calculates the variance) is used to dynamically adjust the noise intensity. When the Decoder has not been trained well (reconstruction error is much greater than KL loss, KL is the abbreviation of Kullback-Leibler, which is used as a classic function to measure the similarity of probability distribution), the noise will be appropriately reduced (KL loss increases), It makes fitting easier (the reconstruction error starts to decrease); conversely, if the Decoder is trained well (the reconstruction error is less than KL loss), then the noise will increase (KL loss decreases), making the fitting more difficult (reconstruction error structure error begins to increase), pushing Decoder to find ways to improve its generation ability.

This essence, in the paper of VAE, it uses an exquisite mathematical formula to explain it, as shown in the screenshot of the paper below:

8272af131f7c838cb936182361e4e8e3.png

Source: https://arxiv.org/pdf/1312.6114.pdf

4. Views of VAE Papers

The main points of this VAE paper are:

  • Dimensionality reduction is the process of reducing the number of features describing some data (either selecting only a subset of the initial features or merging them into a reduced number of new features), so it can be thought of as an encoding process;

  • Autoencoders are neural network architectures consisting of an encoder and decoder that form the bottleneck for data traversal and are trained to lose a minimal amount of information during the encoding-decoding process (via gradient descent iterations) training, with the aim of reducing reconstruction errors);

  • Due to overfitting, the latent space of an autoencoder can be very irregular (close points in the latent space can provide drastically different decoded data, and some points of the latent space can provide meaningless content after decoding); and we It is not possible to really define a generative process that simply consists of sampling a point from the latent space and passing it through the decoder to obtain new data;

  • A variational autoencoder (VAE) is an autoencoder that addresses latent space inconsistencies by making the encoder return a distribution over the latent space instead of a single point, and adding a regularization term to the loss function against this returned distribution. Regularity questions to ensure better organization of the latent space;

  • Assuming a simple underlying probabilistic model to describe our data, the statistical technique of variational inference (hence the name variational autoencoder) can be used to carefully derive the reconstruction and regularization terms Compose the loss function of the VAE.

Interpretation of Diffusion Model Series Papers

Diffusion Models

Before the diffusion model became the mainstream model in the field of Vincent graphs, there were three types of generative models. They are:

  • GAN (Generative Adversarial Network)

  • VAE (Variational Auto-Encode)

  • Flow-based models

These models have all had great success in generating high-quality samples, but each has its own limitations. It is well known that due to its adversarial training nature, the training of GAN models may be unstable and generate low diversity; VAEs rely on surrogate loss (Surrogate Loss) functions; and flow models must use specialized architectures to construct reversible transformations.

  • Surrogate Loss function:

    https://baike.baidu.com/item/%E4%BB%A3%E7%90%86%E6%8D%9F%E5%A4%B1%E5%87%BD%E6%95%B0/22787203

Diffusion Models are inspired by non-equilibrium thermodynamics. They defined Markovian diffusion chains to slowly add random noise to data, and then learned to reverse the diffusion process to construct the desired data samples from the noise. Unlike VAE or flow-based models, diffusion models are learned through a fixed process, and latent variables are of high dimensionality (same as that of the original data). As shown below:

08a16f5db60318a9c956baec39aabde5.png

Source: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

One of the important contributions of the diffusion model is: in the training process (such as the training process of DDPM), the real noise is predicted by the noise estimation model ∈ θ (x t , t), so as to minimize the difference between the estimated noise and the real noise. difference. We will elaborate on this contribution later on.

1. Overview of Diffusion Models

Several major papers based on diffusion generative models, all with similar ideas, include diffusion probability models (Sohl-Dickstein et al., 2015), noise conditional scoring networks (NCSN; Yang & Ermon, 2019), and diffusion probability for denoising model (DDPM; Ho et al., 2020).

  • Sohl-Dickstein et al., 2015:

    https://arxiv.org/abs/1503.03585

  • Yang & Ermon,2019 年:

    https://arxiv.org/abs/1907.05600

  • Ho et al., 2020:

    https://arxiv.org/abs/2006.11239

1

forward diffusion process

Forward diffusion process

Given data points x 0 ~q(x) sampled from the real data distribution, let us define a forward diffusion process. In this process, we add a small amount of Gaussian noise to the samples at step T to generate a series of noisy samples x 1 ,...x T , whose step size is controlled by the variance schedule {β t ∈(0,1 )} T (t=1)  :

322bdfe5b8ade4dd21cb345802e1cb20.jpeg

The data sample x 0  gradually loses its salient features as the step t becomes larger. When T→∞, x T  is equivalent to an isotropic Gaussian distribution. As shown below:

9bd0ab58c76fae8084ea804ca00cf625.png

The Markov chain of forward (reverse) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020 with a few additional annotations)

Both forward diffusion and reverse diffusion processes are Markov processes, the only difference is:

The mean and variance of the Gaussian distribution of each conditional probability in the forward diffusion process have been determined (depending on β t  and x 0  ), while the mean and variance in the inverse diffusion process need to be learned through the neural network.

  • Markov process:

    https://zhuanlan.zhihu.com/p/426290103

Another nice property of the above procedure is that arbitrary time steps x t can  be sampled in closed form using the reparameterised trick , as shown in the following diagram:

0fb3ed39755a6ff1712042b6f64d11cd.png

Illustration of how the reparameterization trick makes the sampling process trainable.(Image source: Slide 12 in Kingma’s NIPS 2015 workshop talk)

The reparameterization trick also works for other types of distributions, not just Gaussians. In the case of a multivariate Gaussian, the model is made trainable by using the reparameterization technique described above, and learning the distribution with mean μ and variance σ, while randomness is represented in random variables ∈~N(0,Ι) . The figure below is a schematic diagram of a variational autoencoder model using a multivariate Gaussian assumption. This variational autoencoder model was discussed in detail in the previous chapter.

b17aaf486261a72a393ba739393f7354.png

Source: https://lilianweng.github.io/posts/2018-08-12-vae/#reparameterization-trick

I will not elaborate on the mathematical derivation process of the forward diffusion process in this space. Interested students can refer to the content of the "Forward diffusion process" section of the following article:

https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Here is just a summary:

Compared with the standard stochastic gradient descent (SGD) method, the diffusion model refers to the method of stochastic gradient Langevin dynamics (stochastic gradient Langevin dynamics), which can inject Gaussian noise in the parameter update to avoid collapsing into local minima .

2

backdiffusion process

Reverse diffusion process

If we can reverse the above process and sample from q(x t-1 |x t ), we can recreate the real samples from the Gaussian noise input x T ~N(0,Ι). Note that if β t  is small enough, q(x t-1 |x t ) will also be Gaussian. However, q(x t-1 |x t ) cannot be easily estimated because estimating it requires using the entire dataset, as shown in the following figure:

80c790962ca2c2175488e558580831ad.png

Image source: Ho et al. 2020 with a few additional annotations

Therefore a model ρ θ needs to be trained to approximate these conditional probabilities to run the backdiffusion process:

d2a2e5923ed87e63452e2708999edd7c.jpeg

The figure below shows an example of training a diffusion model to model 2D Swiss roll data from the "Sohl-Dickstein et al., 2015" paper.

  • Sohl-Dickstein et al., 2015

    https://arxiv.org/abs/1503.03585

9b72891e77662a1eab258b7b92fb7f7b.png

Image source: Sohl-Dickstein et al., 2015

The first row shows the time slice starting from the forward trajectory q(x 0:T ). The data distribution undergoes Gaussian diffusion from the left, and Gaussian diffusion on the right gradually transforms it into an identity-covariance Gaussian distribution .

The middle row shows the corresponding time slices of the trained reverse trajectory ρθ(x 0:T ). The feature covariance Gaussian distribution (right) undergoes a Gaussian diffusion process through the learned mean and covariance functions, and gradually transforms back to the original data distribution (left).

The last row shows the case of the drift term μ θ (x t ,t)—x for the same backdiffusion process .

3

DDPM paper and parametric L t

As mentioned before, we need to learn a neural network to approximate the conditional probability distribution in the backdiffusion process:

ce4e11d2d10fe50ec69cdd5b568a6d61.jpeg

We want to train μ θ  to predict:

668cab2f33455739c97394a8f02209e1.jpeg

Since xt is  available as input at training time, one can instead reparameterize the Gaussian noise term so that it  predicts ∈ t from input xt at time step t :

0562859896b739ddf9387152ed376d7b.jpeg

Later, some mathematical formulas can be used to simplify. Readers who are interested in the specific mathematical derivation process can refer to the following articles:

https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

Simplified results can refer to the following DDPM papers:

17659b52cf73d5ac8b993e9844584163.png

Source: https://arxiv.org/abs/2006.11239

ecee6ebc94bc0e66af65a23e0e2aeb5a.png

The training and sampling algorithms in DDPM (Image source: Ho et al. 2020)

The reason for the above simplified results is mentioned in the DDPM paper, mainly because Ho et al. (2020) found empirically that the training diffusion model works better if the simplified objective that ignores the weighting term is used:

2f0d455218154008f303af6a9e0d50f0.jpeg

The final simplified formula is:

7aa96421a3b8b8d6bbc9d9f34a31089d.jpeg

where C is a constant independent of θ.

  • Ho et al. (2020):

    https://arxiv.org/abs/2006.11239

2. Accelerated Sampling of Diffusion Models

Speed up Diffusion Model Sampling

1

Interpretation of DDIM thesis

Generating samples from a DDPM via a Markov chain following a backdiffusion process is very slow as it can be thousands of steps at most. Data from the 2020 DDIM paper by Song et al. states: "For example, it takes about 20 hours to sample 50,000 images of size 32×32 from DDPM, but less than a minute to sample from a GAN on an Nvidia 2080 Ti GPU." .”

  • The 2020 DDIM paper by Song et al.:

    https://arxiv.org/abs/2010.02502

37a8f862160d6aa2ef06d97c71a6deaa.png

Source: https://arxiv.org/pdf/2010.02502.pdf

DDIM has the same marginal noise distribution, but deterministically maps the noise back to the original data samples. During generation, we only sample a subset {τ 1 ,…,τ S } of diffusion steps, and the inference process of DDIM becomes:

43fc6b5d4fe05d028bd2221c3667e2b5.jpeg

Although all models were trained with Τ=1000 diffusion steps in the experiments, they observed that DDIM (η=0) can produce the best quality samples when S is small, while DDPM (η=1) on small S performance is much worse. DDPM performs even better when we have the ability to run a full reverse Markovian diffusion step (S=Τ=1000).

With DDIM, a diffusion model can be trained to any number of forward steps, but only sampled from a subset of steps in the generation process. The comparative test results of DDPM and DDIM in the paper are shown in the figure below:

31bd89918f7e4a136d82efb374a493fd.png

Source: https://arxiv.org/pdf/2010.02502.pdf

The comparison between DDIM and DDPM is summarized as follows:

  • Generate higher quality samples using fewer steps

  • Because the generative process is deterministic, meaning that multiple samples conditioned on the same latent variable have similar high-level features

  • DDIM enables meaningful interpolation among latent variables due to consistency

2

Interpretation of LDM papers

Another important paper is the Latent Diffusion Model (LDM: Rombach & Blattmann et al., 2022) paper (shown below), which proposes to run the diffusion process in latent space instead of pixel space, thus reducing training costs and speeding up Reasoning speed.

  • LDM: Rombach & Blattmann et al., 2022:

    https://arxiv.org/abs/2112.10752

ba242caaa9c64b6c532a5503385af860.png

Source: https://arxiv.org/pdf/2112.10752.pdf

The paper is motivated by the observation that images are mostly perceptually detailed, while semantic and conceptual composition persist after aggressive compression . LDM learns to loosely decompose perceptual compression and semantic compression through generative modeling by first trimming pixel-level redundancy with an autoencoder, and then manipulating/generating semantic concepts using a diffusion process on the learned latent.

03f6e81c9f5709204a449275ea32287a.png

Description Awareness and Semantic Compression

(illustrating perceptual and semantic compression)

Most parts of digital images correspond to imperceptible details. Although the diffusion model already suppresses this semantically meaningless information by minimizing the loss term, it still needs to evaluate the gradient (during training) and the neural network backbone (both training and inference) at all pixels, which leads to Redundant calculations and unnecessarily expensive optimizations and inferences.

Therefore, the DDIM paper proposes the Latent Diffusion Model (LDM) as an efficient generative model with a separate light compression stage.

The DDIM-aware compression process relies on an autoencoder model. The encoder is used to compress the input image x∈R H×W×3  into a smaller 2D latent vector z=ε(x)∈R h×w×c , where the downsampling rate f=H/h=W/w =2 m , m∈N, then the decoder D reconstructs the image from the latent vector x ̃=D(z).

Diffusion and denoising processes take place on the latent vector Z. The denoising model is a temporally conditioned U-Net enhanced with a cross-attention mechanism for handling flexible conditional information for image generation (e.g. category labels, semantic maps, blurred variants of images). This design is equivalent to fusing representations of different modalities into the model via a cross-attention mechanism. Each type of conditioning information is paired with a domain-specific encoder τθ for projecting the conditioning input y into an intermediate representation that can be mapped to the cross-attention component τ θ (y) ∈ R (M ×) :

deb33079dc8c5c6feec9f36c236a339b.jpeg

b2036310702ab6ec5590d5ab6666963d.png

The architecture of latent diffusion model. (Image source: Rombach & Blattmann, et al. 2022)

3. Conditional Generation of Diffusion Models

Conditioned Generation

When training a generative model on images conditioned on information such as the ImageNet dataset, samples are typically generated from a class label or a piece of descriptive text.

1

Classifier Guidance for Diffusion Models

Classifier Guided Diffusion

The GLIDE paper (pictured below) presents its latest work in the area of ​​classifier guidance for diffusion models.

7aa885eb6a2f01d3d666d5aa0fe989e2.png

Source: https://arxiv.org/pdf/2112.10741.pdf

In order to explicitly incorporate category information into the model diffusion process, Dhariwal & Nichol (2021)  trained a classifier f ϕ (y|x t ,t) on noisy images x t , and used the gradient ∇ x log⁡f ϕ  (y|x t ) guides the diffusion sampling process towards conditional information y (e.g. object class labels) by changing the noise predictions. The resulting ablation-diffusion model (ADM) and a model with additional classifier guidance (ADM-G) are able to achieve better results than SOTA generative models such as BigGAN.

  • Dhariwal & Nichol (2021)

    https://arxiv.org/abs/2105.05233

7b0e3f84af014e4cfc2cf4a0b96478a9.png

The algorithms use guidance from a classifier to run conditioned generation with DDPM and DDIM. (Image source: Dhariwal & Nichol, 2021])

2

Class-Free Guidance for Diffusion Models

Classifier-Free Guidance

In addition, in the GLIDE paper, the use of classifier-free guidance to select samples from GLIDE is also described. From the sample image data provided in the paper, it can be observed that the GLIDE model can generate realistic images with shadows and reflections, can combine multiple concepts, generate artistic renderings of new concepts, and more.

1edbd1dc6565afd33e82cedd1e6c9cdc.png

71438c4f24fc25831dfb409926f43252.png

Source: https://arxiv.org/pdf/2112.10741.pdf

In the GLIDE paper, guidance strategies, CLIP guidance, and uncategorized guidance are also explored in detail, and the latter are found to be more popular.

4. High resolution and image quality of the diffusion model

1

CDM Papers

The paper "Cascaded Diffusion Models for High Fidelity Image Generation" suggests a series of higher resolution multiple diffusion models. Noise conditioning augmentation between pipeline models is crucial to the final image quality, i.e. applying strong data augmentation to the conditioning input z of each super-resolution model p θ (x|z), conditioning the noise helps to reduce compound errors in pipeline setups.

  • Cascaded Diffusion Models for High Fidelity Image Generation:

    https://arxiv.org/abs/2106.15282

f11f8bf42038733ca9a823f1061c1f2e.png

Source: https://arxiv.org/pdf/2106.15282.pdf

In diffusion modeling for generating high-resolution images, U-Net is a common choice of model architecture. The paper mentioned that in the pipeline of Cascaded Diffusion Models, each model uses the U-Net architecture. As shown below:

869d880d240c740ad38254fb9036e316.png

Source: https://arxiv.org/pdf/2106.15282.pdf

The paper also states that they found the most effective noise was Gaussian noise applied at low resolutions and Gaussian blur applied at high resolutions. Additionally, they explore two forms of conditional augmentation that require minor modifications to the training procedure. Conditional noise is only useful for training, not inference.

2

UnCLIP Papers

In the two-stage diffusion model UnCLIP (Ramesh et al. 2022) paper, it is proposed to utilize the CLIP text encoder to generate high-quality text-guided images.

  • Ramesh et al. 2022

    https://arxiv.org/abs/2204.06125

5b1f43478fe79f72092de1696e68aae6.png

Source: https://arxiv.org/abs/2204.06125

Given paired training data for a pretrained CLIP model c and a diffusion model (x, y), where x is the image and y is the corresponding caption, we can compute the vector representations Ct(y) and Ct(y) of the CLIP text and image, respectively . i (x) .

UnCLIP learns two models simultaneously:

  1. Prior model  p(ci | y): Given text y, output  the CLIP image vector representation of ci ;

  2. Decoder  p(x|ci , [y]) : Given a CLIP image vector representation ci and (optionally) a raw text y, outputs an image x.

These two models support conditional generation because:

3bb39e4bf0ad8c8b7439918c94b515c6.jpeg

1bdc76e2320039330cd9bf6663812ddc.png

The architecture of unCLIP. (Image source: Ramesh et al. 2022)

3

Imagen Papers

Imagen's paper Imagen (Saharia et al. 2022) does not use the CLIP model, but uses a pre-trained large LM (frozen T5-XXL text encoder) to encode text to generate images.

  • Saharia et al. 2022

    https://arxiv.org/abs/2205.11487

300b31da180b89a9846ad61eed4c9b87.png

Source: https://arxiv.org/pdf/2205.11487.pdf

The general trend is that larger model sizes lead to better image quality and text-image alignment. The paper's research team found that T5-XXL and CLIP text encoders achieve similar performance on MS-COCO.

ba112abc868176edab01aae94d56ee9e.png

Source: https://arxiv.org/pdf/2205.11487.pdf

Imagen modified several designs in U-net to make it efficient U-Net. For example:

  • Transfer model parameters from high-resolution modules to low-resolution modules by adding more residual locks for lower resolutions

  • Expand the size of skip connections to 1/√2 times

  • Reverse the order of downsampling (moving before convolution) and upsampling operations (moving after convolution) to improve the speed of the forward pass

The paper team's experience summary includes:

  • Noise adjustment enhancement, dynamic thresholding, and efficient U-Net are critical to image quality

  • Scaling text encoder size is more important than U-Net size

summary

In this issue, we began to discuss the interpretation of main papers in the direction of Text-to-Image, including: VAE, DDPM, DDIM, GLIDE, Imagen, UnCLIP, CDM, LDM and other major diffusion model fields.

From our analysis, the main advantages and disadvantages of the diffusion model are as follows:

  • Pros: Traceability and flexibility are two conflicting goals in generative modeling. Tractable models can be evaluated analytically and fit data efficiently (e.g. via Gaussian or Laplacian), but they cannot easily describe structure in rich datasets. Flexible models can fit arbitrary structure in data, but it is often expensive to evaluate, train, or sample from these models. The diffusion model can achieve both traceability and flexibility in analysis;

  • Disadvantages: Diffusion models rely on long chains of Markovian diffusion steps to generate samples and thus can be expensive in terms of time and computation. Although there are some new ways to speed up the process, the sampling speed is still slower than GAN.

2c1e2623aa6c9d87866d4021854f287a.gif

In "Generative AI New World | Hands-on Practice in Vincent Graph Field: Deployment and Reasoning of Pre-trained Models" , we will take you into the hands-on practice session. I will lead you to use services such as Amazon SageMaker of Amazon Cloud Technology to experience the application of building large models in the field of Text-to-Image in the cloud.

Please continue to pay attention to the "Amazon Cloud Developer" WeChat official account to learn more about technology sharing and cloud development trends for developers!

The author of this article

6bf4acb23d3fa2903696d63c5bd819d3.jpeg

Huang Haowen

Senior developer evangelist of Amazon Cloud Technology, focusing on AI/ML, Data Science, etc. With more than 20 years of rich experience in architecture design, technology and entrepreneurial management in telecommunications, mobile Internet and cloud computing industries, he has worked in Microsoft, Sun Microsystems, China Telecom and other companies, focusing on providing corporate clients such as games, e-commerce, media and advertising. Solution consulting services such as AI/ML, data analysis, and enterprise digital transformation.

69d0c76a3305968d07f0b667459f077e.gif

e25197dfc3d8918ac19fa79312cb6748.gif

I heard, click the 4 buttons below

You will not encounter bugs!

eb365ac35c7c21a45f7f8b90992a5ad4.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132574049