The author of this article is Huang Haowen
Amazon Cloud Technology Senior Developer Evangelist
In the last article, we started to explore another rapidly advancing field of Generative AI: the field of Text-to-Image . An overview of the basic content of Text-to-Image such as CLIP, OpenCLIP, diffusion model, DALL-E-2 model, Stable Diffusion model, etc.
The content of this issue will be the interpretation of the main papers in the direction of Text-to-Image.
Interpretation of Variational Autoencoder VAE Paper
Variational Auto-Encoder
1. Auto-Encoder architecture
Auto-encoder (Auto-Encoder) is an unsupervised learning neural network for learning a compressed representation of input data. Specifically, it can be divided into two parts:
Encoder: responsible for compressing data into a low-dimensional representation;
Decoder: responsible for restoring the low-dimensional representation to the original data.
Source: https://towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798
After reading this, some readers may ask: Since the decoder only needs to input some low-dimensional vectors, it can output high-dimensional image data; can we directly use the decoder model as a generative model? For example: some vectors are randomly generated in a low-dimensional space, and then sent to the decoder to generate pictures.
The reason not to do this is that the vast majority of random generation is meaningless noise, and since we don't explicitly model the distribution, we don't know which vectors will generate useful pictures; the dataset we use for training is usually finite and therefore will only have finite responses. However, the entire low-dimensional space is very large. If we only sample randomly in this space, the probability of sampling to generate useful pictures is not high.
And VAE (Variational Auto-Encoders) is based on AE, which explicitly models the distribution, helping the autoencoder to become a qualified or even excellent generative model.
2. Dimensionality reduction and latent space
Dimensionality Reduction and latent space
In the previous section, we talked about a concept of dimensionality reduction (Dimensionality Reduction), which is very important in all generative AI fields. In this section I will give a layman's explanation.
In machine learning, dimensionality reduction is the process of reducing the number of features that describe some data. This reduction can be achieved by selection (keeping only some of the existing features) or extraction (reducing the number of new features created based on old features), in many cases where low-dimensional data is required (data visualization, data storage, heavy computation, etc.) Very useful.
First, let's call an encoder the process of generating a "new feature" representation from an "old feature" representation (either by selection or extraction), and decoding in reverse. Dimensionality reduction can then be interpreted as data compression, where the encoder compresses the data ( from the initial space to the encoded space, also known as the latent space ), and the decoder decompresses it. Of course, depending on the initial data distribution, latent space dimensions, and encoder definition, this compression can be lossy, meaning that some information is lost during encoding and cannot be recovered upon decoding.
Source: https://theaisummer.com/latent-variable-models/
The auto-encoder (Auto-Encoder) uses a neural network to reduce the dimensionality . The general idea of an autoencoder is very simple, including setting the encoder and decoder as a neural network, and using an iterative optimization process to learn the best encoding and decoding scheme. So, in each iteration, we feed some data to the autoencoder architecture (the encoder is followed by the decoder), compare the encoded and decoded output with the original data, and then backpropagate the error through the architecture to update the weights of the network .
The entire autoencoder architecture ensures that only the main structural parts of the information are passed through and reconstructed. From the perspective of the overall framework, the considered encoder family E is defined by the encoder network architecture, and the considered decoder family D is defined by the decoder network architecture; and to minimize the reconstruction error, the parameters of these networks are Gradient descent (Gradient Decent) to complete. As shown below:
Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
This autoencoder structure faces two main challenges in the real world.
First, significant dimensionality reduction without reconstruction loss often comes at the cost of a lack of interpretable and exploitable structure in the latent space, or, more simply, a lack of regularity; second, in most cases, The ultimate goal of dimensionality reduction is not only to reduce the number of dimensions of the data, but also to retain most of the data structure information in a simplified representation.
For these two reasons, in the real world we have to carefully control and tune the size of the latent space and the "depth" of the autoencoder (which defines the degree of compression and quality) according to the ultimate goal of dimensionality reduction. As shown below:
Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
3. Variational Autoencoder
Variational Auto-Encoder
After paving the previous knowledge base, we can finally explore the essence of this paper on VAE.
Source: https://arxiv.org/pdf/1312.6114.pdf
So far, we have discussed the problem of dimensionality reduction and introduced autoencoders, which are encoder-decoder architectures that can be trained with gradient descent. Now let's connect the content generation problem and see the limitations of autoencoders in solving this problem, and then welcome the variational autoencoder (Variational Auto-Encoder).
Regarding the combination of content generation and autoencoder, we may think, if the latent space has enough rules, can we randomly pick points from the latent space to decode to get new content? As shown below:
Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
Definition of Variational Autoencoder
Therefore, to be able to use the decoder of an autoencoder for generative purposes, we must ensure that the latent space is sufficiently regular. A possible solution to obtain such regularity is to introduce explicit regularization during training. A variational autoencoder can be defined as an autoencoder whose training is regularized to avoid overfitting and ensure that the latent space is well-characterized for the generative process.
Just like a standard autoencoder, a variational autoencoder is an architecture consisting of an encoder and a decoder, trained to minimize reconstruction errors between encoded decoded data and the original data. However, to apply some regularization to the latent space, we make a slight modification to the encoding-decoding process: instead of encoding the input as a single point, we encode it as a distribution over the latent space. Then train the model as follows:
The input is encoded as a distribution over the latent space
A point in the latent space is sampled from this distribution
Decoding the sampling points, the reconstruction error can be calculated
Reconstruction errors are backpropagated through the network
As shown below:
Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
In fact, the encoding distribution is chosen to be normal so that the encoder can be trained to return the mean and covariance matrices describing these Gaussian distributions. The reason why the input is encoded as a distribution with some variance rather than a single point distribution is that it expresses latent space regularization very naturally. In this way, both local and global regularization of the latent space are ensured (variance controls local regularization, mean controls global regularization).
Therefore, the loss function that is minimized when training a VAE consists of a "reconstruction term" (located in the last layer) and a "regularization" (located in the latent layer), the latter tends to return by making the distribution of the encoder close to the standard normal distribution to regulate the organization of the latent space. This regularization is expressed as the KL divergence (Kulback-Leibler Divergence) between the return distribution and the standard Gaussian distribution. Since the KL divergence between the two Gaussian distributions has a closed form, the mean and covariance of the two distributions can be directly used matrix representation. As shown below:
Source: https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73
My personal understanding about VAE is that the core of the VAE architecture is two Encoders, one is used to calculate the mean value and the other is used to calculate the variance; the mean value and variance will be calculated by the VAE architecture using a neural network.
VAE essentially adds "Gaussian noise" to the result of the Encoder (corresponding to the network for calculating the mean in VAE) on the basis of our conventional autoencoder , so that the resulting Decoder can be robust to noise; and The additional KL loss (the purpose is to make the mean value 0 and the variance 1) is actually equivalent to a regularization item for the Encoder, and it is hoped that all the things from the Encoder have a zero mean value.
Another Encoder (corresponding to the network that calculates the variance) is used to dynamically adjust the noise intensity. When the Decoder has not been trained well (reconstruction error is much greater than KL loss, KL is the abbreviation of Kullback-Leibler, which is used as a classic function to measure the similarity of probability distribution), the noise will be appropriately reduced (KL loss increases), It makes fitting easier (the reconstruction error starts to decrease); conversely, if the Decoder is trained well (the reconstruction error is less than KL loss), then the noise will increase (KL loss decreases), making the fitting more difficult (reconstruction error structure error begins to increase), pushing Decoder to find ways to improve its generation ability.
This essence, in the paper of VAE, it uses an exquisite mathematical formula to explain it, as shown in the screenshot of the paper below:
Source: https://arxiv.org/pdf/1312.6114.pdf
4. Views of VAE Papers
The main points of this VAE paper are:
Dimensionality reduction is the process of reducing the number of features describing some data (either selecting only a subset of the initial features or merging them into a reduced number of new features), so it can be thought of as an encoding process;
Autoencoders are neural network architectures consisting of an encoder and decoder that form the bottleneck for data traversal and are trained to lose a minimal amount of information during the encoding-decoding process (via gradient descent iterations) training, with the aim of reducing reconstruction errors);
Due to overfitting, the latent space of an autoencoder can be very irregular (close points in the latent space can provide drastically different decoded data, and some points of the latent space can provide meaningless content after decoding); and we It is not possible to really define a generative process that simply consists of sampling a point from the latent space and passing it through the decoder to obtain new data;
A variational autoencoder (VAE) is an autoencoder that addresses latent space inconsistencies by making the encoder return a distribution over the latent space instead of a single point, and adding a regularization term to the loss function against this returned distribution. Regularity questions to ensure better organization of the latent space;
Assuming a simple underlying probabilistic model to describe our data, the statistical technique of variational inference (hence the name variational autoencoder) can be used to carefully derive the reconstruction and regularization terms Compose the loss function of the VAE.
Interpretation of Diffusion Model Series Papers
Diffusion Models
Before the diffusion model became the mainstream model in the field of Vincent graphs, there were three types of generative models. They are:
GAN (Generative Adversarial Network)
VAE (Variational Auto-Encode)
Flow-based models
These models have all had great success in generating high-quality samples, but each has its own limitations. It is well known that due to its adversarial training nature, the training of GAN models may be unstable and generate low diversity; VAEs rely on surrogate loss (Surrogate Loss) functions; and flow models must use specialized architectures to construct reversible transformations.
Surrogate Loss function:
https://baike.baidu.com/item/%E4%BB%A3%E7%90%86%E6%8D%9F%E5%A4%B1%E5%87%BD%E6%95%B0/22787203
Diffusion Models are inspired by non-equilibrium thermodynamics. They defined Markovian diffusion chains to slowly add random noise to data, and then learned to reverse the diffusion process to construct the desired data samples from the noise. Unlike VAE or flow-based models, diffusion models are learned through a fixed process, and latent variables are of high dimensionality (same as that of the original data). As shown below:
Source: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
One of the important contributions of the diffusion model is: in the training process (such as the training process of DDPM), the real noise is predicted by the noise estimation model ∈ θ (x t , t), so as to minimize the difference between the estimated noise and the real noise. difference. We will elaborate on this contribution later on.
1. Overview of Diffusion Models
Several major papers based on diffusion generative models, all with similar ideas, include diffusion probability models (Sohl-Dickstein et al., 2015), noise conditional scoring networks (NCSN; Yang & Ermon, 2019), and diffusion probability for denoising model (DDPM; Ho et al., 2020).
Sohl-Dickstein et al., 2015:
https://arxiv.org/abs/1503.03585
Yang & Ermon,2019 年:
https://arxiv.org/abs/1907.05600
Ho et al., 2020:
https://arxiv.org/abs/2006.11239
1
forward diffusion process
Forward diffusion process
Given data points x 0 ~q(x) sampled from the real data distribution, let us define a forward diffusion process. In this process, we add a small amount of Gaussian noise to the samples at step T to generate a series of noisy samples x 1 ,...x T , whose step size is controlled by the variance schedule {β t ∈(0,1 )} T (t=1) :
The data sample x 0 gradually loses its salient features as the step t becomes larger. When T→∞, x T is equivalent to an isotropic Gaussian distribution. As shown below:
The Markov chain of forward (reverse) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020 with a few additional annotations)
Both forward diffusion and reverse diffusion processes are Markov processes, the only difference is:
The mean and variance of the Gaussian distribution of each conditional probability in the forward diffusion process have been determined (depending on β t and x 0 ), while the mean and variance in the inverse diffusion process need to be learned through the neural network.
Markov process:
https://zhuanlan.zhihu.com/p/426290103
Another nice property of the above procedure is that arbitrary time steps x t can be sampled in closed form using the reparameterised trick , as shown in the following diagram:
Illustration of how the reparameterization trick makes the sampling process trainable.(Image source: Slide 12 in Kingma’s NIPS 2015 workshop talk)
The reparameterization trick also works for other types of distributions, not just Gaussians. In the case of a multivariate Gaussian, the model is made trainable by using the reparameterization technique described above, and learning the distribution with mean μ and variance σ, while randomness is represented in random variables ∈~N(0,Ι) . The figure below is a schematic diagram of a variational autoencoder model using a multivariate Gaussian assumption. This variational autoencoder model was discussed in detail in the previous chapter.
Source: https://lilianweng.github.io/posts/2018-08-12-vae/#reparameterization-trick
I will not elaborate on the mathematical derivation process of the forward diffusion process in this space. Interested students can refer to the content of the "Forward diffusion process" section of the following article:
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Here is just a summary:
Compared with the standard stochastic gradient descent (SGD) method, the diffusion model refers to the method of stochastic gradient Langevin dynamics (stochastic gradient Langevin dynamics), which can inject Gaussian noise in the parameter update to avoid collapsing into local minima .
2
backdiffusion process
Reverse diffusion process
If we can reverse the above process and sample from q(x t-1 |x t ), we can recreate the real samples from the Gaussian noise input x T ~N(0,Ι). Note that if β t is small enough, q(x t-1 |x t ) will also be Gaussian. However, q(x t-1 |x t ) cannot be easily estimated because estimating it requires using the entire dataset, as shown in the following figure:
Image source: Ho et al. 2020 with a few additional annotations
Therefore a model ρ θ needs to be trained to approximate these conditional probabilities to run the backdiffusion process:
The figure below shows an example of training a diffusion model to model 2D Swiss roll data from the "Sohl-Dickstein et al., 2015" paper.
Sohl-Dickstein et al., 2015
https://arxiv.org/abs/1503.03585
Image source: Sohl-Dickstein et al., 2015
The first row shows the time slice starting from the forward trajectory q(x 0:T ). The data distribution undergoes Gaussian diffusion from the left, and Gaussian diffusion on the right gradually transforms it into an identity-covariance Gaussian distribution .
The middle row shows the corresponding time slices of the trained reverse trajectory ρθ(x 0:T ). The feature covariance Gaussian distribution (right) undergoes a Gaussian diffusion process through the learned mean and covariance functions, and gradually transforms back to the original data distribution (left).
The last row shows the case of the drift term μ θ (x t ,t)—x t for the same backdiffusion process .
3
DDPM paper and parametric L t
As mentioned before, we need to learn a neural network to approximate the conditional probability distribution in the backdiffusion process:
We want to train μ θ to predict:
Since xt is available as input at training time, one can instead reparameterize the Gaussian noise term so that it predicts ∈ t from input xt at time step t :
Later, some mathematical formulas can be used to simplify. Readers who are interested in the specific mathematical derivation process can refer to the following articles:
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Simplified results can refer to the following DDPM papers:
Source: https://arxiv.org/abs/2006.11239
The training and sampling algorithms in DDPM (Image source: Ho et al. 2020)
The reason for the above simplified results is mentioned in the DDPM paper, mainly because Ho et al. (2020) found empirically that the training diffusion model works better if the simplified objective that ignores the weighting term is used:
The final simplified formula is:
where C is a constant independent of θ.
Ho et al. (2020):
https://arxiv.org/abs/2006.11239
2. Accelerated Sampling of Diffusion Models
Speed up Diffusion Model Sampling
1
Interpretation of DDIM thesis
Generating samples from a DDPM via a Markov chain following a backdiffusion process is very slow as it can be thousands of steps at most. Data from the 2020 DDIM paper by Song et al. states: "For example, it takes about 20 hours to sample 50,000 images of size 32×32 from DDPM, but less than a minute to sample from a GAN on an Nvidia 2080 Ti GPU." .”
The 2020 DDIM paper by Song et al.:
https://arxiv.org/abs/2010.02502
Source: https://arxiv.org/pdf/2010.02502.pdf
DDIM has the same marginal noise distribution, but deterministically maps the noise back to the original data samples. During generation, we only sample a subset {τ 1 ,…,τ S } of diffusion steps, and the inference process of DDIM becomes:
Although all models were trained with Τ=1000 diffusion steps in the experiments, they observed that DDIM (η=0) can produce the best quality samples when S is small, while DDPM (η=1) on small S performance is much worse. DDPM performs even better when we have the ability to run a full reverse Markovian diffusion step (S=Τ=1000).
With DDIM, a diffusion model can be trained to any number of forward steps, but only sampled from a subset of steps in the generation process. The comparative test results of DDPM and DDIM in the paper are shown in the figure below:
Source: https://arxiv.org/pdf/2010.02502.pdf
The comparison between DDIM and DDPM is summarized as follows:
Generate higher quality samples using fewer steps
Because the generative process is deterministic, meaning that multiple samples conditioned on the same latent variable have similar high-level features
DDIM enables meaningful interpolation among latent variables due to consistency
2
Interpretation of LDM papers
Another important paper is the Latent Diffusion Model (LDM: Rombach & Blattmann et al., 2022) paper (shown below), which proposes to run the diffusion process in latent space instead of pixel space, thus reducing training costs and speeding up Reasoning speed.
LDM: Rombach & Blattmann et al., 2022:
https://arxiv.org/abs/2112.10752
Source: https://arxiv.org/pdf/2112.10752.pdf
The paper is motivated by the observation that images are mostly perceptually detailed, while semantic and conceptual composition persist after aggressive compression . LDM learns to loosely decompose perceptual compression and semantic compression through generative modeling by first trimming pixel-level redundancy with an autoencoder, and then manipulating/generating semantic concepts using a diffusion process on the learned latent.
Description Awareness and Semantic Compression
(illustrating perceptual and semantic compression)
Most parts of digital images correspond to imperceptible details. Although the diffusion model already suppresses this semantically meaningless information by minimizing the loss term, it still needs to evaluate the gradient (during training) and the neural network backbone (both training and inference) at all pixels, which leads to Redundant calculations and unnecessarily expensive optimizations and inferences.
Therefore, the DDIM paper proposes the Latent Diffusion Model (LDM) as an efficient generative model with a separate light compression stage.
The DDIM-aware compression process relies on an autoencoder model. The encoder is used to compress the input image x∈R H×W×3 into a smaller 2D latent vector z=ε(x)∈R h×w×c , where the downsampling rate f=H/h=W/w =2 m , m∈N, then the decoder D reconstructs the image from the latent vector x ̃=D(z).
Diffusion and denoising processes take place on the latent vector Z. The denoising model is a temporally conditioned U-Net enhanced with a cross-attention mechanism for handling flexible conditional information for image generation (e.g. category labels, semantic maps, blurred variants of images). This design is equivalent to fusing representations of different modalities into the model via a cross-attention mechanism. Each type of conditioning information is paired with a domain-specific encoder τθ for projecting the conditioning input y into an intermediate representation that can be mapped to the cross-attention component τ θ (y) ∈ R (M × dτ ) :
The architecture of latent diffusion model. (Image source: Rombach & Blattmann, et al. 2022)
3. Conditional Generation of Diffusion Models
Conditioned Generation
When training a generative model on images conditioned on information such as the ImageNet dataset, samples are typically generated from a class label or a piece of descriptive text.
1
Classifier Guidance for Diffusion Models
Classifier Guided Diffusion
The GLIDE paper (pictured below) presents its latest work in the area of classifier guidance for diffusion models.
Source: https://arxiv.org/pdf/2112.10741.pdf
In order to explicitly incorporate category information into the model diffusion process, Dhariwal & Nichol (2021) trained a classifier f ϕ (y|x t ,t) on noisy images x t , and used the gradient ∇ x logf ϕ (y|x t ) guides the diffusion sampling process towards conditional information y (e.g. object class labels) by changing the noise predictions. The resulting ablation-diffusion model (ADM) and a model with additional classifier guidance (ADM-G) are able to achieve better results than SOTA generative models such as BigGAN.
Dhariwal & Nichol (2021)
https://arxiv.org/abs/2105.05233
The algorithms use guidance from a classifier to run conditioned generation with DDPM and DDIM. (Image source: Dhariwal & Nichol, 2021])
2
Class-Free Guidance for Diffusion Models
Classifier-Free Guidance
In addition, in the GLIDE paper, the use of classifier-free guidance to select samples from GLIDE is also described. From the sample image data provided in the paper, it can be observed that the GLIDE model can generate realistic images with shadows and reflections, can combine multiple concepts, generate artistic renderings of new concepts, and more.
Source: https://arxiv.org/pdf/2112.10741.pdf
In the GLIDE paper, guidance strategies, CLIP guidance, and uncategorized guidance are also explored in detail, and the latter are found to be more popular.
4. High resolution and image quality of the diffusion model
1
CDM Papers
The paper "Cascaded Diffusion Models for High Fidelity Image Generation" suggests a series of higher resolution multiple diffusion models. Noise conditioning augmentation between pipeline models is crucial to the final image quality, i.e. applying strong data augmentation to the conditioning input z of each super-resolution model p θ (x|z), conditioning the noise helps to reduce compound errors in pipeline setups.
Cascaded Diffusion Models for High Fidelity Image Generation:
https://arxiv.org/abs/2106.15282
Source: https://arxiv.org/pdf/2106.15282.pdf
In diffusion modeling for generating high-resolution images, U-Net is a common choice of model architecture. The paper mentioned that in the pipeline of Cascaded Diffusion Models, each model uses the U-Net architecture. As shown below:
Source: https://arxiv.org/pdf/2106.15282.pdf
The paper also states that they found the most effective noise was Gaussian noise applied at low resolutions and Gaussian blur applied at high resolutions. Additionally, they explore two forms of conditional augmentation that require minor modifications to the training procedure. Conditional noise is only useful for training, not inference.
2
UnCLIP Papers
In the two-stage diffusion model UnCLIP (Ramesh et al. 2022) paper, it is proposed to utilize the CLIP text encoder to generate high-quality text-guided images.
Ramesh et al. 2022
https://arxiv.org/abs/2204.06125
Source: https://arxiv.org/abs/2204.06125
Given paired training data for a pretrained CLIP model c and a diffusion model (x, y), where x is the image and y is the corresponding caption, we can compute the vector representations Ct(y) and Ct(y) of the CLIP text and image, respectively . i (x) .
UnCLIP learns two models simultaneously:
Prior model p(ci | y): Given text y, output the CLIP image vector representation of ci ;
Decoder p(x|ci , [y]) : Given a CLIP image vector representation ci and (optionally) a raw text y, outputs an image x.
These two models support conditional generation because:
The architecture of unCLIP. (Image source: Ramesh et al. 2022)
3
Imagen Papers
Imagen's paper Imagen (Saharia et al. 2022) does not use the CLIP model, but uses a pre-trained large LM (frozen T5-XXL text encoder) to encode text to generate images.
Saharia et al. 2022
https://arxiv.org/abs/2205.11487
Source: https://arxiv.org/pdf/2205.11487.pdf
The general trend is that larger model sizes lead to better image quality and text-image alignment. The paper's research team found that T5-XXL and CLIP text encoders achieve similar performance on MS-COCO.
Source: https://arxiv.org/pdf/2205.11487.pdf
Imagen modified several designs in U-net to make it efficient U-Net. For example:
Transfer model parameters from high-resolution modules to low-resolution modules by adding more residual locks for lower resolutions
Expand the size of skip connections to 1/√2 times
Reverse the order of downsampling (moving before convolution) and upsampling operations (moving after convolution) to improve the speed of the forward pass
The paper team's experience summary includes:
Noise adjustment enhancement, dynamic thresholding, and efficient U-Net are critical to image quality
Scaling text encoder size is more important than U-Net size
summary
In this issue, we began to discuss the interpretation of main papers in the direction of Text-to-Image, including: VAE, DDPM, DDIM, GLIDE, Imagen, UnCLIP, CDM, LDM and other major diffusion model fields.
From our analysis, the main advantages and disadvantages of the diffusion model are as follows:
Pros: Traceability and flexibility are two conflicting goals in generative modeling. Tractable models can be evaluated analytically and fit data efficiently (e.g. via Gaussian or Laplacian), but they cannot easily describe structure in rich datasets. Flexible models can fit arbitrary structure in data, but it is often expensive to evaluate, train, or sample from these models. The diffusion model can achieve both traceability and flexibility in analysis;
Disadvantages: Diffusion models rely on long chains of Markovian diffusion steps to generate samples and thus can be expensive in terms of time and computation. Although there are some new ways to speed up the process, the sampling speed is still slower than GAN.
In "Generative AI New World | Hands-on Practice in Vincent Graph Field: Deployment and Reasoning of Pre-trained Models" , we will take you into the hands-on practice session. I will lead you to use services such as Amazon SageMaker of Amazon Cloud Technology to experience the application of building large models in the field of Text-to-Image in the cloud.
Please continue to pay attention to the "Amazon Cloud Developer" WeChat official account to learn more about technology sharing and cloud development trends for developers!
The author of this article
Huang Haowen
Senior developer evangelist of Amazon Cloud Technology, focusing on AI/ML, Data Science, etc. With more than 20 years of rich experience in architecture design, technology and entrepreneurial management in telecommunications, mobile Internet and cloud computing industries, he has worked in Microsoft, Sun Microsystems, China Telecom and other companies, focusing on providing corporate clients such as games, e-commerce, media and advertising. Solution consulting services such as AI/ML, data analysis, and enterprise digital transformation.
I heard, click the 4 buttons below
You will not encounter bugs!