[AI Drawing Study Notes] Variational Autoencoder VAE

Unsupervised learning VAE - variational autoencoder detailed explanation
of machine learning method - elegant model (1): variational autoencoder (VAE)

Needless to say, just read these two articles. This article mainly summarizes some of the interpretations that I failed to understand when watching this article and other videos.


Look here, look here!

In case someone doesn't read the preface, I'll reiterate it again and read these two articles: VAE for unsupervised learning - variational autoencoder explained in detail
machine learning method - elegant model (1): variational autoencoder (VAE)


Some terminology explanations

NN is the abbreviation of neural network.

latent code can be translated into latent coding

dim represents dimension dimension

PCA is principal component analysis, and these two dimensionality reduction methods are introduced in my PCA and SVD article.

Conv is the abbreviation of convolution convolution

BN is Batch Normalization. In the neural network, normalization needs to be used to maintain the same distribution of input and output. Batch normalization divides the data into small batches for stochastic gradient descent, and when each batch of data is forward passed, each batch of data is passed forward. Normalization processing is performed on each layer. A BN layer needs to be connected between each fully connected layer and the excitation layer for standardization.

Insert image description here

RBM is a Restricted Boltzmann Machine. The structure of RBM has only two layers, the visible layer and the hidden layer. From a structural point of view, RBM is very similar to a two-layer neural network, but the difference from the neural network is that RBM not only has backpropagation, but the explicit layer and the hidden layer interact and propagate each other. At the same time, the values ​​of RBM are binary, which means that each factor has only two possible values ​​of 0 or 1. As shown in FIG

Robustness, I am also drunk with this translation, and it can also be translated as robustness ( resistance ), which refers to the ability of the system to maintain its ability to withstand external interference when the internal structure is disturbed. In other words, the strength of stability.

Bottle is a very interesting concept. We can similarly view the entire AE network as a vase (in shape), and the thinnest latent space part in the middle is called the bottleneck of the vase (Bottleneck), as shown below:
Insert image description here


latent coding

Reference notes: Quickly understand latent code latent coding (Latent Space) in deep learning.
Understand the latent space in machine learning.

What is latent encoding? Latent encoding can be understood as a way of reducing dimensionality or compressing data, aiming to express the essence of the data with less information. For example, when we used the kernel function, we mentioned the kernel technique as a dimensionality-raising method. Its essence is to send the feature points of the nonlinear original space into the high-dimensional feature space through the transformation of the kernel function and transform it into a linear problem. The newly obtained This high-dimensional space can also be understood as a potential space. In our example of AE, the latent space is a dimensionally reduced space of the original space. Through the encoder transformation, we obtain the latent encoding.
Insert image description here
The picture above is a simple encoder-decoder architecture. If the entire network is viewed as a vase, the smallest part is called the bottleneck. When reducing dimensionality, we think this is a form of lossy compression. If the loss is noise or useless information, it is our favorite (so that the purpose of information compression can be achieved)

After compression by the encoder, the more important thing is recovery. We should believe that only what can be recovered is successfully compressed. Then we can think that this latent space representation really expresses the most critical information in the input image.
Insert image description here

However, in principle, AE is more like a compressible cup. It is flattened into the latent space through the encoder to obtain the latent code, and then expanded into a reconstructed image through the decoder. In fact, the difference between the output and input obtained is not very big. If the results obtained are similar to the original results, then AE itself is not suitable as a generative model, especially when we want to get other images.

Similar latent codes

Insert image description here

We need to understand how neural networks define the same kind, such as chairs and chairs. The yellow chair and the black chair in the picture above are the same kind in our cognitive definition, but if they are represented by features in the network, then the features contained in the yellow chair : {Some features of the chair, personalized features, yellow}. Characteristics of the black chair: {Some characteristics of the chair, personalized characteristics, black}. Imagine that in the feature space, if the features are distributed as points in the space, the similarity of the feature distribution of the yellow chair and the black chair will not be particularly high (including color, shape, structure and other personalized features affect the two distributions) degree of approximation). If we compress the latent space, for example, remove those personalized features and only retain "some characteristics of the chair", then the feature distribution of the two chairs in the latent space is very similar, so the network considers these two The chairs are of course the same kind.

By learning patterns of edges, angles, etc., these can all be "understood" by our model. As explained, such features are wrapped in a latent space representation of the data. Therefore, as the dimensionality is reduced, the "external" information that is distinct from each image (i.e., the chair color) is "removed" from our latent space representation, because only the most important features of each image are stored in the latent space. spatial representation. As a result, as we reduce the size, the representations of the two chairs become less clear and more similar. If we imagine them in space, they will be "close" together.

Therefore, finding the latent space is an important step in our learning, rather than crudely directly calculating the Euclidean distance between two image pixels.

If too many features are removed, then maybe the network will think that chairs and tables are the same kind, after all, everyone has four legs.

latent space interpolation

The same kind is similar in space. For example, the vectors of two chairs are [0.1,0.1] and [0.12,0.12]. If these two are fed into the network, the generated ones will of course be chairs. Then if the input is [0.11,0.11] Woolen cloth? Of course it's also a chair, so that's interpolation. That is to say, if the latent code is taken in the corresponding interval, then the final decoding result will of course be approximate to the desired result. The figure below shows the effect of interpolation (both horizontal and vertical interpolations have different results). We can interpolate on the latent space and use the model decoder to reconstruct the latent space representation into a two-dimensional image and compare it with the original input. Same dimensions to generate different facial structures . It can be seen that the interpolation around the same type is similar but has slight differences. The simplest application is to use it as a data augmentation to expand the data set.

Insert image description here
Insert image description here
(The picture above is a Pokémon image trained with VAE. We found that some of the interpolations in the middle did look like real Pokémon. This is the generated result we expected)

The picture below shows the effect of linear interpolation between two chairs. It can be seen that as the interpolation changes, the final generated result will show a linear change effect.

Insert image description here


Gaussian mixture

Insert image description here

When we talked about Fourier transform in the preamble of convolutional networks, we mentioned some knowledge about signals. A signal can be decomposed into a superposition of sine waves of different frequencies, amplitudes, and phases.Similarly, the overall distribution of a mixture model can be decomposed into the superposition of multiple sub-distribution models, so we can use the superposition of multiple normal distributions to approximate any distribution.
Insert image description here


Increase the number of layers of AE

Insert image description here
The above picture is given in the original article. The model shown on the right side of the upper half represents the original single-layer AE. We can see that the effect is not very good. AE first compresses the 784-pixel input into a 30-pixel latent code and then Use decoder to reconstruct (reconstucted) to 784 pixels. The output result is very blurry because we only use one hidden layer for learning, and the effect is definitely very poor.

The lower part is a Deep AE with an increased number of layers. The width is increased, such as widening from 784 pixels to 1000 pixels, and the depth is increased, such as adding several hidden layers. A sentence from the original article provides a good overview of the advantages of increasing the depth and width of neural networks:The deeper the network, the more abstract high-level semantic features can be learned; while the wider the network, the hidden layer of each layer can learn richer feature representations.Therefore, we can see that the final output result looks clearer than the input result, indicating that the effect is very good. The Deep AE shown on the left in the upper part also proves this point for the MNIST results after processing the original image.


Limitations of AE

Above, we used AE to construct a clearer autoencoder model than PCA, but this is not a true generative model. For a specific generative model, it should generally satisfy the following two points:

(1) The encoder and decoder can be split independently (analogous to the Generator and Discriminator of GAN);

(2) Any code sampled under a fixed dimension should be able to produce a clear and realistic picture through the decoder.

So why do we say that AE is not a generative model? Let us take the second point as an example. As shown in the figure below, we use a full-month chart and a half-moon chart to train an AE. After training, the model can restore this image very well. Two pictures. Next, we pick a point in the middle of the latent code, that is, a point in the middle of the encoding points of the two pictures, and give this point to the decoder for decoding. Intuitively, we will get a picture between the full moon picture and the half moon picture. Picture (for example, the shadow area covers 3/4). However, when you actually decode at this point, you will find that the image restored by AE is not only blurry but also garbled.

Insert image description here
(Assuming that AE can restore the full moon picture and the half moon picture, then according to intuition, if the code point on the left in the above picture can decode the full moon picture, and the code point on the right can decode the half moon picture, then any code point we select among them You can decode a moon between a full moon and a half moon, but the actual situation is not like this. The picture restored by AE is not only blurry but also garbled.)

Why does this happen? An intuitive explanation is that both AE's Encoder and Decoder use DNN (deep neural network). If this neural network is a linear transformation, then the above intuitive judgment can be achieved, but not all neural networks are Linear transformation (this is explained in our previous article, if all hidden layers are linear transformations, then there is no essential difference between a one-layer neural network and a multi-layer neural network), and DNN is a non-linear transformation process, so in latent The transformation between points in the space potential space often has no rules to follow.

How to solve this problem? One idea is to introduce noise to expand the coding area of ​​the picture so that the distorted blank coding area can be covered. In fact, to put it bluntly, it is to enhance the robustness of the output by increasing the diversity of the input . When we introduce a little noise before encoding the input image, the encoding points of each image appear within the range of the green arrow, so that the resulting latent space can cover more encoding points. At this time, we can extract and restore from the middle point to get the output we prefer, as shown below:

Insert image description here

The reason why adding noise can produce such an effect is because the noise allows the latent space to cover more areas , but there are still many places that are not covered. For example, the yellow part on the right side of the picture above is not covered because it is far away. Encoded to. Therefore, can we try to use more noise so that for each input sample, its encoding can cover the entire encoding space? But what we need to ensure here is that we should give a high probability value for codes near the source code, and we should give a low probability value for codes far away from the original code point. Yes, in general, we want to stretch the original single point to the entire coding space, that is, extend the discrete coding point into a continuous coding curve close to a normal distribution , as shown below:

Insert image description here
The model that adds noise to AE is called VAE variational autoencoder . The relationship between noise and latent space in VAE is that noise is added to the input data, and then the input data is mapped to a latent space representation by the encoder. The decoder then uses this latent space representation to generate output data. That is to say, adding noise makes the latent space cover a larger coding space, allowing us to interpolate in the middle of the code to obtain a new generated result. The distribution on the code can also be regarded as a Gaussian mixture of normal distributions between different results. .


VAE model architecture

Insert image description here

As we have also introduced above, VAE adds appropriate noise to the encoding based on the original AE structure. First, we input the input to the NN Encoder and calculate two sets of codes: one set of codes is the mean code mmm , the other group is the variance codeσ \sigmaσ , variance encodingσ \sigmaσ is mainly used to encode noisez = (e 1, e 2, e 3) z=(e_1,e_2,e_3)z=(e1,e2,e3) to assign weights. In order to make the weight of NN non-negative, we also need toσ \sigmaσ performs an e-exponential operation, and we calculateci = exp (σ i) × ei + mi c_i=exp(\sigma_i)×e_i+m_ici=e x p ( pi)×ei+mi. In the end, this loss function c needs to be added, and the final result must minimize it and the reconstruction error:
Insert image description here

Finally, we superimpose the original code m and the noise code after weight distribution to obtain a new latent code, and then send it to the NN Decoder. Now let’s not discuss how this loss is obtained. Let’s look at its function:

If this loss function is not added, in order to ensure that the quality of the images generated by the model is higher (because we minimize the reconstruction error), then the encoder definitely hopes that the noise will interfere with the images it generates as little as possible, so it is assigned to The lower the weight of noise, the better. If no constraints are imposed, the network only needs to set the variance encoding to a value close to negative infinity, σ → − ∞ , then e σ → 0 \sigma \to -\infty, then e^{\sigma}\to 0p,Then ep0 . In the end, it is equivalent to not introducing noise. Then we get a hammer VAE. The results show that it is trained very well, but the pictures it often generates are very poor.

After thinking backwards, let’s understand it from the front. Why is it useful to add this auxiliary loss? How to obtain the formula will be discussed later. Now we calculate σ \sigma according to the formula given aboveDerivative of σ , we can get c = e σ − 1 c=e^{\sigma}-1c=ep1 , let it be equal to 0. It can be seen that whenσ = 0 \sigma=0p=The minimum value is obtained at 0 , so that the variance encoding can be constrained from going all the way to negative infinity like a cheat, which is equivalent to playing the role of a regularization constraint.

Insert image description here
The blue line represents e σ e^{\sigma}eσ , the red line is( 1 + σ ) (1+\sigma)(1+σ ) , subtract the red line from the blue line to get the green linee σ − ( 1 + σ ) e^{\sigma}-(1+\sigma)ep(1+σ ) , we can see that the minimum value is the noise parameterσ \sigmaWhen σ is 0, the value is 0, and if σ \sigmaAs σ tends to negative infinity, its loss value will increase instead, thus limiting the impact of minimizing noise parameters on the model.

An example given in the original article is very interesting. I will rewrite it: If you just want to get high scores, then you only need to lower the difficulty coefficient (noise parameter) of the knowledge point of the question very low. The lowest is infinitesimal, but not Exams that meet the difficulty level are not what teachers want, so we need a question teacher (loss function). It doesn’t matter if the difficulty coefficient is low. My question maker directly uses a low difficulty coefficient to create a high-calculation question. That question is not Is it difficult? In order to ensure that the difficulty of the exam (online) is always online, it will disgust you.

Insert image description here
Selecting VAE, let's take a look at the probability mixture distribution of the code obtained in the latent space. We will find a very interesting phenomenon. At a high point, we can get some real Pokémon output, such as Bulbasaur in the picture above. Charizard, etc., and at the low points are some blurry generated images that don't look very good. If we can find some high points in the interpolation, we can generate some real Pokémon images.


VAE principle

Insert image description here
(The original article is not very clear here, I want to make some expansions)
When talking about Gaussian mixtures before, we said that any distribution can be approximated by a mixture of n normal distributions, so let us sample from the standard normal distribution zzz , and then mix. But in fact, we do not simply take several normal distributions. As shown in the figure above, we find that the Gaussian distribution given by the corresponding sampling points is actually non-normal, and we need to give it to the NN to obtain μ ( z ) andσ ( z ) \mu(z) and \sigma(z)μ ( z )σ ( z )

This method is called resampling technique . The reason why this technique is used is to solve the problem of zzz is a latent variable problem. Because if we use the expression of normal distribution directly, we want to get z ∼ N ( μ , σ 2 ) , z \sim N(\mu ,\sigma^2), whichconforms to Gaussian distributionzN ( μ ,p2 ),one is that the integral is difficult to calculate, and the other is that it is impossible to write an explicit functional relationship. If you cannot write an explicit functional relationship, then you cannot do backpropagation. Therefore, we can first sample the normal distributionz ′ ∼ N ( 0 , I ) z' \sim N(0,I)zN(0,I ) ,Z = μ + σ z ′ z=\mu+\sigma z'z=m+σz _ . Then z can write the function explicitly,μ \muµσ \sigmaσ becomes a parameter,zzz becomes output,z ′ z'z is input, we can learn thiszzz , its effect is equivalent to directly drawing fromN (μ, σ) N(\mu,\sigma)N ( μ ,zzis sampled from σ )z . Corresponding to the above NN iszzThe z sampling normal distribution directly outputs the parametersμ ( z ) , σ ( z ) of the two Gaussian distributions \mu(z),\sigma(z)m ( z ) ,σ ( z )
Insert image description here
we can give the above probabilistic graphical model, wherezzz is a sampling point that satisfies the normal distribution (for example,z 1 , z 2 . . . zn z_1,z_2...z_nz1,z2...znAll satisfy the normal distribution), X represents the Gaussian distribution we determined based on sampling, θ \thetaθ represents the parametersμ in the Gaussian distribution, σ \mu,\sigmam ,σ . RepeatNN_A mixed Gaussian distribution can be obtained N times.

Insert image description here
Let's look at the contour plot on the right where the mixed Gaussian distribution is expressed as a binary normal distribution. It can be clearly seen that the distribution of the sample can be divided into two Gaussian distributions c 1 , c 2 c_1,c_2c1,c2, pay attention to the sample at the red point. This point can be regarded as obeying c 1 c_1c1Distribution can also be regarded as obeying c 2 c_2c2Distributed, but from a probability point of view, it obeys c 1 c_1c1The probability is greater. A better representation method is x ∼ zx \sim zxz , the probability density of z can express the probability that x obeys these distributions. The mixed Gaussian should be a weighted average of multiple Gaussian distributions.

Insert image description here

Going back to the VAE model structure, under this structure, we can think that the data set is generated by a random process, and zzz is an unobservable hidden variable in this random process. This stochastic process of generating data consists of two steps:

1. From the prior distribution p ( z ) p(z)Sampling in p ( z ) results in a zi z_izi
2.Nete zi z_izi, from the conditional distribution p ( x ∣ zi ) p(x|z_i)p(xzi) to obtain a data pointxi x_ixi
(In the neural network we will use resampling techniques to calculate the corresponding Gaussian distribution)

Insert image description here

We randomly sample mm from the distribution P(X)m discrete points, each sampling pointmi m_imi, we correspond to a Gaussian distribution N ( μ m , σ m ) N(\mu^{m},\sigma^{m})N ( mm,pm ), then a polynomial distribution can be expressed as:

P ( x ) = ∑ m P ( m ) P ( x ∣ m ) = ∫ z P ( z ) P ( x ∣ z ) d z P(x)=\displaystyle\sum_mP(m)P(x|m)=\int_zP(z)P(x|z)dz P(x)=mP(m)P(xm)=zP(z)P(xz)dz

Here, m ∼ P ( m ) m \sim P(m)mP(m) x ∣ m ∼ N ( μ m , σ m ) x|m \sim N(\mu^{m},\sigma^{m}) xmN ( mm,pm ). From the formula transformation, we find that the original discrete sampling of the sample∑ m P ( m ) P ( x ∣ m ) \displaystyle\sum_mP(m)P(x|m)mP ( m ) P ( x m ) , can be transformed into continuous sampling ∫ z P ( z ) P ( x ∣ z ) dz \int_zP(z)P(x|z)dzwith respect to noise zzP ( z ) P ( x z ) d z . Corresponds to the picture below:
Insert image description here

After the above operations, we can convert the original discrete coding method with a large number of distortion areas into a continuous and effective coding method. Here P ( x ) P(x)P ( x ) is a known variable,P ( x ∣ z ) P(x|z)P ( x z ) is an unknown variable, sincex ∣ z ∼ N ( μ ( z ) , σ ( z ) ) x|z \sim N(\mu(z),\sigma(z))xzN ( μ ( z ) ) ,σ ( z )) , so the problem is transformed into findingμ \muµσ \sigmaExpression of σ

Insert image description here
Insert image description here

Let's zoom in on the architecture of the Decoder. We input the Decoder with a z z_i sampled from a normal distribution.zi, in fact, we hope that by θ \thetaThe θ parameterized Decoder can learn a mapping and outputzi z_iziThe corresponding X distribution, that is, p θ ( X ∣ zi ) p_{\theta}(X|z_i)pi(Xzi)


Formula Derivation

Insert image description here


The above proof process is given in the derivation part of the encoder, and some details can be added, for example:
log P ( x ) = log P ( x ) × 1 = log P ( x ) ∫ zq ( z ∣ x ) dz logP(x)=logP(x)×1=logP(x)\int_zq(z|x)dzlogP(x)=logP(x)×1=logP(x)zq ( ​​z x ) d z , sincelog P ( x ) logP(x)logP(x) z z z is independent, so
= ∫ zq ( z ∣ x ) log P ( x ) dz =\int_zq(z|x)logP(x)dz=zq(zx)logP(x)dz,其中 P ( z , x ) P ( z ∣ x ) \frac{P(z,x)}{P(z|x)} P(zx)P(z,x)is obtained by Bayesian formula, and then the following formula is derived

Review the formula of KL divergence: D K L ( p ∣ ∣ q ) = ∑ i = 1 n p ( x ) l o g p ( x ) q ( x ) D_{KL}(p||q)_=\displaystyle\sum^n_{i=1}p(x)log\frac{p(x)}{q(x)} DKL(p∣∣q)=i=1np(x)logq(x)p(x), the right part can be expressed as KL divergence

When I watched another video, the following formula was given. The upper and lower formulas are equivalent:

Insert image description here

In the picture above, the points are missing dz dzd z , and the KL divergence is omitted, and it is expressed in the desired form. At first, I was confused about how this was deformed, but later I found out that we had already talked about this part:

Insert image description here

That is, the summation form of the distribution can be converted into the expectation of the corresponding distribution. This is just a brief mention. I personally think that the first formula given is better.

Now we find log P ( x ) logP(x)A lower bound for l o g P ( x ) :

l o g P ( x ) ≥ ∫ z q ( z ∣ x ) l o g ( P ( x ∣ z ) P ( z ) q ( z ∣ x ) ) d z = L b logP(x) \geq \int_zq(z|x)log(\frac{P(x|z)P(z)}{q(z|x)})dz=L_b logP(x)zq(zx)log(q(zx)P(xz)P(z))dz=Lb

Bringing in the original formula, it can be written as: log P ( x ) = L b + DKL ( q ( z ∣ x ) ∣ ∣ P ( z ∣ x ) ) logP(x)=L_b+D_{KL}(q(z |x)||P(z|x))logP(x)=Lb+DKL( q ( z x ) ∣∣ P ( z x )) , that is,P ( x ) P(x)P ( x ) is transformed into simultaneously findingP ( x ∣ z ) P(x|z)P(xz) q ( z ∣ x ) q(z|x) For the problem of q ( z x ) , let’s look atlog P ( x ) logP(x)l o g P ( x ) andL b L_bLbRelationship:

Insert image description here

由于 P ( x ) = ∫ z P ( z ) P ( x ∣ z ) d z P(x)=\int_zP(z)P(x|z)dz P(x)=zP ( z ) P ( x z ) d z , so whenP ( x ∣ z ) P(x|z)When P ( x z ) is fixed, sinceP ( z ) P(z)P ( z ) is fixed, solog P ( x ) at the blue line segment logP(x)l o g P ( x ) is a constant value. And the equation giveslog P ( x ) = KL + L b logP(x)=KL+L_blogP(x)=KL+Lb, so if maximum likelihood requires maximizing L b L_bLbIf so, according to L b L_bLbThe formula requires maximizing q ( z ∣ x ) q(z|x)q ( ​​z x ) , and finally get the result on the right, thenL b L_bLbAs large as possible, KL as small as possible (it is also mentioned that the maximum likelihood is equivalent to the minimum KL divergence). Until the KL divergence is approximately 0, then L b = log P ( x ) L_b=logP(x)Lb=l o g P ( x ) ,finalq ( z ∣ x ) q(z|x)q ( ​​z x ) andP ( z ∣ x ) P(z|x)The two distributions P ( z x ) will be completely similar.

That is to saySolve for maximum likelihood M axlog P ( x ) = M ax L b MaxlogP(x)=Max L_bMaxlogP(x)=MaxLb, from a macro perspective x ∣ z ∼ N ( μ ( z ) , σ ( z ) ) x|z \sim N(\mu(z),\sigma(z))xzN ( μ ( z ) ) ,σ(z)),调节 P ( x ∣ z ) P(x|z) P ( x z ) is to adjust the NN Decoder; on the contrary,z ∣ x ∼ N ( μ ′ ( z ) , σ ′ ( z ) ) z|x \sim N(\mu^{'}(z),\ sigma^{'}(z))zxN ( m(z),p(z)),调节 q ( z ∣ x ) q(z|x) q ( ​​z x ) is to adjust the NN Encoder.
Insert image description here

Therefore, the algorithm of the VAE model is: every time the Decoder improves, the Encoder is adjusted to be consistent with it, and constraints are used to force the Decoder to "can only move forward, not backward" during training.

Insert image description here
The result of Decoder is the same, then L b = − DKL ( q ( z ∣ x ) ∣ ∣ P ( z ) ) + ∫ zq ( z ∣ x ) log P ( x ∣ z ) dz L_b=-D_{KL} (q(z|x)||P(z))+\int_zq(z|x)logP(x|z)dzLb=DKL(q(zx)∣∣P(z))+zq(zx)logP(xz)dz,

We write the left half of the right-hand equation as − A ∗ -A^*A , the right half is recorded asB ∗ B^*B,即 L b = − A ∗ + B ∗ L_b=-A^*+B^* Lb=A+B . ThereforeMaxL b MaxL_bMaxLbEquivalent to finding A ∗ A^*AMinimum value of ∗ andB ∗ B^*BThe maximum value of ∗ .

Insert image description here

A ∗ A^* A We want it to be as small as possible, and after expansion it becomes the constraint loss function.
The meaning of the above expectation can be expressed as: given NN Encoder outputq ( z ∣ x ) q(z|x)In the case of q ( z x ) distribution, in order to obtain the maximumB ∗ B^*B , then the decoder Decoder’sP ( x ∣ z ) P(x|z)The value of P ( x z ) should be as high as possible. If the variance is not considered, this is actually similar to an Auto-Encoder loss function.

Insert image description here

(If the variance is not considered, then the overall structure is actually equivalent to AE)


limitation

Insert image description here
An important limitation is that VAE is not really generating new images, it is just interpolating the existing images it remembers, and the final result will be quite similar to the existing images.

Although VAE is much better than the ordinary AE model, anyone who has trained the VAE model knows that the pictures it generates will be blurry compared to GANs that directly use adversarial learning. This is because it directly uses adversarial learning. Calculate the mean square error between the generated image and the original image, so what you get is an "average image".

  1. What’s the difference between VAE and AE?

(1) The distribution of hidden layer representation in AE is unknown, while the hidden variables in VAE obey the normal distribution;

(2) What is learned in AE is only the NN Encoder and Decoder, while VAE also learns the distribution of hidden variables, that is, the mean and variance of the Gaussian distribution;

(3) AE can only obtain the corresponding reconstructed x from one sample x; while VAE learns the parameters of the Gaussian distribution that the hidden variable z obeys, and can continuously generate new z, thereby generating new sample x.


Summarize

VAE converts the code in the latent space from a discrete interval to a continuous interval by introducing noise to the AE, so that we can obtain the interpolation of the code in the continuous interval to generate some samples between certain training result codes. There is a certain degree of similarity between these samples and the original input, but the generated image of VAE is very blurry and cannot be directly used as a generative model.

Guess you like

Origin blog.csdn.net/milu_ELK/article/details/129781919