Variational Autoencoder VAE: So this is what it is all about | Attached is the open source code

Table of contents

I. Introduction

2. Distribution transformation

3. VAE Slow Talk

Classic review

The emergence of VAE

Distribution normalization

Derivation

Parameter skills

4. Follow-up analysis

what is the essence

normal distribution?

where is the variation

Condition VAE

5. Code

terminal


I. Introduction

Although I haven’t looked at it carefully in the past, I have always had the impression that Variational Auto-Encoder (VAE) is a good thing. Taking advantage of the recent three-minute popularity of probabilistic graphical models, I decided to also try to understand VAE.

So I still browsed a lot of information on the Internet, and without exception I found that it was all very vague. The main feeling was that I wrote a lot of formulas, but I was still confused. Finally, I finally felt that I understood it. Then I looked at the implemented code, and again It feels like the implementation code is completely different from the theory.

Finally, after piecing together some of my accumulation of probabilistic models during this period, and repeatedly comparing the original paper  Auto-Encoding Variational Bayes , I finally felt that I should have figured it out.

In fact, the real VAE is really different from what is described in many tutorials. Many tutorials go through a lot of writing without describing the key points of the model. So I wrote this article, hoping that through the following text, I can initially explain VAE clearly.

2. Distribution transformation

Usually we compare VAE with GAN. Indeed, the goals of both of them are basically the same - hoping to build a  model  that generates target data  X from latent variables Z  , but the implementation is different.

More precisely, they assume that they obey some common distribution (such as normal distribution or uniform distribution), and then hope to train a model  X = g ( Z ) that can map the original probability distribution to the training set Probability distributions, that is, their purpose is to transform between distributions .

The difficult problem in the generative model is to judge the similarity between the generated distribution and the real distribution, because we only know the sampling results of the two, but do not know their distribution expressions.

Now assuming it obeys the standard normal distribution, then I can sample it and get several Z1, Z2,...,Zn, and then transform it to get X̂1=g(Z1),X̂2=g(Z2),...,X̂n =g(Zn), how do we judge whether the distribution of this data set constructed by f is the same as the distribution of our target data set?

Did any readers say that there is no KL divergence? Of course not, because KL divergence calculates the similarity between two probability distributions based on their expressions, but currently we do not know the expressions of their probability distributions.

We only have a batch of data {X̂1, X̂2,…,X̂n} sampled from the constructed distribution, and a batch of data {X1, training set). We only have the sample itself, no distribution expression, and of course there is no way to calculate the KL divergence.

Although you encounter difficulties, you still have to find a way to solve them. The idea of ​​​​GAN is very straightforward and rough: since there is no suitable metric, I might as well train this metric using a neural network .

In this way, WGAN was born. For the detailed process, please refer to The Art of Mutual Confrontation: From Zero to WGAN-GP . VAE, on the other hand, uses a sophisticated and roundabout technique.

3. VAE Slow Talk

In this part, we first review how VAE is introduced in general tutorials, then explore what problems there are, and then naturally discover the true face of VAE.

Classic review

First ,  we have a batch  of data samples { _  _  _ _ _ _ Then I sample directly according to  p ( X ), and I can get all possible  Xs  (including those other than { X 1 ,... ,

Of course, this ideal is difficult to realize, so we change the distribution:

 

Here we don't distinguish between summation and integration, as long as the meaning is correct. At this time,  p ( X | Z ) describes a  model that   generates  X from Z , and we assume that Z  obeys the standard normal distribution, that is,  p ( Z ) = N (0, I ). If this ideal can be realized, then we can first sample a Z from the standard normal distribution , and then calculate an X based on Z , which is also a great generative model .    

The next step is to combine the autoencoder to achieve reconstruction to ensure that effective information is not lost, plus a series of derivation, and finally implement the model. The schematic diagram of the framework is as follows:

 

The emergence of VAE

In fact, in the entire VAE model, we do not use the assumption that  p ( Z ) (prior distribution) is a normal distribution. We use the assumption that  p ( Z | X ) (posterior distribution) is a normal distribution .

Specifically, given a real sample  Xk , we assume that there is a  distribution p ( Z  | . 

Why emphasize "exclusive"? Because we will train a generator  X = g ( Z ) later, we hope to be able to  restore a Zk sampled  from the distribution p ( Z | Xk )  to  Xk .

If we assume that  p ( Z ) is normally distributed, and then  sample a  Z from p ( Z ) , then how do we know   which real  X this Z corresponds to  ? Now that p ( Z | Xk ) belongs exclusively to Xk , we have reason to say that Z sampled from this distribution should be restored to Xk .     

 In fact, this point is also particularly emphasized in the application section of the paper  Auto-Encoding Variational Bayes :

In this case, we can let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:

 

Equation (9) in the paper is the key to realizing the entire model. I don’t know why many tutorials do not highlight it when introducing VAE. Although the paper also mentions that  p ( Z ) is a standard normal distribution, that is not essentially important.

I emphasize again that at this time, each  Xk  is equipped with an exclusive normal distribution, which facilitates the subsequent generator to restore. But there are as many  normal distributions as there are X's  . We know that the normal distribution has two sets of parameters: mean  μ  and variance  σ ^2 (in multivariate terms, they are both vectors).

So how do I find  the mean and variance of the normal distribution p ( Z | Xk ) specific to Xk ?  There doesn't seem to be any direct idea. 

Well, I will use a neural network to fit it . This is the philosophy of the neural network era: we all use neural networks to fit difficult calculations. We have experienced it once with WGAN, and now we experience it again.

So we constructed two neural networks  μk = f 1( Xk ), log σ ^2= f 2( Xk ) to calculate them. We choose to fit log σ ^2 instead of fitting σ ^2 directly  because  σ ^2 is always non-negative and needs to be processed by an activation function, while fitting log σ ^2 does not require an activation function because it can be positive Negative.

At this point, I can know the mean and variance specific to  Xk  , and also know what its normal distribution looks like. Then I sample a  Zk from this exclusive distribution  , and then use a generator to get  X̂k = g ( Zk ).

 Now  we can safely  minimize  D ( X̂k , _  _  _  _ So the schematic diagram of VAE can be drawn:

In fact, VAE constructs a unique normal distribution for each sample and then samples to reconstruct it.

Distribution normalization

Let us think about what the final result will be based on the training process in the picture above.

First, we hope to reconstruct  X , that is, to minimize  D ( X̂k , Xk )^2, but this reconstruction process is affected by noise because Zk  is resampled and not directly calculated by the encoder.

Obviously noise will increase the difficulty of reconstruction, but fortunately the noise intensity (that is, the variance) is calculated through a neural network, so the final model will definitely try its best to make the variance 0 in order to reconstruct better.

If the variance is 0, there is no randomness, so no matter how you sample, you will only get a certain result (that is, the mean). Of course, it is easier to fit only one than to fit many, and the mean is obtained through another neural network. The network calculated it.

To put it bluntly, the model will slowly degrade into a normal AutoEncoder, and noise will no longer play a role .

Wouldn't this be a waste of effort? What about the generative model mentioned?

Don't worry, don't worry . In fact , VAE also makes all  p ( Z |

How do you understand "guaranteed generation ability"? If all  p ( Z | X ) are close to the standard normal distribution  N (0, I ), then by definition:

 This way we arrive at our a priori assumption: p ( Z ) is a standard normal distribution. Then we can safely sample from  N (0, I ) to generate the image.

In order for the model to be generative, VAE requires each p(Z_X) to be aligned with the normal distribution.

So how to make all  p ( Z | X ) align with  N (0, I )? If there is no external knowledge, the most direct method should be to add additional loss on the basis of the reconstruction error:

 

Because they represent the logarithm of the mean  μk  and the variance log σ ^2 respectively, to achieve  N (0, I ) is to hope that the two are as close to 0 as possible. However, this will face the problem of how to select the ratio of these two losses. If the selection is not done well, the generated image will be blurry.

Therefore, the original paper directly calculated the KL divergence KL ( N ( μ , σ ^2)‖ N (0, I )) of the general (each component is independent) normal distribution and the standard normal distribution as this additional loss, The calculation result is:

Here  d  is the dimension of the latent variable  Z  , and  μ ( i ) and σ_{(i)}^{2} represent the i-th  component of the mean vector and variance vector of the general normal distribution respectively  . By directly using this formula to supplement loss, there is no need to consider the relative proportion of mean loss and variance loss.

Obviously, this loss can also be understood in two parts:

 

Derivation

Since we are considering a multivariate normal distribution with independent components, we only need to derive the case of a univariate normal distribution. According to the definition, we can write:

The entire result is divided into three integrals. The first term is actually −log σ^ 2 multiplied by the integral of the probability density (that is, 1), so the result is −log σ^ 2; the second term is actually the quadratic value of the normal distribution. Order moment, friends who are familiar with the normal distribution should all know that the second moment of the normal distribution is  μ ^2+ σ ^2; and according to the definition, the third term is actually "-variance divided by variance = -1". So the total result is:

Parameter skills

Finally, there is a trick to implement the model. The English name is Reparameterization Trick. I will call it reparameterization here.

▲  Heavy parameter skills

In fact, it is very simple, that is, we need to sample a Zk from  pZ | Xk ) . Although we know  that  p ( Z | The process in turn optimizes the mean-variance model, but the "sampling" operation is not differentiable, but the sampling result is differentiable, so we took advantage of this fact:

 

Therefore, we will  change from sampling  from N ( μ , σ ^2) to sampling from N ( μ , σ ^2), and then obtain the result of sampling from N ( μ , σ ^2) through parameter transformation. In this way, the "sampling" operation does not need to participate in gradient descent. Instead, the sampling results participate, making the entire model trainable.

How to implement it specifically? If you look at the above text against the code, you will understand it at once.

4. Follow-up analysis

Even if we understand everything above, we may still have many questions when facing VAE.

what is the essence

What is the nature of VAE? Although VAE is also called a type of AE (AutoEncoder), its approach (or its interpretation of the network) is unique.

In VAE, there are two Encoders, one is used to calculate the mean, and the other is used to calculate the variance. This is surprising: the Encoder is not used to encode, but to calculate the mean and variance. This is really big news. Well, aren’t the mean and variance both statistics? How come they are calculated using a neural network?

In fact, I think  VAE started from variational and Bayesian theories, which are daunting to ordinary people, and finally landed on a specific model . Although it took a long way, the final model is actually very down-to-earth.

It is essentially based on our conventional autoencoder, adding "Gaussian noise" to the encoder result (corresponding to the network that calculates the mean in VAE), so that the result decoder can be robust to noise; and The additional KL loss (the purpose is to make the mean 0 and the variance 1) is actually equivalent to a regularization term for the encoder. It is hoped that everything coming out of the encoder will have zero mean.

What about the role of the other encoder (corresponding to the network that calculates the variance)? It is used to dynamically adjust the intensity of noise .

Intuitively, when the decoder has not been trained well (the reconstruction error is much larger than KL loss), the noise will be appropriately reduced (KL loss increases), making the fitting easier (the reconstruction error begins to decrease) .

On the other hand, if the decoder is trained well (the reconstruction error is less than KL loss), then the noise will increase (the KL loss will decrease), making the fitting more difficult (the reconstruction error starts to increase again), and then the decoder will need to Think of ways to improve its generation capabilities .

▲  The essential structure of VAE

To put it bluntly, the reconstruction process hopes to be noise-free, while KL loss hopes to have Gaussian noise. The two are opposite. Therefore, VAE, like GAN, actually contains a confrontation process internally, but they are mixed and evolve together .

From this perspective, the idea of ​​VAE seems to be more advanced, because in GAN, when the counterfeiter evolves, the discriminator remains unmoved, and vice versa. Of course, this is just an aspect and does not mean that VAE is better than GAN.

The real brilliance of GAN is that it even directly trains the metric, and this metric is often better than what we think artificially (however, GAN itself also has various problems, which I will not expand on).

normal distribution?

Regarding  the distribution of p ( Z | Can I choose uniform distribution?

First of all, this is an experimental question in itself. You will know after trying both distributions. But intuitively speaking, the normal distribution is more reasonable than the uniform distribution, because the normal distribution has two independent sets of parameters: mean and variance, while the uniform distribution has only one set.

We said earlier that in VAE, reconstruction and noise are antagonistic to each other. Reconstruction error and noise intensity are two mutually antagonistic indicators. In principle, when changing the noise intensity, we need to have the ability to keep the mean unchanged. Otherwise, we It is difficult to determine whether the reconstruction error has increased, whether the mean has changed (the encoder's fault) or the variance has increased (the noise's fault) .

The uniform distribution cannot change the variance while keeping the mean unchanged, so the normal distribution should be more reasonable.

where is the variation

Another interesting (but not important) question is: VAE is called "variational autoencoder". What is its connection with the variational method? In VAE papers and related interpretations, I don’t seem to see the existence of the calculus of variations?

In fact, if the reader has already admitted KL divergence, then VAE does not seem to have much to do with variation, because the definition of KL divergence is:

If it is a discrete probability distribution, it must be written as a summation. We have to prove that: given the probability distribution  p ( x ) (or fixed q ( x )), for any probability distribution  q ( x ) (or  p ( x )) , both  KLp ( x )‖ q ( x ))≥0, and it is equal to zero only when p ( x ) = q ( x ) .

Because  KL ( p ( x ) ‖ q ( x )) is actually a functional, and to find the extreme value of the functional, the variation method must be used. Of course, the variation method here is just a parallel extension of ordinary calculus. The really complicated calculus of variations has not been covered yet. The variational lower bound of VAE is obtained directly based on KL divergence. Therefore, if KL divergence is directly recognized, there will be no such thing as variation.

In a word, the "variation" in the name of VAE is because its derivation process uses KL divergence and its properties.

Condition VAE

Finally, because the current VAE is unsupervised training, it is natural to think: if there is labeled data, can the label information be added to assist in generating samples?

The intention of this question is often to control a certain variable to generate a certain type of image. Of course, this is definitely possible. We call this situation  Conditional VAE , or CVAE (correspondingly, we also have a CGAN in GAN).

However, CVAE is not a specific model, but a type of model. In short, there are many ways to integrate label information into VAE, with different purposes. Based on the previous discussion, a very simple VAE is given here.

▲  A simple CVAE structure

In the previous discussion, we hope that  after X  is encoded, the distribution of Z  will have zero mean and unit variance. This "hope" is achieved by adding KL loss.

If there is now more category information  Y , we can hope that samples of the same class have an exclusive mean  μ^Y (variance remains unchanged, or unit variance), and this  μ^Y  can be trained by the model itself .

In this case, there are as many normal distributions as there are classes, and when generating, we can control the category of the generated image by controlling the mean .

In fact, this may be a solution to add the minimum code to implement CVAE on the basis of VAE, because this "new hope" only needs to be realized by modifying KL loss:

The figure below shows that this simple CVAE has a certain effect, but because the encoder and decoder are relatively simple (pure MLP), the control generated effect is not perfect.

Using this CVAE control to generate the number 9, we can find that various styles of 9 are generated, and slowly transition to 7, so the preliminary observation is that this CVAE is effective.

Readers are asked to learn more complete CVAE on their own. Recently, a work combining CVAE and GAN has been published called  CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training . The model routines are ever-changing.

5. Code

I copied the official VAE code of Keras, then fine-tuned it and added Chinese comments based on the previous content. I also implemented the simple CVAE mentioned at the end for readers' reference.

Code: https://github.com/bojone/vae

terminal

After bumping into each other, we have reached the end of the article again. I don’t know if I’ve explained it clearly, but I hope you can give me more opinions.

Overall, the idea of ​​VAE is still very beautiful. It’s not that it provides a good generative model (because in fact the images it generates are not very good, rather blurry), but that it provides a great example of combining probability maps with deep learning. There are many things worth thinking about in this case.

Guess you like

Origin blog.csdn.net/weixin_45684362/article/details/135403775