Li Hongyi Machine Learning Notes - Generating Models

Three methods are introduced, pixelRNN, VAE, GAN. The notes are mainly VAE.

pixelRNN is easier to understand, from the known to the unknown.

This method can also be applied to fields such as speech generation

There is a tip worth mentioning here. Each pixel in the picture is generally in RGB three colors. The problem is that when the three RGB values ​​​​are not much different, the color of the final result pixel tends to be gray. Therefore, in order to make the generated image more Bright, you need to increase the gap between the three values. In short, instead of using three numbers to represent color, only one is now used.

VAE is a relatively complicated thing. In fact, I watched this video to understand how it works.

The overall structure of VAE is like this. When the training is over, the decoder is extracted separately, and a set of values ​​​​is input to obtain the generated image.

It seems that VAE has a wide range of applications, such as generating verses (actually it is not a verse, but it was written like this in the original blog)

Why use VAE instead of AE

We see this structural diagram again

m as μ, e^{\sigma }as \sum, these two parameters are the parameters in the neural network

\sumVAE is to add some disturbances on the basis of AE , such as e*e^{\sigma }

In this way, \sumit will change, but we must force the value not to be too small. When \sumit is 0, VAE will degenerate into AE.

Add an item to the loss function, the meaning of its value is the green line in the figure, when \sigma=0, the value of the loss is the smallest

When \sigma=0, \sum=1 instead of 0

So we go into more detail on this basis

What we have to do now is to estimate P(x): that is, the probability distribution of the input image (the above picture takes one dimension as an example)

The above picture is actually a combination of many gaussion distributions.

When we want to sample a value, there are many gaussions for us to choose from, each gaussion has its own unique weight (I prefer to call it \muand \sum), and then sample an x ​​from the selected gaussion

The next step is the key point-mapping! !

z obeys a normal distribution (assumed to be one-dimensional), then each point on z corresponds to a gaussion distribution in the graph

Which one corresponds to, see the function——

 To be honest, so far, I still don't understand this mapping relationship until I saw this expression——

 All in all, the above picture is actually a combination of many gaussion distributions. For example, if there are 10, then there are at most 10 situations that can be sampled. But when we use the mapping method, z has infinite possibilities, and  \mualso \sumhas The possibilities are endless.

So the purpose of a neural network is to find a set of weights such that—

As shown in the figure, we changed from looking for P(x|z) to looking for P(x|z) and q(z|x)

When we increase by changing q(z|x) L_{b}, the KL(q(z|x)||P(z|x)) value will continue to shrink (ask why? Log(P(x)) will not change, L_{b}increase, kl naturally decreases)

When the kl value keeps decreasing, it means that q(z|x) is getting closer to P(z|x)

L_{b}expression  to continue parsing

 what is q?

q(z|x) is any distribution, minimizing the kl value is actually making the distribution of q as close as possible to the P(z) distribution 

 Insert a digression about the working principle of AE

HOWEVER

Some supplements about VAE:

The above content was difficult to understand for the first time, and it took half a day to find other information, and suddenly realized, I would like to share it here.

The best starting point to understand VAE is why use VAE instead of AE

  1. The dimension of X is high, and the dimension of Z cannot be very low, so the computational complexity is high.
  2. For an instance x, the strongly correlated z is relatively limited, and a large number of samples may be required to find these limited z.

Look at the second point first,

As shown in the figure above, when the problem dimension is large, the one-to-one correspondence between the sampling variable zi and the real sample xi cannot be guaranteed, so there is no one-to-one correspondence between the generated sample and the real sample, so the loss is difficult to calculate.

So VAE made a change: put n samples together into the encoder --> put the samples into the encoder one by one. That's all.

The following sentence is common to both AE and VAE: input samples, the output is the mean and \mulog( \sigma ^{2}) of a certain distribution of sample mapping

AE directly outputs the mean value, that is, z= \mu, another change of VAE is: z= \mu+\varepsilon *e^{log\sigma ^{2}}

How to solve the first problem? The large amount of calculation is mainly concentrated on pθ(z|x)

It is very complicated for us to directly find pθ(z|x), but how we construct another distribution, and then make the distance between the two distributions closer and closer, solves this problem in a roundabout way.

We create a new distribution called qϕ(z|x)

Distinguish the meaning of qϕ(z|x), pθ(z|x) and pθ(z)?

pθ(z): Prior distribution, distribution of z itself, generally considered to be Gaussian distribution

pθ(z|x): the true posterior distribution (z|x), but it is difficult to solve

qϕ(z|x): Encoder, the continuous reverse derivation is to make qϕ(z|x) more and more close to pθ(z|x), and because we generally assume that pθ(z|x) is a Gaussian distribution , so - constantly reverse derivation is to make qϕ(z|x) more and more close to N(0,I)

How to make qϕ(z|x) approach N(0,I) more and more?

The loss function mentioned above is obtained in this way

As shown in the figure above, when μ=0, e^{log\sigma ^{2}}== \sigma ^{2}1, the loss is the smallest, that is, the loss is 0 when N(0,1)

I understand VAE in this way, but it seems that I don't know much about its essential structure, but I have a very rough understanding of VAE through these two days of study, and it may be changed in the future.

The second update, the second brush teacher Li Hongyi's explanation of the VAE loss function, has another feeling.

Our aim is to maximizelog(P(x)))

By simplification, we split it into two parts: L_{b}divergence with KL

 L_{b}It can be maximized by changing P(z|x) L_{b}, but there is a problem: L_{b}it does increase, but the KL divergence may decrease, that is to say, log(P(x)))it may not increase but decrease

For this reason, P(z|x) and q(z|x) are specially introduced to increase together L_{b}. This has several advantages: q(z|x) keeps approaching P(z|x), and the kl divergence keeps decreasing. When In the optimal situation, when the kl divergence drops to 0, L_{b}when it increases, log(P(x)))it must increase at the same time

 Next, let’s talk about how to change P(z|x) and q(z|x) to increaseL_{b} 

Minimize the kl divergence (not the one mentioned above), the essence is that q(z|x) is constantly approaching the normal distribution N(0,1), this part can be transformed into our common form after complex AND operations (yellow box content)

The second part is to be enlarged, in fact, it is also the core content, and the generated value is \mu (x)constantly approaching x

In other words, the loss function of VAE mainly includes two parts:

First, the encoder q(z|x) should get closer and closer to N(0,1)

Second, the gap loss between reconstruction x and sample x

Guess you like

Origin blog.csdn.net/weixin_62375715/article/details/129838752