Variation from the encoder parsed

Outline

Translated from https://jaan.io/what-is-variational-autoencoder-vae-tutorial/

In discussing the variational automatic encoder, why depth study and the probability of machine learning researchers who are confused? What is a variant of the automatic encoder? Why this word can cause confusion?

This is because the neural networks and probabilistic models differ on the basic concepts and description language. The goal of this tutorial is to bridge the ideological divide, to allow for more collaboration and discussion between these areas and provide a consistent implementation.

Variation from encoder cool, so that we can design complex data generation model, and applied to large data sets. It can generate fictitious celebrity faces or high-resolution digital images of art. These models get very good results in image generation and reinforcement learning.

The following article from the neural network model and the probability of two angles to explain them.

Neural network perspective

Neural network described in language, then, of VAE includes an encoder, decoder and a loss function of three parts. The encoder compresses the data to the hidden space \ ((Z) \) in. The decoder hidden state \ (Z \) reconstruction data.

The encoder is a neural network, which is the input data points \ (X \) , the output is a hidden state \ (Z \) , its parameters \ (\ Theta \) includes a weight and bias. To more specifically, suppose \ (X \) is a \ (28 \ times 28 \) handwritten digital images, typically is remodeled into a vector of 784 dimensions. The encoder 728 need dimensional data \ (X \) encoding the hidden space \ (Z \) , and \ (Z \) dimension much smaller than 784. This is often referred to as the "bottleneck", because the encoder must learn effective data compression method of this low-dimensional space. Suppose the encoder is represented as \ (q _ {\ theta} (z | x ) \) , we note a low-dimensional hidden space is random: encoder parameter output to \ (q _ {\ theta} (z | x ) \) , which is the Gaussian probability density, we can then be obtained from the sample distribution (Z \) \ noise value.

The decoder is also a neural network whose input is the hidden state \ (Z \) , the output probability distribution of data, its parameters \ (\ Phi \) also includes a weight and biasing, can decoder is represented as \ ( _ {P \ Phi} (X | Z) \) . Or in the example above explanation, assume that each pixel value is 0 or 1, a probability distribution of the pixel can be expressed by Bernoulli distribution. Thus a decoder input \ (Z \) Thereafter, the output parameter Bernoulli 784, each representing a pixel map is set to 0 or 1 taken. Original 784-dimensional image \ (x \) information can not be obtained because the decoder can only see the hidden compressed state \ (z \) . This means that there is information loss problem.

Variation from the encoder loss function is the negative logarithm of the term with a positive likelihood function. Since not shared among all the data points represented, and therefore the loss of each data point \ (L_i \) are independent, the total loss \ (\ mathcal {L} = \ sum_ {i = 1} ^ N l_i \) per and loss of data points. The data points \ (x_i \) loss \ (l_i \) can be expressed as:

\[l_i(\theta,\phi)=-\mathbb{E}_{z \sim p_{\theta}(z|x_i)}[\log_{p_{\phi}}(x_i|z)] + KL(p_{\theta}(z|x_i)||p(z)) \]

The first is the reconstruction lost, or data points \ (x_i \) minus the expected logarithmic likelihood. The second KL divergence is a regular item, it is a measure of the distribution \ (p \) and \ (q \) the degree of approximation, it is to use \ (q \) represents \ (p \) How much information is lost when.

In the variation from the encoder, \ (P (Z) \) is designated as the normal distribution, that is, \ (P (Z) = \ {text} Normal (0,1) \) . If the encoder outputs \ (Z \) does not obey the standard normal distribution, the encoder will impose a penalty in loss of function. Regularization hidden state for holding each handwritten digits \ (Z \) sufficiently diverse but meaningful. If not used, the encoder may simply each data point is mapped to different areas of the Euclidean space, this problem occurs. There are two such digital image comprising handwritten 2 \ (2_ {a} \) and \ (2_b \) , they may be very different encoded into the hidden state \ (z_a \) and \ (z_b \) . And we hope that in the hidden spaces hidden state the same number should be close to each other, hence the need for restraint regularization.

Probabilistic model angle

Now, let's forget all the depth of learning and neural network knowledge, from the perspective of a probabilistic model to re-look at variation from the encoder. In the end, we will still return to the neural network.

In the probability model frame, variation from the encoder of data points \ (X \) and hidden variable \ (Z \) joint probability is expressed as \ (p (x, z) = p (x | z) p (z ) \) . In this case, for each data point \ (I \) , the generation process can be represented as follows:

  • Sampling hidden variable \ (z_i \ sim p (z ) \)
  • Sampled data points \ (x_i \ sim p (x | z) \)

This can be expressed in probabilistic graphical models;

That's when we discuss the core issues of variation from the encoder from the perspective of probability models. Hidden state \ (Z \) from the prior distribution \ (p (z) \) are sampled, then the data points \ (X \) from to \ (Z \) conditional distribution \ (p (x | z) \) is generated. The entire data model defines a joint distribution and implicit state \ (P (X, Z) = P (X | Z) P (Z) \) , for the purposes of handwritten numbers, \ (P (X | Z) \) is Bernoulli distribution.

Now, consider how we can infer hidden variables based on a given observation data, or the posterior probability calculation \ (the p-(z | the X-) \) . According to Bayes' theorem:

\[p(z|x)=\frac{p(x|z)p(z)}{p(x)} \]

Consider the denominator \ (P (X) \) , which can be \ (p (x) = \ int p (x | z) p (z) dz \) is calculated. Unfortunately, the integration time required to calculate the index, because of the need to calculate all the hidden variables. Therefore, we need to approximate the posterior distribution.

Variational inference using a distribution group \ (q _ {\ lambda} (z | x) \) after approximated posterior distribution, the parameter \ (\ the lambda \) indicate a specific distribution families. For example, if \ (Q \) is the Gaussian distribution, then, \ (\ the lambda \) The mean and variance is, each data point hidden state \ (\ lambda_ {x_i} = (\ mu_ {x_i}, \ sigma_ {x_i} ^ 2) \) .

So how do you know with the distribution \ (q (z | x) \) approximate the true distribution after \ (p (z | x) \) in the end it really good? We can use the KL divergence measure:

\[KL\left(q_{\lambda}(z|x)||p(z|x)\right) = \\ \mathbb{E}_q[\log q_{\lambda}(z|x)] - \mathbb{E}_q[\log p(x,z)] + \log p(x) \]

Our goal is to find the minimum KL divergence makes variational parameters \ (\ the lambda \) . Optimal posterior distribution can be expressed as:

\[q_{\lambda^*}(z|x)=\arg\min_{\lambda}KL\left(q_{\lambda}(z|x)||p(z|x)\right) \]

But it still can not be calculated, because there are still involves \ (the p-(the X-) \) , we also need to continue to improve. The introduction of the following functions:

\[ELBO(\lambda)= \mathbb{E}_q[\log p(x,z)] - \mathbb{E}_q[\log q_{\lambda}(z|x)] \]

We can ELBO KL divergence with the above formulas in combination, can be obtained:

\[\log p(x)= ELBO(\lambda) + KL\left(q_{\lambda}(z|x)||p(z|x)\right) \]

Since the KL divergence is always greater than or equal to 0, which means minimizing the KL divergence is equivalent to maximize ELBO. ELBO (Evidence Lower BOund) allows us to infer the approximate posterior distribution can be freed from the KL divergence minimization out in favor of maximizing ELBO. The latter is computationally more convenient.

In the variation from the encoder model, each data point is the hidden state \ (Z \) are independent, and therefore can be decomposed into ELBO all data points corresponding to the item and. This allows us to use stochastic gradient decreased to update the shared parameters \ (\ the lambda \) . ELBO each data point is represented as follows:

\[ELBO_i(\lambda)=\mathbb{E}q_{\lambda}(z|x_i)[\log p(x_i|z)] - KL(q_{\lambda}(z|x_i)||p(z)) \]

Then the neural network can now be described. We use an inference network (or encoder) the approximate posterior \ (Q _ {\ Theta} (Z | X, \ the lambda) \) , the inference network input data \ (X \) However, the output parameter \ (\ the lambda \) . Then using a generator network (or decoder) parametric \ (P (X | Z) \) , which generates a network input hidden state and parameter, output data distribution \ (P _ {\ Phi} (X | Z) \) . \ (\ Theta \) and \ (\ Phi \) is the estimation of the parameters of the network and generate network. At this point we can use these networks to rewrite ELBO:

\[ELBO_i(\theta,\phi)=\mathbb{E}q_{\theta}(z|x_i)[\log p_{\phi}(x_i|z)] - KL(q_{\theta}(z|x_i)||p(z)) \]

Can be seen, \ (ELBO_i (\ Theta, \ Phi) \) and loss function we mentioned before the neural network from the perspective of a sent symbol, i.e. \ (ELBO_i (\ theta, \ phi) = - l_i (\ theta , \ Phi) \) . We can still be seen as KL divergence regularization term, seen as a reconstruction of the expected loss. However, the probability model that explains the meaning of these terms, i.e. minimize post approximate posterior distribution \ (q _ {\ lambda} (z | x) \) , and the model posterior distribution \ (p (z | x) \) between KL divergence.

Model parameters? We ignore this, but this is very important. The term "Variational inference" refers generally to parameters with respect to \ (\ the lambda \) maximize ELBO. We can also with respect to the model parameters \ (\ phi \) to maximize ELBO. This technique is called variational EM (expectation maximization), because we are relative to the desired model parameters to maximize data log-likelihood.

That's all, and we followed the method of variational inference, defines:

  • Probability model \ (p \) shows the distribution of hidden variables and data
  • The variational distribution for implicit state \ (Q \) , for the approximate posterior distribution

Then we use the variational inference algorithm learning variational parameters (rising gradient learning on ELBO \ (\ the lambda \) ), variational EM algorithm for learning model parameters (rising gradient learning on ELBO \ (\ Phi \) ) .

experiment

You can now model a number of experiments, two ways to measure the progress of the experiment: a priori distribution or from the posterior distribution sampling. In order to better explain the potential to study the space, we can visualize hidden variables posteriori distribution \ (Q _ {\ the lambda} (Z | X) \) .

Code can reference given by the authors: https://github.com/altosaar/variational-autoencoder.

Mean-field infer and deduce amortized

The problem for me is very confusing for people to learn from the depth of the background, it may be more confusing. Depth study, we consider the input and output, an encoder and decoder, and loss of function. Probabilistic modeling in the study, which could lead to vague, imprecise concept.

Let's discuss the differences Mean-field infer and deduce at amortized. This is our choice when making approximate infer hidden variables to estimate the posterior distribution face. This may involve a variety of questions: Do we have a lot of data? We have a lot of computing resources it? Hidden variables for each data point is a local independent, or globally shared?

Mean-field variation estimation refers to the absence of shared parameters \ (N \) data points distributed inference:

\[q(z)=\prod_i^N q(z_i;\lambda_i) \]

This means that each data point has free parameters \ (\ lambda_i \) (for example, a Gaussian hidden variables, \ (\ lambda_i = (\ mu_i, \ sigma_i) \) ). For new data points we need for its mean-field parameters \ (\ lambda_i \) to maximize ELBO.

amortized inference means to infer the cost between "amortization" data points. One method is to share (amortization) between the data points variational parameter \ (\ the lambda \) . For example, the automatic variation in the encoder, the parameters of the network to infer \ (\ Theta \) , these global parameters shared among all the data points. If we see a new data points and want to see the approximate posterior \ (q (z_i) \) , we can run variational inference (until convergence maximize ELBO) again, or directly use shared parameters. Compared with the Mean-field variational inference, this may be an advantage.

Which one is more flexible it? Mean-field variational inference, strictly speaking, more expressive, because it does not share the argument. Each data point independent parameters \ (\ lambda_i \) can ensure that the most accurate approximate posterior. On the other hand, may be limited by the parameters of the distribution group represented capacity or capability sharing between data points (e.g., using data sharing between weights and bias of the neural network).

Re-parameterization techniques

The last thing to achieve a variation from the encoder is how the derivative parameter random variables. Given the distribution \ | (q _ {\ theta } (z x) \) drawn in the \ (Z \) , and we want to \ (\ Theta \) take \ (Z \) derivative of the function it ? At this time, the sampling non-conductive, and thus can not lead to back-propagation model.

For some distributions, you can reset the parameters of the sample in a clever way, which is re-parameterization. For example, in the mean \ (\ mu \) and standard deviation \ (\ sigma \) is normally distributed variables, we can get samples like this:

\ [Z = \ I \ sigma \ \ epsilon \ adviser]

Where \ (\ Epsilon \ the SIM Normal (0, 1) \) . Is sampled from the Gaussian distribution with mean 0 and standard deviation 1, add pan reduction to give \ (Z \) . Thus the \ (\ Epsilon \) to \ (Z \) involves only linear operations (pan and zoom), the sampling operation outside neural network to FIG.

This figure shows a re-parameterized form, wherein round is a random node, node diamond is certainty.

As limited, the text will inevitably lead to some errors, but also hope the exchange pointed out!

Guess you like

Origin www.cnblogs.com/weilonghu/p/12567793.html