Machine Learning-Whiteboard Derivation-Series (12) Notes: Variational Inference


0 Notes

Derived from [Machine Learning] [Whiteboard Derivation Series] [Collection 1~23] , I will derive on paper with the master of UP when I study. The content of the blog is a secondary written arrangement of notes. According to my own learning needs, I may The necessary content will be added.

Note: This note is mainly for the convenience of future review and study, and it is indeed that I personally type one word and one formula by myself. If I encounter a complex formula, because I have not learned LaTeX, I will upload a handwritten picture instead (the phone camera may not take the picture Clear, but I will try my best to make the content completely visible), so I will mark the blog as [original], if you think it’s not appropriate, you can send me a private message, I will judge whether to set the blog to be visible only to you or something else based on your reply. Thank you!

This blog is the notes of (series twelve), and the corresponding videos are: [(series twelve) variational inference 1-background introduction], [(series twelve) variational inference 2-formula derivation] [(series twelve) ) Variational Inference 3-Looking Back], [(Series 12) Variational Inference 4-SGVI-1], [(Series 12) Variational Inference 5-SGVI-2].

The text starts below.


1 background introduction

1.1 Frequency Pie

Researching machine learning algorithms from a frequency perspective will eventually evolve into optimization problems . The following is an example of the linear regression and SVM discussed before to illustrate why it will evolve into an optimization problem.

Let's look at linear regression first: first assume that there are N sample instances in the data set D, D={(x 1 , y 1 ), (x 2 , y 2 ),…,(x N , y N )}, each sample X i ∈ R p , y i ∈ R for each sample , i=1...N. Construct two matrices X and Y: X=(x 1 ,x 2 ,…,x N ) T , X is an N*P order matrix; Y=(y 1 ,y 2 ,…,y N ) T , Y is N*1 order matrix. The description is as follows:
Insert picture description here
Look at SVM again:
Insert picture description here
Sure enough, it has evolved into an optimization problem.

1.2 Bayesian

Researching machine learning algorithms from a Bayesian perspective will eventually evolve into an integral problem .
Insert picture description here
Isn’t the point on the right side of the above picture just an integral problem?

Bayesian decision-making is to make predictions, and predictions are like this:
Insert picture description here
Bayesian inference is to find posterior probability , which is divided into precise inference and approximate inference, the latter of which is divided into deterministic approximate inference and random approximate inference.

The topic of this blog post- Variational Inference (VI) belongs to deterministic approximate inference .


2 Formula derivation

Suppose X is the observation data, Z is the latent variable and parameter, (X, Z) is called the complete data, log p(X) can be written as:
Insert picture description here
then: log p(x)=ELBO+KL[q(z)| |p(z|x)], set ELBO to L(q(z)), then log p(x)=L(q(z))+KL[q(z)||p(z|x)] , Call L(q(z)) as the variation .

When q(z)→p(z|x), that is, the closer q(z) is to p(z|x), the closer KL[q(z)||p(z|x)] is to 0. For log p(x)=L(q(z))+KL[q(z)||p(z|x)], if x is fixed, then log p(x) is fixed, that is, the left side is fixed. Now do What is to find q(z), make it close to p(z|x), so:
Insert picture description here
now suppose the data is divided into M groups, and each group is independent of each other, then q(z) is:
Insert picture description here
ELBO=L( q(z)) is divided into two formulas:
Insert picture description here
then ELBO=L(q(z))=①-②, assuming that in M ​​groups, the first, second...j-1, j-th have been fixed +1 group…q(z i ) of the Mth group , where i=1,2,…,j-1,j+1,…,M, what is required to solve is the q(z j of the jth group ). The following is a separate treatment of formulas ① and ②, first formula ①:
Insert picture description here
next formula ②: the
Insert picture description here
above formula is the simplification result of formula ②, first observe the first term:
Insert picture description here
therefore the previous picture is the simplification of ② The result can continue to be simplified as:
Insert picture description here
Then ELBO=L(q(z))=①-② is:
Insert picture description here


3 Symbol correction

This section revises some symbols in the second section to avoid confusion of variable names.

X={x (1) ,x (2) ,…,x (N) } are N samples, x (i) is the i-th sample, Z={z (1) ,z (2) ,…, z (N) } is N hidden data, and z (i) is the i-th hidden data. Suppose x is the observed variable, z is the hidden variable, θ is the parameter, x ∈ R p , x i is the ith dimension of the sample , z R p , z i is the ith dimension of the hidden data, and the distribution of z is q(z), then log p θ (X) can be written as:
Insert picture description here
where log p θ (x (i) ) is:
Insert picture description here


4 SGVI

SGVI, namely Stochastic Gradient Variational Inference, translates to stochastic gradient variational inference.

Suppose the distribution of the hidden variable z is q(z), and the parameter of the distribution is φ. Now remember the distribution of the hidden variable z as q φ (z), then the ELBO in the last picture in the section [3 Symbol Correction] is :
Insert picture description here
Let ELBO be L(φ), then:
Insert picture description here
now find the partial derivative of L(φ) with respect to φ, that is, find the gradient of L(φ) φ L(φ): Obtain the gradient of L(φ)
Insert picture description here
from the above figure ▽ φ L(φ)=①+②. First look at the formula ②:
Insert picture description here
namely ▽ φ L(φ)=①+②=①, and because:
Insert picture description here
look at the formula ① again:
Insert picture description here
namely ▽ φ L(φ)=E q φ (z) {▽ φ [log q φ (z )]·[Log p θ (x (i) ,z)-log q φ (z)]}.

I don't understand the following content at all, just copy it according to the content of the main up.

Now carry out reparameterization : Suppose that z=g φ (ε,x (i) ), where ε~p(ε), z~q φ (z,x (i) ), then:
Insert picture description here
Yes:
Insert picture description here
then ▽ φ L (φ) is:
Insert picture description here
set f(φ,z)=log p θ (x (i) ,z)-log q φ (z), by the derivative chain rule:
Insert picture description here
continue to simplify ▽ φ L(φ) as:
Insert picture description here
After that, the MCMC method needs to be used and will be updated.


END

Guess you like

Origin blog.csdn.net/qq_40061206/article/details/113859304