Machine Learning-Whiteboard Derivation-Series (2) Notes: Gaussian Distribution and Probability


0 Notes

Derived from [Machine Learning] [Whiteboard Derivation Series] [Collection 1~23] , I will derive on paper with the master of UP when I study. The content of the blog is a secondary written arrangement of notes. According to my own learning needs, I may The necessary content will be added.

Note: This note is mainly for the convenience of future review and study, and it is indeed that I personally type one word and one formula by myself. If I encounter a complex formula, because I have not learned LaTeX, I will upload a handwritten picture instead (the phone camera may not take the picture Clear, but I will try my best to make the content completely visible), so I will mark the blog as [original], if you think it’s not appropriate, you can send me a private message, I will judge whether to set the blog to be visible only to you or something else based on your reply. Thank you!

This blog is a note for (Series 2), and the corresponding videos are: [(Series 2) Mathematical Foundation-Probability-Gaussian Distribution 1-Maximum Likelihood Estimation], [(Series 2) Mathematical Foundation-Probability-Gaussian Distribution 2- Maximum Likelihood Estimation-Unbiased VS Biased], [(Series 2) Mathematical Fundamentals-Probability-Gaussian Distribution 3-Observation from the Perspective of Probability Density], [(Series 2) Mathematical Fundamentals-Probability-Gaussian Distribution 4-Limitations ], [(Series 2) Mathematical Foundation-Probability-Gaussian Distribution 5-Finding Marginal Probabilities and Conditional Probabilities], [(Series 2) Mathematical Foundation-Probability-Gaussian Distribution 6-Finding Joint Probability Distribution].

The text starts below.


1 Gaussian distribution

There are N sample instances in the data set X, and each sample has p dimensions. The symbol is expressed as X = (x 1 ,x 2 ,...,x N ) T , x i ∈ R p , i=1...N, X is an N*P order matrix.

Let x i be independently and identically distributed in the Gaussian distribution N(α,β) of high dimension (dimension p), that is, x i~N(α,β), i=1...N. Here the parameter θ=(α,β), at this time the probability density function P(x) is:
Insert picture description here
For the convenience of discussion, let p=1, θ=(μ,σ 2 ), namely [α=μ, β=σ 2 ]. At this time, x i~N(μ,σ 2 ), i=1...N. Then the expected value of x i E(x i )=μ, and it becomes a one-dimensional Gaussian distribution (or one-dimensional normal distribution). The probability density function P(x) is:
Insert picture description here
According to this article [ Machine Learning-Whiteboard Derivation- Series (1) Notes: Frequency School/Bayesian School ] [2 Frequency School: θ is an unknown constant] section of the picture is available:
Insert picture description here
because at this time θ=(μ,σ 2 ), since θ MLE is required , just Seek [u MLE ] and [σ MLE ].

1.1 Find u MLE

Insert picture description here
Then take the derivative of u MLE with respect to μ and make the derivative equal to 0:
Insert picture description here

1.2 Find σ MLE

Insert picture description here
Then take the derivative of σ MLE with respect to σ, and make the derivative equal to 0:
Insert picture description here


2 Biased and unbiased estimates

Biased estimation means that the estimated value deviates from the actual value; unbiased estimation means that the estimated value is the same as the actual value. For chestnut: Let [mu] . 1 of [mu] is estimated that if the [mu] . 1 desired E ([mu] . 1 ) = [mu], the [mu] . 1 to [mu] unbiased estimate; set [sigma] 2 . 1 is a [sigma] 2 estimates, if [sigma] 2 . 1 of It is expected that E(σ 2 1 )≠σ 2 , then σ 2 1 is a biased estimate of σ 2 .

So the question is, what kind of estimation are u MLE and σ 2 MLE obtained in the previous section, namely [1 Gaussian distribution] ?

2.1 u MLE is an unbiased estimate

Insert picture description here

2.2 σ 2 MLE is a biased estimate

The first step is to simplify: the
Insert picture description here
second step is to judge:
Insert picture description here


3 Probability density function of Gaussian distribution

Now there are N sample instances in a data set X, and each sample has p dimensions. Symbolically expressed as X = (x 1 ,x 2 ,...,x N ) T , x i ∈ R p , i=1...N.

X is a random variable set (lowercase oh), and x is itself a p-dimensional vector, x = (x . 1 , x 2 , ..., x p ) T . Assuming x~N(μ,Σ), μ is the expectation of x, that is [E(x)=μ], then μ is also a p-dimensional vector, and μ=(μ 12 ,...,μ p ) T ; Σ is the covariance matrix of x, and Σ is a symmetric matrix and positive semi-definite. The following figure shows the Σ matrix: the
Insert picture description here
following figure is the probability density function of a high-dimensional Gaussian distribution ([(x-μ) T Σ -1 (x-μ)] essentially a quadratic form, semi-definite, but For the convenience of discussion, the following assumes to be positive definite ):
Insert picture description here
[(x-μ) T Σ -1 (x-μ)] is the Mahalanobis distance between the vector x and μ, which is [(1×p)×(p×p) ×(p×1)=1] A number of dimension. When Σ is a p-dimensional identity matrix, the Mahalanobis distance becomes the Euclidean distance. Next, perform eigen decomposition of Σ (also called spectral decomposition):
Insert picture description here
Substitute the calculated Σ above into [(x-μ) T Σ -1 (x-μ)]:
Insert picture description here
Use a little trick (according to the master of up, The vector y i is the vector x-μ in the vector μ iThe projection in the direction, I don’t know the line generation and matrix well, I don’t know much about it for the time being), as follows:
Insert picture description here
p is the dimension, let p=2. For the convenience of writing, let [Δ=(x-μ) T Σ -1 (x-μ)], then:
Insert picture description here


4 Limitations of Gaussian distribution

Insert picture description here


5 Solution of marginal probability and conditional probability

Now divide x into two parts, let x=(x a ,x b ) T , x a is an m-dimensional vector, x b is an n-dimensional vector, and m+n=p. It is not difficult to see that the joint probability distribution of x a and x b is the probability distribution of x.

Similarly, divide μ into two parts, let μ=(μ a , μ b ) T , μ a is an m-dimensional vector, μ b is an n-dimensional vector, and m+n=p.

The Σ matrix is ​​also divided into four parts:
Insert picture description here
Since Σ is a symmetric matrix, Σ ab T = Σ ba , Σ aa T = Σ aa , and Σ bb T = Σ bb .

The problem now is to solve: ① marginal probability distributions P(x a ) and P(x b ); ② conditional probability distributions P(x a |x b ) and P(x b |x a ).

To give a theorem: Let x ~ N (μ, Σ) , y = Ax + B, A and B are matrices, then N ~ Y (B + Aμ, AΣA T ). Remember this theorem as * (will be used below, be sure to remember).

Now start the solution.

5.1 Marginal probability distributions P(x a ) and P(x b )

Insert picture description here
Then the marginal probability distributions P(x a ) and P(x b ) can be given by the probability density function of the corresponding Gaussian distribution.

5.2 Conditional probability distributions P(x a |x b ) and P(x b |x a )

Insert picture description here
Now give another theorem of Gaussian distribution: set x~N(μ,Σ), then Mx⊥Nx⇔MΣN T =0, where Mx⊥Nx means Mx and Nx are independent of each other, M and N are both matrices, Σ is still The above block matrix:
Insert picture description here
remember the above theorem as ** (will be used below, be sure to remember). The following proves the independence of x ba and x a , using the ** theorem:
Insert picture description here
because MΣN T =0, x ba is mutually independent of x a , so the conditional probability and independence are combined [P(x ba |x a ) =P(x ba )]. Continue to push below:
Insert picture description here


6 Solving the joint probability distribution

It is known that x~N(μ,Λ -1 ), where Λ -1 is called the precision matrix, which is the inverse matrix of the covariance matrix Σ. y=Ax+b+ε, where A and b are coefficients, ε~N(0,L -1 ), ε and x are independent, then y|x~N(Aμ+b,L -1 ). What is required now is: ① p(y); ② p(x|y).

6.1 Solving p(y)

Insert picture description here
Then p(y) can be given by the probability density function of the corresponding Gaussian distribution.

6.2 Solving p(x|y)

Insert picture description here
E(z) and Var(z) are calculated above, then the joint probability distribution of x and y, that is, the distribution of z, is N(E(z), Var(z)).

In the section [5 Solving Marginal Probabilities and Conditional Probabilities], x=(x a ,x b ) T , the distribution of x a |x b is:
Insert picture description here
each symbol is:
Insert picture description here
according to the above formula, x|y~ N(μ xy + Σ xy Σ yy -1 y, Σ xxy ), correspondingly, the symbols of the previous formula are:
Insert picture description here
then p(x|y) can be given by the probability density function of the corresponding Gaussian distribution.


END

Guess you like

Origin blog.csdn.net/qq_40061206/article/details/112383479