Machine learning to follow a unified framework for understanding a Gaussian mixture model (GMM)

Machine learning to follow a unified framework for understanding a Gaussian mixture model (GMM)


I. Introduction

  1. My blog record only my point of view and thought processes. Welcome to point out I think the blind spot, but hope that we can have our own understanding.
  2. This reference to a lot of data on the network, especially the B station UP shuhuai008 video to explain what is my favorite way: From the perspective of more and understand the problem.

Second, understanding

Unified machine learning framework (MLA):

1. model (Model)
2. Strategy (Loss)
3. Algorithm (Algorithm)


Model

Digression: The so-called model, is the modeling process, but also our assumption of reality (Observed), like several before introduction SVM, LR assumption is this: We think we can use these data to distinguish between a super-plane. Model connotation of our inductive bias or is summarized preferences .

Geometry:

For the observed data \ (X-= \ {I X ^, X ^ 2 \ cdots, n-X ^ \} \) , generating a probability model generation from a \ (P (X-| \ Theta) \) , where \ (\ theta \) is the parameters of this model. Because simply does not know \ (P (X | \ theta ) \) what should be the form of the Gaussian mixture model , we assume (induction preference) of these data generated by the K Gaussian model mix, namely: \ (P ( X) \) of K Gaussian model superimposed from.

A plurality of probability density function of Gaussian distribution is superimposed (weighted average) from
\ [p (x) = \ sum_ {k = 1} ^ {K} \ alpha_k p (x | \ mu_k, \ Sigma_k), \; \; \ sum_ {k = 1} ^ K \ alpha_k = 1, \; \; \ alpha_k \; denotes a weight value \]

Geometry is directly \ (p (x) \) as a model for several weighted together, is separated directly treated.
Below that is the data generated directly angle \ (p (x) \) as a whole view, the process of focus is generated, or is generated split process data.


Data generating angle

Introducing hidden variable z: z x represents the sample for which belongs to a Gaussian distribution, z is a discrete random variable (it is difficult to determine whether a sample belongs to which particular distribution, so a random variable represented).

Here it will be appreciated that the generation process of one sample: z obtained according to the first sample belongs to a category which is Gaussian distribution, then generate a sample using the Gaussian distribution (random sampling).
For example: imagine a nonuniform dice (read the word tou two sound) promoter, first dice, to give a number, and then generate a sample (random sampling) The figure for the Gaussian model. The dice will determine: Which Gaussian prior distribution model generated for each sample belongs.
Is simply: dice -> randomly sampled , mathematical language is
\ [p (z = k) -> N (\ mu_k, \ Sigma_k) \]

with 1 2 \(\cdots\) K
p(z) \(p_1\) \(p_2\) \(\cdots\) \(p_k\)

\[\sum_{k=1}^K p_k=1\]

Therefore, from this point of writing a probability density function:
\ [\} the aligned the begin {P (X) = & \ {Z} the int_ P (X, Z) = & DZ \\ \ {Z} the int_ P (X | z) p (z) dz \\ & = \ sum_ {z} p (x | z) p (z) \\ & = \ sum_ {k = 1} ^ K p (x | z = k) p ( z = k) \\ & = \ sum_ {k = 1} ^ K p_k p (x | \ mu_k, \ Sigma_k) \ end {aligned} \]

At this time, it can be seen that the angles of exactly the same formula.

  • Geometry: \ (\ alpha_k \) represents the weighting value at the different superimposed Gaussian model
  • Data generating angle: \ (P_K \) represents the probability that a sample belongs to any one of a Gaussian model

Use \ (\ Theta \) represents all the parameters: \ (\ Theta = \ {P_1, P_2, \ cdots, P_K; \ mu_1, \ mu_2, \ cdots, \ mu_k; \ Sigma_1, \ Sigma_2, \ cdots, \ Sigma_k \} \)


Tactics

To obtain a set of parameters \ (\ Theta \) , so to \ (\ Theta \) probability of the observed data for the occurrence of the maximum value of the model parameters.

Needed to achieve the goal:
\ [\ max \; logP (the X-| \ Theta) \]

True optimization goal:
\ [\ Theta ^ {(T +. 1)} = \ Arg \ max _ {\ Theta} \ int_Z \; P (the Z | X-, \ Theta ^ {(T)}) \; a logP (X- , Z | \ theta) \]

The type of reference to how EM algorithm derivation of different methods and their understanding of this blog


algorithm

Maximum likelihood estimate

Model is now written, as well as the observed data, using the direct method of solving the maximum likelihood estimation:
\ [\ the aligned the begin {} \ \ Hat {\ Theta} {_} & MLE = \ Arg \ max _ {\ theta} \; logP (X) \\ & = \ arg \ max _ {\ theta} \; log \ prod_ {i = 1} ^ np (x ^ i) \\ & = \ arg \ max _ {\ theta} \ ; \ sum_ {i = 1} ^ n log \ sum_ {k = 1} ^ Kp (z ^ i = k) p (x ^ i | \ mu_k, \ Sigma_k) \ end {aligned} \]

Because of \ (log \) which is connected added, \ (MLE \) manner analytical solution can not be obtained, therefore the use of numerical solution: gradient descent method or the EM algorithm.


EM algorithm

\ [\ theta ^ {(t + 1)} = \ arg \ max _ {\ theta} \ int_Z \; P (Z | X, \ theta ^ {(t)}) \; logP (X, Z | \ theta ) \]
where \ (Q (\ theta, \ theta ^ {(t)}) = \ int_Z \; P (Z | X, \ theta ^ {(t)}) \; logP (X, Z | \ theta ) \) is common in the profile \ (Q \) function.

In fact, the policy of using EM algorithm is still maximum likelihood estimation, but changed the subject, maximization of different objects. At first method (algorithm) Yes, but we got the wrong guy (optimized object, loss of function, maximize function).
As if to find the object, the method used (stalker flow, high handsome cold, domineering brother streams, streams male intimate warm, honest and hard to force flow, flow, etc. bohemian prodigal son, who set up the equivalent here) is actually nothing wrong if not find the object, probably did not go to the correct classification of the crowd's. For example, you want to go to schools with a bohemian prodigal son wants a more meditative experience through the royal sister lives, destined very difficult. So you want to find the object can be changed in two ways:

  • Methods \ who set up the same, or the use of bohemian prodigal son flow (algorithm change, or maximum likelihood estimation), switch to find the object of the target group, for the royal sister Fan groups from the naive romantic little girl groups (optimized for a target , optimized object or loss of function)
  • Target groups the same, love royal sister (optimization function or target unchanged), use a different method \ people change their set, flow into streams or handsome men warm high cold flow from bohemian prodigal son (from maximum likelihood replaced by gradient descent method)

Method is equivalent to set up people, a man who set change is difficult, so for most people, in line with their own people to find a set of objects easier, but most people are looking for their ideal goal, a dream lover , this time going to change their people set up at this time need a lot of effort.
More realistic, the conventional method can optimize the less certain, particularly conventional method can be optimized target, such as MLE (Maximum Likelihood Estimation) method \ (\ arg \ max _ { \ theta} \; logP ( X) \) in can not do, but GD (gradient descent) can be done, and basically for all targets are applicable in the real world is simply rich handsome (who set). In other words: Some people who do not need to change their set up, you can attract a lot sister (different optimization or optimization objective function) of different groups, because such people set (optimization) is through killing.


Online information, on the interpretation of the EM algorithm basically stop here, on how to expand the formula, specifically how the process of it, will unfold in detail.

Expand Mode 1 :

\ (\ sum_ {k = 1
} ^ Klogp (z ^ i = k | x ^ i, \ theta ^ {(t)})] = 1 \) This is the equation will be used and the like. If this formula is equal to 1 I do not understand, then you Duokanjibian.

\[ \begin{aligned} Q(\theta,\theta^{(t)}) &= \int_Z \; P(Z|X,\theta^{(t)})\;logP(X,Z|\theta)\\ &=\sum_{Z} \{\prod_{j=1}^n p(z^j|x^j,\theta^{(t)}) \sum_{i=1}^n log \;p(x^i,z^i|\theta) \}\\ &=\sum_{z^1,\cdots,z^n} \{\prod_{j=1}^n p(z^j|x^j,\theta^{(t)}) \sum_{i=1}^n log \;p(x^i,z^i|\theta) \}\\ &=\sum_{z^1,\cdots,z^n}\{log\;p(x^1,z^1|\theta) \prod_{j=1}^n p(z^j|x^j,\theta^{(t)})+\cdots+ log\;p(x^n,z^n|\theta) \prod_{j=1}^n p(z^j|x^j,\theta^{(t)}) \}\\ &=\sum_{i=1}^n \{\sum_{k=1}^K log\;p(x^i,z^i=k|\theta)\;p(z^i=k|x^i,\theta^{(t)})]\prod_{j=2}^n[\sum_{k=1}^Klogp(z^i=k|x^i,\theta^{(t)})]\}\\ &=\sum_{i=1}^n \sum_{k=1}^K log\;p(x^i,z^i=k|\theta)\;p(z^i|x^i,\theta^{(t)}) \end{aligned} \]


Expand Embodiment 2 : derived from the original expression Expand

When the EM algorithm is applied to GMM, determined \ (q (z) = { p (z | x, \ theta ^ {(t)})} \)

\[ \begin{aligned} log P(X|\theta) &= \sum_{i=1}^n log\;p(x^i|\theta)=\sum_{i=1}^n log \int_{z^i}p(x^i,z^i|\theta)dz^i\\ &=\sum_{i=1}^n log \int_{z^i} \frac{p(x^i,z^i|\theta)}{q(z^i)} q(z^i)dz^i\\ &=\sum_{i=1}^n log \;E_{q(z^i)}[ \frac{p(x^i,z^i|\theta)}{q(z^i)}]\\ & \geq \sum_{i=1}^nE_{q(z^i)}[log \frac{p(x^i,z^i|\theta)}{q(z^i)}]\\ & = \sum_{i=1}^n \sum_{k=1}^K {q(z^i=k)} log \frac{p(x^i,z^i=k|\theta)}{q(z^i=k)}\\ & = \sum_{i=1}^n \sum_{k=1}^K {q(z^i=k)} log {p(x^i,z^i=k|\theta)}-\sum_{i=1}^n \sum_{k=1}^K {q(z^i=k)} log\; {q(z^i=k)}\\ & = \sum_{i=1}^n \sum_{k=1}^K {q(z^i=k)} log {p(x^i,z^i=k|\theta)}\\ & = \sum_{i=1}^n \sum_{k=1}^K {p(z^i=k|x^i,\theta^{(t)})}\; log \;{p(x^i,z^i=k|\theta)}\\ \end{aligned} \]


现在求得
\[ \begin{aligned} Q(\theta,\theta^{(t)})&=\sum_{i=1}^n \sum_{k=1}^K {p(z^i=k|x^i,\theta^{(t)})}\; log \;{p(x^i,z^i=k|\theta)}\\ &=\sum_{i=1}^n \sum_{k=1}^K \frac{p(x^i|z^i=k,\theta^{(t)})p(z^i=k|\theta^{(t)})}{\sum_{j=1}^K p(x^i|z^i=j,\theta^{(t)})} log \;{p(x^i|z^i=k,\theta)p(z^i=k|\theta)}\\ &= \sum_{i=1}^n \sum_{k=1}^K \frac {N(x^i|\mu_k^{(t)},\Sigma_k^{(t)})p_k^{(t)}}{\sum_{j=1}^K N(x^i|\mu_j^{(t)},\Sigma_j^{(t)}) }logN(x^i|\mu_k,\Sigma_k)\;p_k \end{aligned} \]

The method of obtaining the maximum likelihood estimation \ (P_K \) : \
[\ Arg \ max_ {P} \ sum_. 1} ^ {n-I = \ sum_. 1} ^ {K = log K \; P_K \; { p (z ^ i = k | x ^ i, \ theta ^ {(t)})} \\\ sum_ {k = 1} ^ K p_k = 1 \]

\[L(\lambda,p_k) =\sum_{i=1}^n \sum_{k=1}^K {p(z^i=k|x^i,\theta^{(t)})} log\;p_k+\lambda(1-\sum_{k=1}^Kp_k)\\ \]

\[ \begin{aligned} &\frac {\partial L}{\partial p_k} = \sum_{i=1}^n \frac{p(z^i=k|x^i,\theta^{(t)})}{p_k}-\lambda=0\\ &\sum_{i=1}^n {p(z^i=k|x^i,\theta^{(t)})}-\lambda\;p_k=0\\ &\sum_{i=1}^n\sum_{k=1}^K{p(z^i=k|x^i,\theta^{(t)})}=n\\ &\sum_{k=1}^K \lambda p_k=\lambda \;\;-->\lambda=n\\ &p_k = \frac{1}{n}\sum_{i=1}^n {p(z^i=k|x^i,\theta^{(t)})} \end{aligned} \]


Other later supplement


Guess you like

Origin www.cnblogs.com/SpingC/p/11634721.html