Machine Learning: EM algorithm

EM algorithm

All kinds of estimates

Maximum likelihood estimate

Maximum Likelihood Estimation, maximum likelihood estimation, i.e., using known sample results, most likely reverse thrust (maximum probability) cause the computing process parameter values ​​such results.

Straightforward terms, is given a certain amount of data, is assumed to know the data extracted from a certain random distribution, but do not know the distribution of the specific parameter values, namely: the model is known, the unknown parameters , is to use the MLE to estimate the parameters of the model.

MLE goal is to find a set of parameters (parameter model), so that the probability of observing the model output data at a maximum.
\ [Arg ~ max_θP (X; θ) \]

MLE solution process

  • Write the likelihood function (that is, the joint probability function)

Likelihood function: in the fixed sample, the probability function between the sample and the parameters θ appears

  • Likelihood function logarithm, and analyzed;
  • Derivation;
  • De-likelihood function.

\ [L (X; i) → l (i) = ln (L (X; i)) → \ frac {∂l} {∂th} \]

for example:

Assumed probability box white and black balls, the number of unknown, unknown proportion of black and white ball, now only knows the random sampling with replacement of 10 times, the extracted box seeking white balls:

Box number 1 2 3 4 5 6 7 8 9 10
1 black black black black black black black black black black
2 black black White White black black black White black black
3 black White black black White black White black White White
4 White black White White black White black White White White
5 White White White White White White White White White White

It is assumed that the white ball ratio p, the proportion of the black sphere is 1-p, are as follows:
\ [L (X-; P) = P (x_1, x_2, X_3, X_4, x_5, x_6, of x_7 at, x_8, x_9, x_ {10}; p) = \ prod_ {n = 1} ^ {10} {P (x_i; p)} \]

\[ l(P)=ln(L(X;p))=\sum_{i=1}^{10}{P(x_i;p)} \]

\ [Order: \ frac {∂l} {∂p} = 0 \]

\ [Box 1 drawn white ball probability: L (X; p) = (1-p) ^ {10} → l (P) = ln (L (X; p)) = 10ln (1-p) → Since 0 <p <1, so that when p = 0 when, l maximum value \]

\ [Cassette 2 to extract the white ball probability: L (X; p) = p ^ 3 (1-p) ^ {7} → l (P) = 3lnp + 7ln (1-p) → \ frac {∂l} { ∂p} = \ frac {3} {p} - \ frac {7} {1-p} = 0 → p = 0.3 \]

\ [Cassette 3 extracted white ball probability: L (X; p) = p ^ 5 (1-p) ^ {5} → l (P) = 5lnp + 5ln (1-p) → \ frac {∂l} { ∂p} = \ frac {5} {p} - \ frac {5} {1-p} = 0 → p = 0.5 \]

\ [Cassette 4 extracts the white ball probability: L (X; p) = p ^ 7 (1-p) ^ {3} → l (P) = 7lnp + 3ln (1-p) → \ frac {∂l} { ∂p} = \ frac {7} {p} - \ frac {3} {1-p} = 0 → p = 0.7 \]

\ [Cassette 5 extracted white balls probability: L (X; p) = p ^ {10} → l (P) = ln (L (X; p)) = 10lnp → Since 0 <p <1, so that when p = 1, l maximum value \]

Box number White ball Black ball
1 0 1
2 0.3 0.7
3 0.5 0.5
4 0.7 0.3
5 1 0

Bayesian algorithm estimates

Bayesian algorithm is a way to estimate the posterior probability is calculated from the latter prior probability distribution and the sample.

Common concepts:

Priori probability or marginal probability P (A) of the event A;

P (A | B) the conditional probability of occurrence of the known B A occurs, also known as the posterior probability of A;

P (B | A) A known occurrence probability condition B occurs, also known as the posterior probability of B;

A priori probability or an edge probability P (B) of event B;

\[ P(AB)=P(A)P(B|A)=P(B)P(A|B)⇒P(A|B)=\frac{P(B|A)P(A)}{P(B)} \]

\ [Under the condition of a discrete Bayes formula: P (A_i | B) = \ frac {P (B | A_i) P (A_i)} {\ sum_ {j} {P (B | A_i) P (A_i)} } \]

Another example:

There are five boxes, each box has assumed black and white ball, which in the following proportions; it is known from any of the five boxes a box drawn with replacement of two balls, and the balls are white, Q. these two balls are drawn from a box which?

Box number White ball Black ball
1 0 1
2 0.3 0.7
3 0.5 0.5
4 0.7 0.3
5 1 0

Then estimated using MLE as follows:
\ [L (X-; P) = = P ^ 2⇒p. 1 \]

Bayesian estimation algorithm (if performed by cable), the event is assumed extracted white ball is B, extracted from the i-th event box \ (A_i \) , are as follows:
\ [P (A_1 | B) = \ FRAC {P (A_1) P (B | A_1)} {P (B)} = \ frac {0.2 * 0 * 0} {0.2 * 0 ^ 2 + 0.2 * 0.3 ^ 2 + 0.2 * 0.5 ^ 2 + 0.2 * 0.7 ^ 2 + 0.2 * 1 ^ 2 } = \ frac {0} {0.366} = 0 \]

\[ P(A_2|B)=\frac{P(A_2)P(B|A_2)}{P(B)}=0.049~~~~~~~~~~~~~~~P(A_3|B)=\frac{P(A_3)P(B|A_3)}{P(B)}=0.137 \]

\[ P(A_4|B)=\frac{P(A_4)P(B|A_4)}{P(B)}=0.268~~~~~~~~~~~~~~~P(A_5|B)=\frac{P(A_5)P(B|A_5)}{P(B)}=0.564 \]

Maximum a posteriori probability estimates

Maximum a posteriori estimation, maximum a posteriori (MAP) and maximum a priori probability (MLE), are estimated by a sample value of the parameter θ;

In the MLE is that the likelihood function P | value (x θ) when the largest of the parameter [theta], the MLE assuming a priori probability is equivalent ;

The MAP, it is seeking to make θ P (x | θ) P ( θ) of the maximum value , which is required not just to θ function of the maximum likelihood function, while requiring prior probability θ itself also appears relatively large .

MAP may be considered to be a Bayesian algorithm, Bayesian algorithm removes only part of the denominator, as follows:
\ [P ([theta] '| X-) = \ FRAC {P ([theta]') P (X-| [theta] ')} {P (X) } ⇒argmax_ {θ'} P (θ '| X) ⇒argmax_ {θ'} P (θ ') P (X | θ') \]

EM algorithm

An example of the introduction of

Background: The company has a male colleague = [A, B, C], while there are many beautiful female staff = [A small, small chapter, small B]. You suspect that these female employees and male colleagues have a "problem." In order to scientifically verify your guess, you were careful observation.

There are the following observations:

  1. A, A Small, Small B go together;
  2. B, A small, go out with a small chapter;
  3. B, Chapter small, small B go together;
  4. C, go out with a small B;

After collecting the data, the calculation of EM:

Initialization: all the conditions are the same, each and every person has a relationship. Therefore, the probability of each and every male colleagues female employees "problem" is 1/3;

E-step:

  1. A small A had been out with 1/2 * 1/3 = 1/6, and also small acetate 1/6 times out;
  2. A B with a small, small chapter also went out 1/6 times;
  3. B with small chapter, a small B went out 1/6 times;
  4. C B went out with a small 1/3 times;

M-step: Update Your gossip
\ [A small with a small probability B A problem is \ frac {\ frac {1} {6}} {\ frac {1} {6} + \ frac {1} {6}} = \ frac { 1} {2} \]

\ [B with a small A, small B problem probability: \ frac {\ frac {1} {6}} {\ frac {1} {6} * 4} = \ frac {1} {4}; with a small Chapter probability in question: \ frac {\ frac {1} {6} * 2} {\ frac {1} {6} * 4} = \ frac {1} {2} \]

\ [Small probability B and C with the problem are: 1 \]

E-step: according to the latest calculation of probability

  1. A small A had been out with 1/2 * 1/2 = 1/4, and also the small acetate 1/4 times out;
  2. A B out with a small 1/2 1/4 = 1/8, chapter out with a small 1/2 1/2 = 1/4 times;
  3. B B out with a small 1/2 1/4 = 1/8, and chapter out with a small 1/2 1/2 = 1/4 times;
  4. B went out with a small C 1;

M-step: Re gossip update your
\ [A small with a small probability B A problem is \ frac {\ frac {1} {4}} {\ frac {1} {4} + \ frac {1 } {4}} = \ frac {1} {2} \]

\ [With a small A, small B problem probability B: \ frac {\ frac {1} {8}} {\ frac {1} {8} * 2 + \ frac {1} {4} * 2} = \ frac {1} {6}; with small chapter in question is the probability that: \ frac {2} {3} \]

\ [Small probability B and C with the problem are: 1 \]

Well, now, you seem to have got the truth.

EM algorithm, Expectation Maximization Algorithm, expectation-maximization algorithm, which is a type of iterative algorithm is a probabilistic model parameters to find the greatest likelihood estimate or the maximum a posteriori estimate the algorithm, but there can not be observed in the model hidden variables.

EM algorithm process:

  • Initialization parameter distribution;
  • Repeat until convergence both of the following:
    • Step E: estimated hidden variable probability distribution function of the desired;
    • Step M: The re-estimate the distribution function of the desired parameters.

EM algorithm principle

M given training samples, as follows:
\ [\ {^ {X (. 1)}, {X ^ (2)}, ..., {X ^ (m)} \} \]
between separate samples to identify the number of sample model parameters [theta], maximum likelihood distribution model function, as follows:
\ [max_θ Arg [theta] = \ sum_. 1 {I} = {m} ^ {log (P (X ^ {(I)}; θ))} \]
exists implicitly assume Z data in the sample data, as follows:
\ [Z = \ {^ {Z (. 1)}, {Z ^ (2)}, ..., Z ^ {(K) } \} \]
At this time, a great number of distribution model likelihood function modified as follows:
\ [max_θ Arg [theta] = \ sum_. 1 {I} = {m} ^ {log (P (X ^ {(I)} ; θ))} \]

\[ =arg max_θ\sum_{i=1}^{m}{log[\sum_{z^{(j)}}{P(z^{(i)})P(x^{(i)}|z^{(j)};θ)]}} \]

\[ =argmax_θ\sum_{i=1}^{m}{log[\sum_{z^{(j)}}{P(x^{(i)},z^{(j)};θ)}]} \]

Let Z be the distribution of Q (z; θ), then the following equation:

\ [Wherein Q (z ^ {(j)}; θ) satisfies the following condition: Q (z ^ {(j)}; θ) ≥0, \ sum_ {z ^ {(j)}} {Q (z ^ {(j)}; θ)} = 1 \]

\[ l(θ)=\sum_{i=1}^{m}{log[\sum_{z^{(j)}}{P(x^{(i)},z^{(j)};θ)}]} \geq \sum_{i=1}^{m}{\sum_{z^{j}}{Q(z^{(j)};θ)log(\frac{P(x^{(i)},z^{(j)};θ)}{Q(z^{(j)};θ)})}} \]

Derived as follows:
\ [L ([theta]) = \ sum_ {I =. 1} ^ {m} {log [\ sum_ {Z ^ {(J)}} {Q (Z ^ {(J)}; [theta]) \ cdot \ frac {P (x ^ {(i)}, z ^ {(j)}; θ)} {Q (z ^ {(j)}; θ)}}]} \]

\ [Inequality ⇒ by the upper Jensen original = \ sum_ {i = 1} ^ {m} {log [E_Q (\ frac {P (x ^ {(i)}, z; θ)} {Q (z; θ) })]} \]

\ [Since the f function as a log function ⇒ original \ geq \ sum_ {i = 1} ^ {m} {E_Q [log (\ frac {P (x ^ {(i)}, z; θ)} {Q (z ; θ)})}] \]

\ [The formula reducing ⇒ formula = \ sum_ {i = 1} ^ {m} {\ sum_ {z ^ {j}} {Q (z ^ {(j)}; θ) log (\ frac {P (x ^ {(i)}, z ^ {(j)}; θ)} {Q (z ^ {(j)}; θ)})}} \]

Jensen inequality description:

If the convex function f, as shown below:

The presence of the following formula:
\ [F ([theta] x + ([theta]. 1-) Y) \ Faraday rotation has Leq (X) + ([theta]-. 1) F (Y) \]
After generalization:
\ [F (+ ... + θ_1x_1 θ_kx_k) \ leq θ_1f (x_1) + ... + θ_kf (θ_k) \]

\ [F (E (x)) \ leq E (f (x)) \]

\ [其中: th_1, ..., th_k \ GEQ 0, th_1 + ... + th_k = 1 \]

Subject to the above equation is satisfied, the inequality can be introduced by Jensen: When the following equation is constant, the equation to get the equal sign, i.e., l (θ) of the lower bound, is as follows,
\ [\ FRAC {P (X , z; θ)} {Q (z; θ)} = c \]

\[ ⇒Q(z;θ)=\frac{P(x,z;θ)}{c} \]

\[ 由于\sum_{z^{(j)}}{Q(z^{(j)};θ)}=1⇒Q(z;θ)=\frac{P(x,z;θ)}{c\cdot\sum_{z^{(j)}}{Q(z^{(j)};θ)}} \]

\ [= \ Frac {P (x, z; i)} {\ sum_ {z ^ {(j)}} {cP (x, z ^ {(j)}; i)}} = \ frac {P ( x, z; i)} {P (x; i)} = \ frac {P (z | x; i) P (x; i)} {P (x; i)} = P (z | x; i ) \]

That is, the distribution of Z is in fact the case where θ is determined, x is set to the probability distribution of z

And our original purpose is to: identify a sample model parameters [theta], a great number of distribution model likelihood function, after the above derivation, the following
\ [θ ^ {new} = argmax_θl (θ) = argmax_θ \ sum_ { i = 1} ^ {m} {\ sum_ {z ^ {j}} {Q (z ^ {(j)}; θ ^ {old}) log (\ frac {P (x ^ {(i)}, z ^ {(j)}; θ)} {Q (z ^ {(j)}; θ ^ {old})})}} \]

\[ =argmax_θ\sum_{i=1}^{m}{\sum_{z^{(j)}}{P(z^{(j)}|x^{(i)};θ^{old})log(\frac{P(x^{(i)},z^{(j)};θ)}{P(z^{(j)}|x;θ^{old})})}} \]

\[ =argmax_θ\sum_{i=1}^{m}{\sum_{z^{j}}{P(z^{(j)}|x^{(i)};θ^{old})log(P(x^{(i)},z^{(j)};θ))}}-C \]

EM algorithm flow

\ [Sample data have x = \ {x_1, x_2, ..., x_m \}, the joint distribution P (x, z; θ), the conditional distribution P (z | x; θ), the maximum number of iterations J \]

  • The initial value of the random initialization parameter model [theta] \ (θ ^ 0 \)

  • EM algorithm iterative process begins:

    • Step E: Conditional Probability joint distribution of expectations

    \[ Q^j=P(Z|X;θ^j)~~~~~~~l(θ)=\sum_{i=1}^{m}{\sum_{z^{j}}{P(z^{(j)}|x^{(i)};θ^{old})log(P(x^{(i)},z^{(j)};θ))}} \]

    • Step M: l maximization function to obtain the new value θ

    \ [E ^ {j + 1} = arg max_thl (i) \]

    • If the results of the new θ has converged, then the algorithm ends, the output of the final model parameter θ, otherwise continue iterative process

EM algorithm Intuitive Case:

Suppose the probability of two conventional black and white ball with a variable number of boxes, randomly selected from a white ball box is \ (P_1, P_2 \) ; in order to estimate these probabilities, each time a selection box, with replacement of randomly selected five successive balls, recorded as follows:

Box number 1 2 3 4 5 statistics
1 White White black White black 3 Black White -2
2 black black White White black 2 Black White -3
1 White black black black black 1 black and white -4
2 White black White black White 3 Black White -2
1 black White black White black 2 Black White -3

Maximum likelihood estimation using MLE:
\ [L (P_1) = log (^ P_1. 6 (. 1-P_1) ^. 9) = 9 logio 6logp_1 + (. 1-P_1) \]

\[ \frac{∂l(p_1)}{∂p_1}=0→p_1=0.4 \]

\ [Similarly, available p_2 = 0.5 \]

If at this time, do not know the specific number of the box , but the same in order to solve \ (p_1, p_2 \) values, this time is equivalent to more than a hidden variable z, z represents the extracted every time the selected box number , it represents the first example, z1 extracted selection box when the box 1 or 2, as follows

Box number 1 2 3 4 5 statistics
z1 White White black White black 3 Black White -2
z2 black black White White black 2 Black White -3
z3 White black black black black 1 black and white -4
z4 White black White black White 3 Black White -2
z5 black White black White black 2 Black White -3
  • Random initial probability value: white ball probability cassette 1 taken: p1 = 0.1, white ball probability cartridge 2 taken: p2 = 0.9; then MLE calculate the maximum probability of the two boxes extracted in each run to calculate the z-value, after re-using the maximum likelihood estimation method to estimate the probability value

\[ L(z_1=1|x;p_1)=p_1^3 \times p_2^2=0.1^3 \times 0.9^2=0.00081 \]

\[ L(z_1=2|x;p_2))=p_1^3 \times p_2^2=0.9^3 \times 0.1^2=0.00729 \]

轮数 盒子1概率 盒子2概率 归一化:盒1 归一化:盒2
1 0.00081 0.00729 0.1 0.9
2 0.00729 0.00081 0.9 0.1
3 0.06561 0.00009 0.999 0.001
4 0.00081 0.00729 0.1 0.9
5 0.00729 0.00081 0.9 0.1
  • 重新计算p的概率值

\[ l(p_1)=log[p_1^{0.1×3+0.9×2+0.999×1+0.1×3+0.9×2}(1-p_1)^{0.1×2+0.9×3+0.999×4+0.1×2+0.9×3}] \]

\[ log[p_1^{5.199}(1-p_1)^{9.796}]=5.199logp_1+9.796log(1-p_1) \]

\[ \frac{∂l(p_1)}{∂p_1}=0→p_1=0.347 \]

\[ 同理,计算得p_2=0.58 \]

  • 根据p的概率值,再次计算在p的条件下,从每个盒子中抽出的概率,如下:
轮数 盒子1概率 盒子2概率 归一化:盒1 归一化:盒2
1 0.0178 0.0344 0.34 0.66
2 0.0335 0.0249 0.57 0.43
3 0.0630 0.0180 0.78 0.22
4 0.0178 0.0344 0.34 0.66
5 0.0335 0.0249 0.57 0.43
  • 根据新的z值,采用MLE进行计算新的p值,如下

\[ p_1=0.392~~~~~~~~~~~~~~~~~~~~~~~~~p_2=0.492 \]

继续迭代,一直迭代到收敛,此时的p值即为所求。

EM算法收敛证明

EM算法的本质为:寻找参数最大似然估计。因此在每次迭代的过程中,只需要迭代后的参数\(θ^{j+1}\)计算的似然函数大于迭代前参数\(θ^{j}\)计算的似然函数即可,如下:
\[ \sum_{i=1}^{m}log(p(x^{(i)};θ^{j+1})) \geq \sum_{i=1}^{m}log(p(x^{(i)};θ^{j})) \]

具体的证明流程,略

GMM

引入例子:

随机选择1000名用户,测量用户的身高;若样本中存在男性和女性,身高分别服从高斯分布\(N(μ_1,σ_1)和N(μ_2,σ_2)\)的分布,试估计参数:\(μ_1,σ_1,μ_2,σ_2\)

  • 若明确知道样本的情况(即男女性数据是分开的),那么我们使用极大似然估计来估计这个参数值;
  • 如果样本是混合而成的,不能明确的区分开,那么就没法直接使用极大似然估计来进行参数估计,此时就引出了GMM

GMM(Gaussian Mixture Model,高斯混合模型)是指该算法由多个高斯模型线性叠加混合而成。每个高斯模型称之为component(成分)。GMM算法描述的是数据本身存在的一种分布。

GMM算法常用于聚类应用中,component的个数就可以认为是类别的数量。

假定GMM有k个Gaussian分布线性叠加而成,那么概率密度函数如下:
\[ p(x)=\sum_{k=1}^{K}{p(k)p(x|k)}=\sum_{k=1}^{K}{π_kp(x;μ_k,Σ_k)}~~~;π_k:选择第k个类别的概率,μ_k,Σ_k:均值和方差矩阵 \]
对数似然函数如下:
\[ l(π,μ,Σ)=\sum_{i=1}^{N}{log[\sum_{k=1}^{K}{π_kp(x^i;μ_k,Σ_k)}]} \]

GMM求解过程

E-step,在给定x的条件下,数据属于第j个类别的概率:
\[ w_j^{(i)}=Q_i(z^{(i)}=j)=p(z^{(i)}=j|x^{(i)};π,μ,Σ) \]
M-step:极大化对数似然函数l(π,μ,Σ),更新参数π,μ,Σ:

具体的推导步骤过于繁琐,故省略

\[ μ_j=\frac{\sum_{i=1}^{m}{w_j^{(i)}x^{(i)}}}{\sum_{i=1}^{m}{w_j^{(i)}}} \]

\[ Σ_j=\frac{\sum_{i=1}^{m}{w_j^{(i)}(x^{(i)}-μ_l)[(x^{(i)}-μ_j)]^T}}{\sum_{i=1}^{m}{w_j^{(i)}}} \]

\[ π_j=\frac{1}{m}\sum_{i=1}^{m}{w_j^{(i)}} \]

在π,μ,Σ更新完成后,又可进行E-step,不停的进行迭代,直至收敛;

在收敛后,用收敛后的参数,对于待测样本x都可以得出在不同的类别j下的概率值,选一个最大的概率值即为预测值。

实际应用

  • EM1_案例一:EM分类初识及GMM算法实现
  • EM2_案例二:GMM算法分类及参数选择案例
  • EM3_案例三:GMM的不同参数
  • EM4_案例四:EM无监督算法分类鸢尾花数据

GitHub

Guess you like

Origin www.cnblogs.com/zhuchengchao/p/11930371.html