Generate model-related algorithms: EM algorithm steps and formula derivation

introduction

The EM algorithm is a selection algorithm, which was proposed by Dempster et al. in 1977. It is used for the maximum likelihood estimation of the probability model parameters containing hidden variables (hidden variable), or the maximum posterior probability estimation for each selection of the EM algorithm. The generation consists of two steps: E step, expectation (expectation); M step, maximization (maximization), so this algorithm is called expectation maximization algorithm (expectation maximization algorithm), referred to as EM algorithm.

EM Algorithm Examples and Solutions

Three-coin model: Suppose there are 3 coins, respectively A, B, CA, B, CA , B , and C are the probabilities of heads of these coins beingπ \pipp __p andqqq Carry out the following coin trial: first coinAAA , select coinBBB or coinCCC , pick coinBBB , choose coinCCC ; Then toss the selected coin, the result of tossing the coin, if there is a head, it will be recorded as 1, and if there is a negative, it will be recorded as 0; $n $ experiments are repeated independently (here, n=10), and the observation results are as follows: 1 , 1
, 0 , 1 , 0 , 0 , 1 , 0 , 1 , 1 1,1,0,1,0,0,1,0,1,11,1,0,1,0,0,1,0,1,1
Assume that only the result of the coin toss can be observed, but the process of the coin toss cannot be observed. Ask how to estimate the probability of three-coin heads, the parameters of the three-coin model.
The model expression is:
P ( y ∣ θ ) = ∑ z P ( y , z ∣ θ ) = ∑ z P ( z ∣ θ ) P ( y ∣ z , θ ) = π py ( 1 − p ) y + ( 1 − π ) qy ( 1 − q ) 1 − y \begin{align} P(y|\theta) &= \sum_{z}P(y,z|\theta)=\sum_{z}P(z |\theta)P(y|z,\theta) \nonumber\\ &=\pi p^y (1-p)^y+(1-\pi)q^y(1-q)^{1-y } \nonumber \end{align}P(yθ)=zP ( and ,zθ)=zP(zθ)P(yz,i )=πpy(1p)y+(1π ) qy(1q)1y
Here, the random variable yyy is an observed variable, indicating that the result of a trial observation is 1 or 0; a random variable is a hidden variable, indicating that the unobserved coin tossAAA snowflake; θ = ( π , p , q ) \theta = ( \ pi , p , q )i=( p ,p,q ) is the model parameter. This model is the generative model of the above data. Note that the random variableyyThe data of y can be observed, the random variablezzThe data for z are unobservable.
Express the observed data asY = ( Y 1 , Y 1 , . . . , Y 1 ) TY=(Y_1,Y_1,...,Y_1)^TY=(Y1,Y1,...,Y1)T , unobserved data is expressed asZ = ( Z 1 , Z 2 , . . . , Z n ) TZ=(Z_1,Z_2,...,Z_n)^TZ=(Z1,Z2,...,Zn)T , then the likelihood function of the observed data is
P ( Y ∣ θ ) = ∑ z P ( Z ∣ θ ) P ( Y ∣ Z , θ ) P(Y|\theta)=\sum_zP(Z|\theta)P (Y|Z,\theta)P(Yθ)=zP(Zθ)P(YZ,θ )

P ( Y ∣ θ ) = ∏ j = 1 n [ π pyj ( 1 − p ) yj + ( 1 − π ) qyj ( 1 − q ) 1 − yj ] P(Y|\theta)=\prod_ {j=1}^{n}[\pi p^{y_j} (1-p)^{y_j}+(1-\pi)q^{y_j}(1-q)^{1-{y_j} }]P(Yθ)=j=1n[πpyj(1p)yj+(1π ) qyj(1q)1yj]
Consider finding model parametersθ = ( π , p , q ) \theta=(\pi,p,q)i=( p ,p,q ) Specifies the same value, whereθ
^ = arg max ⁡ θ log P ( Y ∣ θ ) \hat{\theta}=arg \max\limits_{\theta}logP(Y|\theta) .i^=argimaxThe problem of l o g P ( Y θ )
has no analytical solution, and can only be solved by an iterative method. The EM algorithm is an iterative algorithm that can be used to solve this problem. The EM algorithm for the above problems is given below, and its derivation process is omitted.
The EM algorithm first selects the initial value of the parameter, which is recorded asθ ( 0 ) = ( π ( 0 ) , p ( 0 ) , q ( 0 ) ) \theta^{(0)}=(\pi^{(0)} ,p^{(0)},q^{(0)})i(0)=( p(0),p(0),q( 0 ) ), and then pass through the estimated values ​​of the following iteration parameters until convergence. SectionIIThe estimated value of the i iteration parameter is θ ( 0 ) = ( π ( i ) , p ( i ) , q ( i ) ) \theta^{(0)}=(\pi^{(i)},p^ {(i)},q^{(i)})i(0)=( p(i),p(i),q( i ) ). The i + 1 i+1of the EM algorithmi+1 iteration is as follows.
Step E: Calculate the model parametersπ ( i ) , p ( i ) , q ( i ) \pi^{(i)},p^{(i)},q^{(i)}Pi(i),p(i),q( i ) Observational datayi y_iyiDetermine the function B of the function
μ i + 1 = π ( p ( i ) ) yj ( 1 − ( p ( i ) ) ) yj π ( p ( i ) ) yj ( 1 − ( p ( i ) ) ) yj + ( 1 − π ) ( q ( i ) ) yj ( 1 − ( q ( i ) ) ) 1 − yj \mu^{i+1}=\frac{\pi (p^{(i)})^{ y_j} (1- (p^{(i)}))^{y_j}}{\pi (p^{(i)})^{y_j} (1- (p^{(i)}))^ {y_j}+(1-\pi)(q^{(i)})^{y_j}(1-(q^{(i)}))^{1-{y_j}}}mi+1=π ( p(i))yj(1(p(i)))yj+(1π ) ( q(i))yj(1(q(i)))1yjπ ( p(i))yj(1(p(i)))yj
Step M: Compute new estimates of model parameters
π ( i + 1 ) = 1 n ∑ j = 1 n μ j ( i + 1 ) \pi^{(i+1)}=\frac{1}{n} \sum_{j=1}^{n} \mu_j^{(i+1)}Pi(i+1)=n1j=1nmj(i+1)
p ( i + 1 ) = ∑ j = 1 n μ j ( i + 1 ) y j ∑ j = 1 n μ j ( i + 1 ) p^{(i+1)}=\frac{ \sum_{j=1}^{n} \mu_j^{(i+1)}y_j}{ \sum_{j=1}^{n} \mu_j^{(i+1)}} p(i+1)=j=1nmj(i+1)j=1nmj(i+1)yj
q ( i + 1 ) = ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) y j ∑ j = 1 n ( 1 − μ j ( i + 1 ) ) q^{(i+1)}=\frac{ \sum_{j=1}^{n} (1-\mu_j^{(i+1)})y_j}{ \sum_{j=1}^{n} (1-\mu_j^{(i+1)})} q(i+1)=j=1n(1mj(i+1))j=1n(1mj(i+1))yj
Perform numerical calculations, assuming that the initial value of the model parameters is
π ( 0 ) = 0.5 , p ( 0 ) = 0.5 , q ( 0 ) = 0.5 \pi^{(0)}=0.5,p^{(0)}= 0.5,q^{(0)}=0.5Pi(0)=0.5,p(0)=0.5,q(0)=0.5
yj = 1 y_j=1yj=1yj = 0 y_j=0yj=0均有μ j ( 1 ) = 0.5 \mu_j^{(1)}=0.5mj(1)=0.5 .
According to the M-step calculation,
π ( 0 ) = 0.5 , p ( 0 ) = 0.6 , q ( 0 ) = 0.6 \pi^{(0)}=0.5,p^{(0)}=0.6,q^{( 0)}=0.6Pi(0)=0.5,p(0)=0.6,q(0)=0.6
According to the E step,
μ j ( 2 ) = 0.5 , j = 1 , 2 , . . . , 10 \mu_j^{(2)}=0.5,j=1,2,...,10mj(2)=0.5,j=1,2,...,10Continue
to iterate and get
π ( 0 ) = 0.5 , p ( 0 ) = 0.6 , q ( 0 ) = 0.6 \pi^{(0)}=0.5,p^{(0)}=0.6,q^{( 0)}=0.6Pi(0)=0.5,p(0)=0.6,q(0)=0.6
Then get the model parameterθ \thetaMaximum likelihood estimation of θ
: π ^ = ​​0.5 , p ^ = 0.6 , q ^ = 0.6 \hat{\pi}=0.5,\hat p=0.6,\hat q=0.6Pi^=0.5,p^=0.6,q^=0.6
π = 0.5 \pi=0.5Pi=0.5 means that coin A is well-balanced, and this result is easy to understand.
If the initial valueπ ( 0 ) = 0.4 , p ( 0 ) = 0.6 , q ( 0 ) = 0.7 \pi^{(0)}=0.4,p^{(0)}=0.6,q^{(0 )}=0.7Pi(0)=0.4,p(0)=0.6,q(0)=0.7 , then the maximum likelihood estimation of the model parameters obtained isπ ^ = ​​0.4064 , p ^ = 0.5368 , q ^ = 0.6432 \hat{\pi}=0.4064,\hat p=0.5368,\hat q=0.6432Pi^=0.4064,p^=0.5368,q^=0.6432 . That is to say, the EM algorithm is related to the selection of initial values, and choosing different initial values ​​may result in different parameter estimates.

EM Algorithm Steps and Description

Generally, Y is used to represent the data of observed random variables, and Z is used to represent the data of hidden random variables. The combination of Y and Z is called complete-data, and the observed data Y is also called incomplete-data. Suppose given observation data Y, its probability distribution is P ( Y ∣ θ ) P(Y|\theta)P ( Y θ ) , where is the model parameter to be estimated, then the likelihood function of incomplete data Y isP ( Y ∣ θ ) P(Y|\theta)P ( Y θ ) , log likelihood functionL ( 0 ) = log P ( Y ∣ θ ) L(0)=logP(Y|\theta)L(0)=l o g P ( Y θ ) ; Suppose the joint probability distribution of Y and Z isP ( YZ ∣ θ ) P(YZ|\theta)P ( Y Z θ ) , the complete log likelihood function islog P ( Y , Z ∣ θ ) logP(Y,Z|\theta)logP(Y,Z θ )
LetL ( θ ) = log P ( Y ∣ θ ) L(\theta)=logP(Y|\theta).L ( i )=The maximum likelihood estimation of l o g P ( Y θ ) contains two steps in each generation: E step, seeking expectation: M step, seeking maximization, and the EM algorithm is introduced below.
Input: observed variable dataYYY , hidden variable dataZZZ , joint distributionP ( Y , Z ∣ θ ) P(Y,Z|\theta)P ( Y ,Z θ ) , conditional distributionP ( Z ∣ Y , θ ) P(Z|Y,\theta)P(ZY,θ ) ;
output: model parametersθ \thetaθ .
(1) Select the initial value of the parameterθ ( 0 ) \theta^{(0)}i( 0 ) , start iteration;
(2) E-step: rememberθ ( i ) \theta^{(i)}i( i ) forsecondThe estimated value of the parameter of the i generation, at thei + 1 i+1i+For the 1st equation E, the function
Q ( θ , θ ( i ) ) = EZ [ log P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] = ∑ Z log P ( Y , Z ∣ θ ) . P ( Z ∣ Y , θ ( i ) ) \begin{align} Q(\theta,\theta^{(i)}) &=E_Z[logP(Y,Z|\theta)|Y,\theta^{ (i)}] \nonumber\\ &=\sum_{Z}logP(Y,Z|\theta)P(Z|Y,\theta^{(i)}) \nonumber\\ end{align}Q(θ,i(i))=EZ[logP(Y,Zθ)Y,i(i)]=ZlogP(Y,Zθ)P(ZY,i(i))
For example, P ( Z ∣ Y , θ ) P(Z|Y,\theta)P(ZY,θ ) in the given observation dataYYY and the current parameter estimateθ ( i ) \theta^{(i)}i( i ) Lower hidden variable dataZZThe conditional probability distribution of Z
; (3)M step: findQ ( θ , θ ( i ) ) Q(\theta,\theta^{(i)})Q(θ,i( i ) )Maximizedθ \thetaθ , determinei+1 i+1i+The estimated value of the parameter of 1 iterationθ ( i + 1 ) \theta^{(i+1)}i(i+1)
θ ( i + 1 ) = arg ⁡ max ⁡ θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\arg \max \limits_{\theta}Q(\theta,\theta^{(i)}) i(i+1)=argimaxQ(θ,i( i ) )
(4) Repeat steps (2) and (3) until convergence.
FunctionQ ( θ , θ ( i ) ) Q(\theta,\theta^{(i)})Q(θ,i( i ) )is the core of the EM algorithm, called the Q function (Q function).
The following are some explanations about the EM algorithm:
Step (1) The initial value of the parameter can be selected arbitrarily, but it should be noted that the EM algorithm is sensitive to the initial value Step (2) Step E to findQ ( θ , θ ( i ) ) Q( \theta,\theta^{(i)})Q(θ,i( i ) ). In the Q function formula, Z is unobserved data, Y is observed data Note,Q ( θ , θ ( i ) ) Q(\theta,\theta^{(i)})Q(θ,i( i ) )The first variable represents the parameter to be maximized, and the second variable represents the current estimated value of the parameter. Each iteration is actually asking forQQThe Q function is extremely large.
Step (3) M steps to findQ ( θ , θ ( i ) ) Q(\theta,\theta^{(i)})Q(θ,i( i ) )to maximize, getθ ( i + 1 ) \theta^{(i+1)}i( i + 1 ) , completed first generationθ ( i ) − > θ ( i + 1 ) \theta^{(i)}->\theta^{(i+1)}i(i)>i( i + 1 ) . It will be proved later that each iteration increases the likelihood function or reaches a local extremum.
Step (4) gives the condition to stop iteration, generally for smaller positive numbersϵ 1 , ϵ 2 \epsilon_1,\epsilon_2ϵ1,ϵ2, if it satisfies∣
∣ θ ( i + 1 ) − θ ( i ) ∣ ∣ < ϵ 1 or ∣ ∣ Q ( θ ( i + 1 ) , θ ( i ) ) − Q ( θ ( i ) , θ ( i ) ) ∣ ∣ < ϵ 2 || \theta^{(i+1)}- \theta^{(i)}||<\epsilon_1 或||Q(\theta^{(i+1)},\theta^{(i)})- Q(\theta^{(i)},\theta^{(i)})||<\epsilon_2∣∣θ(i+1)i(i)∣∣<ϵ1or ∣∣ Q ( θ(i+1),i(i))Q(θ(i),i(i))∣∣<ϵ2
then stop iterating

Guess you like

Origin blog.csdn.net/weixin_42491648/article/details/132031864