"Machine Learning Formula Derivation and Code Implementation" chapter22-EM algorithm

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

EM algorithm

As an iterative algorithm, the EM algorithm ( expectation maximization, 期望极大值算法) is used for the maximum likelihood estimation of the parameters of the probability model including hidden variables .

EM algorithm includes two steps: E step, seeking expectation ( expectation); M step, seeking maximum ( maximization).

1 Maximum Likelihood Estimation

Maximum likelihood estimation ( maximum likelihood estimation, MLE) is a classic parameter estimation method in the field of statistics. For a random sample that satisfies a certain probability distribution, but the statistical parameters in it are unknown, the maximum likelihood estimation allows me to estimate the value of the parameter through the results of several experiments.

To illustrate with a classic example, for example, we want to know the height distribution of students in a certain college. We assume that the height distribution of the students in this school obeys a normal distribution N ( μ , σ 2 ) N(\mu,\sigma^{2})N ( μ ,p2 ), where the distribution parameterμ \muμσ 2 \sigma^{2}p2 unknown. There are tens of thousands of students in the school, and it is definitely unrealistic to measure them one by one, so we decided to use the statistical sampling method to randomly select 100 students to measure their height.

To estimate the height of all students in the school through the height of 100 people, the following questions need to be clarified. The first question is what is the probability of drawing these 100 people . Because everyone is selected independently, the probability of drawing 100 people can be expressed as the product of individual probabilities:
L ( θ ) = L ( x 1 , x 2 , ⋯ , xn ; θ ) = ∏ i = 1 np ( xi ∣ θ ) L(\theta )=L(x_{1},x_{2},\cdots,x_{n};\theta)=\prod_{i=1}^{n}p(x_ {i}\mid\theta)L ( i )=L(x1,x2,,xn;i )=i=1np(xiθ )
The above formula is the likelihood function, for the convenience of calculation, we will take the logarithm of the likelihood function:
H ( θ ) = ln ⁡ L ( θ ) = ln ⁡ ∏ i = 1 np ( xi ∣ θ ) = ∑ i = 1 n ln ⁡ p ( xi ∣ θ ) H(\theta )=\ln_{}{L(\theta )}=\ln_{}{\prod_{i=1}^{n}p(x_{i }\mid \theta)}=\sum_{i=1}^{n} \ln_{}{p(x_{i}\mid \theta)}H ( i )=lnL ( i )=lni=1np(xii )=i=1nlnp(xiθ )
The second question is why these 100 people were just selected. According to the theory of maximum likelihood estimation, among so many students in the school, we just selected these 100 students instead of the other 100 students, precisely because the probability of these 100 students appearing is extremely high, that is, the corresponding likelihood function is extremely large:
θ ^ = argmax L ( θ ) \hat{\theta} = argmax L(\theta )i^=a r g max xL ( θ )
How to solve the last problem, directly forL ( θ ) L(\theta)L ( θ ) is differentiated and made to be 0.

Therefore, the maximum likelihood estimation method can be regarded as the inverse deduction of the conditions from the sampling results , that is, a certain parameter is known to make the probability of these samples appear extremely high, and we directly use this parameter as the true value of the parameter estimation.

2 EM algorithm

Assuming that the height of all students in the school pays a distribution is too general, in fact, the distribution of men and women is different, assuming that the height of boys obeys the distribution N ( μ 1 , σ 1 2 ) N(\mu^{}_{1},\sigma^{ 2}_{1})N ( m1,p12) , the height distribution of girls isN ( μ 2 , σ 2 2 ) N(\mu^{}_{2},\sigma^{2}_{2})N ( m2,p22) . Now to estimate the height of the students in this school, we cannot simply use a distribution assumption.

Assume that 50 boys and 50 girls are sampled separately, and they are estimated separately. Suppose we don't know whether the samples we sampled come from boys or girls.

The student's height is an observed variable ( observable variable), and the gender of the sample is a hidden variable ( hidden variable).

Now we need to estimate two questions: one is whether the sample is a boy or a girl, but what are the normal distribution parameters of the heights corresponding to boys and girls. In this case, the maximum likelihood estimation is not applicable. To estimate the height distribution of male and female students, it is necessary to first estimate whether the student is male or female. Conversely, to estimate whether the student is male or female, it must be judged from the height. However, the two depend on each other, and cannot be calculated directly by maximum likelihood estimation.

For this parameter estimation problem involving hidden variables, the EM( expectation maximization) algorithm is generally used to solve it, that is, the expectation maximization algorithm. For the above-mentioned height estimation problem, the solution idea of ​​the EM algorithm is: since the two problems depend on each other, this must be a dynamic solution process. It is better to directly give the initial value of the distribution of male and female heights, estimate the probability of which sample is male/female according to the initial value ( E step ), and then use the maximum likelihood to estimate the height distribution parameters of male and female students ( M step ), and then dynamically Iterative adjustments are made until the termination condition is met.

The application scenario of the EM algorithm is to solve the parameter estimation problem of a probability model containing hidden variables. Given the observed variable data Y , the hidden variable data Z , the joint probability distribution P ( Y , Z ∣ θ ) P(Y,Z|\theta)P ( Y ,Z θ ) , andthe conditional distribution on hidden variablesP ( Z ∣ Y , θ ) P(Z|Y,\theta)P(ZY,θ ) , use EM algorithm to model parametersθ \thetaThe process of estimating θ
is as follows: (1) Initialize model parametersθ ( 0 ) \theta^{(0)}i( 0 ) , start iteration.
(2) Step E: rememberθ ( i ) \theta^{(i)}i( i ) forsecondi iteration parameterθ \thetaEstimated value of θ at i+1 i+1i+1 Define the E function, solve the function Q:
Q ( θ , θ ( i ) ) = EZ [ log ⁡ P ( Y , Z ∣ θ ) ∣ Y , θ ( i ) ] = ∑ Z log ⁡ P ( Y , Z ∣ θ ) P ( Z ∣ Y , θ ( i ) ) Q(\theta,\theta^{(i)})=E_{Z}\left [ \log_{}{P(Y,Z\mid\ theta)}\mid Y,\theta^{(i)} \right ] =\sum_{Z}^{}\log_{}{P(Y,Z\mid\theta)}P(Z|Y,\ theta^{(i)})Q(θ,i(i))=EZ[logP ( Y ,Zi )Y,i(i)]=ZlogP ( Y ,Zi ) P ( Z Y ,i( i ) )
of whichP ( Z ∣ Y , θ ( i ) ) P(Z|Y,\theta^{(i)})P(ZY,i( i ) )is the given observation dataYYY and the current parameter estimateθ ( i ) \theta^{(i)}iIn the case of ( i ) , hidden variable dataZZThe conditional probability distribution of Z. The key to the E step is this Q function, which is defined as the logarithmic likelihood functionlog ⁡ P ( Y , Z ∣ θ ) \log_{}{P(Y,Z\mid\theta)} of the completelogP ( Y ,Zθ ) about the given observation dataYYY and the current parameterθ ( i ) \theta^{(i)}iIn the case of ( i ) , the unobserved dataZZThe conditional probability distribution of Z. (3) M step: Find the parameter θ \theta
that maximizes the Q functionθ , determine thei + 1 i+1i+Parameter estimates for 1 iterationθ ( i + 1 ) \theta^{(i+1)}i(i+1)
θ ( i + 1 ) = a r g m a x θ Q ( θ , θ ( i ) ) \theta^{(i+1)}=\underset{\theta}{argmax}Q(\theta,\theta^{(i)}) i(i+1)=iargmaxQ(θ,i( i ) )
(4) Repeat step E and step M until convergence.

It can be seen from the EM algorithm process that the key is to determine the Q function in the E step. The E step estimates the hidden state variable distribution under the condition of fixing the model parameters, while the M step estimates the model parameters by fixing the hidden variables. The two are carried out interactively until the convergence condition of the algorithm.

EM algorithm dynamic iterative process:
Please add a picture description

3 Three-coin model

insert image description here
insert image description here
insert image description here

4 Realize the three-coin model based on numpy

import numpy as np

## EM算法过程定义
def em(data, thetas, max_iter=50, eps=1e-3): # data观测数据,thetas初始化的估计参数值,eps收敛阈值

    ll_old = 0 # 初始化似然函数值
    for i in range(max_iter):
        # E步:求隐变量分布
        log_like = np.array([np.sum(data*np.log(theta), axis=1) for theta in thetas]) # 对数似然 2*5
        like = np.exp(log_like) # 似然 2*5
        ws = like/like.sum(0) # 隐变量分布 2*5
        ll_new = np.sum([w*l for w, l in zip(ws, log_like)]) # 期望

        # M步:更新参数值
        vs = np.array([w[:, None] * data for w in ws]) # 概率加权 2*5*2
        thetas = np.array([v.sum(0)/v.sum() for v in vs]) # 2*2

        # 打印结果
        print(f'Iteration:{
      
      i+1}')
        print(f'theta_B = {
      
      thetas[0,0]:.2}, theta_C = {
      
      thetas[1,0]:.2}, {
      
      ll_new}')

        # 满足条件退出迭代
        if np.abs(ll_new - ll_old) < eps:
            break
        ll_old = ll_new
    
    return thetas

The EM algorithm solves the three-coin problem:

# 观测数据,5次独立实验,每次试验10次抛掷的正反面次数
observed_data = np.array([(5, 5), (9, 1), (8, 2), (4, 6), (7, 3)]) # 比如第一次试验为5次正面5次反面
# 初始化参数值,硬币B出现正面的概率0.6,硬币C出现正面的概率为0.5
thetas = np.array([[0.6, 0.4], [0.5, 0.5]])
# EM算法寻优
thetas = em(observed_data, thetas, max_iter=30)
thetas
Iteration:1
theta_B = 0.71, theta_C = 0.58, -32.68721052517165
Iteration:2
theta_B = 0.75, theta_C = 0.57, -31.258877917413145
Iteration:3
theta_B = 0.77, theta_C = 0.55, -30.760072598843628
Iteration:4
theta_B = 0.78, theta_C = 0.53, -30.33053606687176
Iteration:5
theta_B = 0.79, theta_C = 0.53, -30.071062062760774
Iteration:6
theta_B = 0.79, theta_C = 0.52, -29.95042921516964
Iteration:7
theta_B = 0.8, theta_C = 0.52, -29.90079955867412
Iteration:8
theta_B = 0.8, theta_C = 0.52, -29.881202814860167
Iteration:9
theta_B = 0.8, theta_C = 0.52, -29.873553692091832
Iteration:10
theta_B = 0.8, theta_C = 0.52, -29.870576075992844
Iteration:11
theta_B = 0.8, theta_C = 0.52, -29.86941691676721
Iteration:12
theta_B = 0.8, theta_C = 0.52, -29.868965223428773

array([[0.7967829 , 0.2032171 ],
       [0.51959543, 0.48040457]])

The algorithm converges at the seventh iteration, and finally the probabilities of coin B and coin C appearing heads are 0.80 and 0.52, respectively.

Notebook_Github address

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/131969878