Principles and formula derivation of several modules of machine learning - EM algorithm

Notes on the EM Algorithm

This article is only for the author to take notes by himself, to prevent forgetting, please do not use it for other purposes~
If there are references to other posts, I hope the original author will contact me, I will add your link to the back, and finally hope to help you~

content

  • 1. The principle of EM algorithm
  • 2. Formula derivation of EM algorithm

1. The principle of EM algorithm

  • 1.1 Introduction of EM Algorithm - In which situations is the EM algorithm used? (What kind of problem does the EM algorithm solve)
  • 1.2 Description of the principle of EM algorithm
  • 1.3 The relationship between EM algorithm and KMeans

1.1 Introduction of EM algorithm

Probabilistic models sometimes contain both observable variables and latent variables or latent variables. If the variables of the probability model are all observed variables, then given the data, the model parameters can be estimated directly by the maximum likelihood estimation method or the Bayesian estimation method. However, when the model contains hidden variables, the above methods are useless, and the EM algorithm is to calculate the parameters of the probability model containing hidden variables - [Li Hang, "Statistical Learning Methods", P155]


EM (Expectation-Maximization) algorithm is a commonly used tool for estimating latent variables of parameters. It is an iterative method. In real applications, incomplete data sets are often encountered. For example, because the roots of watermelons have fallen off, it is impossible to tell whether they are "curled" or "stiff", and the value of the "root" attribute variable of the training sample is unknown. At this time, the EM algorithm can be used to solve the problem. The basic idea of ​​the EM algorithm is:
①If the parameter Θ is known, the value of the optimal latent variable Z can be inferred according to the training data (step E);
②If the value of Z If known, it is convenient to do maximum likelihood estimation (M steps) for the parameter Θ
- [Zhou Zhihua, "Machine Learning", P163]

Before introducing the EM algorithm, two concepts are recorded, one is maximum likelihood estimation , and the other is about conditional probability .

maximum likelihood estimation

Principle : Maximum likelihood estimation is a statistical method based on the maximum likelihood principle, and it is the application of probability theory in statistics. (The maximum likelihood principle can be illustrated by an example. There are two boxes with the same shape. There are 99 white balls and 1 black ball in box A; there are 99 black balls and 1 white ball in box B. In one experiment Take out a ball, the ball is a black ball, then which box is the black ball taken from? The answer is more like it was taken from box B, and "more like" means "maximum likelihood". This idea is called for the "maximum likelihood principle"). Maximum Likelihood Estimation provides a way to evaluate the parameters of a model given observation data, that is: "The model is given, the parameters are unknown". Through several experiments, observe the results, and use the experimental results to obtain a certain parameter value that can maximize the probability of the sample, which is called maximum likelihood estimation.
The following pictures are taken from the blog ( http://blog.csdn.net/zengxiantao1994/article/details/72787849 ):
Turn
The following pictures are taken from the blog ( http://blog.csdn.net/xueyingxue001/article/details/51374100 ):
Turn

Solving for Maximum Likelihood Estimation

The general steps for solving the maximum likelihood estimate are:
1. Write the likelihood function;
2. Take the logarithm of the likelihood function and arrange it;
3. Find the derivative/partial derivative and set it to 0 to obtain the likelihood equation ;
4. Solve the likelihood equation, and the obtained parameters are the required parameters.

Conditional Probability

In the EM algorithm, when obtaining the maximum likelihood function, there is a place that has been confused for a long time, because the conditional probability is not well understood. First introduce the conditional probability, and then deduce the formula according to the concept of conditional probability (see the derivation of the 2.1EM algorithm for details).

The definition of conditional probability: P(A|B)=P(AB) / P(B)

Explanation: |The right side is the event that you know has occurred, that is, the condition; |The left side is the event for
which conditions The conditional probability of A under B) is equal to the (absolute) probability that both A and B will occur, divided by the (absolute) probability that B will occur.
Ps: "|"The events on the left and right can be one or more than two. If there are more than two, they can be separated by commas or not. That is, whether it is P(A,B|C) (ie P(AB|C)); P(A|B,C) (ie P(A|BC)), by definition, it is not the same as P(A|B) The essential difference, you can understand what it is equal to when you bring in the definition of conditional probability. Introduce conditional probability, which is very helpful in the derivation of P(Y,Z | θ) = P(Y | Z, θ) * P(Z | θ).

Usage of EM Algorithm

To use the EM algorithm, the samples to be processed must contain hidden variables. This sentence can be understood as follows. For a certain function f(x; θ), x in the f function is known, but θ is unknown, so θ can be obtained by using maximum likelihood estimation; however, if the function There is one more hidden variable z, that is, the original function f(x,z; θ), where x is known, z is unknown, and θ is unknown. In this case, the maximum likelihood function cannot be done. In order to solve the above situation, the EM algorithm was invented.

Disadvantages of the EM algorithm

The EM algorithm is not guaranteed to find the global optimum (because the EM algorithm may get stuck in a local optimum and not find the global optimum).

1.2 Principle description of EM algorithm

In simple terms, the EM algorithm uses two alternating steps to calculate: the
first step, the expectation (step E): the expected value of the log-likelihood is calculated using the currently estimated parameter value Θ(i) (why is the logarithm calculated here? Likelihood function? Because 1. The log-likelihood function turns the continuous multiplication of the original likelihood function into a continuous addition; 2. Usually, the values ​​in the likelihood function are relatively small, and many small numbers are multiplied together in the computer. It is easy to cause floating-point underflow, so take the logarithm of it to turn it into a continuous addition.); The
second step, maximization (M step): Find the parameter that maximizes the likelihood expectation generated by E step The value Θ(i+1) is
then re-used for step E, and the cycle iterates until it converges to a local optimum.

EM algorithm flow

The following pictures are taken from the blog ( https://www.cnblogs.com/pinard/p/6912636.html ):
write picture description here

1.3 The relationship between EM algorithm and K-means

K-means clustering algorithm

The K-means algorithm is a commonly used clustering algorithm. It considers that in a cluster composed of a set of data points, the distance between the points inside the cluster should be smaller than the distance between the data points and the points outside the cluster. The K-means clustering algorithm is also used to solve the problem of latent variables. First, the purpose of using the K-means clustering algorithm is to divide the samples into k categories, that is, for the observed variable x, find the hidden category y, and finally pass the extreme Large likelihood estimation to measure the results of clustering.
The specific algorithm description of the K-means clustering algorithm:
1. Manually specify the k value, and randomly select k cluster centroid points as μ1, μ2…..μk.
2. Repeat the following process until convergence:
a. For each example X=Xi, calculate the class it should belong to
write picture description here
b. For each class μj , recalculate the centroid of the class.
write picture description here
The following pictures are taken from the blog ( http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006910.html ):
write picture description here

★★★★The relationship between EM algorithm and K-means clustering algorithm

1. First, the K-means clustering algorithm and the EM algorithm are both used for maximum likelihood estimation of the parameters of the probability model with hidden variables.
2. Secondly, both the K-means clustering algorithm and the EM algorithm need to manually specify a parameter before starting. For the K-means clustering algorithm, the K value needs to be specified, and for the EM algorithm, the initial estimated parameter value Θ needs to be specified.
3. Both the K-means clustering algorithm and the EM algorithm are completed iteratively, and in each step of the iteration, there is a relationship between the assumed (or obtained in the previous step) Θ and the maximum likelihood expectation, and you can know Θ. The maximum likelihood expectation is calculated, and the maximum likelihood expectation is obtained, and the parameter Θ can be obtained.
4. (Important) The K-Means algorithm is classified as a special case of the EM solution of GMM. (Refer to the blog for details: http://blog.csdn.net/tingyue_/article/details/70739671 ):
Description:
a. The K-Means algorithm is actually one of the EM solutions of GMM when the Gaussian component covariance ϵ I→0 special case.
b. The K-Means algorithm performs "hard assignment" to the clustering of data points, that is, each data point only belongs to a unique cluster; while the EM solution rule of GMM is based on the posterior probability distribution, and performs "soft assignment" to the data points ”, that is, each individual Gaussian model contributes to the data clustering, but the contribution value varies.
c. In practical applications, for K-means, we usually repeat a certain number of times and then take the best result, but since the calculation amount of each iteration of GMM is much larger than that of K-means, when using GMM, a more popular approach It is to first use K-means (which has been repeated and take the optimal value) to get a rough result, and then use it as the initial value (as long as the cluster center obtained by K-means is passed to GMM), and then use GMM to perform Iterate carefully.

2. Formula derivation of EM algorithm

  • 2.1 Derive the EM algorithm by approximately solving the maximization problem of the log-likelihood function of the observed data
  • 2.2 Proof of Convergence of EM Algorithm

2.1 Formula derivation of EM algorithm

In the derivation of the formula of the EM algorithm, the following two points should be paid attention to:
1. The initial value of the parameter can be selected arbitrarily, but the EM algorithm is sensitive to the selection of the initial value.
2. Each iteration of the EM algorithm is actually seeking the Q function and its maximum value.
3. Step E is to substitute the current estimated value of the parameter into the Q function, and then obtain the expression of the Q function.
4. The M step is to obtain the new parameter value θ by finding the maximum value of the Q function.
5. When using the EM algorithm, it is necessary to give the conditions for stopping the iteration. Generally, the Q value of the two updates is reduced by less than The smaller positive number or the two updated parameter values ​​can be reduced to less than the smaller positive number. ★★★★ The EM algorithm is an algorithm that maximizes the log-likelihood function
by continuously solving the maximization of the lower bound. For details, see the blog: https://www.cnblogs.com/pinard/p/6912636.html and Li Hang The "Statistical Learning Methods" P158

2.2 Proof of Convergence of EM Algorithm

For details, see Li Hang's "Statistical Learning Methods" P161

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325512648&siteId=291194637