Gaussian Mixture Model GMM

    There are many methods of clustering, and k-means is the simplest clustering method. The general idea is to divide the data into multiple heaps, and each heap is a class. Each heap has a cluster center (the result of learning is to obtain the k cluster centers), this center is the mean of all data in this class, and the cluster centers from all points in this heap to this class are less than To the cluster centers of other classes (the process of classification is the process of comparing the unknown data to the k cluster centers, whoever is closest is who). In fact, k-means is the most intuitive and easy-to-understand clustering method. The principle is to group the most similar data together, and the definition of "like" is done by us. For example, the minimum Euclidean distance, etc. If you want to understand the specific algorithm process of k-means, please see here. In this blog post, I want to introduce another popular clustering method - GMM (Gaussian Mixture Model).

    GMM and k-means are actually very similar, the only difference is that for GMM, we introduce probability. Having said that, I would like to add something first. There are two types of statistical learning models, one is a probabilistic model and the other is a non-probabilistic model. The so-called probability model means that the form of the model we want to learn is P(Y|X), so that in the process of classification, we can obtain a probability distribution of the value of Y through the unknown data X, that is, the model obtained after training. The output is not a specific value, but the probability of a series of values ​​(corresponding to the classification problem, that is, the probability corresponding to each different class), and then we can select the class with the highest probability as the judgment object (calculated as soft classification soft classification). assignment). Instead of a probability model, it means that the model we learn is a decision function Y=f(X), and the input data X can be projected to obtain a unique Y, which is the decision result (hard assignment). Going back to GMM, the learning process is to train several probability distributions. The so-called mixed Gaussian model refers to the estimation of the probability density distribution of the sample, and the estimated model is the weighted sum of several Gaussian models (specifically, several must be in the model. established before training). Each Gaussian model represents a class (a Cluster). The data in the sample is projected on several Gaussian models, and the probability of each class is obtained separately. Then we can choose the class with the highest probability as the decision result.

    What is the benefit of getting probabilities? We know that people are smart because we use various models to judge and analyze observed things and phenomena. When you find a dog on the road, you may just look like a neighbor's dog, but also a little bit more like a girlfriend's dog. The probability of the girlfriend's dog is 51%, and the probability of being a neighbor's dog is 49%. It belongs to a confusing area. At this time, you can use other methods to distinguish who is the dog. If it is a hard classification, what you judge is the dog of your girlfriend's house. There is no concept of "multi-likeness", so it is inconvenient to integrate multiple models.

    From the point of view of the central limit theorem, it is reasonable to assume that the mixture model is Gaussian. Of course, it can also be defined as a Mixture Model of any distribution according to the actual data, but the definition of Gaussian has some convenience in calculation. , theoretically, any probability distribution can be approximated by GMM by increasing the number of Models.

    The Gaussian mixture model is defined as:

   

    Where K is the number of models, πk is the weight of the k-th Gaussian, and is the probability density function of the k-th Gaussian, whose mean is μk and variance is σk. Our estimate of this probability density requires πk, μk, and σk variables. When the expression is obtained, the results of the terms of the summation formula represent the probability that the sample x belongs to each class.

    When doing parameter estimation, the method often used is maximum likelihood . The maximum likelihood method is to maximize the probability value of the sample point on the estimated probability density function. Since the probability value is generally very small, when N is large, the result of this continuous multiplication is very small, which is likely to cause floating-point underflow. So we usually take log and rewrite the target as:

  

    That is to maximize the log-likelyhood function, the full form is:

    When it is generally used for parameter estimation, we always find the extreme value by derivation of the variables to be sought. In the above formula, there is a summation in the log function. If you want to use the derivation method to calculate, the equation system will be It is very complicated, so it is difficult for us to consider solving it with this method (there is no closed solution). The solution method that can be used is the EM algorithm - the solution is divided into two steps: the first step is to estimate the weight of each Gaussian model, assuming that we know the parameters of each Gaussian model (one can be initialized, or based on the iterative results of the previous step). The second step is to go back and determine the parameters of the Gaussian model based on the estimated weights. Repeat these two steps until the fluctuation is small and the extreme value is approximately reached (note that this is an extreme value not the most value, the EM algorithm will fall into a local optimum). The specific expression is as follows:

  

    1. For the i-th sample xi, the probability that it is generated by the k-th model is:

   

    In this step, we assume that the sum of the parameters of the Gaussian model is known (either iterated from the previous step or determined by the initial value).

   (E step)

 

   

    (M step)

 

    3. Repeat the above two steps until the algorithm converges (this algorithm must be convergent, as for the specific proof, please go back to the EM algorithm, and I have no specific attention, and I will add it later).

 

    Finally, to sum up, the advantage of using GMM is that the sample points after projection do not get a definite classification mark, but get the probability of each class, which is an important piece of information. The calculation amount of each iteration of GMM is relatively large, which is larger than k-means. The solution method of GMM is based on the EM algorithm, so it is possible to fall into a local extreme value, which is very related to the selection of the initial value. GMM can be used not only for clustering, but also for probability density estimation.

 Original address: https://blog.csdn.net/jwh_bupt/article/details/7663885

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325257541&siteId=291194637