Gaussian Mixture Models (GMM, GOM)

1. Disadvantages of k-means clustering

        When k-means clustering uses Euclidean distance as the distance function, its two-dimensional nature is a circle circled with the centroid of each cluster as the center. Use this circle to truncate and classify the original data, but the actual data distribution is not necessarily a standard circle, but may also be an ellipse. This makes it an unsatisfactory categorical fit for many types of data:

        1) The shape of the class is not flexible enough, the fitting result is quite different from the actual one, and the accuracy is limited.

        2) Whether the sample belongs to each cluster is determined, that is, only yes or no, and the application lacks robustness.

2. Gaussian mixture model

        The basic idea: use multiple Gaussian distribution functions (normal distribution) to approximate the probability distribution of any shape. The data points to be clustered are regarded as the sampling points of the distribution, and the parameters of the Gaussian distribution are estimated by a method similar to the maximum likelihood estimation through the sampling points, and the parameters are obtained (using the EM algorithm to solve) to obtain the classification of the data points. membership function.

         The probability density function of GMM:

P\left ( x|\theta \right )=\sum_{k=1}^{K}P\left ( \theta_{k} \right )P\left ( x|\theta_{k} \right )

        in;

        1) K is the number of models, that is, the number of clusters.

        2)  P\left ( \theta_{k} \right ) is the probability that the data sample belongs to the kth Gaussian distribution (prior distribution, the distribution obtained from relevant knowledge before the test), which satisfies:

\sum_{k=1}^{K}P\left ( \theta_{k} \right )=1

        3) P\left ( x|\theta_{k} \right )is the probability density of the kth Gaussian, where:

        \theta_{k}=\left ( u_{k},\sigma _{k}^{2} \right )

                u_{k} is the mean, \sigma _{k}^{2} and is the variance.

        Algorithm steps:

        1) Set the number of k, that is, the number of components to initialize the Gaussian mixture model. Initialize the Gaussian distribution parameters for each cluster.

        2) Calculate the probability that each point belongs to each Gaussian model.

        3) Recalculate the parameters of each Gaussian model based on each point and its probability of belonging to each Gaussian model  \alpha _{k},\theta_{k}.

        4) Repeat the iterative calculation steps 2) 3) until convergence.

        Replenish:

        1) The premise assumes that the data sample obeys a Gaussian distribution

        2) k-means is a special case of GMM, that is, when GMM has the same variance in all dimensions, it will appear circular.

        3) The calculation amount of each iteration of GMM is much larger than that of k-means, so k-means can be used first (repeat multiple times to get the best) to obtain the initial cluster center point, and iterate as the initial value of GMM.

        

    

Guess you like

Origin blog.csdn.net/weixin_43284996/article/details/127349987