Explain the principle of Gaussian mixture model in detail

Text/Chen Yunwen, CEO of Daguan Data

What is Gaussian Mixture Model? Gaussian Mixture Model, usually referred to as GMM, is a clustering algorithm widely used in the industry. This method uses Gaussian distribution as a parameter model and uses Expectation Maximization, referred to as EM) algorithm for training.

This article provides an easy-to-understand explanation of the principle of the method, hoping that readers can understand the principle of the method more intuitively. At the end of the text, the Gaussian mixture model is also analyzed to analyze the relationship of another common clustering algorithm K-means. In fact, under certain constraints, the K-means algorithm can be regarded as a special Gaussian mixture model (GMM). Form (Optimistic Data Chen Yunwen).

1 What is a Gaussian distribution? The Gaussian distribution, sometimes called the normal distribution, is the most common form of distribution that exists in abundance in nature. Before providing the precise mathematical definition, a simple example is used to illustrate.

If we randomly sample the height data from a large population and plot the height data as a histogram, we will get the graph shown in Figure 1 below. This picture simulates the statistical data of 334 adults. It can be seen that the heights that appear most in the picture are in the interval of about 180cm and 2.5cm. Figure 1 Normal distribution histogram composed of height data of 334 individuals

Figure 1 Normal distribution histogram composed of height data of 334 individuals

This graph shows the shape of the Gaussian distribution very intuitively. Next, let's look at the strict definition of the Gaussian formula. The probability density function formula of the Gaussian distribution is as follows:

enter image description here

The formula contains two parameters, the parameter μ represents the mean, and the parameter σ represents the standard deviation. The mean corresponds to the middle position of the normal distribution. In this case, we can infer that the mean is around 180cm. The standard deviation measures how spread out the data is around the mean.

Students who have studied advanced mathematics in college should remember that a background knowledge point of the normal distribution is that 95% of the data is distributed within 2 standard deviations around the mean. About 20 to 30 in this example is the value of the standard deviation parameter, because most of the data are distributed between 120cm and 240cm.

The above formula is a probability density function, that is, in the case of known parameters, the input variable refers to x, and the corresponding probability density can be obtained. Another thing to note is that before actual use, the probability distribution must be normalized, that is, the sum of the areas under the curve needs to be 1, so as to ensure that the returned probability density is within the allowable value range.

If you need to calculate the distribution probability within a specified interval, you can calculate the size of the area between the first and last values ​​of the interval. In addition to directly calculating the area, you can also use a simpler method to obtain the same result, which is to subtract the cumulative density function (CDF) corresponding to the interval x. Because CDF represents the distribution probability that the value is less than or equal to x.

Go back to the previous example to evaluate the parameters and the corresponding actual data. Suppose we use column lines to represent the distribution probability. Each column line refers to the distribution probability of the corresponding height value among 334 people. The corresponding probability value can be obtained by dividing the number of people corresponding to each height value by the total number (334). Figure 2 uses It is indicated by the red line on the left (Sample Probability).

If we set the parameters μ = 180, σ = 28, and use the cumulative density function to calculate the corresponding probability value - the green line on the right (Model Probability), the accuracy of the model fitting can be observed with the naked eye.

Figure 2 For a given user, the sampling probability of the height distribution is represented by a red histogram, and the probability calculated by the Gaussian model when the parameters μ=180 and σ=28 are represented by a green histogram

Figure 2 For a given user, the sampling probability of the height distribution is represented by a red histogram, and the probability calculated by the Gaussian model when the parameters μ=180 and σ=28 are represented by a green histogram

Looking at Figure 2, we can see that the mean parameter 180 and the standard deviation parameter 28 we guessed just now fit very well, although it may be slightly smaller. Of course we can keep tweaking the parameters to get a better fit, but a more accurate way is to generate them algorithmically, a process called model training. The most commonly used method is the Expectation Maximum (EM) algorithm, which is explained in detail below.

By the way, there is always a certain difference in the distribution of the sampled data and the overall data. Here, it is first assumed that the collected data of 334 users can represent the height distribution of the entire population. In addition, we also assume that the implicit data distribution is Gaussian distribution, and use this to draw the distribution curve, and use this as a premise to estimate the underlying distribution. If more and more data are collected, usually the distribution of height is getting closer and closer to Gaussian (although there are still other uncertain factors), and the purpose of model training is to reduce the uncertainty as much as possible under these assumptions (Optimistic Data Chen Yunwen ).

2 The EM training process of the expected maximum and the Gaussian model training model is intuitively as follows: we judge whether a model fits well by observing the closeness of the sampling probability value and the model probability value. Then we adjust the model to make the new model better fit the sampled probability values. Iterate this process many times, until the two probability values ​​are very close, we stop updating and finish the model training.

Now we are going to implement this process algorithmically, by using the data generated by the model to determine the likelihood value, that is, by calculating the expected value of the data through the model. The expected value is maximized by updating the parameters μ and σ. This process can be iterated until the resulting parameter changes between iterations are very small. This process is very similar to the algorithm training process of k-means (k-means continuously updates the class center to maximize the result), except that in the Gaussian model here, we need to update two parameters at the same time: the mean and standard deviation of the distribution

3 Gaussian mixture model (GMM) Gaussian mixture model is a simple extension of the Gaussian model, GMM uses a combination of multiple Gaussian distributions to describe the data distribution.

For example: imagine now that instead of looking at the heights of all users, we consider both male and female heights in the model. Assuming that there are both men and women in the previous sample, the Gaussian distribution drawn before is actually the result of the superposition of two Gaussian distributions. Instead of using only one Gaussian for modeling, we can now use two (or more) Gaussian distributions (Chen Yunwen):

enter image description here

This formula is very similar to the previous formula, with a few differences in details. First of all, the distribution probability is the sum of K Gaussian distributions. Each Gaussian distribution has its own μ and σ parameters, as well as the corresponding weight parameters. The weight value must be a positive number, and the sum of all weights must be equal to 1 to ensure that the formula gives Values ​​are reasonable probability density values. In other words, if we combine the input spaces corresponding to this formula, the result will be equal to 1.

Going back to the previous example, women are usually shorter than men in height distribution, as shown in Figure 3.

Figure 3 Probability distribution map of male and female heights

Figure 3 Probability distribution map of male and female heights

The probability values ​​shown on the y-axis of Figure 3 are calculated on the premise that the gender of each user is known. But usually we don't have this information (maybe not recorded when collecting data), so not only need to learn the parameters of each distribution, but also need to generate the gender division ( \varphi_{i}\varphi_{i} ) . When determining the expected value, it is necessary to generate the corresponding height probability values ​​for men and women separately from the weight values ​​and add them together.

Note that although the model is now more complex, the model can still be trained using the same technique as before. When computing the expected value (probably generated from data that has been blended), only an expectation-maximizing strategy that updates the parameters is needed.

4 Learning Example of Gaussian Mixture Model The simple example above uses a one-dimensional Gaussian model: that is, there is only one feature (height). But the Gaussian is not limited to one dimension, it is easy to expand the mean to a vector, the standard deviation to a covariance matrix, and use an n-dimensional Gaussian distribution to describe multi-dimensional features. The following program listing shows running clustering with scikit-learn's Gaussian mixture model and visualizing the results. enter image description here

When initializing the GMM algorithm, the following parameters are passed in:

-n_components - The number of Gaussian distributions of user mixtures. In the previous example there are 2 -covariance_type - the properties of the convention covariance matrix, that is, the shape of the Gaussian distribution. Refer to the following documents for details: http://scikit-learn.org/stable/modules/mixture.html -n_iter —— The calculation result of the iterative running times of EM is as follows (Iris data set) - About make_ellipses ——make_ellipses comes from The plot_gmm_classifier method by Ron Weiss and Gael Varoquaz of scikit-learn. According to the two-dimensional graph drawn by the covariance matrix, the coordinate direction with the largest variance and the second largest variance can be found, and the corresponding magnitude. Then use these axes to draw the corresponding Gaussian ellipse. These axis directions and magnitudes are called eigenvectors and eigenvalues, respectively.

Figure 4 shows the mapping of the 4-D Gaussian clustering results of the Iris dataset on a two-dimensional space

Figure 4 shows the mapping of the 4-D Gaussian clustering results of the Iris dataset on a two-dimensional space

The make_ellipses method is conceptually very simple. It takes the gmm object (training model), the coordinate axis, and the x and y coordinate indices as parameters, and after running, draws the corresponding ellipse shape based on the specified coordinate axis.

5 The relationship between k-means and GMM Under certain conditions, k-means and GMM methods can express each other's ideas. In k-means, the class of each point is labeled according to the closest class center to the point, and the assumption here is that the scales of each class cluster are close and the distribution of features does not have inhomogeneity. This also explains why normalizing the data before using k-means works. The Gaussian mixture model is not subject to this constraint because it examines the covariance model of the features separately for each cluster.

The K-means algorithm can be viewed as a special form of Gaussian Mixture Model (GMM). Overall, the Gaussian mixture model can provide stronger descriptive power, because the affiliation of data points when clustering is not only related to the nearest neighbors, but also depends on the shape of the cluster. The shape of the n-dimensional Gaussian distribution is determined by the covariance of each cluster. It is possible to get the same result with GMM and k-means after adding specific constraints on the covariance matrix.

In practice, if the covariance matrices for each cluster are tied together (that is, they are identical), and the covariance values ​​on the diagonal of the matrix remain the same and all other values ​​are 0, it is possible to generate the same size and The shape is a circular cluster. Under this condition, each point always belongs to the class corresponding to the nearest intermediate point. (Chen Yunwen from Optimistic Data)

Using EM in the k-means method to train a Gaussian mixture model is very sensitive to the setting of the initial value. Compared with k-means, the GMM method has more initial conditions to set. In practice, not only the initial class center should be specified, but also the covariance matrix and mixing weights should be set. You can run k-means to generate class centers and use this as an initial condition for a Gaussian mixture model. It can be seen that the two algorithms have similar processing procedures, and the main difference lies in the complexity of the models.

Taken as a whole, all unsupervised machine learning algorithms follow a simple pattern: given a set of data, train a model that describes the patterns in that data (and expects the underlying process to generate the data). The training process usually iterates over and over until the parameters can no longer be optimized to obtain a model that better fits the data.

Editor's Note: The above is selected from the "Daguan Data Technology Practice Special Issue" compiled by the Daguan Research Institute. This book collects the technical practice summary of the three most popular areas of artificial intelligence: natural language processing, personalized recommendation, and vertical search engines. , which integrates the technical insights of the Daguan technical team after serving hundreds of enterprises in different industries such as Huawei, ZTE, China Merchants Bank, Ping An, and JD Cloud. It is the first domestic electronic journal that systematically introduces the practical application of AI technologies such as NLP and deep learning. Welcome all technical enthusiasts to download.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325067590&siteId=291194637