Machine learning day15 Gaussian mixture model

K-means disadvantages

  • Need to manually set the K value in advance, and the value may not match the real data distribution

  • The K value can only converge to the local optimum, and the effect is greatly affected by the initial value

  • Vulnerable to noise

  • The sample points are divided into a single class

Gaussian mixture model

Gaussian Mixed Model (GMM) is also a common clustering algorithm. Use the EM algorithm for iterative calculations. The Gaussian mixture model assumes that the data of each cluster conforms to the normal distribution (Gaussian distribution), and the current data distribution is the Gaussian distribution of each cluster superimposed.

When the data is obviously unable to fit a normal distribution, then we need to generalize to the superposition of multiple normal distributions, and then fit the data. This is the so-called Gaussian mixture model, that is, using multiple normals The linear combination of distribution functions is used to fit the data distribution. In theory, the Gaussian mixture model can fit any type of distribution.

Gaussian mixture model assumption

We assume that the data of the same type conforms to the normal distribution, and the data of different clusters conform to their different normal distributions.
We need to calculate the parameters, mean image.pngand variance of each normal distribution image.png. We also add a parameter \representing weight to each normal distribution, or the probability of generating data. image.png
The Gaussian mixture model is a generative model, for example, the simplest case. There are two sub-models of one-dimensional normal distribution, N(0,1) and N(5,1), and the weights are 0.7 and 0.3, respectively. Then when generating the first data point, a distribution is randomly selected according to the weight ratio, and then randomly generated according to the parting line parameters, and then the second... until all data points are generated.

Under normal circumstances, we cannot directly obtain the parameters of the Gaussian mixture model, but observe some data points, give an approximate number of categories K, and then find the best K normal distribution models. Therefore, what we need to calculate is the best mean image.png, variance image.pngand weight image.png.

If you use the maximum likelihood solution, it will be extremely complicated, so we use the EM algorithm.


Guess you like

Origin blog.51cto.com/15069488/2578582