Gaussian Mixture Model (GMM) - Personal Notes

1. Model introduction

1. From a geometric perspective: Gaussian distribution is formed by superimposing multiple Gaussian distributions, using a weighted average

       In the mixed model, there are multiple Gaussian distributions. In the upper left picture, red, green, and blue can be three different Gaussian distributions. In the upper right picture, there are two Gaussian distributions. PS, these Gaussian distributions are all overlapping. An x can belong to each Gaussian distribution, but the probability of which Gaussian distribution it belongs to is different. The probability of belonging to the kth Gaussian distribution is\alpha _{k}, and the probability of itself being in this Gaussian distribution is \phi \left ( x | \theta _{k} \right); Multiplying the two is naturally the probability of x in this mixed model.

The next part is to learn the parameters of \alpha _{k} and \phi \left ( x | \theta _{k} \right), among which \theta _{k} is \left ( _{\theta _{k}},\sigma _{k}^{2} \right), used to determine the position and amplitude of the single Gaussian model.

Calculation parameters:

 2. Maximum likelihood

  For a single Gaussian model, the maximum likelihood method can be used to estimate the value of the parameter\theta.

But it doesn’t work for Gaussian mixture models (the specific mathematical operations here are unclear)

3. EM algorithm solution

1. It is an unsupervised classification model that does not use point categories to learn model parameters.

2. In EM algorithm:

Step E: Used to predict the probability that each point belongs to different categories

M step: Find the maximum and calculate the model parameters for a new round of iteration. Update the mean variance of each Gaussian distribution with the estimated class, as well as the prior probability of the current class

Repeat the above two steps to guide distribution convergence (M step no longer works)

3. The initialization scheme will affect the classification structure, and different initializations will produce different results.

argmax is a function, which is a function that determines the parameters (set) of the function. (Come back after learning probability theory for detailed solution steps)

in conclusion:

 The formula on the left means that the i-th sample belongs to the k-th category probability; first set \mu _{j}, \pi _{k} and have the same meaning, where . is a multi-parameter Gaussian model, not just the one of and are updated, and , and similarly for to complete an update of , you can find \left ( _{\theta _{k}},\sigma _{k}^{2} \right), after calculating N_{k}\pi _{k}\pi\mu _{k}\Sigma _{k}\sigma\Sigma\Sigma\left ( _{\theta _{k}},\sigma _{k}^{2} \right)

By continuing to repeat, the above values ​​can be determined.​ 

whereN\left ( x_{i}|\mu _{j} ,\Sigma _{j} \right ) (the value of normal distribution) is calculated using the following formula. For a Gaussian distribution with more than \left ( _{\theta _{k}},\sigma _{k}^{2} \right) parameters, use the following formula Calculate with the formula, where \sum is the covariance matrix, which means there are multiple parameters.

Gaussian model (x is a single variable)

Given any x, you can get the probability of its occurrence.

 

When X has multiple features and variables, X is a matrix. In the same way, \mu is also a vector, for example, \in _{1} is the mean of x_{1}. . . In this way, the single-variable model can be transformed into a multi-variable probability density function

 (Question: What do these parameters mean and why can new values ​​be calculated)

 4. Original code implementation of Gaussian mixture model (two code methods)

Binary classification (two Gaussian distributions)

import numpy as np
import matplotlib.pyplot as pit



# 生成均值为1.71,标准差为0.056的数据
np.random.seed(0)
mu_m = 1.71 #期望
sigma_m = 0.056 #标准差
num_m = 10000 #数据个数为10000
rand_data_m = np.random.normal(mu_m,sigma_m,num_m) #生成数据
y_m = np.ones(num_m) #生成标签

# 生成均值为1.58,标准差为0.056的数据
np.random.seed(0)
mu_w = 1.58 #期望
sigma_w = 0.051 #标准差
num_w = 10000 #数据个数为10000
rand_data_w = np.random.normal(mu_w,sigma_w,num_w) #生成数据
y_w = np.zeros(num_w) #生成标签

# 把两组数据整合
data = np.append(rand_data_m,rand_data_w)
data = data.reshape(-1,1)
y = np.append(y_m,y_w)
print(data)
print(y)



# 迭代数据
# 引入多变量的正态分布multivariate_normal函数,用于计算
from scipy.stats import multivariate_normal

num_iter = 1000 # 迭代次数
n,d = data.shape 
# 初始化参数,随机初始化
# 此处为二分类才会这样,若有多个高斯分布,那就不能写死,用循环从1-k,mu1到muk,同理sigma和pi,但sklearn可以直接完成这个工作,所以没必要写循环
mu1 = data.min(axis = 0)
mu2 = data.max(axis = 0)
sigma1 = np.identity(d) # sigma是一个数,而且本实验是单变量;sigma在多变量的情况下是一个矩阵,也许就不能用identity了
sigma2 = np.identity(d) 
pi = 0.5

for i in range(num_iter):
#     print('in')
    # 计算gamma
    # 此处为二分类才会这样,若有多个高斯分布,那就不能写死,用循环从1-k,mu1到muk,同理sigma和pi
    norm1 = multivariate_normal(mu1,sigma1)
    norm2 = multivariate_normal(mu2,sigma2)
    tau1 = pi*norm1.pdf(data) # ?  # pi,乘以该数分类到男生高斯分布的概率
    tau2 = (1-pi)*norm2.pdf(data)
    gamma = tau1/(tau1+tau2) # 指的是,几号数据的第几个类别,是一个向量
    
    # 计算mu1
    mu1 = np.dot(gamma,data)/np.sum(gamma)
    # 计算mu2
    mu2 = np.dot(1-gamma,data)/np.sum(1-gamma)
    # 计算sigma1
    sigma1 = np.dot(gamma*(data-mu1).T,data-mu1 )/np.sum(gamma) # 为了避免维度错误,要么gamma*(data-mu1).T先乘,要么(data-mu1).T*,data-mu1,总之要先乘(data-mu1).T
    #计算sigma2
    sigma2 = np.dot((1-gamma)*(data-mu2).T,data-mu2)/np.sum(1-gamma)
    # 计算pi
    pi = np.sum(gamma)/n
    
print('类别概率:\t',pi)
print('均值:\t',mu1,mu2)
print('方差:\n',sigma1,'\n\n',sigma2,'\n')

Multiple Gaussian distributions, using Skearn to implement Gaussian mixture models

This is easier to use and can be used for multiple Gaussian distributions.

But the question: What if the Gaussian distribution of this category is more than\left ( _{\theta _{k}},\sigma _{k}^{2} \right)? Make a multi-dimensional matrix and put it in GaussianMixture?

I found the documentation of Sklearn. I feel it is well written and has some principles:https://www.sklearncn.cn/20/

But the examples are difficult to read, so I will come back when I have time:GMM covariances — scikit-learn 1.2.0 documentation

Multiple Gaussian distributions, using Skearn to implement Gaussian mixture models

Explanation: Give it known mixed data data. These data belong to multiple Gaussian distributions, but they are mixed together. Now we want to see, give you the information of these data, and see which Gaussian distribution it belongs to.

The specific process is that you first tell it that there are several Gaussian distributions, and then iteratively train these Gaussian distributions through sklearn (picture above). Finally, using predict, you can know which Gaussian distribution each data in this set of data belongs to (picture below).

 Note: y (the combination of y_m and y_w) is a quantity we know in this example, but it is not known in actual application. Our purpose is to get y_hat, the predicted value.

y_hat is generated by sklearn.predict. Its labels are 0 and 1 by default. Its 0 and 1 have nothing to do with the labels 0 and 1 specified for y_m and y_w when we generate data. Here we set the two to be the same. Just for comparison.

import numpy as np
import matplotlib.pyplot as pit
import sklearn

# 生成数据一
np.random.seed(0)
mu_m = 1.71 #期望
sigma_m = 0.056 #标准差
num_m = 10000 #数据个数为10000
rand_data_m = np.random.normal(mu_m,sigma_m,num_m) #生成数据
y_m = np.ones(num_m) #生成标签.标签为1,就是数据1里的数据
# print(rand_data_m)

# 生成数据二
np.random.seed(0)
mu_w = 1.58 #期望
sigma_w = 0.051 #标准差
num_w = 10000 #数据个数为10000
rand_data_w = np.random.normal(mu_w,sigma_w,num_w) #生成数据
y_w = np.zeros(num_w) #生成标签,标签为2,就是数据2里的数据
# print(y_w)

# 两个数据合并
data = np.append(rand_data_m,rand_data_w)
print(data)
data = data.reshape(-1,1)
y = np.append(y_m,y_w)
# print(data)
# print(y)

# 更新
from sklearn.mixture import GaussianMixture
g = GaussianMixture(n_components = 2,covariance_type = 'full',tol = 1e-6,max_iter = 1000) # 主要看n_components(多少种分类)和max_iter(迭代次数)这两个参数
g.fit(data) # 用返回的数据g,fit一下data,fit后就会训练

print('类别概率:\t',g.weights_[1])
print('类别概率:\t',g.weights_[0])
print('均值:\t\n',g.means_,'\n') #输出均值
print('方差:\n',g.covariances_,'\n')

# 用于预测
from sklearn.metrics import accuracy_score
y_hat = g.predict(data)  # 对聚完类的数据进行预测,预测完返回预测结果
print(type(y_hat))
print(y_hat)
print(accuracy_score(y,y_hat))  # 与实际结果进行比较,看看精确度能达到多少

object for learning:

Gaussian Mixture Model (GMM) - Zhihu (zhihu.com)

 Machine Learning-Whiteboard Derivation Series (10)-EM Algorithm (Expectation Maximization)_bilibili_bilibiliStatistical Model (3)—EM Algorithm_bilibili _bilibili

[07-11-1 Native code implementation of Gaussian mixture model] https://www.bilibili.com/video/BV1M94y1R7VY/?share_source=copy_web&vd_source=98fbab4e0ff3ef4e18cd477db479634d

Guess you like

Origin blog.csdn.net/AaliyahShylock/article/details/128194369