1. Code
import numpy as np, matplotlib.pyplot as mp
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn import mixture
np.random.seed(8) # 设定随机环境
# 创建随机样本
X, _ = datasets.make_blobs(centers=[[0, 0]])
X1 = np.dot(X, [[4, 1], [1, 1]])
X2 = np.dot(X[:50], [[1, 1], [1, -5]]) - 2
X = np.concatenate((X1, X2))
y = [0] * 100 + [1] * 50
# KMeans
kmeans = KMeans(n_clusters=2)
y_kmeans = kmeans.fit(X).predict(X)
# 绘图
for e, labels in enumerate([y, y_kmeans], 1):
mp.subplot(1, 2, e)
mp.scatter(X[:, 0], X[:, 1], c=labels, s=40, alpha=0.6)
mp.xticks(())
mp.yticks(())
mp.show()
# GMM
gmm=mixture.GaussianMixture(n_components=2,covariance_type='full')
y_gmm=gmm.fit(X).predict(X)
# 绘图
for e, labels in enumerate([y, y_gmm], 1):
mp.subplot(1, 2, e)
mp.scatter(X[:, 0], X[:, 1], c=labels, s=40, alpha=0.6)
mp.xticks(())
mp.yticks(())
mp.show()
2. Effect
GMM is more flexible than K-Means in dealing with data shapes (the data set can be any ellipsoidal shape, not limited to spherical.), so as shown in the figure, the clustering effect of GMM is just right. At the same time, GMM uses probability, and each data point may be divided into multiple clusters, especially if the data point is located in the middle of two overlapping clusters.
K-Means
GMM