我们自己产生模拟数据，看一下k-means运行情况，主要目的是熟悉一下API接口。基于scikit包中的创建模拟数据的API创建聚类数据，使用K-means算法对数据进行分类操作，并获得聚类中心点以及总的样本簇中心点距离和值。

接口介绍

1）make_blobs

这是产生数据的API接口（产生团状的，或者说是服从高斯分布的数据）
在这里插入图片描述
参数

n_samples : int or array-like, optional (default=100)，样本数
n_features : int, optional (default=2)，特征数
centers : int or array of shape [n_centers, n_features], optional，中心数
cluster_std : float or sequence of floats, optional (default=1.0)，簇中心的标准差
center_box : pair of floats (min, max), optional (default=(-10.0, 10.0))，取值范围

2）KMeansAPI

这是K-Means的API接口
在这里插入图片描述
参数

n_clusters：k值，默认为8
init : {‘k-means++’, ‘random’ or an ndarray}，初始质心，默认k-means++
n_init : int, default: 10（因为K-means算法是初值敏感的，选择不同的初始值可能导致不同的簇划分规则。为了避免这种敏感性导致的最终结果异常性，可以采用初始化多套初始节点构造不同的分类规则，然后选择最优的构造规则。）
max_iter : int, default: 300，最大迭代次数
tol : float, default: 1e-4，簇中心的变化率小于此值时结束
precompute_distances : {‘auto’, True, False}，预计算距离，是否在聚类之前进行预计算
……
algorithm : “auto”, “full” or “elkan”, default=”auto”，度量方式，即以什么方式来计算

代码实现

1）接口应用

接下来，应用上面两个接口做一下K-Means的简单代码实现

# 导入包
import numpy as np
import sklearn
from sklearn.datasets import make_blobs # 导入产生模拟数据的方法
from sklearn.cluster import KMeans # 导入kmeans 类

# 1. 产生模拟数据
N = 1000
centers = 4
X, Y = make_blobs(n_samples=N, n_features=2, centers=centers, random_state=28)

# 2. 模型构建
km = KMeans(n_clusters=centers, init='random', random_state=28)
km.fit(X)

# 实际的y值
Y

array([2, 0, 2, 0, 3, 2, 2, 1, 0, 1, 1, 2, 3, 0, 1, 1, 0, 0, 0, 3, 0, 2,
3, 0, 2, 2, 1, 0, 0, 0, 3, 1, 3, 1, 1, 1, 3, 2, 3, 3, 0, 0, 1, 0,
3, 1, 1, 0, 2, 1, 3, 0, 2, 2, 3, 0, 3, 1, 0, 2, 0, 3, 2, 3, 2, 2,
0, 2, 0, 2, 1, 2, 1, 0, 2, 0, 0, 1, 0, 2, 1, 2, 3, 1, 1, 2, 0, 3,
3, 2, 2, 1, 1, 0, 3, 1, 0, 0, 0, 0, 2, 0, 2, 3, 0, 0, 3, 2, 0, 3,
0, 0, 3, 0, 0, 0, 1, 1, 1, 3, 1, 2, 0, 3, 1, 3, 2, 0, 0, 3, 1, 2,
0, 0, 3, 3, 1, 1, 1, 3, 0, 0, 1, 0, 3, 2, 0, 1, 2, 1, 2, 1, 3, 3,
3, 3, 0, 1, 1, 2, 0, 1, 0, 1, 1, 1, 3, 3, 3, 3, 2, 3, 0, 3, 1, 0,
2, 2, 0, 3, 1, 3, 0, 2, 3, 0, 1, 3, 2, 0, 1, 0, 1, 0, 3, 0, 0, 2,
1, 1, 1, 2, 1, 1, 2, 3, 1, 1, 1, 2, 0, 1, 0, 0, 2, 3, 2, 3, 3, 3,
3, 0, 0, 2, 0, 3, 0, 3, 0, 1, 2, 3, 2, 3, 3, 0, 3, 2, 3, 1, 2, 0,
3, 0, 1, 3, 3, 0, 0, 2, 2, 0, 0, 0, 3, 1, 3, 0, 1, 3, 1, 0, 0, 1,
1, 3, 1, 1, 0, 1, 0, 3, 1, 1, 2, 1, 1, 3, 3, 0, 0, 2, 3, 0, 2, 3,
2, 3, 2, 2, 2, 2, 3, 0, 2, 2, 1, 0, 1, 2, 2, 1, 0, 2, 3, 3, 0, 1,
1, 3, 1, 2, 3, 1, 3, 2, 0, 3, 1, 3, 0, 1, 2, 2, 0, 3, 1, 1, 0, 3,
2, 1, 2, 2, 0, 1, 1, 1, 0, 3, 1, 3, 2, 0, 2, 2, 2, 1, 0, 1, 2, 2,
2, 2, 2, 1, 1, 1, 1, 1, 3, 3, 3, 0, 1, 2, 3, 1, 0, 2, 3, 1, 1, 3,
1, 2, 0, 2, 3, 0, 3, 0, 1, 2, 1, 0, 0, 3, 2, 3, 1, 3, 2, 0, 0, 0,
1, 3, 1, 0, 3, 3, 0, 1, 2, 0, 1, 1, 3, 3, 0, 2, 2, 3, 3, 1, 3, 3,
1, 0, 1, 0, 1, 1, 2, 0, 3, 2, 3, 0, 2, 1, 2, 2, 1, 3, 2, 1, 0, 0,
0, 1, 2, 2, 2, 3, 3, 1, 0, 3, 3, 3, 0, 3, 2, 0, 2, 0, 0, 2, 2, 1,
1, 3, 3, 0, 0, 1, 1, 0, 0, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 0, 1, 0,
1, 2, 2, 1, 2, 0, 3, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 1, 0, 2, 3,
1, 0, 0, 2, 3, 2, 0, 3, 3, 3, 2, 3, 0, 1, 2, 2, 2, 2, 3, 0, 2, 2,
0, 1, 3, 2, 2, 3, 0, 3, 2, 3, 0, 2, 0, 0, 2, 2, 1, 3, 3, 1, 2, 0,
3, 0, 2, 1, 3, 1, 1, 2, 2, 0, 2, 3, 0, 3, 1, 2, 3, 0, 0, 3, 1, 2,
3, 2, 0, 3, 0, 1, 2, 0, 3, 3, 0, 3, 3, 3, 1, 3, 1, 0, 3, 2, 0, 1,
1, 3, 2, 3, 3, 0, 0, 1, 0, 3, 2, 2, 2, 2, 2, 3, 0, 2, 3, 1, 2, 3,
1, 0, 3, 3, 3, 0, 1, 3, 3, 3, 3, 2, 1, 2, 3, 2, 3, 0, 2, 0, 3, 0,
1, 3, 0, 0, 3, 1, 3, 2, 2, 0, 3, 2, 3, 2, 1, 2, 2, 1, 1, 1, 0, 0,
3, 2, 1, 2, 2, 3, 3, 3, 0, 1, 1, 2, 3, 3, 3, 3, 3, 3, 1, 0, 3, 1,
1, 2, 1, 3, 2, 1, 0, 2, 0, 1, 2, 0, 3, 2, 0, 1, 0, 1, 1, 1, 2, 1,
3, 0, 2, 3, 2, 2, 3, 0, 2, 0, 1, 1, 2, 2, 2, 1, 3, 0, 1, 2, 0, 2,
1, 1, 2, 0, 2, 0, 2, 3, 3, 1, 0, 1, 1, 3, 2, 2, 0, 2, 2, 0, 3, 0,
0, 3, 1, 1, 1, 2, 3, 3, 1, 2, 3, 2, 1, 1, 1, 0, 3, 2, 3, 2, 0, 1,
2, 0, 1, 3, 3, 3, 1, 1, 3, 0, 1, 0, 1, 0, 2, 0, 1, 0, 1, 2, 3, 3,
0, 2, 2, 3, 2, 1, 3, 2, 2, 1, 2, 0, 3, 2, 3, 2, 1, 2, 1, 2, 0, 3,
0, 0, 2, 0, 2, 0, 0, 1, 0, 1, 1, 2, 0, 1, 2, 1, 0, 3, 2, 1, 3, 0,
1, 1, 0, 2, 2, 0, 1, 0, 1, 1, 2, 0, 1, 2, 2, 3, 1, 2, 1, 2, 0, 2,
1, 0, 0, 3, 1, 0, 3, 2, 0, 2, 2, 0, 2, 2, 3, 0, 1, 0, 0, 2, 3, 1,
1, 2, 3, 3, 3, 1, 2, 1, 2, 2, 3, 3, 0, 1, 0, 1, 2, 2, 3, 1, 3, 0,
3, 2, 2, 2, 0, 1, 3, 1, 3, 1, 1, 2, 2, 1, 1, 1, 2, 1, 0, 0, 1, 1,
1, 3, 2, 0, 2, 2, 3, 2, 3, 3, 3, 3, 0, 0, 0, 3, 0, 0, 3, 3, 2, 3,
2, 2, 0, 0, 3, 1, 2, 1, 1, 3, 3, 2, 1, 0, 3, 0, 3, 1, 0, 0, 0, 1,
1, 3, 3, 3, 0, 1, 0, 2, 1, 1, 2, 3, 2, 3, 1, 3, 3, 3, 3, 3, 0, 1,
1, 3, 1, 0, 3, 0, 0, 1, 2, 0])

# 模型的预测
y_hat = km.predict(X[:10])
y_hat

array([2, 3, 2, 2, 0, 2, 2, 1, 2, 1])

print("所有样本距离所属簇中心点的总距离和为:%.5f" % km.inertia_)
print("所有样本距离所属簇中心点的平均距离为:%.5f" % (km.inertia_ / N))

print("所有的中心点聚类中心坐标:")
cluter_centers = km.cluster_centers_
print(cluter_centers)

print("score其实就是所有样本点离所属簇中心点距离和的相反数:")
print(km.score(X))

所有样本距离所属簇中心点的总距离和为:1764.19457
所有样本距离所属簇中心点的平均距离为:1.76419
所有的中心点聚类中心坐标:
[[-6.32351035 7.09545595]
[-7.51888142 -2.01003574]
[ 6.0528514 0.24636947]
[ 4.26881816 1.08317321]]
-1764.19457007324

2）数据分布对K-Mean的影响

熟悉了接口，下面看一下数据的不同分布对K-Means算法会产生什么样的影响。

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import sklearn.datasets as ds
import matplotlib.colors
from sklearn.cluster import KMeans#引入kmeans

## 设置属性防止中文乱码
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False

## 产生模拟数据
N = 1500
centers = 4
data,y = ds.make_blobs(N, n_features=2, centers=centers, random_state=28)
data2,y2 = ds.make_blobs(N, n_features=2, centers=centers,  random_state=28)
data3 = np.vstack((data[y == 0][:200], data[y == 1][:100], data[y == 2][:10], data[y == 3][:50]))
y3 = np.array([0] * 200 + [1] * 100 + [2] * 10 + [3] * 50)

#一、数据前期处理跟前面模型是一样
#二、模型的构建
km = KMeans(n_clusters=centers, init='random',random_state=28)
#n_clusters就是K值，也是聚类值
#init初始化方法，可以是kmeans++，随机，或者自定义的ndarray
km.fit(data, y) # y可要可不要，这里写y的主要目的是为了让代码看上去一样

y_hat = km.predict(data)
print ("所有样本距离聚簇中心点的总距离和:", km.inertia_)
print ("距离聚簇中心点的平均距离:", (km.inertia_ / N))
cluster_centers = km.cluster_centers_
print ("聚簇中心点：", cluster_centers)

所有样本距离聚簇中心点的总距离和: 2592.9990199021127
距离聚簇中心点的平均距离: 1.7286660132680751
聚簇中心点： [[-7.44342199e+00 -2.00152176e+00]
[ 5.80338598e+00 2.75272962e-03]
[-6.36176159e+00 6.94997331e+00]
[ 4.34372837e+00 1.33977807e+00]]

y_hat2 = km.fit_predict(data2)
y_hat3 = km.fit_predict(data3)

def expandBorder(a, b):
    d = (b - a) * 0.1
    return a-d, b+d

## 五、画图
cm = mpl.colors.ListedColormap(list('rgbmyc'))
plt.figure(figsize=(15, 9), facecolor='w')
plt.subplot(241)
plt.scatter(data[:, 0], data[:, 1], c=y, s=30, cmap=cm, edgecolors='none')

x1_min, x2_min = np.min(data, axis=0)
x1_max, x2_max = np.max(data, axis=0)
x1_min, x1_max = expandBorder(x1_min, x1_max)
x2_min, x2_max = expandBorder(x2_min, x2_max)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'原始数据')
plt.grid(True)

plt.subplot(242)
plt.scatter(data[:, 0], data[:, 1], c=y_hat, s=30, cmap=cm, edgecolors='none')
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'K-Means算法聚类结果')
plt.grid(True)

m = np.array(((1, 1), (0.5, 5)))
data_r = data.dot(m)
y_r_hat = km.fit_predict(data_r)
plt.subplot(243)
plt.scatter(data_r[:, 0], data_r[:, 1], c=y, s=30, cmap=cm, edgecolors='none')

x1_min, x2_min = np.min(data_r, axis=0)
x1_max, x2_max = np.max(data_r, axis=0)
x1_min, x1_max = expandBorder(x1_min, x1_max)
x2_min, x2_max = expandBorder(x2_min, x2_max)

plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'数据旋转后原始数据图')
plt.grid(True)

plt.subplot(244)
plt.scatter(data_r[:, 0], data_r[:, 1], c=y_r_hat, s=30, cmap=cm, edgecolors='none')
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'数据旋转后预测图')
plt.grid(True)

plt.subplot(245)
plt.scatter(data2[:, 0], data2[:, 1], c=y2, s=30, cmap=cm, edgecolors='none')
x1_min, x2_min = np.min(data2, axis=0)
x1_max, x2_max = np.max(data2, axis=0)
x1_min, x1_max = expandBorder(x1_min, x1_max)
x2_min, x2_max = expandBorder(x2_min, x2_max)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'不同方差的原始数据')
plt.grid(True)

plt.subplot(246)
plt.scatter(data2[:, 0], data2[:, 1], c=y_hat2, s=30, cmap=cm, edgecolors='none')
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'不同方差簇数据的K-Means算法聚类结果')
plt.grid(True)

plt.subplot(247)
plt.scatter(data3[:, 0], data3[:, 1], c=y3, s=30, cmap=cm, edgecolors='none')
x1_min, x2_min = np.min(data3, axis=0)
x1_max, x2_max = np.max(data3, axis=0)
x1_min, x1_max = expandBorder(x1_min, x1_max)
x2_min, x2_max = expandBorder(x2_min, x2_max)
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'不同簇样本数量原始数据图')
plt.grid(True)

plt.subplot(248)
plt.scatter(data3[:, 0], data3[:, 1], c=y_hat3, s=30, cmap=cm, edgecolors='none')
plt.xlim((x1_min, x1_max))
plt.ylim((x2_min, x2_max))
plt.title(u'不同簇样本数量的K-Means算法聚类结果')
plt.grid(True)

plt.tight_layout(2, rect=(0, 0, 1, 0.97))
plt.suptitle(u'数据分布对KMeans聚类的影响', fontsize=18)
plt.show()

看一下结果：
在这里插入图片描述

张连海博客专家

发布了124 篇原创文章 · 获赞 357 · 访问量 50万+

他的留言板关注

机器学习（聚类三）——K-Means 代码实现