K均值算法

1 K均值算法

K-means算法是最为经典,易用的数据聚类模型；

该算法要求我们预先设定聚类的数量，然后通过迭代更新聚类中心，最后让所有数据点到其所属聚类中心距离的平方和趋于稳定。

算法执行：

<1>随机布设K个特征空间点作为初始的聚类中心

<2>根据每个数据的特征向量，从K个聚类中心寻找距离最近的一个，并将此数据标记为从属于这个聚类中心

<3>在所有的数据都被标记过聚类中心后，根据类簇，重新对K个聚类中心计算

<4>如果一轮下来，所有的数据点从属的聚类中心与上一次的分配的类簇没有变化，则迭代停止；否则返回步骤<2>继续循环

2 实验代码及结果截图

本实验利用手写体数字图像作为数据源，对K-means算法进行分析

#coding:utf-8
#模块导入
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#读取训练数据和测试数据
digits_train=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra',header=None)
digits_test=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes',header=None)

#分离处64维度的像素特征与1维度的数字目标
x_train=digits_train[np.arange(64)]
y_train=digits_train[64]

x_test=digits_test[np.arange(64)]
y_test=digits_test[64]

#导入KMeans模型
from sklearn.cluster import KMeans
#初始化，设置聚类中心数量为10
kmeans=KMeans(n_clusters=10)
kmeans.fit(x_train)
#逐条判断每个测试图像所属的聚类中心
y_predict=kmeans.predict(x_test)

#使用ARI进行算法的聚类性能评估
from sklearn import metrics
print metrics.adjusted_rand_score(y_test, y_predict)

#利用轮廓系数评价不同类簇数量的K-means聚类实例
#导入silhouette_score用于计算轮廓函数
from sklearn.metrics import silhouette_score

#分割出3*2个子图，并在1号作图
plt.subplot(3,2,1)
#初始化原始数据点
x1=np.array([1,2,3,1,5,6,5,5,6,7,8,9,7,9])
x2=np.array([1,3,2,2,8,6,7,6,7,1,2,1,1,3])
X=np.array(zip(x1,x2)).reshape(len(x1),2)

#在1号图做出原始数据点阵的分布
plt.xlim([0,10])
plt.ylim([0,10])
plt.title('Inatances')
plt.scatter(x1, x2)
colors=['b','g','r','c','m','y','k','b']
markers=['o','s','D','v','^','p','*','+']
clusters=[2,3,4,5,8]
subplot_counter=1
sc_scores=[]
for t in clusters:
   subplot_counter+=1
   plt.subplot(3,2,subplot_counter)
   kmeans_model=KMeans(n_clusters=t).fit(X)

   for i ,l in enumerate(kmeans_model.labels_):
        plt.plot(x1[i],x2[i],color=colors[l],marker=markers[l],ls='None')
        plt.xlim([0,10])
        plt.ylim([0,10])
        sc_score=silhouette_score(X, kmeans_model.labels_,metric='euclidean')
        sc_scores.append(sc_score)
        plt.title('K=%s,silhouette coefficient=%0.03f'%(t,sc_score))
plt.figure()
plt.plot(clusters,sc_scores,'*-')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Coefficient Score')
plt.show()

利用轮廓系数评价不同类簇数量的K-means聚类结果

轮廓系数与不同类簇数量的关系曲线

猜你喜欢