下面是一个例子,根据啤酒中配料含量的不同进行聚类,以划分出不同品牌的啤酒
1、导入数据并处理
import pandas as pd
beer = pd.read_csv('data.txt', sep=' ')
beer
name | calories | sodium | alcohol | cost | |
---|---|---|---|---|---|
0 | Budweiser | 144 | 15 | 4.7 | 0.43 |
1 | Schlitz | 151 | 19 | 4.9 | 0.43 |
2 | Lowenbrau | 157 | 15 | 0.9 | 0.48 |
3 | Kronenbourg | 170 | 7 | 5.2 | 0.73 |
4 | Heineken | 152 | 11 | 5.0 | 0.77 |
5 | Old_Milwaukee | 145 | 23 | 4.6 | 0.28 |
6 | Augsberger | 175 | 24 | 5.5 | 0.40 |
7 | Srohs_Bohemian_Style | 149 | 27 | 4.7 | 0.42 |
8 | Miller_Lite | 99 | 10 | 4.3 | 0.43 |
9 | Budweiser_Light | 113 | 8 | 3.7 | 0.40 |
10 | Coors | 140 | 18 | 4.6 | 0.44 |
11 | Coors_Light | 102 | 15 | 4.1 | 0.46 |
12 | Michelob_Light | 135 | 11 | 4.2 | 0.50 |
13 | Becks | 150 | 19 | 4.7 | 0.76 |
14 | Kirin | 149 | 6 | 5.0 | 0.79 |
15 | Pabst_Extra_Light | 68 | 15 | 2.3 | 0.38 |
16 | Hamms | 139 | 19 | 4.4 | 0.43 |
17 | Heilemans_Old_Style | 144 | 24 | 4.9 | 0.43 |
18 | Olympia_Goled_Light | 72 | 6 | 2.9 | 0.46 |
19 | Schlitz_Light | 97 | 7 | 4.2 | 0.47 |
X = beer[["calories","sodium","alcohol","cost"]]
# 当需要用K-means来做聚类时导入KMeans函数
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3).fit(X)
km2 = KMeans(n_clusters=2).fit(X)
# 直接在数据中添加cluster和cluster2两列
beer['cluster'] = km.labels_
beer['cluster2'] = km2.labels_
beer.sort_values('cluster')
name | calories | sodium | alcohol | cost | cluster | cluster2 | |
---|---|---|---|---|---|---|---|
0 | Budweiser | 144 | 15 | 4.7 | 0.43 | 0 | 1 |
1 | Schlitz | 151 | 19 | 4.9 | 0.43 | 0 | 1 |
2 | Lowenbrau | 157 | 15 | 0.9 | 0.48 | 0 | 1 |
3 | Kronenbourg | 170 | 7 | 5.2 | 0.73 | 0 | 1 |
4 | Heineken | 152 | 11 | 5.0 | 0.77 | 0 | 1 |
5 | Old_Milwaukee | 145 | 23 | 4.6 | 0.28 | 0 | 1 |
6 | Augsberger | 175 | 24 | 5.5 | 0.40 | 0 | 1 |
7 | Srohs_Bohemian_Style | 149 | 27 | 4.7 | 0.42 | 0 | 1 |
17 | Heilemans_Old_Style | 144 | 24 | 4.9 | 0.43 | 0 | 1 |
16 | Hamms | 139 | 19 | 4.4 | 0.43 | 0 | 1 |
10 | Coors | 140 | 18 | 4.6 | 0.44 | 0 | 1 |
14 | Kirin | 149 | 6 | 5.0 | 0.79 | 0 | 1 |
12 | Michelob_Light | 135 | 11 | 4.2 | 0.50 | 0 | 1 |
13 | Becks | 150 | 19 | 4.7 | 0.76 | 0 | 1 |
9 | Budweiser_Light | 113 | 8 | 3.7 | 0.40 | 1 | 0 |
8 | Miller_Lite | 99 | 10 | 4.3 | 0.43 | 1 | 0 |
11 | Coors_Light | 102 | 15 | 4.1 | 0.46 | 1 | 0 |
19 | Schlitz_Light | 97 | 7 | 4.2 | 0.47 | 1 | 0 |
15 | Pabst_Extra_Light | 68 | 15 | 2.3 | 0.38 | 2 | 0 |
18 | Olympia_Goled_Light | 72 | 6 | 2.9 | 0.46 | 2 | 0 |
2、计算划分后各维均值
# 下面的错误的写法,plotting已经被提了出来
## from pandas.tools.plotting import scatter_matrix
from pandas.plotting import scatter_matrix
%matplotlib inline
cluster_centers = km.cluster_centers_
cluster_centers_2 = km2.cluster_centers_
beer.groupby("cluster").mean()
calories | sodium | alcohol | cost | cluster2 | scaled_cluster | |
---|---|---|---|---|---|---|
cluster | ||||||
0 | 150.00 | 17.0 | 4.521429 | 0.520714 | 1 | 0 |
1 | 102.75 | 10.0 | 4.075000 | 0.440000 | 0 | 1 |
2 | 70.00 | 10.5 | 2.600000 | 0.420000 | 0 | 2 |
beer.groupby("cluster2").mean()
calories | sodium | alcohol | cost | cluster | scaled_cluster | |
---|---|---|---|---|---|---|
cluster2 | ||||||
0 | 91.833333 | 10.166667 | 3.583333 | 0.433333 | 1.333333 | 1.333333 |
1 | 150.000000 | 17.000000 | 4.521429 | 0.520714 | 0.000000 | 0.000000 |
3、绘图分析
下面是calories和Alcohol两个维度上的聚类结果
# 设置中心点
centers = beer.groupby("cluster").mean().reset_index()
# 绘制3堆的聚类效果
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14
# 设置四种颜色
import numpy as np
colors = np.array(['red', 'green', 'blue', 'yellow'])
# 绘图
plt.scatter(beer["calories"], beer["calories"],c=colors[beer["cluster"]])
plt.scatter(centers.calories, centers.alcohol, linewidths=3, marker='+', s=300, c='black')
plt.xlabel("Calories")
plt.ylabel("Alcohol")
Text(0, 0.5, 'Alcohol')
把聚类后两两特征的散点图分别进行绘制
- K = 3时的结果
scatter_matrix(beer[["calories","sodium","alcohol","cost"]],s=100, alpha=1, c=colors[beer["cluster"]], figsize=(10,10))
plt.suptitle("With 3 centroids initialized")
Text(0.5, 0.98, 'With 3 centroids initialized')
- K = 2时的结果
scatter_matrix(beer[["calories","sodium","alcohol","cost"]],s=100, alpha=1, c=colors[beer["cluster2"]], figsize=(10,10))
plt.suptitle("With 2 centroids initialized")
Text(0.5, 0.98, 'With 2 centroids initialized')
4、轮廓系数分析
由于以上的k = 2或者3时的区别不大,可以引入轮廓系数进行分析,这是评价聚类效果好坏的方式。
- 计算样本i到同簇其他样本的平均距离ai。ai 越小,说明样本i越应该被聚类到该簇。将ai 称为样本i的簇内不相似度。
- 计算样本i到其他某簇Cj 的所有样本的平均距离bij,称为样本i与簇Cj 的不相似度。定义为样本i的簇间不相似度:bi =min{bi1, bi2, …, bik}
- si接近1,则说明样本i聚类合理
- si接近-1,则说明样本i更应该分类到另外的簇
- 若si 近似为0,则说明样本i在两个簇的边界上。
from sklearn import metrics
score_scaled = metrics.silhouette_score(X,beer.scaled_cluster)
score = metrics.silhouette_score(X,beer.cluster)
print(score_scaled, score)
scores = []
for k in range(2,20):
labels = KMeans(n_clusters=k).fit(X).labels_
score = metrics.silhouette_score(X, labels)
scores.append(score)
plt.plot(list(range(2,20)), scores)
plt.xlabel("Number of Clusters Initialized")
plt.ylabel("Sihouette Score")
0.6731775046455796 0.6731775046455796
Text(0, 0.5, 'Sihouette Score')
从图中可以看出,当n_clusters = 2时,轮廓系数更接近与1,更合适。但是在聚类算法中,评估方法只作为参考,真正数据集来时还是具体分析一番。
在使用sklearn工具包进行建模时,换一个算法非常便捷,只需要更改函数即可。。
from pandas.plotting import scatter_matrix
# 改为DBSCAN算法
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=10, min_samples=2).fit(X)
labels = db.labels_
beer['cluster_db'] = labels
beer.sort_values('cluster_db')
beer.groupby('cluster_db').mean()
scatter_matrix(X, c=colors[beer.cluster_db], figsize=(10,10), s=100)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EA98E0C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EABFFA08>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EA9D49C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA0E408>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA46DC8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA807C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAB9148>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAEDFC8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAF7BC8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAB31D88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAB9B388>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAD933C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EADCD4C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE06608>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE3F708>,
<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE79808>]],
dtype=object)