Machine learning-clustering algorithm

The following is an example, clustering according to the content of ingredients in beer to classify different brands of beer

1. Import data and process

import pandas as pd
beer = pd.read_csv('data.txt', sep=' ')
beer
name calories sodium alcohol cost
0 Budweiser 144 15 4.7 0.43
1 slot 151 19 4.9 0.43
2 Lowenbrau 157 15 0.9 0.48
3 Kronenbourg 170 7 5.2 0.73
4 Heineken 152 11 5.0 0.77
5 Old_Milwaukee 145 23 4.6 0.28
6 Augsberger 175 24 5.5 0.40
7 Srohs_Bohemian_Style 149 27 4.7 0.42
8 Miller_Lite 99 10 4.3 0.43
9 Budweiser_Light 113 8 3.7 0.40
10 Coors 140 18 4.6 0.44
11 Coors_Light 102 15 4.1 0.46
12 Michelob_Light 135 11 4.2 0.50
13 Becks 150 19 4.7 0.76
14 To do 149 6 5.0 0.79
15 Pabst_Extra_Light 68 15 2.3 0.38
16 Hamms 139 19 4.4 0.43
17 Heilemans_Old_Style 144 24 4.9 0.43
18 Olympia_Goled_Light 72 6 2.9 0.46
19 Schlitz_Light 97 7 4.2 0.47
X = beer[["calories","sodium","alcohol","cost"]]

# 当需要用K-means来做聚类时导入KMeans函数
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3).fit(X)
km2 = KMeans(n_clusters=2).fit(X)

# 直接在数据中添加cluster和cluster2两列
beer['cluster'] = km.labels_
beer['cluster2'] = km2.labels_
beer.sort_values('cluster')
name calories sodium alcohol cost cluster cluster2
0 Budweiser 144 15 4.7 0.43 0 1
1 slot 151 19 4.9 0.43 0 1
2 Lowenbrau 157 15 0.9 0.48 0 1
3 Kronenbourg 170 7 5.2 0.73 0 1
4 Heineken 152 11 5.0 0.77 0 1
5 Old_Milwaukee 145 23 4.6 0.28 0 1
6 Augsberger 175 24 5.5 0.40 0 1
7 Srohs_Bohemian_Style 149 27 4.7 0.42 0 1
17 Heilemans_Old_Style 144 24 4.9 0.43 0 1
16 Hamms 139 19 4.4 0.43 0 1
10 Coors 140 18 4.6 0.44 0 1
14 To do 149 6 5.0 0.79 0 1
12 Michelob_Light 135 11 4.2 0.50 0 1
13 Becks 150 19 4.7 0.76 0 1
9 Budweiser_Light 113 8 3.7 0.40 1 0
8 Miller_Lite 99 10 4.3 0.43 1 0
11 Coors_Light 102 15 4.1 0.46 1 0
19 Schlitz_Light 97 7 4.2 0.47 1 0
15 Pabst_Extra_Light 68 15 2.3 0.38 2 0
18 Olympia_Goled_Light 72 6 2.9 0.46 2 0

2. Calculate the mean value of each dimension after division

# 下面的错误的写法,plotting已经被提了出来
## from pandas.tools.plotting import scatter_matrix
from pandas.plotting import scatter_matrix
%matplotlib inline

cluster_centers = km.cluster_centers_
cluster_centers_2 = km2.cluster_centers_
beer.groupby("cluster").mean()
calories sodium alcohol cost cluster2 scaled_cluster
cluster
0 150.00 17.0 4.521429 0.520714 1 0
1 102.75 10.0 4.075000 0.440000 0 1
2 70.00 10.5 2.600000 0.420000 0 2
beer.groupby("cluster2").mean()
calories sodium alcohol cost cluster scaled_cluster
cluster2
0 91.833333 10.166667 3.583333 0.433333 1.333333 1.333333
1 150.000000 17.000000 4.521429 0.520714 0.000000 0.000000

3. Drawing analysis

The following are the clustering results on the two dimensions of calories and Alcohol

# 设置中心点
centers = beer.groupby("cluster").mean().reset_index()

# 绘制3堆的聚类效果
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14

# 设置四种颜色
import numpy as np
colors = np.array(['red', 'green', 'blue', 'yellow'])

# 绘图
plt.scatter(beer["calories"], beer["calories"],c=colors[beer["cluster"]])

plt.scatter(centers.calories, centers.alcohol, linewidths=3, marker='+', s=300, c='black')

plt.xlabel("Calories")
plt.ylabel("Alcohol")
Text(0, 0.5, 'Alcohol')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P5jBYEQW-1614323322831)(output_10_1.png)]

Draw the scatter plots of the two features after clustering separately

  • Result when K = 3
scatter_matrix(beer[["calories","sodium","alcohol","cost"]],s=100, alpha=1, c=colors[beer["cluster"]], figsize=(10,10))
plt.suptitle("With 3 centroids initialized")
Text(0.5, 0.98, 'With 3 centroids initialized')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WaXeRzK7-1614323322837)(output_13_1.png)]

  • Result when K = 2
scatter_matrix(beer[["calories","sodium","alcohol","cost"]],s=100, alpha=1, c=colors[beer["cluster2"]], figsize=(10,10))
plt.suptitle("With 2 centroids initialized")
Text(0.5, 0.98, 'With 2 centroids initialized')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-m7Jt0PHG-1614323322839)(output_15_1.png)]

4. Contour coefficient analysis

Since the difference between the above k = 2 or 3 is not big, the profile coefficient can be introduced for analysis, which is a way to evaluate the clustering effect.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fGd2fQqp-1614323322843)(attachment:image.png)]

  • 计算样本i到同簇其他样本的平均距离ai。ai 越小,说明样本i越应该被聚类到该簇。将ai 称为样本i的簇内不相似度。
  • 计算样本i到其他某簇Cj 的所有样本的平均距离bij,称为样本i与簇Cj 的不相似度。定义为样本i的簇间不相似度:bi =min{bi1, bi2, …, bik}
  • si接近1,则说明样本i聚类合理
  • si接近-1,则说明样本i更应该分类到另外的簇
  • 若si 近似为0,则说明样本i在两个簇的边界上。
from sklearn import metrics
score_scaled = metrics.silhouette_score(X,beer.scaled_cluster)
score = metrics.silhouette_score(X,beer.cluster)
print(score_scaled, score)

scores = []
for k in range(2,20):
    labels = KMeans(n_clusters=k).fit(X).labels_
    score = metrics.silhouette_score(X, labels)
    scores.append(score)

plt.plot(list(range(2,20)), scores)
plt.xlabel("Number of Clusters Initialized")
plt.ylabel("Sihouette Score")
0.6731775046455796 0.6731775046455796





Text(0, 0.5, 'Sihouette Score')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O6aEqmiN-1614323322845)(output_20_2.png)]

从图中可以看出,当n_clusters = 2时,轮廓系数更接近与1,更合适。但是在聚类算法中,评估方法只作为参考,真正数据集来时还是具体分析一番。

在使用sklearn工具包进行建模时,换一个算法非常便捷,只需要更改函数即可。。

from pandas.plotting import scatter_matrix

# 改为DBSCAN算法
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=10, min_samples=2).fit(X)
labels = db.labels_
beer['cluster_db'] = labels
beer.sort_values('cluster_db')
beer.groupby('cluster_db').mean()
scatter_matrix(X, c=colors[beer.cluster_db], figsize=(10,10), s=100)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EA98E0C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EABFFA08>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EA9D49C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA0E408>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA46DC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAA807C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAB9148>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAEDFC8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAAF7BC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAB31D88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAB9B388>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAD933C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000199EADCD4C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE06608>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE3F708>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000199EAE79808>]],
      dtype=object)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UKHHN7Ba-1614323322847)(output_22_1.png)]

Guess you like

Origin blog.csdn.net/weixin_44751294/article/details/114133614