"Machine Learning"------Experiment 4 (k-means clustering)

Experiment content:

1. Recurrence, two cases of K-means: iris and city clustering based on latitude and longitude.
2. For a given project, write a program by yourself to use the K-means algorithm to cluster fruit juice drinks with different contents:

An enterprise implements the K-Means algorithm by collecting the data set of the content of a fruit juice beverage produced by the enterprise's own assembly line. Through clustering, the production quality of the fruit juice beverage under a certain standard content deviation is judged, and the category of the beverage is determined.

  1. Load the dataset, read the data, explore the data.
  2. Sample data conversion (data in pandasframe format can be converted into array form), and visualization (drawing a scatter plot) to observe the distribution of data, so that several possible values ​​of k can be obtained.
  3. For each value of k, do the following:
    • Configure and train the K-Means algorithm model.
    • Output the relevant clustering results and evaluate the clustering effect. Here, the CH index can be used to evaluate the clustering effectiveness. In the final comparison with the CH value evaluated when each value of k is used, it can be concluded that when the value of k is, the clustering effect is better. Note: There is no external category information here, so the internal criterion evaluation index (CH) is used for evaluation. (metrics.calinski_harabaz_score())
    • Output various cluster label values ​​and various cluster centers, so as to judge the fruit juice content and sugar content of each type.
    • The clustering results and the visualization of the center points of various clusters (scatter plot), so as to observe the distribution of various clusters. (Different classes indicate the deviation of juice and sugar content of different juice drinks.)
    • [Extended] (optional): Set a certain value range of k, perform clustering and evaluate different clustering results. Reference ideas: set the value range of k; train different values ​​of k; calculate the Euclidean distance of each object from the center of each cluster, and generate a distance table; extract the distance from each object to the center of its cluster, and add them together ; Store the distance results in sequence; draw a line graph of the total distance value corresponding to different values.

The first iris can be used online, it will not be posted here, the other two datasets:
https://pan.baidu.com/s/1mgZTpGkxX7pTU6xAEkO-WA
Extraction code: 5613

1. Reproduce the iris example:

#作   者:Asita
#开发时间:2021/11/26 19:41

#导入必要的包
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn import metrics

# 一、加载数据集:
iris = datasets.load_iris() # 导入鸢尾花数据集
# 数据拆分
iris_X_train , iris_X_test , iris_y_train , iris_y_test = train_test_split(iris.data,iris.target,test_size=0.2)
# 注:这里是从sklearn中直接导入的数据集,也可以采用自己导入数据的方式。

# 二、配置模型
Kmeans = KMeans(n_clusters = 3)  #K-Means算法模型,3类标签
# 三、训练模型
kmeans_fit=Kmeans.fit(iris_X_train) #模型训练
# 四、模型预测
y_predict=Kmeans.predict(iris_X_train)
#print(y_predict)
# 五、模型评估
#iris_y_train[iris_y_train==11]=0
#print("调整",iris_y_train)  #显示调整后的预测
score=metrics.accuracy_score(iris_y_train,Kmeans.predict(iris_X_train))
print('准确率:{0:f}'.format(score))  #显示准确率输出:Accuracy:0.8
# 六、结果可视化
# 因为图形只有两个维度X和Y,所以该程序只有将特征值的第一个和第二个分别当成表格中X和Y的位置,第三个和第四个特征值虽然在计算时会使用,但显示图片的时候就不使用。
x1=iris_X_train[:, 0] #鸢尾花花萼长度
y1=iris_X_train[:, 1] #鸢尾花花萼宽度
plt.scatter(x1,y1, c=y_predict, cmap='viridis') #画每一条的位置

centers = Kmeans.cluster_centers_  #每个分类的中心点
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5); #中心点
plt.show()  #显示图像


Experimental results:
insert image description here

insert image description here
Reproduce the case of city clustering based on latitude and longitude

#作   者:Asita
#开发时间:2021/11/26 20:31

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# 一、数据获取
# ------ 1.1.导入数据 ------
df = pd.read_csv('F:/data/China_cities.csv',encoding='GB18030') # 你的数据集地址
# print(df)
print(df.shape)    # 输出数据维度
print(df.head())  # 展示前5行数据

# 二、数据预处理
# ------ 2.1.提取经纬度数据 ------
x = df.drop('省级行政区', axis=1) # 删除 省级行政区 这一列
# print(x)
x = x.drop("城市", axis=1) # 删除 城市 这一列
# print(x)
x_np = np.array(x)        # 将x转化为numpy数组
# print(x)

# 三、模型构建与训练
# ------ 3.1.构造K-Means聚类器 ------
n_clusters = 7                # 类簇的数量
estimator = KMeans(n_clusters)  # 构建聚类器
# ------ 3.2.训练K-Means聚类器 ------
estimator.fit(x)

# 四、数据可视化
markers = ['*', 'v', '+', '^', 's', 'x', 'o']      # 标记样式列表
colors = ['r', 'g', 'm', 'c', 'y', 'b', 'orange']  # 标记颜色列表
labels = estimator.labels_      # 获取聚类标签

plt.figure(figsize=(9, 6))
plt.title("Major Cities in China", fontsize=25)
plt.xlabel('East Longitude', fontsize=18)
plt.ylabel('North Latitude', fontsize=18)

for i in range(n_clusters):
    members = labels == i      # members是一个布尔型数组
    plt.scatter(
        x_np[members, 1],      # 城市经度数组
        x_np[members, 0],      # 城市纬度数组
        marker = markers[i],   # 标记样式
        c = colors[i]          # 标记颜色
    )   # 绘制散点图
plt.grid()
plt.show()

Experimental results:
insert image description here
2. Beverage-based clustering:

Different k values ​​(2~7) are set for clustering:

Full code:

#作   者:Asita
#开发时间:2021/11/28 9:43

import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# 一、数据获取
# ------ 1.1.导入数据 ------
df = pd.read_csv('F:/研究生/课程/机器学习/聚类/beverage.csv',encoding='GB18030') # 你的数据集地址
# print(df)
print(df.shape)    # 输出数据维度
print(df.head())  # 展示前5行数据

# 二、数据可视化
def showOrgData(dataMat):
    df=np.array(dataMat)
    print(type(df))
    plt.scatter(df[:, 0], df[:, 1],color='m', marker='o', label='Org_data')
    plt.xlabel('juice')
    plt.ylabel('sweet')
    plt.legend(loc=2) # 把说明放在左上角,具体请参考官方文档
    plt.show()

#三、数据处理
df=np.array(df) # 将pandasframe格式的数据转化为numpy的数组形式

# 样本数据散点图(未划分之前的数据集)
showOrgData(df)

#四、训练不同k值下的kmeans
score_all=[]
list1=range(2,8)

for i in range(2,8):

    estimator = KMeans(n_clusters=i)
    estimator.fit(df)
    y_pred = estimator.fit_predict(df)

#五、画出不同k值下的结果散点图
    plt.scatter(df[:, 0], df[:, 1], c=y_pred,label=i)
    plt.legend(loc=2)  # 把说明放在左上角,具体请参考官方文档
    plt.xlabel('juice')
    plt.ylabel('sweet')

    # 重要属性cluster_centers_,查看质心
    centroid = estimator.cluster_centers_
    print("k=%d:" % i)
    print("centroid:\n",centroid)
    # 各类簇中心点的可视化
    plt.scatter(
        centroid[:, 0],
        centroid[:, 1],
        marker="x",
        c="black",
        s=48
    )

#六、记录不同k值下聚类的CH评价指标的结果
    score = metrics.calinski_harabasz_score(df, y_pred)
    score_all.append(score)
    print("score=",score)
    print('------------------------------')
    plt.show()

#七、画出不同k值对应的聚类效果(折线)
plt.plot(list1,score_all)
plt.xlabel('k')
plt.ylabel('CH')
plt.show()

Experimental results:
output various cluster label values ​​and various cluster centers under different k values
insert image description hereinsert image description here
insert image description here
​​Original data set:
insert image description here
By observing the distribution of data, it can be concluded that the possible values ​​of k are 2~4 (observed by the naked eye), However, in order to measure different clustering results, the final k is set to 2 ~ 7.

When k=2:
insert image description here
When k=3:
insert image description here
When k=4:
insert image description here
When k=5:
insert image description here
When k=6:
insert image description here
When k=7:
insert image description here
Line graph according to different k values ​​and internal evaluation index CH Come:
insert image description here
It can be seen that the clustering effect for this dataset is the best when k=4.

Guess you like

Origin blog.csdn.net/Naruto_8/article/details/121592850