Experiment 7 K-means clustering experiment
1. Experimental purpose
Learn the basic principles of K-means algorithm and implement Iris data clustering.
2. Experimental content
Apply K-means algorithm to cluster the iris data set.
3. Experimental results and analysis
0: Output the basic information of the data set
The reference code first prints the data, feature name, target value, and target value name in the main function. The results of the iris data set are shown in the figure below.
【data】
There are 150 groups of data, each group contains 4 features.
【Feature name】
The characteristics included in each set of data are: sepal length, sepal width, petal length, and petal width. Among them, sepal corresponds to the calyx, and petal corresponds to the petals.
【Target value】
The data set contains a total of 3 species of iris, with labels 0, 1, and 2 respectively. It can be seen from this that we need to cluster them into 3 categories through the K-means algorithm for classification.
[Name of target value]
Each set of data in the data set has labels 0, 1, and 2, and the categories correspond to setosa, versicolor, and virginica respectively.
At the same time, the show_data function is used in the reference code to draw the relationship between the length and width of the calyx and the relationship between the length and width of the petals. The results are shown in the figure below.
【Sepal calyx】
【Petal】
1: Call Kmeans for clustering
In task 1, Sepal and Petal need to be clustered separately. Here, the KMeans package package in the [sklearn] library is used to make the call. The initial number of clusters is selected as 3, and the fit method is used for model training. The final training labels are [kmeans_sepal.labels_] and [kmeans_petal.labels_]. The overall code is shown in the figure below.
At the same time, the head method is used to output the clustering status of the first few data. The program output results are shown in the figure below.
2: Draw pictures before and after clustering
In task 2, we defined the size of the graph and defined 4 subgraphs, which are used to display cluster scatter plots before calyx clustering, after calyx clustering, before petal clustering, and after petal clustering. The overall code is shown in the figure below.
The program output is shown in the figure below. Among them, Original Sepal Data corresponds to the original sepal data, Sepal Clustering Overlay corresponds to the clustered calyx data, Original Petal Data corresponds to the original petal data, and Petal Clustering Overlay corresponds to the clustered petal data.
3: Calculate and output the accuracy
In task 3, use [from sklearn.metrics import accuracy_score] to call the accuracy from the evaluation index, input the label of the data set itself and the label obtained by kmeans algorithm clustering, and then output the accuracy result after comparison. The overall code is shown in the figure below.
The program output is shown in the figure below. It can be seen that the calculated accuracy of calyx clustering is 25.33%, and the accuracy of petal clustering is 1.33%.
From the comparison chart in Task 2, we can see that the relationship between the original labels and the labels of the clustering results is as shown in the table below. Among them, the orange positions in the table are where the result labels are inconsistent with the original labels. Therefore, there are inaccuracies in this accuracy, and the correct Accuracy can only be obtained when the original label is equal to the result label. At the same time, evaluation indicators suitable for unsupervised learning should be used to judge the quality of the results.
cluster object |
original tag |
Result label |
calyx |
red |
black |
green |
green |
|
black |
red |
|
petal |
red |
black |
green |
red |
|
black |
green |
4: Calculate and output the contour coefficient (auto-increment)
In task 4, use [from sklearn.metrics import silhouette_score] to call the silhouette coefficient from the evaluation index, input the eigenvalues of the data set itself and the labels obtained by kmeans algorithm clustering, and then output the silhouette coefficient results after comparison. The overall code is shown in the figure below.
The program output is shown in the figure below. It can be seen that the calculated silhouette coefficient of the calyx cluster is about 0.45, and the calculated silhouette coefficient of the petal cluster is about 0.66.
5: Calculate and output Adjusted Rand Index (auto-increment)
In task 5, use [from sklearn.metrics import adjusted_rand_score] to call ARI from the evaluation index, input the label of the data set itself and the label obtained by kmeans algorithm clustering, compare and output the ARI result. The overall code is shown in the figure below.
The program output is shown in the figure below. It can be seen that the calculated ARI of calyx clustering is about 0.60, and the ARI of petal clustering is about 0.89.
4. Problems encountered and solutions
Problem 1:The Accuracy calculated by the K-means clustering algorithm is different every time it is executed.
Solution 1: By comparing the printed data labels and clustering labels, it can be seen that the K-means clustering algorithm assigns different labels to each category after each execution. Only has a certain mapping relationship (for example, the 0 label in the dataset maps to the 1 label in kmeans-label, rather than to the 1 label in kmeans-label 0 tag corresponding). However, there is no guarantee that the mapping relationship will be the same after each code run, so it is necessary to use the ARI evaluation index to evaluate the quality of clustering.
5. Experiment summary and insights
1: The following parameters can be modified in the KMeans package: n_clusters (specify the number of clusters to be divided into), init (the method used to initialize the cluster center, you can choose a random initial value random, or select an initial value k-means++ from the data), n_init (the number of times the K-means algorithm is executed), max_iter (the maximum number of each iteration), tol (the threshold for convergence), random_state (the integer used to determine the random seed to ensure the repeatability of the results), algorithm (used for The algorithm for calculating distance can be selected from full, elkan, auto, etc.), precompute_distances (specifying whether to pre-compute the distance can speed up the convergence of the algorithm). Generally speaking, the most important thing is to determine the number of clusters n_clusters, because it will directly affect the clustering results.
2: ARI is used to evaluate the consistency between clustering results and real labels. Its value range is between [-1, 1], and the closer it is to 1, the greater the consistency. The better the clustering effect.
3: Silhouette coefficient is used to measure the similarity between a data point and the cluster to which it belongs and the difference between it and other clusters. Its value range is between [-1, 1]. A silhouette coefficient close to 1 means that the data point is very similar to other data points in the cluster to which it belongs, while at the same time it is very different from the data points in other clusters, which usually means that the data point is correctly assigned to the appropriate cluster. A silhouette coefficient close to 0 means that the similarity between the data point and the data points within the cluster to which it belongs is about the same as that of data points in other clusters, usually indicating that the data point may be located on the boundary of two or more clusters. A silhouette coefficient close to -1 indicates that the data point is very different from other data points in the cluster to which it belongs, but is highly similar to data points in other clusters, usually indicating that the data point has been mistakenly assigned to an inappropriate cluster. Silhouette coefficients can be used to select the best K value, compare the performance of different clustering algorithms, or evaluate the quality of clustering results.
4: The main idea of K-means clustering is to find cluster centers iteratively and assign data points to the nearest cluster centers. Before applying a clustering algorithm, you can conduct experiments and evaluate clustering performance. Through visualization and indicator evaluation, the structure of the data can be better understood and the appropriate K value can be selected.
6. Appendix
(1) Complete program source code (including comments)
The task operations of each part are constructed under multi-line code comments. Each piece of code contains a concept annotation module.
import matplotlib.pyplot as plt from sklearn import datasets from sklearn.cluster import KMeans import sklearn.metrics as sm import pandas as pd import numpy as np from sklearn.metrics import accuracy_score from sklearn.metrics import adjusted_rand_score from sklearn.metrics import silhouette_score def print_data(want_print, print_iris): """ ExhibitionirisThe number of figures :return: None """ print("iris{0}为:\n{1}".format(want_print, print_iris)) print("=" * 85)
def show_data(length, width, title): """ Draw a picture :param length: 长度 :param width: 宽度 :param title: 标题 :return: None """ # Create a canvas plt.figure(figsize=(14, 7)) plt.scatter(length, width, c=colormap[y.Targets], s=40) plt.title(title) plt.show() if __name__ == '__main__': # introductionirisnumber position iris = datasets.load_iris() # ExhibitionirisTrue number station print_data(want_print="数据", print_iris=iris.data) # ExhibitionirisSpecial name print_data(want_print="特征名字", print_iris=iris.feature_names) # Display target value print_data(want_print="目标值", print_iris=iris.target) # 展示目标值的名字 print_data(want_print="目标值的名字", print_iris=iris.target_names) # 为了便于使用,将iris数据转换为pandas库数据结构,并设立列的名字 # 将iris数据转为pandas数据结构 x = pd.DataFrame(iris.data) # 将iris数据的名字设为‘Sepal_Length’,‘Sepal_Width’,‘Sepal_Width’,‘Petal_Width’ x.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width'] # 将iris目标值也转为pandas数据结构 y = pd.DataFrame(iris.target) # 将iris目标值得名字设为‘Targets’ y.columns = ['Targets'] # 创建色板图 colormap = np.array(['red', 'lime', 'black']) # 开始画Sepal长度和宽度的关系 show_data(length=x.Sepal_Length, width=x.Sepal_Width, title='Sepal') # 开始画Petal长度和宽度的关系 show_data(length=x.Petal_Length, width=x.Petal_Width, title='Petal')
########################################################################################### # 调用Kmeans进行聚类
# for sepal kmeans_sepal = KMeans(n_clusters=3) kmeans_sepal.fit(x[['Sepal_Length', 'Sepal_Width']]) x['Sepal_Cluster'] = kmeans_sepal.labels_ # for petal kmeans_petal = KMeans(n_clusters=3) kmeans_petal.fit(x[['Petal_Length', 'Petal_Width']]) x['Petal_Cluster'] = kmeans_petal.labels_ # 打印前几行数据(including聚类结果) print(x.head()) # print(y.Targets)
########################################################################################### # 计算轮廓系数
# 计算花萼聚类的轮廓系数 silhouette_sepal = silhouette_score(x[['Sepal_Length', 'Sepal_Width']], x['Sepal_Cluster']) print("Silhouette Score for Sepal Clustering:", silhouette_sepal) # 计算花瓣聚类的轮廓系数 silhouette_petal = silhouette_score(x[['Petal_Length', 'Petal_Width']], x['Petal_Cluster']) print("Silhouette Score for Petal Clustering:", silhouette_petal)
########################################################################################### # 绘出聚类前后的图
# 绘制花萼的聚类前后对比图 plt.figure(figsize=(16, 7)) # 绘制花萼的原始数据散点图 plt.subplot(2, 2, 1) plt.scatter(x['Sepal_Length'], x['Sepal_Width'], c=colormap[y['Targets']], s=40, label='Original Data') plt.title('Original Sepal Data') # 绘制花萼的聚类结果散点图 plt.subplot(2, 2, 2) for cluster in np.unique(x['Sepal_Cluster']): cluster_data = x[x['Sepal_Cluster'] == cluster] plt.scatter(cluster_data['Sepal_Length'], cluster_data['Sepal_Width'], c=colormap[cluster], s=40, label=f'Cluster { cluster}') plt.title('Sepal Clustering Overlay') plt.legend() # 绘制花瓣的聚类前后对比图 # 绘制花瓣的原始数据散点图 plt.subplot(2, 2, 3) plt.scatter(x['Petal_Length'], x['Petal_Width'], c=colormap[y['Targets']], s=40, label='Original Data') plt.title('Original Petal Data') # 绘制花瓣的聚类结果散点图 plt.subplot(2, 2, 4) for cluster in np.unique(x['Petal_Cluster']): cluster_data = x[x['Petal_Cluster'] == cluster] plt.scatter(cluster_data['Petal_Length'], cluster_data['Petal_Width'], c=colormap[cluster], s=40, label=f'Cluster { cluster}') plt.title('Petal Clustering Overlay') plt.legend() plt.tight_layout() plt.show()
########################################################################################### # 计算并输出Accuracy
# acc for sepal accuracy_sepal = accuracy_score(iris.target, kmeans_sepal.labels_) print("Accuracy for Sepal Clustering: {:.2f}%".format(accuracy_sepal * 100))
# acc for petal accuracy_petal = accuracy_score(iris.target, kmeans_petal.labels_) print("Accuracy for Petal Clustering: {:.2f}%".format(accuracy_petal * 100))
########################################################################################### # 计算并输出ARI(adjusted_rand_score) """ ARI(Adjusted Rand Index): 用于评估聚类结果与真实标签之间的一致性。取值范围在[-1, 1]之间,越接近1表示聚类效果越好。 """
# ARI for sepal ari_score_sepal = adjusted_rand_score(iris.target, x['Sepal_Cluster']) print("ARI for Sepal Clustering:", ari_score_sepal)
# ARI for petal ari_score_petal = adjusted_rand_score(iris.target, x['Petal_Cluster']) print("ARI for Petal Clustering:", ari_score_petal) |
(2)数据集文本文件
"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" "1" 5.1 3.5 1.4 0.2 "setosa" "2" 4.9 3 1.4 0.2 "setosa" "3" 4.7 3.2 1.3 0.2 "setosa" "4" 4.6 3.1 1.5 0.2 "setosa" "5" 5 3.6 1.4 0.2 "setosa" "6" 5.4 3.9 1.7 0.4 "setosa" "7" 4.6 3.4 1.4 0.3 "setosa" "8" 5 3.4 1.5 0.2 "setosa" "9" 4.4 2.9 1.4 0.2 "setosa" "10" 4.9 3.1 1.5 0.1 "setosa" "11" 5.4 3.7 1.5 0.2 "setosa" "12" 4.8 3.4 1.6 0.2 "setosa" "13" 4.8 3 1.4 0.1 "setosa" "14" 4.3 3 1.1 0.1 "setosa" "15" 5.8 4 1.2 0.2 "setosa" "16" 5.7 4.4 1.5 0.4 "setosa" "17" 5.4 3.9 1.3 0.4 "setosa" "18" 5.1 3.5 1.4 0.3 "setosa" "19" 5.7 3.8 1.7 0.3 "setosa" "20" 5.1 3.8 1.5 0.3 "setosa" "21" 5.4 3.4 1.7 0.2 "setosa" "22" 5.1 3.7 1.5 0.4 "setosa" "23" 4.6 3.6 1 0.2 "setosa" "24" 5.1 3.3 1.7 0.5 "setosa" "25" 4.8 3.4 1.9 0.2 "setosa" "26" 5 3 1.6 0.2 "setosa" "27" 5 3.4 1.6 0.4 "setosa" "28" 5.2 3.5 1.5 0.2 "setosa" "29" 5.2 3.4 1.4 0.2 "setosa" "30" 4.7 3.2 1.6 0.2 "setosa" "31" 4.8 3.1 1.6 0.2 "setosa" "32" 5.4 3.4 1.5 0.4 "setosa" "33" 5.2 4.1 1.5 0.1 "setosa" "34" 5.5 4.2 1.4 0.2 "setosa" "35" 4.9 3.1 1.5 0.2 "setosa" "36" 5 3.2 1.2 0.2 "setosa" "37" 5.5 3.5 1.3 0.2 "setosa" "38" 4.9 3.6 1.4 0.1 "setosa" "39" 4.4 3 1.3 0.2 "setosa" "40" 5.1 3.4 1.5 0.2 "setosa" "41" 5 3.5 1.3 0.3 "setosa" "42" 4.5 2.3 1.3 0.3 "setosa" "43" 4.4 3.2 1.3 0.2 "setosa" "44" 5 3.5 1.6 0.6 "setosa" "45" 5.1 3.8 1.9 0.4 "setosa" "46" 4.8 3 1.4 0.3 "setosa" "47" 5.1 3.8 1.6 0.2 "setosa" "48" 4.6 3.2 1.4 0.2 "setosa" "49" 5.3 3.7 1.5 0.2 "setosa" "50" 5 3.3 1.4 0.2 "setosa" "51" 7 3.2 4.7 1.4 "versicolor" "52" 6.4 3.2 4.5 1.5 "versicolor" "53" 6.9 3.1 4.9 1.5 "versicolor" "54" 5.5 2.3 4 1.3 "versicolor" "55" 6.5 2.8 4.6 1.5 "versicolor" "56" 5.7 2.8 4.5 1.3 "versicolor" "57" 6.3 3.3 4.7 1.6 "versicolor" "58" 4.9 2.4 3.3 1 "versicolor" "59" 6.6 2.9 4.6 1.3 "versicolor" "60" 5.2 2.7 3.9 1.4 "versicolor" "61" 5 2 3.5 1 "versicolor" "62" 5.9 3 4.2 1.5 "versicolor" "63" 6 2.2 4 1 "versicolor" "64" 6.1 2.9 4.7 1.4 "versicolor" "65" 5.6 2.9 3.6 1.3 "versicolor" "66" 6.7 3.1 4.4 1.4 "versicolor" "67" 5.6 3 4.5 1.5 "versicolor" "68" 5.8 2.7 4.1 1 "versicolor" "69" 6.2 2.2 4.5 1.5 "versicolor" "70" 5.6 2.5 3.9 1.1 "versicolor" "71" 5.9 3.2 4.8 1.8 "versicolor" "72" 6.1 2.8 4 1.3 "versicolor" "73" 6.3 2.5 4.9 1.5 "versicolor" "74" 6.1 2.8 4.7 1.2 "versicolor" "75" 6.4 2.9 4.3 1.3 "versicolor" "76" 6.6 3 4.4 1.4 "versicolor" "77" 6.8 2.8 4.8 1.4 "versicolor" "78" 6.7 3 5 1.7 "versicolor" "79" 6 2.9 4.5 1.5 "versicolor" "80" 5.7 2.6 3.5 1 "versicolor" "81" 5.5 2.4 3.8 1.1 "versicolor" "82" 5.5 2.4 3.7 1 "versicolor" "83" 5.8 2.7 3.9 1.2 "versicolor" "84" 6 2.7 5.1 1.6 "versicolor" "85" 5.4 3 4.5 1.5 "versicolor" "86" 6 3.4 4.5 1.6 "versicolor" "87" 6.7 3.1 4.7 1.5 "versicolor" "88" 6.3 2.3 4.4 1.3 "versicolor" "89" 5.6 3 4.1 1.3 "versicolor" "90" 5.5 2.5 4 1.3 "versicolor" "91" 5.5 2.6 4.4 1.2 "versicolor" "92" 6.1 3 4.6 1.4 "versicolor" "93" 5.8 2.6 4 1.2 "versicolor" "94" 5 2.3 3.3 1 "versicolor" "95" 5.6 2.7 4.2 1.3 "versicolor" "96" 5.7 3 4.2 1.2 "versicolor" "97" 5.7 2.9 4.2 1.3 "versicolor" "98" 6.2 2.9 4.3 1.3 "versicolor" "99" 5.1 2.5 3 1.1 "versicolor" "100" 5.7 2.8 4.1 1.3 "versicolor" "101" 6.3 3.3 6 2.5 "virginica" "102" 5.8 2.7 5.1 1.9 "virginica" "103" 7.1 3 5.9 2.1 "virginica" "104" 6.3 2.9 5.6 1.8 "virginica" "105" 6.5 3 5.8 2.2 "virginica" "106" 7.6 3 6.6 2.1 "virginica" "107" 4.9 2.5 4.5 1.7 "virginica" "108" 7.3 2.9 6.3 1.8 "virginica" "109" 6.7 2.5 5.8 1.8 "virginica" "110" 7.2 3.6 6.1 2.5 "virginica" "111" 6.5 3.2 5.1 2 "virginica" "112" 6.4 2.7 5.3 1.9 "virginica" "113" 6.8 3 5.5 2.1 "virginica" "114" 5.7 2.5 5 2 "virginica" "115" 5.8 2.8 5.1 2.4 "virginica" "116" 6.4 3.2 5.3 2.3 "virginica" "117" 6.5 3 5.5 1.8 "virginica" "118" 7.7 3.8 6.7 2.2 "virginica" "119" 7.7 2.6 6.9 2.3 "virginica" "120" 6 2.2 5 1.5 "virginica" "121" 6.9 3.2 5.7 2.3 "virginica" "122" 5.6 2.8 4.9 2 "virginica" "123" 7.7 2.8 6.7 2 "virginica" "124" 6.3 2.7 4.9 1.8 "virginica" "125" 6.7 3.3 5.7 2.1 "virginica" "126" 7.2 3.2 6 1.8 "virginica" "127" 6.2 2.8 4.8 1.8 "virginica" "128" 6.1 3 4.9 1.8 "virginica" "129" 6.4 2.8 5.6 2.1 "virginica" "130" 7.2 3 5.8 1.6 "virginica" "131" 7.4 2.8 6.1 1.9 "virginica" "132" 7.9 3.8 6.4 2 "virginica" "133" 6.4 2.8 5.6 2.2 "virginica" "134" 6.3 2.8 5.1 1.5 "virginica" "135" 6.1 2.6 5.6 1.4 "virginica" "136" 7.7 3 6.1 2.3 "virginica" "137" 6.3 3.4 5.6 2.4 "virginica" "138" 6.4 3.1 5.5 1.8 "virginica" "139" 6 3 4.8 1.8 "virginica" "140" 6.9 3.1 5.4 2.1 "virginica" "141" 6.7 3.1 5.6 2.4 "virginica" "142" 6.9 3.1 5.1 2.3 "virginica" "143" 5.8 2.7 5.1 1.9 "virginica" "144" 6.8 3.2 5.9 2.3 "virginica" "145" 6.7 3.3 5.7 2.5 "virginica" "146" 6.7 3 5.2 2.3 "virginica" "147" 6.3 2.5 5 1.9 "virginica" "148" 6.5 3 5.2 2 "virginica" "149" 6.2 3.4 5.4 2.3 "virginica" "150" 5.9 3 5.1 1.8 "virginica" |