[Artificial Intelligence I] Experiment 7: K-means clustering experiment

Experiment 7 K-means clustering experiment

1. Experimental purpose

Learn the basic principles of K-means algorithm and implement Iris data clustering.

2. Experimental content

Apply K-means algorithm to cluster the iris data set.

3. Experimental results and analysis

0: Output the basic information of the data set

The reference code first prints the data, feature name, target value, and target value name in the main function. The results of the iris data set are shown in the figure below.

【data】

    There are 150 groups of data, each group contains 4 features.

【Feature name】

    The characteristics included in each set of data are: sepal length, sepal width, petal length, and petal width. Among them, sepal corresponds to the calyx, and petal corresponds to the petals.

【Target value】

    The data set contains a total of 3 species of iris, with labels 0, 1, and 2 respectively. It can be seen from this that we need to cluster them into 3 categories through the K-means algorithm for classification.

[Name of target value]

Each set of data in the data set has labels 0, 1, and 2, and the categories correspond to setosa, versicolor, and virginica respectively.

At the same time, the show_data function is used in the reference code to draw the relationship between the length and width of the calyx and the relationship between the length and width of the petals. The results are shown in the figure below.

【Sepal calyx】

【Petal】

1: Call Kmeans for clustering

In task 1, Sepal and Petal need to be clustered separately. Here, the KMeans package package in the [sklearn] library is used to make the call. The initial number of clusters is selected as 3, and the fit method is used for model training. The final training labels are [kmeans_sepal.labels_] and [kmeans_petal.labels_]. The overall code is shown in the figure below.

At the same time, the head method is used to output the clustering status of the first few data. The program output results are shown in the figure below.

2: Draw pictures before and after clustering

In task 2, we defined the size of the graph and defined 4 subgraphs, which are used to display cluster scatter plots before calyx clustering, after calyx clustering, before petal clustering, and after petal clustering. The overall code is shown in the figure below.

The program output is shown in the figure below. Among them, Original Sepal Data corresponds to the original sepal data, Sepal Clustering Overlay corresponds to the clustered calyx data, Original Petal Data corresponds to the original petal data, and Petal Clustering Overlay corresponds to the clustered petal data.

3: Calculate and output the accuracy

In task 3, use [from sklearn.metrics import accuracy_score] to call the accuracy from the evaluation index, input the label of the data set itself and the label obtained by kmeans algorithm clustering, and then output the accuracy result after comparison. The overall code is shown in the figure below.

The program output is shown in the figure below. It can be seen that the calculated accuracy of calyx clustering is 25.33%, and the accuracy of petal clustering is 1.33%.

From the comparison chart in Task 2, we can see that the relationship between the original labels and the labels of the clustering results is as shown in the table below. Among them, the orange positions in the table are where the result labels are inconsistent with the original labels. Therefore, there are inaccuracies in this accuracy, and the correct Accuracy can only be obtained when the original label is equal to the result label. At the same time, evaluation indicators suitable for unsupervised learning should be used to judge the quality of the results.

cluster object

original tag

Result label

calyx

red

black

green

green

black

red

petal

red

black

green

red

black

green

4: Calculate and output the contour coefficient (auto-increment)

In task 4, use [from sklearn.metrics import silhouette_score] to call the silhouette coefficient from the evaluation index, input the eigenvalues ​​of the data set itself and the labels obtained by kmeans algorithm clustering, and then output the silhouette coefficient results after comparison. The overall code is shown in the figure below.

The program output is shown in the figure below. It can be seen that the calculated silhouette coefficient of the calyx cluster is about 0.45, and the calculated silhouette coefficient of the petal cluster is about 0.66.

5: Calculate and output Adjusted Rand Index (auto-increment)

In task 5, use [from sklearn.metrics import adjusted_rand_score] to call ARI from the evaluation index, input the label of the data set itself and the label obtained by kmeans algorithm clustering, compare and output the ARI result. The overall code is shown in the figure below.

 The program output is shown in the figure below. It can be seen that the calculated ARI of calyx clustering is about 0.60, and the ARI of petal clustering is about 0.89.

4. Problems encountered and solutions

Problem 1:The Accuracy calculated by the K-means clustering algorithm is different every time it is executed.

Solution 1: By comparing the printed data labels and clustering labels, it can be seen that the K-means clustering algorithm assigns different labels to each category after each execution. Only has a certain mapping relationship (for example, the 0 label in the dataset maps to the 1 label in kmeans-label, rather than to the 1 label in kmeans-label 0 tag corresponding). However, there is no guarantee that the mapping relationship will be the same after each code run, so it is necessary to use the ARI evaluation index to evaluate the quality of clustering.

5. Experiment summary and insights

1: The following parameters can be modified in the KMeans package: n_clusters (specify the number of clusters to be divided into), init (the method used to initialize the cluster center, you can choose a random initial value random, or select an initial value k-means++ from the data), n_init (the number of times the K-means algorithm is executed), max_iter (the maximum number of each iteration), tol (the threshold for convergence), random_state (the integer used to determine the random seed to ensure the repeatability of the results), algorithm (used for The algorithm for calculating distance can be selected from full, elkan, auto, etc.), precompute_distances (specifying whether to pre-compute the distance can speed up the convergence of the algorithm). Generally speaking, the most important thing is to determine the number of clusters n_clusters, because it will directly affect the clustering results.

2: ARI is used to evaluate the consistency between clustering results and real labels. Its value range is between [-1, 1], and the closer it is to 1, the greater the consistency. The better the clustering effect.

3: Silhouette coefficient is used to measure the similarity between a data point and the cluster to which it belongs and the difference between it and other clusters. Its value range is between [-1, 1]. A silhouette coefficient close to 1 means that the data point is very similar to other data points in the cluster to which it belongs, while at the same time it is very different from the data points in other clusters, which usually means that the data point is correctly assigned to the appropriate cluster. A silhouette coefficient close to 0 means that the similarity between the data point and the data points within the cluster to which it belongs is about the same as that of data points in other clusters, usually indicating that the data point may be located on the boundary of two or more clusters. A silhouette coefficient close to -1 indicates that the data point is very different from other data points in the cluster to which it belongs, but is highly similar to data points in other clusters, usually indicating that the data point has been mistakenly assigned to an inappropriate cluster. Silhouette coefficients can be used to select the best K value, compare the performance of different clustering algorithms, or evaluate the quality of clustering results.

4: The main idea of ​​K-means clustering is to find cluster centers iteratively and assign data points to the nearest cluster centers. Before applying a clustering algorithm, you can conduct experiments and evaluate clustering performance. Through visualization and indicator evaluation, the structure of the data can be better understood and the appropriate K value can be selected.

6. Appendix

(1) Complete program source code (including comments)

The task operations of each part are constructed under multi-line code comments. Each piece of code contains a concept annotation module.

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.cluster import KMeans

import sklearn.metrics as sm

import pandas as pd

import numpy as np

from sklearn.metrics import accuracy_score

from sklearn.metrics import adjusted_rand_score

from sklearn.metrics import silhouette_score

def print_data(want_print, print_iris):

    """

    ExhibitionirisThe number of figures

    :return: None

    """

    print("iris{0}:\n{1}".format(want_print, print_iris))

    print("=" * 85)

 

def show_data(length, width, title):

    """

    Draw a picture

    :param length: 长度

    :param width: 宽度

    :param title: 标题

    :return: None

    """

    # Create a canvas

    plt.figure(figsize=(14, 7))

    plt.scatter(length, width, c=colormap[y.Targets], s=40)

    plt.title(title)

    plt.show()

if __name__ == '__main__':

    # introductionirisnumber position

    iris = datasets.load_iris()

    # ExhibitionirisTrue number station

    print_data(want_print="数据", print_iris=iris.data)

    # ExhibitionirisSpecial name

    print_data(want_print="特征名字", print_iris=iris.feature_names)

    # Display target value

    print_data(want_print="目标值", print_iris=iris.target)

    # 展示目标值的名字

    print_data(want_print="目标值的名字", print_iris=iris.target_names)

    # 为了便于使用,将iris数据转换为pandas库数据结构,并设立列的名字

    # iris数据转为pandas数据结构

    x = pd.DataFrame(iris.data)

    # iris数据的名字设为‘Sepal_Length’‘Sepal_Width’‘Sepal_Width’‘Petal_Width’

    x.columns = ['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width']

    # iris目标值也转为pandas数据结构

    y = pd.DataFrame(iris.target)

    # iris目标值得名字设为‘Targets’

    y.columns = ['Targets']

    # 创建色板图

    colormap = np.array(['red', 'lime', 'black'])

    # 开始画Sepal长度和宽度的关系

    show_data(length=x.Sepal_Length, width=x.Sepal_Width, title='Sepal')

    # 开始画Petal长度和宽度的关系

    show_data(length=x.Petal_Length, width=x.Petal_Width, title='Petal')

   

    ###########################################################################################

    # 调用Kmeans进行聚类

   

    # for sepal

    kmeans_sepal = KMeans(n_clusters=3)

    kmeans_sepal.fit(x[['Sepal_Length', 'Sepal_Width']])

    x['Sepal_Cluster'] = kmeans_sepal.labels_

    # for petal

    kmeans_petal = KMeans(n_clusters=3)

    kmeans_petal.fit(x[['Petal_Length', 'Petal_Width']])

    x['Petal_Cluster'] = kmeans_petal.labels_

    # 打印前几行数据(including聚类结果)

    print(x.head())

    # print(y.Targets)

   

    ###########################################################################################

    # 计算轮廓系数

   

    # 计算花萼聚类的轮廓系数

    silhouette_sepal = silhouette_score(x[['Sepal_Length', 'Sepal_Width']], x['Sepal_Cluster'])

    print("Silhouette Score for Sepal Clustering:", silhouette_sepal)

    # 计算花瓣聚类的轮廓系数

    silhouette_petal = silhouette_score(x[['Petal_Length', 'Petal_Width']], x['Petal_Cluster'])

    print("Silhouette Score for Petal Clustering:", silhouette_petal)

   

    ###########################################################################################

    # 绘出聚类前后的图

   

    # 绘制花萼的聚类前后对比图

    plt.figure(figsize=(16, 7))

    # 绘制花萼的原始数据散点图

    plt.subplot(2, 2, 1)

    plt.scatter(x['Sepal_Length'], x['Sepal_Width'], c=colormap[y['Targets']], s=40, label='Original Data')

    plt.title('Original Sepal Data')

    # 绘制花萼的聚类结果散点图

    plt.subplot(2, 2, 2)

    for cluster in np.unique(x['Sepal_Cluster']):

        cluster_data = x[x['Sepal_Cluster'] == cluster]

        plt.scatter(cluster_data['Sepal_Length'], cluster_data['Sepal_Width'], c=colormap[cluster], s=40, label=f'Cluster { cluster}')

    plt.title('Sepal Clustering Overlay')

    plt.legend()

    # 绘制花瓣的聚类前后对比图

    # 绘制花瓣的原始数据散点图

    plt.subplot(2, 2, 3)

    plt.scatter(x['Petal_Length'], x['Petal_Width'], c=colormap[y['Targets']], s=40, label='Original Data')

    plt.title('Original Petal Data')

    # 绘制花瓣的聚类结果散点图

    plt.subplot(2, 2, 4)

    for cluster in np.unique(x['Petal_Cluster']):

        cluster_data = x[x['Petal_Cluster'] == cluster]

        plt.scatter(cluster_data['Petal_Length'], cluster_data['Petal_Width'], c=colormap[cluster], s=40, label=f'Cluster { cluster}')

    plt.title('Petal Clustering Overlay')

    plt.legend()

    plt.tight_layout()

    plt.show()

   

    ###########################################################################################

    # 计算并输出Accuracy

   

    # acc for sepal

    accuracy_sepal = accuracy_score(iris.target, kmeans_sepal.labels_)

    print("Accuracy for Sepal Clustering: {:.2f}%".format(accuracy_sepal * 100))

   

    # acc for petal

    accuracy_petal = accuracy_score(iris.target, kmeans_petal.labels_)

    print("Accuracy for Petal Clustering: {:.2f}%".format(accuracy_petal * 100))

   

    ###########################################################################################

    # 计算并输出ARI(adjusted_rand_score)

    """

        ARIAdjusted Rand Index):

        用于评估聚类结果与真实标签之间的一致性。取值范围在[-1, 1]之间,越接近1表示聚类效果越好。

    """

   

    # ARI for sepal

    ari_score_sepal = adjusted_rand_score(iris.target, x['Sepal_Cluster'])

    print("ARI for Sepal Clustering:", ari_score_sepal)

   

    # ARI for petal

    ari_score_petal = adjusted_rand_score(iris.target, x['Petal_Cluster'])

    print("ARI for Petal Clustering:", ari_score_petal)

(2)数据集文本文件

"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

"1" 5.1 3.5 1.4 0.2 "setosa"

"2" 4.9 3 1.4 0.2 "setosa"

"3" 4.7 3.2 1.3 0.2 "setosa"

"4" 4.6 3.1 1.5 0.2 "setosa"

"5" 5 3.6 1.4 0.2 "setosa"

"6" 5.4 3.9 1.7 0.4 "setosa"

"7" 4.6 3.4 1.4 0.3 "setosa"

"8" 5 3.4 1.5 0.2 "setosa"

"9" 4.4 2.9 1.4 0.2 "setosa"

"10" 4.9 3.1 1.5 0.1 "setosa"

"11" 5.4 3.7 1.5 0.2 "setosa"

"12" 4.8 3.4 1.6 0.2 "setosa"

"13" 4.8 3 1.4 0.1 "setosa"

"14" 4.3 3 1.1 0.1 "setosa"

"15" 5.8 4 1.2 0.2 "setosa"

"16" 5.7 4.4 1.5 0.4 "setosa"

"17" 5.4 3.9 1.3 0.4 "setosa"

"18" 5.1 3.5 1.4 0.3 "setosa"

"19" 5.7 3.8 1.7 0.3 "setosa"

"20" 5.1 3.8 1.5 0.3 "setosa"

"21" 5.4 3.4 1.7 0.2 "setosa"

"22" 5.1 3.7 1.5 0.4 "setosa"

"23" 4.6 3.6 1 0.2 "setosa"

"24" 5.1 3.3 1.7 0.5 "setosa"

"25" 4.8 3.4 1.9 0.2 "setosa"

"26" 5 3 1.6 0.2 "setosa"

"27" 5 3.4 1.6 0.4 "setosa"

"28" 5.2 3.5 1.5 0.2 "setosa"

"29" 5.2 3.4 1.4 0.2 "setosa"

"30" 4.7 3.2 1.6 0.2 "setosa"

"31" 4.8 3.1 1.6 0.2 "setosa"

"32" 5.4 3.4 1.5 0.4 "setosa"

"33" 5.2 4.1 1.5 0.1 "setosa"

"34" 5.5 4.2 1.4 0.2 "setosa"

"35" 4.9 3.1 1.5 0.2 "setosa"

"36" 5 3.2 1.2 0.2 "setosa"

"37" 5.5 3.5 1.3 0.2 "setosa"

"38" 4.9 3.6 1.4 0.1 "setosa"

"39" 4.4 3 1.3 0.2 "setosa"

"40" 5.1 3.4 1.5 0.2 "setosa"

"41" 5 3.5 1.3 0.3 "setosa"

"42" 4.5 2.3 1.3 0.3 "setosa"

"43" 4.4 3.2 1.3 0.2 "setosa"

"44" 5 3.5 1.6 0.6 "setosa"

"45" 5.1 3.8 1.9 0.4 "setosa"

"46" 4.8 3 1.4 0.3 "setosa"

"47" 5.1 3.8 1.6 0.2 "setosa"

"48" 4.6 3.2 1.4 0.2 "setosa"

"49" 5.3 3.7 1.5 0.2 "setosa"

"50" 5 3.3 1.4 0.2 "setosa"

"51" 7 3.2 4.7 1.4 "versicolor"

"52" 6.4 3.2 4.5 1.5 "versicolor"

"53" 6.9 3.1 4.9 1.5 "versicolor"

"54" 5.5 2.3 4 1.3 "versicolor"

"55" 6.5 2.8 4.6 1.5 "versicolor"

"56" 5.7 2.8 4.5 1.3 "versicolor"

"57" 6.3 3.3 4.7 1.6 "versicolor"

"58" 4.9 2.4 3.3 1 "versicolor"

"59" 6.6 2.9 4.6 1.3 "versicolor"

"60" 5.2 2.7 3.9 1.4 "versicolor"

"61" 5 2 3.5 1 "versicolor"

"62" 5.9 3 4.2 1.5 "versicolor"

"63" 6 2.2 4 1 "versicolor"

"64" 6.1 2.9 4.7 1.4 "versicolor"

"65" 5.6 2.9 3.6 1.3 "versicolor"

"66" 6.7 3.1 4.4 1.4 "versicolor"

"67" 5.6 3 4.5 1.5 "versicolor"

"68" 5.8 2.7 4.1 1 "versicolor"

"69" 6.2 2.2 4.5 1.5 "versicolor"

"70" 5.6 2.5 3.9 1.1 "versicolor"

"71" 5.9 3.2 4.8 1.8 "versicolor"

"72" 6.1 2.8 4 1.3 "versicolor"

"73" 6.3 2.5 4.9 1.5 "versicolor"

"74" 6.1 2.8 4.7 1.2 "versicolor"

"75" 6.4 2.9 4.3 1.3 "versicolor"

"76" 6.6 3 4.4 1.4 "versicolor"

"77" 6.8 2.8 4.8 1.4 "versicolor"

"78" 6.7 3 5 1.7 "versicolor"

"79" 6 2.9 4.5 1.5 "versicolor"

"80" 5.7 2.6 3.5 1 "versicolor"

"81" 5.5 2.4 3.8 1.1 "versicolor"

"82" 5.5 2.4 3.7 1 "versicolor"

"83" 5.8 2.7 3.9 1.2 "versicolor"

"84" 6 2.7 5.1 1.6 "versicolor"

"85" 5.4 3 4.5 1.5 "versicolor"

"86" 6 3.4 4.5 1.6 "versicolor"

"87" 6.7 3.1 4.7 1.5 "versicolor"

"88" 6.3 2.3 4.4 1.3 "versicolor"

"89" 5.6 3 4.1 1.3 "versicolor"

"90" 5.5 2.5 4 1.3 "versicolor"

"91" 5.5 2.6 4.4 1.2 "versicolor"

"92" 6.1 3 4.6 1.4 "versicolor"

"93" 5.8 2.6 4 1.2 "versicolor"

"94" 5 2.3 3.3 1 "versicolor"

"95" 5.6 2.7 4.2 1.3 "versicolor"

"96" 5.7 3 4.2 1.2 "versicolor"

"97" 5.7 2.9 4.2 1.3 "versicolor"

"98" 6.2 2.9 4.3 1.3 "versicolor"

"99" 5.1 2.5 3 1.1 "versicolor"

"100" 5.7 2.8 4.1 1.3 "versicolor"

"101" 6.3 3.3 6 2.5 "virginica"

"102" 5.8 2.7 5.1 1.9 "virginica"

"103" 7.1 3 5.9 2.1 "virginica"

"104" 6.3 2.9 5.6 1.8 "virginica"

"105" 6.5 3 5.8 2.2 "virginica"

"106" 7.6 3 6.6 2.1 "virginica"

"107" 4.9 2.5 4.5 1.7 "virginica"

"108" 7.3 2.9 6.3 1.8 "virginica"

"109" 6.7 2.5 5.8 1.8 "virginica"

"110" 7.2 3.6 6.1 2.5 "virginica"

"111" 6.5 3.2 5.1 2 "virginica"

"112" 6.4 2.7 5.3 1.9 "virginica"

"113" 6.8 3 5.5 2.1 "virginica"

"114" 5.7 2.5 5 2 "virginica"

"115" 5.8 2.8 5.1 2.4 "virginica"

"116" 6.4 3.2 5.3 2.3 "virginica"

"117" 6.5 3 5.5 1.8 "virginica"

"118" 7.7 3.8 6.7 2.2 "virginica"

"119" 7.7 2.6 6.9 2.3 "virginica"

"120" 6 2.2 5 1.5 "virginica"

"121" 6.9 3.2 5.7 2.3 "virginica"

"122" 5.6 2.8 4.9 2 "virginica"

"123" 7.7 2.8 6.7 2 "virginica"

"124" 6.3 2.7 4.9 1.8 "virginica"

"125" 6.7 3.3 5.7 2.1 "virginica"

"126" 7.2 3.2 6 1.8 "virginica"

"127" 6.2 2.8 4.8 1.8 "virginica"

"128" 6.1 3 4.9 1.8 "virginica"

"129" 6.4 2.8 5.6 2.1 "virginica"

"130" 7.2 3 5.8 1.6 "virginica"

"131" 7.4 2.8 6.1 1.9 "virginica"

"132" 7.9 3.8 6.4 2 "virginica"

"133" 6.4 2.8 5.6 2.2 "virginica"

"134" 6.3 2.8 5.1 1.5 "virginica"

"135" 6.1 2.6 5.6 1.4 "virginica"

"136" 7.7 3 6.1 2.3 "virginica"

"137" 6.3 3.4 5.6 2.4 "virginica"

"138" 6.4 3.1 5.5 1.8 "virginica"

"139" 6 3 4.8 1.8 "virginica"

"140" 6.9 3.1 5.4 2.1 "virginica"

"141" 6.7 3.1 5.6 2.4 "virginica"

"142" 6.9 3.1 5.1 2.3 "virginica"

"143" 5.8 2.7 5.1 1.9 "virginica"

"144" 6.8 3.2 5.9 2.3 "virginica"

"145" 6.7 3.3 5.7 2.5 "virginica"

"146" 6.7 3 5.2 2.3 "virginica"

"147" 6.3 2.5 5 1.9 "virginica"

"148" 6.5 3 5.2 2 "virginica"

"149" 6.2 3.4 5.4 2.3 "virginica"

"150" 5.9 3 5.1 1.8 "virginica"

Guess you like

Origin blog.csdn.net/m0_65787507/article/details/134828707