K-means clustering algorithm (with Python3 implementation code)

The code and data address of this article has been uploaded to github: https://github.com/helloWorldchn/MachineLearning

1. Basic idea of ​​K-means algorithm

1. Partition-based clustering

The idea of ​​the partition algorithm is to divide the data objects in a given data set to be mined into K groups (k≤N, N represents the number of objects in the data set), and each group represents a cluster. And it must be satisfied that any data object can only belong to one cluster, and each cluster has at least one data object.
This algorithm usually requires that before the algorithm starts, a parameter K is given to determine the number of clusters after clustering. The algorithm establishes an initial grouping according to the parameter k, and then the algorithm repeatedly uses the iterative relocation technique to redistribute the data objects in each cluster, and then obtains the final relatively satisfactory clustering result. It can be called a good clustering analysis algorithm if the gap between data objects inside the cluster is as small as possible and the gap between data objects between clusters is as large as possible. K-medoids and K-means algorithms are two classic algorithms in partitioning algorithms. Many other partitioning algorithms are evolved and improved from these two algorithms.

2. Introduction to K-means

In 1957, Lloyd first proposed the k-means algorithm in the literature. In 1967, MacQueen gave the classic k-means algorithm in the literature, described the complete theory of the k-means algorithm and conducted detailed research. As the most classic partitioning clustering algorithm, the implementation of k-means algorithm is not complicated and has high scalability. At the same time, k-means algorithm has good reliability and high efficiency, and is a widely used clustering algorithm. .

3. K-means algorithm process

The K-means (K-means) algorithm accepts a parameter K to determine the number of clusters in the result. At the beginning of the algorithm, K data objects are randomly selected in the data set to be used as the initial centers of k clusters, and the remaining data objects are selected according to the distance between them and the cluster center of each cluster. assigned to it. Then recalculate the average value of all data objects in each cluster, and use the obtained result as the new cluster center; repeat the above process step by step until the objective function converges.

The specific steps of the algorithm are described below:

  1. For a given set of data, randomly initialize K cluster centers (cluster centers)
  2. Calculate the distance from each data to the center of the cluster (generally using Euclidean distance), and classify the data as the nearest cluster.
  3. Based on the obtained clusters, recalculate the cluster centers.
  4. Iterate step 2 and step 3 until the cluster center does not change or is smaller than the specified threshold.
    K-means algorithm flow chart
    Flow chart of K-means algorithm

4. K-means pseudocode

输入 n 个数据对象集合Xi ;输出 k 个聚类中心 Zj 及K 个聚类数据对象集合 Cj .
Procedure K -means(s , k)
S ={
    
    x 1 ,x 2 ,,x n };
m =1;for j =1 to k 初始化聚类中心 Zj ;
do {
    
    for i =1 to n
  for j=1 to k
   {
    
    D(Xi ,Zj)= Xi -Zj ;if D(Xi ,Zj)=Min{
    
    D(Xi ,Zj)}then Xi ∈Cj ;}//归类
   if m=1 then Jc(m)=∑kj=1∑ Xi -Zj
2
  m =m+1;for j =1 to k
  Zj =(∑
n
i=1 (Xi)
j )/n;//重置聚类中心
  }while J c (m)-J c (m -1) >ξ

2. K-means code implementation

The data set used in this article is the UCI data set, and the iris data set Iris, the wine data set Wine, and the wheat seed data set seeds are used for testing. This paper downloads these three data sets from the UCI official website and puts them into python files in the same folder. At the same time, due to the needs of the program, the position of the columns of the data set has been slightly changed. The specific information of the dataset is as follows:

data set Number of samples attribute dimension Number of categories
Iris 150 4 3
Wine 178 3 3
Seeds 210 7 3

The data set is available in my homepage resources, free of points download, if you can't download it, you can private message me.

1. Python3 code implementation

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import f1_score, accuracy_score, normalized_mutual_info_score, rand_score
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA


# 数据保存在.csv文件中
iris = pd.read_csv("dataset/Iris.csv", header=0)  # 鸢尾花数据集 Iris  class=3
wine = pd.read_csv("dataset/wine.csv")  # 葡萄酒数据集 Wine  class=3
seeds = pd.read_csv("dataset/seeds.csv")  # 小麦种子数据集 seeds  class=3
wdbc = pd.read_csv("dataset/wdbc.csv")  # 威斯康星州乳腺癌数据集 Breast Cancer Wisconsin (Diagnostic)  class=2
glass = pd.read_csv("dataset/glass.csv")  # 玻璃辨识数据集 Glass Identification  class=6
df = iris  # 设置要读取的数据集
# print(df)

columns = list(df.columns)  # 获取数据集的第一行,第一行通常为特征名,所以先取出
features = columns[:len(columns) - 1]  # 数据集的特征名(去除了最后一列,因为最后一列存放的是标签,不是数据)
dataset = df[features]  # 预处理之后的数据,去除掉了第一行的数据(因为其为特征名,如果数据第一行不是特征名,可跳过这一步)
attributes = len(df.columns) - 1  # 属性数量(数据集维度)
class_labels = list(df[columns[-1]])  # 原始标签

k = 3

# 这里已经知道了分3类,其他分类这里的参数需要调试
model = KMeans(n_clusters=k)
# 训练模型
model.fit(dataset)
# 预测全部数据
label = model.predict(dataset)
print(label)


def clustering_indicators(labels_true, labels_pred):
    if type(labels_true[0]) != int:
        labels_true = LabelEncoder().fit_transform(df[columns[len(columns) - 1]])  # 如果数据集的标签为文本类型,把文本标签转换为数字标签
    f_measure = f1_score(labels_true, labels_pred, average='macro')  # F值
    accuracy = accuracy_score(labels_true, labels_pred)  # ACC
    normalized_mutual_information = normalized_mutual_info_score(labels_true, labels_pred)  # NMI
    rand_index = rand_score(labels_true, labels_pred)  # RI
    return f_measure, accuracy, normalized_mutual_information, rand_index


F_measure, ACC, NMI, RI = clustering_indicators(class_labels, label)
print("F_measure:", F_measure, "ACC:", ACC, "NMI", NMI, "RI", RI)

if attributes > 2:
    dataset = PCA(n_components=2).fit_transform(dataset)  # 如果属性数量大于2,降维
# 打印出聚类散点图
plt.scatter(dataset[:, 0], dataset[:, 1], marker='o', c='black', s=7)  # 原图
plt.show()
colors = np.array(["red", "blue", "green", "orange", "purple", "cyan", "magenta", "beige", "hotpink", "#88c999"])
# 循换打印k个簇,每个簇使用不同的颜色
for i in range(k):
    plt.scatter(dataset[np.nonzero(label == i), 0], dataset[np.nonzero(label == i), 1], c=colors[i], s=7)
plt.show()

2. Analysis of clustering results

In this paper, F value (F-measure, FM), accuracy rate (Accuracy, ACC), standard mutual information (Normalized Mutual Information, NMI) and Rand index (Rand Index, RI) are selected as evaluation indicators, and their value ranges are [ 0,1], the larger the value, the better the clustering results meet expectations.

The F value combines two indicators of precision (Precision) and recall rate (Recall). Its value is the harmonic mean of precision and recall rate. Its calculation formula is shown in the formula:

P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP} Precision=TP+FPTP

R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN} Recall=TP+FNTP

F − m e a s u r e = 2 R e c a l l × P r e c i s i o n R e c a l l + P r e c i s i o n F-measure=\frac{2Recall \times Precision}{Recall+Precision} Fmeasure=Recall+Precision2Recall×Precision

ACC is the ratio of the number of correctly classified samples to the total number of samples in the data set, and the calculation formula is as follows:

A C C = T P + T N T P + T N + F P + F N ACC=\frac{TP+TN}{TP+TN+FP+FN} ACC=TP+TN+FP+FNTP+TN

Among them, TP (True Positive) indicates the number of samples that predict the positive class as the number of positive classes, TN (True Negative) indicates the number of samples that predict the negative class as the number of negative classes, and FP (False Positive) indicates that the negative class is predicted FN (False Negative) represents the number of samples that predict the positive class as the number of negative classes.

NMI is used to quantify the matching degree of clustering results and known category labels. Compared with ACC, the value of NMI is not affected by the arrangement of category labels. Calculated as follows:

N M I = I ( U , V ) H ( U ) H ( V ) NMI=\frac{I\left(U,V\right)}{\sqrt{H\left(U\right)H\left(V\right)}} NMI=H(U)H(V) I(U,V)

Among them, H(U) represents the entropy of the correct classification, and H(V) represents the entropy of the results obtained by the algorithm.

The specific implementation code is as follows:
Since the correct label given in the data set may be a text type instead of a digital label, it is necessary to judge whether the label of the data set is a digital type before calculation, and if not, convert it to a digital type

def clustering_indicators(labels_true, labels_pred):
    if type(labels_true[0]) != int:
        labels_true = LabelEncoder().fit_transform(df[columns[len(columns) - 1]])  # 如果数据集的标签为文本类型,把文本标签转换为数字标签
    f_measure = f1_score(labels_true, labels_pred, average='macro')  # F值
    accuracy = accuracy_score(labels_true, labels_pred)  # ACC
    normalized_mutual_information = normalized_mutual_info_score(labels_true, labels_pred)  # NMI
    rand_index = rand_score(labels_true, labels_pred)  # RI
    return f_measure, accuracy, normalized_mutual_information, rand_index


F_measure, ACC, NMI, RI = clustering_indicators(class_labels, label)
print("F_measure:", F_measure, "ACC:", ACC, "NMI", NMI, "RI", RI)

If you need to calculate the cluster analysis index, just insert the above code into the K-means implementation code.

3. Clustering results

  1. Iris data set Iris
    The original image of the Iris iris data set

    The original image of the Iris iris data set
    K-means clustering rendering of Iris iris data set
    K-means clustering rendering of Iris iris data set

  2. Wine dataset Wine
    Wine wine dataset original image

    Wine wine dataset original image
    K-means clustering rendering of Wine wine data set
    K-means clustering rendering of Wine wine data set

  3. Wheat seed dataset Seeds
    The original image of the Seeds wheat seed dataset

    The original image of the Seeds wheat seed dataset
    insert image description here
    Seeds wheat seed data set K-means clustering effect diagram

4. Insufficiency of K-means algorithm

The core step of the K-means algorithm is to update the cluster center through continuous iteration to achieve the minimum distance within the cluster. The time complexity of the algorithm is very low, so the algorithm has been widely used, but there are many shortcomings in the algorithm, the main shortcomings are as follows:

  1. The number of clusters for K-means clustering needs to be specified by the user. The K-means algorithm first requires the user to specify the K value of the number of clusters. The determination of the K value directly affects the clustering result. Usually, the K value needs to be specified by the user based on his own experience and understanding of the data set, so the specified value may not be Ideally, the result of clustering is not guaranteed.
  2. The initial center point selection of the K-means algorithm is a random method. The K-means algorithm is extremely dependent on the selection of the initial center point: once the initial center point is selected incorrectly, it will have a great impact on the subsequent clustering process, and it is likely that the optimal clustering result will not be obtained, and the number of clustering iterations will also decrease. May increase. The random selection of the initial center point has great uncertainty, which directly affects the effect of clustering.
  3. K-means uses Euclidean distance for similarity measurement, and it is difficult to achieve good clustering effect in non-convex data sets.

Guess you like

Origin blog.csdn.net/qq_43647936/article/details/130246537