Hebei University of Technology Data Mining Experiment Five k-means Clustering Algorithm

k-means clustering algorithm

1. Experimental purpose
2. Experimental principles
- 1. k-means clustering
- 2. Termination conditions
3. Experimental content and procedures
4. Experimental results
5. Experimental analysis

1. Experimental purpose

Be familiar with k-means clustering algorithm.
Write a program for k-means clustering on the training sample set, run the k-means clustering algorithm on task-related data, and debug the experiment.
Master distance calculation methods and clustering evaluation criteria.
Write a lab report.

2. Experimental principles

1. k-means clustering

K-means clustering is a centroid-based partitioning technique. The specific iterative calculation steps are as follows:

K centroid coordinates are randomly generated in the attribute vector space.
$T_i (1\leq i\leq n)$ in the data set D separately $T_{i} (1 \leq i \leq n)$ to allDistance measure D ist ( i , j ) ( 1 ≤ i ≤ n , 1 ≤ j ≤ k ) of k centroids $Dist$ $(1\leq i\leq n, 1\leq j\leq k)$ , and convert the data object $T_i$ Gather into the cluster with the minimum distance metric. That is, $T_i\in C_J$ , represents the data object $T_i$ Gathered to In cluster $J.$ where $J=\argmin(Dist(i,j))$ , indicating $J$ is such that $D i s t (i, j)$ takes the minimum $j$ 。
Calculate the centroid coordinates of each cluster according to the definition of centroid to form the next generation $k$ centroid coordinates.
If the termination condition is not met, go to 2) to continue iteration; otherwise, end.

Among them, the centroid of the cluster can have different definitions, for example, it can be the mean value of the attribute vector of the data objects in the cluster (that is, the center of gravity), or it can be the center point, etc.; the distance measure can also have different definitions, and the commonly used ones are Euclidean distance, Manhattan (or city block, block) distance, Minkowski distance, etc.; the termination condition can be when the reallocation of objects no longer occurs, the program iteration ends.

2. Termination conditions

The termination condition can be any of the following:

No (or a minimum number of) objects are reassigned to different clusters.
No (or minimum number of) cluster centers change anymore.
The sum of squared errors is a local minimum.

3. Experimental content and procedures

1. Experimental content

According to the calculation steps of k-means clustering algorithm, draw the program flow chart when k=3;
The k-means clustering algorithm is implemented by k-means program flow chart programming;
Display a series of screenshots of the k-means clustering process in the experiment report, indicating the gradual evolution of each cluster
;
In the report, point out and explain the selection of the initial centroid, the selection of the termination condition, and
the selection of the distance metric in the experimental code.

2. Experimental steps

Programming implements the following functions:

First, the attribute vectors in the data set D={D1,D2,D3} are input as experimental data;
The k-means clustering algorithm is programmed by the k-means program flow chart and run with experimental data;
During the running process, it pauses at the appropriate iteration algebra and displays the results of the real-time iteration, such as the position of the cluster center
, the results of clustering by distance to nearest neighbor, etc.;

3. Program block diagram

Insert image description here
Block diagram

4. Experimental samples

data.txt

5. Experimental code

#!/usr/bin/env python 　
# -*- coding: utf-8 -*-
#
# Copyright (C) 2021 #
# @Time    : 2022/5/30 21:29
# @Author  : Yang Haoyuan
# @Email   : [email protected]
# @File    : Exp5.py
# @Software: PyCharm
import math
import random
import argparse

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import calinski_harabasz_score

parser = argparse.ArgumentParser(description="Exp5")
parser.add_argument("--epochs", type=int, default=100)
parser.add_argument("--k", type=int, default=3)
parser.add_argument("--n", type=int, default=2)
parser.add_argument("--dataset", type=str, default="data.txt")

parser.set_defaults(augment=True)
args = parser.parse_args()
print(args)


# 读取数据集
def loadDataset(filename):
    dataSet = []
    with open(filename, 'r') as file_to_read:
        while True:
            lines = file_to_read.readline()  # 整行读取数据
            if not lines:
                break
            p_tmp = [str(i) for i in lines.split(sep="\t")]
            p_tmp[len(p_tmp) - 1] = p_tmp[len(p_tmp) - 1].strip("\n")
            for i in range(len(p_tmp)):
                p_tmp[i] = float(p_tmp[i])
            dataSet.append(p_tmp)

    return dataSet


# 计算n维数据间的欧式距离
def euclid(p1, p2, n):
    distance = 0
    for i in range(n):
        distance = distance + (p1[i] - p2[i]) ** 2
    return math.sqrt(distance)


# 初始化聚类中心
def init_centroids(dataSet, k, n):
    _min = dataSet.min(axis=0)
    _max = dataSet.max(axis=0)

    centre = np.empty((k, n))
    for i in range(k):
        for j in range(n):
            centre[i][j] = random.uniform(_min[j], _max[j])
    return centre


# 计算每个数据到每个中心点的欧式距离
def cal_distance(dataSet, centroids, k, n):
    dis = np.empty((len(dataSet), k))
    for i in range(len(dataSet)):
        for j in range(k):
            dis[i][j] = euclid(dataSet[i], centroids[j], n)
    return dis


# K-Means聚类
def KMeans_Cluster(dataSet, k, n, epochs):
    epoch = 0
    # 初始化聚类中心
    centroids = init_centroids(dataSet, k, n)
    # 迭代最多epochs
    while epoch < epochs:
        # 计算欧式距离
        distance = cal_distance(dataSet, centroids, k, n)

        classify = []
        for i in range(k):
            classify.append([])

        # 比较距离并分类
        for i in range(len(dataSet)):
            List = distance[i].tolist()
            # 因为初始中心的选取完全随机，所以存在第一次分类，类的数量不足k的情况
            # 这里作为异常捕获，也就是distance[i]=nan的时候，证明类的数量不足
            # 则再次递归聚类，直到正常为止，返回聚类标签和中心点
            try:
                index = List.index(distance[i].min())
            except:
                labels, centroids = KMeans_Cluster(dataSet=np.array(data_set), k=args.k, n=args.n, epochs=args.epochs)
                return labels, centroids

            classify[index].append(i)
        # 构造新的中心点
        new_centroids = np.empty((k, n))
        for i in range(len(classify)):
            for j in range(n):
                new_centroids[i][j] = np.sum(dataSet[classify[i]][:, j:j + 1]) / len(classify[i])

        # 比较新的中心点和旧的中心点是否一样
        if (new_centroids == centroids).all():
            # 中心点一样，停止迭代
            label_pred = np.empty(len(data_set))
            # 返回个样本聚类结果和中心点
            for i in range(k):
                label_pred[classify[i]] = i

            return label_pred, centroids
        else:
            centroids = new_centroids
            epoch = epoch + 1


# 聚类结果展示
def show(label_pred, X, centroids):
    x = []
    for i in range(args.k):
        x.append([])

    for k in range(args.k):
        for i in range(len(label_pred)):
            _l = int(label_pred[i])
            x[_l].append(X[i])
    for i in range(args.k):
        plt.scatter(np.array(x[i])[:, 0], np.array(x[i])[:, 1], color=plt.cm.Set1(i % 8), label='label' + str(i))
    plt.scatter(x=centroids[:, 0], y=centroids[:, 1], marker='*', label='pred_center')
    plt.legend(loc=3)
    plt.show()


if __name__ == "__main__":
    # 读取数据
    data_set = loadDataset(args.dataset)
    # 原始数据展示
    plt.scatter(np.array(data_set)[:, :1], np.array(data_set)[:, 1:])
    plt.show()
    # 获取聚类结果
    labels, centroids = KMeans_Cluster(dataSet=np.array(data_set), k=args.k, n=args.n, epochs=args.epochs)

    print("Classes: ", labels)
    print("Centers: ", centroids)
    # 使用Calinski-Harabaz标准评价聚类结果
    scores = calinski_harabasz_score(data_set, labels)
    print("Scores: ", round(scores, 2))
    # 展示聚类结果
    show(X=np.array(data_set), label_pred=labels, centroids=centroids)

4. Experimental results

Insert image description here
Original data distribution scatter plot

A scatter plot of classification after "good" clustering and cluster centers.
Insert image description here
A scatter plot of cluster labels, cluster centers and CH index scores after "good" clustering.

A scatter plot of classification after "bad" clustering. and cluster center.

A "bad" clustering post-cluster label, cluster center and CH index score situation

5. Experimental analysis

This experiment is mainly about the implementation of the K-Means algorithm.

For the distance between samples, I adopt the Euclidean distance metric, which is a similarity and dissimilarity metric that has been proven to be highly effective in the classic K-Means algorithm.

I set the cluster center to no longer change or the number of iterations to reach the upper limit as the termination condition, leaving aside some non-convergence situations that cause the program to run for too long.

Since the initial selection of the clustering center is completely random, it may happen that the number of clustering classes for the first time is less than the specified k. In this case, I directly perform recursion in the exception handling, restart the clustering, and return to the recursive clustering. class result. At the same time, due to the randomness of cluster centers, even for the same data set, with the same parameters, there may be huge gaps in the clustering results. To judge whether the clustering results are "good" or "bad", I use $C a l i n s k i - The H a r a b a z (C H)$ criterion evaluates the clustering results. This is a classic algorithm used to evaluate the covariance between and within classes, and then evaluate the clustering results. Compared with Silhouettes index, the calculation of CH index is faster. The calculation formula is as follows:
$CH=\frac{BGSS}{k-1} / \frac{WGSS}{nk}$
BGSS refers to the between-class covariance, and WGSS refers to the intra-class covariance. A good clustering should have large inter-class covariance and small intra-class covariance, so that the CH index reaches a local or global maximum. It can be seen from the experimental results that empirically better clustering results do have larger CH scores.