Get into the habit of writing together! This is the second day of my participation in the "Nuggets Daily New Plan · April Update Challenge", click to view the details of the event .

1. Experimental algorithm design

Read the watermelon dataset
Randomly select k samples as the initial cluster center
Calculate the distance between each sample and each cluster center, and assign each sample to the cluster center closest to it. At this time, all samples have been divided into k groups
Update the cluster centers, taking the mean of the samples in each group as the new cluster center for the group
Repeat the second and third steps until the cluster center becomes stable or reaches the maximum number of iterations.

experiment analysis

Classification using KNN on the watermelon dataset

A simple analysis of the watermelon dataset yields the following results:

feature	Data features	Role
serial number	discrete	serial number
density	continuous	feature
sugar content	continuous	feature
good melon	discrete	Label

Therefore, feature density and sugar content were selected for cluster analysis.

Second, the core code of K-means cluster analysis

Import required libraries
```
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
复制代码
```
In this experiment, I chose pandas ¹ as the main tool for reading datasets, numpy ² to speed up the main mathematical operations, and matplotlib ³ for data visualization analysis.

Defining K-Means Clustering KMeans

Define `init()`to initialize the classifier

class KNNClassifier:
    def __init__(self, x: pd.DataFrame):
        self.x = x
    ...
复制代码

where Xrepresents the dataset features.

Predefined distance functionsdistanceAll()
```
def distanceAll(center, rest):
    distances = np.apply_along_axis(_distances, 1, rest, center)
    return distances.sum()

def _distances(point: np.ndarray, centers: np.ndarray):
    distances = np.apply_along_axis(_distance, 1, centers, point)
    return distances

def _distance(x, y):
    return np.sqrt(np.dot(x, x) - 2 * np.dot(x, y) + np.dot(y, y))
复制代码
```
Here I have made several optimizations, the specific optimization points are as follows:

Avoid for-loopto run faster

In the first function distanceAll, the incoming centersum restis a multi-dimensional matrix. Here, the distance function between the centersum and each other is implemented , and no loop is used, which greatly improves the running speed.restfor

Reuse _distance(x, y) calculation results

The general calculation formula of Euclidean distance is:

$d = \left( \sum_{k=1}^m \left | x_{ki} - x_{kj} \right |^2 \right)$

但是我在此处使用的公式为其展开形式

$d = \sum_{k=1}^m\left( \red{ x_{ki}^2} - 2\times x_{ki}\times x_{kj}+ \red {x_{kj}^2} \right)$

此公式中红色部分在计算欧氏距离时会多次使用，因此，使用此公式可以充分利用numpy的缓存机制，减少不必要的重复运算量。

预定义 `allocate()` 核心方法为每个点找到最近的聚类中心

def allocateAll(center, rest):
    # 2. 计算每个样本到各个聚类中心之间的距离，将每个样本分配给距离它最近的聚类中心
    allocates = np.apply_along_axis(_allocate, 1, rest, center)
    # sns.scatterplot(data=rest, x=0, y=1, hue=allocates)
    copied = rest.copy()
    copied["allocations"] = allocates
    groups = copied.groupby("allocations").groups
    # 绘图
    fig = plt.figure()
    ax = rest.plot.scatter(x=0, y=1, c=allocates, colormap='viridis', legend=True)
    center.iloc[list(groups.keys())].plot.scatter(x=0,
                                                  y=1,
                                                  c=list(groups.keys()),
                                                  marker="x",
                                                  colormap='viridis',
                                                  s=200,
                                                  ax=ax)
    plt.show()
    return groups

def _allocate(point: np.ndarray, centers: np.ndarray):
    distances = np.apply_along_axis(_distance, 1, centers, point, "euclidean")
    nearest_center = np.argmin(distances)
    return nearest_center
复制代码

同时，在对每个点寻找中心进行聚类的过程中，还集合了绘图可视化方法。此处的可视化方法将绘制出之后聚类的过程。

定义 `train()` 在训练集上进行迭代训练

class KMeans:
    ...
    def train(self, k):
        print(f" === k = {k} === ")
        batch = self.x.shape[0]
        features = self.x.shape[1]
        # 1. 随机选取 k 个样本作为初始的聚类中心
        index = np.random.randint(0, batch, size=k)
        centers: pd.DataFrame = self.x.iloc[index]  # 聚类中心
        # rest: pd.DataFrame = self.x.loc[~self.x.index.isin(index)]
        allocations = allocateAll(centers, self.x)
        for i in range(10):
            last_centers = centers
            centers = np.empty((k, 2))
            for label, points in allocations.items():
                center = self.x.iloc[points]
                new_center = np.average(center, axis=0)
                centers[label] = new_center
            if np.isclose(last_centers, centers).all():
                print(f"k = {k} 收敛，停止！")
                return distanceAll(pd.DataFrame(centers), self.x)
            allocations = allocateAll(pd.DataFrame(centers), self.x)
复制代码

在本段代码中，我指定每次训练最多进行10轮，一般来说，只需要迭代5次即可收敛到聚类中心。

代码分为两部分，第一次的聚类中心在样本中随机选取，进行第一次聚类之后，再依据上一次的聚类结果，选择每一类的均值点作为中心进行循环迭代，当下一轮迭代的循环中心与上一轮相差不大时，终止迭代，返回此时的wss距离值。

三、实验数据及结果分析

在西瓜数据集上使用K均值聚类

导入所需库

import matplotlib.pyplot as plt
import pandas as pd

from model import KMeans
复制代码

此处导入刚刚编写的KMeans以及绘图工具matplotlib进行wss曲线的绘制。

读取数据集并构建模型
```
df = pd.read_csv("kmeansdata.csv")
model = KMeans(df[["m", "h"]])
复制代码
```
此处读入西瓜数据集，并选定特征m和h构建模型。
KMeans 模型训练，可视化，WSS曲线分析
```
wss = []
for i in range(2, 10):
    wss.append(model.train(k=i))
plt.plot(range(2, 10), wss)
plt.savefig("result.png")
复制代码
```
此处我在2到15中选择k值，分别使用这些k值在KMeans模型上进行训练，并保存每一次训练之后返回的wss距离，最后对wss距离进行可视化分析。

训练过程可视化 k=3

首先，在数据集中随机选取三个样本作为聚类中心：

可以看出，选择的聚类中心偏下，然后进行第一次迭代：

在每一类中，选择其中心点作为下一次聚类中心，然后对每个点重新决定其类别，并进行下一次迭代：

可以看出，此时中心往中间偏移，分类更加合理。再进行一次迭代：

此后迭代中心不再产生明显变化，代表聚类中心收敛，本轮聚类结束。

WSS曲线可视化

四、总结及心得体会

在简单的数据集(如西瓜数据集)上，聚类效果较好，在几次迭代内便可达到收敛。
根据对不同k值的可视化分析，可以发现，在k=3时达到"肘部"，此时K为最优值，大于3的k值会因为类别过多而失去统计意义，k值太小会导致类别过少，使类内距离急剧上升。
使用C接口实现Python程序比使用Python-based-coding效率更高。
掌握了一些简单的数据可视化方法，学会使用一些简单的matplotlib库中有关pyplot的函数，利用简单的数据可视化方法将大量的数据转化成图片，极大地简化了我们对结果数据的分析和比对，能够更轻易的获得一些结果上的规律和结论。

5. Suggestions for the improvement of the experimental process, methods and means

When visualizing the data set, rough selection of the first two dimensions for high-dimensional features for visual analysis will lose the feature information of other dimensions. Here, you can choose a dimensionality reduction method, such as PCA ⁴ , to project the high-dimensional features onto a two-dimensional plane. Perform visual analysis.
More complex datasets can be tried.
You can try to consider more distance functions.

Machine Learning Must Know and Know--Detailed Analysis and Implementation of KNN Algorithm

1. Experimental algorithm design

experiment analysis

Second, the core code of K-means cluster analysis

Import required libraries

Defining K-Means Clustering KMeans

Define `init()`to initialize the classifier

Predefined distance functions`distanceAll()`

Avoid `for-loop`to run faster

Reuse `_distance(x, y)` calculation results

预定义 `allocate()` 核心方法为每个点找到最近的聚类中心

定义 `train()` 在训练集上进行迭代训练

三、实验数据及结果分析

在西瓜数据集上使用K均值聚类

导入所需库

读取数据集并构建模型

KMeans 模型训练，可视化，WSS曲线分析

训练过程可视化 `k=3`

WSS曲线可视化

四、总结及心得体会

5. Suggestions for the improvement of the experimental process, methods and means

References

Guess you like

Machine Learning Must Know and Know--Detailed Analysis and Implementation of KNN Algorithm

1. Experimental algorithm design

experiment analysis

Second, the core code of K-means cluster analysis

Import required libraries

Defining K-Means Clustering KMeans

Define __init__()to initialize the classifier

Predefined distance functionsdistanceAll()

Avoid for-loopto run faster

Reuse _distance(x, y) calculation results

预定义 allocate() 核心方法为每个点找到最近的聚类中心

定义 train() 在训练集上进行迭代训练

三、实验数据及结果分析

在西瓜数据集上使用K均值聚类

导入所需库

读取数据集并构建模型

KMeans 模型训练，可视化，WSS曲线分析

训练过程可视化 k=3

WSS曲线可视化

四、总结及心得体会

5. Suggestions for the improvement of the experimental process, methods and means

References

Guess you like

Define `init()`to initialize the classifier

Predefined distance functions`distanceAll()`

Avoid `for-loop`to run faster

Reuse `_distance(x, y)` calculation results

预定义 `allocate()` 核心方法为每个点找到最近的聚类中心

定义 `train()` 在训练集上进行迭代训练

训练过程可视化 `k=3`