Machine Learning Must Know and Know--Detailed Analysis and Implementation of KNN Algorithm

Get into the habit of writing together! This is the second day of my participation in the "Nuggets Daily New Plan · April Update Challenge", click to view the details of the event .

1. Experimental algorithm design

  1. Read the watermelon dataset
  2. Randomly select k samples as the initial cluster center
  3. Calculate the distance between each sample and each cluster center, and assign each sample to the cluster center closest to it. At this time, all samples have been divided into k groups
  4. Update the cluster centers, taking the mean of the samples in each group as the new cluster center for the group
  5. Repeat the second and third steps until the cluster center becomes stable or reaches the maximum number of iterations.

experiment analysis

Classification using KNN on the watermelon dataset

A simple analysis of the watermelon dataset yields the following results:

feature Data features Role
serial number discrete serial number
density continuous feature
sugar content continuous feature
good melon discrete Label

Therefore, feature density and sugar content were selected for cluster analysis.

Second, the core code of K-means cluster analysis

  1. Import required libraries

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    复制代码

    In this experiment, I chose pandas 1 as the main tool for reading datasets, numpy 2 to speed up the main mathematical operations, and matplotlib 3 for data visualization analysis.

  2. Defining K-Means Clustering KMeans

    1. Define __init__()to initialize the classifier

      class KNNClassifier:
          def __init__(self, x: pd.DataFrame):
              self.x = x
          ...
      复制代码

      where Xrepresents the dataset features.

    2. Predefined distance functionsdistanceAll()

      def distanceAll(center, rest):
          distances = np.apply_along_axis(_distances, 1, rest, center)
          return distances.sum()
      
      def _distances(point: np.ndarray, centers: np.ndarray):
          distances = np.apply_along_axis(_distance, 1, centers, point)
          return distances
      
      def _distance(x, y):
          return np.sqrt(np.dot(x, x) - 2 * np.dot(x, y) + np.dot(y, y))
      复制代码

      Here I have made several optimizations, the specific optimization points are as follows:

      Avoid for-loopto run faster

      In the first function distanceAll, the incoming centersum restis a multi-dimensional matrix. Here, the distance function between the centersum and each other is implemented , and no loop is used, which greatly improves the running speed.restfor

      Reuse _distance(x, y) calculation results

      The general calculation formula of Euclidean distance is:

      d = ( k = 1 m x k i x k j 2 ) d = \left( \sum_{k=1}^m \left | x_{ki} - x_{kj} \right |^2 \right)

      但是我在此处使用的公式为其展开形式

      d = k = 1 m ( x k i 2 2 × x k i × x k j + x k j 2 ) d = \sum_{k=1}^m\left( \red{ x_{ki}^2} - 2\times x_{ki}\times x_{kj}+ \red {x_{kj}^2} \right)

      此公式中红色部分在计算欧氏距离时会多次使用,因此,使用此公式可以充分利用numpy的缓存机制,减少不必要的重复运算量。

    3. 预定义 allocate() 核心方法为每个点找到最近的聚类中心

      def allocateAll(center, rest):
          # 2. 计算每个样本到各个聚类中心之间的距离,将每个样本分配给距离它最近的聚类中心
          allocates = np.apply_along_axis(_allocate, 1, rest, center)
          # sns.scatterplot(data=rest, x=0, y=1, hue=allocates)
          copied = rest.copy()
          copied["allocations"] = allocates
          groups = copied.groupby("allocations").groups
          # 绘图
          fig = plt.figure()
          ax = rest.plot.scatter(x=0, y=1, c=allocates, colormap='viridis', legend=True)
          center.iloc[list(groups.keys())].plot.scatter(x=0,
                                                        y=1,
                                                        c=list(groups.keys()),
                                                        marker="x",
                                                        colormap='viridis',
                                                        s=200,
                                                        ax=ax)
          plt.show()
          return groups
      
      def _allocate(point: np.ndarray, centers: np.ndarray):
          distances = np.apply_along_axis(_distance, 1, centers, point, "euclidean")
          nearest_center = np.argmin(distances)
          return nearest_center
      复制代码

      同时,在对每个点寻找中心进行聚类的过程中,还集合了绘图可视化方法。此处的可视化方法将绘制出之后聚类的过程。

    4. 定义 train() 在训练集上进行迭代训练

      class KMeans:
          ...
          def train(self, k):
              print(f" === k = {k} === ")
              batch = self.x.shape[0]
              features = self.x.shape[1]
              # 1. 随机选取 k 个样本作为初始的聚类中心
              index = np.random.randint(0, batch, size=k)
              centers: pd.DataFrame = self.x.iloc[index]  # 聚类中心
              # rest: pd.DataFrame = self.x.loc[~self.x.index.isin(index)]
              allocations = allocateAll(centers, self.x)
              for i in range(10):
                  last_centers = centers
                  centers = np.empty((k, 2))
                  for label, points in allocations.items():
                      center = self.x.iloc[points]
                      new_center = np.average(center, axis=0)
                      centers[label] = new_center
                  if np.isclose(last_centers, centers).all():
                      print(f"k = {k} 收敛,停止!")
                      return distanceAll(pd.DataFrame(centers), self.x)
                  allocations = allocateAll(pd.DataFrame(centers), self.x)
      复制代码

      在本段代码中,我指定每次训练最多进行10轮,一般来说,只需要迭代5次即可收敛到聚类中心。

      代码分为两部分,第一次的聚类中心在样本中随机选取,进行第一次聚类之后,再依据上一次的聚类结果,选择每一类的均值点作为中心进行循环迭代,当下一轮迭代的循环中心与上一轮相差不大时,终止迭代,返回此时的wss距离值。

三、实验数据及结果分析

在西瓜数据集上使用K均值聚类

  1. 导入所需库

    import matplotlib.pyplot as plt
    import pandas as pd
    
    from model import KMeans
    复制代码

    此处导入刚刚编写的KMeans以及绘图工具matplotlib进行wss曲线的绘制。

  2. 读取数据集并构建模型

    df = pd.read_csv("kmeansdata.csv")
    model = KMeans(df[["m", "h"]])
    复制代码

    此处读入西瓜数据集,并选定特征mh构建模型。

  3. KMeans 模型训练,可视化,WSS曲线分析

    wss = []
    for i in range(2, 10):
        wss.append(model.train(k=i))
    plt.plot(range(2, 10), wss)
    plt.savefig("result.png")
    复制代码

    此处我在2到15中选择k值,分别使用这些k值在KMeans模型上进行训练,并保存每一次训练之后返回的wss距离,最后对wss距离进行可视化分析。

    训练过程可视化 k=3

    首先,在数据集中随机选取三个样本作为聚类中心:

    image.png

    可以看出,选择的聚类中心偏下,然后进行第一次迭代:

    image.png

    在每一类中,选择其中心点作为下一次聚类中心,然后对每个点重新决定其类别,并进行下一次迭代:

    image.png

    可以看出,此时中心往中间偏移,分类更加合理。再进行一次迭代:

    image.png

    image.png

    此后迭代中心不再产生明显变化,代表聚类中心收敛,本轮聚类结束。

    WSS曲线可视化

    image.png

四、总结及心得体会

  1. 在简单的数据集(如西瓜数据集)上,聚类效果较好,在几次迭代内便可达到收敛。
  2. 根据对不同k值的可视化分析,可以发现,在k=3时达到"肘部",此时K为最优值,大于3的k值会因为类别过多而失去统计意义,k值太小会导致类别过少,使类内距离急剧上升。
  3. 使用C接口实现Python程序比使用Python-based-coding效率更高。
  4. 掌握了一些简单的数据可视化方法,学会使用一些简单的matplotlib库中有关pyplot的函数,利用简单的数据可视化方法将大量的数据转化成图片,极大地简化了我们对结果数据的分析和比对,能够更轻易的获得一些结果上的规律和结论。

5. Suggestions for the improvement of the experimental process, methods and means

  1. When visualizing the data set, rough selection of the first two dimensions for high-dimensional features for visual analysis will lose the feature information of other dimensions. Here, you can choose a dimensionality reduction method, such as PCA 4 , to project the high-dimensional features onto a two-dimensional plane. Perform visual analysis.
  2. More complex datasets can be tried.
  3. You can try to consider more distance functions.

References

Guess you like

Origin juejin.im/post/7083118254645837861