"Machine Learning Formula Derivation and Code Implementation" chapter17-kmeans

"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.

Cluster analysis and k-means clustering algorithm

聚类分析( cluster analysis) is a class of classic unsupervised learning algorithms. In the case of a given sample, cluster analysis automatically divides the sample into several categories by measuring 特征相似度or .距离

1 Distance measure and similarity measure

Distance measure and similarity measure are the core concepts of cluster analysis, and most clustering algorithms are based on distance measure. The commonly used distance measures include 闵氏距离and 马氏距离, and the commonly used similarity measures include 相关系数and and 夹角余弦so on.

(1) 闵氏距离That is 闵可夫斯基距离( Minkowski distance), the distance is defined as follows, given a mset of dimensional vector samples X, for xi, xjX, xi= (x1i,x2i,...xmi)T, then the Min’s distance between sample xi and sample xj can be defined as:
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ p ) 1 p , p ≥ 1 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{p} \right )^{\frac{1}{p} }, p\ge 1dij=(k=1mxtoxkjp)p1,p1
It can be easily seen that p=1at that time闵氏距离it becomes曼哈顿距离(Manhatan distance):
dij = ∑ k = 1 m ∣ xki − xkj ∣ d_{ij}=\sum_{k=1}^{m}\left | x_{ki }-x_{kj} \right |dij=k=1mxtoxkj∣At
that timep=2 ,闵氏距离it becomes欧氏距离(Euclidean distance):
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ 2 ) 1 2 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{2} \right )^{\frac{1}{2} }dij=(k=1mxtoxkj2)21
At that timep=∞ , 闵氏距离it is also called 切比雪夫距离( Chebyshev distance):
dij = max ∣ xki − xkj ∣ d_{ij}=max\left | x_{ki}-x_{kj} \right |dij=maxxtoxkj
(2)马氏距离The full name马哈拉诺比斯距离(Mahalanobis distance), is a clustering measurement method to measure the correlation between various features. Given a sample setX=(xij)mxn, assuming that the sample covariance matrix isS​​, then the Mahalanobis distance between sample xi and sample xj can be defined as:
dij = [ ( xi − xj ) TS − 1 ( xi − xj ) ] 1 2 d_ {ij}=\left [\left(x_{i}-x_{j}\right)^{T} S^{-1}\left(x_{i}-x_{j}\right)\right] ^{\frac{1}{2}}dij=[(xixj)TS1(xixj)]21
When Sit is an identity matrix, that is, when the features of the sample are independent of each other and the variance is 1, the Mahalanobis distance is the Euclidean distance.

(3) The correlation coefficient is the most commonly used way to measure the similarity of samples. There are many ways to define the correlation coefficient, the more commonly used is the Pearson correlation. The closer the correlation coefficient is to 1, the more similar the two samples are; the correlation coefficient between sample xi and sample xj can be defined as:
rij = ∑ k = 1 m ( xki − x ˉ i ) ( xkj − x ˉ j ) [ ∑ k = 1 m ( xki − x ˉ i ) 2 ∑ k = 1 m ( xkj − x ˉ j ) 2 ] 1 2 r_{ij}=\frac{\sum_{k=1}^{m}\left ( x_ {ki}-\bar{x}_{i}\right )\left ( x_{kj}-\bar{x}_{j}\right )}{\left [ \sum_{k=1}^{ m} \left ( x_{ki}-\bar{x}_{i}\right )^{2} \sum_{k=1}^{m} \left ( x_{kj}-\bar{x} _{j}\right )^{2} \right ] ^{\frac{1}{2} } }rij=[k=1m(xtoxˉi)2k=1m(xkjxˉj)2]21k=1m(xtoxˉi)(xkjxˉj)
The above formula looks a bit complicated, but it is actually:
r ( X , Y ) = C ov ( X , Y ) V ar [ X ] V ar [ Y ] r\left ( X,Y \right ) =\frac{ Cov\left ( X,Y \right ) }{\sqrt{Var\left [ X \right ] Var\left [ Y \right ] } }r(X,Y)=Was r[X]Was r[Y] Co v _(X,Y)
(4) 余弦夹角( angle cosine) is also a way to measure the similarity of two samples. The closer the cosine of the included angle is to 1, the more similar the two samples are:
similarity = cos ( θ ) = A ⋅ B ∥ A ∥ ∥ B ∥ similarity=cos\left ( \theta \right ) =\frac{A\cdot B} {\left\|A\right\|\left\|B\right\|}similarity=cos( i )=ABAB
The cosine of the included angle between sample xi and sample xj can be defined as:
AC ij = ∑ k = 1 mxkixkj [ ∑ k = 1 mxki 2 ∑ k = 1 mxkj 2 ] 1 2 AC_{ij}=\frac{\sum_{ k=1}^{m}x_{ki}x_{kj}}{\left [ \sum_{k=1}^{m}x_{ki}^{2} \sum_{k=1}^{m }x_{kj}^{2}\right ] ^{\frac{1}{2}}}ACij=[k=1mxto2k=1mxkj2]21k=1mxtoxkj

2 List of clustering algorithms

The clustering algorithm classifies similar samples into the same cluster, which makes the similarity of sample objects in the same cluster as large as possible, and the difference of sample objects in different clusters is also as large as possible. The commonly used clustering algorithms are as follows:

  • 基于距离的聚类: The goal of this type of algorithm is to make the intra-cluster distance small and the inter-cluster distance large, and the most typical one is k均值聚类the algorithm.
  • 基于密度的聚类: This type of algorithm is divided according to the density of the adjacent area of ​​​​the sample, and the most common density clustering algorithm is undoubtedly DBSCAN算法.
  • 层次聚类算法: Including merging hierarchical clustering and splitting hierarchical clustering, etc.
  • Based on graph theory 谱聚类.

insert image description here
Comparison of the effects of sklearn's 10-class clustering algorithms on different data sets.

3 K-means algorithm principle

insert image description here

4 K-means algorithm numpy implementation

import numpy as np

# 定义欧氏距离
def euclidean_distance(x, y):
    
    distance = 0
    for i in range(len(x)):
        distance += np.power((x[i] - y[i]), 2)
    return np.sqrt(distance)
# 质心初始化
def centroids_init(X, k): # 训练样本,质心个数(聚类簇数)

    m, n = X.shape # 样本数和特征数
    centroids = np.zeros((k, n)) # 初始化质心矩阵,大小为质心个数*特征数
    for i in range(k):
        centroid = X[np.random.choice(range(m))]
        centroids[i] = centroid
    return centroids # centroids:质心矩阵,k个长度为n的从m个样本中选取的样本
# 求单个样本所属最近质心的索引
def closest_centroid(x, centroids): # 单个样本实例,质心矩阵

    closest_i, closest_dist = 0, float('inf')
    for i, centroid in enumerate(centroids):
        distance = euclidean_distance(x, centroid)
        if distance < closest_dist:
            closest_i = i
            closest_dist = distance
    return closest_i # closest_i:最近质心
# 构建簇与分配样本
def build_clusters(centroids, k, X): # 质心矩阵,质心个数, 训练样本

    clusters = [[] for _ in range(k)] # 初始化簇列表
    for x_i, x in enumerate(X):
        centroid_i = closest_centroid(x, centroids) # 样本最近质心的下标
        clusters[centroid_i].append(x_i) # 样本下标加入簇矩阵中
    return clusters # 聚类簇
# 计算新的质心
def calculate_centroids(clusters, k, X):

    n = X.shape[1] # 特征数
    centroids = np.zeros((k, n)) # 初始化质心矩阵
    for i, cluster in enumerate(clusters):
        centroid = np.mean(X[cluster], axis=0) # 计算每个簇的均值作为新的质心
        centroids[i] = centroid # 更新质心矩阵
    return centroids # 返回新的质心矩阵
# 获取每个样本所属聚类类别
def get_cluster_labels(clusters, X):

    y_pred = np.zeros(X.shape[0]) # 样本数
    for cluster_i, cluster in enumerate(clusters):
        for sample_i in cluster:
            y_pred[sample_i] = cluster_i
    return y_pred # 预测结果
# 封装k-means算法
def kmeans(X, k, max_iterations):

    centroids = centroids_init(X, k) # 训练样本,质心个数(聚类簇数)

    # 迭代至收敛
    for _ in range(max_iterations):
        clusters = build_clusters(centroids, k, X) # 分配样本与构建簇
        new_centroids = calculate_centroids(clusters, k, X) # 计算新的质心
        print(f'迭代进行到第{
      
      _}轮')
        diff = centroids - new_centroids
        centroids = new_centroids
        if not diff.any():
            break
    return get_cluster_labels(clusters, X) # 获取每个样本所属聚类类别
from sklearn import datasets

# 测试算法
data = datasets.load_iris()
iris, y = data.data, data.target
label_pred = kmeans(iris, 3, 100)

# 取2个或者3个维度来看一下聚类的效果
X = iris[:,2:]
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()

insert image description here

5 K-means algorithm based on sklearn

from sklearn.cluster import KMeans

kmeans_sk = KMeans(n_clusters=3, random_state=2023).fit(iris)
label_pred = kmeans_sk.labels_ # 打印拟合标签

X = iris[:,2:]
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()

insert image description here
Notebook_Github address

Guess you like

Origin blog.csdn.net/cjw838982809/article/details/131350937