"Machine Learning Formula Derivation and Code Implementation" study notes, record your own learning process, please buy the author's book for detailed content.
Cluster analysis and k-means clustering algorithm
聚类分析
( cluster analysis
) is a class of classic unsupervised learning algorithms. In the case of a given sample, cluster analysis automatically divides the sample into several categories by measuring 特征相似度
or .距离
1 Distance measure and similarity measure
Distance measure and similarity measure are the core concepts of cluster analysis, and most clustering algorithms are based on distance measure. The commonly used distance measures include 闵氏距离
and 马氏距离
, and the commonly used similarity measures include 相关系数
and and 夹角余弦
so on.
(1) 闵氏距离
That is 闵可夫斯基距离
( Minkowski distance
), the distance is defined as follows, given a m
set of dimensional vector samples X
, for xi
, xj
∈ X
, xi
= (x1i,x2i,...xmi)T
, then the Min’s distance between sample xi and sample xj can be defined as:
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ p ) 1 p , p ≥ 1 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{p} \right )^{\frac{1}{p} }, p\ge 1dij=(k=1∑m∣xto−xkj∣p)p1,p≥1
It can be easily seen that p=1
at that time闵氏距离
it becomes曼哈顿距离
(Manhatan distance
):
dij = ∑ k = 1 m ∣ xki − xkj ∣ d_{ij}=\sum_{k=1}^{m}\left | x_{ki }-x_{kj} \right |dij=k=1∑m∣xto−xkj∣At
that timep=2
,闵氏距离
it becomes欧氏距离
(Euclidean distance
):
dij = ( ∑ k = 1 m ∣ xki − xkj ∣ 2 ) 1 2 d_{ij}=\left ( \sum_{k=1}^{m}\left | x_{ki}-x_{kj} \right | ^{2} \right )^{\frac{1}{2} }dij=(k=1∑m∣xto−xkj∣2)21
At that timep=∞
, 闵氏距离
it is also called 切比雪夫距离
( Chebyshev distance
):
dij = max ∣ xki − xkj ∣ d_{ij}=max\left | x_{ki}-x_{kj} \right |dij=max∣xto−xkj∣
(2)马氏距离
The full name马哈拉诺比斯距离
(Mahalanobis distance
), is a clustering measurement method to measure the correlation between various features. Given a sample setX=(xij)mxn
, assuming that the sample covariance matrix isS
, then the Mahalanobis distance between sample xi and sample xj can be defined as:
dij = [ ( xi − xj ) TS − 1 ( xi − xj ) ] 1 2 d_ {ij}=\left [\left(x_{i}-x_{j}\right)^{T} S^{-1}\left(x_{i}-x_{j}\right)\right] ^{\frac{1}{2}}dij=[(xi−xj)TS−1(xi−xj)]21
When S
it is an identity matrix, that is, when the features of the sample are independent of each other and the variance is 1, the Mahalanobis distance is the Euclidean distance.
(3) The correlation coefficient is the most commonly used way to measure the similarity of samples. There are many ways to define the correlation coefficient, the more commonly used is the Pearson correlation. The closer the correlation coefficient is to 1, the more similar the two samples are; the correlation coefficient between sample xi and sample xj can be defined as:
rij = ∑ k = 1 m ( xki − x ˉ i ) ( xkj − x ˉ j ) [ ∑ k = 1 m ( xki − x ˉ i ) 2 ∑ k = 1 m ( xkj − x ˉ j ) 2 ] 1 2 r_{ij}=\frac{\sum_{k=1}^{m}\left ( x_ {ki}-\bar{x}_{i}\right )\left ( x_{kj}-\bar{x}_{j}\right )}{\left [ \sum_{k=1}^{ m} \left ( x_{ki}-\bar{x}_{i}\right )^{2} \sum_{k=1}^{m} \left ( x_{kj}-\bar{x} _{j}\right )^{2} \right ] ^{\frac{1}{2} } }rij=[∑k=1m(xto−xˉi)2∑k=1m(xkj−xˉj)2]21∑k=1m(xto−xˉi)(xkj−xˉj)
The above formula looks a bit complicated, but it is actually:
r ( X , Y ) = C ov ( X , Y ) V ar [ X ] V ar [ Y ] r\left ( X,Y \right ) =\frac{ Cov\left ( X,Y \right ) }{\sqrt{Var\left [ X \right ] Var\left [ Y \right ] } }r(X,Y)=Was r[X]Was r[Y]Co v _(X,Y)
(4) 余弦夹角
( angle cosine
) is also a way to measure the similarity of two samples. The closer the cosine of the included angle is to 1, the more similar the two samples are:
similarity = cos ( θ ) = A ⋅ B ∥ A ∥ ∥ B ∥ similarity=cos\left ( \theta \right ) =\frac{A\cdot B} {\left\|A\right\|\left\|B\right\|}similarity=cos( i )=∥A∥∥B∥A⋅B
The cosine of the included angle between sample xi and sample xj can be defined as:
AC ij = ∑ k = 1 mxkixkj [ ∑ k = 1 mxki 2 ∑ k = 1 mxkj 2 ] 1 2 AC_{ij}=\frac{\sum_{ k=1}^{m}x_{ki}x_{kj}}{\left [ \sum_{k=1}^{m}x_{ki}^{2} \sum_{k=1}^{m }x_{kj}^{2}\right ] ^{\frac{1}{2}}}ACij=[∑k=1mxto2∑k=1mxkj2]21∑k=1mxtoxkj
2 List of clustering algorithms
The clustering algorithm classifies similar samples into the same cluster, which makes the similarity of sample objects in the same cluster as large as possible, and the difference of sample objects in different clusters is also as large as possible. The commonly used clustering algorithms are as follows:
基于距离的聚类
: The goal of this type of algorithm is to make the intra-cluster distance small and the inter-cluster distance large, and the most typical one isk均值聚类
the algorithm.基于密度的聚类
: This type of algorithm is divided according to the density of the adjacent area of the sample, and the most common density clustering algorithm is undoubtedlyDBSCAN算法
.层次聚类算法
: Including merging hierarchical clustering and splitting hierarchical clustering, etc.- Based on graph theory
谱聚类
.
Comparison of the effects of sklearn's 10-class clustering algorithms on different data sets.
3 K-means algorithm principle
4 K-means algorithm numpy implementation
import numpy as np
# 定义欧氏距离
def euclidean_distance(x, y):
distance = 0
for i in range(len(x)):
distance += np.power((x[i] - y[i]), 2)
return np.sqrt(distance)
# 质心初始化
def centroids_init(X, k): # 训练样本,质心个数(聚类簇数)
m, n = X.shape # 样本数和特征数
centroids = np.zeros((k, n)) # 初始化质心矩阵,大小为质心个数*特征数
for i in range(k):
centroid = X[np.random.choice(range(m))]
centroids[i] = centroid
return centroids # centroids:质心矩阵,k个长度为n的从m个样本中选取的样本
# 求单个样本所属最近质心的索引
def closest_centroid(x, centroids): # 单个样本实例,质心矩阵
closest_i, closest_dist = 0, float('inf')
for i, centroid in enumerate(centroids):
distance = euclidean_distance(x, centroid)
if distance < closest_dist:
closest_i = i
closest_dist = distance
return closest_i # closest_i:最近质心
# 构建簇与分配样本
def build_clusters(centroids, k, X): # 质心矩阵,质心个数, 训练样本
clusters = [[] for _ in range(k)] # 初始化簇列表
for x_i, x in enumerate(X):
centroid_i = closest_centroid(x, centroids) # 样本最近质心的下标
clusters[centroid_i].append(x_i) # 样本下标加入簇矩阵中
return clusters # 聚类簇
# 计算新的质心
def calculate_centroids(clusters, k, X):
n = X.shape[1] # 特征数
centroids = np.zeros((k, n)) # 初始化质心矩阵
for i, cluster in enumerate(clusters):
centroid = np.mean(X[cluster], axis=0) # 计算每个簇的均值作为新的质心
centroids[i] = centroid # 更新质心矩阵
return centroids # 返回新的质心矩阵
# 获取每个样本所属聚类类别
def get_cluster_labels(clusters, X):
y_pred = np.zeros(X.shape[0]) # 样本数
for cluster_i, cluster in enumerate(clusters):
for sample_i in cluster:
y_pred[sample_i] = cluster_i
return y_pred # 预测结果
# 封装k-means算法
def kmeans(X, k, max_iterations):
centroids = centroids_init(X, k) # 训练样本,质心个数(聚类簇数)
# 迭代至收敛
for _ in range(max_iterations):
clusters = build_clusters(centroids, k, X) # 分配样本与构建簇
new_centroids = calculate_centroids(clusters, k, X) # 计算新的质心
print(f'迭代进行到第{
_}轮')
diff = centroids - new_centroids
centroids = new_centroids
if not diff.any():
break
return get_cluster_labels(clusters, X) # 获取每个样本所属聚类类别
from sklearn import datasets
# 测试算法
data = datasets.load_iris()
iris, y = data.data, data.target
label_pred = kmeans(iris, 3, 100)
# 取2个或者3个维度来看一下聚类的效果
X = iris[:,2:]
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()
5 K-means algorithm based on sklearn
from sklearn.cluster import KMeans
kmeans_sk = KMeans(n_clusters=3, random_state=2023).fit(iris)
label_pred = kmeans_sk.labels_ # 打印拟合标签
X = iris[:,2:]
x0 = X[label_pred == 0]
x1 = X[label_pred == 1]
plt.scatter(x0[:, 0], x0[:, 1], c = "red", marker='o', label='label0')
plt.scatter(x1[:, 0], x1[:, 1], c = "green", marker='*', label='label1')
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.legend(loc=2)
plt.show()