Basics of machine learning algorithms--hierarchical clustering method

1. Introduction to the principle of hierarchical clustering method

#聚合聚类(层次聚类方法)
"""
1.层次聚类顾名思义就是按照某个层次对样本集进行聚类操作,这里层次并非是真实的层次,实际上指的就是某种距离定义,(我们其实已经学过了很多的距离定义了)
2.层次聚类方法的目标就是采用自下而上的方法去去消除类别的数量,类似与树状图的由叶子结点向根结点靠拢的过程。
3.更简单的说,层次聚类是将初始化的多个类簇看做树节点,每一次迭代都会两两距离相近的类簇进行合并,如此反复,直至最终只剩一个类簇(也就是根结点)。
"""

2. Demonstration of basic algorithm of hierarchical clustering method

Three different methods of hierarchical clustering:
Based on different definitions of similarity (distance), the clustering methods of hierarchical clustering are divided into three types:
1. Single-linkage: The distance to be compared is between pairs of elements. the minimum distance.
2.Complete-linkage: The distance to be compared is the maximum distance between pairs of elements.
3.Group average: The distance to be compared is the average distance between classes.
We first take out some data to calculate and demonstrate this most basic algorithm. As shown in the figure, this is the distance between the five points of ABCDE:
Insert image description here

2.1. Demonstration of calculation method of Single-linkage

Single-linkage: The distance to be compared is the minimum distance between pairs of elements. So we need to find the minimum distance corresponding to each point.
The first step: The minimum distance of A is B, so AB is merged first, recorded as {AB}.
Insert image description here
Step 2: Conduct research on the merger of C with AB as a whole. Insert image description here
Finally, it is found that CD is the shortest, and the combination is recorded as {CD}.
Step 3: Study the merger of E based on {AB}/{CD} as a whole.
Insert image description here
Finally, it was found that CD->E was the shortest, and the combination was recorded as {CDE}.
Step 4: Just merge the last two clusters, that is, {AB}{CDE} merge.

2.2.Complete-linkage calculation method demonstration

2.Complete-linkage: The distance to be compared is the maximum distance between pairs of elements. So we need to find the maximum distance corresponding to each point.
The first step: The maximum distance and the minimum distance between A and each element is B, so AB is merged first, recorded as {AB}.
aad5384fbf5f056a6.png)
Step 2:
The minimum value of the maximum distance between C and each element is as follows:
Insert image description here
Therefore, the minimum value of the maximum distance between each element of C is D. Combine CD and record it as {CD}.
Step 3: Study the merger of E based on {AB}/{CD} as a whole.
Insert image description here
Finally, it was found that CD->E was the shortest, and the combination was recorded as {CDE}.
Step 4: Just merge the last two clusters, that is, {AB}{CDE} merge.

2.3. Demonstration of calculation method of Group-average

The distance to be compared by Group-average is the average distance between pairs of elements. So we need to find the most average distance corresponding to each point.
The first step: The maximum distance and the minimum distance between A and each element is B, so AB is merged first, recorded as {AB}.
aad5384fbf5f056a6.png)
Step 2:
The minimum value of the average distance between C and each element is as follows:
Insert image description here
Therefore, the minimum value of the average distance between each element of C is D. Combine CD and record it as {CD}.
Step 3: Study the merger of E based on {AB}/{CD} as a whole.
Insert image description here
Finally, it is found that the average distance of CD->E is the shortest, and the combination is recorded as {CDE}.
Step 4: Just merge the last two clusters, that is, {AB}{CDE} merge.

3. Introduction to hierarchical clustering expansion algorithm

Source: https://blog.csdn.net/huangguohui_123/article/details/106995538

3.1. Introduction to the principle of centroid method

Insert image description here
If after two groups merge, the minimum distance in the next merge decreases (the center of mass is constantly changing), we call this situation a reversal/inversion, which is manifested as a crossover phenomenon in the dendrogram. .

In some hierarchical clustering methods, such as simple linkage, complete linkage, and average linkage, inversion cannot occur and these distance measures are monotonic. Clearly the centroid method is not monotonic.

3.2. Centroid method based on midpoint

Insert image description here

3.3.Ward method

Insert image description here

4. Practical application of hierarchical clustering method

4.1. Clustering application of hierarchical clustering method

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
#%%
# 读取数据
data = pd.read_excel('Clustering_5.xlsx')
# 提取特征和标签
X = data.iloc[:, :2].values
y = data['y'].values
# 创建凝聚聚类模型
n_clusters = 5
agg_clustering = AgglomerativeClustering(n_clusters=n_clusters)
# 进行聚类
labels = agg_clustering.fit_predict(X)
#%%
# 绘制聚类结果
plt.figure(figsize=(10, 6))
for i in range(n_clusters):
    cluster_points = X[labels == i]
    plt.scatter(cluster_points[:, 0], 
                cluster_points[:, 1], label=f'Cluster {
      
      i + 1}',s=16)

plt.title('Agglomerative clustering')
plt.legend()
plt.show()

The clustering effect is relatively good
Insert image description here

4.2. Hierarchical clustering method clustering tree drawing

4.2.1.Single-linkage connection method
#%%
linked = linkage(X, 'single')  # 使用ward方法计算链接
dendrogram(linked, orientation='top', 
           distance_sort='descending', show_leaf_counts=True)
plt.title('Single-linkage连接方法')
plt.show()

Insert image description here

4.2.2.Complete-linkage connection method
#%%
linked = linkage(X, 'complete')  # 使用ward方法计算链接
dendrogram(linked, orientation='top', 
           distance_sort='descending', show_leaf_counts=True)
plt.title('Complete-linkage连接方法')
plt.show()

Insert image description here

4.2.3.Group-average connection method
#%%
linked = linkage(X, 'average')  # 使用ward方法计算链接
dendrogram(linked, orientation='top', 
           distance_sort='descending', show_leaf_counts=True)
plt.title('Group-average连接方法')
plt.show()

Insert image description here

4.2.4.Centroid connection method
#%%
linked = linkage(X, 'centroid')  # 使用ward方法计算链接
dendrogram(linked, orientation='top', 
           distance_sort='descending', show_leaf_counts=True)
plt.title('Centroid连接方法')
plt.show()

Insert image description here

4.2.5.Ward connection method
# 绘制树状图(聚类树)
linked = linkage(X, 'ward')  # 使用ward方法计算链接
dendrogram(linked, orientation='top', 
           distance_sort='descending', show_leaf_counts=True)
plt.title('Ward连接方法')
plt.show()

Insert image description here

5. Acknowledgments

本章内容的完成离不开以下大佬文章的启发和帮助,在这里列出名单,如果对于内容还有不懂的,可以移步对应的文章进行进一步的理解分析。
1.层次聚类法的基础算法演示https://blog.csdn.net/qq_40206371/article/details/123057888
2.层次聚类法的进阶算法演示https://blog.csdn.net/huangguohui_123/article/details/106995538
在文章的最后再次表达由衷的感谢!!

Guess you like

Origin blog.csdn.net/m0_71819746/article/details/133433905