This blog is mainly used for myself to review the knowledge points, all references are listed at the end of the article. If there is an error, I hope to communicate together.

Basic knowledge

Hierarchical clustering assumes that there is a hierarchical structure between categories, clustering samples into hierarchical classes. Hard clustering

Hierarchical clustering

Aggregate clustering (bottom-up clustering) (agglomerative)
Split clustering (top-down clustering) (divisive) (not involved in this blog)

Aggregate clustering

The specific process : For a given sample set, start to classify each sample into a class ==> According to certain rules, for example, the distance between classes is the smallest, and the two classes that best meet the rule conditions are merged ==> Repeatedly, Decrease one class at a time until the stop condition is met, for example, all samples are clustered into one class.
Elements that need to be determined in advance : (Different clustering methods can be formed according to different combinations of these elements)
- Distance or similarity: How to measure the distance between samples. (Minkowski distance, Mahalanobis distance, correlation coefficient, angle cosine, etc.)
- Merging rules: merge the two categories that meet what conditions. (The distance between classes is the smallest, and the distance between classes can be the longest distance, the shortest distance, the center distance, and the average distance)
- Stop condition: (all samples are grouped into one category, etc.)

example

Measurement of distance between samples: Euclidean distance; merging rule: distance between classes (shortest distance) is the smallest; stop condition: the number of classes is 1.
Algorithm 14.1
The complexity of the aggregation hierarchical clustering algorithm: $O(n^3m)$ ， $m$ is the sample dimension, $n$ is the number of samples.

At level $On t$ , there is $N - t$ clusters, in order to determine $t +$ For clustering pairs to be merged on level $1$ , $C_{Nt}^2=(Nt)(Nt-1)/2$ cluster pairs. In this way, the total number of cluster pairs to be considered in the clustering process is $C_{ N}^2+C_{N-1}^2+...+C_{2}^2=C_{N+1}^3=(N-1)N(N+1)/6$ . $m$ should appear in the calculated distance.
Example 14.1:

advantage

No need to select K value, no need to initialize cluster centers
The similarity between distance and rules is easy to define, with few restrictions
Can discover the hierarchical relationship of classes

Disadvantage

The computational complexity is too high
Singular values can also have a big impact
Each step focuses on merging the two most similar clusters to reach the local optimum. However, the previous steps cannot be updated based on the subsequent results, so the global optimum may not be reached in the end. That is, if a clustering error occurs at the beginning of the algorithm, the error will always be continued and cannot be modified.
Algorithm is likely to cluster into chains

Python implementation

Scipy

There are mainly two functions:

linkage (y, method='single', metric='euclidean', optimal_ordering=False): perform hierarchical clustering
fcluster (Z, t, criterion='inconsistent', depth=2, R=None, monocrit=None): form a flat cluster from the hierarchical cluster defined by the given link matrix

# -*- coding: utf-8 -*-
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from sklearn.cluster import AgglomerativeClustering
from itertools import cycle
from sklearn.datasets import make_blobs

scipy.spatial.distance distance calculation module
scipy.cluster is a clustering package under scipy, which contains two types of clustering algorithms:
- Vector quantization (scipy.cluster.vq): supports vector quantization and k-means clustering methods
- Hierarchical clustering (scipy.cluster.hierarchy): supports hierarchical clustering and agglomerative clustering
itertools is python's iterator module, which provides tools that are quite efficient and save memory.
Ability to create your own custom iterators for efficient loops.

# ========== 生成数据 ===========
# 产生数据的个数、数据中心点
n_samples = 300
centers = [[2, 2], [-1, -1], [1, -2]]

X, labels = make_blobs(n_samples=n_samples, centers=centers, cluster_std=1, random_state=0)
variables = ['X', 'Y']

make_blobs parameters: cluster_std: the standard deviation of the clusters in the sample; random_state: specify the random number seed, each seed generates the same sequence

# =========== 层次聚类 ============
df = pd.DataFrame(X, columns=variables, index=labels)
row_clusters = linkage(pdist(df, metric='euclidean'), method='single')
print(pd.DataFrame(row_clusters, columns=['row label1', 'row label2', 'distance', 'no. of items in clust.'],
                   index=['cluster %d' % (i + 1) for i in range(row_clusters.shape[0])]))

# 平面聚类
f = fcluster(row_clusters, 0.6, 'distance')  # 第二个参数为距离阈值
print("平面聚类结果：", f)

# 绘图
row_dendr = dendrogram(row_clusters, labels=labels)  # 使用树形图查看每个步骤中簇的形成方式
plt.tight_layout()  # tight_layout会自动调整子图参数，使之填充整个图像区域。
plt.title('canberra-complete')
plt.show()

Output:

              row label1  row label2  distance  no. of items in clust.
cluster 1         2656.0      2668.0  0.000975                     2.0
cluster 2          242.0      1964.0  0.001864                     2.0
cluster 3         1043.0      1133.0  0.002340                     2.0
cluster 4         1258.0      1272.0  0.002940                     2.0
cluster 5          278.0      2328.0  0.003447                     2.0
...                  ...         ...       ...                     ...
cluster 2995      5981.0      5993.0  0.668463                  2996.0
cluster 2996       875.0      5994.0  0.700367                  2997.0
cluster 2997      2292.0      5995.0  0.704809                  2998.0
cluster 2998       719.0      5996.0  0.739750                  2999.0
cluster 2999       916.0      5997.0  0.818214                  3000.0

[2999 rows x 4 columns]

(n-1)*4 Matrix : The first field and the second field are the numbers of the clusters respectively, the third field represents the distance between the previous two clusters, and the fourth field represents the newly generated clusters The number of elements contained.

The number of samples is 30
The number of samples is 300

Sklearn

The hierarchical clustering under the sklearn library is in AgglomerativeClustering of sklearn.cluster

def __init__(self, n_clusters=2, affinity="euclidean",
             memory=None,
             connectivity=None, compute_full_tree='auto',
             linkage='ward', distance_threshold=None):

affinity: is a method of calculating the distance between clusters
linkage: Choose a strategy for calculating the distance between clusters, including: ward (minimize variance), complete (maximum distance), average (average distance), single (minimum distance).

n_clusters_ = 3 

ac = AgglomerativeClustering(n_clusters=n_clusters_, affinity='euclidean', linkage='ward')  # 聚合层次聚类
clustering = ac.fit_predict(X)
print('簇的标签：%s' % clustering)

# 绘图
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(0, 3), colors):
    # 根据lables中的值是否等于k，重新组成一个True、False的数组
    my_members = clustering == k
    # X[my_members, 0] 取出my_members对应位置为True的值的横坐标
    plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.title('euclidean-ward')
plt.show()

In this, ward is better as the distance between classes. The classification of the original label is the left image, the ward cluster is the middle image, and the minimum distance is the right image. Obviously ward is better as a distance metric:

reference

Hang Li, Statistical Machine Learning Methods (Second Edition), 2019: 261-263.
Complexity part: https://www.cnblogs.com/emanlee/archive/2012/02/28/2371273.html
Code: https://blog.csdn.net/zcmlimi/article/details/87929070
Hierarchical clustering with linkage package: https://blog.csdn.net/yibo492387/article/details/88065036
scipy，sklearn：https://blog.csdn.net/pentiumCM/article/details/105695414

Statistical learning method 01-14.2 Hierarchical clustering

table of Contents