Clustering - Hierarchical clustering

1. Hierarchical clustering of introduction

Hierarchical clustering (hierarchical cluster method) a translation of "Cluster Analysis System." A method for analyzing clusters. The practice is the beginning of each sample, as a class, then the closest to the sample (i.e., the minimum distance group goods) subclasses together first, then the distance between the recombining polymerized subcategories according to their class, constantly continued down, and finally all subclasses are aggregated into a category.

In general, when considering the clustering efficiency, we choose the plane clustering, when the plane clustering of potential problems (not enough structured, a predetermined number of clusters, non-deterministic) become the focus, we choose hierarchical clustering. In addition, many researchers believe, hierarchical clustering produce better clustering than the flat clusters.

Hierarchical clustering (Hierarchical Clustering) is a clustering algorithm, created by calculating the similarity between different categories of data points in a nested hierarchical clustering tree. In the clustering tree, the different types of raw data points is the lowest layer of the tree, the top of the tree is the root node of a cluster. Create a cluster tree bottom-up and top-down split merge the two methods.

For example:

From the above chart, given different distances, can get different categories, such as, 23, is divided into two categories, China and other countries and regions; 17, can be divided into three categories, a separate category China, the Philippines and Japan a class, the rest of the country and the region as a class.

Bottom-up merging algorithm:

 

 

2. The combination of the distance between two data points:

(1) Single Linkage minimum distance

The method is a combination of the two data points the distance between the two closest data point as the distance of the combination of these two data points.

This method is susceptible to the effects of extreme values. Similar compositions are two data points may be due to one of these extreme points are close to data grouped together.

(2) Complete Linkage maximum distance

The calculation method is the complete Linkage Single Linkage In contrast, the combination of the distance between two data points farthest two points as the combination of these two data points from the data.

Complete Linkage Problems also Single Linkage In contrast, a similar two data points may be due to a combination of extreme values ​​which together can not be far away.

(3) Average Linkage average distance

Average Linkage calculated for each data point is the combination of two data points calculated distance from all other data points.

将所有距离的均值作为两个组合数据点间的距离。

这种方法计算量比较大,但结果比前两种方法更合理。

 

我们使用Average Linkage计算组合数据点间的距离。下面是计算组合数据点(A,F)到(B,C)的距离,这里分别计算了(A,F)和(B,C)两两间距离的均值。

 

(4)ward linkage 离差平方和

参考以下文章:

http://blog.sciencenet.cn/blog-2827057-921772.html

 

3.分层聚类的案例

瑞士卷案例

 

 1 import warnings
 2 warnings.filterwarnings('ignore')
 3 import numpy as np
 4 import matplotlib.pyplot as plt
 5 %matplotlib inline
 6 from sklearn import datasets
 7 # 凝聚聚类:自下而上聚类
 8 from sklearn.cluster import AgglomerativeClustering,KMeans
 9 from mpl_toolkits.mplot3d.axes3d import Axes3D
10 
11 # 加载数据
12 X,t = datasets.make_swiss_roll(n_samples=1500,noise = 0.05)
13 # 纵坐标变‘薄’,离的更近
14 X[:,1]*=0.5
15 # (1500, 3) (1500,)三维的
16 display(X.shape,t.shape)
17 
18 # 画图
19 fig = plt.figure(figsize=(9,6))
20 axes3D = Axes3D(fig)
21 axes3D.scatter(X[:,0],X[:,1],X[:,2],c =t)
22 # 调整角度
23 axes3D.view_init(7,-80)

 

 

 

1 # 使用kmeans
2 kmeans = KMeans(6)
3 kmeans.fit(X)
4 y_ = kmeans.predict(X)
5 
6 fig = plt.figure(figsize=(9,6))
7 axes3D = Axes3D(fig)
8 axes3D.scatter(X[:,0],X[:,1],X[:,2],c = y_)
9 axes3D.view_init(7,-80)

 

 

 

kmeans效果不好,和上图比较一下就好

 

 1 from sklearn.neighbors import kneighbors_graph
 2 
 3 # linkage : {"ward", "complete", "average", "single"}
 4 conn = kneighbors_graph(X,5)
 5 agg = AgglomerativeClustering(n_clusters=6,linkage='ward',connectivity=conn)
 6 
 7 agg.fit(X)
 8 
 9 y_ = agg.labels_
10 
11 fig = plt.figure(figsize=(9,6))
12 axes3D = Axes3D(fig)
13 
14 axes3D.scatter(X[:,0],X[:,1],X[:,2],c = y_)
15 
16 axes3D.view_init(7,-80)

 

 

 

 

分层聚类,也需要选择合适的参数,距离方法

没有连接性约束的忽视其数据本身的结构,会形成了跨越流形的不同褶皱

添加connectivity便可以得到很好的结果

该图为不加connectivity的结果

 

 

 

 

Guess you like

Origin www.cnblogs.com/xiuercui/p/11985554.html