Principle and python hierarchical clustering algorithm implementation

I. Introduction to Algorithms

Mainstream clustering algorithms can be roughly divided into a hierarchical clustering algorithm, a clustering algorithm is divided formula (graph theory, KMEAN), density-based clustering algorithm (DBSCAN) and grid and other clustering algorithms.

1.1 Basic Concepts

  1. Hierarchical clustering (Hierarchical Clustering) is one kind of clustering algorithm, created by calculating the similarity between different categories of data points in a nested hierarchical clustering tree. In the clustering tree, the different types of raw data points is the lowest layer of the tree, the top of the tree is the root node of a cluster.

  2. Create a clustering tree: a bottom-up merge, split from top to bottom. (Here are the first)

1.2 hierarchical clustering algorithm merger

Hierarchical clustering algorithm combined by calculating the similarity between the two types of data points, the combination of all the data points in the two most similar data points, and iterative process. Briefly combined hierarchical clustering algorithm is determined by calculating the distance between each data point and a category of all data points of similarity between them, the smaller the distance, the higher the similarity. And combining to generate a clustering tree nearest data points or two categories. Consolidation process is as follows:

  1. We can obtain a distance matrix X, which represents and distance, referred to the distance between the data point and the point . Each data point is referred to will be the minimum combined distance data points to obtain a combination of the data points , referred to as G
  2. The distance between the data point and the combined data points: when calculating G , and when the distance needs to be calculated , and G the distance of each point.
  3. Combination of the data points with a combination of the data points the distance between: main Single Linkage, Complete Linkage and Average Linkage three. Three algorithms described below, taken from :

    Linkage Single
    Single Linkage calculation method is a combination of the two data points the distance between the two closest data point as the distance of the combination of these two data points. This method is susceptible to the effects of extreme values. Similar compositions are two data points may be due to one of these extreme points are close to data grouped together.

    Linkage Complete
    Complete Linkage Method to Calculate Single Linkage In contrast, the combination of the distance between two data points farthest two points as the combination of these two data points from the data. Complete Linkage Problems also Single Linkage In contrast, the combination of two dissimilar data points may be due to the extreme values which can not be farther away together.

    Average Linkage
    Calculation Average Linkage each data point is the combination of two data points calculated distance from all other data points. The average of all distances as the distance between two points combined data. This method is computationally intensive, but the result is more reasonable than the first two methods.
   

Two, Python realization

It can be used directly scipy.cluster.hierarchy.linkage! ! !

The following code is to achieve a set of numbers for hierarchical clustering.

# -*- coding: utf-8 -*-
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from matplotlib import pyplot as plt

def hierarchy_cluster(data, method='average', threshold=5.0):
    '''层次聚类
    
    Arguments:
        data [[0, float, ...], [float, 0, ...]] -- 文档 i 和文档 j 的距离
    
    Keyword Arguments:
        method {str} -- [linkage的方式: single、complete、average、centroid、median、ward] (default: {'average'})
        threshold {float} -- 聚类簇之间的距离

    Return:
        cluster_number int -- 聚类个数
        cluster [[idx1, idx2,..], [idx3]] -- 每一类下的索引
    '''
    data = np.array(data)

    Z = linkage(data, method=method)
    cluster_assignments = fcluster(Z, threshold, criterion='distance')
    print type(cluster_assignments)
    num_clusters = cluster_assignments.max()
    indices = get_cluster_indices(cluster_assignments)

    return num_clusters, indices



def get_cluster_indices(cluster_assignments):
    '''映射每一类至原数据索引
    
    Arguments:
        cluster_assignments 层次聚类后的结果
    
    Returns:
        [[idx1, idx2,..], [idx3]] -- 每一类下的索引
    '''
    n = cluster_assignments.max()
    indices = []
    for cluster_number in range(1, n + 1):
        indices.append(np.where(cluster_assignments == cluster_number)[0])
    
    return indices


if __name__ == '__main__':
    

    arr = [[0., 21.6, 22.6, 63.9, 65.1, 17.7, 99.2],
    [21.6, 0., 1., 42.3, 43.5, 3.9, 77.6],
    [22.6, 1., 0, 41.3, 42.5, 4.9, 76.6],
    [63.9, 42.3, 41.3, 0., 1.2, 46.2, 35.3],
    [65.1, 43.5, 42.5, 1.2, 0., 47.4, 34.1],
    [17.7, 3.9, 4.9, 46.2, 47.4, 0, 81.5],
    [99.2, 77.6, 76.6, 35.3, 34.1, 81.5, 0.]]

    arr = np.array(arr)
    r, c = arr.shape
    for i in xrange(r):
        for j in xrange(i, c):
            if arr[i][j] != arr[j][i]:
                arr[i][j] = arr[j][i]
    for i in xrange(r):
        for j in xrange(i, c):
            if arr[i][j] != arr[j][i]:
                print(arr[i][j], arr[j][i])

    num_clusters, indices = hierarchy_cluster(arr)


    print "%d clusters" % num_clusters
    for k, ind in enumerate(indices):
        print "cluster", k + 1, "is", ind
## 运行结果
5 clusters
cluster 1 is [1 2]
cluster 2 is [5]
cluster 3 is [0]
cluster 4 is [3 4]
cluster 5 is [6]

The results Visualization:






发布了120 篇原创文章 · 获赞 35 · 访问量 17万+

Guess you like

Origin blog.csdn.net/u012328476/article/details/78978113