Watermelon book machine learning clustering algorithm sort of context

What is clustering task

"Unsupervised learning" the most studied, the most widely used learning tasks, in addition, there are an estimated density (density estimation) and anomaly detection (anomaly detection). In unsupervised learning, the training sample tag information is unknown, the goal is to reveal the intrinsic nature of the data and rules on training samples unmarked learning, provide the basis for further data analysis.

Clustering (Clustering) samples in the data set divided into several disjoint subsets typically, each subset is called a "cluster" (Cluster), each cluster may correspond to a potential concepts (classes). These concepts for clustering algorithm, is not known beforehand, only automatic clustering process cluster structure is formed, the corresponding cluster required by the user to grasp the concept of semantics and naming.

 

What do clustering, How?

         Clustering both as a separate process, to find the distribution of the internal structure of the data; precursor process may also be classified as other learning task: According to the results of the clustering, each cluster is defined as a class, these classes before the classification model based on the training .

 

How to determine the quality of clustering?

That is what kind of performance metrics?

Better results from samples meet the same cluster near each other may be similar, different cluster sample as different as possible. I.e. clustering result "cluster similarity" (intra-cluster similarity) and a high "degree of similarity between clusters" (inter-cluster similarity) is low.

        Cluster performance metrics, also known as clustering "effectiveness index" (validity index). Generally, there are two categories: the clustering result with a "reference model" (reference model) is compared, "external indicators" (external index); the other is a direct inspection clustering without using any reference model, known as the "internal indicators" (internal index).

Common external indicators :

    Jaccard coefficient (Jaccard Coefficient, referred JC)

    FM Index (Fowlkes and Mallows Index, referred FMI)

    Rand index (Rand Index, referred to as RI)

Commonly used internal indicators :

    DB index (Davies-Bouldin Index, referred to as DBI)

    Dunn index (Dunn Index, abbreviated DI)

Using internal metrics when metric clustering result, need to use a measure of the distance between samples.

Common distance metric:

    Minkowski distances (Minkowski distance): (ordinal attribute) for continuous attributes (continuous attribute) and ordered property

    VDM distance (Value Difference Metric): for unordered attributes (non-ordinal)

    Note: The property is divided into discrete and continuous attributes properties, property is divided into discrete ordered and unordered attribute properties.

 

Classification clustering algorithm

Clustering algorithms can be roughly divided into three categories: the prototype clustering, density and distance hierarchical clustering. The basic idea as follows:

Prototype clustering

Also known as "prototype-based clustering" (prototype-based clustering), such clustering algorithm assumes that structure through a set of prototype characterization, in reality clustering task is extremely common. Under normal circumstances, the algorithm first initializes the prototype, the prototype and then solved iteratively updated. Represent different prototypes, different solution method will produce different algorithms. Common prototype clustering algorithm K-means algorithm, learning vector quantization (Learning Vector Quantization, LVQ) and Gaussian mixture clustering (Mixture of Gaussian).

 

Density Clustering

Also known is determined "density-based clustering" (density-based clustering), the clustering algorithm assumes that such structures can tightness of sample distribution. Under normal circumstances, the clustering density sample density and angle can examine the continuity between the samples and expanding clades based on continuous samples to derive the final clustering result. Common density clustering algorithm DBSCAN (Density-Based Spatial Clustering of Application with Noise)

 

Hierarchical clustering

层次聚类(hierarchical clustering)试图在不同层次对数据集进行划分,从而形成树形的聚类结构。数据集的划分可采用“自底向上”的聚合策略,也可采用“自顶向下”的分拆策略。

AGNES(Agglomerative NESting)是一种采用自底向上聚合策略的层次聚类算法。它将数据集中的每个样本看作是一个初始聚类簇,然后在算法运行的每一步中找出距离最近的两个聚类簇进行合并,该过程不断重复,直至达到预设的聚类簇个数。

 

参考资料

[1] 周志华. 机器学习. 北京:清华大学出版社. 2016.197~217

Guess you like

Origin www.cnblogs.com/klchang/p/11482157.html