What is cluster analysis? Category analysis method of clustering

Cluster analysis means analyzes the packet data collection process similar objects by a plurality of classes of objects.

basic concept

Clustering (Clustering) is a technology between the internal structure of the data to find. Examples of the whole cluster data organized into a group of similar, these groups are referred to similar clusters. Examples of data in the same cluster are identical to each other, in the examples of different clusters different from each other.

Clustering techniques commonly also known as unsupervised learning and supervised learning difference is that those represented in the cluster category or categories of data packet information is not available.

Similarity between the data is defined by a distance or a similarity coefficient discrimination. Figure 1 shows an example of clustering performed in accordance with the distance between the data objects, is divided into a cluster of similar data object distance.

Cluster analysis schematic
FIG 1 schematically Cluster Analysis

Cluster analysis can be applied to the data pre-processing, for a complex multi-dimensional data structure may be aggregated data by cluster analysis of the data are normalized complex structure.

Cluster analysis can also be used to discover the dependencies between the data items to remove items or closely associated with dependency. Cluster analysis may be a certain data mining methods (such as correlation rules, rough set method), providing preprocessing functions.

In business, cluster analysis is an effective tool market segments, was used to discover a different customer base, and it is through the portrayal of the characteristics of different customer base, and is used to study consumer behavior, search for new potential markets .

In biology, cluster analysis was used to classify plants and animals and genes, in order to obtain the understanding of the inherent structure of the population.

In the insurance industry, cluster analysis can be used to identify grouping car insurance policy holders by the average consumer, and can be identified according to the city's residential real estate group type, value, location.

On the Internet applications, cluster analysis was used to classify documents online.

In e-commerce, cluster analysis by grouping a cluster of customers with similar browsing behaviors, and analyze the common characteristics of customers, thereby helping e-commerce businesses understand their customers, provide more appropriate services.

Category analysis method of clustering

Currently there are a lot of clustering algorithm selection algorithm depending on the purpose and application of the specific type of data, clustering. Clustering algorithm is divided into five categories: the division-based clustering method, hierarchical clustering method based on density clustering method based on grid-based clustering method and model-based clustering method.

1. Partition-based clustering methods

Division clustering method is a top-down based on, for a given data set D n data objects, the data objects are organized into k (k≤n) partitions, wherein each partition represents a cluster. 2 is a schematic division-based clustering method of FIG.

Hierarchical clustering algorithm schematic
FIG 2 schematically hierarchical clustering algorithm

Clustering method based division, the most classic is the average k- (k-means) algorithm and k- Center (k-medoids) algorithm, many algorithms are improved by the two algorithms come.

Advantage of partitioning based clustering method is fast convergence speed, the disadvantage is that it requires the number of classes can be reasonably estimated k, and selects and noise clustering initial center will have a significant impact.

2. Based on the hierarchical clustering methods

Based on the hierarchical clustering method refers to data of a given level of decomposition, up until a certain condition is met. The algorithm includes the bottom-up and top-down method according to an order hierarchical decomposition method, i.e. agglomerative hierarchical clustering algorithm, and the split hierarchical clustering algorithm.

1) bottom-up approach.

First, each data object is a cluster, calculates the distance between the data objects, each combined closest point to the same cluster. Then, the distance between the clusters and cluster computing, will merge the nearest cluster is a large cluster. Constantly consolidated, until the synthesis of a cluster, or a termination condition is reached.

Distance calculation method of the cluster and the cluster shortest distance method, the middle distance method, group average method, wherein the method is to define the shortest distance from the cluster and the cluster of data objects is the shortest distance between the clusters and the cluster. Bottom-up approach is representative algorithm AGNES (AGglomerativeNESing) algorithm.

2) a top-down method.

The method in the beginning all individuals belong to a cluster, then gradually broken down into smaller clusters, until finally each data object in a different cluster, or reaches a termination condition. Method representative top-down algorithm is DIANA (DivisiveANAlysis) algorithm.

The main advantage based on hierarchical clustering algorithms include similarity distance and easily defined rules, less restrictions, no pre-established number of clusters, clusters can be found in hierarchical relationships. The main disadvantage based on hierarchical clustering algorithms include computational complexity is too high, the singular values ​​can have a significant impact, probably clustering algorithms in a chain.

3. The density-based clustering methods

基于密度的聚类方法的主要目标是寻找被低密度区域分离的高密度区域。与基于距离的聚类算法不同的是,基于距离的聚类算法的聚类结果是球状的簇,而基于密度的聚类算法可以发现任意形状的簇。

基于密度的聚类方法是从数据对象分布区域的密度着手的。如果给定类中的数据对象在给定的范围区域中,则数据对象的密度超过某一阈值就继续聚类。

这种方法通过连接密度较大的区域,能够形成不同形状的簇,而且可以消除孤立点和噪声对聚类质量的影响,以及发现任意形状的簇,如图 3 所示。

基于密度的聚类方法中最具代表性的是 DBSAN 算法、OPTICS 算法和 DENCLUE 算法。 图 2 是基于层次的聚类算法的示意图,上方是显示的是 AGNES 算法的步骤,下方是 DIANA 算法的步骤。这两种方法没有优劣之分,只是在实际应用的时候要根据数据特点及想要的簇的个数,来考虑是自底而上更快还是自顶而下更快。

Clustering a schematic density
图 3  密度聚类算法示意

4. 基于网格的聚类方法

基于网格的聚类方法将空间量化为有限数目的单元,可以形成一个网格结构,所有聚类都在网格上进行。基本思想就是将每个属性的可能值分割成许多相邻的区间,并创建网格单元的集合。每个对象落入一个网格单元,网格单元对应的属性空间包含该对象的值,如图 4 所示。

Grid-based clustering algorithm is a schematic
图 4  基于网格的聚类算法示意

基于网格的聚类方法的主要优点是处理速度快,其处理时间独立于数据对象数,而仅依赖于量化空间中的每一维的单元数。这类算法的缺点是只能发现边界是水平或垂直的簇,而不能检测到斜边界。另外,在处理高维数据时,网格单元的数目会随着属性维数的增长而成指数级增长。

5. 基于模型的聚类方法

基于模型的聚类方法是试图优化给定的数据和某些数学模型之间的适应性的。该方法给每一个簇假定了一个模型,然后寻找数据对给定模型的最佳拟合。假定的模型可能是代表数据对象在空间分布情况的密度函数或者其他函数。这种方法的基本原理就是假定目标数据集是由一系列潜在的概率分布所决定的。

Figure 5 performs a comparison and division clustering method based on model-based clustering method. The results are given in the left side distance-based clustering method, the core principle is the near point distance together. Right analysis model clustering method based on probability distribution, the probability distribution model used here is a certain arc of an ellipse.

5 points labeled in FIG two solid, these two points are very close, the distance-based clustering method, they are gathered in a cluster, the clustering method based on probability distribution model will separate them in different the cluster, which is to meet specific probability distribution model.

Clustering a schematic comparison
FIG 5 a schematic comparison clustering

In the model-based clustering method, the number of clusters is based on noise or outlier is to analyze the standard statistics automatically determined by statistics. Model-based clustering method attempts to optimize the fit between the given data and some data model.

Recommended Learning Catalog: 54. Cluster analysis Introduction
55 .k-means clustering algorithm

Guess you like

Origin blog.csdn.net/dsdaasaaa/article/details/94590153