Machine Learning - hierarchical clustering algorithm

Hierarchical clustering methods (algorithms we do with very little) for a given data set level of decomposition or merger until certain conditions are satisfied
so far, the traditional hierarchical clustering algorithm is divided into two categories:
  ● cohesion hierarchical clustering : AGNES algorithm (aGglomerative NESting) ==> from the bottom to the adoption of the policy.
Initially each object as a cluster, then the clusters in accordance with certain registration step by step were combined, between the two clusters
distance may be determined by the similarity of the two different clusters closest data point; Clustering the combined process is repeated
until all of the objects that satisfy the number of clusters. With more cohesion to some class
  ● split the hierarchical clustering : DIANA algorithm (DIvisive ANALysis) ==> using a top-down strategy. First
of all the objects placed in a cluster, and according to some established rules gradually broken down into smaller and smaller clusters (such as maximum
Euclidean distance), until it reaches a termination condition (the number of clusters or cluster distance threshold is reached) .

Advantages and disadvantages:

Simple, easily understood.
Merge point / selected fragmentation points less likely to
merge / split operation can not revoke
large data sets not suitable (amount of data does not fit in memory)
the lower the efficiency of O (t * n2), t as the number of iterations, n is the sample points

Inter-cluster distance: hierarchical clustering can do more than just clusters merge

 Hierarchical clustering consolidation strategy : ward is the minimum distance, complete the maximum distance, average is the average distance

Among them, the two male behavior data. In outlier free male edge detection data by the maximum distance is better

Clustering of words with the minimum distance is better

Non-convex data outlier detection, then hierarchical clustering inappropriate.

 

 Hierarchical clustering optimization algorithm (to tell the truth, no use):

Suppose x1, x2, x3, x4 four data, a given threshold value 5, to take a random sample, such as a left subtree x1 ,, then divided into three groups, a first group is the sample number, 1 sample, the second set of coordinates x1, and the third group is the horizontal and vertical coordinates, and square.

And then randomly samples, such as distance x2, and x1 is calculated, is less than the threshold, then 5, then in a cluster, a first group of samples is 2, is the second set of horizontal and vertical coordinates, respectively, adding the third All groups are horizontal and vertical coordinates of sum of squares.

Objective : When averaging convenient, is not divided by a second group of the first group it. The third dimension is calculated for the next sample point to a distance from the center of the cluster

Guess you like

Origin www.cnblogs.com/qianchaomoon/p/12129299.html