Cluster analysis

Excerpted from "Introduction to Data Mining"

Cluster analysis groups data objects only based on the information found in the data that describes the objects and their relationships. The goal is that objects within groups are similar (related) to each other, while objects between groups are different (unrelated). The greater the similarity (homogeneity) between groups, the greater the difference in composition, and the better the clustering.

The different cluster types are as follows (and also the basis of the clustering algorithm): The



main attention is clustering according to distance and density.

Common clustering algorithms include K-means, agglomerative hierarchical clustering, and DBSCAN.

K-means first assumes that there are K clusters in the set. At the beginning, K centroids are randomly placed, and the value to be classified is assigned to the centroid closest to it, and then the centroid is continuously moved, and finally the error is minimized. The calculation process of K-means is as follows:





Since the centroids of K-means are randomly placed at the beginning, the result has a certain randomness and may not be the global optimum, so it is necessary to try several times and then select the optimal solution. There are other ways to reduce randomness (such as bipartite K-means).

The disadvantage of K-means is that the amount of calculation is small and the understanding is simple. The disadvantage is that the clustering effect of non-spherical stars and clusters of different densities is not good.

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering can be divided into two processes: agglomeration and splitting, which are inverse processes of each other.

Agglomerative clustering begins to treat each individual as a cluster, and each operation merges the two closest clusters into one cluster, until there is only one cluster left in the set.





DBCAN

DBSCAN is a density-based clustering algorithm.

DBSCAN first determines the finite radius, and then divides the core points, edge points and noise points:




The algorithm of DBSCAN can be described as:




DBSCAN is an effective density-based clustering algorithm, which can effectively divide clusters of any shape and size. The disadvantage is that the effective radius needs to be estimated, and it is difficult for the cluster density to vary too much and for high-dimensional data.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326520906&siteId=291194637