How does clustering work?

Clustering is an unsupervised learning method that attempts to divide samples in a data set into several usually disjoint subsets, and each subset is called a "cluster". The goal is to cluster similar samples into the same cluster and cluster dissimilar samples into different clusters. Similarity in cluster analysis is usually defined by a distance measure, such as Euclidean distance, cosine distance, Manhattan distance, etc.

The following are the basic principles of some common clustering algorithms:

  1. K-Means algorithm: K-Means is a very common clustering algorithm. It first randomly initializes k cluster centers, and then repeats the following two steps: divide each sample into the nearest cluster, and then update the center of each cluster to be the mean of all samples in the cluster. This two-step operation will continue to iterate until the cluster center does not change or changes very little.

  2. DBSCAN algorithm: DBSCAN is a density-based clustering algorithm. It defines the ε-neighborhood of a sample: the region consisting of all samples that are no more than ε away from the sample. A sample is a core object if its ε-neighborhood has at least MinPts samples. DBSCAN first selects a core object, and forms a cluster with all directly density-reachable core objects in it and its ε-neighborhood; then finds all directly density-reachable core objects in the ε-neighborhood of these core objects , expanding the cluster continuously. If no new core object can be added to the cluster, then the search for a new core object is started. This process continues until all samples have been visited.

  3. Hierarchical clustering algorithm: Hierarchical clustering attempts to cluster at different levels to form a tree-like clustering structure. Hierarchical clustering can be divided into two approaches: agglomerative hierarchical clustering and divisive hierarchical clustering. Agglomerative hierarchical clustering initially treats each sample as a cluster, and then continuously merges the two nearest clusters until all samples are merged into one cluster. Split hierarchical clustering starts with all samples as a cluster, and then splits the farthest two clusters until each sample is a cluster.

When performing cluster analysis, it is necessary to select an appropriate distance measure and clustering algorithm in order to obtain the best clustering results. In addition, the quality of clustering results can also be evaluated using some evaluation indicators, such as silhouette coefficient, Davies-Bouldin index, etc.

The silhouette coefficient evaluates the difference between how similar a sample is to other samples in the same cluster and how similar a sample is to samples from other clusters. A good clustering result should be that the similarity of samples in the same cluster is high, and the similarity of samples in different clusters is low.

The Davies-Bouldin index is calculated based on the ratio of the mean distance within a cluster to the distance between clusters. A smaller Davies-Bouldin index means better clustering results because it indicates that samples within clusters are closer together and samples between clusters are more dispersed.

Besides that, clustering algorithms have their limitations as most of them are based on some form of distance measure which may not always reflect the true structure of the data. For example, in high-dimensional data, all distances may be very close, making distance-based methods ineffective, which is the so-called "curse of dimensionality". Therefore, when performing cluster analysis, it is necessary to select and adjust appropriate methods and parameters in combination with the characteristics of the actual problem and data.

Guess you like

Origin blog.csdn.net/m0_57236802/article/details/131568979