Four ways to determine the optimal number of clusters in K-means (K-means++) - with code

Table of contents

Summary:

1. K-means algorithm

2.Calinski-Harabasz Criterion (Kalinski-Harabasz indicator, CH value)

3.Davies-Bouldin Criterion (Davies-Bouldin indicator, DB value)

4. Gap Value (Gap value)

5.Silhouette Coefficient (contour coefficient)

6. Based on Matlab's K-means clustering and optimal clustering number selection results:

7. The Matlab code implementation of this article:


Summary:

In the Kmeans algorithm, the K value determines the number of clusters to be allocated in the clustering algorithm. The Kmeans algorithm is sensitive to the initial value. For the same k value, the selected points are different, which will affect the clustering effect of the algorithm and the number of iterations. This article measures the optimal clustering number of K-means by calculating four indicators in the original data: CH value, DB value, Gap value, and silhouette coefficient, and uses K-means for clustering, and finally visualizes the clustering results .

The code in this article has been standardized, and it can be used by importing your own data, which is very convenient

Through this code, the optimal number of clusters can be determined, which is suitable for applications such as mathematical modeling

1. K-means algorithm

The k-means clustering algorithm (k-means clustering algorithm) is an iterative clustering analysis algorithm. Its steps are to divide the data into K groups in advance, then randomly select K objects as the initial cluster centers, and then calculate The distance between each object and each seed cluster center, assign each object to the cluster center closest to it. The cluster centers and the objects assigned to them represent a cluster. Each time a sample is assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will be repeated until a certain termination condition is met. Termination conditions can be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers change again, and the sum of squared errors is locally minimized.

Calculation steps of K-means:

2.Calinski-Harabasz Criterion (Kalinski-Harabasz indicator, CH value)

The Kalinske-Harabas criterion is sometimes called the variance ratio criterion (VRC). The Kalinske-Harabas index is defined as

where SSB is the overall between-cluster variance, SSW is the overall within-cluster variance, k is the number of clusters, and N is the number of observations.

Well-defined clusters have large between-cluster variance (SSB) and small within-cluster variance (SSW). The larger the VRCk ratio, the better the data partition. To determine the optimal number of clusters, maximize VRCk with respect to k. The optimal number of clusters corresponds to the solution with the highest Kalinskey-Harabas index value.

The Kalinsky-Harabas criterion is best suited for k-means clustering solutions with squared Euclidean distances.

3.Davies-Bouldin Criterion (Davies-Bouldin indicator, DB value)

The Davies-Bouldin criterion is based on the ratio of within-cluster distances to between-cluster distances. The Davis-Bolding index is defined as:

where Di,j is the intra-cluster to inter-cluster distance ratio for the i-th and j-th clusters. In mathematics:

di is the average distance between each point in the ith cluster and the centroid of the ith cluster. dj is the average distance between each point in the jth cluster and the centroid of the jth cluster. dij is the Euclidean distance between the centroids of the i-th and j-th clusters and the maximum value of Dij represents the worst within-to-between ratio for cluster i. The optimal clustering solution has the smallest Davies Bouldin index value.

4. Gap Value (Gap value)

A common graphical approach to cluster evaluation is to compare the error measure against several proposed cluster sizes and find the "elbow" of this graph. The "elbow" occurs where the largest drop in the error measure occurs. The gap criterion formalizes this by estimating the location of the "elbow" as the number of clusters with the largest gap value. Thus, under the gap criterion, the best The number of clusters corresponds to the solution with the largest local or overall gap value within a tolerance range. The specific formula is as follows:

where n is the sample size, k is the number of clusters being evaluated, and Wk is an aggregate measure of dispersion within clusters.

5.Silhouette Coefficient (contour coefficient)

The silhouette coefficient of each point is a measure of how similar that point is to other points in the same cluster, compared to points from other clusters. The silhouette coefficient si of the i-th point is defined as:

Among them, ai is the average distance from the i-th point to other points in the same cluster as i, and bi is the minimum average distance from the i-th point to points in different clusters, which is the smallest in the cluster. If the i-th point is the only point in its cluster, then the silhouette coefficient si is set to 1. Silhouette coefficients range from -1 to 1. A high silhouette coefficient indicates that the point matches well with its own cluster and poorly with other clusters. If most points have a high silhouette coefficient, then the clustering scheme is suitable. If many points have low or negative silhouette coefficients, the clustering scheme may have too many or too few clusters. Silhouette coefficients can be used as clustering evaluation criteria for any distance metric.

6. Based on Matlab's K-means clustering and optimal clustering number selection results:

Various metrics evaluate images:

​Through this program, the most suitable number of K-means clusters can be selected

Visualization of K-means clustering results:

7. The Matlab code implementation of this article :

Guess you like

Origin blog.csdn.net/widhdbjf/article/details/129119234