Data mining: model selection-K-means

K-means introduction

K-means is an unsupervised clustering algorithm, which divides the samples into K clusters according to the distance between each sample data. (That is, K categories) After being divided into K clusters, the desired effect is that the points in each cluster are as close as possible, and the distance between clusters is as large as possible.

Algorithm flow

Insert picture description here
As shown below.

  1. K samples are randomly selected as the initial centroid. In this way, there are K different clusters.
  2. Calculate the mean value of all sample points in each cluster, and use this mean point as the new centroid.
  3. Calculate the distance of all samples to these centroids , and select the point closest to the centroid of a cluster as the new sample point in the cluster. (Because the center of mass in the cluster changes, it is necessary to redistribute the sample points in each cluster).
  4. Repeat 2-3 until the position of the center of mass no longer changes, stop iterating, and complete the clustering.
    Insert picture description here

It can be seen here that since the centroid is randomly selected, although the desired effect can be achieved through continuous iterations, the calculation cost will be very large. At the same time, the distance from all samples to the centroid is calculated. The training cost, therefore, has produced multiple K-means optimization algorithm.

K-means optimization

K-Means ++ initialization and optimization

The strategy of K-Means ++ algorithm to initialize the centroid is as follows. (It feels like a screening of the initial centroid points first ...)
Insert picture description here

K-Means distance calculation optimization elkan K-Means

The idea of ​​using elkan K-Means algorithm is to use the nature of the triangle whose sum of two sides is greater than or equal to the third side, and the difference between the two sides is smaller than the third side , so as to achieve the purpose of reducing the distance calculation.
The following are the two rules
Insert picture description here
used by the elkan K-Means algorithm: Using the above two rules, the iteration speed of the traditional K-Means clustering algorithm can be improved to a certain extent. However, if the features of the sample are sparse and have missing values , the algorithm cannot be used because some distances cannot be calculated

Big Batch Optimized Mini Batch K-Means

Even if the optimized elkan K-Means algorithm is used, the computational overhead is very large. Especially in this era of big data. Therefore, the Mini Batch K-Means algorithm came into being. (Some samples are used for multiple calculations to reduce running speed and improve accuracy)
Insert picture description here

Measure of similarity

After clustering, the similarity of samples within clusters is required to be large, and the similarity between clusters is small. That is , the difference within the cluster is small, and the difference outside the cluster is large . This similarity / difference is generally measured by the distance from the sample point to its centroid.
For a cluster, the smaller the sum of the distances from all sample points to the center of mass, we think that the more similar the samples in this cluster, the smaller the difference within the cluster.
The measures of distance are generally as follows. We generally choose Euclidean distance.
Insert picture description here
Does K-means have a loss function?
What KMeans seeks is to find a centroid that minimizes the sum of squares within the cluster. Seeing minimization, the first thing that comes to mind is the optimization of the loss function, but through the following description, we can know that because K-means and decision trees do not need to solve parameters, K-means and decision trees have no loss function.
Insert picture description here

Evaluation index of clustering algorithm

Because the ultimate goal of classification is: small differences within clusters and large differences outside clusters , we can measure the effect of clustering by measuring differences within clusters .
Previously, we used ** to minimize the sum of squares within the cluster, ** to measure by this. However, this indicator has the following problems:

  1. It is not bounded . We do not know what the sum of squares within the cluster is the limit of this model.
  2. The calculation is easily affected by the number of features . In view of the characteristics of K-means calculation, when the data becomes larger and larger, the amount of calculation will explode, which is not suitable for evaluating the model again and again.
  3. Susceptible to ultra-parameter K . The larger K is, the smaller the sum of squares in the cluster must be, but it is not that the larger K is, the better the clustering effect is.
  4. There are assumptions about the distribution of data . It assumes that the data satisfies the convex distribution (that is, the data looks like a convex function on a two-dimensional plane image), and it assumes that the data is isotropic, that is, the attributes of the data are different. The direction represents the same meaning . But the actual data is often not the case. Therefore, using the sum of squares within a cluster as an evaluation index will make the clustering algorithm perform poorly on some elongated clusters, circular clusters, or irregularly shaped manifolds.

The following two cases are used to evaluate the clustering algorithm.

When the real label is known

Contour factor

Although we do not enter real labels in the cluster, this does not mean that the data we have must not have real labels, or there must be no reference information. If we have real labels, we prefer to use classification algorithms. But it does not exclude the possibility that we may still use clustering algorithms.
If we have data on the actual clustering of the samples, we can measure the effect of clustering on the results of the clustering algorithm and the real results. ( This is good for understanding. Generally, it will be done directly by classification. The following true labels are unknown to be processed. ) The following three methods are commonly used:
Insert picture description here

When the real label is unknown

Clustering depends entirely on evaluating the density of clusters (small differences within clusters) and the degree of dispersion between clusters (large differences outside clusters) to evaluate the effect of clustering. The contour coefficient is the evaluation index of the most commonly used clustering algorithm. It is defined for each sample, and it can be measured at the same time:
Insert picture description here
according to the requirements of clustering, "the difference within the cluster is small, the difference outside the cluster is large", we hope that b is always greater than a, and the larger the better .
Insert picture description here
After the formula is analyzed, the contour coefficient range is (-1,1). The closer the contour coefficient is to 1, the better the clustering effect, and the closer to -1, the worse the clustering effect.

In sklearn, we use the silhouette_score class in the module metrics to calculate the contour coefficient. It returns the average of the contour coefficients of all samples in a data set . But we also have the silhouette_sample in the metrics module . Its parameters are consistent with the contour coefficients, but it returns the contour coefficients of each sample in the dataset .

The advantages and disadvantages of the wheel library coefficient :
Insert picture description here

Kalinsky-Harabas Index

In addition to the most commonly used contour coefficients, we also have the Kalinski-Harabaz Index (CHI for short, also known as the variance ratio standard), the Davis-Burdin index and the contingency matrix can be used .
Insert picture description here
Here we focus on the Kalinsky-Harabas Index. The higher the Calinski-Harabaz index, the better. The formula is as follows: the
Insert picture description here
greater the dispersion between groups, the larger the Bk; the smaller the dispersion within the groups, the smaller the Wk. Therefore, the larger the value of this formula, the better the clustering purpose is: the difference within the cluster is small, and the difference outside the cluster is large.
Insert picture description here
The calinski-Harabaz index has no bounds, and the clustering on convex data will also show an imaginary height. But compared to the contour coefficient, it has a huge advantage, that is, the calculation is very fast (it is fast with the matrix calculation ).

Comparison between K-Means algorithm and KNN


  • Similarities : 1. K-Means clustering algorithm and KNN (K nearest neighbor algorithm) both find the closest point to a certain point, that is, both use the idea of ​​the nearest leader
  • Differences:
    K-Means clustering algorithm is an unsupervised learning algorithm with no sample output; KNN is a supervised learning algorithm with corresponding category output
    1. K-Means finds the best centroid of K categories in the iterative process, thus Determine K cluster categories;
    2.KNN is to find the K points closest to a certain point in the training set

K-Means algorithm advantages and disadvantages

  • Advantages:
    1. Simple and easy to understand, fast algorithm convergence speed (this convergence speed ...)
    2. Strong interpretability of the algorithm
  • Disadvantages:
    1. The selection of K value generally requires a priori experience (expert experience)
    2. Using the iterative method, the results obtained are only local optimal
    3. Since the distance from the centroid to all points needs to be calculated, it is more sensitive to noise and abnormal points
    4. If the data amount of each hidden category is seriously unbalanced, or the variance of each hidden category is different, the clustering effect is not good

References

https://www.bilibili.com/video/BV1vJ41187hk?from=search&seid=15670729090205346470
https://blog.csdn.net/weixin_46032351/article/details/104565711

Published 33 original articles · Liked 45 · Visitors 20,000+

Guess you like

Origin blog.csdn.net/AvenueCyy/article/details/105380053