"Machine Learning in Practice" - Chapter 10 K-MEANS Algorithm Learning Summary

1. K-means clustering

       Clustering is an unsupervised learning that groups similar objects into the same cluster. The method of clustering can be applied to almost all objects. The more similar the objects in the cluster, the better the effect of clustering. The k in the K-means algorithm means that the cluster is clustered into k clusters , and the means means that the mean of the data values ​​in each cluster is taken as the center of the cluster , or called the centroid , that is, the centroid of each class is used for the cluster. clusters are described.

        The biggest difference between clustering and classification is that the target of classification is known in advance , while clustering is different. Clustering does not know what the target variable is in advance , and the category is not pre-defined like classification. Therefore, clustering sometimes Also called unsupervised learning. Cluster analysis attempts to classify similar objects into the same cluster and dissimilar objects into different clusters. Then, a suitable similarity calculation method is obviously needed. We know that there are many similarity calculation methods, such as European The distance, the cosine distance, the Hamming distance, etc. Although the K-means algorithm is relatively easy to implement, it may converge to a local optimal solution, and the convergence speed is relatively slow on large-scale data sets.

2. K-means clustering algorithm process

       First, randomly determine the centroids of k initial points; then assign each point in the dataset to a cluster, that is, find the nearest centroid for each point and assign it to the cluster corresponding to the centroid; the After the step is complete, the centroid of each cluster is updated to be the average of all points in that cluster. The pseudo code is as follows:

Create k points as starting centroids, which can be randomly selected (located within the data boundaries)
  When the cluster assignment result of any point changes
    for each point in the dataset
        for each centroid
          Calculate distance between centroid and data point
        Assign a data point to its closest cluster
    For each cluster, compute the mean of all points in the cluster and use the mean as the centroid

       In the algorithm, the default calculation method of similarity is Euclidean distance calculation. Of course, other similarity calculation functions can also be used, such as cosine distance. In the algorithm, the initialization method of k classes is random initialization, and the initialized centroids must be in Within the boundary of the entire data set, this can be done by finding the maximum and minimum values ​​of each dimension of the data set; then the minimum value + a random number in the range of 0 to 1 to ensure that the random point is within the data boundary.

       In the actual K-means algorithm, the method of calculating the centroid-assignment-recalculating the centroid is used to iterate iteratively. The condition for the algorithm to stop is, of course, when all the points in the data set are assigned to the nearest cluster that does not change, the assignment is stopped. , after updating the centroids of all clusters, returns a list of centroids (usually in the form of vectors) consisting of the centroids of the k classes, and a two-dimensional matrix that stores the classification results of each data point and the square of the error distance . The reason why each data point is stored from the square of its centroid error distance is to facilitate subsequent algorithm preprocessing. Because the K-means algorithm adopts the method of randomly initializing the centroids of k clusters, the clustering effect may fall into the local optimal solution. Although the local optimal solution has a good effect, it is not as good as the global optimal solution. better. Therefore, after the end of the algorithm, corresponding post-processing will be taken to make the algorithm jump out of the local optimal solution, reach the global optimal solution, and obtain the best clustering effect.

If the algorithm has fallen into a local minimum, how can the effect of the K-means algorithm be further improved?

       An index used to measure the clustering effect is SSE, that is, the sum of squares of errors, which is the cumulative sum of squared error distances from all data points in all clusters to the cluster center. The smaller the value of SSE, the closer the data points are to their cluster centers, that is, the centroids, and the better the clustering effect. Because, when the error is squared, more emphasis is placed on data points that are far from the center.

From this, we conclude that a way to improve the clustering effect is to reduce SSE , so how to improve the quality of clusters while keeping the number of clusters unchanged?

       One way is: we can divide the cluster with the largest SSE value into two clusters (because, the cluster with the largest SSE generally means that the data points in the cluster are far from the cluster center), specifically, the largest cluster can be divided into two clusters. The included points are filtered out and the K-means algorithm is run on these points, where k is set to 2. At the same time, after dividing the largest cluster into two clusters, in order to ensure that the number of clusters is unchanged, we can merge the two clusters.

       On the one hand , we can merge the clusters corresponding to the two nearest centroids, that is, calculate the distance between all the centroids, and merge the clusters corresponding to the two closest centroids; on the other hand , we can merge the two to minimize the SSE increase Obviously, the value of SSE will increase after merging two clusters, so for the best clustering effect, the total SSE value should be as small as possible, so select the cluster with the smallest increase in SSE after merging the two clusters . Specifically, it is to calculate the total SSE after merging any two clusters, and select the two clusters corresponding to the smallest SSE after merging for merging. In this way, the number of clusters can be kept constant.

       Without changing the k value, the above method can play a certain role, which will improve the clustering effect to a certain extent. The next thing to introduce is a K-means algorithm that overcomes the problem that the algorithm converges to a local minimum, that is, the bisection k-means algorithm.

3. Bisection k-means algorithm

       The bisection K-means algorithm first treats all points as a cluster, and then divides the cluster into two. Then select one of the clusters to continue to divide, which cluster is selected depends on whether it can reduce the value of SSE to the greatest extent. The above division process is repeated continuously until the number of divided clusters reaches the value specified by the user. The pseudo code is as follows:

Treat all points as a cluster
When the number of clusters is less than k
for each cluster
    Calculate the total error
    Perform k-means clustering on a given cluster (k=2)
    Calculate the total error after splitting the cluster in two
Select the cluster that minimizes the total error for division

Of course, the cluster with the largest SSE can also be selected for division until the number of clusters reaches the number specified by the user. In the above algorithm, the algorithm will not stop until the number of clusters reaches the value of k. In the algorithm, all clusters are divided, and then the errors of all clusters after division are calculated separately, and the cluster with the smallest total error is selected for division. After the division is complete, the list of centroids of the clusters, the classification results of the data points, and the squared error are updated.

       Specifically, assuming that the divided cluster is the ith cluster in m (m<k) clusters, then after this cluster is divided into two clusters, one of them replaces the divided cluster and becomes the ith cluster, and calculates The centroid of the cluster; in addition, another cluster obtained by division is regarded as a new cluster, which becomes the m+1th cluster, and the centroid of the cluster is calculated. In addition, the division result of each data point and the square of the error are also stored in the algorithm, and the corresponding storage information should also be updated at this time. In this way, the process is repeated until the number of clusters reaches k. Through the above algorithm, the data that was previously caught in the local minimum value gradually converged to the global minimum value after being divided several times by the bipartite K-means algorithm, thus achieving a satisfactory clustering effect.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325811597&siteId=291194637