Clustering algorithms are generally divided into three types, namely partition-based clustering, density-based clustering, and hierarchical clustering. The author here introduces the most common partition-based clustering, namely K-means Clustering Algorithm. This article is some notes and supplements for studying "Basics of Machine Learning Algorithms - Qin Bingfeng".
1. The difference between clustering and classification
The samples of the classification algorithm are labeled, and the samples of the clustering algorithm are unlabeled. Try not to know which category the data set belongs to, and cluster the data according to the characteristics of the data, and the clustering is unlabeled. Supervised learning and classification belong to supervised learning.
2. Introduction of K-means algorithm
Parameter K: Both the K-means algorithm and the KNN algorithm have a K, but the K of the two algorithms is different. The parameter K value in K-means refers to dividing the n data objects input in advance into k clusters. Classes (clusters), so that the obtained clusters satisfy: the similarity of objects in the same cluster is high; while the similarity of objects in different clusters is small.
Algorithm idea: Clustering is centered on k points in the space, and the objects closest to them are classified. Through the iterative method, the value of each cluster center is updated successively until the best clustering result is obtained.
3. Implementation process of K-means algorithm
- First randomly select k elements from the unlabeled element set A as the centers of gravity of the k subsets, that is, first define how many classes/clusters there are in total.
- Calculate the distances from the remaining elements to the center of gravity of the k subsets (the distance here can also use Euclidean distance), and divide these elements into the nearest subset according to the distance.
- According to the clustering results, recalculate the center of gravity (the calculation method of the center of gravity is to calculate the arithmetic mean of each dimension of all elements in the subset).
- All the elements in set A are clustered again according to the new center of gravity.
- Repeat the above operations continuously until the clustering result does not change.
4. Example of K-means algorithm
There are four points in the coordinate axis, which are recorded as A(1,1), B(2,1), C(4,3), D(5,4), and these four points are clustered , assuming that (1,1) and (2,1) are selected as the two classification center points. As shown below.
First, calculate the distances from the four points A, B, C, and D to the classification center point respectively, as shown in the matrix D0 in the figure below. The first row in the D0 matrix is the distance from each point to the first classification center point, and the second row is the distance from each point to the second classification center point
After calculating the D0 matrix, compare the distance between each point and the classification center point, and classify each point to the point that is closer to it. Get the matrix G0. As shown below.
Then calculate the two types of center points of the new round respectively. According to the matrix, we can get that there is only point A in group-1 itself, so the classification center point of group-1 has not changed. And there are three points B, C, and D in group-2, so recalculate to get the new classification center point of group-2.
Then start a new round of iteration, the operation is consistent with the above operation.
The result of calculating the distance from each point to the center point of the new round is shown in the matrix D1 in the figure:
The resulting new classification is shown in Figure G1:
Then calculate the classification center point of the next round according to the new classification situation
Then the third round of classification
The result of calculating the distance from each point to the center point of the new round is shown in the matrix D2 in the figure:
The resulting new classification is shown in Figure G2:
At this time, it is found that the clustering result has not changed, and the algorithm iteration stops. Therefore, the result of the K-means clustering algorithm is to divide these four points into two categories: A, B, C, and D.
5、Mini Batch K-Means
The Mini Batch K-Means algorithm is a variant of the K-Means algorithm, which uses a small batch of data subsets to reduce calculation time. The so-called small batch here refers to the randomly selected data subsets each time the algorithm is trained. Using these randomly generated subsets to train the algorithm greatly reduces the calculation time, and the results are generally only slightly worse than the standard algorithm.
The iterative steps of the algorithm have two steps:
- Randomly sample some data from the dataset to form mini-batches and assign them to the nearest centroid.
- Updating the centroid Compared with the K-means algorithm, the data is updated on each small sample set. Mini Batch K-Means has a faster convergence speed than K-Means, but it also reduces the effect of clustering, but it is not obvious in actual projects.
From the clustering results, we can also find that the effect of Mini Batch K-Means is similar to that of K-means
6. Disadvantages of K-means algorithm
1. It is sensitive to the selection of k initial centroids, and it is easy to fall into a local minimum. For example, when we run the above algorithm, we may get different results, such as the following two cases. K-means is also converged, but only converged to a local minimum.
2. The choice of k value is specified by the user, and the results obtained by different k will be quite different, as shown in the figure below, the left is the result of k=3, the blue cluster is too sparse, the blue cluster should be can be further divided into two clusters. On the right is the result for k=5, the red and blue clusters should be merged into one cluster.
3. There are limitations. The non-spherical data distribution as shown in the figure below is uncertain. According to the author's idea, the final classification result should be as shown in Figure 1, but what appears after we use the K-means clustering algorithm is The second result.
7. K-means clustering algorithm visualization
Visualizing K-Means Clustering (visualization website)
In this website, we can freely define the parameter K and can clearly see the results of each clustering
When we choose one of them to conduct experiments, we can take a look at the effect, as shown in the figure below, we can see that according to our idealization, the final classification results should be three categories, so here we choose the parameter K to be 3, Then continue to iterate.
The final clustering result is shown in the figure below. At this time, the clustering does not change, and the iteration of the algorithm stops.
Here I will show you another example of K-means clustering effect is very poor
For a smiley sample like this, our final result should be two eyes, then the mouth, and then the outer circle. Classified into four clusters, but when we use the K-means algorithm, the effect is very poor. As shown below.
At this time, we should use a density-based clustering method, such as the DBSCAN algorithm. The result of using the DBSCAN algorithm is shown in the figure below.