Machine Learning-K-means Clustering Algorithm

Clustering algorithms are generally divided into three types, namely partition-based clustering, density-based clustering, and hierarchical clustering. The author here introduces the most common partition-based clustering, namely K-means Clustering Algorithm. This article is some notes and supplements for studying "Basics of Machine Learning Algorithms - Qin Bingfeng".

1. The difference between clustering and classification

The samples of the classification algorithm are labeled, and the samples of the clustering algorithm are unlabeled. Try not to know which category the data set belongs to, and cluster the data according to the characteristics of the data, and the clustering is unlabeled. Supervised learning and classification belong to supervised learning.

66d74c0f129f4e6497cd997533d27997.png

2. Introduction of K-means algorithm

Parameter K: Both the K-means algorithm and the KNN algorithm have a K, but the K of the two algorithms is different. The parameter K value in K-means refers to dividing the n data objects input in advance into k clusters. Classes (clusters), so that the obtained clusters satisfy: the similarity of objects in the same cluster is high; while the similarity of objects in different clusters is small.

Algorithm idea: Clustering is centered on k points in the space, and the objects closest to them are classified. Through the iterative method, the value of each cluster center is updated successively until the best clustering result is obtained.

3. Implementation process of K-means algorithm

  • First randomly select k elements from the unlabeled element set A as the centers of gravity of the k subsets, that is, first define how many classes/clusters there are in total.
  • Calculate the distances from the remaining elements to the center of gravity of the k subsets (the distance here can also use Euclidean distance), and divide these elements into the nearest subset according to the distance.
  • According to the clustering results, recalculate the center of gravity (the calculation method of the center of gravity is to calculate the arithmetic mean of each dimension of all elements in the subset).
  • All the elements in set A are clustered again according to the new center of gravity.
  • Repeat the above operations continuously until the clustering result does not change.

 4. Example of K-means algorithm

849b7768b9ee40eebe442232a448f4ea.png

b9d83084793148b5a3c576f6d1e72746.png

There are four points in the coordinate axis, which are recorded as A(1,1), B(2,1), C(4,3), D(5,4), and these four points are clustered , assuming that (1,1) and (2,1) are selected as the two classification center points. As shown below.

21ab3c0cbf9e44c5a9076ac981829052.png

First, calculate the distances from the four points A, B, C, and D to the classification center point respectively, as shown in the matrix D0 in the figure below. The first row in the D0 matrix is ​​the distance from each point to the first classification center point, and the second row is the distance from each point to the second classification center point

25ffef671c884a2688e491fd6e184747.png

10b66f2fb06c49be83b3a1675ff75227.png

 After calculating the D0 matrix, compare the distance between each point and the classification center point, and classify each point to the point that is closer to it. Get the matrix G0. As shown below.

46e1c6441d18417b90b41d0f82a8de37.png

Then calculate the two types of center points of the new round respectively. According to the matrix, we can get that there is only point A in group-1 itself, so the classification center point of group-1 has not changed. And there are three points B, C, and D in group-2, so recalculate to get the new classification center point of group-2.

1dbd588773af4192a3e677706f550ee7.png

 Then start a new round of iteration, the operation is consistent with the above operation.

01934be605a241af945b1bf8c1444d7e.png

 The result of calculating the distance from each point to the center point of the new round is shown in the matrix D1 in the figure:

807b00381f1c4e2fa5df74f97a9e9a52.png

 The resulting new classification is shown in Figure G1:

5a519bba157f4354a3af2d987c0716f5.png

 Then calculate the classification center point of the next round according to the new classification situation

e9e632ae4b7c43a8991610a8e126e26a.png

 Then the third round of classification

3d959f4a53ea4221a2f91c9b05baeb0d.png

 The result of calculating the distance from each point to the center point of the new round is shown in the matrix D2 in the figure:

4462fc2dd7d6470d8464f3e966a150c0.png

 The resulting new classification is shown in Figure G2:

b441aeeb81df40789dc723a31ecc7909.png

At this time, it is found that the clustering result has not changed, and the algorithm iteration stops. Therefore, the result of the K-means clustering algorithm is to divide these four points into two categories: A, B, C, and D.

5、Mini Batch K-Means

The Mini Batch K-Means algorithm is a variant of the K-Means algorithm, which uses a small batch of data subsets to reduce calculation time. The so-called small batch here refers to the randomly selected data subsets each time the algorithm is trained. Using these randomly generated subsets to train the algorithm greatly reduces the calculation time, and the results are generally only slightly worse than the standard algorithm.

The iterative steps of the algorithm have two steps:

  1. Randomly sample some data from the dataset to form mini-batches and assign them to the nearest centroid.
  2. Updating the centroid Compared with the K-means algorithm, the data is updated on each small sample set. Mini Batch K-Means has a faster convergence speed than K-Means, but it also reduces the effect of clustering, but it is not obvious in actual projects. 

ba988f35fafb4199842022b82e3697d8.png

From the clustering results, we can also find that the effect of Mini Batch K-Means is similar to that of K-means

6. Disadvantages of K-means algorithm

1. It is sensitive to the selection of k initial centroids, and it is easy to fall into a local minimum. For example, when we run the above algorithm, we may get different results, such as the following two cases. K-means is also converged, but only converged to a local minimum.

7d29e0d7d97c4a2481cab6f891e767d1.png

2. The choice of k value is specified by the user, and the results obtained by different k will be quite different, as shown in the figure below, the left is the result of k=3, the blue cluster is too sparse, the blue cluster should be can be further divided into two clusters. On the right is the result for k=5, the red and blue clusters should be merged into one cluster.

ce6b6548e99a47cbaa9a1d58b6049367.png3. There are limitations. The non-spherical data distribution as shown in the figure below is uncertain. According to the author's idea, the final classification result should be as shown in Figure 1, but what appears after we use the K-means clustering algorithm is The second result.

4418d6d9e36c4ec7b3a188c15c57f4ea.png

7. K-means clustering algorithm visualization

Visualizing K-Means Clustering (visualization website)

In this website, we can freely define the parameter K and can clearly see the results of each clustering

71a987aa6fec43b3af1e005018c12a85.png

 491cb767a1b74658a5599a735c3b5165.png

 When we choose one of them to conduct experiments, we can take a look at the effect, as shown in the figure below, we can see that according to our idealization, the final classification results should be three categories, so here we choose the parameter K to be 3, Then continue to iterate.

532f1e795c59430b994f25876ed4ad4d.png

0432ac7420ab4492ab8d430e005d9567.png

2770ce619fd84642b77c5faa7ffcd849.png

76b004c56bd7432faeeb3073df809b2b.png

The final clustering result is shown in the figure below. At this time, the clustering does not change, and the iteration of the algorithm stops.9541dba1df13471f8f8930d390e1df22.png

Here I will show you another example of K-means clustering effect is very poor

551172eafc0e4bfda9b5bee36b1582a6.png

For a smiley sample like this, our final result should be two eyes, then the mouth, and then the outer circle. Classified into four clusters, but when we use the K-means algorithm, the effect is very poor. As shown below.

ee10c87c10f245fe8d8a2155c572ea2a.png

At this time, we should use a density-based clustering method, such as the DBSCAN algorithm. The result of using the DBSCAN algorithm is shown in the figure below.

7bd341fd26b14e5fa6e125fb32a4406f.png

Guess you like

Origin blog.csdn.net/qq_45138078/article/details/127613230