Research and Improvement of K-Means Clustering Algorithm

Code: https://github.com/dengsiying/K-Means-improvement.git

Research and Improvement of K-Means Clustering Algorithm *

1 ( School of Computer Science , Central China Normal University , Wuhan 430079 , Hubei )   

Abstract: The K-Means algorithm is a typical algorithm in the clustering algorithm based on partitioning. However, the algorithm still has some defects that cause the algorithm to be unstable due to random initial clustering centers. This paper studies the idea, principle, advantages and disadvantages of the traditional K-Means algorithm, and proposes a new algorithm for its dependence on the initial value. An improved algorithm K-Means++ is studied, which improves the method of selecting the initial cluster center. Experiments show that the K-Means++ algorithm can effectively improve the efficiency and stability of the algorithm and reduce the algorithm overhead.

Keywords : clustering algorithm , K-Means algorithm , data mining

Research and Improvement of K-Means Clustering Algorithm

Abstract: K-Means algorithm is a typical algorithm based on partitioned clustering algorithm. It has the advantages of simple operation, error squared sum criteria function, high scalability and compressibility for processing large data sets advantage. However, there are still some shortcomings in this algorithm, such as stochastic initial clustering center, which results in instability of the algorithm. This paper studies the concept, principle, advantages and disadvantages of the traditional K-Means algorithm and proposes and studies the defects of the original K- An improved algorithm K-Means ++, which improves the method of selecting initial cluster centers. Experimental results show that the K-Means ++ algorithm effectively improves the efficiency and stability of the algorithm and reduces the cost of the algorithm.

Key words: clustering algorithm, K-Means algorithm, data mining

The K-Means clustering algorithm is the most classic and the most widely used partition-based clustering algorithm, which belongs to the distance-based clustering algorithm. The so-called distance-based clustering algorithm refers to the use of distance as a similarity measure. The evaluation index, that is to say, the closer the distance between two objects, the greater their similarity. The goal of such algorithms is usually to group objects with similar distances into clusters, so as to obtain compact and independent different clusters. This algorithm is called distance-based clustering algorithm. K-Means clustering algorithm is one of the more classic algorithms. As an important branch of data mining, K-Means clustering algorithm has the advantages of simple algorithm, easy implementation, and easy expansion. And it can handle the advantages of large data sets, but K-Means also has some inevitable shortcomings, such as being sensitive to the initial value, different initial clustering centers will lead to different clustering results, making the algorithm unstable and easy to fall into local The optimal situation. This paper proposes an improved algorithm for this shortcoming of the K-Means algorithm. Experiments show that the improved algorithm can effectively improve the efficiency and stability of the algorithm, and reduce the loss.

The first section of this paper introduces the principle, algorithm steps, advantages and disadvantages of the traditional K-Means algorithm; the second section proposes an improved algorithm for the disadvantage that the traditional K-Means algorithm randomly selects the initial cluster center will lead to the instability of the algorithm K-Means++, the main idea is to follow the principle that the distance between the initial cluster centers should be as far as possible when selecting the initial center clustering; the third section implements the K-Means and K-Means++ algorithms, and combines the two The experiments show that the K-Means++ algorithm can effectively improve the efficiency and stability of the algorithm, and finally this paper is summarized.

1 Traditional K-Means algorithm

1.1 Principle of K-Means Algorithm

The K-means algorithm adopts the idea of ​​iterative update. The goal of the algorithm is according to the input parameter k (k indicates that the data objects need to be clustered into several clusters). The basic idea is: first specify the number of clusters to be divided. Randomly select k initial data objects as the centers of the initial clusters or clusters; then calculate the distances from the remaining data objects to the k initial cluster centers, and divide the data objects into the cluster where the center closest to it is located Then recalculate the center of each cluster as the cluster center of the next iteration. Repeat this process continuously until each cluster center no longer changes or the iteration reaches the specified maximum number of iterations. The iteration makes the selected cluster The class center is getting closer and closer to the real cluster center, so the clustering effect is getting better and better, and finally all objects are divided into k clusters.

1.2 K-Means algorithm steps

The steps of the K-Means algorithm are as follows:


1.3 Advantages and disadvantages of K-Means algorithm

The K-means algorithm is a classic algorithm for solving clustering problems . This algorithm is simple and fast . When the structure set is dense and the difference between clusters is obvious , the clustering results are better . When dealing with large amounts of data , this algorithm It has high scalability and efficiency, and its time complexity is O (nkt), where n is the number of sample objects , k is the number of classifications , and t is the number of iterations of the algorithm . In general , k<<n ,t<<n.

However , the traditional K-means algorithm also has many shortcomings and needs to be further optimized .

(1) The K-Means clustering algorithm requires the user to specify the k value of the number of clusters in advance . In many cases , when clustering the data set , the user does not know how many categories the data set should be divided into at first . The value of k is difficult to estimate .

(2)  Sensitive to the initial clustering center , selecting different clustering centers will produce different clustering results and different accuracy . The practice of randomly selecting the initial clustering center will lead to the instability of the algorithm , and it may fall into the local optimum. excellent situation .

(3)  Sensitive to noise and outlier data , the K-Means algorithm regards the centroid of the cluster as the cluster center and adds it to the next round of calculation , so a small amount of this type of data can have a great impact on the average , resulting in the result unstable or even buggy .

(4)  It is impossible to find any cluster , and generally only spherical clusters can be found . Because the K-Means algorithm mainly uses the Euclidean distance function to measure the similarity between data objects , and uses the sum of squares of errors as the criterion function , usually only the distribution of data objects can be found . more uniform spherical clusters .

This paper mainly analyzes and studies the influence of the initial value on the clustering results, which is the second problem of the K-Means algorithm , and considers improving the selection of the initial clustering center , thereby reducing the dependence of the K-Means algorithm on the initial value and improving the algorithm . efficiency .

2 Improved algorithm K-Means++

2.1 Principle of K-Means++ algorithm

The traditional K-means algorithm is sensitive to the initial cluster centers, and the clustering results fluctuate with different initial cluster centers . Aiming at the defect of randomly selecting the initial cluster centers in the K-means clustering algorithm, a new method based on K-Means++ algorithm for selecting initial cluster centers from data distribution .

The overall idea of ​​the K -Means++ algorithm is not much different from the K-Means algorithm, and it also adopts the idea of ​​iterative update. Its main improvement is that when the k initial cluster centers are selected in the first step, it is no longer randomly selected from the entire data set. K data objects are used as the initial cluster centers, but k initial cluster centers are selected according to the principle that the distance between the initial cluster centers should be as far as possible. The main idea of ​​K-Means++ algorithm to select the initial cluster centers is as follows: Assuming that n initial cluster centers (0<n< k ) have been selected , then when selecting the n+1th cluster center : the points farther away from the current n cluster centers will have a higher probability of being selected is the n+1th cluster center . The random method is also used when selecting the first cluster center (n=1) .

2.2 K-Means++ algorithm steps

The steps of the K-Means++ algorithm are as follows:


In summary , the above are the complete steps of the K-Means++ algorithm .

3 Comparison of Improved Algorithm K-Means++ and K-Means Algorithm

3.1 Dataset 

We used a simple two -dimensional dataset containing 300 samples, each with two attributes , and visualized it as shown in Figure 1 below. It can be seen that the dataset is roughly divided into 3 clusters, so we determined the value of k is 3, on this basis, the K-Means algorithm and the improved K-Means++ algorithm are implemented.

 

Figure 1   Dataset

3.2 Evaluation Criteria

As mentioned above, the K-Means algorithm uses the error sum of squares criterion function , and the algorithm goal is to optimize the error sum of squares criterion function , and the stopping condition used in the implementation of the algorithm in this paper is to reach the maximum number of iterations . Therefore , we use the error squared and ( SSE ) as evaluation criteria for the algorithm .

                                             1

where dist represents the distance from each point to the center point of the cluster to which it belongs , and the SSE value is the sum of the squares of the distances from all sample points to the center point of the cluster to which it belongs . Under the same number of iterations , the smaller the SSE value is , the better the algorithm , the smaller the loss .

3.3 Algorithm Comparison

The K-Means algorithm and the K-Means++ algorithm were used on the dataset respectively , and the relationship between the number of iterations and the SSE value of the algorithm is shown in Table 1 (retain two decimal places).

 

Table 1 .Comparison of two algorithms SSE value

Table 1 Comparison of SSE values ​​of two algorithms

number of iterations

K-Means algorithm SSE value

K-Means++ algorithm SSE value

2

872.47

266.82

3

851.15

266.66

4

690.14

266.66

6

276.68

266.66

8

266.66

266.66

10

266.66

266.66

 

According to the above table, it can be seen that the traditional K-Means algorithm converges to the global minimum at the 8th iteration, and the SSE value at this time is 266.66, while the K-Means++ algorithm has converged to the global minimum at the 3rd iteration. , the SSE value at this time is also 266.66. Therefore, it can be proved that the efficiency of the improved K-Means++ algorithm is greatly improved, and it can quickly and stably converge to the global minimum with less overhead, and achieve a better clustering effect. .

When the number of iterations is 4, the clustering results of the two algorithms are visualized as shown in Figures 2 and 3 below. From the comparison of the two results, it can be seen that the clustering effect of the K-Means++ algorithm is better than that of the K-Means algorithm. The original data points can be more clearly divided into 3 clusters according to the distance similarity, in which the red X marks the sample center point of the last classification.

In addition , it has been proved by experiments that when the number of iterations is large enough and the number of experiments is large enough, both the K-Means algorithm and the K-Means++ algorithm can converge to the global minimum, and finally can achieve a good clustering effect. However, the K-Means algorithm The random selection of the initial cluster centers leads to the instability of the K-Means algorithm. After improving the method of selecting the initial cluster centers, the K-Means++ algorithm can converge faster and the algorithm effect is better.

To sum up , it can be concluded that the improved K-Means++ algorithm reduces the dependence of the traditional K-Means algorithm on the initial value to a certain extent, reduces the instability of the algorithm, effectively improves the algorithm efficiency, and reduces the algorithm overhead.

 

Figure 2 K-Means clustering effect (the number of iterations is 4 )

 

Figure 3 K-Means++ clustering effect (the number of iterations is 4 )

4 Summary

As a typical algorithm based on partition clustering algorithm, K-Means clustering algorithm is widely used in data mining and is often used as a preprocessing step. This paper focuses on the idea, principle and advantages and disadvantages of K-Means algorithm. , and in view of the shortcomings of the traditional K-Means algorithm that randomly selects the initial cluster centers, which leads to the instability of the algorithm, an improved algorithm K-Means++ is proposed. The K-Means++ algorithm is an improved algorithm based on the data distribution to select the initial cluster centers. The distance between the cluster centers should be as far as possible. This paper also focuses on the idea and principle of the K-Means++ algorithm. Its improvement can effectively avoid the blindness of randomly selecting the initial cluster centers. Experiments show that the improved K-Means++ algorithm -Means++ algorithm has a great improvement in stability and speed than traditional K-Means algorithm.

 

References:

[1]  Wang Li . Research on clustering methods in data mining [D]. Tianjin University, 2004.

[2]  Chang Tong . Research status of K-means algorithm and its improvement [J]. Communication World, 2017(19):289-290.

[3]  Chen Wei , Li Hong, Wang Wei. An Analysis of K-means Clustering Algorithm Based on Python [J]. Digital Technology and Application, 2017(10):118-119.

[4]  Zhang Lin , Chen Yan, Ji Ye, Zhang Jinsong. A research on K-means algorithm based on density [J]. Computer Application Research, 2011, 28(11): 4071-4073+4085.

[5]  Li Weiping . Improvement of k-means clustering algorithm [J]. Science and Technology of Western China, 2010, 9(24): 49-50.

[6]  Xia Changhui . An improved K-means clustering algorithm [J]. Information and Computer (Theoretical Edition), 2017(14):40-42.

[7]  Sun Jia , Hu Ming, Zhao Jia. K-means initial cluster center selection optimization algorithm [J]. Journal of Changchun University of Technology, 2016, 37(01): 25-29.

[8] David Arthur,Sergei Vassilvitskii. k-means++: The Advantages of Careful Seeding.symposium on discrete algorithms.2007.http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324892430&siteId=291194637