Machine Learning (9) Clustering K-means

what is clustering

Clustering is to divide a large number of data sets with unknown labels into different categories according to the data characteristics existing in the data, so that the data within the categories is relatively similar, and the data similarity between the categories is relatively small; it is unsupervised . Learn 

The focus of clustering algorithms is to calculate the similarity between sample items , sometimes called the distance between samples 

The difference from the classification algorithm: The classification algorithm is supervised learning, and the algorithm model is constructed based on the labeled historical data. The clustering algorithm is unsupervised learning, and the data in the data set is not labeled

Similarity/Distance Formula

pearson correlation coefficient


K-means algorithm:


Advantages and disadvantages of K-means algorithm:

Disadvantages: The K value is given by the user. Before the data processing, the K value is unknown, and the results obtained by different K values ​​are different;

             is sensitive to the initial cluster center point 

            Not suitable for finding non-convex clusters or clusters with large differences in size 

            Special values ​​(outliers) have a greater impact on the model

 Advantages: easy to understand, good clustering effect 

        When dealing with large data sets, the algorithm can ensure better scalability and high efficiency

         When the clusters approximate Gaussian distribution, the effect is very good

K-means optimization:

Bipartite K-Means Algorithm



K-means++ algorithm



Mini Batch K-means algorithm:


The Measure Index of Clustering Algorithm--Silhouette Coefficient


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325720401&siteId=291194637