what is clustering
Clustering is to divide a large number of data sets with unknown labels into different categories according to the data characteristics existing in the data, so that the data within the categories is relatively similar, and the data similarity between the categories is relatively small; it is unsupervised . Learn
The focus of clustering algorithms is to calculate the similarity between sample items , sometimes called the distance between samples
The difference from the classification algorithm: The classification algorithm is supervised learning, and the algorithm model is constructed based on the labeled historical data. The clustering algorithm is unsupervised learning, and the data in the data set is not labeled
Similarity/Distance Formula
pearson correlation coefficient
K-means algorithm:
Advantages and disadvantages of K-means algorithm:
Disadvantages: The K value is given by the user. Before the data processing, the K value is unknown, and the results obtained by different K values are different;
is sensitive to the initial cluster center point
Not suitable for finding non-convex clusters or clusters with large differences in size
Special values (outliers) have a greater impact on the model
Advantages: easy to understand, good clustering effect
When dealing with large data sets, the algorithm can ensure better scalability and high efficiency
When the clusters approximate Gaussian distribution, the effect is very good
K-means optimization:
Bipartite K-Means Algorithm
K-means++ algorithm
Mini Batch K-means algorithm:
The Measure Index of Clustering Algorithm--Silhouette Coefficient