K-means algorithm (unsupervised learning)

1 What is unsupervised learning

In real life, there is often such a problem: lack of sufficient prior knowledge, so it is difficult to manually label categories or the cost of manual category labeling is too high. Naturally, we hope that computers can do these tasks for us, or at least provide some help. Solving various problems in pattern recognition based on training samples with unknown categories (not labeled) is called unsupervised learning.

  • An advertising platform needs to divide the American population into different groups based on similar demographic characteristics and buying habits, so that advertisers can reach their target customers through related ads.
  • Someone needs to group their house lists into different communities so that users can more easily refer to these lists.

How can we summarize and group them most usefully? How can we effectively characterize data in a compressed format? This is the goal of unsupervised learning. It is called unsupervised because it starts learning from unlabeled data. 

2 K-means principle

Let's first look at a K-means clustering effect map

  1. Randomly set K points in the feature space as the initial cluster center
  2. Calculate the distance to K centers for each other point, and select the nearest cluster center point for the unknown point as the label category
  3. After facing the marked cluster centers, recalculate the new center point (average value) of each cluster
  4. If the calculated new center point is the same as the original center point, then end, otherwise, repeat the second step process

If you look at the above picture separately, the effect is as follows:

3 K-means API 

sklearn.cluster.KMeans(n_clusters=8,init=‘k-means++’)

  • n_clusters: the number of starting cluster centers
  • init: initialization method, the default is'k-means ++'
  • labels_: The type of the default label, which can be compared with the real value (not a value comparison)

4 Example code

5 K-means performance evaluation index

5.1 Contour coefficient

The contour coefficient calculation formula is as follows:

For each point i is a sample in the clustered data, b_i is the minimum distance from i to all samples of other ethnic groups, and a_i is the average distance from i to its own cluster. Finally, the average value of the contour coefficients of all sample points is calculated.

5.2 Numerical analysis of profile coefficient

Analysis process (we take a blue 1 point as an example)

  • Calculate the average value a_i of the distance between blue 1 and all points of its own ethnic group
  • The distance between blue 1 and the other two groups is calculated as the average red average, green average, and the smallest distance is taken as b_i
  • According to the formula: consider extreme values: if b_i >>a_i: then the formula result approaches 1; if a_i>>>b_i: then the formula result approaches -1

5.3 Conclusion

If b_i>>a_i: approaches 1, the better the effect, and b_i<<a_i: approaches -1, the effect is not good. The value of the profile coefficient is between [-1,1], and the closer it is to 1, the better the cohesion and separation are.

5.4 Profile Coefficient API

sklearn.metrics.silhouette_score(X, labels)

  • X: Characteristic value
  • labels: target value marked by the cluster

6 Summary

  • Feature analysis: iterative algorithm is adopted, which is intuitive and easy to understand and very practical
  • Disadvantages: easy to converge to a local optimal solution (multiple clustering)
  • Note: clustering is generally done before classification

 

 

 

Guess you like

Origin blog.csdn.net/gf19960103/article/details/109360903