[Clustering algorithm] Kmeans clustering

every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog

0. Preface

Summary of Kmeans

1. Text

1.1 Introduction

To put it simply, for a bunch of data, first select k samples as the cluster centers , and the distances from all samples to them are divided into the nearest cluster center according to the distance ; for each obtained cluster, calculate their respective The average value is used as the new cluster center , and the above process is repeated until the change of the cluster center tends to be stable. That is, the clustering of the data is completed.

The overall process is also relatively easy to understand.

1.2 steps

  1. Data preprocessing, mainly: data standardization and outlier filtering.
  2. Randomly select k centers
  3. Calculate the distance from the sample to k centers and assign it to the nearest center
  4. For the several classes divided above, recalculate the center of each class
  5. Repeat steps 3 and 4 above until the center is finally stable.
    Please add a picture description

1.3k options

1.3.1 Inflection point method (elbow method)

Calculate the sum of squares of distances under different k values. As the k value increases, the distance will gradually decrease. When the slope suddenly changes from large to small, and then changes slowly, it is considered that the k value is an appropriate kz value.

insert image description here

1.3.2 Silhouette coefficient

Each sample has a corresponding silhouette coefficient, which consists of two parts:

  • The average distance of a sample from other sample points in the same cluster class (within class) (quantifies cohesion)
  • The average clustering of all samples in the sample and the nearest cluster (inter-class) (quantified classification degree)

S = b − a m a x ( a , b ) S = {b-a \over max(a,b)} S=max(a,b)ba
The value of S is [-1, 1]

Silhouette coefficient in a set of data sets: equal to the average value of the silhouette coefficients of each sample in the data set

1.4 Advantages and disadvantages

(1). Advantages

  1. Belongs to unsupervised learning, does not require labels
  2. The principle is simple and the implementation is easy
  3. The results are interpretable

(2). Disadvantages

  1. The selection of clustering data k, improper selection may get unsatisfactory results
  2. May converge to a local optimum, slow to converge on large-scale data
  3. Sensitive to noise and outliers

1.5 Algorithm improvement

There are mainly the following methods (not expanded temporarily)

  • kmeans++
  • Divided kmeans
  • minbatchKmeans

reference

[1] https://blog.csdn.net/Claire_chen_jia/article/details/111060253#t2
[2] https://blog.csdn.net/weixin_45788069/article/details/108853816#t3
[3] https://blog.csdn.net/qq_43741312/article/details/97128745#t11
[4] https://zhuanlan.zhihu.com/p/432230028
[5] https://www.zhihu.com/tardis/zm/art/158776162?source_id=1005
[6] https://zhuanlan.zhihu.com/p/184686598

Guess you like

Origin blog.csdn.net/weixin_39190382/article/details/131379283