Machine learning - the K-means clustering algorithm and K-means ++, K-meansⅡ

 

Similarity is determined:

  ① from the formula:

 

 

 We use the Euclidean distance to the main

  angle cosine values: the larger, the higher the similarity

Inner product / die length

  Jaccard similarity coefficient and correlation coefficient

 

 

 As shown above, similar to the said x1 x2, namely Jaccard similarity coefficient, and to maintain the properties of coherence distance, so 1- Jaccard similarity factor, similarity factor is the same

 

 Cluster : After clustering categories, namely cluster

Clustering is only reasonable and unreasonable, there is no good or bad.

K-means:

  K points randomly selected from the sample as the initial cluster center point, other samples to calculate the distance between these points, from which point on which nearly attributed to a class.

  When all sample points are points after this, the center point of the cluster instead of the cluster where the average of all samples.

  Then a new cluster center point of all the samples re-divide a class.

  This cycle has been, until the center point of the cluster do not change or change so far is minimal. Or a minimum squared error (MSE, i.e., all the sample points in the cluster to the cluster center point and a distance) does not change so far.

 

Clustering effect factors:

  Select ①, k value

  ②, select the initial cluster center point, if the data is not balanced, the clustering effect is likely to be bad

 Why take the mean cluster center point? principle:

 

When there are outliers

 

K-means algorithm needs to be addressed point: the beginning of a few randomly selected cluster center point if get very close, they can finish the calf

Advantages and disadvantages :

 

 K-means ++ algorithm : It can be said that the improved K-means

 

 In order to address the shortcomings of K-means, with the

K-means Ⅱ algorithm : in fact, start with a small batch data sets do a K-means clustering, and then as a result of the initial cluster center point, do a K-means for all data.

 

Guess you like

Origin www.cnblogs.com/qianchaomoon/p/12129080.html