Similarity is determined:
① from the formula:
We use the Euclidean distance to the main
② angle cosine values: the larger, the higher the similarity
Inner product / die length
③ Jaccard similarity coefficient and correlation coefficient
As shown above, similar to the said x1 x2, namely Jaccard similarity coefficient, and to maintain the properties of coherence distance, so 1- Jaccard similarity factor, similarity factor is the same
Cluster : After clustering categories, namely cluster
Clustering is only reasonable and unreasonable, there is no good or bad.
K-means:
K points randomly selected from the sample as the initial cluster center point, other samples to calculate the distance between these points, from which point on which nearly attributed to a class.
When all sample points are points after this, the center point of the cluster instead of the cluster where the average of all samples.
Then a new cluster center point of all the samples re-divide a class.
This cycle has been, until the center point of the cluster do not change or change so far is minimal. Or a minimum squared error (MSE, i.e., all the sample points in the cluster to the cluster center point and a distance) does not change so far.
Clustering effect factors:
Select ①, k value
②, select the initial cluster center point, if the data is not balanced, the clustering effect is likely to be bad
Why take the mean cluster center point? principle:
When there are outliers
K-means algorithm needs to be addressed point: the beginning of a few randomly selected cluster center point if get very close, they can finish the calf
Advantages and disadvantages :
K-means ++ algorithm : It can be said that the improved K-means
In order to address the shortcomings of K-means, with the
K-means Ⅱ algorithm : in fact, start with a small batch data sets do a K-means clustering, and then as a result of the initial cluster center point, do a K-means for all data.