K-means clustering

Without supervision label, only the x characteristic values, no y, there is no way to predict, there is no right or wrong way to prove that you do, such data sets, what we can do is it? Is unsupervised machine learning. The clustering algorithm is common or dimension reduction. What do clustering is? Data mining is the existence of centralized rule by the similar data classification, help us explore how the sample data set is divided, for example, the user may be grouped, different marketing strategies. Clustering is also included in the algorithm very much.

 

The basic idea of ​​clustering is: Like attracts like, people in groups. Characterized by calculating the similarity between the samples.

 

K-means clustering:

Step 1: Determine a hyper-parameters k, k is going to gather a sample of a few categories.

Step Two: In all samples, the random selection of three points, as initial cluster centers.

Third step: for each point, and a distance other than the three center points of the three center points are sequentially calculated. Then find the nearest sample point from which the center point.

Fourth ho: All points will be divided into the nearest cluster center point represent the go.

Step five: all samples are divided into k classes, k heap with data, calculate the k cluster centroid. E.g:

Step Six: Generate new cluster center k, k to this new focus again repeat 3-5 ho.

Seventh ho: termination condition (a): In the clustering process is repeated, the classification results of all sample points are not changed; (b) or the maximum number of iterations you set, e.g. max_iter = 200.

 

Principle - algorithm:

 https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans 
 
 KMeans (
    n_clusters =. 8, # int, the number of optional clusters
    init = 'k-means ++' , # an alternative initial centroid method
    n_init = 10, # using the number of different centroid seeds run k-means algorithm. In the inertia, the end result will be the best output n_init continuous operation.
    max_iter = 300, # maximum number of iterations, if this will not continue to exceed the number of iteration
    tol = 0.0001, # MSE value down to what size of when to stop the iterative
    precompute_distances = 'auto' ,
    verbose = 0,
    random_state = None, random number seed #
    copy_x = True,
    n_jobs = None, use of cpu cores #
    algorithm = 'Auto',
)

Model Assessment

Here's what the model results Kmeans algorithm assessment, evaluation model algorithms are implemented to measure the following aspects

1, the sum of the samples closest to the cluster center (inertias)


# Inertias: K-means it is a property of the object model, represents the sum of the samples from the nearest cluster center, as it is unsupervised evaluation index in the absence of real classification and labeling, the value is smaller the better, the smaller the value of proof the more concentrated distribution between sample classes, i.e., the smaller the distance within the class.

 2, profile factor


# Contour factor: It is used to calculate the average coefficient profile of all samples, using an average of the recent cluster in the distance and the average population for each sample to calculate the distance, it is a non-supervised assessment indicators. Its maximum value of 1, the worst value is represented by a value close to -1, 0, overlapping clusters, typically a negative value indicates a sample that has been assigned to the wrong cluster.

 

 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score 
 
Find the best k value by the profile factor, in fact, parameter adjustment. Commonly used methods are: 2 profile factor.
 

3, CH indicators

 

CH indicators measured by the distance squared and tightness in the category, by calculating the center point of all types of data sets and the distance between the central square to measure the resolution data sets, each class CH index calculation points by the cluster center separation the ratio of tightness obtained. Thus, CH larger more closely represents the class itself, the dispersion between classes, i.e., better clustering results. 
 
 

Guess you like

Origin www.cnblogs.com/BC10/p/11791334.html