K-means clustering (K-means) algorithm

       Cluster analysis, also known as group analysis, is a statistical analysis method for studying (samples or indicators) classification problems, and it is also an important algorithm for data mining. It belongs to the unsupervised learning method.

        The K-Means algorithm is the basic algorithm in clustering and an important method in unsupervised learning.

        The basic algorithm idea is as follows:

        1) Randomly give k initial points as cluster centroids.

        2) Calculate the distance between each data sample and the centroid of each cluster by a certain distance function, and assign it to the nearest cluster.

        3) According to the newly assigned clusters, calculate the new centroids of k clusters.

        4) Iterative calculation steps 2) 4) until the termination condition of the iteration is reached (for example, the change of the centroid distance of the two iteration clusters is less than a certain threshold), and the data classification is completed.

        The ordinary K-Means algorithm may converge to a local optimum due to the randomness of the initial cluster centroid. At this time, multiple random initializations can be used to obtain the best result of classification.

        But when the value of k is large, the best result obtained by multiple randomization may only be slightly better, because as the number of k increases, the randomness of the centroid position of each initial cluster brings less uncertainty to the whole .

        For the distance measure from data point to cluster centroid, Euclidean distance and cosine distance are often used.

         The cosine distance diagram is as follows:

        Cosine of two vectors = dot product of two vectors / product of modulus of two vectors

 

 

Guess you like

Origin blog.csdn.net/weixin_43284996/article/details/127349451