Clustering (a) - Kmeans

Clustering

K-means clustering

  Clustering is one of the main research direction of machine learning and data mining, which is an unsupervised learning algorithm, the main research direction graduate student Xiao Bian is a period of " data stream adaptive clustering algorithm ," so there is clustering algorithm a deeper understanding of it, he decided to open a topic to write clustering algorithm, hoping to bring help to the reader entry and research cluster related algorithms. Clustering can be used as a separate task, looking for internal distribution structure data, often a precursor process as other learning tasks, is widely used. Today, Xiao Bian will take you to explore the mysteries of clustering algorithm, and introduced the first clustering Kmeans.

 

Q: What is a cluster?

A: a cluster in accordance with a particular standard (similarity measures, such as Euclidean distance, etc.), dividing a data set into different classes, such that the similarity data objects of the same class as large as possible, different types of differences between large as much as possible, is below a visual clustering result:

image.png

    Clustering has a very wide range of applications, for example, found that the spending records of different customer groups, clustering of gene expression can be studied genetic characteristics of different populations, the text clustering can quickly find articles and other related topics.

 

Q: how to measure similarities?

A: similarity metrics used are the following:

  • Euclidean distance:image.png

  • Minkowoski Distance:image.png

  • Manhattan Distance:image.png

  • Cosine distance:image.png

  • Jaccard similarity coefficient:image.png

  • The correlation coefficient:image.png



 

Q: What are the commonly used clustering algorithm?

A:

  • Partition-based clustering: k-means, mean shift

  • Hierarchical Clustering: BIRCH

  • Density Clustering: DBSCAN

  • Model-based clustering: GMM

  • Affinity propagation

  • Spectral clustering

    Above these algorithms simply introduce the concept of clustering, the next topic, we will discuss specific classical clustering algorithm, study their principles, advantages and disadvantages, and other scenarios. Today, we'll learn the most classic of a clustering algorithm Kmeans

 

K-means

 

Kmeans clustering principle

    Thought Kmeans very simple algorithm, according to the size of the distance between a given sample set of samples the sample set into k clusters (s), such that its distance from each point belongs to the nearest cluster center (i.e., average means) corresponding to class. The reason why is because it can be called kmenas find k (user specified) clusters and cluster data center belonging to the cluster mean to represent.

 

Kmeans clustering algorithm

     Data set X = {x1, ... xn} in each sample is no d-dimensional data tag, kmeans clustering aims n points assigned to these k clusters such that the center point of the cluster to the cluster points (mean ) and the minimum squared distance, i.e., find the optimal solution of the objective function of the following

  Where μi is the mean Si clusters midpoint.   

    But style is not the simple solution of a problem, because it is an NP-hard problem, so kmeans a heuristic algorithm uses an iterative solution method:

    First, k randomly selected objects as initial cluster centers, and then calculated for each sample from each cluster to the center, and it is assigned to the nearest cluster center distance. Once the objects have all been allocated, recalculated each cluster center (mean) as a new center point for the next iteration. This process is repeated until any of the following conditions are satisfied:

  • No object is reassigned to a new class;

  • Cluster centers no longer change;

  • And local minimum squared error.

 

Tips:

  • K is selected values: a priori general we can select a suitable data k, if not, it can be verified by selecting an appropriate cross-k;

  • Initializing the center point of the k: may be randomly selected, may be selected from each other as far as the center point as a center point;

Kmeans ++ algorithm

  We also mentioned earlier the choice of K initialization centers have a significant impact on operating results and time clustering algorithm, Kmeans ++ algorithm is proposed to optimize the randomized initialization clustering centers:

  1. Randomly selected from the set of input data points as a first cluster center point mu] 1;

  2. For each set of data points xi, calculate its nearest cluster center a selected distance D (xi);

  3. Select a larger point D (xi) as a new cluster centers;

  4. Repeating b, c until you find the k cluster centers; 

    

Kmeans algorithm Summary

advantage:

  • Principle is simple, simple, fast convergence;

  • Clustering effect is better;

  • Interpretability strong and intuitive;

  • Only one parameter k;

Disadvantages:

  • k greater impact on the selection of the clustering results;

  • For the data sets are not convex convergence more difficult;

  • Category unbalanced bad data clustering effect;

  • Local optimum results;

  • Sensitive to noises;

  • The result is a spherical cluster.

 

summary:

    Today is the first part of the learning clustering algorithm, content is simple but very important, kmeans algorithm often used as the basis for other algorithms, such as semi-supervised learning before and after speaking spectral clustering algorithm will be used. You must believe that today's learning has also been harvested, the next topic is the content clustering spectral clustering, so stay tuned!

Scan code concerns

Get interesting algorithms knowledge

qrcode_for_gh_32a933b66b99_258.jpg

 

Guess you like

Origin www.cnblogs.com/PJQOOO/p/11825586.html