[Clustering model ②] System clustering algorithm-to solve the remaining problems of k-means clustering

In the previous blog, we mentioned that although k-means clustering can classify multiple samples in k, there is a problem even with the improved k-means++ clustering method: the clustering results are largely dependent on user feedback. Set the number of classes k.

So is there a way to solve this problem? Hierarchical clustering algorithm is introduced in teacher Qingfeng's tutorial↓ (The picture of this article comes from teacher Yu Jingxian of Liaoning University of Petroleum and Chemical Technology)

Hierarchical clustering steps

Overall description

  1. Count each sample as one category
  2. Use a specific method to calculate the distance between the class and the class, and divide the closer classes into a large class
  3. Take the new category as a subcategory, repeat step 2 and draw a cluster pedigree map until all samples are classified into one category
  4. According to the obtained clustering pedigree and the number of selected classes k, the result of k classification is obtained

Cluster pedigree

According to the results of each classification, draw a tree-like pedigree diagram similar to the following figure:

Taking the picture modification as an example, the classification process is as follows: for the
first time classification, students 1 and 5 are divided into one category, students 2 and 4 are divided into one category, and 3 is a separate category. For the second time, 1524 students are divided into one category, and 3 is a separate category. For the last time, all students are classified into one category (student category, the complete set of all samples).
Insert picture description here
Finally, the kcategories are selected according to different categories in the above figure : it
Insert picture description here
can be seen that kthe two classification method when = 2 is to divide student 3 into one category. 1, 2, 4, and 5 are divided into one category.

Take k= 3 when three students classified into a class of 1,5, 2,4 into a class 3 class of its own. And so on

5 ways to calculate the distance between classes

At the beginning of the classification, each sample forms its own category. So the distance between sample points is the distance between classes

Since then, there are five common methods for calculating the distance between classes:

  1. The shortest distance method (Nearest Neighbor)
    takes the shortest distance between points within two classes as the distance between the two classes D (G p, G q) D(G_p,G_q)D(Gp,Gq) Is the length of the red line in the figure below:
    Insert picture description here

  2. The longest distance method (Furthest Neighbor)
    takes the longest distance between two types of points as the distance between the two types D (G p, G q) D (G_p, G_q)D(Gp,Gq) Is the red line in the figure belowInsert picture description here

  3. The method of average distance between groups
    Calculate the distance between the points of two classes (the red line in the figure below), and take the average of all distances as the distance between the two classes
    Insert picture description here

  4. Average within the group
    Calculate the distance between the two classes including all points (the red line in the figure below), and take the average value as the distance between the two classes
    Insert picture description here

  5. Center of gravity method
    Take the center of gravity of the midpoint of the class as the center of the class, and the distance between the center points of the two classes as the distance between the classes D (G p, G q) D(G_p,G_q)D(Gp,Gq)
    Insert picture description here

Improvements made by hierarchical clustering

We know that the method of k-means clustering classification is to first select k classes, then select the initial cluster center, and then further classify based on this.

And k-means clustering algorithm is different is that the system of classification using clustering is not first choice "is divided into several categories", on the contrary, but directly from the characteristics of the sample are classified and , finally divided according to actual needs kclass in just Search in the classification results.

If k-means clustering is a classification method from front to back, hierarchical clustering is a classification method from back to front. Since its classification process does not directly depend on the kclasses we need , it effectively solves the remaining problems of k-means clustering √

Finally, I recommend the mathematics course of teacher Qingfeng, the entrance to the trial class

Guess you like

Origin blog.csdn.net/weixin_44559752/article/details/107869202