In the previous blog, we mentioned that although k-means clustering can classify multiple samples in k, there is a problem even with the improved k-means++ clustering method: the clustering results are largely dependent on user feedback. Set the number of classes k.
So is there a way to solve this problem? Hierarchical clustering algorithm is introduced in teacher Qingfeng's tutorial↓ (The picture of this article comes from teacher Yu Jingxian of Liaoning University of Petroleum and Chemical Technology)
Article Directory
Hierarchical clustering steps
Overall description
- Count each sample as one category
- Use a specific method to calculate the distance between the class and the class, and divide the closer classes into a large class
- Take the new category as a subcategory, repeat step 2 and draw a cluster pedigree map until all samples are classified into one category
- According to the obtained clustering pedigree and the number of selected classes k, the result of k classification is obtained
Cluster pedigree
According to the results of each classification, draw a tree-like pedigree diagram similar to the following figure:
Taking the picture modification as an example, the classification process is as follows: for the
first time classification, students 1 and 5 are divided into one category, students 2 and 4 are divided into one category, and 3 is a separate category. For the second time, 1524 students are divided into one category, and 3 is a separate category. For the last time, all students are classified into one category (student category, the complete set of all samples).
Finally, the k
categories are selected according to different categories in the above figure : it
can be seen that k
the two classification method when = 2 is to divide student 3 into one category. 1, 2, 4, and 5 are divided into one category.
Take k
= 3 when three students classified into a class of 1,5, 2,4 into a class 3 class of its own. And so on
5 ways to calculate the distance between classes
At the beginning of the classification, each sample forms its own category. So the distance between sample points is the distance between classes
Since then, there are five common methods for calculating the distance between classes:
-
The shortest distance method (Nearest Neighbor)
takes the shortest distance between points within two classes as the distance between the two classes D (G p, G q) D(G_p,G_q)D(Gp,Gq) Is the length of the red line in the figure below:
-
The longest distance method (Furthest Neighbor)
takes the longest distance between two types of points as the distance between the two types D (G p, G q) D (G_p, G_q)D(Gp,Gq) Is the red line in the figure below -
The method of average distance between groups
Calculate the distance between the points of two classes (the red line in the figure below), and take the average of all distances as the distance between the two classes
-
Average within the group
Calculate the distance between the two classes including all points (the red line in the figure below), and take the average value as the distance between the two classes
-
Center of gravity method
Take the center of gravity of the midpoint of the class as the center of the class, and the distance between the center points of the two classes as the distance between the classes D (G p, G q) D(G_p,G_q)D(Gp,Gq)
Improvements made by hierarchical clustering
We know that the method of k-means clustering classification is to first select k classes, then select the initial cluster center, and then further classify based on this.
And k-means clustering algorithm is different is that the system of classification using clustering is not first choice "is divided into several categories", on the contrary, but directly from the characteristics of the sample are classified and , finally divided according to actual needs k
class in just Search in the classification results.
If k-means clustering is a classification method from front to back, hierarchical clustering is a classification method from back to front. Since its classification process does not directly depend on the k
classes we need , it effectively solves the remaining problems of k-means clustering √
Finally, I recommend the mathematics course of teacher Qingfeng, the entrance to the trial class