Table of contents
Systematic (hierarchical) clustering
Common distances between samples
Clustered pedigree diagram (dendrogram)
How to Determine K Value - Elbow Rule
Aggregation coefficient: total distortion degree
After determining K, use SPSS to draw a graph
DBSCAN algorithm-density-based clustering method
K-means clustering algorithm
step
- Specify the number of clusters K, which is the number of classification categories
- Specify K initial clustering centers
- Calculate the distance between the remaining points and the cluster center, and reclassify the sample points into the clusters closest to themselves.
- Recalculate the center of each cluster as the new cluster center
- Loop through two steps until the center converges or the specified number of iterations is reached
K-means++ can solve the last two shortcomings:K-mean++ needs to ensure that the cluster center is as far away as possible, so an isolated point far away from other points is likely to become the cluster center, which allows the isolated point to be in a separate category;At the same time, K-means++ ensures that the cluster center is as far away as possible, ensuring that the selection of the initial value is not arbitrary.
K-means++
Basic principle: The random selection of the initial clustering center is optimized and the initial clustering center needs to be as far away as possible
step
- Randomly select a sample point as the first cluster center
- Calculate the distance between the remaining sample points and the existing cluster center (if there are multiple cluster centers, calculate the centers of these cluster centers first, and then calculate the distance between the remaining sample points and the center). The larger the distance, the The greater the probability of being selected as the next cluster center (assign a probability), and then use the roulette method to extract the next cluster center
- Repeat until K initial clustering centers are selected
- Continue the steps of K-means
SPSS
Problems:
- However, neither of the above two methods can solve the problem of manually specifying K. You can only try a few more K to see which result is easier to explain.
- Dimensional effects, standardizing data
Systematic (hierarchical) clustering
step
- Initially, each sample is treated as a class, and the distance between sample points is calculated ;
- The two with the smallest distance are merged into a new class;
- Recalculate the distance between the new class and all classes, and calculate the distance between classes ;
- Repeat until there is only one class
The scores of 6 subjects of 60 students are known
Cluster samples: such as classifying students
Clustering indicators: For example, classifying these six courses
Common distances between samples
distance between indicators
distance between classes
Mostly used between groups and within groups
SPSS
Clustered pedigree diagram (dendrogram)
How to Determine K Value - Elbow Rule
Aggregation coefficient: total distortion degree
The larger the number of categories K, the smaller the aggregation coefficient J
After SPSS generates a table of previous iterations, there is a coefficient column corresponding to J and a stage corresponding to K; then use excel to draw a graph and explain:
After determining K, use SPSS to draw a graph
Only when the indicator is 2/3 can the graph be drawn like this
After determining K, use the system clustering again, and fill in the number of clusters as K in "Save"
DBSCAN algorithm- density-based clustering method
The first two algorithms are distance based, DBSCAN : Density-based clustering method with noise
- Core point: contains no less than the number of points MinPts within the radius Eps
- Boundary points: The number of points within the radius Eps is less than MinPts , but falls within the neighborhood of the core point
- Noise point: a point that is neither a core point nor a boundary point (draw a circle with a certain point as the center, if the included point is <minPts and the point is not within the range of any core point , it is noise)