Mathematical Modeling: 13 Clustering Models

Table of contents

K-means clustering algorithm

step

K-means++

step

SPSS

Systematic (hierarchical) clustering 

step

Common distances between samples

distance between indicators

distance between classes

SPSS

Clustered pedigree diagram (dendrogram)

How to Determine K Value - Elbow Rule

Aggregation coefficient: total distortion degree

After determining K, use SPSS to draw a graph

DBSCAN algorithm-density-based clustering method


K-means clustering algorithm

step

  1. Specify the number of clusters K, which is the number of classification categories
  2. Specify K initial clustering centers
  3. Calculate the distance between the remaining points and the cluster center, and reclassify the sample points into the clusters closest to themselves.
  4. Recalculate the center of each cluster as the new cluster center
  5. Loop through two steps until the center converges or the specified number of iterations is reached

advantage:
(1) The algorithm is simple and fast.
(2) This algorithm is relatively efficient for processing large data sets.
shortcoming:
(1) The user must give the number K of clusters to be generated in advance .
( 2 ) Sensitive to initial value.
( 3 ) Sensitive to isolated point data.
K-means++ can solve the last two shortcomings:
K-mean++ needs to ensure that the cluster center is as far away as possible, so an isolated point far away from other points is likely to become the cluster center, which allows the isolated point to be in a separate category;
At the same time, K-means++ ensures that the cluster center is as far away as possible, ensuring that the selection of the initial value is not arbitrary.

K-means++

Basic principle: The random selection of the initial clustering center is optimized and the initial clustering center needs to be as far away as possible

step

  1. Randomly select a sample point as the first cluster center
  2. Calculate the distance between the remaining sample points and the existing cluster center (if there are multiple cluster centers, calculate the centers of these cluster centers first, and then calculate the distance between the remaining sample points and the center). The larger the distance, the The greater the probability of being selected as the next cluster center (assign a probability), and then use the roulette method to extract the next cluster center
  3. Repeat until K initial clustering centers are selected
  4. Continue the steps of K-means

SPSS

Problems:

  1. However, neither of the above two methods can solve the problem of manually specifying K. You can only try a few more K to see which result is easier to explain.
  2. Dimensional effects, standardizing data

Systematic (hierarchical) clustering 

step

  1. Initially, each sample is treated as a class, and the distance between sample points is calculated ;
  2. The two with the smallest distance are merged into a new class;
  3. Recalculate the distance between the new class and all classes, and calculate the distance between classes ;
  4. Repeat until there is only one class

The scores of 6 subjects of 60 students are known

Cluster samples: such as classifying students

Clustering indicators: For example, classifying these six courses

Common distances between samples

distance between indicators

distance between classes

Mostly used between groups and within groups

Shortest distance method: (Nearest Neighbor)

Longest distance method: (Furthest Neighbor)

Between-group Linkage method: (Between-group Linkage)

Within-group Linkage

Center of gravity method: (Centroid clustering)

SPSS

Clustered pedigree diagram (dendrogram)

How to Determine K Value - Elbow Rule

Aggregation coefficient: total distortion degree

The larger the number of categories K, the smaller the aggregation coefficient J

After SPSS generates a table of previous iterations, there is a coefficient column corresponding to J and a stage corresponding to K; then use excel to draw a graph and explain:

After determining K, use SPSS to draw a graph

Only when the indicator is 2/3 can the graph be drawn like this

After determining K, use the system clustering again, and fill in the number of clusters as K in "Save" 

DBSCAN algorithm- density-based clustering method

The first two algorithms are distance based, DBSCAN : Density-based clustering method with noise

The DBSCAN algorithm divides data points into three categories:
  • Core point: contains no less than the number of points MinPts within the radius Eps
  • Boundary points: The number of points within the radius Eps is less than MinPts , but falls within the neighborhood of the core point
  • Noise point: a point that is neither a core point nor a boundary point (draw a circle with a certain point as the center, if the included point is <minPts and the point is not within the range of any core point , it is noise)

advantage:
1. Based on density definition, it can handle clusters of any shape and size;
2. Outliers can be found while clustering ;
3. Compared with K-means, there is no need to enter the number of clusters to be divided .
shortcoming:
1. Sensitive to the input parameters ε radius and Minpts , it is difficult to determine the parameters;
2. Since the variables ε and Minpts are globally unique in the DBSCAN algorithm, when the density of the clusters is uneven, the
When the class distances are very different, the clustering quality is poor;
3. When the amount of data is large, the computational complexity of calculating the density unit is high.
There are only two indicators, and after making a scatter plot, it is found that the data behaves very " DBSCAN ". At this time, DBSCAN is used for clustering .

Guess you like

Origin blog.csdn.net/m0_54625820/article/details/128704673