Mathematical Modeling--(7) Clustering Model

clustering model 

        "Things of a feather flock together, and people are divided into groups." The so-called clustering is the process of dividing samples into multiple classes composed of similar objects . After clustering, we can use statistical models in each class to estimate, analyze or predict more accurately ; we can also explore the correlation and main differences between different classes.
        The difference between clustering and classification: classification is a known category, and clustering is unknown.

Table of contents

clustering model 

1. K-means clustering algorithm

1. Process

2. Graphical K-means

3. Algorithm advantages and disadvantages

2. K-means++ clustering algorithm

1. Process

2. Graphical K-means++

3. Algorithm disadvantages

4. SPSS operation

3. Systematic (hierarchical) clustering algorithm

1. Sample-to-sample distance

2. The distance between indicators

3. Distance between classes

4. Process of system clustering method

5. Case Analysis

4. Elbow rule

Five. DBSCAN clustering algorithm

1. Basic classification

 2. Advantages and disadvantages

6. Summary


1.  K-means clustering algorithm

1. Process

  1. Specify the K value of the number of clusters to be divided (the number of classes);
  2. Randomly select K data objects as the initial cluster centers (not necessarily our sample points);
  3. Calculate the distance from each of the remaining data objects to the K initial clustering centers, and classify the data objects into the cluster class where the center closest to it is located;
  4. Adjust the new class and recalculate the center of the new class;
  5. Loop steps three and four to see if the center is converged (unchanged), and stop the loop if it converges or reaches the number of iterations;
  6. Finish.

2. Graphical K-means


3. Algorithm advantages and disadvantages

advantage:

  • The algorithm is simple and fast.
  • For processing large data sets, the algorithm is relatively efficient.

shortcoming:

  • Requires the user to give the number k of clusters to be generated in advance
  • Sensitive to initial value
  • Sensitive to outlier data

The K-means++ algorithm can solve the two shortcomings of 2 and 3.


2. K-means++ clustering algorithm

The basic principle of the K-means++ algorithm to select the initial cluster center is: the mutual distance between the initial cluster centers should be as far as possible.


1. Process

  1. Randomly select a sample as the first cluster center;
  2. Calculate the shortest distance between each sample and the current existing cluster center (that is, the distance to the nearest cluster center). The larger the value (the larger the distance), the greater the probability of being selected as the cluster center; finally, Use the roulette method (selection based on the probability) to select the next cluster center;
  3. Repeat step 2 until K cluster centers are selected. After selecting the initial point, continue to use the standard K-means algorithm.
     

2. Graphical K-means++

        Suppose there are 5 points in the plane, randomly select A as the cluster center as the first cluster center,

        Calculate the shortest distance between the remaining 4 samples and the current cluster center. The larger the value, the higher the probability of being selected as the cluster center (as shown in Figure 1).

        Use the roulette method to select D as the second cluster center (E can also be selected).

        Select D as the second cluster center, find the center of gravity of the first cluster center A and the second cluster center D, and use it as a virtual cluster center, find the other three points B, C, E The distance to the virtual cluster center, and calculate the relative probability (as shown in Figure 2),

        And use the roulette method to select the next cluster center... until K cluster centers are selected and then perform the K-means algorithm.


3. Algorithm disadvantages

  1. Although the K-means++ algorithm solves the two problems of initial value sensitivity and boundary value sensitivity of the K-means algorithm. But K still needs to be selected manually , so K should depend on personal experience and feeling, just try a few more sets.
  2. Dimensions are not consistent, there will be problems, therefore, need to use the following formula for calculation. zi=(xi-x average)/x standard deviation, that is, subtract the mean and divide by the standard deviation . The effect of dimension can be eliminated, of course, it can also be operated with SPSS .


4. SPSS operation

Analysis → Classification → K-means clustering, the K-means++ algorithm is used by default here. The options for saving and options are as shown in the figure below:


3. Systematic (hierarchical) clustering algorithm

        The merging algorithm of system clustering calculates the distance between two types of data points , combines the closest two types of data points, and iterates this process repeatedly until all data points are combined into one type and a cluster pedigree diagram is generated. In addition, system clustering can solve the value problem of cluster number K, and the process of SPSS operation will be given later.

1. Sample-to-sample distance

Example: The table below shows the results of six courses for 30 students. Classify the 30 students based on their scores.

insert image description here


Common distance between samples (sample i and sample j):


example:

  insert image description here  insert image description here


2. The distance between indicators

Example: According to the results of these 30 people, divide the six courses into two categories.

Common distance between indicators (indicator i and indicator j)

 Use less, just understand.


3. Distance between classes

The distance between classes is the distance between sets.

The distance between classes has the following two properties:

  1. A class consisting of one sample is the most basic class; if each class consists of one sample, then the distance between samples is the inter-class distance.
  2. If a class contains more than one sample, then the inter-class distance is determined , which is defined based on the inter-sample distance.

  1. Shortest distance method:

  2. Longest distance method:

  3. Longest distance method:

  4. Intra-group average connection method:

Center of gravity method:


4. Process of system clustering method

 The flow chart is as follows:

 


 5. Case Analysis

Classify the five students based on their grades in the six subjects.insert image description here


1. Write the distance matrix between samples (take Euclidean distance as an example)

 

 

2. Treat each sample as a class, namely G1, G2, G3, G4, G5. It is observed that D(G1,G5)=15.8 is the smallest, so G1 and G5 are grouped together and recorded as G6. Calculate the distance between the new class and other classes to get a new distance matrix D1.

3. Observe that D(G2,G4)=15.9 is the smallest, so group G2 and G4 into one group and record it as G7 . Calculate the distance between the new class and other classes to get a new distance matrix D2.

4. It is observed that D(G6,G7) =18.2 is the smallest, so G6 and G7 are grouped together and recorded as G8.  Calculate the distance between the new class and other classes to get a new distance matrix D3.

 

5. Finally, group G8 and G3 into one group and record it as G9

 


SPSS operation

Analysis → Classification → System Clustering, add variables and label basis to corresponding groups. Save the diagram as follows, standardize in the method, and choose Z score if the dimensions are different.

 

Result analysis :

In the pedigree diagram run by SPSS, you can divide the number of clusters according to your own needs. The method is to draw vertical lines, as shown in the figure below, and divide the sampling into 3 categories. The first category is Guangdong and Shanghai, the second category is Beijing, Zhejiang, Tianjin, Tibet, and Fujian, and the third category is the rest of the provinces. (This sample question selects: the average per capita annual consumption expenditure data of urban households in 31 provinces across the country in 1999) 

4. Elbow rule

        The elbow rule is to roughly estimate the optimal number of clusters through graphics . After applying the coefficients in the centralized plan in the SPSS output document, after sorting the coefficients, use excel to generate the image, as shown in the figure below: 

It can be seen from the above figure that when K is 1 to 5, the degree of distortion changes the most. After exceeding 5, the variation of the degree of distortion is significantly reduced. Therefore, K can take 5, and of course 3, as long as the reason is given.


After determining K, you can re-cluster, and pay attention to selecting a single solution as the value of K during saving. If K=3: 

 Five. DBSCAN clustering algorithm

DBSCAN (Density-based spatial clustering of applications with noise) is a density-based clustering method         proposed by Martin Ester, Hans-Peter Kriegel et al. in 1996. It does not need to pre-specify the number of clusters before clustering. The
generated The number of clusters is variable (depending on the data).

        The algorithm uses the concept of density-based clustering, which requires that the number of objects (points or other spatial objects) contained in a certain area in the clustering space is not less than a given threshold.
        This method can find clusters of any shape in a noisy spatial database , can connect adjacent regions with sufficient density , and can effectively deal with abnormal data .


1. Basic classification

The DBSCAN algorithm divides data points into three categories:

  • • Core points: there are no less than MinPts points within the radius Eps
  • • Boundary points: the number of points within the radius Eps is less than MinPts, but falls within the neighborhood of the core point
  • • Noisy points: points that are neither core points nor border points 

 2. Advantages and disadvantages

advantage:

  1. Based on the density definition, it can handle clusters of arbitrary shapes and sizes;
  2. Outliers can be found while clustering;
  3. Compared with K-means, there is no need to input the number of clusters to be divided.

shortcoming:

  1. Sensitive to the input parameters ε and Minpts, it is difficult to determine the parameters;
  2. Since the variables ε and Minpts are globally unique in the DBSCAN algorithm, when the density of the clusters is uneven and the clustering distances differ greatly, the clustering quality is poor;
  3. When the amount of data is large, the computational complexity of calculating the density unit is large.

6. Summary

  1. The K-means algorithm is simple and fast for classification problems, and it is relatively efficient for processing large data sets.
  2. The K-means++ algorithm optimizes the step of initializing K cluster centers in the K-means algorithm, but further discussion is needed to determine the K value. In fact, the selection of K value can be solved by applying hierarchical clustering.
  3. There are only two indicators, and after you make a scatter plot, you find that the data is very " DBSCAN ", then you use DBSCAN for clustering. In all other cases, use systematic clustering.
  4. Depending on the clustering method, the clustering results are generally different (especially when there are too many samples). It is best to find out the commonality among them through various methods.
  5. Attention should be paid to the dimension of the index. If the dimension difference is too large, the clustering result will be unreasonable.
  6. The results of cluster analysis may not be satisfactory, because what we are doing is a mathematical process, and we need to find a reasonable explanation for the results.

Guess you like

Origin blog.csdn.net/qq_58602552/article/details/130352766