【Digital Model】Clustering Model

  • Classification and clustering
    • Classification: The final category is confirmed, and each sample is divided into existing categories.
    • Clustering: The final category is unknown, divide all samples into the final category

1. K-means clustering algorithm

1.1 Understanding K-means algorithm

  • Algorithm process (it is recommended to use a flow chart: it is more concise and can reduce duplication checking)

    • Flowchart drawing: Edraw, PPT, Visio and other software can be usedInsert image description here
  • IllustrationInsert image description here

  • K-means cluster visualization

1.2 K-means algorithm evaluation

  • advantage:
    • ①The algorithm is simple and fast.
    • ② This algorithm is relatively efficient for processing large data sets.
  • shortcoming:
    • ① The user is required to give the number K of clusters to be generated in advance.
    • ② Sensitive to initial value.
    • ③ Sensitive to isolated point data.
    • (② and ③ can be solved using K-means++)

2. K-means++ algorithm

2.1 Understanding K-means++ algorithm

  • The basic principle for selecting initial clustering centers is:The mutual distance between
    centers in the initial clustering should be as far as possible.
  • Algorithm process:
    • (Only the step of K-means algorithm "Initialize K cluster centers" has been optimized)
    • Step 1: Randomly select a sample as the first cluster center;
    • Step 2: Calculate the shortest distance between each sample and the current existing cluster center (that is, the distance to the nearest cluster center). The larger the value, the greater the probability of being selected as the cluster center; finally, use the round The disk method (selection based on probability) selects the next cluster center;
    • Step 3: Repeat step 2 until K cluster centers are selected. After selecting the initial point, continue using the standard K-means algorithm.

2.2 Spss implements K-means++

  • Specific operations after importing data (K-means clustering is selected, and the K-means++ algorithm is used by default)
    Insert image description here
    Insert image description here
    Insert image description here
    Insert image description here
  • Analysis results:
    • ①The results of clusteringInsert image description here
    • ②After being divided into k categories, the distance between k cluster centersInsert image description here
    • ③Number of samples in each clusterInsert image description here

2.3 Some discussions on K-means algorithm

  • ①How to determine the K value of the number of clusters?
    • Answer: The classification into several categories mainly depends on personal experience and feeling. The usual approach is to try several K values ​​to see if the results divided into several categories are better explained and more in line with the purpose of analysis.
    • For example: when classifying consumer groups, either two or three categories are acceptable, because two categories can be interpreted as high consumption and low consumption; three categories can be interpreted as high consumption, medium consumption and low consumption.
  • ②What should I do if the dimensions of the data are inconsistent?
    • Answer: If the dimensions of the data are different, it will be meaningless when calculating distance.
    • For example: If the unit of X1 is meters and the unit of X2 is tons, when calculated using the distance formula, it will appear that "meters squared" plus "tons squared" are squared, and the final calculated thing has no mathematical meaning, which is a problem.
    • Processing method:Insert image description here
      Insert image description here
      The descriptive table generated as follows can be placed in the paper
      Insert image description here
      Insert image description here

3. System/hierarchical clustering

  • Both K-means clustering and K-means++ clustering require artificial setting of the final number of clusters, but system clustering does not require this step.
  • The merging algorithm of system clustering calculates the distance between two types of data points, combines the two closest types of data points, and iterates this process repeatedly until all data points are combined into one type and a cluster pedigree is generated.
  • (This section refers to the teaching pdf of Liaoning Petrochemical University)
    Insert image description here

3.1 Common distances between samples (sample i and sample j)

  • Applicable situations:Insert image description here
  • official:
    • Commonly used absolute distances for grid paths
    • Other commonly used Euclidean distances
      Insert image description here
  • Example:
    Insert image description here

3.2 Commonly used “distance” between indicators (indicator i and indicator j)

  • There are rarely questions that require classifying indicators

  • Applicable situations:Insert image description here

  • official:
    Insert image description here

  • Example
    Insert image description here

3.3 Common distances between classes

  • Applicable situations:Insert image description here

  • 1. A class consisting of one sample is the most basic class; if each class consists of one sample, then the distance between samples is the inter-class distance.

  • 2. If a certain class contains more than one sample, then the inter-class distance must be determined. The inter-class distance is defined based on the distance between samples. There are generally several definition methods:

    • The default is the center of gravity method;
    • In system clustering, inter-group and intra-group use are relatively common;
    • (There are various choices of clustering methods, as long as the model can be explained clearly)Insert image description here
① Shortest distance method

Insert image description here

② Longest distance method

Insert image description here

③ Average connection method between groups

Insert image description here

④ Intra-group average connection methodInsert image description here
⑤ Center of gravity method

Insert image description here

3.4 System clustering process

  • The paper cannot use the following figure directly, you must make some modifications yourself (form and content description, etc.)
  • The algorithm flow of systematic (hierarchical) clustering:
    • ① Treat each object as a category and calculate the minimum distance between two objects;
    • ② Merge the two classes with the smallest distance into a new class;
    • ③ Recalculate the distance between the new class and all classes;
    • ④ Repeat steps 2 and 3 until all categories are finally merged into one category;
    • ⑤ End.

Insert image description here

3.5 Complete problem-solving process

【topic】
  • Classify these five students based on their scores in six courses
    Insert image description here
[Solution 1: Use the shortest distance system clustering method]
  • (1) Calculation process

    • 1. Write the distance matrix between samples (take Euclidean distance as an example)Insert image description here
    • 2. Treat each sample as a class, that is, G1, G2, G3, G4, G5. It is observed that D (G1, G5) = 15.8 is the minimum, so G1 and G5 are grouped into one class, recorded as G6. Calculate the distance between the new category and other categories to obtain a new distance matrix D1Insert image description here
      Insert image description here
    • 3. Observe that D(G2,G4)=15.9 is the smallest, so G2 and G4 are grouped into one category and recorded as G7. Calculate the distance between the new class and other classes to obtain a new distance matrix D2Insert image description hereInsert image description here
    • 4. Observe that D(G6,G7)=18.2 is the smallest, so G6 and G7 are grouped into one category and recorded as G8. Calculate the distance between the new class and other classes to obtain a new distance matrix D3
      Insert image description here
    • 5. Finally, G8 and G3 are grouped into one category and recorded as G9
  • (2) Clustered pedigree diagram (dendrogram)
    Insert image description here

[Solution 2: Use the longest distance system clustering method]
  • (1) Calculation process

    • 1. Write the distance matrix between samples (take Euclidean distance as an example)
      Insert image description here
      2. Treat each sample as a class, namely G1, G2, G3, For G4 and G5, it is observed that D(G1,G5)=15.8 is the smallest, so G1 and G5 are grouped into one category and recorded as G6. Calculate the distance between the new category and other categories to obtain a new distance matrix D1
      Insert image description here
    • 3. Observe that D(G2,G4)=15.9 is the smallest, so G2 and G4 are grouped into one category and recorded as G7. Calculate the distance between the new class and other classes to obtain a new distance matrix D2
      Insert image description here
    • 4. Observe that D(G3,G7)=32.4 is the smallest, so G3 and G7 are grouped into one category and recorded as G8. Calculate the distance between the new class and other classes to obtain a new distance matrix D3Insert image description here
    • 5. Finally, G8 and G6 are grouped into one category, recorded as G9
  • (2) Clustered pedigree diagram (dendrogram)Insert image description here

[Other solutions]
  • average linkage system clustering method between groups
  • centroid system clustering method
  • Within-group average linkage system clustering method
  • Note: The difference between these methods is to calculate the distance between the new category and other categories. As long as it can be explained, it can be used.

3.6 Issues that need attention in cluster analysis

  1. For a practical problem, indicators should be selected based onthe purpose of classification. Different classification results of indicator selection are generally different.
  2. Distance definitions between samples are different, and clustering results are generally different.
  3. Depending on the clustering method, the clustering results are generally different (especially when there are a lot of samples). It is best to find commonalities through various methods.
  4. Pay attention tothe dimensions of the indicators. Too large a difference in dimensions will lead to unreasonable clustering results.
  5. The results of cluster analysis may not be satisfactory because what we are doing is a mathematical process and we have to find a reasonable explanation for the results.

3.7 Spss software operation for system clustering

  • Specific operations
    Insert image description here
    Insert image description here
    Insert image description here
    Insert image description here
  • Result analysis
    • The pedigree chart is a feature added by the newer Spss version: the horizontal axis represents the distance between various types (the distance has been rescaled); the number of clusters can be determined from the graph.
    • There is also a kind of graph in the Spss results, called the icicle graph, which is rarely used now.
      Insert image description here
      The number of categories is recommended to be ≤5 categories (any more will make interpretation difficult)
      Insert image description here

3.8 Graphically estimating the number of clusters

  • Elbow Method: roughly estimate the optimal number of clusters through graphics.

Insert image description here

How to draw aggregation coefficient line chart
  • ① Copy the coefficient column in Spss to the new Excel table and sort it in descending order
    Insert image description here
    Insert image description here
    Insert image description here

  • ② Build charts
    Insert image description here
    Insert image description here

  • ③Explain the chart

    • (1) According to the aggregation coefficient line chart, when the number of categories is 5, the downward trend of the line slows down, so the number of categories can be set to 5.
    • (2) It can be seen from the figure that when the K value is from 1 to 5, the degree of distortion changes the most. After exceeding 5, the variation in distortion degree is significantly reduced. Therefore, the elbow is K=5, so the number of categories can be set to 5. (K=3 can also be explained, because the decline from 3 to 4 is also relatively gentle)
    • (Which is the best explanation to use)
  • ④After determining K, save the clustering results and draw a schematic diagram

    • Schematic description: The graph can only be drawn when the number of indicators is 2 or 3.Insert image description here

    • After generating k (3 in the picture below) categories, the k categories should be explained in the paper (why they are divided into these 3 categories)
      Insert image description here

    • Chart construction
      Insert image description here

    • Set horizontal and vertical axis labels
      Insert image description here

    • Set point labels
      Insert image description here
      Insert image description here

    • Modify the style of the diagram
      Insert image description here
      Insert image description here

    • Finally, just copy it directly
      Insert image description here


4. DBSCAN algorithm

4.1 Basic concepts of DBSCAN algorithm

  • K-means clustering and hierarchical clustering are clustering methods based ondistance; the DBSCAN algorithm in this section is based on clustering algorithmdensity
  • There is no need to pre-specify the number of clusters before clustering with the DBSCAN algorithm, and the number of generated clusters is variable (depending on the data).
    • This algorithm uses the concept of density-based clustering, which requires that the number of objects (points or other spatial objects) contained in a certain area in the clustering space is not less than a given threshold. → "Whoever is close to me is my brother; my brother's brother is also my brother"
    • Advantages: This method can find clusters of arbitrary shapes in noisy spatial databases, connect adjacent areas with sufficient density, and can effectively handle abnormal data.
    • DBSCAN algorithm visualization
      Insert image description here

4.2 Classification

  • Core point (red point in the picture below): contains no less than the number of points MinPts within the radius Eps
  • Boundary points (yellow points in the picture below): The number of points within the radius Eps is less than MinPts, but falls within the neighborhood of the core point
  • Noise points (blue points in the picture below): points that are neither core points nor boundary points
    Insert image description here

4.3 Matlab code

% Copyright (c) 2015, Yarpiz (www.yarpiz.com)
% All rights reserved. Please read the "license.txt" for license terms.
%
% Project Code: YPML110
% Project Title: Implementation of DBSCAN Clustering in MATLAB
% Publisher: Yarpiz (www.yarpiz.com)
%
% Developer: S. Mostapha Kalami Heris (Member of Yarpiz Team)
%
% Contact Info: [email protected], [email protected]

4.4 Advantages and Disadvantages

  • advantage:

    1. Based on density definition, it can handle clusters of arbitrary shapes and sizes;
    2. Outliers can be found while clustering;
    3. Compared with K-means, there is no need to enter the number of clusters to be divided.
  • shortcoming:

    1. Sensitive to the input parameters ε and Minpts, it is difficult to determine the parameters;
    2. Since the variables ε and Minpts are globally unique in the DBSCAN algorithm, when the clustering density is uneven and the clustering distances
      are very different, the clustering quality is poor;
    3. When the amount of data is large, the computational complexity of calculating the density unit is large.
  • Teacher Qingfeng’s suggestions:

    • There are only two indicators, and after you make a scatter plot and find that the data behaves very "DBSCAN" (shapeful), then use DBSCAN for clustering.
    • In other cases, system clustering is all used (K-means can also be used, but if it is used, there will be less things to write in the paper).

postscript

Guess you like

Origin blog.csdn.net/SHIE_Ww/article/details/131990278