Summary and Introduction of Clustering Algorithms

Clustering Algorithm

"Things of a feather flock together, and people are divided into groups." The so-called clustering is the process of dividing samples into multiple classes composed of similar objects. The category for the dataset is unknown.

algorithm name Is it necessary to give K value Clustering by scope of use
K-means yes distance Low
K-means++ yes distance middle
Hierarchical (hierarchical) clustering no distance high
DBSCAN Algorithm no density Used when the scatter plot obviously has DBSCAN characteristics

K-means (K-means) clustering algorithm

Algorithm Description:

  1. The K value of the number of given classes.
  2. Select K initial data centers.
  3. Find the distance to the data center for the rest of the points, and each point is classified into the same data center as the closest data center.
  4. Update the data center to be the center of gravity of the class.
  5. Repeat the operations of 3 and 4, and each repetition counts as an iteration.

advantage:

  1. The algorithm is simple and fast.
  2. For processing large data sets, the algorithm is relatively efficient.

shortcoming:

  1. The number K of generated clusters needs to be given.
  2. Sensitive to initial values.
  3. Sensitive to outlier data.

Notice:

  • Items 2 and 3 of the shortcomings can be solved using the K-means++ algorithm.

  • Algorithm steps can be converted into a flow chart with clear layers, which is a bonus item and can also reduce the rate of paper duplication checks.

  • Edraw, PPT, and Visio can all draw flowcharts.

K-means++ algorithm

For the K-means algorithm to select the initial data center , a new algorithm regulation is made to keep the distance between the initial data centers the farthest, and the rest of the steps are the same as the K-means algorithm.

  • Solved the 2nd and 3rd shortcomings of the K-means algorithm.

Select initial data center algorithm description

  1. Randomly select a sample as the first cluster center;

  2. Calculate the shortest distance between other sample points and the current existing cluster center (that is, the distance to the nearest cluster center), and assign a probability to the sample point according to the distance (the longer the distance, the greater the probability of becoming a cluster center ), using the roulette method (selection based on the probability) to select

    Get the next cluster center;

  3. Repeat step 2 until K cluster centers are selected.

SPSS operation

  • SPSS uses the K-means++ algorithm by default.


Note:

  • Dimension elimination: standardization.
    insert image description here
    ​Using SPSS standardization to obtain the z-type standardized variable described in the above formula.

Hierarchical (hierarchical) clustering

The K value does not need to be given before clustering, and the number of clusters to be divided is judged by the clustered pedigree diagram.

Algorithm process

  1. Treat each object as a class and calculate the minimum distance between two pairs;
  2. Merge the two classes with the smallest distance into one class;
  3. Calculate the distance between the new class and each class;
  4. Repeat steps 2 and 3 until all classes are synthesized into one class;
  5. Finish.

SPSS operation

  • Statistics: Generally do not care.
  • Figure: The pedigree diagram is the flow chart of the classification, which needs to be checked.
  • Method: clustering method - the definition method of the distance between classes; transformation value - standardization (Z score is the Z standardization method)
  • Save: If the K value is determined, it can be filled in the "Number of Clusters" (generally K≤5, easy to explain).

The Elbow Rule - Estimating the Number of Clusters Graphically

Elbow Method: Roughly estimate the optimal number of clusters through graphics.

insert image description here

Steps for usage

  1. Copy the coefficients of each stage in SPSS to Excel and sort them in descending order (double-click the table generated in SPSS to copy it).
  2. Insert —> Recommended Charts —> Scatterplot.
  3. Generally, the K value at the turning point is selected as the final number of classifications. Of course, it is also convenient to explain the classification.
    insert image description here
    insert image description here

After determining the K value, save the clustering results and draw a graph

You can use SPSS to visualize the clustering results, but the variables can only be 2 or 3.
insert image description here

You can directly double-click the generated chart to modify the parameters, so as to beautify the icon.

DBSCAN Algorithm

The DBSCAN algorithm is a density-based clustering method. It does not need to pre-specify the number of clusters before clustering, and the number of generated clusters is variable (related to the data).

In short, it is the process of randomly selecting a reference point, and then according to the requirements (radius Eps, minimum number of points in the class MinPts) to circle the surrounding points and the surrounding points of the surrounding points one by one to form the final grouping process.

Data Point Classification

data point features whether it belongs to a class
core point Contains no less than MinPts number of points within radius Eps yes
boundary point The number of points within the radius Eps is less than MinPts, but falls within the neighborhood of the core point yes
noise point Points that are neither core points nor border points no

Code

You can download the code recommended by Matlab official website .

advantage

  1. Based on the definition of density, the K value of the number of clusters is not required;

  2. Abnormal points (noise points) can be found;

shortcoming

  1. Sensitive to the input parameters ε (radius Eps ) and Minpts, it is difficult to determine the parameters;

  2. Since the variables ε and Minpts are globally unique in the DBSCAN algorithm, when the density of the clusters is uneven and the clustering distances differ greatly, the clustering quality is poor;

  3. When the amount of data is large, the computational complexity of calculating the density unit is large.

Graphical prediction result display

insert image description here

  • It can be seen that the DBSCAN algorithm divides the points in a Pimpled Smiley image into three categories according to the density of the points, and removes the noise points.
  • Attach the DBSCAN algorithm visualization website Visualizing DBSCAN Clustering (naftaliharris.com)

Guess you like

Origin blog.csdn.net/qq_61539914/article/details/126800936