Artificial Intelligence-Clustering-Notes

Introduction to the principle of clustering algorithm

Clustering concept

Clustering involves the grouping of data points. Given a set of data points, we can divide each data point into a specific group according to the clustering algorithm. Data points in the same group should have similar attributes or characteristics, and data points in different groups should have highly different attributes or characteristics. Clustering is a method of unsupervised machine learning (without labels) , or statistical data analysis techniques commonly used in many fields, sometimes used as preprocessing of sparse features in supervised learning, and sometimes used as outlier detection .
Application scenarios: news clustering , user purchase model (cross-selling) , image and genetic technology, etc. The difficulties of clustering include evaluation and parameter adjustment (no label).

K-means algorithm

k-means is unsupervised machine learning , a method of dividing data into k parts without any supervision signal

To get the number of clusters, you need to specify the K value (the number of clusters);
centroid: mean, that is, the average of each dimension of the vector can be averaged.
Distance measurement: commonly used Euclidean distance and cosine distance (standardize the data first)
Optimization objective:
$ min $

work process

work process
(1) Specify the value of K, initialize two centroids randomly (b)
(2) Traverse all sample points (a), calculate the distance between the sample points and the two centroids and perform clustering (c)
(3) According to the clustering Update the position of the centroid (d)
(4) Re-traverse the sample points to calculate the distance to the centroid and perform clustering (e)
(5) Keep updating and clustering until the position of the centroid no longer changes significantly (f)
Advantages: Simple and fast, suitable for conventional data sets.
Disadvantages:
1. The K value is difficult to determine, and the complexity has a linear relationship with the sample range. It is difficult to find clusters of arbitrary shapes.
2. k-means is locally optimal and is easily affected by the centroid of the first tenth
Solution: The binary k-means algorithm is not very sensitive to the choice of the initial centroid, because only one centroid is initially selected
Insert picture description here

Details of k-means
  1. How do we determine the value of k? How do we know how many categories should be divided into categories that are
    not determined. The categories are mainly determined by personal experience and feeling. The usual practice is to try a few more values ​​of k, and see the results of the categories for better explanation. It is more suitable for analysis purposes, etc. Or you can compare the SSE calculated from various k values, and take the smallest SSE k value
  2. How to select the initial k centroids
    is usually randomly selected . The selection of the initial centroids has an impact on the final clustering result, so the algorithm must be executed several times, and the result is more reasonable .
    Of course, there are some optimization methods: the
    first is to choose the points furthest from each other. Specifically, the first point is selected first, and then the second point that is the furthest away from the first point is selected as the second point, and then the third point is selected. The sum of the distances from the third point to the first and second two points is the smallest, and so on.
    The second is to first obtain the clustering results according to other clustering algorithms (such as hierarchical clustering). From the results, each Pick a point
  3. k-means will not fall into the election process has been the center of mass, never not stop
    will not , there mathematical proof K-Means will converge , the general idea is based on the concept (error sum of squares) of SSE, ie each point to itself The sum of squares of the distance attributable to the centroid. This sum of squares is a function, and then it can be proved that the function finally converges.
  4. How to determine which centroid distance each point belongs to? The
    first type: Euclidean distance : follow-up supplement\frac{follow-up}{additional}Supplement chargeAfter continued
    The second type: cosine similarity : cosine similarity uses the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals. Compared with the distance measurement, cosine similarity pays more attention to the difference in the direction of the two vectors , Not distance or length.
    There are other ways to calculate distance, but they are all derived from Euclidean distance and cosine similarity , Minkowski distance, Chebyshev distance, Manhattan distance, Mahalanobis distance, adjusted cosine similarity Degree, Jaccard similarity coefficient...
  5. Everyone’s units must be consistent
  6. How to select
    the arithmetic average of each dimension of the new centroid in each iteration , such as (x1,y1,z1), (x2,y2,z2), (x3,y3,z3), the new centroid is [(x1+x2 +x3)/3,(y1+y2+y3)/3,(z1+z2+z3)/3],*It should be noted here that the new centroid is not necessarily an actual data point
  7. Regarding outliers,
    outliers are very abnormal and very special data points far away from the whole. These outliers such as "extremely large" and "extremely small" should be removed before clustering, otherwise they will be The result of the class has an impact. However, outliers are often very valuable in their own analysis, and outliers can be analyzed as a separate category.
  8. What does the k-means clustering result made with SPSS include anova (one-way analysis of variance)?
    Simply put, it is to determine whether the variables used for clustering contribute to the clustering results. The more significant the variables of the analysis of variance test result, the more impact on the clustering results. For insignificant variables, consider removing from the model

DBSCAN

Guess you like

Origin blog.csdn.net/qq122716072/article/details/104872867