Data Mining Algorithms and Practice (16): Clustering Algorithm

The previous algorithm is too focused on understanding and derivation, and the following algorithm will focus more on the use in sklearn and the official data set test. This article talks about clustering. Clustering is a general term for a class of algorithms, which is a classic unsupervised learning. After the sample is trained, the model is applied to the new data, and the data is directly clustered to obtain certain types of data. The classic scenario is the abnormal handling of the industrial production environment, and the common distance- based clustering (represented by the K-means algorithm) And density- based clustering (the representative is the DBscan algorithm), master the most commonly used k-means, agglomerative clustering, dbscan;

Refer to the sklearn Chinese forum: https://apachecn.gitee.io/sklearn-doc-zh/docs/master/22.html

One, clustering algorithm

The following sklearn shows the approximate clustering effect of the clustering algorithm. Take kmeans and bdscan as examples. It is found that the former will be close to the division by region, and the latter tends to divide the high density of contiguous patches into the same region, and agglomerative clustering (such as AgglomerativeClustering ) Is similar to the phagocytic process, and its classification effect is closer to the k-mean;

二、k-means

The K-means algorithm tries to find the cluster center that represents a specific area of the data. The algorithm iteratively performs two steps: ① Allocate each data point to the nearest cluster center; ② Each cluster center is set to the average value of all the assigned data points. If the cluster allocation does not change, the algorithm ends. There are several problems with k-means: ① It is required to specify the number of clusters to find; ② k-means can only find relatively simple shapes; ③ k-means only considers the distance to the nearest cluster center ④ The output of the algorithm depends on the random seed;

Three, agglomerative clustering

Agglomerative clustering is a clustering algorithm constructed based on the same principle. First, each point is regarded as its own cluster, and then the two most similar clusters are merged until a certain stopping criterion is met. For example, the stopping criterion is the number of clusters, so similar The clusters of are merged until the specified number of clusters are left, and the two nearest clusters are merged in an iterative manner. "Best" means that the sum of the cluster variances is the smallest, and the process is similar to the process of phagocytosis;

Four, DBscan

DBSCAN is a density-based spatial clustering application with noise, including 3 types of points: ① core samples; ② boundary points; ③ noise, and 2 important parameters: eps and min_sample