Density-based clustering algorithm (1) - DBSCAN detailed explanation

Density-based clustering algorithm (1) - detailed explanation of DBSCAN
density-based clustering algorithm (2) - detailed explanation of OPTICS
density-based clustering algorithm (3) - detailed explanation of DPC

1. Introduction to DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise, a density-based clustering method with noise ) is a typical density-based spatial clustering algorithm. Compared with K-Means and BIRCH, which are generally only applicable to clustering of convex sample sets, DBSCAN can be applied to both convex sample sets and non-convex sample sets. The algorithm divides regions with sufficient density into clusters and discovers clusters of arbitrary shape in a noisy spatial database, which defines a cluster as the largest collection of density-connected points.

The algorithm uses the concept of density-based clustering, which requires that the number of objects (points or other spatial objects) contained in a certain area in the clustering space is not less than a given threshold. The significant advantage of DBSCAN algorithm is that the clustering speed is fast and it can effectively deal with noise points and find spatial clusters of arbitrary shapes . However, when the density of spatial clusters is not uniform and the distance between clusters varies greatly, the clustering quality is poor.

2. The advantages and disadvantages of DBSCAN
Compared with the traditional K-Means algorithm, the biggest difference of DBSCAN is that it does not need to input the number of categories k. The biggest advantage is that it can find clusters of any shape, not like K-Means, which generally only Used for clustering of convex sample sets. At the same time, it can find outliers while clustering, which is similar to the BIRCH algorithm.

So when do you need to use DBSCAN for clustering? Generally speaking, if the data set is dense and the data set is not convex, then using DBSCAN will be much better than K-Means clustering. If the dataset is not dense, DBSCAN is not recommended for clustering.
(1) Advantages of DBSCAN:
1) It can cluster dense data sets of any shape. In contrast, clustering algorithms such as K-Means are generally only suitable for convex data sets.
2) It can find outliers while clustering, and is not sensitive to outliers in the data set.
3) There is no bias in the clustering results. Relatively, the initial value of clustering algorithms such as K-Means has a great influence on the clustering results.
(2) Disadvantages of DBSCAN:
1) If the density of the sample set is not uniform and the clustering distance is very different, the clustering quality is poor, and it is generally not suitable to use DBSCAN clustering at this time.
2) If the sample set is large, the clustering convergence time is long, at this time, the KD tree or ball tree established when searching for the nearest neighbor can be improved by limiting the scale.
3) Compared with the traditional K-Means and other clustering algorithms, parameter adjustment is a bit more complicated. It mainly needs to jointly adjust the parameters of the distance threshold ϵ and the neighborhood sample number threshold MinPts. Different parameter combinations have a greater effect on the final clustering effect Influence

3. Detailed description of DBSCAN and meaning of parameters
DBSCAN is based on a group of neighborhoods to describe the tightness of the sample set, and the parameter ( ϵ, MinPts) is used to describe the tightness of the sample distribution of the neighborhood. Among them, ϵthe threshold of the neighborhood distance of a certain sample is described, and the threshold of the number of samples in the neighborhood MinPtsof the distance of a certain sample is described .     Assuming that the sample set is * *, the specific density of DBSCAN is described as follows:     1) -Neighborhood : For , its -neighborhood contains the sub set whose distance from the sample set is not greater than Recorded as     2) Core object : For any sample , if its ϵ-neighborhood corresponds to at least samples, that is, if , it is a core object.      3) Density direct access : If it is located in the neighborhood of and is the core object, it is called direct access by xj density. Note that the opposite is not necessarily true, that is, it cannot be said that xj is directly reached by the density of xi at this time, unless xi is also a core object.     4) Density reachable : For and , if there is a sample sample sequenceϵ
D=(x1,x2,...,xm)
ϵxj∈DϵDxjϵNϵ(xj)={xi∈D|distance(xi,xj)≤ϵ}|Nϵ(xj)| 
xj∈DNϵ(xj)MinPts|Nϵ(xj)|≥MinPtsxj
xixjϵ-xjxi
xixjp1,p2,...,pT,Satisfied p1=xi,pT=xj, and directly reachable pt+1by ptdensity, then xj is said to be reachable by xi density. That is, density reachability satisfies transitivity. At this time, the passing samples in the sequence p1,p2,...,pT−1are all core objects, because only the core objects can make other sample density direct. Note that density reachability does not satisfy symmetry, which can be derived from the asymmetry of density reachability.
    5) Density connection : For xiand xj, if there are core object samples xk, xiand xjboth xkare reachable by density, then xiit is called xjdensity connection. Note that the density connection relationship is symmetric.

From the figure below, it is easy to see and understand the above definition. In the figure MinPts=5, the red points are all core objects, because ϵthere are at least 5 samples in their -neighborhood. Samples in black are non-core objects. All core object density-directed samples are within the hypersphere centered on the red core object, and if they are not within the hypersphere, they cannot be density-directed. The core objects connected by the green arrows in the figure constitute a density-reachable sample sequence. All samples in the ϵ-neighborhood of these density-accessible sample sequences are density-connected to each other.
insert image description here
4. DBSCAN idea
  The clustering definition of DBSCAN is very simple: the maximum density connected sample set derived from the density reachable relationship is a category of our final clustering, or a cluster.
  There can be one or more core objects in this DBSCAN cluster. If there is only one core object, the other non-core object samples in the cluster are in the ϵ-neighborhood of this core object; if there are multiple core objects, there ϵmust be one other in the -neighborhood of any core object in the cluster core objects, otherwise these two core objects cannot be density-reachable. ϵ-A DBSCAN cluster is formed by the collection of all samples in the neighborhood of these core objects .
    
  So how can we find such a cluster sample set? The method used by DBSCAN is very simple. It arbitrarily selects a core object without category as a seed, and then finds all the sample sets that can reach the density of this core object, which is a cluster. Then continue to select another core object without a category to find a sample set with a reachable density, so as to obtain another cluster. Runs until all core objects have classes.

This is basically the main content of the DBSCAN algorithm, but there are still three issues that have not been considered:
  1) Some abnormal sample points or a small number of sample points that are free from the cluster, these points are not around any core object, in DBSCAN, We generally mark these sample points as noise points.
  2) The problem of distance measurement, that is, how to calculate the distance between a sample and the core object sample. In DBSCAN, the nearest neighbor idea is generally adopted, and a certain distance measure is used to measure the sample distance, such as the Euclidean distance. This is exactly the same as the nearest neighbor idea of ​​the KNN classification algorithm. Corresponding to a small number of samples, finding the nearest neighbor can directly calculate the distance of all samples. If the sample size is large, the KD tree or ball tree is generally used to quickly search the nearest neighbor.
  3) The third problem is quite special. Some samples may be less than two core objects ϵ, but these two core objects are not directly connected by density and do not belong to the same cluster, so if the category of this sample is defined Woolen cloth? Generally speaking, DBSCAN adopts first-come-first-served at this time, and the category cluster that is clustered first will mark this sample as its category. That is to say, the algorithm of DBSCAN is not a completely stable algorithm.

5. Steps of the DBSCAN algorithm
The following are the main steps of the DBSCAN clustering algorithm
  Input: sample set D=(x1,x2,...,xm), neighborhood parameters ( ϵ,MinPts), sample distance measurement method
  Output: Cluster division C. 
  
  1) Initialize the core object set Ω=∅, initialize the number of clusters k=0, and initialize unvisited Sample set Γ = D, cluster division C = ∅
  2) For j=1,2,...m, follow the steps below to find out all the core objects: a) Find the sample - neighborhood sub-sample set
    by the distance measure     b) If the sub-sample set If the number of samples is sufficient , add the sample to the core object sample set:   3) If the core object set , the algorithm ends, otherwise go to step 4.   4) In the core object set , randomly select a core object and initialize the core object queue of the current cluster , Initialize the category number , initialize the current cluster sample set, update the unvisited sample set   5) If the current cluster core object queue , the current cluster cluster is generated, update the cluster division , update the core object set , and go to step 3. Otherwise update the core object collection . 6) Take out a core object from   the core object queue of the current cluster , and pass the neighborhood distance thresholdxjϵNϵ(xj)
|Nϵ(xj)|≥MinPtsxjΩ=Ω∪{xj}
Ω=∅
ΩoΩcur={o}k=k+1Ck={o},Γ=Γ−{o}
Ωcur=∅CkC={C1,C2,...,Ck}Ω=Ω−CkΩ=Ω−Ck
Ωcuro′ϵFind all ϵ-neighborhood sub-sample sets Nϵ(o′), let Δ=Nϵ(o′)∩Γ, update the current cluster sample set Ck=Ck∪Δ, update the unvisited sample set Γ=Γ−Δ, update Ωcur=Ωcur∪(Δ∩Ω)−o′, and go to step 5.
  The output result is: cluster division C={C1,C2,…,Ck}

6. The DBSCAN algorithm is implemented in python scikit-learn
  In scikit-learn, the DBSCAN algorithm class is sklearn.cluster.DBSCAN. To proficiently master clustering with DBSCAN class, in addition to having a deep understanding of the principle of DBSCAN itself, it is also necessary to have a certain understanding of the idea of ​​​​nearest neighbors.
  The important parameters of DBSCAN are also divided into two categories, one is the parameters of the DBSCAN algorithm itself, and the other is the parameters of the nearest neighbor measure:
  1) eps : DBSCAN algorithm parameters, that is, the distance threshold of our ϵ-neighborhood, and the sample distance exceeding The sample points of ϵ are not in the ϵ-neighborhood. The default value is 0.5. Generally, it is necessary to select an appropriate threshold in multiple sets of values. If the eps is too large, more points will fall in the ϵ-neighborhood of the core object. At this time, our number of categories may be reduced, and samples that should not be one category will also be classified as one category. On the contrary, the number of categories may increase, but the samples that were originally one category are divided.
  2) min_samples : DBSCAN algorithm parameter, that is, the sample number threshold of the ϵ-neighborhood required for the sample point to become the core object. The default value is 5. Generally, it is necessary to select an appropriate threshold in multiple groups of values. Usually tuned together with eps. In the case of a certain eps, if the min_samples is too large, there will be too few core objects. At this time, some samples in the cluster that are originally one class may be marked as noise points, and the number of categories will increase. Conversely, if min_samples is too small, a large number of core objects will be generated, which may lead to too few categories.
  3) metric : nearest neighbor distance metric parameter. There are many distance measures that can be used. Generally speaking, DBSCAN uses the default Euclidean distance (ie Minkowski distance with p=2) to meet our needs. The distance metric parameters that can be used are:
    a) Euclidean distance "euclidean"
    b) Manhattan distance "manhattan"
    c) Chebyshev distance "chebyshev"
    d) Minkowski distance "minkowski"
    e) Weighted Minkowski distance "wminkowski"
    f) Normalized Euclidean distance "seuclidean": that is, for each feature dimension Euclidean distance after normalization. At this time, the mean value of each sample feature dimension is 0, and the variance is 1.
    g) Mahalanobis distance "mahalanobis": When the sample distribution is independent, the Mahalanobis distance is equivalent to the Euclidean distance.
 There are also some other distance measures that are not real numbers, which are generally not used in the DBSCAN algorithm, so they will not be listed here.
  
  4) algorithm: Nearest neighbor search algorithm parameters, there are three kinds of algorithms, the first is brute force implementation, the second is KD tree implementation, and the third is ball tree implementation. For this parameter, there are a total of 4 optional inputs, 'brute' corresponds to the first brute force implementation, 'kd_tree' corresponds to the second KD tree implementation, 'ball_tree' corresponds to the third ball tree implementation, 'auto' is We will make a trade-off among the above three algorithms, and choose an optimal algorithm that fits the best. It should be noted that if the input sample features are sparse, no matter which algorithm we choose, scikit-learn will finally use brute force to achieve 'brute'. From personal experience, the default 'auto' is usually enough. If the amount of data is large or there are many features, it may take a long time to build a tree with "auto", and the efficiency is not high. It is recommended to choose the KD tree to implement 'kd_tree'. At this time, if you find that the speed of 'kd_tree' is relatively slow or you already know that the sample distribution is not When it's even, try 'ball_tree'. And if the input samples are sparse, no matter which algorithm you choose, the actual operation will be 'brute'.
  5) leaf_size: Nearest neighbor search algorithm parameter, when using KD tree or ball tree, the threshold of the number of leaf nodes to stop building subtrees. The smaller the value, the larger the generated KD tree or ball tree, the deeper the number of layers, and the longer the tree building time. On the contrary, the generated KD tree or ball tree will be smaller, the layers are shallower, and the tree building time is shorter. The default is 30. Because this value generally only affects the running speed of the algorithm and the size of the memory used, it can be ignored under normal circumstances.
  6) p: nearest neighbor distance measurement parameter. It is only used for the selection of p value in Minkowski distance and weighted Minkowski distance, p=1 is Manhattan distance, p=2 is Euclidean distance. If you use the default Euclidean distance, you don't need to worry about this parameter.
  The above is the introduction of the main parameters of the DBSCAN class. The two parameters eps and min_samples need to be adjusted. The combination of these two values ​​​​has a great impact on the final clustering effect. Therefore, the results of the DBSCAN clustering algorithm are sensitive to parameters. The OPTICS clustering algorithm developed based on DBSCAN solves this problem well, and the detailed explanation of the OPTICS algorithm will be updated later.

Generate a set of random data, in order to reflect the clustering advantages of DBSCAN in non-convex data, we generated three clusters of data, two groups are non-convex

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
%matplotlib inline
X1, y1=datasets.make_circles(n_samples=5000, factor=.6,
                                      noise=.05)
X2, y2 = datasets.make_blobs(n_samples=1000, n_features=2, centers=[[1.2,1.2]], cluster_std=[[.1]],
               random_state=9)

X = np.concatenate((X1, X2))
plt.scatter(X[:, 0], X[:, 1], marker='o')
plt.show()

insert image description here

1) Using K-Means clustering, the code is as follows:

from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters=3, random_state=9).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

insert image description here
2) Use DBSCAN, with default parameters, the code is as follows:

from sklearn.cluster import DBSCAN
y_pred = DBSCAN().fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

insert image description here
But the clustering result is only one class. . .

3) Using DBSCAN, adjust two important parameters:
As can be seen from the above figure, the number of categories is too small, so it is necessary to increase the number of categories, which can be achieved by reducing the size of the ϵ-neighborhood, the default is 0.5, reduce it to 0.1 to see the effect . code show as below:

y_pred = DBSCAN(eps = 0.1).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

insert image description here
It can be seen that the clustering effect has been improved. Continue to adjust parameters to increase categories, there are two directions are possible, one is to continue to reduce eps, the other is to increase min_samples. Increase min_samples from the default 5 to 10, the code is as follows:

y_pred = DBSCAN(eps = 0.1, min_samples = 10).fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.show()

insert image description here
The clustering effect at this time is basically satisfactory.

7. Summarize
the two parameters eps and min_samples that DBSCAN needs to tune. The combination of these two values ​​has a great impact on the final clustering effect, that is, the result of the DBSCAN clustering algorithm is more sensitive to the parameters.
The OPTICS clustering algorithm developed based on DBSCAN solves this problem very well, and the detailed explanation and application of the OPTICS algorithm will be updated later.
In addition, the matlab function of DBSCAN has also been updated in the resources. If you need it, you can also comment or leave a message to ask me! !
https://download.csdn.net/download/weixin_50514171/85192429?spm=1001.2014.3001.5503

Guess you like

Origin blog.csdn.net/weixin_50514171/article/details/127195711