DBSCAN clustering algorithm and Python implementation

 

DBSCAN clustering algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that can divide data points into different clusters and can identify noise points (points that do not belong to any cluster).

The basic idea of ​​the DBSCAN clustering algorithm is: in a given data set, according to the density of other data points around each data point, the data points are divided into core points, boundary points and noise points. A core point is a data point that has enough other data points within a certain radius around it, a boundary point is a data point that does not meet the requirements of the core point but is within the radius of a certain core point, and a noise point is a point that does not meet any conditions . Then, starting from the core point, the data points connected by density continue to expand to form a cluster.

The advantage of the DBSCAN algorithm is that it can handle clusters of any shape, does not need to pre-specify the number of clusters, and can automatically identify noise points and exclude them from the clusters. However, the disadvantage of this algorithm is that it may not be able to cluster effectively for datasets with large differences in density. In addition, the parameters of the algorithm need to be reasonably selected according to the characteristics of the data set, such as radius parameters and density parameters.

example

Suppose we have the following set of data points:[(1,1), (1,2), (2,1), (8,8), (8,9), (9,8), (15,15)]

We can use the DBSCAN algorithm to separate these points into different clusters. First, we need to set two parameters: the radius a and the minimum number of samples minPts. Here we set a=2, minPts=3.

Next, we pick a point from the data set, say the first point (1,1) as a seed point, and mark this point as a "core point" because there are more than minPts points around it within the radius a . We then find all points within a distance of this point, mark them as "density-reachable" from this point, and add these points to the same cluster. This includes (1,2) and (2,1).

Next, we select the next unclassified point, here is (8,8), mark it as "core point", and add all points within the distance to it into the same cluster, including (8,9 ) and (9,8).

Finally, we select the last unclassified point, (15,15), but this point has only 1 point in a, which is not enough to meet the requirements of minPts, so this point is marked as a noise point.

Therefore, the final clustering result is:

Cluster 1: [(1,1), (1,2), (2,1)]
Cluster 2: [(8,8), (8,9), (9,8)]
Noise: [(15,15)]

It can be seen that the DBSCAN algorithm successfully divides the data points into two clusters and excludes the noise point (15,15) from the cluster.

Python implementation

Example 1

Let's take the above example as an example for Python implementation:

from sklearn.cluster import DBSCAN
import numpy as np

# 输入数据
X = np.array([(1,1), (1,2), (2,1), (8,8), (8,9), (9,8), (15,15)])

# 创建DBSCAN对象,设置半径和最小样本数
dbscan = DBSCAN(eps=2, min_samples=3)

# 进行聚类
labels = dbscan.fit_predict(X)

# 输出聚类结果
for i in range(max(labels)+1):
    print(f"Cluster {i+1}: {list(X[labels==i])}")
print(f"Noise: {list(X[labels==-1])}")

The result is:

Cluster 1: [array([1, 1]), array([1, 2]), array([2, 1])]
Cluster 2: [array([8, 8]), array([8, 9]), array([9, 8])]
Noise: [array([15, 15])]

It is consistent with the result of hand calculation.

In the above Python implementation, first we define a data set X, which contains 7 two-dimensional data points. We then created a DBSCAN object, setting the radius to 2 and the minimum number of samples to 3. Here we use the DBSCAN algorithm implementation provided by the scikit-learn library.

We input the data set X into the DBSCAN object, call fit_predict()the method for clustering, and the returned result is the cluster label to which each data point belongs. The label is -1indicates that the point is a noise point.

Finally, we iterate over all cluster labels, outputting the data points in each cluster. When outputting cluster labels, we start labels from 0, so we need to add 1.

The output results show that the data points are divided into two clusters and one noise point, which is consistent with the previous manual calculation results.

Detailed Algorithm Parameters

The parameters in the module are described below sklearn.cluster. The calling method of this function isDBSCAN(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)

The algorithm provides several adjustable parameters to control the clustering effect of the algorithm. The commonly used parameters are described in detail below:

  • eps: Controls the size of the radius, which is the distance threshold for judging whether two data points belong to the same cluster. The default value is 0.5.

  • min_samples: controls the minimum number of data points required around the core point. The default value is 5.

  • metric: The measurement method used to calculate the distance. The available methods include euclidean distance (euclidean), Manhattan distance (manhattan) and so on. The default is Euclidean distance.

  • Algorithm: The algorithm used to calculate the distance. Algorithms that can be selected include Ball Tree (ball_tree), KD Tree (kd_tree) and brute force (brute). The Ball Tree and KD Tree algorithms are suitable for high-dimensional data, and the brute force algorithm is suitable for low-dimensional data. The default value is auto, which automatically selects the algorithm.

  • leaf_size: If the Ball Tree or KD Tree algorithm is used, this parameter specifies the size of the leaf node. The default value is 30.

  • p: If Manhattan distance or Minkowski distance (minkowski) is used, this parameter specifies the p-value of Manhattan distance. The default value is 2, which is the Euclidean distance.

  • n_jobs: Specifies the number of CPUs for parallel operations. The default value is 1, which means single CPU operation. If -1, use all available CPUs.

  • metric_params: If you need to set additional parameters when using some metrics, you can pass these parameters through this parameter. The default value is None.

These parameters are very important to control the clustering effect of the DBSCAN algorithm, and need to be selected and adjusted according to specific data sets and requirements. When using the DBSCAN algorithm, we usually need to conduct multiple experiments and adjustments on these parameters to achieve the best clustering effect.

Example 2: Iris data set

Then take the famous iris data set as an example for Python implementation

from sklearn.cluster import DBSCAN
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# 加载数据集
iris = load_iris()
X = iris.data

# 数据预处理,标准化数据
scaler = StandardScaler()
X = scaler.fit_transform(X)

# 使用DBSCAN聚类算法
dbscan = DBSCAN(eps=0.5, min_samples=5)
y_pred = dbscan.fit_predict(X)

# 输出聚类结果
print('聚类结果:', y_pred)

The above code first load_iris()loads iristhe dataset using the function and then StandardScaler()normalizes the data using Created a DBSCAN object using the DBSCAN class, and passed the value epsof min_samplesthe parameter. Finally, use fit_predict()the method to cluster the data, store the clustering results in y_predvariables, and finally print the clustering results.

The result is as follows:

聚类结果: [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0 -1 -1  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0 -1 -1  0  0  0  0  0  0  0 -1  0  0  0  0  0  0
  0  0  1  1  1  1  1  1 -1 -1  1 -1 -1  1 -1  1  1  1  1  1 -1  1  1  1
 -1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1 -1  1  1  1  1  1 -1  1  1
  1  1 -1  1 -1  1  1  1  1 -1 -1 -1 -1 -1  1  1  1  1 -1  1  1 -1 -1 -1
  1  1 -1  1  1 -1  1  1  1 -1 -1 -1  1  1  1 -1 -1  1  1  1  1  1  1  1
  1  1  1  1 -1  1]

However, in this result, many points are set as noise points (of course, we can classify noise points into one category). If you feel that the current result is not satisfactory, you can further adjust the parameters in the algorithm.

 

References:

【1】https://mp.weixin.qq.com/s/z6AgcvUP3-FwtwCyyQHgPg

【2】sklearn.cluster.DBSCAN — scikit-learn 1.2.2 documentation

Guess you like

Origin blog.csdn.net/weixin_64338372/article/details/130021175