[Clustering algorithm] DBSCAN based on density clustering

every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog

0. Preface

I feel that kmeans is not fragrant in an instant, hahaha

Explanation: This algorithm can not only cluster, but also eliminate outliers. After clustering, the noise points (outliers) with a label of -1 can be eliminated.
insert image description here

1. Text

1.1 Concept

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) density-based spatial clustering. Partitioning the data into clusters based on density allows the discovery of clusters of arbitrary shape in noisy spaces .

1.1.1 Visual understanding

Now there is a group of people who want to develop micro-business among them. Only when a person develops immediate family members , and the number is greater than 5 , can a micro-business be considered successful. Through this rule, it is easy to divide different "micro-business groups" among the crowd, that is, clustering is realized. (Is it very short, there is wood~)

1.2.2 General concepts

(1). Three kinds of points

  • core point
  • boundary point
  • noise point

As shown in the figure below, a point, if there are a certain number of other points (the above is greater than 5, used to determine the density ) within the specified radius (the immediate relatives mentioned above, to determine the distance), then the point is called a core point ;

Continue to use the radius and quantity to develop the "downline" until the final rule cannot be satisfied. The final "downline" is called the boundary point , and the points that are not in the cluster are noise points .
insert image description here

(2). Four relationships

  1. If P is the core point and Q is in the neighborhood of P, it is said that P to Q density is direct
    • Direct access from the core point to its own density
    • Density straight up to asymmetry. (P to Q is direct in density, Q to P is not necessarily direct in density. Note: the above micro-business example is wrong here. You can look at the distance between data points, which should be easier to understand)
  2. If the density of the core point P1 to P2 is directly accessible, P2 to P3 is directly accessible to density, ..., Pn to Q is directly accessible to density, then P1 to Q is accessible in density (the density is also reachable, there is no symmetry, and the direct access from density does not have symmetry, it should be compared easy to understand)
  3. If there is a core zone S, making S to P and Q reachable in density , then P and Q are density connected , and the density connection has symmetry , and the two density connected points are in the same cluster.
  4. If the two points do not belong to the density-connected relationship, the two points are not density-connected , and the two non-density-connected points belong to different clusters, or there is noise in them.

You can see the figure below to understand:
insert image description here

1.2 Algorithm steps

  1. Traverse all sample points, if the number of samples within the specified radius is greater than the specified value, the current sample will be included in the list of core points, and the points with direct density will form temporary clusters
  2. For each temporary cluster , check whether the points in it are core points, and if so, merge the cluster formed by the current point with the temporary cluster of the current point itself to obtain a new temporary cluster. (Development offline process)
  3. This process is repeated until each point in the current temporary cluster is either not in the list of core points, or its density direct points are already in the temporary cluster. This temporary cluster is upgraded to a cluster.
  4. Continue this process for the remaining temporary clusters. until all temporary clusters are processed.

insert image description here

1.3 demo

Description: run in jupyter notebook

Generate sample points

%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import numpy as np
import pandas as pd
from sklearn import datasets


X,_ = datasets.make_moons(500,noise = 0.1,random_state=1)
df = pd.DataFrame(X,columns = ['feature1','feature2'])

df.plot.scatter('feature1','feature2', s = 100,alpha = 0.6, title = 'dataset by make_moon');

insert image description here
clustering

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

from sklearn.cluster import dbscan

# eps为邻域半径,min_samples为最少点数目
core_samples,cluster_ids = dbscan(X, eps = 0.2, min_samples=20) 
# cluster_ids中-1表示对应的点为噪声点

df = pd.DataFrame(np.c_[X,cluster_ids],columns = ['feature1','feature2','cluster_id'])
df['cluster_id'] = df['cluster_id'].astype('i2')

df.plot.scatter('feature1','feature2', s = 100,
    c = list(df['cluster_id']),cmap = 'rainbow',colorbar = False,
    alpha = 0.6,title = 'sklearn DBSCAN cluster result');

insert image description here

1.4 Setting of eps value

1.4.1 Method 1: OPTICS Algorithm

from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs

# 生成随机数据
X, y = make_blobs(n_samples=1000, centers=3, random_state=42)

# 使用 OPTICS 算法自适应估计 eps 值
optics = OPTICS(min_samples=10, xi=0.05)
optics.fit(X)

# 输出聚类结果和估计的 eps 值
print("Estimated eps:", optics.eps_)
print("Cluster labels:", optics.labels_)

1.4.3 Method 2: K-distance

from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs

# 生成随机数据
X, y = make_blobs(n_samples=1000, centers=3, random_state=42)

# 计算 k-距离
knn = NearestNeighbors(n_neighbors=10)
knn.fit(X)
distances, indices = knn.kneighbors(X)

# 估计 eps 值
eps = distances[:, 9].mean()

# 输出估计的 eps 值
print("Estimated eps:", eps)

1.5 Animation Understanding

Recommend an animation website:
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

insert image description here

1.6 Display of noise points (outliers)

display one-dimensional data

def show_scatter_1D(data,labels,noise_black=False):
    plt.clf()
    if noise_black:
        noise_data = data[labels == -1, 0]
        plt.scatter(noise_data,np.zeros_like(noise_data), edgecolor='red',c='white', label='Noise')
    for l in set(labels) - {
    
    -1}:
        norm_data = data[labels == l, 0]
        plt.scatter(norm_data, np.zeros_like(norm_data), cmap='viridis', label=f'Cluster {
      
      l}')
    plt.legend()
    plt.show()

Display 2D data

def show_scatter_2D(data,labels,noise_black=False):
    plt.clf()
    if noise_black:
        plt.scatter(data[labels == -1, 0], data[labels == -1, 1],edgecolor='red', c='white', label='Noise')
    for l in set(labels) - {
    
    -1}:
        plt.scatter(data[labels == l, 0], data[labels == l, 1], label=f'Cluster {
      
      l}')
    plt.legend()
    plt.show()

insert image description here

reference

[1] https://blog.csdn.net/swy_swy_swy/article/details/106130675
[2] https://blog.csdn.net/Cyrus_May/article/details/113504879
[3] https://blog.csdn.net/huacha__/article/details/81094891#t0
[4] https://zhuanlan.zhihu.com/p/336501183
[5] https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

Guess you like

Origin blog.csdn.net/weixin_39190382/article/details/131421107