[Machine Learning] Clustering Algorithm-DBSCAN Basic Understanding and Practical Cases

foreword

In machine learning, clustering is a common unsupervised learning method whose goal is to divide data points in a dataset into different groups with similar characteristics between each group. Clustering can be used in various applications such as image segmentation, social media analysis, medical data analysis, etc. DBSCAN is a clustering algorithm, which is widely used in various fields.

1. The principle of DBSCAN algorithm

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. Its principle is to determine clusters based on the density around data points. In DBSCAN, regions with high density are considered as clusters, while regions with low density are considered as noise.

The algorithm flow of DBSCAN is as follows:

  • Select a data point as a starting point, and find all data points within a specified distance from that point.
  • Mark data points as core points if the number of data points whose distance is within the specified range is greater than or equal to the specified threshold.
  • For all core points, all data points whose distance is within the specified range are grouped into the same cluster. If there are overlapping data points between two core points, they are grouped into the same cluster.
  • For all non-core points, mark them as noise points.

basic concept:

  • Density-connected: If starting from a core point p, point q and point k are both density-reachable, then point q and point k are said to be density-connected.
  • Boundary point: a non-core point belonging to a certain category, which cannot be developed offline
  • Direct density reachable: If a point p is in the r neighborhood of point q, and q is a core point, then pq is directly density reachable.
  • Noise points: Points that do not belong to any cluster are density-inaccessible from any core point

In DBSCAN, there are three parameters that need to be specified:

  • eps: Specifies the maximum distance within the distance range.
  • min_samples: Specify the minimum number of data points around a point to determine the core point.
  • Metric: the measurement method used to calculate the distance, the density threshold

2. Basic use

  • Guide package
import matplotlib.pyplot as plt
import numpy as mp
from sklearn.datasets import make_circles
from sklearn.cluster import DBSCAN

make_circles is a function in scikit-learn to generate a random dataset of circles. The parameters of this function include n_samples, indicating the number of samples in the generated data set; noise, indicating the standard deviation of Gaussian noise added to the generated data; factor, indicating the proportional factor between the inner and outer circles; random_state, indicating the random seed of the generated data , which is used to ensure that multiple generated datasets are consistent.
The make_circles function will generate two circle-shaped clusters on a two-dimensional plane, and place them nested with each other to simulate the actual non-linear separable problem.

Introduction to DBSCN() instantiation parameters

  • eps: Indicates the maximum distance between two samples, samples beyond this distance will be regarded as outliers, the default value is 0.5.
  • min_samples: Indicates the minimum number of samples in a cluster, clusters smaller than this number will be considered outliers, the default value is 5.
  • metric: Indicates the metric method used to calculate the distance, the default is Euclidean distance (euclidean). You can also choose Manhattan distance (manhattan), cosine distance (cosine), etc.
  • metric_params: Indicates other parameters of the metric method.
  • algorithm: Indicates the algorithm for calculating DBSCAN. You can choose an efficient algorithm based on kd tree ('kd_tree') or an efficient algorithm based on ball tree ('ball_tree'). The default is automatic selection.
  • leaf_size: Indicates the leaf size when building a kd tree or ball tree, the default is 30.
  • p: Indicates the parameters used for Minkowski distance calculation. When p=1, it is Manhattan distance. When p=2, it is Euclidean distance. When p>2, it is L_p distance. The default is 2.
    -> n_jobs: Indicates the number of parallel jobs used for calculation, the default is 1. -1 means use all available CPU cores.

Return method introduction

fit_predict(X): While training the model, return the clustering result, that is, return the cluster number to which each sample point belongs, and return -1 if it is a noise point. X is the input data.

  • labels_: After training, the cluster number to which each sample point belongs, if it is a noise point, it is -1.
  • core_sample_indices_: After training, the index of the core sample.
  • components_: After training, the feature vector of the core samples.
  • eps_: After training, the best eps value.
  • min_samples_: After training, the best min_samples value.
  • get_params(): Get the parameter settings of the current model.

-data set

X,y=make_circles(factor=0.3,n_samples=1000,random_state=42,noise=0.1)
X

- Let's visualize the data we see

plt.plot(X[:,0],X[:,1],'b.',marker='*')
plt.show()

insert image description here

  • Since different eps have different effects on the results, here we try multiple eps for training

def plot_show(epe,modal):
    core_mask=np.zeros_like(modal.labels_,dtype=bool)

    #设置核心样本店
    core_mask[modal.core_sample_indices_]=True

    anoalie_mask=modal.labels_==-1

    # 标记噪声点
    non_core_mask=~(core_mask | anoalie_mask)

    cores=modal.components_
    anomalies=X[anoalie_mask]
    non_cores=X[non_core_mask]

    plt.scatter(cores[:,0],cores[:,1],c=modal
                .labels_[core_mask],marker='o',cmap='Paired')

    plt.scatter(cores[:, 0], cores[:, 1], marker='*', s=20, c=modal.labels_[core_mask])

    plt.scatter(anomalies[:,0],anomalies[:,1],c='red',marker='X',s=70)

    plt.scatter(non_cores[:,0],non_cores[:,1],c=modal.labels_[non_core_mask],marker='.')

    plt.axis('off')

    plt.title(f'epe:{
      
      epe}')


for i,epe in enumerate(epes):

    dbscan=DBSCAN(eps=epe,min_samples=5)
    dbscan.fit(X)
    plt.subplot(331+i)

    plot_show(epe,dbscan)

plt.show()

insert image description hereIt can be seen that the different epe values ​​are still very large for the classification results. It is particularly important to choose the appropriate epe. The X in the figure is the location of the core sample.

3. Comparison of DBSCAN and Kmeans processing

DNSCAN has advantages in processing nonlinear clusters of arbitrary shapes, but Kmeans cannot handle them well. We have observed experimentally

  • Download Data
from sklearn.datasets import make_moons

X,y=make_moons(n_samples=1000,noise=0.04,random_state=42)
#%% md
#%%
plt.plot(X[:,0],X[:,1],'b.',marker='*')
plt.show()

make_moons is a function in scikit-learn to generate a random crescent moon dataset. The parameters of this function include n_samples, which indicates the number of samples in the generated data set; noise, which indicates the standard deviation of Gaussian noise added to the generated data; random_state, which indicates the random seed of the generated data, which is used to ensure that the data sets generated multiple times are consistent .
The make_moons function generates two half-moon-shaped clusters on a two-dimensional plane and places them across each other.

insert image description here

- Compare the two classification methods

plt.subplot(121)
dbscan=DBSCAN(eps=0.1,min_samples=5)
dbscan.fit(X)
plt.scatter(X[:,0],X[:,1],c=dbscan.labels_)

plt.title('DBSCAN')

from sklearn.cluster import KMeans
plt.subplot(122)
kmeans=KMeans(n_clusters=2,random_state=42)
kmeans.fit(X)
plt.scatter(X[:,0],X[:,1],c=kmeans.labels_)
plt.title('KMEANS')

plt.show()

insert image description here
It can be seen that kmeans did not achieve our classification effect

Summarize

The DBSCAN algorithm has the following characteristics:

  • There is no need to set the number of clusters in advance.
  • Ability to identify clusters of arbitrary shape.
  • Noise points can be identified.
  • It is sensitive to the setting of parameters, but usually only two parameters need to be adjusted: radius ϵ \epsilonϵ and the minimum number of samplesM in P ts MinPtsMinPts

However, dbscan found a problem in image cutting. If the min_samples setting is too small, there will be no significant change after image cutting. If the setting is too large, some points with lower density will be removed as noise, resulting in image loss, which is different from the original The shapes are inconsistent, and there is no good solution at present. If there is a solution later, please share it as soon as possible.

Due to my limited ability, if there are any mistakes in the above content, please correct me
==I will continue to work hard to share and learn, I hope you will support me a lot

Guess you like

Origin blog.csdn.net/qq_61260911/article/details/130089960