Python implementation of DBSCAN clustering

1. Description

        DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering is a density-based clustering algorithm. It can divide data into different categories based on the density of data points and can automatically identify outliers. The core idea of ​​the DBSCAN clustering algorithm is to divide high-density data points into the same cluster, and divide low-density data points into noise points. By defining distance and density thresholds between data points, DBSCAN can perform clustering without the need to determine the number of clusters in advance.

2. Overview of DBSCAN

        Clustering is applied to a data set to group similar sets of data points. It looks for similarities and differences in data points and blends them together. There are no labels in the clusters. Clustering is a type of unsupervised learning that aims to discover the underlying structure of a data set. 

2.1 Types of clustering algorithms:

  • Partition-based clustering
  • fuzzy clustering
  • hierarchical clustering
  • grid-based clustering
  • exclusive clustering
  • overlapping clusters
  • Density-based clustering

        In this blog, we will focus on density-based clustering methods, specifically the DBSCAN algorithm combined with scikit-learn. Density-based algorithms are good at finding high-density areas and outliers. It is commonly used for anomaly detection and non-linear data set clustering.

2.2 DBSCAN properties

1) The advantages and disadvantages of DBSCAN clustering algorithm include:

  • There is no need to specify the number of clusters in advance.
  • Ability to identify outliers and noise points.
  • Ability to handle clusters of arbitrary shapes.
  • Can perform well for dense clusters.

2) The disadvantages of DBSCAN clustering algorithm include:

  • For clusters with low density, the clustering effect may not be as good as other algorithms.
  • For high-dimensional data, clustering may not perform well.

3) Applications of DBSCAN clustering algorithm include:

  • Image segmentation
  • Spatial clustering in geographic information systems
  • Object tracking in computer vision
  • Community discovery in social network analysis

 

3. Construction of DBSCAN

        DBSCAN (Density-Based Spatial Clustering of Noise Applications) is a density-based unsupervised learning algorithm. It computes nearest neighbor graphs to find arbitrarily shaped clusters and outliers. K-means clustering, on the other hand, produces spherical clusters. 

DBSCAN does not require K clusters         initially . Instead, it requires two parameters: eps and minPts.

  • eps : It is the radius of a specific neighborhood. If the distance between two points is less than or equal to esp, they are considered their neighbors.
  • minPts : The minimum number of data points forming a cluster in a given neighborhood. 

        DBSCAN uses these two parameters to define core points, boundary points, or outlier points.

 

 

4. How does the DBSCAN clustering algorithm work?

 

  1. Randomly select any point p . If there are more data points than minPts in the neighborhood , it is also called a core point.
  2. It will use eps and minPts to identify all density reachable points.
  3. If p is a core point, it will create a cluster using eps and minPts.
  4. If p is a boundary point , it moves to the next data point. If the number of points in the neighborhood of a data point is less than minPts, the data point is called a boundary point.
  5. The algorithm will continue until all points have been visited.

 

5. DBSCAN clustering in Python

 

        We will use a Deepnote notebook to run this example. It comes with Python packages pre-installed, so we only need to import NumPy, pandas, seaborn, matplotlib and sklearn. 

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

We are using the mall customer segmentation data         from Kaggle . It contains the customer's age, gender, income and spending scores. We will use these features to create various clusters. 

        First, we will load the dataset using pandas `read_csv`. We will then select three columns ("Age", "Annual Income (k$)", "Expenditure Score (1-100)") to create the X_train data frame. 

df = pd.read_csv('Mall_Customers.csv')
X_train = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

        We will use eps  12.5 and min_sample 4 to fit X_train to the DBSCAN algorithm. After that, we will create a DBSCAN_dataset from X_train and create a "Cluster" column  using clustering.labels_ .

clustering = DBSCAN(eps=12.5, min_samples=4).fit(X_train)
DBSCAN_dataset = X_train.copy()
DBSCAN_dataset.loc[:,'Cluster'] = clustering.labels_

        To visualize the distribution of clusters, we will use value_counts() and convert it to a dataframe. 

As you can see, we have 5 clusters and 1 outlier. The "0" cluster has the largest size, with 112 rows. 

DBSCAN_dataset.Cluster.value_counts().to_frame()

 

In this section, we will use the above information and visualize a scatter plot.  

There are two graphs: "Annual Income vs. Expense Fractions" and "Annual Income vs. Age." Clusters are defined by color and outliers are defined as small black dots. 

The visualization clearly shows how each customer belongs to one of the 5 clusters, we can use this information to provide high-end offers to customers in the purple cluster and cheaper offers to customers in the dark green cluster. 

 
  
outliers = DBSCAN_dataset[DBSCAN_dataset['Cluster']==-1]

fig2, (axes) = plt.subplots(1,2,figsize=(12,5))

sns.scatterplot('Annual Income (k$)', 'Spending Score (1-100)',

                data=DBSCAN_dataset[DBSCAN_dataset['Cluster']!=-1],

                hue='Cluster', ax=axes[0], palette='Set2', legend='full', s=200)

sns.scatterplot('Age', 'Spending Score (1-100)',

                data=DBSCAN_dataset[DBSCAN_dataset['Cluster']!=-1],

                hue='Cluster', palette='Set2', ax=axes[1], legend='full', s=200)

axes[0].scatter(outliers['Annual Income (k$)'], outliers['Spending Score (1-100)'], s=10, label='outliers', c="k")

axes[1].scatter(outliers['Age'], outliers['Spending Score (1-100)'], s=10, label='outliers', c="k")
axes[0].legend()
axes[1].legend()

plt.setp(axes[0].get_legend().get_texts(), fontsize='12')
plt.setp(axes[1].get_legend().get_texts(), fontsize='12')

plt.show()


 

6. Conclusion

        DBSCAN is one of many algorithms used for customer segmentation. You can use K-means or hierarchical clustering to get better results. Clustering algorithms are commonly used in recommendation engines, market and customer segmentation, social network analysis, and document analysis. 

        In this blog, we learned the basics of the density-based algorithm DBCAN and how to use it to create customer segments with scikit-learn. You can improve the algorithm  by using silhouette scores and heatmaps to find the best eps and min_samples .

 

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132783438