DBSCAN of scikit-learn's clustering algorithm

Algorithm flow

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm treats clusters as high-density regions separated by low-density regions.
Core sample: a sample in the data set, there are at least min_samples other samples within the eps distance, and these samples are designated as neighbors of the core sample;

1. Calculate all core samples;
2. Randomly select a core sample in the core sample set, find the core samples in all its neighbors (neighbor samples), and then find their neighbors (neighbors of the newly acquired core samples).
3. After step 2, a cluster is obtained, which includes a set of core samples and a set of non-core samples close to the core samples (neighbor samples of the core samples) ;
Repeat steps 2 and 3 until all core samples belong to a certain cluster;

By definition, any core sample is part of a cluster, and any sample that is not a core sample and whose distance from any core sample is greater than eps will be considered an outlier ;

Algorithms Advantages and Disadvantages

Advantages:
1. It can cluster dense data sets of any shape. Relatively, clustering algorithms such as K-Means are generally only suitable for convex data sets
; Insensitive to the abnormal points in the concentration;
Disadvantages:
1. If the density of the sample set is not uniform, the quality of the clustering will be poor;
2. If the sample set is large, the clustering convergence time will be long, at this time, it can be established when searching for the nearest neighbor.
3. Compared with traditional K-Means and other clustering algorithms, the parameter adjustment is slightly more complicated. It mainly requires joint adjustment of the distance threshold eps and the neighborhood sample number threshold min_samples . The combination of parameters has a great influence on the final clustering effect;

Parameters in sklearn

[class sklearn.cluster.DBSCAN]
eps=0.5: float, the distance threshold of the ϵ-neighborhood;
min_samples=5: int, the threshold of the number of samples of the ϵ-neighborhood required for the sample point to become the core object; the
above two parameters To jointly adjust the parameters;

metric='euclidean': string, or callable, the nearest neighbor distance metric parameter, the distance metric that can be used ;

metric_params=None: dict, parameters of distance metric;

algorithm='auto': {'auto', 'ball_tree', 'kd_tree', 'brute'}, the nearest neighbor search algorithm parameters, 'brute' is a brute force implementation, 'kd_tree' is a KD tree implementation, 'ball_tree' is Ball tree implementation, 'auto' will make a trade-off among the above three algorithms and choose an optimal algorithm with the best fit; if the input sample features are sparse, no matter which algorithm we choose, scikit-learn will finally To use brute force to implement 'brute', in general, the default 'auto' is enough, if the amount of data is large or there are many features, you can try to use 'kd_tree' or 'ball_tree';

leaf_size=30: int, the parameter of the nearest neighbor search algorithm, which is the threshold of the number of leaf nodes that stop building subtrees when using KD tree or ball tree. The smaller this value is, the larger the generated KD tree or ball tree, the deeper the layers, and the longer the tree construction time. On the contrary, the generated KD tree or ball tree will be smaller, with shallower layers and shorter tree construction time. Because this value generally only affects the running speed of the algorithm and the size of the memory used, it can be ignored in general;

p=None: float, the nearest neighbor distance metric parameter, only used for the selection of p value in Minkowski distance and weighted Minkowski distance, p=1 is Manhattan distance, p=2 is Euclidean distance, if using the default Euclidean distance does not need to control this parameter;

n_jobs=1: int, multi-threaded;
                   -1: use all cpus;
                    1: do not use multi-threaded;
                  -2: if n_jobs<0, (n_cpus + 1 + n_jobs) cpus are used, so when n_jobs=-2 , only one of all CPUs is not used;

sample code

Demo of DBSCAN clustering algorithm

print(__doc__)

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324736995&siteId=291194637