scikit-learn --- clustering algorithm of machine learning algorithms DBSCAN

DBSCAN clustering algorithm

  • Space area density based clustering algorithm, the algorithm will have sufficient density divided into clusters, clusters of arbitrary shape and found in the data space having noise, the maximum set point will be defined as the density of clusters connected.

  • DBSCAN, two parameters need to be specified

    • epsilon: radius around a point adjacent the region

    • minPts: adjacent to the inner region comprises at least a number of points

      Binding epsilon-neighborhood according to the above two parameters characteristic of the sample divided into three categories midpoint

      • The core point (cor point): meet NBHD (p, epsilon)> = minPts, compared with sample points
      • Boundary point (border point) NBHD (p, epsilon) <minPts, but this point may be a number of core points obtained
      • Outlier (Outlier): neither nuclear nor point edge point
  • If you can not understand can be seen below, a number of sample distribution sample space, will be clustered into one group, A point near points in the sample space from a similar density, a red circle according to the specifications (Epsilon neighborhood radius), is finally stored A 5 points nearby, it is marked in red, marked with a red cluster positioning, the other is not housed in clusters according to the same rules. ( Image, we can think of this system to obtain a selected sample of the many o'clock, around the selected sample stipple a circle, the provisions of the radius of the circle and the point within a circle at least contained, if sufficient within a specified radius number of sample points into account, then the circle center shifts to the other internal sample point, continue to circle the other sample points in the vicinity, as similar to pyramid schemes, continue to develop off the assembly line, until the rolling circle found number of sample points are encircled less than the value specified in advance, stopped. then we say that the beginning point for the core point in the following figure a, stop to that point as a boundary point, the following chart B, C. no roll that point as an outlier, as FIG N )

1.1DBSCAN difference between K-Means clustering algorithm

  • Because the K-Means clustering algorithm can handle spherical cluster, which is a radical polymerization as a solid, but often have a different shape data in reality, at this time, traditional clustering algorithm to die, so you can by densitometry. Clustering algorithm is DBSCAN

1.2DBSCAN step (known under the premise minPts and epsilon)

  • Arbitrarily selecting a point (neither not assigned to a specific class as an outlier), which is calculated NBHD (p, epsilon) determines whether the core points, if so, create a class around this point, otherwise, set outliers.
  • Other points traversed, until the establishment of a class, the directly-reachable point category added, followed by the density-reachable add to the mix point, if the point is marked as an outlier add to the mix, will modify the state boundary point.
  • Repeat steps 1 and 2 until all points meeting in class (core point or the edge points) or a boundary point

    1.3DBSCAN kind of important parameters

    DBSCAN important parameter class is also divided into two categories, one is the parameter of DBSCAN itself, one is the nearest neighbor metric parameters, let's do a summary of these parameters.

    1) eps : DBSCAN algorithm parameters, namely the distance threshold εε- our neighborhood, and sample distance of more than εε sample point is not εε- neighborhood. The default value is 0.5. Usually need to select a suitable threshold value at which a plurality of sets. eps is too large, the more points will fall on core object εε- neighborhood, at this time we might reduce the number of categories, which should not be a kind of samples will be classified as a class. Otherwise it may increase the number of categories, the samples could have been a kind of division was opened.

    2) min_samples : DBSCAN algorithm parameters, i.e., the sample point to be a threshold number of samples εε- neighborhood core object required. The default value is generally required by selecting a suitable threshold value at which a plurality of sets. Eps parameter adjustment usually together. Under certain circumstances eps, min_samples too large, the core object is too small, then the cluster is a class of inner portion originally sample may be marked as noise points, the number of classes also become many. Conversely min_samples too small, then this will generate a lot of core objects, may lead to too few categories.

    3) Metric : nearest neighbor distance metric. Distance measurement can be used more generally DBSCAN default Euclidean distance (ie, p = Minkowski distance 2) to meet our needs. Distance metrics can be used are:

  

    4) algorithm : nearest neighbor search algorithm parameters, algorithms, a total of three, the first is to achieve brute force, the second is to achieve KD tree, the third tree ball is achieved. These three methods in K Nearest Neighbor (KNN) Summary principle has told, if not familiar with the review can go down. For this parameter, a total of four selectable input, 'brute' brute force corresponding to a first realization, 'kd_tree' corresponding to a second KD tree implementation, 'ball_tree' corresponding to the third ball tree implementation, 'auto' is We will make trade-offs in the above three algorithms, optimal algorithm to select the best fit. Note that, if the input sample characteristics are sparse, no matter what method we choose, will last scikit-learn to use brute force to achieve 'brute'. Personal experience, in general, use the default 'auto' is enough. If a large amount of data or features are many, with "auto" contribution of time can be very long, inefficient, it recommended to choose KD tree implementation 'kd_tree', at this time if found 'kd_tree' slower or have known sample distribution is not when very uniform, you can try 'ball_tree'. And if the input samples are sparse, no matter which method you choose the last actual operation is 'brute'.

    5) leaf_size : nearest neighbor search algorithm parameters, when using the KD-tree or tree ball, stopped construction of a number of sub-tree leaf node threshold. The smaller this value, the resulting KD-tree or tree ball greater, deeper layers, the longer the time contribution, and vice versa, the resulting KD-tree or tree is small balls, the number of layers shallower, shorter achievements. The default is 30. Because this value is generally only affect the speed and memory size using algorithms, so under normal circumstances can leave it.

    6) p: 最近邻距离度量参数。只用于闵可夫斯基距离和带权重闵可夫斯基距离中p值的选择,p=1为曼哈顿距离, p=2为欧式距离。如果使用默认的欧式距离不需要管这个参数。

    以上就是DBSCAN类的主要参数介绍,其实需要调参的就是两个参数eps和min_samples,这两个值的组合对最终的聚类效果有很大的影响。

1.4代码示例:

  • 为体现DBSCAN在非凸数据的聚类优点,首先生成一组随机数据:

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    %matplotlib inline
    
    X1,y1 = datasets.make_circles(n_samples=5000,factor=.6,noise=.05)
    X2,y2 = datasets.make_blobs(n_samples=1000,n_features=2,centers=[[1.2,1.2]],cluster_std=[[.1]],random_state=9)
    X = np.concatenate((X1,X2))
  • K-Means聚类效果,代码如下

    from sklearn.cluster import KMeans
    y_pred = KMeans(n_clusters=3,random_state=9).fit_predict(X)
    plt.scatter(X[:,0],X[:,1],c=y_pred)
    plt.show()

  • DBSCAN聚类效果,代码如下:

    from sklearn.cluster import DBSCAN
    y_pred = DBSCAN().fit_predict(X)
    plt.scatter(X[:,0],X[:,1],c=y_pred)
    plt.show()

  • 但不是我们所期望的,它将所有数据都归为一类,需要对DBSCAN的两个关键参数eps和min_samples进行调参,我们减少eps到0.1看看效果:

    from sklearn.cluster import DBSCAN
    y_pred = DBSCAN(eps=0.1).fit_predict(X)
    plt.scatter(X[:,0],X[:,1],c=y_pred)
    plt.show()

参照链接:

Guess you like

Origin www.cnblogs.com/xujunkai/p/12115040.html