[Clustering algorithm] OPTICS is based on density clustering

every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog

0. Preface

Complementary to DBSCAN, OPTICS clustering

1. Text

1.0 Problems with DBSCAN

Earlier we introduced DBSCAN , which can cluster according to density.
But there is such a problem:

  • When the eps is large, A, B, and C will be divided
  • When the eps is small, C1, C2, and C3 will be divided (A, B will be considered as noise) and
    A, B, C, C1, C2, and C3 will not be obtained

The density of different clusters may be different. Two DBSCAN sets a fixed eps, and will ignore some of them.

Please add a picture description

1.1 Basic concepts

Core point: refer to DBSCAN

Core distance: The minimum distance that makes point p a core point.

eg: The radius of the neighborhood is eps=2, the minimum number of samples is min_samples= 5 ,
point p, when the radius eps=2, there are 8 sample points. Then p is the core point. The fifth point closest to
point p has a distance of 0.8 from its core point p, so the core distance of point p is 0.8.

Note: When calculating min_samples, the core point itself is included, so the above only needs to find 7 points in the neighborhood eps, then p is the core point.


Reachable distance: The reachable distance from point p to core point o is the maximum value between the distance between p and o and the core distance of o . ( Similar to the reachable distance in the LOF algorithm, you can refer to it)

In human terms: a point may be within or outside the radius of the domain .

eg1:

  • Point p is outside the core distance of the core point o1, so the reachable distance reach-dist=d(p,o1)
  • Point p is within the core distance of core point 02, so the reachable distance reach-dist = d_5(o2)
    Please add a picture description

eg2:
Points 1, 2, and 3 are within the core distance, so the reachable distance from them to the core point = core distance
Please add a picture description

1.2 Algorithm process

Generally speaking, it is still the routine of developing offline, there is only one difference between the closeness of the offline relationship

  1. define two queues
    • Ordered queue, storing core points and density direct points (points in the neighborhood of core points), arranged in ascending order of reachable distance, samples to be processed
    • Result queue, storing sample point output, processed samples
  2. Select unprocessed core points and put them into the result queue , calculate the reachable distance of the samples in the neighborhood, and put them into the ordered queue in ascending order of reachable distance
  3. Extract the first sample from the ordered queue. If the reachable distance is calculated for a core point, put the one with the smallest reachable distance into the result queue; if it is not a core point, skip this point, select a new core point, and repeat step 2
  4. Iterate steps 2 and 3 until all sample points are processed, and output the samples and reachable distances in the result queue.

1.3 Results

The output is the reachable distance and the order of sample points

  • Clusters appear as valleys in the graph, the deeper the valley, the tighter the cluster
  • Yellow represents noise and does not form valleys

Please add a picture description

Please add a picture description

Please add a picture description

1.4 Summary

The OPTICS algorithm sorts the points in each cluster according to the reachable distance and outputs them under specific parameters. In the process of calculating the reachable distance of the sample points, it starts from a core point and finds its ϵ \ epsilonAll points in ϵ , and calculate the core distance of these points about the core point in turn, and sort them, if theirϵ \epsilonIf there is a core point in the ϵ field point, then find out the newϵ \epsilonAll sample points in the ϵ field, and calculate the reachable distance of these sample points with respect to the new core point, and then sort them. If these newϵ \epsilonA certain sample point in the ϵ domain is at the old ϵ \epsilonϵ already exists in the field, thenthe reachable distance of the sample point is defined as the smaller one of the reachable distances of the sample point with respect to the two core points. And so on, so that, in a cluster, we can seepermutations with increasing reachable distances. When the ϵ \epsilonof a core point of the first clusterIf there is no longer a core point in the ϵ field, it indicates that the boundary of the cluster has been reached at this time. Next, it is time to select the next core point, which marks the beginning of another cluster, so the curve of the OPTICS output results can see a trend that reaches the highest point and then drops rapidly to a lower point, which indicates another dense cluster. The beginning of the cluster is the same as the processing flow of the first cluster, and the output curve can be obtained after processing each cluster.

reference

[1] https://zhuanlan.zhihu.com/p/408243818
[2] https://zhuanlan.zhihu.com/p/77052675
[3] https://blog.csdn.net/haveanybody/article/details/113782209
[4] https://zhuanlan.zhihu.com/p/395088759
[5] https://blog.csdn.net/PRINCE2327/article/details/110412944
[6] https://blog.csdn.net/markaustralia/article/details/120155061
[7] https://blog.csdn.net/m0_45411005/article/details/123251733#t1

Guess you like

Origin blog.csdn.net/weixin_39190382/article/details/131477317