Clustering algorithm--DBSCAN algorithm

1. DBSCAN algorithm

Introduction

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. The algorithm treats clusters as high-density object areas separated by low-density areas in the data space; by dividing sufficiently high-density areas into clusters, arbitrary-shaped clusters can be found in noisy data sets.

basic concept

There are two important parameters in the DBSCAN algorithm: Eps and MinPtS . Eps is the neighborhood radius when defining density, and MinPts is the threshold when defining core points.

The center-based method of defining density divides points into three categories:

(1) Core point: Given a user-specified threshold MinPts, if the number of points in a given neighborhood of a point exceeds the given threshold MinPts, then the point is called a core point.

(2) Boundary point: The boundary point is not a core point, but it falls within the Eps neighborhood of a certain core point.

(3) Noise points: Noise points are neither core points nor boundary points.

Density-based clusters are defined as follows:

(1) Density direct: Given an object set D, if p is in the neighborhood of q, and q is a core object, then p is said to be directly density-direct from object q.

(2) The density is reachable: if there is an object chain, $\text{[math]}$ , for

$\text{[math]}$ is $\text{[math]}$ directly density-reachable with respect to Eps and MinPts, then object p is density-reachable from object q with respect to Eps and MinPts.

(3) Density connection: If there is an object o in the object set D such that objects p and q are density-reachable from o with respect to Eps and MinPts, then objects p and q are density-connected with respect to Eps and MinPts.

(4) Density reachability is the transitive closure of density reachability. This relationship is asymmetric, and only core objects are density reachable to each other.

algorithm process

In layman's terms, the algorithm flow is as follows:

(1) Mark all data objects as core objects, boundary objects or noise objects.

(2) Delete noise objects.

(3) Assign an edge between all core objects whose distance is within Eps.

(4) Each group of connected core objects forms a cluster.

(5) Assign each boundary object to a cluster of core objects associated with it.

2. Example of DBSCAN algorithm

Case description and data

Given a set of two-dimensional data (x, y) as points, you can define it yourself or use the following data to save the data in a text document and put it in the corresponding directory. Eps is set to 2 and MinPts is set to 3. Use the DBSCAN algorithm for classification operations.

0,0
0,1
0,2
0,3
0,4
0,5
12,1
12,2
12,3
12,4
12,5
12,6
0,6
0,7
12,7
0,8
0,9
1,1
6,8
8,7
6,7
3,5

code

//定义点类
public class Point {
    private int x;
    private int y;
    private boolean isKey;
    private boolean isClassed;
    public boolean isKey() {
        return isKey;
    }
    public void setKey(boolean isKey) {

        this.isKey = isKey;
        this.isClassed = true;
    }
    public boolean isClassed() {
        return isClassed;
    }
    public void setClassed(boolean isClassed) {

        this.isClassed = isClassed;
    }
    public int getX() {
        return x;
    }
    public void setX(int x) {
        this.x = x;
    }
    public int getY() {
        return y;
    }
    public void setY(int y) {
        this.y = y;
    }
    public Point() {
        x = 0;
        y = 0;
    }
    public Point(int x, int y) {
        this.x = x;
        this.y = y;
    }
    public Point(String str) {
        String[] p = str.split(",");
        this.x = Integer.parseInt(p[0]);
        this.y = Integer.parseInt(p[1]);
    }
    public String print() {
        return "(" + this.x + "," + this.y + ")";
    }
}

//对点进行操作
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
public class Utility {


    /**
     　　* 测试两个点之间的距离
     　　* @param p 点
     　　* @param q 点
     　　* @return 返回两个点之间的距离
     　　*/
    public static double getDistance(Point p,Point q){

        int dx=p.getX()-q.getX();
        int dy=p.getY()-q.getY();
        double distance=Math.sqrt(dx*dx+dy*dy);
        return distance;
    }

    /**
     　　* 检查给定点是不是核心点
     　　* @param lst 存放点的链表
     　　* @param p 待测试的点
     　  * @param e e半径
     　　* @param minp 密度阈值
     　　* @return
     　　*/
    public static List<Point> isKeyPoint(List<Point> lst,Point p,int e,int minp){

        int count=0;
        List<Point> tmpLst=new ArrayList<Point>();
        for (Point q : lst) {
            if (getDistance(p, q) <= e) {
                ++count;
                if (!tmpLst.contains(q)) {
                    tmpLst.add(q);
                }
            }
        }
        if(count>=minp){
            p.setKey(true);
            return tmpLst;
        }
        return null;
    }

    /**
     * 设置已经分类点的标志
     * @param lst
     */
    public static void setListClassed(List<Point> lst){

        for (Point p : lst) {
            if (!p.isClassed()) {
                p.setClassed(true);
            }
        }
    }

    /**
     * 如果b中含有a中包含的元素，则把两个集合合并
     * @param a
     * @param b
     * @return a
     */
    public static boolean mergeList(List<Point> a,List<Point> b){
        boolean merge=false;
        for (Point value : b) {
            if (a.contains(value)) {
                merge = true;
                break;
            }
        }
        if(merge){
            for (Point point : b) {
                if (!a.contains(point)) {
                    a.add(point);
                }
            }
        }
        return merge;
    }

    /**
     * 读取数据
     * @return 返回文本中点的集合
     */
    public static List<Point> getPointsList() throws IOException{
        List<Point> lst=new ArrayList<Point>();
        //String txtPath="D:"+File.separatorChar+"Points.txt";
        String txtPath="E:\\myfile\\Points.txt";
        BufferedReader br=new BufferedReader(new FileReader(txtPath));

        String str="";
        while((str = br.readLine()) != null && !str.equals("")){
            lst.add(new Point(str));
        }
        br.close();
        return lst;
    }
}

//主要算法

import java.io.*;
import java.util.*;

public class DBScan {
    private static List<Point> pointsList = new ArrayList<Point>();// 初始点列表
    private static List<List<Point>> resultList = new ArrayList<List<Point>>();// 分类结果集
    private static int e = 2;// e半径
    //private static int e = 2;// e半径
     private static int minp = 3;// 密度阈值
    //private static int minp = 4;// 密度阈值

    /**
     * 打印结果
     **/
    private static void display() {
        int index = 1;
        for (List<Point> lst : resultList) {
            if (lst.isEmpty()) {
                continue;
            }
            System.out.println("*****第" + index + "个分类*****");
            for (Point p : lst) {
                System.out.print(p.print());

            }
            System.out.print("\n");
            index++;
        }
    }

    /**
     * 调用算法
     */
    private static void applyDbscan() {
        try {
            pointsList = Utility.getPointsList();
            for (Point p : pointsList) {
                if (!p.isClassed()) {
                    List<Point> tmpLst = new ArrayList<Point>();
                    if ((tmpLst = Utility.isKeyPoint(pointsList, p, e, minp)) != null) {
                        // 为所有聚类完毕的点做标示
                        Utility.setListClassed(tmpLst);
                        resultList.add(tmpLst);
                    }
                }
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 合并结果集
     * @return
     */
    private static List<List<Point>> getResult() {
        applyDbscan();// 找到所有直达的聚类
        int length = resultList.size();
        for (int i = 0; i < length; ++i) {
            for (int j = i + 1; j < length; ++j) {
                if (Utility.mergeList(resultList.get(i), resultList.get(j))) {
                    resultList.get(j).clear();
                }
            }
        }
        return resultList;
    }

    /**
     * 程序主函数
     * @param args
     */
    public static void main(String[] args) {
        getResult();
        display();

    }
}

3. Running results

3. Summary

The DBSCAN algorithm can cluster dense data sets of any shape. Compared with clustering algorithms such as K-Means and Mean Shift, they are generally only suitable for convex data sets. In addition, the algorithm finds outliers while clustering and is insensitive to outliers in the data set.

The DBSCAN algorithm also has shortcomings. If the density of the sample set is uneven and the cluster spacing difference is very different, the clustering quality will be poor; when the sample set is large, the clustering convergence time will be longer; and the combination of Eps and MinPts Adjusting parameters is more difficult.

In daily life, we can reasonably choose the algorithm for clustering classification according to the type of data.

reference

https://blog.csdn.net/qq_42735631/article/details/120983729

https://zhuanlan.zhihu.com/p/139926467