Matlab implements the DBSCAN algorithm (detailed annotations for each line of code)

This article is mainly to complete the daily homework and further deepen the understanding of the algorithm. I also hope that it will be helpful to visiting readers.


1. What is the DBSCAN algorithm

        DBSCAN is a density-based clustering algorithm based on high-density connected regions, which can divide regions with sufficiently high density into clusters and find clusters of arbitrary shapes in noisy data. In short, the purpose of DBSCAN is to find the largest set of density-connected objects. The basic points of its principle are: the DBSCAN algorithm needs to choose a distance measure. For the data set to be clustered, the distance between any two points reflects the density between points, indicating whether points can be clustered together. in the same class. Since the DBSCAN algorithm is difficult to define density for high-dimensional data, for points in two-dimensional space, Euclidean distance can be used for measurement.

2. Significance of DBSCAN algorithm

       First of all, we have to mention the clustering algorithm. The clustering algorithm can cluster analysis and use the internal cluster structure and pattern of the data to classify. It does not need to train the samples to obtain prior knowledge, which reduces the computational complexity. Introduce deep learning to learn the characteristics of the internal structure and mode of the data, obtain the preliminary clustering of the data, and then continuously optimize the preliminary clustering to obtain the final classification effect. Experimental results show that the algorithm solves the contradiction between comprehensive information and disaster of dimensionality well, and has good practicability and subjective consistency.

        The DBSCAN clustering algorithm is widely used in face recognition and traffic and other fields.

Three, DBSCAN algorithm code analysis

1. Key concepts

Eps parameter: Neighborhood radius when defining density;

MmPts parameter: the threshold when defining the core point; 

E Neighborhood: The area within the radius E of a given object is called the E Neighborhood of the object;

Core object: if the number of sample points in the field of given object E is greater than or equal to MinPts, then the object is called a core object;

Direct density reachability : For a sample set D, if the sample point q is in the E field of p, and p is the core object, then the object q is directly density reachable from the object p;

Density reachable : For the sample set D, given a series of sample points p1, p2...pn, p= p1, q= pn, if the object pi is directly density reachable from pi-1, then the object q is density reachable from the object p Da;

Density connection: There is a point o in the sample set D, if object o to object p and object q are all density reachable, then p and q are density connected.

2. General idea

        The definition of clusters in DBSCAN algorithm is very simple. The maximum density connected sample set derived from the density reachable relationship is a cluster of the final clustering. There can be one or more core points in the cluster of DBSCAN algorithm. If there is only one core point, the other non-core point samples in the cluster are all in the Eps neighborhood of this core point. If there are multiple core points, there must be another core point in the Eps neighborhood of any core point in the cluster, otherwise the two core points cannot be density-reachable. The set of all samples in the Eps neighborhood of these core points forms a DBSCAN cluster.
       The description of the DBSCAN algorithm is as follows:

  • Input: dataset, neighborhood radius Eps, threshold MinPts of the number of data objects in the neighborhood;
  • Output: Density-connected clusters.

      The specific processing flow is as follows:
1) Randomly select a data object point p from the data set;
2) If the selected data object point p is the core point for the parameters Eps and MinPts, then find all the data objects that are density-reachable from p 3
) If the selected data object point p is an edge point, select another data object point;
4) Repeat steps (2) and (3) until all points are processed.

3. Compare the detailed comments of each line of code   

(1) The DBSCAN.m file is annotated as follows:

%
% Copyright (c) 2015, Yarpiz (www.yarpiz.com)
% All rights reserved. Please read the "license.txt" for license terms.
%
% Project Code: YPML110
% Project Title: Implementation of DBSCAN Clustering in MATLAB
% Publisher: Yarpiz (www.yarpiz.com)
% 
% Developer: S. Mostapha Kalami Heris (Member of Yarpiz Team)
% 
% Contact Info: [email protected], [email protected]
%
//上面的部分应该是运行前的加载文件,不做过多解读

function [IDX, isnoise]=DBSCAN(X,epsilon,MinPts)    //DBSCAN聚类函数

    C=0;                       //统计簇类个数,初始化为0
    
    n=size(X,1);               //把矩阵X的行数数赋值给n,即一共有n个点
    IDX=zeros(n,1);            //定义一个n行1列的矩阵
    
    D=pdist2(X,X);             //计算(X,X)的行的距离
    
    visited=false(n,1);        //创建一维的标记数组,全部初始化为false,代表还未被访问
    isnoise=false(n,1);        //创建一维的异常点数组,全部初始化为false,代表该点不是异常点
    
    for i=1:n                  //遍历1~n个所有的点
        if ~visited(i)         //未被访问,则执行下列代码
            visited(i)=true;   //标记为true,已经访问
            
            Neighbors=RegionQuery(i);     //查询周围点中距离小于等于epsilon的个数
            if numel(Neighbors)<MinPts    //如果小于MinPts
                % X(i,:) is NOISE        
                isnoise(i)=true;          //该点是异常点
            else              //如果大于MinPts,且距离大于epsilon
                C=C+1;        //该点又是新的簇类中心点,簇类个数+1
                ExpandCluster(i,Neighbors,C);    //如果是新的簇类中心,执行下面的函数
            end
            
        end
    
    end                    //循环完n个点,跳出循环
    
    function ExpandCluster(i,Neighbors,C)    //判断该点周围的点是否直接密度可达
        IDX(i)=C;                            //将第i个C簇类记录到IDX(i)中
        
        k = 1;                             
        while true                           //一直循环
            j = Neighbors(k);                //找到距离小于epsilon的第一个直接密度可达点
            
            if ~visited(j)                   //如果没有被访问
                visited(j)=true;             //标记为已访问
                Neighbors2=RegionQuery(j);   //查询周围点中距离小于epsilon的个数
                if numel(Neighbors2)>=MinPts //如果周围点的个数大于等于Minpts,代表该点直接密度可达
                    Neighbors=[Neighbors Neighbors2];   %#ok  //将该点包含着同一个簇类当中
                end
            end                              //退出循环
            if IDX(j)==0                     //如果还没形成任何簇类
                IDX(j)=C;                    //将第j个簇类记录到IDX(j)中
            end                              //退出循坏
            
            k = k + 1;                       //k+1,继续遍历下一个直接密度可达的点
            if k > numel(Neighbors)          //如果已经遍历完所有直接密度可达的点,则退出循环
                break;
            end
        end
    end                                      //退出循环
    
    function Neighbors=RegionQuery(i)        //该函数用来查询周围点中距离小于等于epsilon的个数
        Neighbors=find(D(i,:)<=epsilon);
    end

end

(2) The mydata.mat file is annotated as follows:

The original data is obviously two-dimensional data, that is, points in the plane. The data in the source code is displayed as follows:

(Supplement on April 18: Many readers here are asking what it is, please add. These are the initial coordinates of all points, and each row represents the abscissa and ordinate of a point. As shown in the figure, there are 1000*2 rows of data, explaining There are 1000 data, and the coordinates of the first data are (0.8514, -0.4731), the coordinates of the second data are (-0.0143, 0.6897), and so on. Of course, these data are the data on the official website. can generate its own data)

(3) The PlotClusterinResult.m file is annotated as follows:

%
% Copyright (c) 2015, Yarpiz (www.yarpiz.com)
% All rights reserved. Please read the "license.txt" for license terms.
%
% Project Code: YPML110
% Project Title: Implementation of DBSCAN Clustering in MATLAB
% Publisher: Yarpiz (www.yarpiz.com)
% 
% Developer: S. Mostapha Kalami Heris (Member of Yarpiz Team)
% 
% Contact Info: [email protected], [email protected]
%
//上面的程序依旧应该是加载文件,不做过多的解析

function PlotClusterinResult(X, IDX)                //绘图,标绘聚类结果

    k=max(IDX);                                     //求矩阵IDX每一列的最大元素及其对应的索引

    Colors=hsv(k);                                  //颜色设置

    Legends = {};
    for i=0:k                                       //循环每一个簇类
        Xi=X(IDX==i,:);                    
        if i~=0                                     
            Style = 'x';                            //标记符号为x
            MarkerSize = 8;                         //标记尺寸为8
            Color = Colors(i,:);                    //所有点改变颜色改变
            Legends{end+1} = ['Cluster #' num2str(i)]; 
        else
            Style = 'o';                            //标记符号为o
            MarkerSize = 6;                         //标记尺寸为6
            Color = [0 0 0];                        //所有点改变颜色改变
            if ~isempty(Xi)
                Legends{end+1} = 'Noise';           //如果为空,则为异常点
            end
        end
        if ~isempty(Xi)
            plot(Xi(:,1),Xi(:,2),Style,'MarkerSize',MarkerSize,'Color',Color);
        end
        hold on;
    end
    hold off;                                    //使当前轴及图形不在具备被刷新的性质
    axis equal;                                  //坐标轴的长度单位设成相等
    grid on;                                     //在画图的时候添加网格线
    legend(Legends);
    legend('Location', 'NorthEastOutside');      //legend默认的位置在NorthEast,将其设置在外侧

end                                              //结束循环

(4) The main.m file is annotated as follows:

%
% Copyright (c) 2015, Yarpiz (www.yarpiz.com)
% All rights reserved. Please read the "license.txt" for license terms.
%
% Project Code: YPML110
% Project Title: Implementation of DBSCAN Clustering in MATLAB
% Publisher: Yarpiz (www.yarpiz.com)
% 
% Developer: S. Mostapha Kalami Heris (Member of Yarpiz Team)
% 
% Contact Info: [email protected], [email protected]
%
//上面的代码又应该是加载程序,这里不做过多解释

clc;                    //清理命令行的意思
clear;                  //清楚存储空间的变量,以免对下面的程序运行产生影响
close all;              //关闭所有图形窗口

%% Load Data            //定义data.mat数据文件加载模块

data=load('mydata');    //数据读取
X=data.X;


%% Run DBSCAN Clustering Algorithm    //定义Run运行模块

epsilon=0.5;                          //规定两个关键参数的取值
MinPts=10;
IDX=DBSCAN(X,epsilon,MinPts);         //传入参数运行


%% Plot Results                       //定义绘图结果模块

PlotClusterinResult(X, IDX);          //传入参数,绘制图像
title(['DBSCAN Clustering (\epsilon = ' num2str(epsilon) ', MinPts = ' num2str(MinPts) ')']);

Four. Summary

The above four parts are the complete code of the DBSCAN algorithm. If you need data.X file data, you can chat with me privately, and I can send it to you. Of course, in practical applications, it is operated according to your actual use data.

The above is the whole content of the detailed annotations for each line of code of the DBSCAN algorithm. I hope it will be helpful to everyone. Bookmark it quickly and learn this algorithm well! I'd love to discuss it with readers, too!

Guess you like

Origin blog.csdn.net/TaloyerG/article/details/123916617