MVO optimizes DBSCAN to achieve clustering

table of Contents

One, MVO

1. Basic concepts

2. Algorithm principle

3. Advantages and disadvantages of the algorithm

4. Algorithm flow

Two, DBSCAN

1. Basic concepts

2. Algorithm principle

3. Advantages and disadvantages of the algorithm

4. Algorithm flow

5. Parameter setting

6.MATLAB code

3. MVO optimizes DBSCAN to achieve clustering

references


One, MVO

1. Basic concepts

The idea of ​​the MVO algorithm is inspired by the theory of multiple universes in physics, through the mathematical description of concepts such as white/black holes (universe) and wormholes and their interaction mechanisms to solve the problem to be optimized.

White hole: It is a special celestial body that only emits but does not absorb, and is the main component of the birth of a universe;

Black hole: just the opposite of white hole, it attracts everything in the universe, and all laws of physics will be invalid in black hole;

Wormhole: a multi-dimensional space-time tunnel connecting white holes and black holes, transporting individuals to any corner of the universe, even from one universe to another, and the multiverse reaches a stable state through the interaction of white holes, black holes, and wormholes.

2. Algorithm principle

The MVO algorithm builds a mathematical model based on the three main concepts of the multiverse theory: white holes, black holes, and wormholes. The candidate solution is defined as the universe, and the fitness of the candidate solution is the expansion rate of the universe. In the iterative process, each candidate solution is a black hole, and the well-adapted universe becomes a white hole according to the roulette principle. The black hole and the white hole exchange matter (dimension change), and some black holes can traverse the optimal universe through the wormhole link ( Group best nearby search).

3. Advantages and disadvantages of the algorithm

3.1 Advantages

The main performance parameters are the probability of wormhole existence and the travel distance rate of the wormhole. The parameters are relatively few, and the low-dimensional numerical experiments show relatively excellent performance.

3.2 Disadvantages

The performance of solving large-scale optimization problems is poor, and the algorithm lacks the ability to jump out of local extremes, which makes it impossible to find the global optimal solution.

4. Algorithm flow

Use the pseudo-code of the MVO algorithm to describe its process as follows:

Create random universes (U)
Initialize WER,TDR,Best_universe
SU=Sorted universes 
NI=Normalize inflation rate (fitness) of the universes
while the end criterion is not satisfied
      Update WEP and TDR; 
      Evaluate the fitness of all universes 
      for each universe indexed by i 
          Black_hole_index=i;
          for each object indexed  by  j 
              r1=random([0,1]);
              if r1<NI(Ui) 
                 White_hole_index=RouletteWheelSelection(-NI); 
                 U(Black_hole_index,j)=SU(White_hole_index,j); 
              end if

              r2=random([0,1]); 
              if r2<Wormhole_existance_probability 
                    r3= random([0,1]); 
                    r4= random([0,1]); 
                    if  r3<0.5         
                        U(i,j)=Best_universe(j)+Travelling_distance_rate*(( ub(j) -lb(j))*r4 + lb(j)); 
                    else   
                        U(i,j)=Best_universe(j)-Travelling_distance_rate*(( ub(j) -lb(j))*r4+lb(j));
                    end if 
              end if
          end for 
      end for
end while

For a detailed introduction to the MVO algorithm, please refer to: (Chinese version) MVO algorithm and its pseudo code


Two, DBSCAN

1. Basic concepts

DBSCAN is a density-based clustering algorithm. It has two important parameters: Eps is the radius of the field when the density is defined, and MinPts is the threshold when the core point is defined.

(1) Core point: There are points greater than or equal to the number of MinPts within the radius Eps.

(2) Boundary points: The number of points within the radius Eps is less than MinPts, but they fall within the neighborhood of the core point.

(3) Noise point: A point that is neither a core point nor a boundary point.

Figure 1 Core points, boundary points, and noise points

(4) Direct density: if \large x_{j}located in the core point \large x_{i}within Eps field, called \large x_{j}the \large x_{i}direct density.

(5) density of up: for \large x_{j}and \large x_{i}, if present, the sequence \large p_{1},p_{2},...,p_{n},of which \large p_{1}=x_{i},p_{n}=x_{j}and \large p_{i+1}by a \large p_{i}direct density, called \large x_{j}the \large x_{i}up density.

(6) Density connected: on \large x_{j}and \large x_{i}, if present \large x_{k}, so that \large x_{j}the \large x_{i}by \large x_{k}up to density, called \large x_{j}and \large x_{i}connected to the density.

Figure 2 Direct density, reachable density, connected density

As shown in Figure 2, the dotted line represents the Eps domain, \large x_{1}~ \large x_{5}are the core points, \large x_{2}and \large x_{4}are \large x_{1}directly reached by the density, \large x_{3}and \large x_{5}are respectively \large x_{1}reachable by the density \large x_{3}and \large x_{5}connected to the density.

2. Algorithm principle

The algorithm first finds all the core points according to the parameters Eps and MinPts, and then uses any core point as the starting point to find out the samples with reachable density to generate clusters, until all the core points have been visited.

3. Advantages and disadvantages of the algorithm

3.1 Advantages

(1) There is no need to specify the number of clustering categories in advance;

(2) Classes of any shape can be found;

(3) Can find out the noise points in the data, and is not sensitive to noise points.

3.2 Disadvantages

(1) The clustering effect of the algorithm depends on the selection of the distance formula. Due to the dimensional disaster of high-dimensional data, it will affect the distance measurement standard;

(2) When the data density in the data set is not uniform, it is difficult to select the parameters Eps and MinPts, which affects the effect of clustering.

4. Algorithm flow

Figure 3 DBSCAN algorithm flow chart

5. Parameter setting

(1) Eps: The value of Eps is selected by defining the K-distance graph in the original text of the DBSCAN algorithm. Definition of K-distance graph: Given k neighborhood parameters, for each point in the data, calculate the corresponding k-th nearest neighbor distance, and calculate the k-th nearest neighbor distance corresponding to all points in the data set in descending order Sort by way and draw a k-distance graph in descending order. In the K-distance map, the k distance value corresponding to the first valley point position is set to the Eps of DBSCAN, which can obtain a better clustering effect.

Note: Generally, the value of parameter k in the K-distance graph is set to 4.

(2) MinPts: There is a criterion for the selection of this parameter, MinPts≥dim+1, where dim represents the dimension of the data to be clustered.

When MinPts is set to 1, each independent point is a core point and can form a cluster; when MinPts≤2, the result is the same as the hierarchical distance to the nearest neighbor. Therefore, MinPts must choose a value greater than or equal to 3.

6.MATLAB code

%%% 用DBSCAN实现聚类
clear;
close all;
clc;
tic;
%% 待聚类数据集
% 第一组数据
mu1=[0 0];  %均值
S1=[0.1 0;0 0.1];  %协方差
data1=mvnrnd(mu1,S1,100);   %产生高斯分布数据
%第二组数据
mu2=[1.25 1.25];
S2=[0.1 0;0 0.1];
data2=mvnrnd(mu2,S2,100);
% 第三组数据
mu3=[-1.25 1.25];
S3=[0.1 0;0 0.1];
data3=mvnrnd(mu3,S3,100);
data=[data1;data2;data3];

%%
% KNN k distance graph, to determine the epsilon
% 对每个数据求第K个最近领域距离,一般将k值设为4.
numData=size(data,1);
Kdist=zeros(numData,1);
[~,Dist]=knnsearch(data(2:numData,:),data(1,:),'dist','euclidean','k',4);
Kdist(1)=Dist(1);
for i=2:numData
    [~,Dist] = knnsearch(data([1:i-1,i+1:numData],:),data(i,:));   % 除i行以外的其他行与i进行计算.
    Kdist(i)=Dist;        % 矩阵Dist与k的个数有关,大小始终为(1*K).
end
[sortKdist,~]=sort(Kdist,'descend');% 将数据集所有点对应的最近领域距离按照降序方式排列,sortKdist是降序排列后的数据.
distX=(1:numData)';

%% 画图
figure(1);
plot(distX,sortKdist,'r+-','LineWidth',2);
grid on;
% 将参数和数据代入DBSCAN函数
epsilon=0.2;
MinPts=  4   ;
labels=DBSCAN(data,epsilon,MinPts);

figure(2);
PlotClusterinResult(data, labels);
title(['DBSCAN Clustering (\epsilon = ' num2str(epsilon) ', MinPts = ' num2str(MinPts) ')']);
toc;

3. MVO optimizes DBSCAN to achieve clustering

Although the value of Eps can be selected by defining the K-distance graph, the k value of the method itself also needs to be set artificially, which leads to a poor selection of the value of Eps. Aiming at the problem of selecting the Eps value, this blog uses MVO to optimize DBSCAN to achieve clustering, and uses the optimization performance of MVO to find the appropriate Eps value to optimize the clustering effect.

The source code of MVO optimized DBSCAN to achieve clustering includes MVO algorithm, the source code of DBSCAN algorithm, and the process of MVO optimized DBSCAN.

Code address: MVO-DBSCAN


references

[1]Mirjalili S, Mirjalili S M, Hatamlou A. Multi-verse optimizer: A nature-inspired algorithm for global optimization[J]. Neural Computing and Applications, 2016,27(2): 495-513. 

[2] Zhao Shijie, Gao Leifu, Tu Jun, etc. Improved MVO algorithm coupled with horizontal and vertical individual update strategy[J]. Control and Decision, 2018, 33(8):1423.

[3] Lai Wenhao, Zhou Mengran, Li Daping. Application of unsupervised learning AE and MVO-DBSCAN combined with LIF in coal mine water inrush identification[J]. Spectroscopy and Spectral Analysis, 2019,39(8):2439.

[4] Liu Xiaolong. Improved multiverse algorithm to solve large-scale real-valued optimization problems[J]. Journal of Electronics and Information, 2019, 41(7):1667.

[5] Clustering method: DBSCAN algorithm research (1)-DBSCAN principle, process, parameter setting, advantages and disadvantages and algorithm

[6] Primary exploration of clustering algorithm (5) DBSCAN

 

Guess you like

Origin blog.csdn.net/weixin_45317919/article/details/109403792