每天进步一点点《ML - DBSCAN》

一些前提的约定，还是沿用上篇文章的哈。先来致敬下男神。
在这里插入图片描述

一：DBSCAN介绍
有一堆样本点，在特征空间内，样本之间有稀疏之分，所以呢有一种就是基于密度的聚类算法，就把密切紧密挨着的点，认为是一个簇，密度小的点，认为是噪声点。
它比仅仅考虑距离的K-means算法优点是：这个算法不仅仅考虑距离，正中啊哟考虑密度，这样一来就是可以发现任意形状的样本簇。

二：算法介绍
好比一个人群，自动的进行聚合找朋友圈，一开始所有的人都没有分类，随机找到一个未分类的人，就好比是找朋友，给定某个圈子范围大小，如果以你为中心，你的圈子内包含的朋友多于某个阈值，你就是一个核心点，则认为你的圈子可以称为一个朋友圈，然后在你的朋友圈的朋友内，深度迭代地搜索每个朋友他们的朋友圈，朋友而朋友就是朋友，这样的朋友圈连起来就是一个大的朋友圈，一直深度搜索到没有朋友为止。
可是如果你的圈子朋友少于阈值，除了自身还有个别其他人，那么就你就是一个边缘点，
啥是边缘人点呢，就是介于核心点和孤立点之间的，孤立点就是朋友圈子内除了自己没别人了，简称没朋友。
我们称这个朋友圈的范围就是Eps，朋友圈的个数就是MinPts（最少个数阈值）。
在这里插入图片描述
如上图假设给定一个圈子，圈子半径是R，有朋友的阈值是MinPts=3，那么最左边的红色点的朋友数目大于阈值就是一个核心点，它的密度就挺大，中间的红色点的朋友数目小于阈值就成了边缘点，最后个红色点由于没有朋友，就是孤立点。
下面我们再讲解一个概念，就是密度直接可达和密度可达，说人话就是你的直接朋友，和你的间接朋友（你的朋友的朋友）。如果两个核心点互在自己的邻域内，就是直接朋友，你就可以串门子，直接过去，叫做密度直接可达（比如下图的两个圆形红色核心点）。如果得需要经过某个邻域才能到达，叫做密度可达（比如下图的最左边圆形红色核心点和最右边那个三角形红色核心点）。最后把这些核心点和边缘点连接在一起就是一个更大的朋友圈，形成的群体就叫做簇。
在这里插入图片描述
这个算法呢，在聚类过程中还可以发现孤立点，避免噪声干扰，还会自动地划分类别，不需要一开始指定分多少类别。但是他却需要调节两个参数，受参数影响较大。
至于距离的计算，那就是欧氏距离，曼哈顿距离等。
在深度搜索的时候呢，对于孤立点，标记为孤立点就截止，换个样本继续。对于边缘点，标记为边缘点就截止，换个样本继续，也就是说只能根据核心点继续深度搜索。但是最后在归类的时候，是把核心点和边缘点分在一起的，作为一个簇的，也就是说，最后分簇的时候，这个簇内既包含有核心点，也包含有边缘点。

二：算法实践：
假设目前我么在二维空间内有这么一群点。
在这里插入图片描述
计算过程如下：

clear all
clc

% randomly construct the samples
x_1 = [-2:0.05:2];
y_1 = sqrt((4 - x_1 .^ 2) .+ 4*rand(size(x_1)));

x_2 = [-4:0.05:4];
y_2 = sqrt((16 - x_2 .^ 2) .+ 8*rand(size(x_2)));

x = [x_1, x_2];
y = [y_1, y_2];
trainData = [x', y'];

figure();
subplot(1, 1, 1);
hold on;

% drow all the points
scatter(trainData(:, 1), trainData(:, 2), 'g', 'linewidth', 3);

xlim([-5,5]);
ylim([0,6]);
%axis equal;
grid on;
xlabel('x1');
ylabel('x2');


% now we get the data.
% begin to run the SBSCAN.
m = size(trainData)(1);

% set tag of each sample, -1: no tag, 0: alone point, 1: edge point, and another points are tagged point.
tag = -ones(m, 1);
% this array is only to edge point, it represents its belongs
edge_ponit_TypeTag = zeros(m, 1);
NormalType = 2;

% set the circle area, banjing
% set the threshold
R = 0.5;
Minpts = 3;

alone_p = 0;
edge_p = 1;



for i=1:m
	if tag(i) != -1
		continue;
	end;
	
	% calculate the distance of each sample to another all sample.
	dis = sqrt(((trainData(i, :) .- trainData) .^ 2) * ones(2,1));
	
	% find the points which are in this circle area.
	neghbours = find(dis <= R);
	
	if size(neghbours)(1) == 1
		tag(i) = alone_p;  % alone point
		continue;
	elseif size(neghbours)(1) < (Minpts+1)
		tag(i) = edge_p;  % edge point
		continue;
	end;
	
	% create a new type of cluster
	tag(i) = NormalType;  % core point

	
	% init the 动态数组.
	array = neghbours;
	NormalType
	
	% deep iteration to find all the core points.
	while !isempty(array)
		% pop the first 
		point = array(1);
		array(1) = []; % pop(remove) this point
		
		% ignore the core point itself.
		if point == i
			continue;
		end;
		
		if tag(point) == 1
			edge_ponit_TypeTag(point) = NormalType;  % record this edge point to core point's tag
			continue;
		elseif tag(point) != -1
			continue; % ignore the alone point and tagged point
		end;
		
		% calculate the distance of each sample to another all sample.
		dis = sqrt(((trainData(point, :) .- trainData) .^ 2) * ones(2,1));
		
		% find the points which are in this circle area.
		neghbours = find(dis <= R);
		
		if size(neghbours)(1) == 1
			tag(point) = alone_p;  % alone point
			continue;
		elseif size(neghbours)(1) < (Minpts+1)
			tag(point) = edge_p;  % edge point
			continue;
		else
			tag(point) = NormalType;  % core point
		end;
		
		% append(insert) to the 动态数组.
		array = [array; neghbours];
		
	end;
	
	% another type
	NormalType++;
	
end;

% correct the NormalType
NormalType--;


% covert all the edge point to its belonging tag
idx = find(edge_ponit_TypeTag > 0);
tag(idx) = edge_ponit_TypeTag(idx);

% covert all the alone point to 1, color(1) = 'black'
idx = find(tag == alone_p);
tag(idx) = 1;

% set color to draw
color = ['k', 'r', 'g', 'b', 'c', 'm', 'y'];



% drow all the points
figure();


for i=1:NormalType

	subplot(1, NormalType, i);
	hold on;

	% drow all the points
	idx = find(tag == i);
	data = trainData(idx, :);
	
	scatter(data(:, 1), data(:, 2), color(i), 'linewidth', 3);

	xlim([-5,5]);
	ylim([0,6]);
	%axis equal;
	grid on;
	xlabel('x1');
	ylabel('x2');

end;

在这里插入图片描述
我们可以看到有三幅图像，第一幅图是噪音点，也就是孤立点，第二幅图像是一个簇，第三幅图像也是一个簇。

因此忽略噪声点，整个样本本分为了两个簇，聚类完毕。

每天进步一点点《ML - DBSCAN》

猜你喜欢