First, the purpose of the experiment
- Understand the basic principles of DBSCAN;
- DBSCAN write code and implement algorithms for data clustering
Second, the experimental procedures
- Using data sets: R factoextra bag multishapes language data set
- Function first determines two parameters:
(. 1) Epsilon: radius around a point adjacent the area
(2) minPts: adjacent to the inner region comprises at least a number of points
according to the above two parameters, sample points can be divided into three categories:
Nuclear Point: meet point within the field> = minPts, compared sample points nuclear
boundary points: points that satisfy <minPts in the art, but this point may be a number of core points obtained
noise points: neither nucleation point is an edge point is not, it is point does not belong to this category
- DisMatrix distance matrix between the calculation point and the point from European
- Add the data set columns visited, unvisited represents 0, 1 visited
- Type judgment points stored in data_N
- Delete noise points stored in data_C
- Recalculated from the field, associated with the intersection of
- For different types of point labels
- Draw
Third, the experiment code
library(factoextra)
library(ggplot2)
data<-data.frame(multishapes[,1:2])
ggplot(data,aes(x,y))+geom_point()
#主函数
DBSCAN = function(data,eps,MinPts){
rows = nrow(data)
disMatrix<-as.matrix(dist(data, method = "euclidean"))#求距离
data$visited <- rep(0,rows)
names(data)<-c("x","y","visited")
data_N = data.frame(matrix(NA,nrow =rows,ncol=3)) #领域集N,存放索引、领域内的点数、点的类型
names(data_N)<-c("index","pts","cluster")
#判断点的类型,1核心点,2边界点,0噪声点
for(i in 1:rows){
if(data$visited[i] == 0){ #未被访问的点
data$visited[i] = 1 #标记已经被访问
index <- which( disMatrix[i,] <= eps)
pts <- length(index)
if(pts >= MinPts){
data_N[i,]<-c(i,pts,"1")
}else if(pts>1 && pts< MinPts){
data_N[i,]<-c(i,pts,"2")
}else{
data_N[i,]<-c(i,pts,"0")
}
}
}
#删除噪声点
data_C<-data[which(data_N$cluster!=0),]
#去掉噪声点之后的领域
disMatrix2<-as.matrix(dist(data_C, method = "euclidean"))
Cluster<-list()
for(i in 1:nrow(data_C)){
Cluster[[i]]<-names(which(disMatrix2[i,]<= eps))
}
#合并有交集的邻域,生成一个新簇
for(i in 1:length(Cluster)){
for(j in 1:length(Cluster)){
if(i!=j && any(Cluster[[j]] %in% Cluster[[i]])){
if(data_N[Cluster[[i]][1],]$cluster=="1"){
Cluster[[i]]<-unique(append(Cluster[[i]],Cluster[[j]])) #合并,删除重复
Cluster[[j]]<-list()
}
}
}
}
newCluster<-list() #去掉空列表
for(i in 1:length(Cluster)){
if(length(Cluster[[i]])>0){
newCluster[[length(newCluster)+1]]<-Cluster[[i]]
}
}
#为相同簇中的对象赋相同的标签
data_C[,4]<-as.character()
for(i in 1:length(newCluster)){
for(j in 1:length(newCluster[[i]])){
data_C[newCluster[[i]][j],4]<-i
}
}
return(data_C)
}
#运行
test<-DBSCAN(data,0.15,6) #设定eps为0.15,minpts为6
ggplot(test,aes(x,y,colour=factor(test[,4])))+
geom_point(shape=factor(test[,4]))
Fourth, operating results
The original data
after clustering