【机器学习】利用DBSCAN-基于密度的空间聚类算法计算出一批经纬度中的中心点

需求:通过聚类算法实现从一批经纬度中获得中心点并筛选出离群点

聚类原则:聚类距离簇边界最近的点

算法流程:

核心点:核心点的半径范围内的样本个数≥最少点数

边界点:边界点的半径范围内的样本个数小于最少点数大于0

噪声点:噪声点的半径范围的样本个数为0

1,根据半径、最少点数区分核心点、边界点、噪声点

2,先对核心点归类:

while:

随机选取一核心点作为当前簇

寻找距离当前簇最近的核心点:与簇边缘最近的核心点,,

if 若该核心点距离当前簇的边界核心点的距离小于半径:

则将该核心点归类到当前簇

if 若剩余的未归类的核心点距离当前簇边界距离均大于半径:

说明:距离第numC个簇最近的核心点不在该簇邻域内,说明第numC个簇已全部分类完毕, 可以生成新簇来归类核心点,则在剩余的未归类的核心点随机选取一核心点归类到新的簇中。

if 核心点已全部归类:

停止迭代

3,再根据半径和已归类的核心点来归类边界点

优点:

1.可以自动决定类的数量。不需要人为假设。

2.可以发现任意形状的簇类,而不像K-means那样只能发现圆形簇

3.可以识别噪声点,抗噪声能力较强

缺点:

1.不能很好的应用在高维数据中

2.如果样本集的密度不均匀,效果就不好

from sklearn.cluster import DBSCAN
import numpy as np
# 经纬度数据
data = np.array([[44.34855, 129.4578], [44.34855, 129.45781], [44.348557, 129.45782], [44.348526, 129.4578], [44.34851, 129.4578], [44.348545, 129.45784], [44.348526, 129.4578],
                 [44.348545, 129.45782], [44.348537, 129.4578], [44.348537, 129.45781], [44.348534, 129.4578], [44.34853, 129.4578], [44.348534, 129.4578],
                 [44.348553, 129.45789], [44.348545, 129.45781], [44.348526, 129.4578], [44.34851, 129.45778], [44.34855, 129.4578], [44.34853, 129.45784], [44.34852, 129.45778],
                 [44.348534, 129.45784], [44.34852, 129.45778], [44.34853, 129.45784], [44.348526, 129.45778], [44.348537, 129.45778], [44.348522, 129.45782],
                 [44.348553, 129.45782], [44.348526, 129.45776], [44.348534, 129.45776], [44.348534, 129.45778], [44.34853, 129.45782], [44.348534, 129.45782],
                 [44.348522, 129.45782], [44.34851, 129.45782], [44.348534, 129.45781], [44.348537, 129.4578], [44.34855, 129.45782], [44.348537, 129.45782],
                 [44.348526, 129.45781], [44.348553, 129.45781],[44.34853, 129.45784], [44.34855, 129.4578], [44.348545, 129.45781], [44.348537, 129.4578],
                 [44.34853, 129.45781],[50.3,112.2]])
# 聚类
dbscan = DBSCAN(eps=0.001, min_samples=2).fit(data)
# 聚类标签
labels = dbscan.labels_
# 计算每个聚类中心点和离群点
centers = []
outliers = []
for i in set(labels):
    if i == -1: # 噪声点
        outliers.extend(data[labels == i])
        continue
    centers.append(np.mean(data[labels == i], axis=0))
# 计算每个聚类中心点离所有点的距离
distances = []
for center in centers:
    distance = np.sqrt(np.sum(np.square(data - center), axis=1))
    distances.append(np.min(distance))
# 选取距离最小的点作为最中心的点
index = np.argmin(distances)
result = centers[index]
print("最中心的点:", result)
print("离群点:", outliers)

结果:

最中心的点: [ 44.34853458 129.45780822]

离群点: [array([ 50.3, 112.2])]

猜你喜欢

转载自blog.csdn.net/qq_44992785/article/details/129385819