Python快速查找每个站的最近的10个站

作者:小小明,Pandas数据处理专家,致力于帮助无数数据从业者解决数据处理难题。

这是半年前写的一篇文章,里面涉及的方法可能有些过时,但处理思想仍有较高的参考价值,现在发布到csdn。


有位朋友咨询了我这样一个问题:

image-20200611172140721

然后我最终就通过KNN算法的查找临近节点的函数实现了这个功能,现在分享给大家实现过程。

数据形式

基站数据库数据

import pandas as pd
data = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="现网基站数据库",
                     usecols=[1, 2, 3])
data
CELL_NAME LON LAT
0 LFZ贵港市电信城北局O4 109.59128 23.11093
1 LFZ贵港市电信城北局O5 109.59128 23.11093
2 LFZ贵港市电信城北局O6 109.59128 23.11093
3 LFZ贵港市电信城北局O10 109.59128 23.11093
4 LFZ贵港市电信城北局O49 109.59128 23.11093
9514 LFZ贵港市人民医院外科楼25F楼梯间I5 109.59482 23.09028
9515 LFZ贵港市人民医院妇科楼13F楼梯间I6 109.59482 23.09028
9516 newLFZ贵港市唐人街室分21I 109.59644 23.09433
9517 newLFZ贵港市唐人街室分22I 109.59644 23.09433
9518 newLFZ贵港市港北区锦泰公馆1栋2单元B1F电梯旁B1FIQ20 109.58687 23.10925

9519 rows × 3 columns

共9519条示例数据。

需找出最近距离的基站,共16条示例数据:

需找出最近距离的基站数据

find = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="需找出最近距离的基站",
                     usecols=[1, 2, 3])
find
基站中文名 lon lat
0 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596440 23.094330
1 newLFZ桂平社坡镇禄全村二(村西山头)O19 110.190283 23.353451
2 LFZ桂平市高铁站I10 110.112020 23.318939
3 newLFZ桂平大洋镇什字村二(旺冲良)O56 109.980278 23.112500
4 newLFZ桂平社坡镇禄全村二(村西山头)O17 110.190283 23.353451
5 newLFZ桂平社坡镇禄全村二(村西山头)O18 110.190283 23.353451
6 LFZ贵港市塔山部队装甲团O53 109.651942 23.106775
7 newLFZ贵港黄练镇水村O54 109.253430 23.190430
8 LFZ贵港市世纪经典悉尼座4单元12楼线井I11 109.588060 23.111116
9 LFZ贵港黄练镇新谭三中O57 109.274750 23.180370
10 LFZ贵港奇石乡六马村六良屯O52 109.637593 23.365757
11 newLFZ贵港市体育中心西看台4楼I5 109.559160 23.115079
12 newLFZ贵港市体育中心综合馆I2 109.559160 23.115079
13 LFZ桂平市郁江湾7号楼B1F电井外墙 I8 110.072415 23.381235
14 newLFZ贵港山北乡大王村O53 109.411290 23.353670
15 LFZ贵港市园博园主展馆2FI10 109.561264 23.076138

现在我们的目标就是找出这16个基站每个基站在基站数据库中最近的10个基站。

常规做法是暴力遍历,那么每个基站都要遍历9519次,当前示例数据还勉强可以接受,但是基站数据库规模一旦达到10万以上,那计算起来就耗时很久了。为了提高计算效率,我利用KNN算法的ball_tree快速计算。

注意:sklearn的KNN算法还提供了brute也可以自定义距离函数,但经过实践发现ball_tree的计算速度会更快一些。

使用KNN分类器计算每个基站最近的10个基站

筛选用于训练的经纬度特征数据

从基站数据库中筛选出经纬度特征数据:

# 从基站数据库中筛选出经纬度特征数据,用于给KNN分类器训练
data_fit = data.iloc[:, [1, 2]]
# y本身用于标注每条数据属于哪个类别,但我并不使用KNN的分类功能,所以统一全部标注为类别1
y = [1] * len(data_fit)
data_fit
LON LAT
0 109.59128 23.11093
1 109.59128 23.11093
2 109.59128 23.11093
3 109.59128 23.11093
4 109.59128 23.11093
9514 109.59482 23.09028
9515 109.59482 23.09028
9516 109.59644 23.09433
9517 109.59644 23.09433
9518 109.58687 23.10925

9519 rows × 2 columns

筛选需要求出最近10个点的的基站的经纬度特征数据:

# 筛选需要求出最近10个点的的基站的经纬度特征数据
find_x = find.iloc[:, [1, 2]]
find_x.head()
lon lat
0 109.596440 23.094330
1 110.190283 23.353451
2 110.112020 23.318939
3 109.980278 23.112500
4 110.190283 23.353451

构建KNN分类器

导入KNN分类器:

# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *

创建用于计算两个经纬度距离的函数:

# 创建用于计算两个经纬度距离的函数
def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance

创建KNN分类器:

# 指定算法为ball_tree
knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))

训练模型并计算获取结果

训练:

knn.fit(data_fit, y)

结果:

KNeighborsClassifier(algorithm='ball_tree',
                     metric=<function <lambda> at 0x000000000E7C7D90>,
                     n_neighbors=1)

使用knn算法计算它们最近的10个点:

distance, points = knn.kneighbors(find_x, n_neighbors=10, return_distance=True)
print(distance[:5])
print(points[:5])
[[   0.    0.    0.    0.    0.  115.  121.  121.  121.  167.]
 [   0.    0.    0. 1093. 1093. 1093. 1657. 1657. 1657. 2509.]
 [   0.  293.  293.  293.  293.  839.  839.  839. 1069. 1069.]
 [   0.    0.    0.    0.    0.    0. 1514. 1514. 2358. 2358.]
 [   0.    0.    0. 1093. 1093. 1093. 1657. 1657. 1657. 2509.]]
[[9517 9137 9139 9136 9516 9492 9016 9015 9017 7559]
 [8234 8235 8233 5190 5189 5188 5867 5865 5866 8878]
 [9209 9283 6041 6039 6040 6848 6846 6845 6864 6865]
 [3775 3776 7916 3774 7918 7917 8722 8723 1751 6749]
 [8234 8235 8233 5190 5189 5188 5867 5865 5866 8878]]

结果整理

上面KNN算法已经计算出了结果,现在我将一顿骚操作,把结果整理一下,让结果看起来比较好看,再保存起来。

对于第一条被查找的数据:

find.iloc[0]

结果:

基站中文名    LFZ贵港港北区唐人街1号楼17F弱电井I4
lon                     109.596
lat                     23.0943
Name: 0, dtype: object

如何转换成Datafream呢?

s = pd.DataFrame(find.iloc[0]).T
s

结果:

基站中文名 lon lat
0 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596 23.0943

如何获取这个基站的最近的10个基站的数据呢?

tmp = data.iloc[points[0]]
tmp

结果:

CELL_NAME LON LAT
9517 newLFZ贵港市唐人街室分22I 109.596440 23.094330
9137 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596440 23.094330
9139 LFZ贵港港北区唐人街2号楼11FI6 109.596440 23.094330
9136 LFZ贵港港北区唐人街A区B1F弱电井I3 109.596440 23.094330
9516 newLFZ贵港市唐人街室分21I 109.596440 23.094330
9492 newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17 109.597461 23.094762
9016 LFZ贵港市凤凰二街华隆超市微站O50 109.597226 23.095149
9015 LFZ贵港市凤凰二街华隆超市微站O49 109.597226 23.095149
9017 LFZ贵港市凤凰二街华隆超市微站O51 109.597226 23.095149
7559 newLFZ贵港市桥北商贸城O20 109.596790 23.092860

再加上距离:

tmp['距离'] = distance[0]
tmp

结果:

CELL_NAME LON LAT 距离
9517 newLFZ贵港市唐人街室分22I 109.596440 23.094330 0.0
9137 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596440 23.094330 0.0
9139 LFZ贵港港北区唐人街2号楼11FI6 109.596440 23.094330 0.0
9136 LFZ贵港港北区唐人街A区B1F弱电井I3 109.596440 23.094330 0.0
9516 newLFZ贵港市唐人街室分21I 109.596440 23.094330 0.0
9492 newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17 109.597461 23.094762 115.0
9016 LFZ贵港市凤凰二街华隆超市微站O50 109.597226 23.095149 121.0
9015 LFZ贵港市凤凰二街华隆超市微站O49 109.597226 23.095149 121.0
9017 LFZ贵港市凤凰二街华隆超市微站O51 109.597226 23.095149 121.0
7559 newLFZ贵港市桥北商贸城O20 109.596790 23.092860 167.0

对被查找的基站和结果数据进行合并:

s['距离'] = '被求点0'
s.columns = tmp.columns
tmp = s.append(tmp)
tmp

结果:

CELL_NAME LON LAT 距离
0 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596 23.0943 被求点0
9517 newLFZ贵港市唐人街室分22I 109.596 23.0943 0
9137 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596 23.0943 0
9139 LFZ贵港港北区唐人街2号楼11FI6 109.596 23.0943 0
9136 LFZ贵港港北区唐人街A区B1F弱电井I3 109.596 23.0943 0
9516 newLFZ贵港市唐人街室分21I 109.596 23.0943 0
9492 newLFZ贵港市港北区港福时代广场1栋2单元负2楼电梯旁IQ17 109.597 23.0948 115
9016 LFZ贵港市凤凰二街华隆超市微站O50 109.597 23.0951 121
9015 LFZ贵港市凤凰二街华隆超市微站O49 109.597 23.0951 121
9017 LFZ贵港市凤凰二街华隆超市微站O51 109.597 23.0951 121
7559 newLFZ贵港市桥北商贸城O20 109.597 23.0929 167

最终合并代码:

result = pd.DataFrame()
for i, row in find.iterrows():
    tmp = data.iloc[points[i]]
    tmp['距离'] = distance[i]
    s = pd.DataFrame(row).T
    s['距离'] = f'被求点{i}'
    s.columns = tmp.columns
    tmp = s.append(tmp)
    result = result.append(tmp)
result
CELL_NAME LON LAT 距离
0 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596 23.0943 被求点0
9517 newLFZ贵港市唐人街室分22I 109.596 23.0943 0
9137 LFZ贵港港北区唐人街1号楼17F弱电井I4 109.596 23.0943 0
9139 LFZ贵港港北区唐人街2号楼11FI6 109.596 23.0943 0
9136 LFZ贵港港北区唐人街A区B1F弱电井I3 109.596 23.0943 0
9058 LFZ贵港市园博园微站7O7 109.561 23.0741 223
9052 LFZ贵港市园博园微站1O1 109.559 23.0764 229
9053 LFZ贵港市园博园微站2O2 109.559 23.0756 229
9059 LFZ贵港市园博园微站8O8 109.562 23.0739 252
9060 LFZ贵港市园博园微站9O9 109.562 23.0741 257

176 rows × 4 columns

保存结果

result.to_excel(r"D:/hdfs/excel/result/10base.xlsx", index=False)

整体完整代码

import pandas as pd

find = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="需找出最近距离的基站",
                     usecols=[1, 2, 3])
data = pd.read_excel(r"D:\hdfs\excel\经纬度计算最近10个基站.xlsx",
                     sheet_name="现网基站数据库",
                     usecols=[1, 2, 3])
# 从基站数据库中筛选出经纬度特征数据,用于给KNN分类器训练
data_fit = data.iloc[:, [1, 2]]
# y本身用于标注每条数据属于哪个类别,但我并不使用KNN的分类功能,所以统一全部标注为类别1
y = [1] * len(data_fit)
# 筛选需要求出最近10个点的的基站的经纬度特征数据
find_x = find.iloc[:, [1, 2]]

# 导入KNN分类器
from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *


# 创建用于计算两个经纬度距离的函数
def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance


# 指定算法为ball_tree
knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))
# 训练模型
knn.fit(data_fit, y)
# 计算它们最近的10个点
distance, points = knn.kneighbors(find_x, n_neighbors=10, return_distance=True)

result = pd.DataFrame()
for i, row in find.iterrows():
    tmp = data.iloc[points[i]]
    tmp['距离'] = distance[i]
    s = pd.DataFrame(row).T
    s['距离'] = f'被求点{i}'
    s.columns = tmp.columns
    tmp = s.append(tmp)
    result = result.append(tmp)
result.to_excel(r"D:/hdfs/excel/result/10base.xlsx", index=False)

求连接的基站不在最近6个基站内的采样点

做完上面的需求,结果又来了类似的需求:

image-20200611191721381

数据读取

导包:

import pandas as pd

读取基站经纬度信息:

data = pd.read_excel(r"D:\hdfs\excel\网格和周围LTE站点.xlsx",usecols=[0,2,3])
data
CellName Longitude Latitude
0 FSSDRongGuiHengDeLouDC-EFW-1 113.282501 22.767101
1 FSSDRongGuiNanJieTanDiWCDC-EFW-1 113.267010 22.766701
2 FSSDRongGuiNanJieTanDiWCDC-EFW-2 113.267010 22.766701
3 FSSDRongGuiNanJieTanDiWCDC-EFW-3 113.267010 22.766701
4 FSSDRongGuiNanJieTanDiWCDC-EFW-4 113.267010 22.766701
583 FSSDRongGuiQingGuiGLGYQDDC-EFH-6 113.326050 22.775870
584 FSSDRongGuiRongGangLuBeiDC-EFH-4 113.286910 22.781120
585 FSSDRongGuiRongGangLuBeiDC-EFH-5 113.286910 22.781120
586 FSSDRongGuiRongGangLuBeiDC-EFH-6 113.286910 22.781120
587 FSSDDaLiangCaiHongLuBanQDC-EFH-1 113.292710 22.798050

588 rows × 3 columns

读取采样点数据:

find = pd.read_excel(r"D:/hdfs/excel/GPS采样点.xlsx", usecols=[2, 3, 6, 8])
find
Longitude Latitude ECI CELLNAME
0 113.272493 22.752590 231103049 FSSDRongGuiHongXingDC-EFH-3
1 113.272502 22.752585 231103049 FSSDRongGuiHongXingDC-EFH-3
2 113.272503 22.752584 231103049 FSSDRongGuiHongXingDC-EFH-3
3 113.272503 22.752584 231103049 FSSDRongGuiHongXingDC-EFH-3
4 113.272548 22.752557 231103049 FSSDRongGuiHongXingDC-EFH-3
5066 113.310089 22.784417 231128648 FSSDDaLiangDeShengQiaoBQDC-EFH-2
5067 113.310091 22.784417 231128648 FSSDDaLiangDeShengQiaoBQDC-EFH-2
5068 113.310103 22.784415 231128648 FSSDDaLiangDeShengQiaoBQDC-EFH-2
5069 113.310103 22.784415 231128648 FSSDDaLiangDeShengQiaoBQDC-EFH-2
5070 113.310103 22.784415 231128648 FSSDDaLiangDeShengQiaoBQDC-EFH-2

5071 rows × 4 columns

将基站名称转换为索引

构建一个用于查询基站名称和对应索引的DataFream:

cellName_index = data[['CellName']].reset_index()
cellName_index = cellName_index.rename(columns={
    
    "index": "cell_index"})
cellName_index
cell_index CellName
0 0 FSSDRongGuiHengDeLouDC-EFW-1
1 1 FSSDRongGuiNanJieTanDiWCDC-EFW-1
2 2 FSSDRongGuiNanJieTanDiWCDC-EFW-2
3 3 FSSDRongGuiNanJieTanDiWCDC-EFW-3
4 4 FSSDRongGuiNanJieTanDiWCDC-EFW-4
583 583 FSSDRongGuiQingGuiGLGYQDDC-EFH-6
584 584 FSSDRongGuiRongGangLuBeiDC-EFH-4
585 585 FSSDRongGuiRongGangLuBeiDC-EFH-5
586 586 FSSDRongGuiRongGangLuBeiDC-EFH-6
587 587 FSSDDaLiangCaiHongLuBanQDC-EFH-1

588 rows × 2 columns

将采样点数据中正在连接的基站的名称转换为在基站数据库中的索引:

find = find.merge(cellName_index,
                  left_on='CELLNAME',
                  right_on='CellName',
                  copy=False)
find = find[["Longitude", "Latitude", "ECI", "cell_index"]]
find
Longitude Latitude ECI cell_index
0 113.272493 22.752590 231103049 14
1 113.272502 22.752585 231103049 14
2 113.272503 22.752584 231103049 14
3 113.272503 22.752584 231103049 14
4 113.272548 22.752557 231103049 14
5066 113.310089 22.784417 231128648 66
5067 113.310091 22.784417 231128648 66
5068 113.310103 22.784415 231128648 66
5069 113.310103 22.784415 231128648 66
5070 113.310103 22.784415 231128648 66

获取最近的6个点

筛选需要进行训练的特征数据:

data_fit = data.iloc[:, [1, 2]]
y = [1] * len(data_fit)
find_fit = find.iloc[:, [0, 1]]
print(data_fit.head())
print(find_fit.head())
    Longitude   Latitude
0  113.282501  22.767101
1  113.267010  22.766701
2  113.267010  22.766701
3  113.267010  22.766701
4  113.267010  22.766701
    Longitude   Latitude
0  113.272493  22.752590
1  113.272502  22.752585
2  113.272503  22.752584
3  113.272503  22.752584
4  113.272548  22.752557

使用KNN分类器获取结果:

from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *


def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance


knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))
knn.fit(data_fit, y)
points = knn.kneighbors(find_fit, n_neighbors=6, return_distance=False)
points[:5]

结果:

array([[ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13,  87,  88,  86],
       [ 14,  12,  13, 101, 103, 102]], dtype=int64)

获取连接的基站不在最近6个基站内的采样点

result = pd.DataFrame(
    [row for i, row in find.iterrows() if not row.cell_index in points[i]])
result
Longitude Latitude ECI cell_index
8 113.272701 22.752470 231108424.0 33.0
9 113.272701 22.752470 231108424.0 33.0
10 113.272702 22.752470 231108424.0 33.0
11 113.272735 22.752454 231108424.0 33.0
12 113.272743 22.752450 231108424.0 33.0
5027 113.305510 22.762527 94409287.0 170.0
5028 113.305528 22.762507 94409287.0 170.0
5029 113.305537 22.762497 94409287.0 170.0
5030 113.305537 22.762497 94409287.0 170.0
5031 113.305537 22.762496 94409287.0 170.0

1279 rows × 4 columns

整体完整代码

import pandas as pd


# 读取基站经纬度信息:
data = pd.read_excel(r"D:\hdfs\excel\网格和周围LTE站点.xlsx",usecols=[0,2,3])
# 读取采样点数据:
find = pd.read_excel(r"D:/hdfs/excel/GPS采样点.xlsx", usecols=[2, 3, 6, 8])

# ## 将基站名称转换为索引

# 构建一个用于查询基站名称和对应索引的DataFream:
cellName_index = data[['CellName']].reset_index()
cellName_index = cellName_index.rename(columns={
    
    "index": "cell_index"})
# 将采样点数据中正在连接的基站的名称转换为在基站数据库中的索引:
find = find.merge(cellName_index,
                  left_on='CELLNAME',
                  right_on='CellName',
                  copy=False)
find = find[["Longitude", "Latitude", "ECI", "cell_index"]]

# ## 获取最近的6个点
# 筛选需要进行训练的特征数据:
data_fit = data.iloc[:, [1, 2]]
y = [1] * len(data_fit)
find_fit = find.iloc[:, [0, 1]]

# 使用KNN分类器获取结果:
from sklearn.neighbors import KNeighborsClassifier  #KNN
from math import *

def distancefuc(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(
        radians,
        [float(lon1), float(lat1),
         float(lon2), float(lat2)])  # 经纬度转换成弧度
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    distance = 2 * asin(sqrt(a)) * 6371 * 1000  # 地球平均半径,6371km
    distance = round(distance, 0)
    return distance

knn = KNeighborsClassifier(n_neighbors=1,
                           algorithm='ball_tree',
                           metric=lambda s1, s2: distancefuc(*s1, *s2))
knn.fit(data_fit, y)
points = knn.kneighbors(find_fit, n_neighbors=6, return_distance=False)

# ## 获取连接的基站不在最近6个基站内的采样点
result = pd.DataFrame(
    [row for i, row in find.iterrows() if not row.cell_index in points[i]])
result
Longitude Latitude ECI cell_index
8 113.272701 22.752470 231108424.0 33.0
9 113.272701 22.752470 231108424.0 33.0
10 113.272702 22.752470 231108424.0 33.0
11 113.272735 22.752454 231108424.0 33.0
12 113.272743 22.752450 231108424.0 33.0
5027 113.305510 22.762527 94409287.0 170.0
5028 113.305528 22.762507 94409287.0 170.0
5029 113.305537 22.762497 94409287.0 170.0
5030 113.305537 22.762497 94409287.0 170.0
5031 113.305537 22.762496 94409287.0 170.0

1279 rows × 4 columns

猜你喜欢

转载自blog.csdn.net/as604049322/article/details/112385553