Code optimization example

This article mainly deals with the trajectory data of one month in a certain area, and completes the optimization of the code. The specific format of a line of the selected track is shown in the following table, where MMSI_paragraph represents the unique id format of each track, Lat and Lon represent the latitude and longitude of the track at a certain moment, Receivedtime (UTC+8) is the time of the historical track Poke, update every 30 seconds.

MMSI_paragraph Lat Lon Receivedtime(UTC+8)
44236893_01 113.120742 29.425808 2020-01-28 01:21:42

1. Calculate the sailing time for each trajectory

The first is to use pandas to read the track and store it in DF, and convert its timestamp into the corresponding datetime format.

from symbol import import_stmt
import pandas as pd
from tqdm import tqdm 
from datetime import datetime
import time
DF=pd.read_csv('/home/data/AIS_data_process/trajectory_for_lunwen/异常剔除后轨迹并进行上行下行轨迹提取/直线处轨迹/trajectory_toup_01.csv')
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
BJS_format = "%Y-%m-%d %H:%M:%S"
DF['Receivedtime(UTC+8)'] =DF['Receivedtime(UTC+8)'].parallel_apply(lambda x: datetime.strptime(x, BJS_format)) #apply的并行
mmsi_para_list=DF['MMSI_paragraph'].unique()

The following is the original code for calculating the sailing time of each track. The code reads each track individually and then calculates its sailing time in the area. Its time complexity and space complexity are particularly high.

start1=time.time()
df_m1_list=[]#存储每个id的轨迹计算后得到的dataframe
for i in range(len(mmsi_para_list)):
    dfm1=DF[DF['MMSI_paragraph']==mmsi_para_list[i]]#这样获取数据时间复杂度较高,因为涉及到每一行的判断
    total_time=(dfm1.iloc[-1]['Receivedtime(UTC+8)']-dfm1.iloc[0]['Receivedtime(UTC+8)']).total_seconds()
    dfm1['total_time']=total_time
    df_m1_list.append(dfm1)#若轨迹数量太多,则都存储在df_m1_list会占用大量的存储空间
DF1=pd.concat(df_m1_list)#concat的操作复杂度较高
end1=time.time()
print(end1-start1)

After code optimization, it can be rewritten into the following form, which uses assign, groupby, transform and other methods to transform the original trajectory.

  • The role of assign is to append the sequence of total_time to the end of the column of this table of DF.
  • groupby can build in the for loop of the first version of the code, which saves the time for the for loop to extract each track separately.
  • transform realizes the custom transformation of the table through the built-in lambda anonymous function, and broadcasts it to the entire group where it is located.
    (agg is used for scalar aggregation, transform is used for scalar broadcast/return of the same length sequence, filter is used for filtering group, and apply is used for multi-column return)
start2=time.time()
DF2=DF.assign(total_time=DF.groupby("MMSI_paragraph")["Receivedtime(UTC+8)"].transform(lambda x: (x.iat[-1]-x.iat[0]).total_seconds()))
end2=time.time()
print(end2-start2)

The DF used in this paper has a total of 6.43 million lines of trajectory points and a total of 26524 trajectories. After code optimization, the calculation time is reduced from the original 2 hours and 21 minutes to the current 4 seconds .

One of the high-performance principles: O(n) cannot be written explicitly in python

2 Calculate the sailing length of each trajectory

Each same MMSI_paragraph represents a navigation trajectory, so in this task, it is necessary to calculate the distance difference between the trajectories of each line of the same MMSI_paragraph. The figure below shows part of the data display of the trajectory. During the calculation process, a function needs to be set to find the distance between two different longitudes and latitudes.
The difficulty of this problem is how to calculate the voyage length for large-scale trajectory samples.
insert image description here
In this article, the list of DF_list is established first, and then each trajectory in DF is extracted through the for loop df, and then the for loop is used to traverse each row of data of this trajectory, and the longitude and latitude distance function of the spherical surface is used to obtain the result. The length of the voyage, and finally splice the length with the extracted df, and put it in the list of DF_list. After the loop runs, use concat to merge all calculated df into a new DF. In this way, the sailing length of each trajectory is obtained.
The distance function between the longitude and latitude of the earth is as follows:

def calcDistance(Lng_A, Lat_A, Lng_B, Lat_B):
    """
    根据两个点的经纬度求两点之间的距离
    :param Lng_A:  经度1
    :param Lat_A:   维度1
    :param Lng_B:  经度2
    :param Lat_B:   维度2
    :return:  单位米
    """
    ra = 6378.140
    rb = 6356.755
    flatten = (ra - rb) / ra
    rad_lat_A = np.radians(Lat_A)
    rad_lng_A = np.radians(Lng_A)
    rad_lat_B = np.radians(Lat_B)
    rad_lng_B = np.radians(Lng_B)
    pA = math.atan(rb / ra * np.tan(rad_lat_A))
    pB = math.atan(rb / ra * np.tan(rad_lat_B))
    xx = math.acos(np.sin(pA) * np.sin(pB) + np.cos(pA) * np.cos(pB) * np.cos(rad_lng_A - rad_lng_B))
    c1 = (np.sin(xx) - xx) * (np.sin(pA) + np.sin(pB)) ** 2 / np.cos(xx / 2) ** 2
    # 经测试当传入这两个点的经纬度一样时会返回0
    if np.sin(xx/2)==0:
        return 0
    c2 = (np.sin(xx) + xx) * (np.sin(pA) - np.sin(pB)) ** 2 / np.sin(xx / 2) ** 2
    dr = flatten / 8 * (c1 - c2)
    distance = ra * (xx + dr)
    return distance*1000

In the process just described, such an operation wastes a lot of time and space, and the calculation efficiency is extremely low, so numba and cython can be used to speed up the processing.

#基于numba的计算加速方式
from numba import njit #使用numba加速
ra, rb = 6378.140, 6356.755 
@njit
def distance_numba(p, lng):
    '''
    计算轨迹序列的长度
    '''
    flatten = (ra - rb) / ra
    sum_value, nrows, xx = 0.0, p.shape[0], 0.0
    if nrows == 1:
        return sum_value
    for i in range(nrows-1):
        pA, pB, lng_A, lng_B = p[i], p[i+1], lng[i], lng[i+1]
        xx = sin(pA)*sin(pB)
        xx += cos(pA)*cos(pB)*cos(lng_A-lng_B)
        xx = acos(min(xx, 1))
        c1 = (sin(xx)-xx) * (sin(pA)+sin(pB)) ** 2 / cos(xx/2+1e-15) ** 2
        c2 = (sin(xx)+xx) * (sin(pA)-sin(pB)) ** 2 / sin(xx/2+1e-15) ** 2
        dr = flatten / 8 * (c1 - c2)
        dist = ra * (xx + dr) * 1000
        sum_value += dist
    return sum_value
    
DF["p"] = np.arctan(rb/ra*np.tan(np.radians(DF.Lat_d)))
agg_result = DF.groupby("MMSI_paragraph").apply(lambda x: distance_numba(x.p.values, np.radians(x.Lon_d.values)))#利用groupby对不同的轨迹分别处理
DF = DF.merge(agg_result.to_frame().rename(columns={
    
    0:"Distance"}), on="MMSI_paragraph")#将计算后的结果和DF进行merge在一起
#为了加速计算,也可以写成DF1=DF.assign(Distance=agg_result.reindex(DF.MMSI_paragraph).reset_index(drop=True)).drop('p',axis=1)
#在一些情况下可以尝试用reindex替换merge,主要原因是merge本质是O(n**2)的复杂度,数据量越大则越慢
#所以在能用reindex代替merge的时候尽可能去用reindex操作

The following is a calculation acceleration method based on cython

#启动cython
%load_ext cython

%%cython
import numpy as np
cimport numpy as np
import cython
from math import acos, sin, cos
#其中上一行的导入库也可以写为 from 

def distance_cython(double[:] p, double[:] lng):
    cdef:
        double ra = 6378.140, 
        double rb = 6356.755
        double flatten = (ra - rb) / ra
        double sum_value = 0
        double xx, pA, pB, lng_A, lng_B, c1, c2, dr
        int nrows = p.shape[0]
        int i
    if nrows == 1:
        return sum_value
    for i in range(nrows-1):
        pA, pB, lng_A, lng_B = p[i], p[i+1], lng[i], lng[i+1]
        xx = sin(pA)*sin(pB)
        xx += cos(pA)*cos(pB)*cos(lng_A-lng_B)
        xx = acos(min(xx, 1))
        if sin(xx/2)==0:
            sum_value += 0
            continue
        c1 = (sin(xx)-xx) * (sin(pA)+sin(pB)) ** 2 / (cos(xx/2) ** 2)
        c2 = (sin(xx)+xx) * (sin(pA)-sin(pB)) ** 2 / (sin(xx/2) ** 2)
        dr = flatten / 8 * (c1 - c2)
        sum_value += ra * (xx + dr) * 1000
    return sum_value
agg_result = DF.groupby("MMSI_paragraph").apply(lambda x: distance_cython(x.p.values,  np.radians(x.Lon_d.values)))
DF= DF.merge(agg_result.to_frame().rename(columns={
    
    0:"Distance"}), on="MMSI_paragraph").drop("p", axis=1)

Calculated for 26524 trajectories, the total time for cython acceleration is about 13.6 seconds, while that for numba acceleration is 7.6 seconds. Compared with the original method, the computational efficiency can be greatly improved.

3. Obtain part of the trajectory data

This task aims to randomly obtain the first 30% to 70% of each trajectory data.
The original code is as follows:

DF_train=pd.DataFrame()
mmsi_para_list=DF['MMSI_paragraph'].unique()
for mmsi_para in mmsi_para_list:
    df2=DF[DF['MMSI_paragraph']==mmsi_para]
    ratio=random.uniform(0.3,0.7)
    traj_num=int(len(df2)*ratio)
    df_train=df2.iloc[:traj_num]
    DF_train=DF_train.append(df_train)

The above writing method still cannot meet the high-performance requirements, and the efficiency is low, so groupby can be used to optimize the code. Then the code can be written as the following structure and then optimized

DF.groupby("MMSI_paragraph",as_index=False, sort=False, group_keys=False).apply(lambda x: x.iloc[:int(x.shape[0]*np.random.uniform(0.3,0.7))])

On this basis, code optimization can also be performed. The idea of ​​code optimization is: search for all MMSI_paragraph values, and find each identical MMSI_paragraph start position and end position. Store the location information in the list, and then sample the data according to the list to clarify where the end position is after sampling 30%-70% of the data. Next, the original trajectory data DF is sampled according to the list. The specific code is as follows:

%%cython

import pandas as pd
import numpy as np
from numba import njit
import numpy as np
cimport numpy as np
import cython 


def get_count(str[:] arr):
#对所有MMSI_paragraph的值进行查找,找到每一个相同的MMSI_paragraph开始位置和结束位置
#类似于aaaabbbcc,返回数组[4,3,2]
    cdef:
        int n_groups = 26673
        str cur = arr[0]
        int[:] count = np.zeros(n_groups, dtype="int")
        int cur_g = 0
        int i
        int n = arr.shape[0]
    count[cur_g] = 1
    for i in range(1, n):
        last = cur
        cur = arr[i]
        if cur != last:
            cur_g += 1
        count[cur_g] += 1
    return np.asarray(count)

#利用idx存储 DF应该 保留的行数
n = DF.shape[0]
@njit
def generate_idx(sample, count):
    idx = np.zeros(n, dtype=np.int8)
    start = 0
    for i in range(len(count)):
        step = sample[i]
        idx[start:start+step] = 1
        start += count[i]
    return idx

count = get_count(DF.MMSI_paragraph.values)#统计MMSI_paragraph的values,然后存储至count
sample = (count * np.random.uniform(0.3,0.7,count.shape[0])).astype("int")#根据count这个list对数据进行抽样
result = DF.loc[generate_idx(sample, count).astype("bool")]#按照行数对原始数据进行抽取

Guess you like

Origin blog.csdn.net/weixin_44133327/article/details/124300313
Recommended