Use TransBigData to quickly and efficiently process, analyze, and mine taxi GPS data

picture

01. Introduction to TransBigData

TransBigData is a Python package developed for traffic spatiotemporal big data processing, analysis and visualization. TransBigData provides a fast and concise method for processing common traffic spatio-temporal big data (such as taxi GPS data, shared bicycle data and bus GPS data, etc.). TransBigData provides a variety of processing methods for various stages of traffic spatiotemporal big data analysis. The code is concise, efficient, flexible, and easy to use. It can realize complex data tasks with concise code.

Currently, TransBigData mainly provides the following methods:

❖Data preprocessing: Provide methods for quickly calculating basic information such as data volume, time period, and sampling interval for data sets, and provide corresponding cleaning methods for various data noises.

❖Data rasterization: Provide a method system for generating and matching various types of geographic grids (rectangular, triangular, hexagonal and geohash grids) in the research area, and can quickly convert spatial point data in a vectorized manner Mapped onto a geographic raster.

❖Data visualization: Based on the visualization package keplergl, data can be displayed interactively and visually on Jupyter Notebook with simple code.

❖Track processing: generate track line type from track data GPS points, track point densification, thinning, etc.

❖Map base map, coordinate conversion and calculation: load and display the coordinate conversion between the map base map and various special coordinate systems.

❖Specific processing methods: Provide corresponding processing methods for various specific data, such as extracting order origin and destination points from taxi GPS data, identifying residence and work place from mobile phone signaling data, constructing network topology structure from subway network GIS data and Calculate the shortest path, etc.

TransBigData can be installed via pip or conda, run the following code in the command prompt to install:

pip install -U transbigdata

After the installation is complete, run the following code in Python to import the TransBigData package.

In [1]:
import transbigdata as tbd

02. Data preprocessing

TransBigData can seamlessly connect with Pandas and GeoPandas packages commonly used in data processing. First we import the Pandas package and read the taxi GPS data:

In [2]:
import pandas as pd  
#读取数据      
data = pd.read_csv('TaxiData-Sample.csv',header = None)   
data.columns = ['VehicleNum','time','lon','lat','OpenStatus','Speed'] 
data.head()

The result is shown in Figure 1.

picture

■ Figure 1 Taxi GPS data

Then, introduce the GeoPandas package, read the area information of the research area and display:


In [3]:
import geopandas as gpd  
#读取研究范围区域信息  
sz = gpd.read_file(r'sz/sz.shp')  
sz.plot()

The result is shown in Figure 2.

picture

■ Figure 2 Area information of the research scope

The TransBigData package integrates some common preprocessing methods for traffic spatio-temporal data. Among them, the tbd.clean_outofshape method inputs the data and the area information of the research area, and can eliminate the data outside the research area. The tbd.clean_taxi_status method can remove the record of the instant change of the passenger status in the taxi GPS data. When using the preprocessing method, you need to pass in the column names corresponding to the important information columns in the data table. The code is as follows:


In [4]:
#数据预处理  
#剔除研究范围外的数据,计算原理是在方法中先栅格化后栅格匹配研究范围后实现对应。因此这里需要同时定义栅格大小,越小则精度越高  
data = tbd.clean_outofshape(data, sz, col=['lon', 'lat'], accuracy=500)  
#剔除出租车数据中载客状态瞬间变化的数据  
data = tbd.clean_taxi_status(data, col=['VehicleNum', 'time', 'OpenStatus'])

After processing the above code, we have already eliminated the data outside the research scope and the instantaneous change of passenger status in the taxi GPS data.

03. Data rasterization

Grid form (grid with the same size in geographical space) is the most basic method to express data distribution. After GPS data is rasterized, each data point contains the raster information where it is located. When a grid is used to express the distribution of data, the distribution it represents is close to the real situation.

The TransBigData tool provides us with a complete, fast and convenient raster processing system. When using TransBigData for rasterization, you first need to determine the rasterization parameters (which can be understood as defining a grid coordinate system), and the parameters can help us quickly rasterize:

In [5]:
#定义研究范围边界
bounds = [113.75, 22.4, 114.62, 22.86]
#通过边界获取栅格化参数
params = tbd.area_to_params(bounds,accuracy = 1000)
params

Out [5]:
{'slon': 113.75, 
 'slat': 22.4, 
 'deltalon': 0.00974336289289822, 
 'deltalat': 0.008993210412845813,
 'theta': 0,
 'method': 'rect',
 'gridsize': 1000}

The contents of the rasterization parameter params output at this time store the origin coordinates of the grid coordinate system (slon, slat), the latitude and longitude length and width of a single grid (deltalon, deltalat), the rotation angle of the grid (theta), and the grid The shape of the grid (the method parameter, whose value can be square rect, triangular tri, and hexagonal hexa) and the size of the grid (gridsize parameter, in meters).

After obtaining the rasterization parameters, we can use the methods provided in TransBigData to perform raster matching and generation operations on GPS data. The complete grid processing method system is shown in Figure 3.

picture

■ Figure 3 The raster processing system provided by TransBigData

Use the tbd.GPS_to_grid method to generate for each taxi GPS point. This method will generate the numbered columns LONCOL and LATCOL, and these two columns jointly specify the grid where they are located:


In [6]:
#将GPS数据对应至栅格,将生成的栅格编号列赋值到数据表上作为新的两列
data['LONCOL'],data['LATCOL'] = tbd.GPS_to_grids(data['lon'],data['lat'],params)

 In the next step, aggregate the amount of data in each raster, generate geographic geometry for the raster, and build a GeoDataFrame:


In [7]:
#聚合集计栅格内数据量  
grid_agg = data.groupby(['LONCOL','LATCOL'])['VehicleNum'].count().reset_index()  
#生成栅格的几何图形  
grid_agg['geometry'] = tbd.grid_to_polygon([grid_agg['LONCOL'],grid_agg['LATCOL']],params)  
#转换为GeoDataFrame  
grid_agg = gpd.GeoDataFrame(grid_agg)  
#绘制栅格  
grid_agg.plot(column = 'VehicleNum',cmap = 'autumn_r')  

The result is shown in Figure 4.

picture

  

■ Figure 5 The result of data rasterization

For a formal data visualization, we also need to add basemap, color bar, north arrow and scale bar. TransBigData also provides corresponding functions, the code is as follows:

In [8]:
import matplotlib.pyplot as plt  
fig =plt.figure(1,(8,8),dpi=300)  
ax =plt.subplot(111)  
plt.sca(ax)  
#添加行政区划边界作为底图
sz.plot(ax = ax,edgecolor = (0,0,0,0),facecolor = (0,0,0,0.1),linewidths=0.5)
#定义色条位置  
cax = plt.axes([0.04, 0.33, 0.02, 0.3])  
plt.title('Data count')  
plt.sca(ax)  
#绘制数据  
grid_agg.plot(column = 'VehicleNum',cmap = 'autumn_r',ax = ax,cax = cax,legend = True)  
#添加指北针和比例尺  
tbd.plotscale(ax,bounds = bounds,textsize = 10,compasssize = 1,accuracy = 2000,rect = [0.06,0.03],zorder = 10)  
plt.axis('off')  
plt.xlim(bounds[0],bounds[2])  
plt.ylim(bounds[1],bounds[3])  
plt.show()  

 The result is shown in Figure 5.

picture

■ Figure 5 Taxi GPS data distribution drawn by tbd package

04. Order origin and destination OD extraction and aggregation

For taxi GPS data, TransBigData provides a method to extract taxi order origin (OD) information directly from the data, the code is as follows:


In [9]:  
#从GPS数据提取OD  
oddata = tbd.taxigps_to_od(data,col = ['VehicleNum','time','Lng','Lat','OpenStatus'])  
oddata  

 The result is shown in Figure 6.

picture

■ Figure 6 Taxi OD extracted from tbd package

The rasterization method provided by the TransBigData package allows us to quickly define rasterization, and only need to modify the accuracy parameter to quickly define rasters of different sizes and granularities. We redefine a 2km*2km grid coordinate system, and pass its parameters into the tbd.odagg_grid method to perform grid aggregation on OD and generate a GeoDataFrame:


In [10]:  
#重新定义栅格,获取栅格化参数  
params = tbd.area_to_params(bounds,accuracy = 2000)  
#栅格化OD并集计  
od_gdf = tbd.odagg_grid(oddata,params)  
od_gdf.plot(column = 'count')  

 The result is shown in Figure 7.

picture

■ Figure 7 Grid OD of tbd aggregation

Add map basemap, color bar and scale bar north arrow:


In [11]:
#创建图框  
import matplotlib.pyplot as plt  
fig =plt.figure(1,(8,8),dpi=300)  
ax =plt.subplot(111)  
plt.sca(ax)  
#添加行政区划边界作为底图  
sz.plot(ax = ax,edgecolor = (0,0,0,1),facecolor = (0,0,0,0),linewidths=0.5)
#绘制colorbar  
cax = plt.axes([0.05, 0.33, 0.02, 0.3])  
plt.title('Data count')  
plt.sca(ax)  
#绘制OD  
od_gdf.plot(ax = ax,column = 'count',cmap = 'Blues_r',linewidth = 0.5,vmax = 10,cax = cax,legend = True)  
#添加比例尺和指北针  
tbd.plotscale(ax,bounds = bounds,textsize = 10,compasssize = 1,accuracy = 2000,rect = [0.06,0.03],zorder = 10)  
plt.axis('off')  
plt.xlim(bounds[0],bounds[2])  
plt.ylim(bounds[1],bounds[3])  
plt.show()  

 The result is shown in Figure 8.

picture

■ Figure 8 Raster OD data drawn by TransBigData

At the same time, the TransBigData package also provides a method to directly aggregate OD into regions:


In [12]:
#OD集计到区域
#方法1:在不传入栅格化参数时,直接用经纬度匹配
od_gdf = tbd.odagg_shape(oddata,sz,round_accuracy=6)  
#方法2:传入栅格化参数时,程序会先栅格化后匹配以加快运算速度,数据量大时建议使用  
od_gdf = tbd.odagg_shape(oddata,sz,params = params)  
od_gdf.plot(column = 'count')  

 The result is shown in Figure 9.

picture

■ Figure 9 Cell OD of tbd aggregate

Load the map basemap and adjust the plotting parameters:

        


In [13]:
#创建图框  
import matplotlib.pyplot as plt  
import plot_map  
fig =plt.figure(1,(8,8),dpi=300)  
ax =plt.subplot(111)  
plt.sca(ax)  
#添加行政区划边界作为底图  
sz.plot(ax = ax,edgecolor = (0,0,0,0),facecolor = (0,0,0,0.2),linewidths=0.5)
#绘制colorbar  
cax = plt.axes([0.05, 0.33, 0.02, 0.3])  
plt.title('count')  
plt.sca(ax)  
#绘制OD  
od_gdf.plot(ax = ax,vmax = 100,column = 'count',cax = cax,cmap = 'autumn_r',linewidth = 1,legend = True)  
#添加比例尺和指北针  
tbd.plotscale(ax,bounds = bounds,textsize = 10,compasssize = 1,accuracy = 2000,rect = [0.06,0.03],zorder = 10)  
plt.axis('off')  
plt.xlim(bounds[0],bounds[2])  
plt.ylim(bounds[1],bounds[3])  
plt.show()  

 The results are shown in Figure 10.

picture

■ Figure 10 Inter-area OD visualization results

05. Interactive visualization

In TransBigData, we can use simple codes to quickly interactively visualize taxi data in jupyter notebook. The bottom layer of these visualization methods relies on the keplergl package. The result of visualization is no longer a static picture, but a map application that can interact with the mouse response.

The tbd.visualization_data method can realize the visualization of data distribution. After the data is passed into this method, TransBigData will first collect the data points in a grid, then generate a data grid, and map the data volume to the color. code show as below:


In [14]:  
#可视化数据点分布  
tbd.visualization_data(data,col = ['lon','lat'],accuracy=1000,height = 500)

 The result is shown in Figure 11.

■ Figure 11 Grid visualization of data distribution

For the trip OD extracted from the taxi data, the tbd.visualization_od method can also be used to visualize the arc of OD. This method also performs grid aggregation on OD data, generates OD arcs, and maps OD trips of different sizes to different colors. code show as below:


In [15]:  
#可视化数据点分布  
tbd.visualization_od(oddata,accuracy=2000,height = 500) 

The result is shown in Figure 12. 

 

■ Figure 12 Arc visualization of OD distribution

For individual-level continuous tracking data, the tbd.visualization_trip method can process data points into trajectory information with time stamps and display them dynamically. The code is as follows:


In [16]:  
#动态可视化轨迹  
tbd.visualization_trip(data,col = ['lon','lat','VehicleNum','time'],height = 500)  

The results are shown in Figure 13. Click the play button, you can see the dynamic track effect of the taxi running.

 ■ 13 Dynamic visualization of taxi trajectory

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/132145574