Taxi travel time prediction based on DolphinDB machine learning

DolphinDB integrates high-performance time-series database and comprehensive analysis functions. It can be used for storage, query, analysis, and real-time calculation of massive structured data. It is widely used in industrial Internet of Things scenarios. This article takes New York taxi travel time prediction as an example to introduce how to use DolphinDB to train machine learning models and make real-time data predictions , providing real-time prediction solutions based on machine learning methods for Internet of Vehicles companies based on intelligent connected vehicles .


1. Overview

With the rapid development of mobile applications and online car-hailing platforms, online car-hailing travel has gradually become an important way of travel in urban life. Compared with other modes of travel, passengers who choose online car-hailing have higher requirements for timeliness of travel. This paper will use DolphinDB machine learning method to train the model based on passengers' boarding time and location and other static information to predict the timeliness of online car-hailing. car travel time.

On this basis, this article will introduce how to use the DolphinDB stream data processing system to collect, clean, count, and store the continuously growing online car-hailing order dynamic data generated by the business system in real time, and display the travel time prediction results in real time. 

Travel time real-time prediction process

2. Data introduction

2.1 Data source and training method

The training and prediction in this paper use the data set from the New York Taxi Commission provided by Kaggle . The training method refers to the model of the winner beluga . DolphinDB is used to perform data preprocessing on the original data and complete the principal component analysis (PCA, Principal Component Analysis) of location information. , location information clustering (KMeans), new feature construction, etc., and use the DolphinDB XGBoost plug-in to complete model training and travel time prediction.    

In order to compare the performance of DolphinDB in machine learning, this paper uses the Python Scikit-Learn library and XGBoost to carry out model training and prediction in the same environment. DolphinDB has good performance in terms of training time consumption and model accuracy.

2.2 Data characteristics

The data set is pre-divided into a training data set and a test data set. The training data set contains a total of 1,458,644 pieces of data, and the test data set contains a total of 625,134 pieces of data; the training data set contains a total of the following 11 columns of information.

column name column type illustrate example
id SYMBOL unique identifier for the itinerary id2875421
vendor_id INT Trip History Provider Code 2
pickup_datetime DATETIME Taxi meter opening time 2016/3/14 17:24:55
dropoff_datetime DATETIME Taxi meter closing time 2016/3/14 17:32:30
passenger_count INT number of passengers 1
pickup_longitude DOUBLE Taxi meter open position longitude -73.98215484619139
pickup_latitude DOUBLE Taxi meter open location latitude 40.76793670654297
dropoff_longitude DOUBLE taximeter off location longitude -73.96463012695312
dropoff_latitude DOUBLE Taxi meter off location latitude 40.765602111816406
store_and_fwd_flag CHAR Identifies whether the source is stored historical data N
trip_duration INT Travel time (in seconds) 455

The target column for travel time prediction is the trip_duration column in the above table, which is the difference between dropoff_datetime and pickup_datetime. The test data set is used for prediction, so its column information does not include the dropoff_datetime and trip_duration columns, and the attributes of the columns in the test data set, such as trip ID and location, are the same as those in the above table.

Among the data types in the above table, the SYMBOL type is a special string type in DolphinDB. The internal storage structure of the system is an encoded dictionary, and the DATETIME type is a time type that includes date and time.

DolphinDB supports methods to read data storage files such as csv into memory tables, and users can use functions to obtain feature information of tables. DolphinDB also supports querying data using SQL statements. loadText  schema 

train = loadText("./taxidata/train.csv") 
train.schema().colDefs
select count(*) from train
select top 5 * from train

2.3 Data Storage

After loading the data into the memory table, you can import the training data and test data into the DolphinDB database to facilitate subsequent data reading and model training. For details on the operation of importing data into the distributed database, see database.md · dolphindb/Tutorials_CN - Gitee . 

3. Model Construction

This section describes how to build a travel time prediction model.

The construction of the travel time prediction model is divided into multiple processes. One is to preprocess the original data, convert possible null values, and convert non-numeric data such as characters into numerical data that can be used for model training; the other is to optimize the location information, the latitude and longitude information in the original data is concentrated between 40.70 °N to 40.80 °N and 73.94 °W to 74.02 °W, the difference in location characteristics between the data is not significant enough, and the features can be extracted by principal component analysis and clustering methods More obvious information; the third is the construction of new features. Location information and time information are the two key dimensions of order data. Through calculation, position and distance information can be obtained on the basis of location information, more spatial features can be extracted, and different categories can be combined The location information and time information of the model can also obtain more complex features, which is conducive to the model to learn deep spatio-temporal laws.

3.1 Data preprocessing

In the process of model training, it is first necessary to check whether the data set contains null values. Neither the training data set nor the test data set contains null values. If there are missing values, operations such as deletion and interpolation are required to solve the problem of missing data.

Secondly, it is necessary to check the data type of the data set. The original data often contains text/character data. As can be seen from the table in Section 1.3, the store_and_fwd_flag column in this dataset is character data, and the pickup_datetime and dropoff_datetime columns are date and time data. In order to make full use of these information training Model, which needs to be converted into numerical data.

In addition, considering that the evaluation index of the data test set is Root Mean Squared Logarithmic Error (RMSLE), and at the same time, the maximum travel time is close to 1000 hours, outliers will affect the model training effect, and the travel time is right The number is used as the predicted value, and the root mean square error (Root Mean Squared Error, RMSE) indicator can be used directly in the evaluation (see Section 3.6).

RMSE

DolphinDB provides a variety of calculation functions to help users quickly realize data processing. DolphinDB provides methods for judging null values, and can quickly complete the query of the entire table data when used together with other aggregate functions; provides methods similar to conditional operators to simplify if-else statements; , , and other methods can extract different characteristics of time and date data , concise and efficient; similar to programming languages ​​such as Python, DolphinDB supports square bracket ([]) indexes, which simplifies table lookup, update, and insertion. isNull()  sum()  iif() date()weekday()hour() 

sum(isNull(train))  // 0,不含空值
trainData[`store_and_fwd_flag_int] = iif(trainData[`store_and_fwd_flag] == 'N', int(0), int(1)) // 将字符N/Y转化为0/1值
trainData[`pickup_date] = date(trainData[`pickup_datetime]) // 日期
trainData[`pickup_weekday] = weekday(trainData[`pickup_datetime]) // 星期*
trainData[`pickup_hour] = hour(trainData[`pickup_datetime]) // 小时
trainData[`log_trip_duration] = log(double(trainData[`trip_duration]) + 1)// 对行程时间取对数,log(trip_duration+1)
select max(trip_duration / 3600) from trainData // 训练集上最大行程时间为979h

3.2 Principal component analysis (PCA) of location information

The latitude and longitude information in the original data is concentrated between 40.70 °N to 40.80 °N and 73.94 °W to 74.02 °W. The difference in location characteristics between the data is not significant enough. Using PCA to convert longitude and latitude coordinates is helpful for XGBoost decision tree For the splitting of DolphinDB PCA function, see pca — DolphinDB 2.0 documentation for details . 

The result returned by DolphinDB PCA is a dictionary, which contains three keys: components, explainedVarianceRatio, and singularValues, which respectively represent the principal component analysis matrix of size (colNames)*k, the variance contribution rate of each feature of the first k principal components, and the principal Component variances (eigenvalues ​​of the covariance matrix). The data to be processed can be transformed by the principal component analysis matrix, see Scikit-Learn PCA.transform() for details . 

Some data can be drawn from it to draw a longitude-latitude scatter plot to observe the PCA results.

After processing, the position coordinates are scattered around the origin.

Pick-up location information before PCA

 

Pickup position information after PCA

pca() Receive one or more data sources as parameters, and perform principal component analysis on the data in the specified column. Users can use the method to create a memory table for PCA; DolphinDB also provides , equal matrix multiplication, and matrix stacking methods. Users can use the built-in The function quickly completes matrix operations and processes position information. table()  dot()repmat() 

PCApara = table(trainData[`pickup_latitude] as latitude, trainData[`pickup_longitude] as longitude)
pca_model = pca(sqlDS(<select * from PCApara>)) // 使用PCA计算数据集属性
pca_trainpick = dot((matrix(trainPickPara) - repmat(matrix(avg(trainPickPara)), train_n, 1)), pca_model.components) // transform
trainData[`pca_trainpick_0] = flatten(pca_trainpick[:, 0])

DolphinDB provides plotfunctions for data visualization. Users can chartTypespecify the chart type, see plot — DolphinDB 2.0 documentation for details .

x = select top 1000 pca_trainpick_1 from trainData
y = flatten(matrix(select top 1000 pca_trainpick_0 from trainData))
plot(x, y, chartType=SCATTER)

3.3 Location Information Clustering (KMeans)

The original data location data is huge, and it is difficult to mine the common features among multiple pieces of data. KMeans can group data points with similar latitude and longitude into the same cluster, which helps to better summarize the data characteristics within the group. This model specifies that the number of clusters to be generated is 100, the maximum number of iterations of the centroid is 100, and the KMeans++ algorithm is selected to generate the model. For the optional parameters and meanings of DolphinDB kmeans, see kmeans — DolphinDB 2.0 documentation . 

You can use a bar chart to observe the distribution of the clustered data.

KMeans clustering results

kmeans() Receive a table as a training set. For the model generated by the machine learning function, DolphinDB provides a method to save the model to a local file for subsequent prediction, and the user can specify the absolute path or relative path of the server-side output file; DolphinDB also provides a method to call the trained specific model pair The test set data with the same table structure is used for prediction. saveModel  predict 

kmeans_set = PCApara[rand(size(PCApara)-1, 500000)] // 随机选取500000数据用于聚类
kmeans_model = kmeans(kmeans_set, 100, maxIter=100, init='k-means++') // KMeans++
saveModel(kmeans_model, "./taxidata/KMeans.model") // 保存模型训练结果
trainData['pickup_cluster'] = kmeans_model.predict(select pickup_latitude, pickup_longitude from trainData)

For how to use saveModel and predict functions, please refer to:

3.4 New Feature Construction

The original data only provides longitude and latitude location information, and location features can be added on this basis, such as the distance between two longitude and latitude points on the earth's surface, the Manhattan distance between two longitude and latitude points, and the orientation information between two longitude and latitude points.

The distance between two points on the earth’s surface can be accurately obtained using the haversine formula. In this data set, the actual driving of the online car-hailing car is often a path composed of horizontal or vertical streets. The Manhattan distance (also known as the city block distance) indicates The sum of the absolute wheelbases of two points on a standard coordinate system may more accurately reflect the actual distance traveled.

On this basis, considering that the training set contains complete time information, speed features can also be added to the training set. The speed features on the training set cannot be directly used in the test set, but under the same location clustering attributes or the same time and date features, there may be some commonalities in the travel time and average speed (such as suburban or early morning car-hailing The speed of the car is too high and the speed of the online car-hailing car in the urban area and the morning and evening peak hours is too small), the experience found on this training set can be applied to the test set. The data can be grouped by clustering attributes or time features, and the average speed of data in the group can be counted, combined to generate new features, and merged into the corresponding groups of the test set.

The calculation parameters of distance and azimuth are different but the method is the same. DolphinDB supports user-defined functions to complete specific calculation tasks through independent code modules. Whereas for features within different classes (clusters, time), methods can be used to compute the desired features (e.g. mean) in each grouping. Receive three parameters, group according to the column specified by the third parameter, take the first parameter as the calculation function, calculate the characteristics of the column corresponding to the second parameter, and return a table with the number of rows equal to the number of groups. The user can merge the combined features into the feature data through the table join operation. This article uses (full join) to merge the feature table with the groupby table, specify the third parameter as the join column, and merge the tables passed in by the first two parameters. groupby groupby  fjfj() 

// 两经纬度点距离、两个经纬度之间的 Manhattan 距离、两个经纬度之间的方位信息
trainData['distance_haversine'] = haversine_array(trainData['pickup_latitude'], trainData['pickup_longitude'], trainData['dropoff_latitude'], trainData['dropoff_longitude'])
trainData['distance_dummy_manhattan'] = dummy_manhattan_distance(trainData['pickup_latitude'], trainData['pickup_longitude'], trainData['dropoff_latitude'], trainData['dropoff_longitude'])
trainData['direction'] = bearing_array(trainData['pickup_latitude'], trainData['pickup_longitude'], trainData['dropoff_latitude'], trainData['dropoff_longitude'])
// 按时间、聚类等信息处理速度、行程时间,产生新特征
for(gby_col in ['pickup_hour', 'pickup_date', 'pickup_week_hour', 'pickup_cluster', 'dropoff_cluster']) {
    for(gby_para in ['avg_speed_h', 'avg_speed_m', 'log_trip_duration']) {
        gby = groupby(avg, trainData[gby_para], trainData[gby_col])
        gby.rename!(`avg_ + gby_para, gby_para + '_gby_' + gby_col)
        trainData = fj(trainData, gby, gby_col)
        testData = fj(testData, gby, gby_col)
    }
      trainData.dropColumns!(`gby + gby_col)
}

3.5 Model training (XGBoost)

Before training, it is necessary to check the data of the training set and the test set again. It is necessary to eliminate non-numeric data such as ID, date, and characters, as well as data that only exists in the training set, such as average speed and driving time, so as to ensure that the training data is consistent with the training data. The prediction data structure is consistent.

After data processing and feature construction are completed, the model can be trained using machine learning methods such as XGBoost. In order to evaluate the model training effect, the training data set is divided into training set and verification set, 80% of the data is randomly selected as the training set to train the model, 20% of the data is used as the verification set to output the prediction results, and the root mean square error index is used to calculate the verification The deviation of the predicted value of the set from the true value. Finally, the prediction results of the travel time can be output on the test set.

The root mean square error of this model on the verification set is 0.390, and the predicted value-true value scatter diagram can be drawn to qualitatively analyze the prediction effect of the model.

Predicted and true values ​​on the validation set

DolphinDB provides the XGBoost plug-in for model training and prediction. Before using it, you need to download the plug-in to the specified path and load the XGBoost plug-in. See xgboost/README_CN.md for details on DolphinDB XGBoost plug-in · dolphindb/DolphinDBPlugin - Gitee . 

xgb_pars = {'min_child_weight': 50, 'eta': 0.3, 'colsample_bytree': 0.3, 'max_depth': 10,
            'subsample': 0.8, 'lambda': 1., 'nthread': 4, 'booster' : 'gbtree', 'silent': 1,
            'eval_metric': 'rmse', 'objective': 'reg:linear', 'nthread': 48} // xgb 参数设置
xgbModel = xgboost::train(ytrain, train, xgb_pars, 60) // 训练模型
yvalid_ = xgboost::predict(xgbModel, valid) // 使用模型进行预测

3.6 Model evaluation

To achieve the prediction of taxi travel time, three machine learning methods are used in this paper. Firstly, PCA is used to process the location information, and the latitude and longitude characteristics of the data are converted; KMeans++ is used to cluster the taxi pick-up and drop-off locations, and New York City is divided into 100 areas for analysis; finally, XGBoost is used to train the characteristics of the data set. The root mean square error of the model on the verification set is 0.390, which is good.

Python Scikit-Learn is also one of the mainstream machine learning libraries. This article uses Python to train the same data set in the same environment. The time consumption of PCA, KMeans++, and XGBoost training is shown in the following table:

Model DolphinDB Python
PCA 0.325s 0.396s
KMeans++ 45.711s 104,568s
XGBoost 57.269s 74.289s

The error of DolphinDB and Python training model on the verification set is shown in the following table:

  DolphinDB Python
RMSE 0.390 0.394

In this travel time prediction task, DolphinDB is similar to Python in terms of accuracy; and in terms of performance, DolphinDB performs better than Python in PCA, KMeans++, and XGBoost.

4. Real-time prediction of travel time

This section combines real-world scenarios to introduce how to use DolphinDB to process real-time order flow data and estimate travel time in real time based on the prediction model.

In real-world scenarios, online car-hailing passengers have high requirements for timeliness and need the platform to provide accurate travel time estimates; and service providers also need to monitor the travel platform, analyze travel needs and complete resource scheduling. Only using predictive models cannot efficiently process real-time data. It is difficult to complete real-time prediction tasks and cannot meet the real-time needs of passengers and service providers. The DolphinDB stream data module can solve the problem of rapid analysis and calculation of real-time data in the production environment. For the real-time data sent by service providers, the DolphinDB stream data engine can efficiently complete data preprocessing, information extraction, feature construction, etc., using pre-trained models to complete Fast and accurate prediction of real-time order travel time, providing users with a one-stop solution from model training, streaming data injection to real-time prediction and online monitoring.

DolphinDB stream data processing framework

4.1 Scenario description

The DolphinDB streaming data module adopts the publish-subscribe-consume model. The streaming data is first injected into the streaming data table, and the data is published through the streaming table. Third-party applications can subscribe and consume streaming data through DolphinDB scripts or APIs.

In order to realize the real-time prediction of taxi travel time, the service provider can create a DolphinDB stream data table to subscribe to the server message, obtain the itinerary information created by the passenger, use the model completed offline training to predict the travel time in real time, and finally subscribe to the prediction through the application data and provide it to passengers.

4.2 Real-time data simulation and prediction

In order to obtain travel data and use machine learning models to predict travel time, users need to create three flow tables to achieve real-time prediction. One is to create an order information table to subscribe to passenger itinerary information, and the other is to create a feature table to complete the feature extraction of order information; Create a prediction table to predict the characteristic information of the trip sent by the flow data table, and output the prediction result.

Users can use subscribeTable to complete the subscription of stream data, and specify the method of processing subscription data through the handler (see subscribeTable — DolphinDB 2.0 documentation for details). In this example, the feature table needs to subscribe to the order table to complete the feature extraction of the original information, which is implemented by defining functions in this model; the prediction table needs to subscribe to the feature table to use feature information to complete travel time prediction, which is implemented by defining functions in this model. See the attached code in Section 6.2 for details of the function implementation .   process  predictDuration 

To simulate real-time data, use  the replay  function to play back historical data.

// 订阅订单信息表,数据从订单表流向特征表
subscribeTable(tableName="orderTable", actionName="orderProcess", offset=0, handler=process{traitTable, hisData}, msgAsTable=true, batchSize=1, throttle=1, hash=0, reconnect=true)
// 订阅特征表,数据从特征表流向预测表
subscribeTable(tableName="traitTable", actionName="predict", offset=0, handler=predictDuration{predictTable}, msgAsTable=true, hash=1, reconnect=true)
// 回放历史数据,模拟实时产生的生产数据
submitJob("replay", "trade",  replay{inputTables=data, outputTables=orderTable, dateColumn=`pickup_datetime, timeColumn=`pickup_datetime, replayRate=25, absoluteRate=true, parallelLevel=1})

4.3 Grafana real-time monitoring

Service providers can connect to the DolphinDB database through a third-party API to monitor travel time prediction services. This article takes Grafana as an example to briefly introduce how to use third-party applications to dynamically display real-time data.

Grafana is a data display tool for dynamic visualization of time series data. DolphinDB provides the data interface of Grafana. Users can write query scripts on the Grafana panel to interact with DolphinDB, realize the visualization of DolphinDB time series data, and perform real-time data analysis on the Web side , see README.zh.md · dolphindb/grafana-datasource - Gitee . 

After adding a datasource and creating a new dashboard, fill in the following DolphinDB statements in Query for real-time data visualization:

  • Query 1: Display the estimated arrival time and estimated travel time of travel orders for the day
select id as ID, pickup_datetime as pickup_time, (pickup_datetime+int((exp(duration)-1))) as arrival_time,  (exp(duration)-1)/60 as duration from predictTable 
where date(predictTable.pickup_datetime) == date(select max(pickup_datetime) from predictTable) 
  • Query 2: Count the cumulative number of orders and cumulative passengers on the day
select count(*) from predictTable 
where date(predictTable.pickup_datetime) == date(select max(pickup_datetime) from predictTable)
select sum(passenger_count) from predictTable 
where date(predictTable.pickup_datetime) == date(select max(pickup_datetime) from predictTable) 

The estimated arrival time of the order and the number of orders on the day

  • Query 3: Count the boarding positions of passengers on the day
select pickup_latitude as latitude, pickup_longitude as longitude from predictTable 
where date(predictTable.pickup_datetime) == date(select max(pickup_datetime) from predictTable) 

Passenger boarding position on the day

  • Query 4: Count the travel time of orders at different times of the day
select pickup_datetime, (exp(duration)-1)/60 as duration from predictTable 
where date(predictTable.pickup_datetime) == date(select max(pickup_datetime) from predictTable) 

The travel time of orders at different times of the day

4.4 Data Persistence

If you need to write historical data to disk, you can subscribe to the data in the order table and specify the method to persist the data to disk. subscribeTable  loadTable 

db = database("dfs://taxi")
if(existsTable("dfs://taxi", "newData")) { dropTable(db, "newData") }
db.createPartitionedTable(table=table(1:0, orderTable.schema().colDefs.name, orderTable.schema().colDefs.typeString), tableName=`newData, partitionColumns=`pickup_datetime, sortColumns=`pickup_datetime, compressMethods={datetime:"delta"})
subscribeTable(tableName="orderTable", actionName="saveToDisk", offset=0, handler=loadTable("dfs://taxi", "newData"), msgAsTable=true, batchSize=100000, throttle=1, reconnect=true)

5. Summary

This article introduces the method of using DolphinDB machine learning functions and plug-ins to train taxi travel time prediction models . Compared with mainstream machine learning methods such as Python Scikit-Learn, DolphinDB has a good performance in model training time consumption and prediction accuracy ; here Based on this, this article also introduces how to use DolphinDB stream data processing tools for real-time prediction , and uses Grafana as an example to demonstrate the visualization method of DolphinDB time series data . DolphinDB's built-in calculation functions and machine learning methods can realize a complete machine learning process from data storage, data loading, data cleaning, feature construction to model building and model evaluation, and can provide users in the Internet of Things industry with more comprehensive data analysis methods.

6. Appendix

6.1 Test environment

  • Operating system: Linux version 3.10.0-1160.el7.x86_64
  • CPU:Intel(R) Xeon(R) Silver 4214 CPU @2.20GHz 48核
  • Memory: 188G
  • Software version:
    • DolphinDB:2.00.9
    • Python3:3.7.12
    • Scikit-Learn:1.0.2
    • XGBoost:1.6.2

6.2 Model Code

DolphinDB model training code: taxiTrain.dos

DolphinDB streaming data prediction code: taxiStream.dos 

Python model training code: taxiTrain.py 

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4865736/blog/8676850