[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary round B: Public bicycle usage forecasting and analysis in New York City, USA Three time series forecasting Python code analysis

2023 2nd Dingding Cup College Student Big Data Challenge Preliminary Round B: Public Bicycle Usage Prediction and Analysis Problem Three Time Series Forecast Python Code Analysis in New York, USA

insert image description here

Related Links

[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary Round B: Public Bicycle Usage Prediction and Analysis in New York, USA - Python Code Analysis

[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary Round B: Public Bicycle Usage Prediction and Analysis Problem 2 in New York, USA Python Code Analysis

[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary round B: Public bicycle usage forecasting and analysis in New York City, USA Three time series forecasting Python code analysis

1 topic

Citi Bike is a bicycle sharing travel program launched by New York City in 2013, sponsored by Citi Bank and named "Citi Bike". There are 8,000 bikes and 500 stations in Manhattan, Brooklyn, Queens and Jersey City. To provide New York residents and tourists with a convenient, fast and cost-effective way to travel by bicycle. People can borrow from Citi Bank anywhere and return it at their destination. The data in this case has two parts: the first part is the flow meter of borrowing and returning transactions of public bicycles in New York City. Citi Bik bicycles are different from shared bicycles. They cannot be borrowed and returned at any place by scanning the QR code with a mobile phone, but need to be borrowed and returned using fixed bicycle docks. The data set includes a total of 38 bicycles from July 1, 2013 to August 31, 2016. Monthly (1158 days) data, one file per month. Among them, the data format of July 2013 to August 2014 is different from that of other years and months, which is specifically reflected in the different storage formats of the variables starttime and stoptime.

The second part is the weather data for that time period in New York City and is stored in the weather_data_NYC.csv file, which contains hour-level weather data from 2010 to 2016.

Public bicycle data field table

variable number variable name variable meaning Variable value and description
1 trip duration travel time Riding time, numerical value, seconds
2 start time departure time Borrow time, string, m/d/YYY HH:MM:SS
3 stop time End Time Return time, string, m/d/YYY HH:MM:SS
4 start station id Borrowing station number Qualitative variable, site unique number
5 start station name car rental site name string
6 start station latitude car rental site dimensions numeric
7 start station longtude Longitude of car rental station numeric
8 end station id Return station number Qualitative variable, site unique number
9 end station name Return station name string
10 end station latitude Latitude of return station numeric
11 end station longitude Return station longitude numeric
12 bile id bike number Qualitative variable, unique bicycle number
13 Use type user type Subscriber: annual subscriber; Customer: temporary subscriber for 24 hours or 7 days
14 birth year year of birth Only this column has missing values
15 gender gender 0: unknown 1: male 2: female

Weather Data Field Profile Table

variable number variable name variable meaning Variable value and description
1 date date string
2 time time EDT (Eastern Daylight Timing) refers to the eastern daylight saving unit of the United States
3 temperature air temperature Unit: ℃
4 dew_poit dew point Unit: ℃
5 humidity humidity percentage
6 pressure sea ​​level pressure Unit: hPa
7 visibility visibility Unit: km
8 wind_direction wind direction Discrete type, categories include west, calm, etc.
9 wind_speed wind speed Unit: kilometers per hour
10 moment_wind_speed instantaneous wind speed Unit: kilometers per hour
11 precipitation precipitation Unit: mm, there are missing values
12 activity Activity Discrete type, categories include snow, etc.
13 conditions state Discrete, categories include overcast, light snow, etc.
14 WindDirDegrees wind angle Continuous type, the value is 0~359
15 DateUTC GMT YYY/m/d HH:MM

Two, solve the problem

  1. Realization of bike borrowing and returning status:

Realize the network diagram of the bicycle borrowing and returning situation at each station in one day. The network diagram is a directed graph, and the arrow points from the borrowing station to the returning station (many stations have borrowing and returning records at the same time, so most stations are between two. two-way connection).

(1) Taking August 3, 2014 as an example to conduct network analysis, realize the bicycle borrowing and returning network graph, and calculate the number of nodes, edges, and network density of the network graph (indicating the number of edges accounts for the proportion of all possible connections), The calculation process and drawing results are given.

(2) Using the above-mentioned network analysis diagram, analyze the LAN area with longitude between 40.695~40.72 and latitude between -74.023~-73.973, and calculate the average shortest path length (the shortest path length between all points is calculated arithmetic mean) and network diameter (the maximum value of the shortest path in the defined network).

  1. Cluster analysis

Carry out cluster analysis on the bicycle data of the data set from July 1, 2013 to August 31, 2015, select an appropriate cluster number K value, select at least two clustering algorithms for clustering, and compare different clusters method and analyze the clustering results.

  1. Forecast analysis of car borrowing volume at the site:

Forecast the borrowing volume of public bicycles at all stations, and predict the future single-day borrowing volume. The data from July 2013 to July 2015 is used as the training set, and the data from August 1-31, 2015 is used as the test set to predict the daily bicycle rental volume from August 1-31, 2015. Give the MAPE of the prediction results of each site, and give the number of parameters of the model, and finally calculate the mean value of the MAPE of all sites (Note: the test set cannot participate in training and verification, otherwise it will be treated as a violation).
MAPE = 1 n ∑ ∣ yi − yi ^ yi ∣ × 100 % MAPE = \frac{1}{n} \sum{|\frac{y_i-\hat{y_i}}{y_i}|} \times 100\%MAPE=n1yiyiyi^×100%

data.csv is the transaction flow information of public bicycles in New York City. The format is as follows. Please use python to perform cluster analysis after data preprocessing and feature engineering:

Public bicycle data field table

variable number variable name variable meaning Variable value and description
1 trip duration travel time Riding time, numerical value, seconds
2 start time 出发时间 借车时间,字符串,m/d/YYY HH:MM:SS
3 stop time 结束时间 还车时间,字符串,m/d/YYY HH:MM:SS
4 start station id 借车站点编号 定性变量,站点唯一编号
5 start station name 借车站点名称 字符串
6 start station latitude 借车站点维度 数值型
7 start station longtude 借车站点经度 数值型
8 end station id 还车站点编号 定性变量,站点唯一编号
9 end station name 还车站点名称 字符串
10 end station latitude 还车站点纬度 数值型
11 end station longitude 还车站点经度 数值型
12 bile id 自行车编号 定性变量,自行车唯一编号
13 Use type 用户类型 Subscriber:年度用户; Customer:24小时或者7天的临时用户
14 birth year 出生年份 仅此列存在缺失值
15 gender 性别 0:未知 1:男性 2:女性

2 问题分析

2.1 问题一

  1. 绘制有向图

a. 读入数据并分别提取“起始站点编号”和“结束站点编号”两列数据,构建自行车借还网络图。

b. 对于第一步构建的网络图,我们需要计算网络图的节点数,边数,网络密度。节点数即为站点数,边数为借还次数。网络密度为边的数量占所有可能的连接比例。

c. 画出自行车借还网络图。

e. 计算平均最短路径长度和网络直径

首先选出符合条件(经度位于40.695~40.72,纬度位于- 74.023~-73.973之间)的借车站点和还车站点,并以它们为节点构建一个子图进行分析。然后可以直接使用networkx库中的函数来计算平均最短路径长度和网络直径。

2.2 问题二

  1. 数据预处理:对进行数据清洗和特征提取。可以使用PCA、LDA算法进行降维,减小计算复杂度。

  2. 聚类算法:
    a. K-means: 进行数据聚类时,选择不同的K值进行多次试验,选取最优的聚类结果。可以使用轮廓系数、Calinski-Harabaz指数等评价指标进行比较和选择。
    b. DBSCAN: 利用密度对数据点进行聚类,不需要预先指定聚类的数量。使用基于密度的聚类算法时,可以通过调整半径参数和密度参数来得到不同聚类效果。
    c. 层次聚类:可分为自顶向下和自底向上两种方式。通过迭代计算每个数据点之间的相似度,将数据点逐渐合并,最后得到聚类结果。

    d.改进的聚类算法

    e. 深度聚类算法

  3. 聚类结果分析:选择最优的聚类结果后,对不同类别骑车的用户进行画像。分析每个类别的用户行为特征。

2.3 问题三

  1. 导入数据并进行数据预处理,整合以站点为单位的借车数据。
  2. 对数据进行时间序列分析,使用ARIMA模型进行单日借车量预测。
  3. 使用时间序列交叉验证方法进行模型评估,计算每个站点预测结果的MAPE。
  4. 计算所有站点的MAPE的均值,给出模型的参数数量。

3 Python代码实现

3.1 问题一

【2023 年第二届钉钉杯大学生大数据挑战赛】 初赛 B:美国纽约公共自行车使用量预测分析 问题一Python代码分析

3.2 问题二

【2023 年第二届钉钉杯大学生大数据挑战赛】 初赛 B:美国纽约公共自行车使用量预测分析 问题二Python代码分析

3.3 问题三

(1)合并天气数据


import pandas as pd
import os
# 加载数据
# 合并数据
folder_path = '初赛数据集/问题3数据集'
dfs = []
for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        csv_path = os.path.join(folder_path, filename)
        tempdf = pd.read_csv(csv_path)[0:5000]
        dfs.append(tempdf)
bike_data = pd.concat(dfs,axis=0)
weather_data = pd.read_csv('初赛数据集/weather_data_NYC(3).csv')

# 查看数据格式及之间的关联
print(bike_data.head())
print(weather_data.head())

# 将“start time”列和“stop time”列转换为datetime格式
bike_data['starttime'] = pd.to_datetime(bike_data['starttime'])
bike_data['stoptime'] = pd.to_datetime(bike_data['stoptime'])
weather_data['date'] = pd.to_datetime(weather_data['date'])

# 在每张表格中加入一个“day”列,代表日期
bike_data['day'] = bike_data['starttime'].dt.date
weather_data['day'] = weather_data['date'].dt.date

print(bike_data.head())
print(weather_data.head())


(2)特征工程

从两个数据集提取用于建模的特征。

  • 对于公共自行车的使用情况,考虑借车站、还车站、借车时间等。

  • 对于天气条件,还可以考虑温度、湿度、风速等因素。

# 对于公共自行车的使用情况,提取用于建模的特征
bike_data_features = bike_data[['start station id', 'end station id', 'starttime', 'day']]
...# 对天气条件进行处理,提取用于建模的特征
weather_data_features = weather_data[['date', 'temperature', 'humidity', 'wind_speed']]
...# 接下来,需要将两个数据集进行合并,以创建一个数据集来训练模型。我们可以通过将bike_data_features和weather_data_features根据日期(day)合并来实现:

model_data = pd.merge(bike_data_features, weather_data_features, on='day', how='left')
# 类别特征编码
...# 测试集:打标签,计算每天的借车数量
BorrowCounts = model_data.groupby(['day', 'start station id']).size().reset_index()
BorrowCounts = BorrowCounts.rename(columns={
    
    0: 'count'})
model_data = pd.merge(model_data, BorrowCounts, on=['day', 'start station id'], how='left')
...print(model_data.head())

insert image description here

(3)模型训练

回归预测问题,可以采用回归模型,比如XGB、LGB、线性回归、神经网络回归等模型,常用的时间序列预测模型ARIMA模型、GARCH模型、LSTM等。以下是XGB为例。

# 将数据集拆分为训练集和测试集,建立模型并对它进行训练:

import xgboost as xgb
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


# 拆分数据集
train_data = model_data[model_data['day'] < pd.to_datetime('2015-08-01').date()]
test_data = model_data[model_data['day'] >= pd.to_datetime('2015-08-01').date()]

# 定义输入特征以及输出
data_train = train_data[['start station id', 'end station id', 'starthour', 'is_weekend', 'temperature', 'humidity', 'wind_speed']]
Y_train = train_data['label']


# 测试集:打标签,计算每天的借车数量
data_test = test_data[['start station id', 'end station id', 'starthour', 'is_weekend', 'temperature', 'humidity', 'wind_speed']]
Y_test = test_data['label']

X_train = scaler.fit_transform(data_train)
X_test = scaler.transform(data_test)
# # 定义模型并训练
# XGBoost回归模型,还可以使用线性回归、决策树回归、神经网络回归
model = xgb.XGBRegressor(
            objective='reg:squarederror',
            n_jobs=-1,
            n_estimators=1000,
            max_depth=7,
            subsample=0.8,
            learning_rate=0.05,
            gamma=0,
            colsample_bytree=0.9,
            random_state=2023, max_features=None, alpha=0.3)
model.fit(X_train, Y_train)


insert image description here

(4) 模型评价与检验


# 计算每个站点的MAPE:
# 对测试集进行预测
Y_pred = model.predict(X_test)

def calculate_mape(row):
    return mean_absolute_percentage_error([row['pred']],[row['true']])
# 计算每个站点的MAPE
data_test['pred'] = Y_pred
data_test['y_true'] = Y_test
data_test['mape'] = data_test.apply(calculate_mape, axis=1)
mape_by_station = data_test.groupby('start station id')['mape'].mean()
print(mape_by_station)

insert image description here

# 计算所有站点的MAPE的均值
mape_mean = mean_absolute_percentage_error(Y_test,Y_pred)
print(mape_mean)

0.47444156668192194

# 计算XGB模型的参数数量:
model.get_params()

insert image description here

完整代码

见知乎文章底部链接,下载包括所有问题的全部代码

zhuanlan.zhihu.com/p/643865954

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/131895347