2023 2nd DingTalk Cup College Student Big Data Challenge Preliminary Round B: Public Bicycle Usage Prediction and Analysis Question 2 in New York, USA
Related Links
1 topic
Citi Bike is a bicycle sharing travel program launched by New York City in 2013, sponsored by Citi Bank and named "Citi Bike". There are 8,000 bikes and 500 stations in Manhattan, Brooklyn, Queens and Jersey City. To provide New York residents and tourists with a convenient, fast and cost-effective way to travel by bicycle. People can borrow from Citi Bank anywhere and return it at their destination. The data in this case has two parts: the first part is the flow meter of borrowing and returning transactions of public bicycles in New York City. Citi Bik bicycles are different from shared bicycles. They cannot be borrowed and returned at any place by scanning the QR code with a mobile phone, but need to be borrowed and returned using fixed bicycle docks. The data set includes a total of 38 bicycles from July 1, 2013 to August 31, 2016. Monthly (1158 days) data, one file per month. Among them, the data format of July 2013 to August 2014 is different from that of other years and months, which is specifically reflected in the different storage formats of the variables starttime and stoptime.
The second part is the weather data for that time period in New York City and is stored in the weather_data_NYC.csv file, which contains hour-level weather data from 2010 to 2016.
Public bicycle data field table
variable number | variable name | variable meaning | Variable value and description |
---|---|---|---|
1 | trip duration | travel time | Riding time, numerical value, seconds |
2 | start time | departure time | Borrow time, string, m/d/YYY HH:MM:SS |
3 | stop time | End Time | Return time, string, m/d/YYY HH:MM:SS |
4 | start station id | Borrowing station number | Qualitative variable, site unique number |
5 | start station name | car rental site name | string |
6 | start station latitude | car rental site dimensions | numeric |
7 | start station longtude | Longitude of car rental station | numeric |
8 | end station id | Return station number | Qualitative variable, site unique number |
9 | end station name | Return station name | string |
10 | end station latitude | Latitude of return station | numeric |
11 | end station longitude | Return station longitude | numeric |
12 | bile id | bike number | Qualitative variable, unique bicycle number |
13 | Use type | user type | Subscriber: annual user; Customer: temporary user for 24 hours or 7 days |
14 | birth year | year of birth | Only this column has missing values |
15 | gender | gender | 0: unknown 1: male 2: female |
Weather Data Field Profile Table
variable number | variable name | variable meaning | Variable value and description |
---|---|---|---|
1 | date | date | string |
2 | time | time | EDT (Eastern Daylight Timing) refers to the eastern daylight saving unit of the United States |
3 | temperature | air temperature | Unit: ℃ |
4 | dew_poit | dew point | Unit: ℃ |
5 | humidity | humidity | percentage |
6 | pressure | sea level pressure | Unit: hPa |
7 | visibility | visibility | Unit: km |
8 | wind_direction | wind direction | Discrete type, categories include west, calm, etc. |
9 | wind_speed | wind speed | Unit: kilometers per hour |
10 | moment_wind_speed | instantaneous wind speed | Unit: kilometers per hour |
11 | precipitation | precipitation | Unit: mm, there are missing values |
12 | activity | Activity | Discrete type, categories include snow, etc. |
13 | conditions | state | Discrete, categories include overcast, light snow, etc. |
14 | WindDirDegrees | wind angle | Continuous type, the value is 0~359 |
15 | DateUTC | GMT | YYY/m/d HH:MM |
Two, solve the problem
- Realization of bike borrowing and returning status:
Realize the network diagram of the bicycle borrowing and returning situation at each station in one day. The network diagram is a directed graph, and the arrow points from the borrowing station to the returning station (many stations have borrowing and returning records at the same time, so most stations are between two. two-way connection).
(1) Taking August 3, 2014 as an example to conduct network analysis, realize the bicycle borrowing and returning network graph, and calculate the number of nodes, edges, and network density of the network graph (indicating the number of edges accounts for the proportion of all possible connections), The calculation process and drawing results are given.
(2) Using the above-mentioned network analysis diagram, analyze the LAN area with longitude between 40.695~40.72 and latitude between -74.023~-73.973, and calculate the average shortest path length (the shortest path length between all points is calculated arithmetic mean) and network diameter (the maximum value of the shortest path in the defined network).
- Cluster analysis
Carry out cluster analysis on the bicycle data of the data set from July 1, 2013 to August 31, 2015, select an appropriate cluster number K value, select at least two clustering algorithms for clustering, and compare different clusters method and analyze the clustering results.
- Forecast analysis of car borrowing volume at the site:
Forecast the borrowing volume of public bicycles at all stations, and predict the future single-day borrowing volume. The data from July 2013 to July 2015 is used as the training set, and the data from August 1-31, 2015 is used as the test set to predict the daily bicycle rental volume from August 1-31, 2015. Give the MAPE of the prediction results of each site, and give the number of parameters of the model, and finally calculate the mean value of the MAPE of all sites (Note: the test set cannot participate in training and verification, otherwise it will be treated as a violation).
MAPE = 1 n ∑ ∣ yi − yi ^ yi ∣ × 100 % MAPE = \frac{1}{n} \sum{|\frac{y_i-\hat{y_i}}{y_i}|} \times 100\%MAPE=n1∑∣yiyi−yi^∣×100%
data.csv is the transaction flow information of public bicycles in New York City. The format is as follows. Please use python to perform cluster analysis after data preprocessing and feature engineering:
Public bicycle data field table
variable number | variable name | variable meaning | Variable value and description |
---|---|---|---|
1 | trip duration | travel time | Riding time, numerical value, seconds |
2 | start time | 出发时间 | 借车时间,字符串,m/d/YYY HH:MM:SS |
3 | stop time | 结束时间 | 还车时间,字符串,m/d/YYY HH:MM:SS |
4 | start station id | 借车站点编号 | 定性变量,站点唯一编号 |
5 | start station name | 借车站点名称 | 字符串 |
6 | start station latitude | 借车站点维度 | 数值型 |
7 | start station longtude | 借车站点经度 | 数值型 |
8 | end station id | 还车站点编号 | 定性变量,站点唯一编号 |
9 | end station name | 还车站点名称 | 字符串 |
10 | end station latitude | 还车站点纬度 | 数值型 |
11 | end station longitude | 还车站点经度 | 数值型 |
12 | bile id | 自行车编号 | 定性变量,自行车唯一编号 |
13 | Use type | 用户类型 | Subscriber:年度用户; Customer:24小时或者7天的临时用户 |
14 | birth year | 出生年份 | 仅此列存在缺失值 |
15 | gender | 性别 | 0:未知 1:男性 2:女性 |
2 问题分析
2.1 问题一
- 绘制有向图
a. 读入数据并分别提取“起始站点编号”和“结束站点编号”两列数据,构建自行车借还网络图。
b. 对于第一步构建的网络图,我们需要计算网络图的节点数,边数,网络密度。节点数即为站点数,边数为借还次数。网络密度为边的数量占所有可能的连接比例。
c. 画出自行车借还网络图。
e. 计算平均最短路径长度和网络直径
首先选出符合条件(经度位于40.695~40.72,纬度位于- 74.023~-73.973之间)的借车站点和还车站点,并以它们为节点构建一个子图进行分析。然后可以直接使用networkx库中的函数来计算平均最短路径长度和网络直径。
2.2 问题二
-
数据预处理:对进行数据清洗和特征提取。可以使用PCA、LDA算法进行降维,减小计算复杂度。
-
Clustering algorithm:
a. K-means: When performing data clustering, select different K values for multiple experiments and select the optimal clustering result. Evaluation indicators such as silhouette coefficient and Calinski-Harabaz index can be used for comparison and selection.
b. DBSCAN: Use density to cluster data points without specifying the number of clusters in advance. When using a density-based clustering algorithm, different clustering effects can be obtained by adjusting the radius parameter and density parameter.
c. Hierarchical clustering: can be divided into top-down and bottom-up methods. By iteratively calculating the similarity between each data point, the data points are gradually merged, and finally the clustering result is obtained.d. Improved clustering algorithm
e. Deep clustering algorithm
-
Clustering result analysis: After selecting the optimal clustering result, make portraits of different categories of cycling users. Analyze user behavior characteristics for each category.
2.3 Question 3
- Import data and perform data preprocessing, and integrate car rental data on a site-by-site basis.
- Carry out time series analysis on the data, and use the ARIMA model to predict the number of vehicles borrowed in a single day.
- Model evaluation was performed using a time series cross-validation method to calculate the MAPE for each site's forecast results.
- Computes the mean of the MAPE for all sites, given the number of parameters for the model.
3 Python code implementation
3.1 Question 1
3.2 Question 2
3.2.1 Read data
import package
import pandas as pd
from sklearn.cluster import Birch
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
import time
from sklearn import metrics
import os
from sklearn.cluster import MeanShift
from tqdm import tqdm
import numpy as np
import warnings
warnings.filterwarnings("ignore")
tqdm.pandas()
# 合并数据
folder_path = '初赛数据集/2013_2015'
dfs = []
for filename in os.listdir(folder_path):
if filename.endswith('.csv'):
csv_path = os.path.join(folder_path, filename)
tempdf = pd.read_csv(csv_path)
dfs.append(tempdf)
data = pd.concat(dfs,axis=0)
# 根据数据表的字段说明,删除与研究无关的列,例如自行车编号和出生年份等信息,并在必要时删除带有缺失值的行。
data.drop(['bikeid', 'birth year'], axis=1, inplace=True)
data.dropna(inplace=True) # 删除带有缺失值的行
data.shape
3.2.3 Feature Engineering
Create new features, including: difference between departure time and end time, and distance between departure and end stations (calculated by latitude and longitude), etc.
from math import radians, sin, cos, acos
from datetime import datetime
data['starttime'] = pd.to_datetime(data['starttime']) # 将时间格式转换为datetime
data['stoptime'] = pd.to_datetime(data['stoptime'])
# 计算时间差值和路程距离
data['duration'] = data['stoptime'] - data['starttime']
data['duration'] = data['duration'] / pd.Timedelta(seconds=1) # 将时间差值转换为秒数
def get_distance(lat1, lng1, lat2, lng2):
"""
根据两点经纬度计算路程距离,单位为米
"""
。。。略
return distance * 1000
data['distance'] = data.apply(
lambda row: get_distance(row['start station latitude'], row['start station longitude'],
row['end station latitude'], row['end station longitude']), axis=1)
features = ['tripduration', 'start station latitude', 'start station longitude',
'end station latitude', 'end station longitude', 'duration', 'distance']
clear_data =data[features]
clear_data.to_excel('初赛数据集/特征工程后的数据.xlsx',index=False)
3.2.3 Cluster analysis
K-value analysis, using the elbow method
start = time.time()
trainingData = weight
SSE = [] # 存放每次结果的误差平方和
k1 = 2
k2 = 10
trainingData =weight
for k in range(k1, k2):
estimator = KMeans(n_clusters=k, max_iter=10000, init="k-means++", tol=1e-6)
estimator.fit(trainingData)
SSE.append(estimator.inertia_) # estimator.inertia_获取聚类准则的总和
end = time.time()
print(f'耗时:{
end-start}s')
X = range(k1,k2)
plt.figure(figsize=(8,6))
plt.xlabel('k',fontsize=20)
plt.ylabel('SSE',fontsize=20)
plt.plot(X, SSE, 'o-')
plt.savefig('img/手肘法.png',dpi=300)
plt.show()
3.2.4 Kmeas clustering
from sklearn.cluster import KMeans
start = time.time()
trainingData = weight
clf = KMeans(n_clusters=4,max_iter=10000, init="k-means++", tol=1e-6)
result = clf.fit(trainingData)
source = list(clf.predict(trainingData))
end = time.time()
label = clf.labels_
print(f'耗时:{
end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure(figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/kmeans.png',dpi=300)
plt.show()
3.2.5 AGG Hierarchical Clustering
start = time.time()
trainingData = weight
# 使用层次聚类
clf = AgglomerativeClustering(n_clusters=4, linkage='ward', affinity='euclidean')
result = clf.fit(trainingData)
source = list(clf.labels_)
end = time.time()
label = clf.labels_
print(f'耗时:{
end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure(figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/agg聚类.png',dpi=300)
plt.show()
3.2.6 DBSCAN clustering
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import time
from sklearn import metrics
start = time.time()
trainingData = weight
clf = DBSCAN(eps=0.08, min_samples=7)
result = clf.fit(trainingData)
source = list(clf.fit_predict(trainingData))
end = time.time()
label = clf.labels_
print(f'耗时:{
end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure (figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/dbscan.png',dpi=300)
plt.show()
3.2.7 Birch clustering
trainingData = weight
clf = Birch(n_clusters=5, branching_factor=10, threshold=0.01)
start = time.time()
result = clf.fit(trainingData)
source = list(clf.predict(trainingData))
end = time.time()
label = clf.labels_
print(f'耗时:{
end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure(figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/Birch聚类.png',dpi=300)
plt.show()
4 Complete code download
See the link at the bottom of the article, including all the codes of all the questions
zhuanlan.zhihu.com/p/643865954