2023 2nd Dingding Cup College Student Big Data Challenge Preliminary Round B: Public Bicycle Usage Prediction and Analysis in New York, USA 2Python Code Analysis

2023 2nd DingTalk Cup College Student Big Data Challenge Preliminary Round B: Public Bicycle Usage Prediction and Analysis Question 2 in New York, USA

insert image description here

Related Links

[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary Round B: Public Bicycle Usage Prediction and Analysis in New York, USA - Python Code Analysis

[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary Round B: Public Bicycle Usage Prediction and Analysis Problem 2 in New York, USA Python Code Analysis

[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary round B: Public bicycle usage forecasting and analysis in New York City, USA Three time series forecasting Python code analysis

1 topic

Citi Bike is a bicycle sharing travel program launched by New York City in 2013, sponsored by Citi Bank and named "Citi Bike". There are 8,000 bikes and 500 stations in Manhattan, Brooklyn, Queens and Jersey City. To provide New York residents and tourists with a convenient, fast and cost-effective way to travel by bicycle. People can borrow from Citi Bank anywhere and return it at their destination. The data in this case has two parts: the first part is the flow meter of borrowing and returning transactions of public bicycles in New York City. Citi Bik bicycles are different from shared bicycles. They cannot be borrowed and returned at any place by scanning the QR code with a mobile phone, but need to be borrowed and returned using fixed bicycle docks. The data set includes a total of 38 bicycles from July 1, 2013 to August 31, 2016. Monthly (1158 days) data, one file per month. Among them, the data format of July 2013 to August 2014 is different from that of other years and months, which is specifically reflected in the different storage formats of the variables starttime and stoptime.

The second part is the weather data for that time period in New York City and is stored in the weather_data_NYC.csv file, which contains hour-level weather data from 2010 to 2016.

Public bicycle data field table

variable number variable name variable meaning Variable value and description
1 trip duration travel time Riding time, numerical value, seconds
2 start time departure time Borrow time, string, m/d/YYY HH:MM:SS
3 stop time End Time Return time, string, m/d/YYY HH:MM:SS
4 start station id Borrowing station number Qualitative variable, site unique number
5 start station name car rental site name string
6 start station latitude car rental site dimensions numeric
7 start station longtude Longitude of car rental station numeric
8 end station id Return station number Qualitative variable, site unique number
9 end station name Return station name string
10 end station latitude Latitude of return station numeric
11 end station longitude Return station longitude numeric
12 bile id bike number Qualitative variable, unique bicycle number
13 Use type user type Subscriber: annual user; Customer: temporary user for 24 hours or 7 days
14 birth year year of birth Only this column has missing values
15 gender gender 0: unknown 1: male 2: female

Weather Data Field Profile Table

variable number variable name variable meaning Variable value and description
1 date date string
2 time time EDT (Eastern Daylight Timing) refers to the eastern daylight saving unit of the United States
3 temperature air temperature Unit: ℃
4 dew_poit dew point Unit: ℃
5 humidity humidity percentage
6 pressure sea ​​level pressure Unit: hPa
7 visibility visibility Unit: km
8 wind_direction wind direction Discrete type, categories include west, calm, etc.
9 wind_speed wind speed Unit: kilometers per hour
10 moment_wind_speed instantaneous wind speed Unit: kilometers per hour
11 precipitation precipitation Unit: mm, there are missing values
12 activity Activity Discrete type, categories include snow, etc.
13 conditions state Discrete, categories include overcast, light snow, etc.
14 WindDirDegrees wind angle Continuous type, the value is 0~359
15 DateUTC GMT YYY/m/d HH:MM

Two, solve the problem

  1. Realization of bike borrowing and returning status:

Realize the network diagram of the bicycle borrowing and returning situation at each station in one day. The network diagram is a directed graph, and the arrow points from the borrowing station to the returning station (many stations have borrowing and returning records at the same time, so most stations are between two. two-way connection).

(1) Taking August 3, 2014 as an example to conduct network analysis, realize the bicycle borrowing and returning network graph, and calculate the number of nodes, edges, and network density of the network graph (indicating the number of edges accounts for the proportion of all possible connections), The calculation process and drawing results are given.

(2) Using the above-mentioned network analysis diagram, analyze the LAN area with longitude between 40.695~40.72 and latitude between -74.023~-73.973, and calculate the average shortest path length (the shortest path length between all points is calculated arithmetic mean) and network diameter (the maximum value of the shortest path in the defined network).

  1. Cluster analysis

Carry out cluster analysis on the bicycle data of the data set from July 1, 2013 to August 31, 2015, select an appropriate cluster number K value, select at least two clustering algorithms for clustering, and compare different clusters method and analyze the clustering results.

  1. Forecast analysis of car borrowing volume at the site:

Forecast the borrowing volume of public bicycles at all stations, and predict the future single-day borrowing volume. The data from July 2013 to July 2015 is used as the training set, and the data from August 1-31, 2015 is used as the test set to predict the daily bicycle rental volume from August 1-31, 2015. Give the MAPE of the prediction results of each site, and give the number of parameters of the model, and finally calculate the mean value of the MAPE of all sites (Note: the test set cannot participate in training and verification, otherwise it will be treated as a violation).
MAPE = 1 n ∑ ∣ yi − yi ^ yi ∣ × 100 % MAPE = \frac{1}{n} \sum{|\frac{y_i-\hat{y_i}}{y_i}|} \times 100\%MAPE=n1yiyiyi^×100%

data.csv is the transaction flow information of public bicycles in New York City. The format is as follows. Please use python to perform cluster analysis after data preprocessing and feature engineering:

Public bicycle data field table

variable number variable name variable meaning Variable value and description
1 trip duration travel time Riding time, numerical value, seconds
2 start time 出发时间 借车时间,字符串,m/d/YYY HH:MM:SS
3 stop time 结束时间 还车时间,字符串,m/d/YYY HH:MM:SS
4 start station id 借车站点编号 定性变量,站点唯一编号
5 start station name 借车站点名称 字符串
6 start station latitude 借车站点维度 数值型
7 start station longtude 借车站点经度 数值型
8 end station id 还车站点编号 定性变量,站点唯一编号
9 end station name 还车站点名称 字符串
10 end station latitude 还车站点纬度 数值型
11 end station longitude 还车站点经度 数值型
12 bile id 自行车编号 定性变量,自行车唯一编号
13 Use type 用户类型 Subscriber:年度用户; Customer:24小时或者7天的临时用户
14 birth year 出生年份 仅此列存在缺失值
15 gender 性别 0:未知 1:男性 2:女性

2 问题分析

2.1 问题一

  1. 绘制有向图

a. 读入数据并分别提取“起始站点编号”和“结束站点编号”两列数据,构建自行车借还网络图。

b. 对于第一步构建的网络图,我们需要计算网络图的节点数,边数,网络密度。节点数即为站点数,边数为借还次数。网络密度为边的数量占所有可能的连接比例。

c. 画出自行车借还网络图。

e. 计算平均最短路径长度和网络直径

首先选出符合条件(经度位于40.695~40.72,纬度位于- 74.023~-73.973之间)的借车站点和还车站点,并以它们为节点构建一个子图进行分析。然后可以直接使用networkx库中的函数来计算平均最短路径长度和网络直径。

2.2 问题二

  1. 数据预处理:对进行数据清洗和特征提取。可以使用PCA、LDA算法进行降维,减小计算复杂度。

  2. Clustering algorithm:
    a. K-means: When performing data clustering, select different K values ​​for multiple experiments and select the optimal clustering result. Evaluation indicators such as silhouette coefficient and Calinski-Harabaz index can be used for comparison and selection.
    b. DBSCAN: Use density to cluster data points without specifying the number of clusters in advance. When using a density-based clustering algorithm, different clustering effects can be obtained by adjusting the radius parameter and density parameter.
    c. Hierarchical clustering: can be divided into top-down and bottom-up methods. By iteratively calculating the similarity between each data point, the data points are gradually merged, and finally the clustering result is obtained.

    d. Improved clustering algorithm

    e. Deep clustering algorithm

  3. Clustering result analysis: After selecting the optimal clustering result, make portraits of different categories of cycling users. Analyze user behavior characteristics for each category.

2.3 Question 3

  1. Import data and perform data preprocessing, and integrate car rental data on a site-by-site basis.
  2. Carry out time series analysis on the data, and use the ARIMA model to predict the number of vehicles borrowed in a single day.
  3. Model evaluation was performed using a time series cross-validation method to calculate the MAPE for each site's forecast results.
  4. Computes the mean of the MAPE for all sites, given the number of parameters for the model.

3 Python code implementation

3.1 Question 1

[The 2nd Dingding Cup College Student Big Data Challenge in 2023] Preliminary Round B: Public Bicycle Usage Prediction and Analysis in New York, USA - Python Code Analysis

3.2 Question 2

3.2.1 Read data

import package

import pandas as pd
from sklearn.cluster import Birch
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
import time
from sklearn import metrics
import os
from sklearn.cluster import MeanShift
from tqdm import tqdm
import numpy as np
import warnings
warnings.filterwarnings("ignore")
tqdm.pandas()

# 合并数据
folder_path = '初赛数据集/2013_2015'
dfs = []
for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        csv_path = os.path.join(folder_path, filename)
        tempdf = pd.read_csv(csv_path)
        dfs.append(tempdf)
data = pd.concat(dfs,axis=0)

# 根据数据表的字段说明,删除与研究无关的列,例如自行车编号和出生年份等信息,并在必要时删除带有缺失值的行。
data.drop(['bikeid', 'birth year'], axis=1, inplace=True)
data.dropna(inplace=True)    # 删除带有缺失值的行
data.shape

3.2.3 Feature Engineering

Create new features, including: difference between departure time and end time, and distance between departure and end stations (calculated by latitude and longitude), etc.


from math import radians, sin, cos, acos
from datetime import datetime

data['starttime'] = pd.to_datetime(data['starttime'])  # 将时间格式转换为datetime
data['stoptime'] = pd.to_datetime(data['stoptime'])
# 计算时间差值和路程距离
data['duration'] = data['stoptime'] - data['starttime']
data['duration'] = data['duration'] / pd.Timedelta(seconds=1)  # 将时间差值转换为秒数


def get_distance(lat1, lng1, lat2, lng2):
    """
    根据两点经纬度计算路程距离,单位为米
    """
   。。。略
    return distance * 1000
data['distance'] = data.apply(
    lambda row: get_distance(row['start station latitude'], row['start station longitude'], 
                             row['end station latitude'], row['end station longitude']), axis=1)

features = ['tripduration', 'start station latitude', 'start station longitude', 
            'end station latitude', 'end station longitude', 'duration', 'distance']

clear_data =data[features]
clear_data.to_excel('初赛数据集/特征工程后的数据.xlsx',index=False)

3.2.3 Cluster analysis

K-value analysis, using the elbow method



start = time.time()
trainingData = weight
SSE = []  # 存放每次结果的误差平方和
k1 = 2
k2 = 10
trainingData =weight 
for k in range(k1, k2):
    estimator = KMeans(n_clusters=k, max_iter=10000, init="k-means++", tol=1e-6)
    estimator.fit(trainingData)
    SSE.append(estimator.inertia_) # estimator.inertia_获取聚类准则的总和
end = time.time()
print(f'耗时:{
      
      end-start}s')
X = range(k1,k2)
plt.figure(figsize=(8,6))
plt.xlabel('k',fontsize=20)
plt.ylabel('SSE',fontsize=20)
plt.plot(X, SSE, 'o-')
plt.savefig('img/手肘法.png',dpi=300)
plt.show()

3.2.4 Kmeas clustering

from sklearn.cluster import KMeans
start = time.time()

trainingData = weight
clf = KMeans(n_clusters=4,max_iter=10000, init="k-means++", tol=1e-6)
result = clf.fit(trainingData)
source = list(clf.predict(trainingData))
end = time.time()
label = clf.labels_
print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)

import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure(figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/kmeans.png',dpi=300)
plt.show()

insert image description here

3.2.5 AGG Hierarchical Clustering

start = time.time()
trainingData = weight
# 使用层次聚类
clf = AgglomerativeClustering(n_clusters=4, linkage='ward', affinity='euclidean')
result = clf.fit(trainingData)
source = list(clf.labels_)
end = time.time()
label = clf.labels_
print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)

import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure(figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/agg聚类.png',dpi=300)
plt.show()

3.2.6 DBSCAN clustering

from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
import time
from sklearn import metrics
start = time.time()
trainingData = weight
clf = DBSCAN(eps=0.08, min_samples=7)
result = clf.fit(trainingData)
source = list(clf.fit_predict(trainingData))
end = time.time()
label = clf.labels_

print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)

import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure (figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/dbscan.png',dpi=300)
plt.show()

insert image description here

3.2.7 Birch clustering

trainingData = weight
clf = Birch(n_clusters=5, branching_factor=10, threshold=0.01)
start = time.time()
result = clf.fit(trainingData)
source = list(clf.predict(trainingData))
end = time.time()
label = clf.labels_
print(f'耗时:{
      
      end-start}s')
silhouette = metrics.silhouette_score(trainingData, label)
print("silhouette: ", silhouette)
import matplotlib.pyplot as plt
import seaborn as sns
# 使用PCA将样本点投影到二维平面上
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(weight)
source = list(clf.predict(trainingData))
# 绘制每个样本点与其对应的簇标签
plt.figure(figsize=(8, 6))
sns.scatterplot(x=reduced_data[:, 0], y=reduced_data[:, 1], hue=source, palette='bright')
plt.savefig('img/Birch聚类.png',dpi=300)
plt.show()

4 Complete code download

See the link at the bottom of the article, including all the codes of all the questions

zhuanlan.zhihu.com/p/643865954

Guess you like

Origin blog.csdn.net/weixin_43935696/article/details/131895059