Analysis and prediction of shared bicycle riding data based on data mining

Reminder: At the end of the article, there is a blogger Wechat / QQ business card officially provided by the CSDN platform :)

1. Project background

Bike-sharing systems are gaining popularity in big cities, allowing people to enjoy cycling around the city without having to buy bikes for themselves by offering affordable bike rentals. This project utilizes historical data provided by Nice Ride MN in the Twin Cities (Minneapolis/St. Paul, MN). We will explore bike-sharing ride data by looking at bike demand at different sites, bike traffic at each site, the impact of seasonality and weather on riding patterns, and differences in riding patterns between members and non-members.

2. Functional composition

The main functions of the shared bicycle riding data analysis and prediction system based on data mining include:

3. Data reading and preprocessing

stations = pd.read_csv('data/Nice_Ride_2017_Station_Locations.csv')
trips = pd.read_csv('data/Nice_ride_trip_history_2017_season.csv')

Stations and trips basic information:

stations.info()
trips.info()

 The stations and trips datasets look very clean - no missing values, and the latitude, longitude and number of loaded piers are as expected.

        Convert the data format of the time field:

# Convert start and end times to datetime
for col in ['End date', 'Start date']:
    trips[col] = pd.to_datetime(trips[col],
                                format='%m/%d/%Y %H:%M')

4. Data exploratory visual analysis

4.1 Analysis of Station Locations

# On hover, show Station name
tooltips = [("Station", stations['Name'])]

# Plot the stations
p, _ = MapPoints(stations.Latitude, stations.Longitude, 
                 title="Nice Ride Station Locations",
                 tooltips=tooltips,
                 height=fig_height, width=fig_width)

show(p)

        As can be seen, most of the stations are scattered around Minneapolis, but there is also a cluster in downtown St. Paul, as well as several stations along University Avenue and Grande Avenue, which connect Minneapolis to St. Paul.

4.2 Number of bicycle stops per station

# Plot histogram of # docks at each station 
plt.figure(figsize=(16, 5))
plt.hist(stations['Total docks'],)
plt.ylabel('Number of Stations')
plt.xlabel('Number of Docks')
plt.title('Number of Docks Distribution')
plt.show()

         It can be seen that most of the 15 shared bicycles are parked at each station.

# 显示车站名称和停靠自行车数量
tooltips = [("Station", stations['Name']), 
            ("Docks", stations['Total docks'])]

# Plot the stations
p, _ = MapPoints(stations.Latitude, stations.Longitude, 
                 tooltips=tooltips, color=stations['Total docks'],
                 size=4*np.sqrt(stations['Total docks']/np.pi),
                 title="Number of Docks at each Station",
                 height=fig_height, width=fig_width)

show(p)

 4.3 Station Demand Analysis Station Demand

demand_df = pd.DataFrame({'Outbound trips': trips.groupby('Start station').size(),
                          'Inbound trips': trips.groupby('End station').size()
                      })
demand_df['Name'] = demand_df.index
sdf = stations.merge(demand_df, on='Name')

plt.figure(figsize=(16, 5))
plt.hist(sdf['Outbound trips'], bins=20)
plt.ylabel('Number of Stations')
plt.xlabel('Number of outbound rentals')
plt.title('Outbound trip distribution')
plt.show()

# Plot num trips started from each station 
plt.figure(figsize=(16, 5))
plt.hist(sdf['Inbound trips'], bins=20)
plt.ylabel('Number of Stations')
plt.xlabel('Number of inbound rentals')
plt.title('Inbound trip distribution')
plt.show()

         It can be seen that Nice Ride MN has to redistribute bicycles from stations with excess bicycles to stations without enough bicycles. Stations with more rides ending at that station than starting at that station will end up with extra bikes, and Nice Ride MN will have to reallocate those extra bikes to emptier stations! Analyze which stations have more destination trains than origin trains.

        At most stations, the number of terminal trains is similar to the number of starting trains. However, there are definitely a few stations that are unbalanced! That is, some stations have more inbound rides than outbound rides, and vice versa.

        We can plot these distributions on a map to see which stations are unbalanced. We'll use Bokeh to plot the outbound trip count, inbound trip count, and demand variance in three separate tabs. The color and area of ​​the circle indicates the vaue (number of outbound trips, number of inbound trips, or variance) in the corresponding tab. For "difference", the size of the circles indicates the "absolute" difference (so we can see which sites are most unbalanced, and the color tells us in which direction they are unbalanced). Click each tab at the top of the plot to see the number of outbound trips, inbound trips, or the difference between the two.

        Some stations have far more people ending their trip than starting (for example, the station at the northeast corner of Lake St & Knox Ave, Bde Maka Ska, or the Minnehaha Park station). There are also stations where far more people start a trip than finish it (for example, Coffman Union Station and Wiley Hall Station on the University of Minnesota campus). But most stations have about the same number of inbounds and outbounds.

        It can also be seen that more cars are leaving downtown Minneapolis or the University of Michigan campus. Note that downtown Minneapolis is clustered with many large blue circles (stations with more exits), but most of the large red circles (stations with high exits) are farther from downtown and tend to be "purpose land” and parks (such as Minneha Park, Bde Maka Ska, Logan Park, and North Mississippi Regional Park))

4.4 Demand Difference Analysis

         Ideally, Nice Ride would like to have more docks at stations where there is a big difference between inbound and outbound trips. This is because, if more rides start at a particular station than end at that station, the number of bikes at that station will decrease over time. So, the station needs to have enough docks to accommodate enough bikes so that the station isn't empty at the end of the day! On the other hand, if more rides end at a station than start there, then all the piers at that station will be packed and people won't be able to end their rides there! Therefore, these stations must have enough docks to absorb the traffic during the day.

sdf['abs_diff'] = sdf['demand_diff'].abs()

sdf['Docks'] = sdf['Total docks']/sdf['Total docks'].sum()
sdf['DemandDiff'] = sdf['abs_diff']/sdf['abs_diff'].sum()

sdf['demand_dir'] = sdf['Name']
sdf.loc[sdf['demand_diff']<0, 'demand_dir'] = 'More Outgoing'
sdf.loc[sdf['demand_diff']>0, 'demand_dir'] = 'More Incoming'
sdf.loc[sdf['demand_diff']==0, 'demand_dir'] = 'Balanced'

tidied = (
    sdf[['Name', 'Docks', 'DemandDiff']]
       .set_index('Name')
       .stack()
       .reset_index()
       .rename(columns={'level_1': 'Distribution', 0: 'Proportion'})
)

plt.figure(figsize=(4.5, 35))
station_list = sdf.sort_values('DemandDiff', ascending=False)['Name'].tolist()
sns.barplot(y='Name', x='Proportion', hue='Distribution', 
            data=tidied, order=station_list)
plt.title('Proportion Docks vs Demand Difference')
locs, labels = plt.yticks()
plt.yticks(locs, tuple([s[:15] for s in station_list]))
plt.show()

        There is not a good match between the number of terminals per station and the variance in overall demand. Differences in requirements may change over time. For example, some stations may have more outbound trips in the morning and more inbound trips in the evening, or vice versa.

4.5 Differences in demand over time

        The balance between driving in and out of stations isn't static - it changes over time! Around 8am there were more people finishing up their rental cars at the station, probably commuting to work. But at the end of the day, around 5pm, people typically start renting cars from that station, probably for their commute home. 

4.6 Cumulative variance analysis of requirements

# 计算需求的累积差异
cdiff = trips_hp['Difference'].apply(np.cumsum, axis=1)

        From the accumulated demand difference, it is clear that many bicycles are taken from this station in the morning and brought back in the evening. So, if Cycling Nice doesn't reallocate bikes to this station, there will be a lot fewer bikes here between 9am and 4pm than at night.

4.7 Analysis of circulation of bicycles from stops

        The location of rides, the number of terminals at each station, the demand at each station, and how demand changes over time. However, how do bicycles flow from each station to another? That is, what does the distribution of travel look like? What are the most and least common destinations at each station?

# 计算从每个车站到另一个车站的行程数
flow = (
    trips.groupby(['Start station', 'End station'])['Start date']
    .count().to_frame().reset_index()
    .rename(columns={"Start date": "Trips"})
    .pivot(index='Start station', columns='End station')
    .fillna(value=0)
)
# Plot trips to and from each station
sns.set_style("dark")
plt.figure(figsize=(10, 8))
plt.imshow(np.log10(flow.values+0.1),
           aspect='auto',
           interpolation="nearest")
plt.set_cmap('plasma')
cbar = plt.colorbar(ticks=[-1,0,1,2,3])
cbar.set_label('Number of trips')
cbar.ax.set_yticklabels(['0','1','10','100','1000'])
plt.ylabel('Station Number FROM')
plt.xlabel('Station Number TO')
plt.title('Number of trips to and from each station')
plt.show()

# 最终停靠在出发时候的同一车站的数量
sns.set()
plt.figure()
plt.bar([0, 1], 
        [np.trace(flow.values), 
         flow.values.sum()-np.trace(flow.values)],
        tick_label=['Same as start', 'Other'])
plt.xlabel('End station')
plt.ylabel('Number of trips')
plt.title('Number of trips which end\n'+
          'at same station they started from')
plt.show()

        It can be seen that most trips do not actually return to the station from which they originate.

4.8 Analysis of Cycling Duration

# 超过24小时的,可能为异常情况
Ntd = np.count_nonzero(trips['Total duration (Seconds)']>(24*60*60))
print("Number of trips longer than 24 hours: %d ( %0.2g %% )"
      % (Ntd, 100*Ntd/float(len(trips))))

# 骑行时长不超过1小时的
Ntd = np.count_nonzero(trips['Total duration (Seconds)']<(24*60))
print("Number of trips shorter than 1 hour: %d ( %0.2g %% )"
      % (Ntd, 100*Ntd/float(len(trips))))

# Plot histogram of ride durations
plt.figure()
sns.distplot(trips.loc[trips['Total duration (Seconds)']<(4*60*60),
                       'Total duration (Seconds)']/3600)
plt.xlabel('Ride duration (hrs)')
plt.ylabel('Number of Trips')
plt.title('Ride durations')
plt.show()

4.9 Distribution of Riding Time by Month

        The most popular month for cycling is July, and while there are still plenty of cycling in the non-prime months, April and October see almost half the amount of cycling.

 4.9 Distribution of Rides by Day in a Year

trips.groupby(trips['Start date'].dt.dayofyear)['Start date'].count().plot()
plt.xlabel('Day of the year')
plt.ylabel('Number of rentals')
plt.title('Number of rentals by day of the year in 2017')
holidays = [("Mother's day", 134),
            ("Memorial day", 149),
            ("4th of July", 185),
            ("Labor day", 247), 
            ("Oct 27", 300)]
for name, day in holidays:
    plt.plot([day,day], [0,6000], 
             'k--', linewidth=0.2)
    plt.text(day, 6000, name, fontsize=8, 
             rotation=90, ha='right', va='top')
plt.show()

 4.10 Weekly ride distribution

plt.figure()
sns.countplot(trips['Start date'].dt.weekday)
plt.xlabel('Day')
plt.ylabel('Number of Trips')
plt.title('Number of rentals by day of the week in 2017')
plt.xticks(np.arange(7),
           ['M', 'T', 'W', 'Th', 'F', 'Sa', 'Su'])
plt.show()

 4.11 Weather Factors Affect Cycling Conditions

weather = pd.read_csv('data/WeatherDailyMinneapolis2017.csv')
weather['DATE'] = pd.to_datetime(weather['DATE'],
                                 format='%Y-%m-%d')
# Plot daily min + max temperature
plt.plot(weather.DATE.dt.dayofyear,
         weather.TMAX, 'C2')
plt.plot(weather.DATE.dt.dayofyear,
         weather.TMIN, 'C1')
plt.legend(['Max', 'Min'])
plt.xlabel('Day of year')
plt.ylabel('Temperature (degrees F)')
plt.title('Daily temperatures in Minneapolis for 2017')
plt.show()

        Distribution of temperature and number of bicycle rides:

        Distribution of precipitation and number of rides:

         Distribution of precipitation and riding time:

        There may be some negative correlation between precipitation and ride duration. This is even more apparent when comparing ride times with and without rain.

4.12 Member User Analysis

        Analysis of the difference in the number of rides per day between members and non-members: 

        Analysis of the difference in the number of rides per hour between members and non-members: 

        Analysis of the difference in cycling time between members and non-members:

 5. Predict daily ride data based on weather data and historical ride data

        Build a model to predict the number of rides per day, including season, weather, and other factors, using a second-order polynomial to simulate seasonal effects:

         However, the seasons will be highly temperature dependent! This means that if we try to fit both in the same model, some of the effect of temperature on the number of rides could be due to the seasons. So, let's fit a model that includes everything we care about except the predictors of season, and then treat the residuals of that model as a function of season. Basically what we're going to do is "remove" the effect of weather on the number of rides and then study how the remaining information changes with the seasons.

        First, let's fit an ordinary least squares regression model to predict the number of rides per day, daily maximum temperature, and daily precipitation, starting with the day of the week.

5.1 OLS regression analysis

df = pd.DataFrame()
df['Trips'] = weather.trips
df['Date'] = weather.DATE
df['Day'] = weather.DATE.dt.dayofweek
df['Temp'] = weather.TMAX
df['Precip'] = weather.PRCP + 0.001

# Only fit model on days with trips
df = df.loc[~np.isnan(df.Trips), :]
df.reset_index(inplace=True, drop=True)

# Fit the linear regression model
olsfit = smf.ols('Trips ~ Temp + log(Precip) + C(Day)', data=df).fit()

# Show a summary of the fit
print(olsfit.summary())

         Our model does capture the effect of weather. In the summary table above, the coef column contains the coefficients for the variables in the leftmost column. The temperature coefficient is ≈39.9, which means that for every 10 degrees increase in temperature, you can get a good riding experience ≈400 more rides per day! But no more 400 rides are possible. The two rightmost columns show the 95% confidence intervals, which indicate that the model is 95% sure that the temperature coefficient is between 33.6 and 46.2. So the model is pretty sure that the number of car rides per day increases with temperature (since the 95% confidence interval is well above 0), but it's not sure how strong the relationship is.

        Also, the precipitation coefficient is significantly negative, meaning that more precipitation means fewer daily rides. This makes sense and fits with our previous weather analysis.

5.2 Predict the number of rides per day

# Predict num rides and compute residual
y_pred = olsfit.predict(df)
resid = df.Trips-y_pred

# Plot Predicted vs actual
plt.figure(figsize=(12, 6))
sns.set_style("darkgrid")
plt.plot(df.Date.dt.dayofyear, y_pred)
plt.plot(df.Date.dt.dayofyear, df.Trips)
plt.legend(['Predicted', 'Actual'])
plt.xlabel('Day of the Year')
plt.ylabel('Number Daily Rentals')
plt.title('Actual vs Predicted Rides Per Day')
plt.show()

6. Summary

        This project utilizes historical data provided by Nice Ride MN in the Twin Cities (Minneapolis/St. Paul, MN). We will explore bike-sharing ride data by looking at bike demand at different sites, bike traffic at each site, the impact of seasonality and weather on riding patterns, and differences in riding patterns between members and non-members.

Everyone is welcome to like, bookmark, follow, and comment . Due to limited space, only part of the core code is shown .

For technical exchanges, look for  the senior Wechat / QQ business card officially provided by CSDN below:)

Wonderful column recommended subscription:

1. Practical case of Python Bishe boutique
2. Natural language processing NLP boutique case
3. Computer vision CV boutique case

Guess you like

Origin blog.csdn.net/andrew_extra/article/details/125030769