Today I will share with you the third kaggle competition project, New-York-City-Taxi-Fare-Prediction. The feature of this project is that the data set given to us is relatively large, 5.3G, and the total amount of data is 5400W rows. However, when we are doing this project, we don't need so much data. Let's take a look at this project together.
Part1. Data import and preliminary analysis
First import our data set. Due to the large amount of data, we only import the first 500W rows of data for modeling.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train = pd.read_csv('train.csv',nrows=5000000)
test = pd.read_csv('test.csv')
test_ids = test['key']
train.head()
It can be seen that our data feature quantity this time is still relatively small, although the total amount of data is large, there are only 8 features.
train.info()
key: index
fare_amount: price
pickup_datetime: time
when the taxi received the guest pickup_longitude: longitude
at departure
pickup_latitude: latitude at departure dropoff_longitude: longitude
at arrival dropoff_latitude: latitude at arrival
passenger_count: number of passengers
train.describe()
The data enclosed in the red circle is an outlier: negative prices, the minimum number of passengers is 0, and the maximum number of passengers is 208.
The data drawn by the horizontal line is puzzling: What kind of journey will incur a rental cost of 1270?
Part2. Data Analysis
First observe the distribution of price characteristics:
train.fare_amount.hist(bins=100,figsize = (16,8))
plt.xlabel("Fare Amount")
plt.ylabel("Frequency")
train[train.fare_amount <100 ].fare_amount.hist(bins=100, figsize = (16,8))
plt.xlabel("Fare Amount")
plt.ylabel("Frequency")
train[train.fare_amount >=100 ].fare_amount.hist(bins=100, figsize = (16,8))
plt.xlabel("Fare Amount")
plt.ylabel("Frequency")
train[train.fare_amount <100].shape
train[train.fare_amount >=100].shape
From the above code and diagram, we can get several conclusions:
1. The distribution of prices is mostly within 100, a small part is above 100
, and prices within 100 are mostly concentrated between 0 and 20, 3
, and outside of 100 . Most of the prices are concentrated in the vicinity of 200, and a few relatively large prices may be outliers, or they may be prices to the airport.
Next, observe the distribution of the number of passengers:
train.passenger_count.hist(bins=100,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")
train[train.passenger_count<10].passenger_count.hist(bins=10,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")
train[train.passenger_count<7].passenger_count.hist(bins=10,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")
train[train.passenger_count>7].passenger_count.hist(bins=10,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")
train[train.passenger_count >7]
train[train.passenger_count ==0].shape
plt.figure(figsize= (16,8))
sns.boxplot(x = train[train.passenger_count< 7].passenger_count, y = train.fare_amount)
train[train.passenger_count <7][['fare_amount','passenger_count']].corr()
From the above code and graph, we can get several conclusions:
1. The distribution of the number of people is mostly within 7, a small part is outside
of 7. 2. In the data with the number of people outside of 7, most of the data coordinates are missing and the number of people is 208
3. There are 17,602 data that the number of passengers is 0, it may be a taxi that transports goods, or it may be missing data.
4. From the box chart, it can be seen that the average price of taxis with fewer than 7 is close to
5. Use the .corr() interface to check the correlation between passenger_count and fare_amount is not high, only 0.013
Part3. Data processing
1. Null value processing
train.isnull().sum()#找出空值
train = train.dropna(how='any', axis=0)
36 missing values are insignificant for our 500W data volume, so I choose to directly remove the missing values
test = pd.read_csv('test.csv')
test_ids = test['key']
test.head()
test.isnull().sum()
Do the same for the test set, but the test set has no missing values
2. Outlier handling
train = train[train.fare_amount>=0]
Remove data with negative prices
3. Feature engineering
① Shorten the scope of the training set. Because the amount of data in the training set is relatively large, we can reduce the training set
according to the coordinate range of the test set.
print(min(test.pickup_longitude.min(),test.dropoff_longitude.min()))
print(max(test.pickup_longitude.max(),test.dropoff_longitude.max()))
print(min(test.pickup_latitude.min(),test.dropoff_latitude.min()))
print(max(test.pickup_latitude.max(),test.dropoff_latitude.max()))
Get -74.2 to -73 as the longitude selection range, and 40.5 to 41.8 as the latitude selection range
def select_train(df, fw):
return (df.pickup_longitude >= fw[0]) & (df.pickup_longitude <= fw[1]) & \
(df.pickup_latitude >= fw[2]) & (df.pickup_latitude <= fw[3]) & \
(df.dropoff_longitude >= fw[0]) & (df.dropoff_longitude <= fw[1]) & \
(df.dropoff_latitude >= fw[2]) & (df.dropoff_latitude <= fw[3])
fw = (-74.2, -73, 40.5, 41.8)
train = train[select_train(train, fw)]
Use select_train to reduce the training set data
②Constructing new time features. The
original time features are not suitable for us to use directly. Considering that taxis may increase prices in different time periods, years and months, we need to extract new years and months from the original time features. , Day, and time are new features for our model.
def deal_time_features(df):
df['pickup_datetime'] = df['pickup_datetime'].str.slice(0, 16)
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], utc=True, format='%Y-%m-%d %H:%M')
df['hour'] = df.pickup_datetime.dt.hour
df['month'] = df.pickup_datetime.dt.month
df["year"] = df.pickup_datetime.dt.year
df["weekday"] = df.pickup_datetime.dt.weekday
return df
train = deal_time_features(train)
test = deal_time_features(test)
train.head()
The processed time feature consists of hour, month, year, and day of the week
③Constructing new distance features
directly using latitude and longitude coordinates is not conducive to the operation of our model, we use conversion formulas to convert latitude and longitude coordinates into distance
def distance(x1, y1, x2, y2):
p = 0.017453292519943295
a = 0.5 - np.cos((x2 - x1) * p)/2 + np.cos(x1 * p) * np.cos(x2 * p) * (1 - np.cos((y2 - y1) * p)) / 2
dis = 0.6213712 * 12742 * np.arcsin(np.sqrt(a))
return dis
train['distance_miles'] = distance(train.pickup_latitude,train.pickup_longitude,train.dropoff_latitude,train.dropoff_longitude)
test['distance_miles'] = distance(test.pickup_latitude, test.pickup_longitude,test.dropoff_latitude,test.dropoff_longitude)
train.head()
train[(train['distance_miles']==0)&(train['fare_amount']==0)]
After constructing the distance feature, we will find that there are 15 more useless data whose distance and price are both 0, which can be deleted
train = train.drop(index= train[(train['distance_miles']==0)&(train['fare_amount']==0)].index, axis=0)
④Special processing
1. Delete data with fare_amount less than 2.5, because the starting fare of taxis in New York is 2.5
train = train.drop(index= train[train['fare_amount'] < 2.5].index, axis=0)
2. Remove data with more than 7 people
train[train.passenger_count >= 7]
train = train.drop(index= train[train.passenger_count >= 7].index, axis=0)
Part4. Data Modeling
Take a look at the final look after the data is processed
train.describe().T
Use the .corr interface to see how these new features relate to prices
train.corr()['fare_amount']
Steps into modeling:
df_train = train.drop(columns= ['key','pickup_datetime'], axis= 1).copy()
df_test = test.drop(columns= ['key','pickup_datetime'], axis= 1).copy()
#使用copy后的数据进行建模
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_train.drop('fare_amount',axis=1)
,df_train['fare_amount']
,test_size=0.2
,random_state = 42)
#用train_test_split分出训练集和测试集
import xgboost as xgb
params = {
'max_depth': 7,
'gamma' :0,
'eta':0.3,
'subsample': 1,
'colsample_bytree': 0.9,
'objective':'reg:linear',
'eval_metric':'rmse',
'silent': 0
}
def XGBmodel(X_train,X_test,y_train,y_test,params):
matrix_train = xgb.DMatrix(X_train,label=y_train)
matrix_test = xgb.DMatrix(X_test,label=y_test)
model=xgb.train(params=params,
dtrain=matrix_train,
num_boost_round=5000,
early_stopping_rounds=10,
evals=[(matrix_test,'test')])
return model
model = XGBmodel(X_train,X_test,y_train,y_test,params)
#建模
prediction = model.predict(xgb.DMatrix(df_test), ntree_limit = model.best_ntree_limit)
prediction
#数据预测
res = pd.DataFrame()
res['key'] = test_ids
res['fare_amount'] = prediction
res.to_csv('submission.csv', index=False)
#结果保存
Part5. Summary
My approach is just a relatively simple way of thinking, because the only factors that I can think of about taxi prices are different time periods and distances. These two factors will have a greater impact. If you have other better ideas and practices, please leave a message in the discussion area to tell the blogger.
In addition, there is a way in the kaggle community to construct a new feature, which represents the distance from the coordinates to three different local airports. This method is used by the blogger directly when the parameters are not adjusted. The result is an increase of 0.03. I think the improvement is It's not too big, and the function overlaps with the distance a bit, so I didn't use it in the end. Post it here and share it with everyone.
# def transform(data):
# # Distances to nearby airports,
# jfk = (-73.7781, 40.6413)
# ewr = (-74.1745, 40.6895)
# lgr = (-73.8740, 40.7769)
# data['pickup_distance_to_jfk'] = distance(jfk[1], jfk[0],
# data['pickup_latitude'], data['pickup_longitude'])
# data['dropoff_distance_to_jfk'] = distance(jfk[1], jfk[0],
# data['dropoff_latitude'], data['dropoff_longitude'])
# data['pickup_distance_to_ewr'] = distance(ewr[1], ewr[0],
# data['pickup_latitude'], data['pickup_longitude'])
# data['dropoff_distance_to_ewr'] = distance(ewr[1], ewr[0],
# data['dropoff_latitude'], data['dropoff_longitude'])
# data['pickup_distance_to_lgr'] = distance(lgr[1], lgr[0],
# data['pickup_latitude'], data['pickup_longitude'])
# data['dropoff_distance_to_lgr'] = distance(lgr[1], lgr[0],
# data['dropoff_latitude'], data['dropoff_longitude'])
# return data
# train = transform(train)
# test = transform(test)