Introduction to Kaggle Actual Combat (3) New York Taxi Price Prediction New-York-City-Taxi-Fare-Prediction

Today I will share with you the third kaggle competition project, New-York-City-Taxi-Fare-Prediction. The feature of this project is that the data set given to us is relatively large, 5.3G, and the total amount of data is 5400W rows. However, when we are doing this project, we don't need so much data. Let's take a look at this project together.

Part1. Data import and preliminary analysis

First import our data set. Due to the large amount of data, we only import the first 500W rows of data for modeling.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train = pd.read_csv('train.csv',nrows=5000000)
test = pd.read_csv('test.csv')
test_ids = test['key']

train.head()

Insert picture description here

It can be seen that our data feature quantity this time is still relatively small, although the total amount of data is large, there are only 8 features.

train.info()

Insert picture description here

key: index
fare_amount: price
pickup_datetime: time
when the taxi received the guest pickup_longitude: longitude
at departure
pickup_latitude: latitude at departure dropoff_longitude: longitude
at arrival dropoff_latitude: latitude at arrival
passenger_count: number of passengers

train.describe()

Insert picture description here
The data enclosed in the red circle is an outlier: negative prices, the minimum number of passengers is 0, and the maximum number of passengers is 208.
The data drawn by the horizontal line is puzzling: What kind of journey will incur a rental cost of 1270?

Part2. Data Analysis

First observe the distribution of price characteristics:

train.fare_amount.hist(bins=100,figsize = (16,8))
plt.xlabel("Fare Amount")
plt.ylabel("Frequency")

Insert picture description here

train[train.fare_amount <100 ].fare_amount.hist(bins=100, figsize = (16,8))
plt.xlabel("Fare Amount")
plt.ylabel("Frequency")

Insert picture description here

train[train.fare_amount >=100 ].fare_amount.hist(bins=100, figsize = (16,8))
plt.xlabel("Fare Amount")
plt.ylabel("Frequency")

Insert picture description here

train[train.fare_amount <100].shape

Insert picture description here

train[train.fare_amount >=100].shape

Insert picture description here
From the above code and diagram, we can get several conclusions:
1. The distribution of prices is mostly within 100, a small part is above 100
, and prices within 100 are mostly concentrated between 0 and 20, 3
, and outside of 100 . Most of the prices are concentrated in the vicinity of 200, and a few relatively large prices may be outliers, or they may be prices to the airport.

Next, observe the distribution of the number of passengers:

train.passenger_count.hist(bins=100,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")

Insert picture description here

train[train.passenger_count<10].passenger_count.hist(bins=10,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")

Insert picture description here

train[train.passenger_count<7].passenger_count.hist(bins=10,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")

Insert picture description here

train[train.passenger_count>7].passenger_count.hist(bins=10,figsize = (16,8))
plt.xlabel("passenger_count")
plt.ylabel("Frequency")

Insert picture description here

train[train.passenger_count >7]

Insert picture description here

train[train.passenger_count ==0].shape

Insert picture description here

plt.figure(figsize= (16,8))
sns.boxplot(x = train[train.passenger_count< 7].passenger_count, y = train.fare_amount)

Insert picture description here

train[train.passenger_count <7][['fare_amount','passenger_count']].corr()

Insert picture description here

From the above code and graph, we can get several conclusions:
1. The distribution of the number of people is mostly within 7, a small part is outside
of 7. 2. In the data with the number of people outside of 7, most of the data coordinates are missing and the number of people is 208
3. There are 17,602 data that the number of passengers is 0, it may be a taxi that transports goods, or it may be missing data.
4. From the box chart, it can be seen that the average price of taxis with fewer than 7 is close to
5. Use the .corr() interface to check the correlation between passenger_count and fare_amount is not high, only 0.013

Part3. Data processing

1. Null value processing

train.isnull().sum()#找出空值

Insert picture description here

train = train.dropna(how='any', axis=0)

36 missing values ​​are insignificant for our 500W data volume, so I choose to directly remove the missing values

test = pd.read_csv('test.csv')
test_ids = test['key']
test.head()
test.isnull().sum()

Insert picture description here
Insert picture description here
Do the same for the test set, but the test set has no missing values

2. Outlier handling

train = train[train.fare_amount>=0]

Remove data with negative prices

3. Feature engineering
① Shorten the scope of the training set. Because the amount of data in the training set is relatively large, we can reduce the training set
according to the coordinate range of the test set.

print(min(test.pickup_longitude.min(),test.dropoff_longitude.min()))
print(max(test.pickup_longitude.max(),test.dropoff_longitude.max()))
print(min(test.pickup_latitude.min(),test.dropoff_latitude.min()))
print(max(test.pickup_latitude.max(),test.dropoff_latitude.max()))

Insert picture description here
Get -74.2 to -73 as the longitude selection range, and 40.5 to 41.8 as the latitude selection range

def select_train(df, fw):
    return (df.pickup_longitude >= fw[0]) & (df.pickup_longitude <= fw[1]) & \
           (df.pickup_latitude >= fw[2]) & (df.pickup_latitude <= fw[3]) & \
           (df.dropoff_longitude >= fw[0]) & (df.dropoff_longitude <= fw[1]) & \
           (df.dropoff_latitude >= fw[2]) & (df.dropoff_latitude <= fw[3])
fw = (-74.2, -73, 40.5, 41.8)
train = train[select_train(train, fw)]

Use select_train to reduce the training set data

②Constructing new time features. The
original time features are not suitable for us to use directly. Considering that taxis may increase prices in different time periods, years and months, we need to extract new years and months from the original time features. , Day, and time are new features for our model.

def deal_time_features(df):
    df['pickup_datetime'] = df['pickup_datetime'].str.slice(0, 16)
    df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], utc=True, format='%Y-%m-%d %H:%M')
    df['hour'] = df.pickup_datetime.dt.hour
    df['month'] = df.pickup_datetime.dt.month
    df["year"] = df.pickup_datetime.dt.year
    df["weekday"] = df.pickup_datetime.dt.weekday
    return df
train = deal_time_features(train)
test = deal_time_features(test)
train.head()

Insert picture description here
The processed time feature consists of hour, month, year, and day of the week

③Constructing new distance features
directly using latitude and longitude coordinates is not conducive to the operation of our model, we use conversion formulas to convert latitude and longitude coordinates into distance

def distance(x1, y1, x2, y2):
    p = 0.017453292519943295 
    a = 0.5 - np.cos((x2 - x1) * p)/2 + np.cos(x1 * p) * np.cos(x2 * p) * (1 - np.cos((y2 - y1) * p)) / 2
    dis = 0.6213712 * 12742 * np.arcsin(np.sqrt(a))
    return dis  
train['distance_miles'] = distance(train.pickup_latitude,train.pickup_longitude,train.dropoff_latitude,train.dropoff_longitude)
test['distance_miles'] = distance(test.pickup_latitude, test.pickup_longitude,test.dropoff_latitude,test.dropoff_longitude)
train.head()

Insert picture description here

train[(train['distance_miles']==0)&(train['fare_amount']==0)]

Insert picture description here
After constructing the distance feature, we will find that there are 15 more useless data whose distance and price are both 0, which can be deleted

train = train.drop(index= train[(train['distance_miles']==0)&(train['fare_amount']==0)].index, axis=0)

④Special processing
1. Delete data with fare_amount less than 2.5, because the starting fare of taxis in New York is 2.5

train = train.drop(index= train[train['fare_amount'] < 2.5].index, axis=0)

2. Remove data with more than 7 people

train[train.passenger_count >= 7]

Insert picture description here

train = train.drop(index= train[train.passenger_count >= 7].index, axis=0)

Part4. Data Modeling

Take a look at the final look after the data is processed

train.describe().T

Insert picture description here
Use the .corr interface to see how these new features relate to prices

train.corr()['fare_amount']

Insert picture description here
Steps into modeling:

df_train = train.drop(columns= ['key','pickup_datetime'], axis= 1).copy()
df_test = test.drop(columns= ['key','pickup_datetime'], axis= 1).copy()
#使用copy后的数据进行建模


from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(df_train.drop('fare_amount',axis=1)
                                                    ,df_train['fare_amount']
                                                    ,test_size=0.2
                                                    ,random_state = 42)


#用train_test_split分出训练集和测试集

import xgboost as xgb
params = {
    
     
    'max_depth': 7,
    'gamma' :0,
    'eta':0.3, 
    'subsample': 1,
    'colsample_bytree': 0.9, 
    'objective':'reg:linear',
    'eval_metric':'rmse',
    'silent': 0
}
def XGBmodel(X_train,X_test,y_train,y_test,params):
    matrix_train = xgb.DMatrix(X_train,label=y_train)
    matrix_test = xgb.DMatrix(X_test,label=y_test)
    model=xgb.train(params=params,
                    dtrain=matrix_train,
                    num_boost_round=5000, 
                    early_stopping_rounds=10,
                    evals=[(matrix_test,'test')])
    return model

model = XGBmodel(X_train,X_test,y_train,y_test,params)

#建模


Insert picture description here

prediction = model.predict(xgb.DMatrix(df_test), ntree_limit = model.best_ntree_limit)
prediction
#数据预测

Insert picture description here

res = pd.DataFrame()
res['key'] = test_ids
res['fare_amount'] = prediction
res.to_csv('submission.csv', index=False)
#结果保存

Part5. Summary

My approach is just a relatively simple way of thinking, because the only factors that I can think of about taxi prices are different time periods and distances. These two factors will have a greater impact. If you have other better ideas and practices, please leave a message in the discussion area to tell the blogger.
In addition, there is a way in the kaggle community to construct a new feature, which represents the distance from the coordinates to three different local airports. This method is used by the blogger directly when the parameters are not adjusted. The result is an increase of 0.03. I think the improvement is It's not too big, and the function overlaps with the distance a bit, so I didn't use it in the end. Post it here and share it with everyone.

# def transform(data):
#     # Distances to nearby airports, 
#     jfk = (-73.7781, 40.6413)
#     ewr = (-74.1745, 40.6895)
#     lgr = (-73.8740, 40.7769)

#     data['pickup_distance_to_jfk'] = distance(jfk[1], jfk[0],
#                                          data['pickup_latitude'], data['pickup_longitude'])
#     data['dropoff_distance_to_jfk'] = distance(jfk[1], jfk[0],
#                                            data['dropoff_latitude'], data['dropoff_longitude'])
#     data['pickup_distance_to_ewr'] = distance(ewr[1], ewr[0], 
#                                           data['pickup_latitude'], data['pickup_longitude'])
#     data['dropoff_distance_to_ewr'] = distance(ewr[1], ewr[0],
#                                            data['dropoff_latitude'], data['dropoff_longitude'])
#     data['pickup_distance_to_lgr'] = distance(lgr[1], lgr[0],
#                                           data['pickup_latitude'], data['pickup_longitude'])
#     data['dropoff_distance_to_lgr'] = distance(lgr[1], lgr[0],
#                                            data['dropoff_latitude'], data['dropoff_longitude'])
    
#     return data

# train = transform(train)
# test = transform(test)

Thank you for reading!

Guess you like

Origin blog.csdn.net/kiligso/article/details/108696392