[Project Combat] Teach you to use Python to predict air ticket prices

Airfares in India fluctuate based on supply and demand, with few restrictions from regulators. It is therefore often considered unpredictable, and dynamic pricing adds to the confusion.

Our aim is to build a machine learning model to predict the price of future flights based on historical data, and these flight prices can be given to customers or airline service providers as reference prices.

1. Prepare

Please choose one of the following ways to enter the command to install dependencies :

1. Open Cmd (Start-Run-CMD) in the Windows environment.
2. Open Terminal in the MacOS environment (command+space to enter Terminal).
3. If you are using VSCode editor or Pycharm, you can directly use the Terminal at the bottom of the interface.

pip install pandas
pip install numpy
pip install matplotlib
pip install seaborn
pip install scikit-learn

2. Import related datasets

The data set in this article is Data_Train.xlsx, first look at the format of the training set:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


flights = pd.read_excel('./Data_Train.xlsx')
flights.head()


It can be seen that the fields in the training set include airline ( Airline ), date ( Date_of_Journey ), origin station ( Source ), terminal station ( Destination ), route ( Route ), departure time ( Dep_Time ), arrival time ( Arrival_Time ), elapsed time ( Duration ), the total number of stops ( Total_Stops ), additional information ( Additional_Info ), and finally the ticket price ( Price ).

The test set, as opposed to it, is identical to the training set for all fields except the price field.

To download the complete data source and code, please visit: Reply in the background of the official account: Predict air tickets .

3. Exploratory Data Analysis

3.1 Clean up missing data

Look at the basic information of all fields:

flights.info()

The number of other non-zero values ​​is 10683, only the number of routes and stops is 10682, indicating that these two fields are missing a value.

As a precaution, we drop rows with missing data:

# clearing the missing data
flights.dropna(inplace=True)
flights.info()

Now that the non-zero values ​​have reached a consistent number, the data is cleaned.

3.2 Distribution characteristics of airline companies

Next, look at the distribution characteristics of airlines:

sns.countplot('Airline', data=flights)
plt.xticks(rotation=90)
plt.show()

The top three airlines are IndiGo, Air India, JetAirways.

There may be low-cost airlines among them.

3.3 Let’s look at the distribution of origin

sns.countplot('Source',data=flights)
plt.xticks(rotation=90)
plt.show()

Certain areas may be unpopular areas, and there is a high possibility of unpopular tickets.

3.4 Number distribution of stops

sns.countplot('Total_Stops',data=flights)
plt.xticks(rotation=90)
plt.show()

It appears that most flights make only one or no stops mid-flight.

Are some flights with more stops cheaper?

3.5 How much data contains additional information

plot=plt.figure()
sns.countplot('Additional_Info',data=flights)
plt.xticks(rotation=90)

Most of the flight information does not contain additional information, except for some flight information: does not include aircraft meals, does not include free checked baggage.

This information is very important. Is it cheaper to fly a plane that does not include these two services?

3.6 Time dimension analysis

First convert the time format:

flights['Date_of_Journey'] = pd.to_datetime(flights['Date_of_Journey'])
flights['Dep_Time'] = pd.to_datetime(flights['Dep_Time'],format='%H:%M:%S').dt.time

Next, study the relationship between departure time and price:

flights['weekday'] = flights[['Date_of_Journey']].apply(lambda x:x.dt.day_name())
sns.barplot('weekday','Price',data=flights)
plt.show()

Generally speaking, there is no difference in price, indicating that this feature is invalid.

So what is the relationship between the month and the air ticket price?

flights["month"] = flights['Date_of_Journey'].map(lambda x: x.month_name())
sns.barplot('month','Price',data=flights)
plt.show()

Unexpectedly, the average price of air tickets in April is only half of that in other months. It seems that April is the off-season for traveling in India.

The relationship between departure time and price :

flights['Dep_Time'] = flights['Dep_Time'].apply(lambda x:x.hour)
flights['Dep_Time'] = pd.to_numeric(flights['Dep_Time'])
sns.barplot('Dep_Time','Price',data=flights)
plot.show()

It can be seen that the tickets for red-eye flights (midnight and morning) are cheaper, which is in line with our perception.

3.7 Clear invalid features

Remove those fields that are not related to the price directly:

flights.drop(['Route','Arrival_Time','Date_of_Journey'],axis=1,inplace=True)
flights.head()

4. Model training

Next, we are ready to use the model to predict air ticket prices, however, data preprocessing and feature scaling are required.

4.1 Data preprocessing

Replace string variables with numbers:

from sklearn.preprocessing import LabelEncoder
var_mod = ['Airline','Source','Destination','Additional_Info','Total_Stops','weekday','month','Dep_Time']
le = LabelEncoder()
for i in var_mod:
    flights[i] = le.fit_transform(flights[i])
flights.head()

Feature scaling is performed on each column of data , extracting independent variables (x) and dependent variables (y):

flights.corr()
def outlier(df):
    for i in df.describe().columns:
        Q1=df.describe().at['25%',i]
        Q3=df.describe().at['75%',i]
        IQR= Q3-Q1
        LE=Q1-1.5*IQR
        UE=Q3+1.5*IQR
        df[i]=df[i].mask(df[i]<LE,LE)
        df[i]=df[i].mask(df[i]>UE,UE)
    return df
flights = outlier(flights)
x = flights.drop('Price',axis=1)
y = flights['Price']

Split test and training sets:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=101)


4.2 Model training and testing

Model training using random forests:

from sklearn.ensemble import RandomForestRegressor
rfr=RandomForestRegressor(n_estimators=100)
rfr.fit(x_train,y_train)

In random forests we have a method for determining the importance of features based on the correlation of the data:

features=x.columns
importances = rfr.feature_importances_
indices = np.argsort(importances)
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')

It can be seen that Duration (flight duration) is the most influential factor.

Predict the divided test set and get the result:

predictions=rfr.predict(x_test)
plt.scatter(y_test,predictions)
plt.show()

It is not very intuitive to look at it this way. Next, we will evaluate the model digitally.

4.3 Model Evaluation

sklearn provides a very convenient function to evaluate the model, that is metrics:

from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
print('r2_score:', (metrics.r2_score(y_test, predictions)))

result:

MAE: 1453.9350628905618
MSE: 4506308.3645551
RMSE: 2122.806718605135
r2_score: 0.7532074710409375

Among these 4 values, you can only pay attention to R2_score. The closer r2 is to 1, the better the model effect is. The score of this model is 0.75, which is considered a very good model.

See if its residual histogram is normally distributed:

sns.distplot((y_test-predictions),bins=50)
plt.show()

Yes, most of the predicted results and true values ​​are in the range of -1000 to 1000, which is an acceptable result. The residual histogram also basically conforms to the normal distribution, indicating that the model is effective.

Guess you like

Origin blog.csdn.net/weixin_56659172/article/details/128314261