Kaggle—Hotel booking demand forecast analysis (Hotel booking demand)
Project background: This project is the research content of the hotel online booking business. From the perspective of hotel operation, it analyzes the hotel's room type supply, demand in different time periods, core consumer groups, and factors that affect unsubscribe, and establishes a classification algorithm model for the hotel. Order unsubscribes for forecasting.
Data source: kaggle:Hotel booking demand , the project data is a Hotel booking data set on kaggle, interested friends can download it for practice.
Data introduction:
field name | field meaning |
---|---|
hotel | hotel name |
is_canceled | whether to unsubscribe |
lead_time | check in time |
arrival_date_year | year of occupancy |
arrival_date_month | month of stay |
arrival_date_week_number | week of the year |
arrival_date_day_of_month | what day of the year |
stays_in_weekend_nights | weekend nights |
stays_in_week_nights | Weeknights |
adults | Adults |
children | number of children |
babies | number of babies |
meal | order status |
country | Country of Citizenship |
market_segment | Market segments |
distribution_channel | market |
is_repeated_guest | Are you a repeat customer? |
previous_cancellations | The number of bookings canceled by customers before booking |
previous_bookings_not_canceled | The number of reservations that the customer did not cancel before booking |
reserved_room_type | room type |
assigned_room_type | room type code |
booking_changes | Number of changes made to booking |
deposit_type | Whether to pay a deposit |
agent | travel agency id |
company | company |
days_in_waiting_list | Review days before confirming order |
customer_type | booking type |
adr | average daily holiday |
required_car_parking_spaces | The number of parking spaces requested by the customer |
total_of_special_requests | Special Request Quantity |
reservation_status | Order Status |
reservation_status_date | Date the order status was last set |
It contains a total of 32 fields and 119390 records.
Project Flow
- data preprocessing
- Missing value handling
- data type conversion
- Outlier handling
- feature engineering
- Numerical feature normalization
- One-hot encoding of categorical features
- feature selection
- model training
- Model prediction and evaluation
1. Data preprocessing
- Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
- View and understand data
df = pd.read_csv('hotel_bookings.csv',encoding='gbk')
df.head()
df.info()
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 market_segment 119390 non-null object
15 distribution_channel 119390 non-null object
16 is_repeated_guest 119390 non-null int64
17 previous_cancellations 119390 non-null int64
18 previous_bookings_not_canceled 119390 non-null int64
19 reserved_room_type 119390 non-null object
20 assigned_room_type 119390 non-null object
21 booking_changes 119390 non-null int64
22 deposit_type 119390 non-null object
23 agent 103050 non-null float64
24 company 6797 non-null float64
25 days_in_waiting_list 119390 non-null int64
26 customer_type 119390 non-null object
27 adr 119390 non-null float64
28 required_car_parking_spaces 119390 non-null int64
29 total_of_special_requests 119390 non-null int64
30 reservation_status 119390 non-null object
31 reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
It is found that the data set has a total of 32 fields and 119,389 rows of data. The company column has obvious missing values. In addition, columns representing time such as arrival_date need to be merged and converted to date format.
- Date merge and format conversion
Since the arrival_date_month month information in the data set is expressed in English, first convert it to Chinese month expression, which is convenient for later merging dates
#修改arrival_date_month的英文月份为中文月份
import calendar
month = []
for i in df.arrival_date_month:
mon = list(calendar.month_name).index(i)
month.append(mon)
df.insert(4,"arrival_month",month)
Add a column of arrival_date of the year, month, and day of the reservation to the store, and splicing the original year, month, and day
#将年月日拼接
#增加一列预订到店的年月日arrival_date
df[["arrival_date_year","arrival_month","arrival_date_day_of_month"]] = df[["arrival_date_year","arrival_month","arrival_date_day_of_month"]].apply(lambda x:x.astype(str))
date = df.arrival_date_year.str.cat([df.arrival_month,df.arrival_date_day_of_month],".")
df.insert(3,"arrival_date",date)
convert to date format
# 转换日期格式
df['arrival_date']=pd.to_datetime(df['arrival_date'])
Delete the original year, month, and day information, and only use the newly created arrival_date to represent
df.drop(['arrival_date_year','arrival_month','arrival_date_month','arrival_date_week_number'],axis=1,inplace=True)
df.drop(['arrival_date_day_of_month'],axis=1,inplace=True)
- Missing value handling
#统计缺失值
df.isnull().sum()
#统计缺失率
#df.isnull().sum()/df.shape[0]
结果:
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country **488**
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent **16340**
company **112593**
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
dtype: int64
The missing values of the data mainly exist in the four fields of children, country, agent, and company, and the most missing value is company
One, children are missing 4, and it is a numerical variable, so it is filled with the median
Second, the country is missing 488, and it is a categorical variable, so it is filled with the mode
3. The agent is missing 16340, the missing rate is 13.6%, and the number of missing is relatively large, but the agent indicates the travel agency booked, and the missing rate is less than 20%. It is recommended to keep it and fill it with 0, indicating that there is no travel agency ID
Four, the company is missing 112,593, the missing rate is 94.3%>80%, does not have the validity of information value, so delete it directly
df.children.fillna(df.children.median(),inplace=True)
df.country.fillna(df.country.mode()[0],inplace=True)
df.agent.fillna(0,inplace=True)
df.drop(['company'],axis=1,inplace=True)
- Outlier handling
By observing the data set, it is found that there are floating-point numbers for the occupancy of children and the occupancy of travel agencies. In the data set, the adult, child, and baby fields are all 0, which means that the occupancy of the order is 0, which is not realistic. There is an outlier value greater than 5000 in the hotel's average daily consumption. We need to deal with this so as not to affect subsequent model building.
# children、agent字段不可能为浮点数,需修改数据类型
df.children = df.children.astype(int)
df.agent = df.agent.astype(int)
# 根据原数据集介绍,餐饮字段中的Undefined / SC –无餐套餐为一类
df.meal.replace("Undefined", "SC", inplace=True)
#删除异常值的行
zero_guests = list(df["adults"] + df["children"] + df["babies"] == 0)
df.drop(df.index[zero_guests],inplace=True)
#核实adr变量的离群值情况
sns.boxplot(x=df['adr'])
#删除离群值
df = df[df["adr"]<5000]
feature engineering
Numerical feature standardization
Since the unit dimensions of numerical features are different, it is easy to be partial when fitting the model, so it is necessary to perform normalization processing, unify the dimensions, and retain the data rules
- 先将数值型特征提取出来
#数值型特征标准化过程
num_feature = ["lead_time","stays_nights_total","stays_in_weekend_nights","stays_in_week_nights","number_of_people","adults","children","babies","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","booking_changes","agent","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests"]
- 对数值特征进行标准化处理
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
#df = sc_X.fit_transform(df)
dff=sc_X.fit_transform(df[["lead_time","stays_in_weekend_nights","stays_in_week_nights","adults","children","babies","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","booking_changes","agent","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests"]])
dff=pd.DataFrame(data=dff, columns=["lead_time","stays_in_weekend_nights","stays_in_week_nights","adults","children","babies","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","booking_changes","agent","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests"])
类别特征向量化
由于计算机只能识别数值,而不能识别字符串类别信息,所以为了保证信息的完整性,我们需要进行向量化处理,将其转换为模型容易识别的数值型特征
- 提取类别型特征
cat_feature = ["hotel","meal","country","market_segment","distribution_channel","reserved_room_type","assigned_room_type","deposit_type","customer_type"]
- one-hot编码
from sklearn.preprocessing import OneHotEncoder
one_hot=OneHotEncoder()
data_temp=pd.DataFrame(one_hot.fit_transform(df[["hotel","meal","country","market_segment","distribution_channel","reserved_room_type","assigned_room_type","deposit_type","customer_type"]]).toarray(),
columns=one_hot.get_feature_names(["hotel","meal","country","market_segment","distribution_channel","reserved_room_type","assigned_room_type","deposit_type","customer_type"]),dtype='int32')
data_onehot=pd.concat((dff,data_temp),axis=1) #也可以用merge,join
data_onehot.head()
data_onehot['is_canceled'] = df['is_canceled']
降维
在对类别型特征进行one-hot编码后,数据集由原来的32个字段改变为239个字段,维度大大增加,增加了模型训练的时间复杂性以及可能会造成数据分布稀疏的问题。为了更好的训练模型而又尽量不损失太多的数据信息,在此我们使用决策树模型进行特征选择,保留30个特征进行训练。
# 适用于分类模型
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectFromModel
def descTree(x,y,n):
# 数据集划分为特征X和标签y,y是分类
X, y = x, y
# 决策树模型
print('使用决策树模型')
tree = DecisionTreeClassifier().fit(X, y)
model1 = SelectFromModel(tree, prefit=True, max_features=n)
d = model1.get_support(indices=True)
print('特征是...')
print(d)
return d
d= descTree(x,y,30)
all_fea = pd.DataFrame(d)
结果:
使用决策树模型
特征是...
[ 0 1 2 3 4 7 8 9 10 11 12 13 14 16 17 19 20 64
72 77 80 156 204 211 214 220 223 232 236 237]
进行特征选择后,我们拿到了所选择的30个特征的索引,我们需要对其进行挑选合并处理做最后的训练数据
fea_num = list(d)
data_stand_fea = x.iloc[:, list(fea_num[:])]
data_stand_fea
模型训练
- 切割数据集
#切割数据集 82开
X_train, X_test, y_train, y_test = train_test_split(data_stand_fea, y, test_size=0.2)
- 采用RandomForest模型进行训练
clf3 = RandomForestClassifier(n_estimators=160,
max_features=0.4,
min_samples_split=2,
n_jobs=-1,
random_state=0)
clf3.fit(X_train,y_train)
- 模型预测与评估
from sklearn.metrics import accuracy_score
y_pred3 = clf3.predict(X_test)
print('The accuracy of prediction is:', accuracy_score(y_test, y_pred3))
随机森林的评分为0.83,作为第一次训练的结果,评分还是不错的,后续我们可以进行模型参数调优或者采用更加复杂的模型进行训练,提高预测精度。在此我们对此进行一个简单的参数调优
- 参数调优
参数调优可以使用GridSearchCV,但在参数数量选择上,不建议太多,否则数据处理量太多,速度会很慢。对应该模型,参数选择"n_estimators":决策树的量;“max_depth”:决策树的深度(预剪枝);“max_features”:选择的最大特征量
rf = RandomForestClassifier()
#参数选择
param_dict = {
"n_estimators":[100,150,200],"max_depth":[3,5,8,10,15],"max_features":["auto","log2"]}
#网络搜索调优器
rf_model = GridSearchCV(rf,param_grid=param_dict,cv=3)
#模型拟合
CLF = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', rf_model)])
CLF.fit(X_train, y_train)
#不同参数下,最好的评分及其参数
CLF.best_score_
CLF.best_params_