Kaggle—hotel reservation demand forecast analysis

Kaggle—Hotel booking demand forecast analysis (Hotel booking demand)

Project background: This project is the research content of the hotel online booking business. From the perspective of hotel operation, it analyzes the hotel's room type supply, demand in different time periods, core consumer groups, and factors that affect unsubscribe, and establishes a classification algorithm model for the hotel. Order unsubscribes for forecasting.

Data source: kaggle:Hotel booking demand , the project data is a Hotel booking data set on kaggle, interested friends can download it for practice.

Data introduction:

field name field meaning
hotel hotel name
is_canceled whether to unsubscribe
lead_time check in time
arrival_date_year year of occupancy
arrival_date_month month of stay
arrival_date_week_number week of the year
arrival_date_day_of_month what day of the year
stays_in_weekend_nights weekend nights
stays_in_week_nights Weeknights
adults Adults
children number of children
babies number of babies
meal order status
country Country of Citizenship
market_segment Market segments
distribution_channel market
is_repeated_guest Are you a repeat customer?
previous_cancellations The number of bookings canceled by customers before booking
previous_bookings_not_canceled The number of reservations that the customer did not cancel before booking
reserved_room_type room type
assigned_room_type room type code
booking_changes Number of changes made to booking
deposit_type Whether to pay a deposit
agent travel agency id
company company
days_in_waiting_list Review days before confirming order
customer_type booking type
adr average daily holiday
required_car_parking_spaces The number of parking spaces requested by the customer
total_of_special_requests Special Request Quantity
reservation_status Order Status
reservation_status_date Date the order status was last set

It contains a total of 32 fields and 119390 records.

Project Flow

  • data preprocessing
    • Missing value handling
    • data type conversion
    • Outlier handling
  • feature engineering
    • Numerical feature normalization
    • One-hot encoding of categorical features
    • feature selection
  • model training
  • Model prediction and evaluation

1. Data preprocessing

  1. Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
  1. View and understand data
df = pd.read_csv('hotel_bookings.csv',encoding='gbk')
df.head()
df.info()
结果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

It is found that the data set has a total of 32 fields and 119,389 rows of data. The company column has obvious missing values. In addition, columns representing time such as arrival_date need to be merged and converted to date format.

  1. Date merge and format conversion

Since the arrival_date_month month information in the data set is expressed in English, first convert it to Chinese month expression, which is convenient for later merging dates

#修改arrival_date_month的英文月份为中文月份
import calendar
month = []
for i in df.arrival_date_month:
    mon = list(calendar.month_name).index(i)
    month.append(mon)
df.insert(4,"arrival_month",month)

Add a column of arrival_date of the year, month, and day of the reservation to the store, and splicing the original year, month, and day

#将年月日拼接
#增加一列预订到店的年月日arrival_date
df[["arrival_date_year","arrival_month","arrival_date_day_of_month"]] = df[["arrival_date_year","arrival_month","arrival_date_day_of_month"]].apply(lambda x:x.astype(str))
date = df.arrival_date_year.str.cat([df.arrival_month,df.arrival_date_day_of_month],".")
df.insert(3,"arrival_date",date)

convert to date format

# 转换日期格式
df['arrival_date']=pd.to_datetime(df['arrival_date'])

Delete the original year, month, and day information, and only use the newly created arrival_date to represent

df.drop(['arrival_date_year','arrival_month','arrival_date_month','arrival_date_week_number'],axis=1,inplace=True)
df.drop(['arrival_date_day_of_month'],axis=1,inplace=True)
  1. Missing value handling
#统计缺失值
df.isnull().sum()
#统计缺失率
#df.isnull().sum()/df.shape[0]
结果:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              **488**
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              **16340**
company                           **112593**
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64

The missing values ​​of the data mainly exist in the four fields of children, country, agent, and company, and the most missing value is company

One, children are missing 4, and it is a numerical variable, so it is filled with the median

Second, the country is missing 488, and it is a categorical variable, so it is filled with the mode

3. The agent is missing 16340, the missing rate is 13.6%, and the number of missing is relatively large, but the agent indicates the travel agency booked, and the missing rate is less than 20%. It is recommended to keep it and fill it with 0, indicating that there is no travel agency ID

Four, the company is missing 112,593, the missing rate is 94.3%>80%, does not have the validity of information value, so delete it directly

df.children.fillna(df.children.median(),inplace=True)
df.country.fillna(df.country.mode()[0],inplace=True)
df.agent.fillna(0,inplace=True)
df.drop(['company'],axis=1,inplace=True)
  1. Outlier handling

By observing the data set, it is found that there are floating-point numbers for the occupancy of children and the occupancy of travel agencies. In the data set, the adult, child, and baby fields are all 0, which means that the occupancy of the order is 0, which is not realistic. There is an outlier value greater than 5000 in the hotel's average daily consumption. We need to deal with this so as not to affect subsequent model building.

# children、agent字段不可能为浮点数,需修改数据类型
df.children = df.children.astype(int)
df.agent = df.agent.astype(int)
# 根据原数据集介绍,餐饮字段中的Undefined / SC –无餐套餐为一类
df.meal.replace("Undefined", "SC", inplace=True)
#删除异常值的行
zero_guests = list(df["adults"] + df["children"] + df["babies"] == 0)
df.drop(df.index[zero_guests],inplace=True)
#核实adr变量的离群值情况
sns.boxplot(x=df['adr'])
#删除离群值
df = df[df["adr"]<5000]

feature engineering

Numerical feature standardization

Since the unit dimensions of numerical features are different, it is easy to be partial when fitting the model, so it is necessary to perform normalization processing, unify the dimensions, and retain the data rules

  1. 先将数值型特征提取出来
#数值型特征标准化过程
num_feature = ["lead_time","stays_nights_total","stays_in_weekend_nights","stays_in_week_nights","number_of_people","adults","children","babies","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","booking_changes","agent","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests"]

  1. 对数值特征进行标准化处理
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
#df = sc_X.fit_transform(df)
dff=sc_X.fit_transform(df[["lead_time","stays_in_weekend_nights","stays_in_week_nights","adults","children","babies","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","booking_changes","agent","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests"]])

dff=pd.DataFrame(data=dff, columns=["lead_time","stays_in_weekend_nights","stays_in_week_nights","adults","children","babies","is_repeated_guest","previous_cancellations","previous_bookings_not_canceled","booking_changes","agent","days_in_waiting_list","adr","required_car_parking_spaces","total_of_special_requests"])

类别特征向量化

由于计算机只能识别数值,而不能识别字符串类别信息,所以为了保证信息的完整性,我们需要进行向量化处理,将其转换为模型容易识别的数值型特征

  1. 提取类别型特征
cat_feature = ["hotel","meal","country","market_segment","distribution_channel","reserved_room_type","assigned_room_type","deposit_type","customer_type"]
  1. one-hot编码
from sklearn.preprocessing import OneHotEncoder

one_hot=OneHotEncoder()

data_temp=pd.DataFrame(one_hot.fit_transform(df[["hotel","meal","country","market_segment","distribution_channel","reserved_room_type","assigned_room_type","deposit_type","customer_type"]]).toarray(),
             columns=one_hot.get_feature_names(["hotel","meal","country","market_segment","distribution_channel","reserved_room_type","assigned_room_type","deposit_type","customer_type"]),dtype='int32')
data_onehot=pd.concat((dff,data_temp),axis=1)    #也可以用merge,join

data_onehot.head()
data_onehot['is_canceled'] = df['is_canceled']

降维

在对类别型特征进行one-hot编码后,数据集由原来的32个字段改变为239个字段,维度大大增加,增加了模型训练的时间复杂性以及可能会造成数据分布稀疏的问题。为了更好的训练模型而又尽量不损失太多的数据信息,在此我们使用决策树模型进行特征选择,保留30个特征进行训练。

# 适用于分类模型
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectFromModel


def descTree(x,y,n):
    # 数据集划分为特征X和标签y,y是分类
    X, y = x, y
    # 决策树模型
    print('使用决策树模型')
    tree = DecisionTreeClassifier().fit(X, y)
    model1 = SelectFromModel(tree, prefit=True, max_features=n)
    d = model1.get_support(indices=True)
    print('特征是...')
    print(d)
    return d 
d= descTree(x,y,30)
all_fea = pd.DataFrame(d)
结果:
使用决策树模型
特征是...
[  0   1   2   3   4   7   8   9  10  11  12  13  14  16  17  19  20  64
  72  77  80 156 204 211 214 220 223 232 236 237]

进行特征选择后,我们拿到了所选择的30个特征的索引,我们需要对其进行挑选合并处理做最后的训练数据

fea_num = list(d)
data_stand_fea = x.iloc[:, list(fea_num[:])]
data_stand_fea

模型训练

  1. 切割数据集
#切割数据集  82开
X_train, X_test, y_train, y_test = train_test_split(data_stand_fea, y, test_size=0.2)
  1. 采用RandomForest模型进行训练
clf3 = RandomForestClassifier(n_estimators=160,
                               max_features=0.4,
                               min_samples_split=2,
                               n_jobs=-1,
                               random_state=0)
clf3.fit(X_train,y_train)
  1. 模型预测与评估
from sklearn.metrics import accuracy_score
y_pred3 = clf3.predict(X_test)
print('The accuracy of prediction is:', accuracy_score(y_test, y_pred3))

请添加图片描述

随机森林的评分为0.83,作为第一次训练的结果,评分还是不错的,后续我们可以进行模型参数调优或者采用更加复杂的模型进行训练,提高预测精度。在此我们对此进行一个简单的参数调优

  1. 参数调优

参数调优可以使用GridSearchCV,但在参数数量选择上,不建议太多,否则数据处理量太多,速度会很慢。对应该模型,参数选择"n_estimators":决策树的量;“max_depth”:决策树的深度(预剪枝);“max_features”:选择的最大特征量

rf = RandomForestClassifier()
#参数选择
param_dict = {
    
    "n_estimators":[100,150,200],"max_depth":[3,5,8,10,15],"max_features":["auto","log2"]}
#网络搜索调优器
rf_model = GridSearchCV(rf,param_grid=param_dict,cv=3)
#模型拟合
CLF = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', rf_model)])
CLF.fit(X_train, y_train)
#不同参数下,最好的评分及其参数
CLF.best_score_
CLF.best_params_

Guess you like

Origin blog.csdn.net/m0_51353633/article/details/130381647