Traditional Machine Learning (6) Integrated Algorithm (1) - Random Forest Algorithm and Case Details
1 Overview
集成学习(Ensemble Learning)
It is to integrate multiple models through a certain strategy, and improve the accuracy of decision-making through group decision-making.
The primary problem of integrated learning is what kind of learner to choose and how to integrate multiple base learners, that is, the integration strategy.
1.1 Classification of integrated learning
In addition to making the learning effect of each base learner good, an effective integration also requires the difference between each base learner to be as large as possible (difference: the prediction results of each base learner are not exactly the same). Ensemble learning is often effective when combined with models with large variance.
-
There is no strong dependency between the base learners in the Bagging class method, and it can be executed in parallel.
-
There is a strong dependency between the base learners in the Boosting class method, which must be executed serially.
-
Stacking: aggregate multiple classification or regression models (can be done in stages).
Both Boosting and Bagging are usually used 同一种基学习器
(base learner), so we generally call it a homogeneous integration method.
Stacking通常都是基于多个不同的基学习器
The integration done, so we call it heterogeneous integration method.
1.1.1 Bagging
The Bagging class method is 随机构造训练样本、随机选择特征
to improve the independence of each base model through an equal method. Due to the difference in training data, there will be differences in the obtained learners. However, if each subset sampled is completely different, each base learner can only train a small part of the data and cannot perform effective learning. So consider using overlapping sampling subsets. Representative methods include Bagging和随机森林
the following.
-
Bagging (Bootstrap Aggregating) is
通过不同模型的训练数据集的独立性来提高不同模型之间的独立性
. We perform random sampling with replacement on the original training set. T data sets containing m samples can be sampled and trained in parallel to obtain T models, and then these basic learning models can be combined. For ensemble methods of base learners, Bagging通常对分类任务使用简单投票法,对回归任务使用平均法
. If there are two classes with the same number of votes in the predicted result, it can be determined by random selection or by checking the confidence of the learner vote. -
Random Forest (Random Forest) is
在Bagging的基础上再引入了随机特征
to further improve the independence between each base model. In random forest, each base model is a decision tree. Unlike traditional decision trees, in RF, for each node of each base decision tree, a random selection is made from the attribute set of the node. Contains a subset of k attributes, and then selects an optimal attribute from this subset due to division, while the traditional decision tree directly selects an optimal attribute from the attribute set of the current node to divide the set.
1.1.2 Boosting
The Boosting class method is 按照一定的顺序来先后训练不同的基模型,每个模型都针对先前模型的错误进行专门训练
. According to the results of the previous model, the weight of the training sample is adjusted to increase the difference between different base models. The process of Boosting is very similar to the process of human learning. Our process of learning new knowledge is often iterative. When we study for the first time, we will remember part of the knowledge, but we often make some mistakes, and we will be deeply impressed by these mistakes. During the second study, we will strengthen the study of the knowledge that has made mistakes, so as to reduce the occurrence of similar mistakes. The cycle repeats until the number of mistakes is reduced to a very low level.
The Boosting class method is a very powerful ensemble method, as long as the accuracy of the base model is higher than random guessing, the ensemble method can significantly improve the accuracy of the ensemble model. Representative methods of Boosting class methods are: AdaBoost,GBDT
etc.
1.1.3 Stacking
For a problem, we can use different types of learners to solve the learning problem. These learners are usually able to learn a part of the problem, but cannot learn the entire space of the problem. The practice of Stacking is 首先构建多个不同类型的一级学习器,并使用他们来得到一级预测结果,然后基于这些一级预测结果,构建一个二级学习器,来得到最终的预测结果
.
The motivation of Stacking can be described as: if a certain first-level learner learns a certain region of the feature space by mistake, then the second-level learner can properly correct this error by combining the learning behavior of other first-level learners.
1.2 Random Forest Principle
Random Forest (Random Fores) is an integrated algorithm for simple bagging of decision trees. It trains multiple decision trees through multiple random sampling samples, and uses multiple decision trees to integrate decision-making. Because it has multiple trees and each tree is random, it is called random forest.
1.2.1 Model Expressions
y = 1 k [ t 1 ∗ p r o b ( x ) + [ t 2 ∗ p r o b ( x ) + . . . + [ t k ∗ p r o b ( x ) ] y = \frac{1}{k}[t_{1}*prob(x) + [t_{2}*prob(x) + ...+ [t_{k}*prob(x)] y=k1[t1∗prob(x)+[t2∗prob(x)+...+[tk∗prob(x)]
其中:
t(i) : 决策树
t(i).prob(x) : 第i棵树对x的预测,输出为各个类别的预测概率(行向量)
k : 森林规模数
即模型为多棵决策树组成,最后的预测概率为各棵树的概率预测均值。概率得分最大的一类,即为预测类别
1.2.2 Model training
The focus of model training is how to train multiple different weak trees. It can be achieved in the following way: randomly select some samples each time, and train a weak tree with some variables. To make the tree a weak tree, the depth of the tree can be set shallower. In short, the prediction of the tree is not very accurate.
-- 1、训练流程如下
放回式抽取n个样本,每个样本抽到的次数,作为样本权重。
用加了权重的样本训练弱决策树(弱决策树即:最大分割特征:2),一直训练K棵树为止。
简单地说,就是生成k棵树,每棵树用的样本随机抽取。最后k棵树组合在一起就是森林。
-- 2、训练参数
(1) 变量最大个数m,m一般远小于总变量个数M,例如 根号M
(2) 森林规模(树的棵数k)
1.2.3 Generalization ability of random forest
袋外错误率
The random forest can evaluate the generalization ability of the out-of-bag error rate obb (out-of-bag) error.
The idea of the out-of-bag error rate is to pass 未参与训练样本(袋外样本)的准确率来检验森林的泛化能力
. Since each sample is only used for training by part of the tree, you can only use the sub-forest that the sample does not participate in to predict the sample. The out-of-bag error rate is to use this method to calculate all samples prediction accuracy.
袋外类别预测
The category prediction outside the bag refers to the probability prediction of the tree that has not participated in the training of the sample, and summarizes the prediction results of all the trees for the sample (the value after the sum of the probability is normalized), which category has the highest probability in the end, Just think which is the out-of-bag prediction category.
The out-of-bag prediction accuracy rate of all samples is the out-of-bag score (obb_socre), and the out-of-bag error rate is: obb_error = 1 - obb_socre
1.2.4 Feature Weight
The feature weight is the proportion of each feature's contribution to the random forest. The higher the feature weight, the more important the feature is to the composition of the forest and the greater the impact on the decision result.
As long as the scores of each tree in the forest are averaged and normalized, it is the feature score of the forest.
s = norm ( 1 k ∑ i = 1 nsi ) s = norm(\frac{1}{k}\sum\limits_{i=1}^ns_{i})s=norm(k1i=1∑nsi)
Among them, s(i) is a vector, which is the score of the i-th tree for each feature.
1.2.5 Advantages and disadvantages of random forest
1. Advantages of random forest algorithm
-
Due to the use of integrated algorithms, the accuracy itself is better than most individual algorithms, so the accuracy is high
-
It performs well on the test set. Due to the introduction of two randomness, the random forest is not easy to fall into overfitting (random sample, random feature)
-
In industry, due to the introduction of two randomness, the random forest has a certain anti-noise ability, which has certain advantages compared with other algorithms
-
Due to the combination of trees, the random forest can handle nonlinear data, which itself belongs to the nonlinear classification (fitting) model
-
It can handle very high-dimensional (many features) data, and does not need to do feature selection, and has strong adaptability to data sets: it can handle both discrete data and continuous data, and the data set does not need to be normalized
-
The training speed is fast and can be used on large-scale data sets
-
Can handle default values (separately as a class) without additional processing
-
Thanks to out-of-bag data (OOB), an unbiased estimate of the true error can be obtained during model generation without loss of training data volume
-
During the training process, the mutual influence between features can be detected, and the importance of features can be obtained, which has certain reference significance
-
Since each tree can be generated independently and simultaneously, it is easy to make a parallelization method
-
Due to its simple implementation, high precision, and strong anti-overfitting ability, it is suitable as a benchmark model when faced with nonlinear data.
2. Disadvantages of random forest algorithm
-
When the number of decision trees in the random forest is large, the space and time required for training will be relatively large
-
There are still many places in the random forest that are not easy to explain. It is a bit of a black box model.
-
On some sample sets with relatively large noise, the RF model is prone to overfitting
2. Manually implement random forest
from sklearn.datasets import load_iris
import numpy as np
from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier
np.random.seed(888)
# ==================== 加载数据=====================================================
iris = load_iris()
X = iris.data
Y = iris.target
n_samples = X.shape[0] # 样本个数
n_samples_bootstrap = X.shape[0] # 抽样个数
c_n = np.unique(Y).shape[0] # 类别数
tree_num = 100 # 森林决策树个数
trees = [] # 初始化树列表
p_oob = np.zeros((n_samples, c_n)) # oob投票结果
random_state = np.random.mtrand._rand # 随机状态
max_features = 2 # 每棵树分割最大特征数(至少有一个分割点,若一个都没有,则无视该条件)
# 建立树模板
base_estimator = DecisionTreeClassifier()
base_estimator.set_params(**{
'criterion': 'gini',
'min_samples_split': 2,
'min_samples_leaf': 1,
'min_weight_fraction_leaf': 0.0,
'max_features': max_features,
'max_leaf_nodes': None,
'min_impurity_decrease': 0.0,
'random_state': None,
'ccp_alpha': 0.0
})
# 逐树训练
random_state_list = [random_state.randint(np.iinfo(np.int32).max) for i in range(tree_num)] # 初始化树随机状态
for i in range(tree_num):
sample_indices = np.random.RandomState(random_state_list[i]).randint(0, n_samples, n_samples_bootstrap) # 抽样
sample_counts = np.bincount(sample_indices, minlength=n_samples) # 抽样分布
curr_sample_weight = np.ones((n_samples,), dtype=np.float64) * sample_counts # 样本权重
cur_tree = clone(base_estimator) # 初始化树
cur_tree.set_params(**{
'random_state': random_state_list[i]}) # 设置当前树随机状态
cur_tree.fit(X, Y, sample_weight=curr_sample_weight, check_input=False) # 训练树
trees.append(cur_tree) # 将本次训练好的树, 添加到树列表
# 计算obb得分
un_select = ~ np.isin(range(n_samples), sample_indices) # 未选中的数据
cur_p_oob = cur_tree.predict_proba(X[un_select, :]) # 将当前未选中数据的预测结果
p_oob[un_select, :] += cur_p_oob # 投票到汇总结果
# =============== 模型指标统计 ===================================
oob_score = np.mean(Y == np.argmax(p_oob, axis=1), axis=0) # obb样本正确率即为obb得分
# 计算特征得分
all_importances = [getattr(tree, 'feature_importances_') for tree in trees if tree.tree_.node_count > 1] # 获取每棵树中各特征评估
all_importances = np.mean(all_importances, axis=0, dtype=np.float64) # 求均值
feature_importances = all_importances / np.sum(all_importances) # 归一化
# =============== 模型预测 ===================================
sim_p = np.zeros((X.shape[0], c_n), dtype=np.float64) # 初始化投票得分
for i in range(len(trees)): # 逐树投票
sim_p += trees[i].predict_proba(X) / len(trees) # 投票
sim_c = np.argmax(sim_p, axis=1) # 得分最高者作为投票结果
# =================打印结果==========================
print("\n----前5条预测结果:----")
print(sim_p[0:5]) # 打印结果
print("\n----袋外准确率oob_score:----")
print(oob_score) # 打印oob得分
print("\n----特征得分:----")
print(feature_importances)
----前5条预测结果:----
[[1. 0. 0. ]
[0.99 0.01 0. ]
[1. 0. 0. ]
[1. 0. 0. ]
[1. 0. 0. ]]
----袋外准确率oob_score:----
0.9533333333333334
----特征得分:----
[0.08186032 0.02758341 0.44209899 0.44845728]
3. Bagging and Random Forest in sklearn
3.1 bagging
import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
'''
Bagging策略:
首先对训练数据集进行多次采样,保证每次得到的采样数据都是不同的;
分别训练多个模型,例如树模型;
预测时需得到所有模型结果再进行集成。
'''
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=500,noise=0.25,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
plt.plot(X[:,0][y==0],X[:,1][y==0],'yo',alpha = 0.6)
plt.plot(X[:,0][y==0],X[:,1][y==1],'bs',alpha = 0.6)
'''
1、传统决策树
'''
from sklearn import tree
from sklearn.metrics import accuracy_score
tree_clf = tree.DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train,y_train)
y_pred = tree_clf.predict(X_test)
print('传统决策树精确率为:',accuracy_score(y_test,y_pred))
# 传统决策树精确率为: 0.936
'''
2、Bagging
'''
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(
tree.DecisionTreeClassifier(), # 拟合数据集的随机子集的基学习器
n_estimators=500, # 基学习器数目
max_samples=100, # 每个学习器抽样的最大样本数
bootstrap=True, # 样本是否放回
n_jobs=-1,
oob_score=True,
random_state=42
)
bag_clf.fit(X_train,y_train)
y_pred = bag_clf.predict(X_test)
print('Bagging精确率为:',accuracy_score(y_test,y_pred))
# Bagging精确率为: 0.952
'''
3、决策边界对比
'''
from matplotlib.colors import ListedColormap
def plot_decision_boundary(clf,X,y,axes=[-1.5,2.5,-1,1.5],alpha=0.5,contour =True):
x1s=np.linspace(axes[0],axes[1],100)
x2s=np.linspace(axes[2],axes[3],100)
x1,x2 = np.meshgrid(x1s,x2s)
X_new = np.c_[x1.ravel(),x2.ravel()]
y_pred = clf.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1,x2,y_pred,cmap = custom_cmap,alpha=0.3)
if contour:
custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
plt.contour(x1,x2,y_pred,cmap = custom_cmap2,alpha=0.8)
plt.plot(X[:,0][y==0],X[:,1][y==0],'yo',alpha = 0.6)
plt.plot(X[:,0][y==0],X[:,1][y==1],'bs',alpha = 0.6)
plt.axis(axes)
plt.xlabel('x1')
plt.xlabel('x2')
plt.figure(figsize = (12,5))
plt.subplot(121)
plot_decision_boundary(tree_clf,X,y)
plt.title('Decision Tree')
plt.subplot(122)
plot_decision_boundary(bag_clf,X,y)
plt.title('Decision Tree With Bagging')
plt.show()
# 随机森林可用袋外错误率obb(out-of-bag) error评估泛化能力.
# obb_error = 1 - obb_socre
obb_error = 1 - bag_clf.oob_score_
obb_error
# 0.06399999999999995
3.2 Random Forest
"""
sklearn的随机森林Demo
"""
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import numpy as np
np.random.seed(55)
# ==================== 加载数据 =================
iris = load_iris()
X = iris.data
y = iris.target
# ========================= 模型训练 =============
clf = RandomForestClassifier(
n_jobs=1,
oob_score=True,
max_features=2,
n_estimators=100
)
clf.fit(X, y)
# =============================== 模型预测 ========
pred_prob = clf.predict_proba(X)
pred_c = clf.predict(X)
preds = iris.target_names[pred_c]
#=================打印结果==========================
print("\n----前5条预测结果:----")
print(pred_prob[0:5])
print("\n----袋外准确率oob_score:----")
print(clf.oob_score_)
print("\n----特征得分:----")
print(clf.feature_importances_)
----前5条预测结果:----
[[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]]
----袋外准确率oob_score:----
0.9533333333333334
----特征得分:----
[0.09910252 0.0317905 0.48356935 0.38553764]
4. Random forest case
4.1 Random forest predicts broadband customer churn
Dataset link: https://pan.baidu.com/s/1vmjldkWZtQWlFopWlFLX9w Extraction code: ad1h
Like neural networks, ensemble learning is a black-box model with poor explanatory power, so we don’t need to delve too much into the specific meaning of each variable in the data set. We only need to pay attention to the last variable, broadband, and strive to pass such as age, duration of use, Variables such as payment status, traffic and call status can make a more accurate prediction of whether broadband customers will renew their subscriptions.
1. Data Exploration
import pandas as pd
import numpy as np
df = pd.read_csv('../data/broadband.csv') # 宽带客户数据
# 列名全部换成小写
df.rename(str.lower, axis='columns', inplace=True)
# 只需关注参数,broadband:0-离开,1-留存
df.head()
cust_id | gender | age | tenure | channel | autopay | arpb_3m | call_party_cnt | day_mou | afternoon_mou | night_mou | avg_call_length | broadband | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 34 | 27 | 2 | 0 | 203 | 0 | 0.0 | 0.0 | 0.0 | 3.04 | 1 |
1 | 64 | 0 | 62 | 58 | 1 | 0 | 360 | 0 | 0.0 | 1910.0 | 0.0 | 3.30 | 1 |
2 | 65 | 1 | 39 | 55 | 3 | 0 | 304 | 0 | 437.2 | 200.3 | 0.0 | 4.92 | 0 |
3 | 66 | 1 | 39 | 55 | 3 | 0 | 304 | 0 | 437.2 | 182.8 | 0.0 | 4.92 | 0 |
4 | 67 | 1 | 39 | 55 | 3 | 0 | 304 | 0 | 437.2 | 214.5 | 0.0 | 4.92 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1114 entries, 0 to 1113
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cust_id 1114 non-null int64
1 gender 1114 non-null int64
2 age 1114 non-null int64
3 tenure 1114 non-null int64
4 channel 1114 non-null int64
5 autopay 1114 non-null int64
6 arpb_3m 1114 non-null int64
7 call_party_cnt 1114 non-null int64
8 day_mou 1114 non-null float64
9 afternoon_mou 1114 non-null float64
10 night_mou 1114 non-null float64
11 avg_call_length 1114 non-null float64
12 broadband 1114 non-null int64
dtypes: float64(4), int64(9)
memory usage: 113.3 KB
from collections import Counter
# 查看broadband分布情况,随机森林擅长处理数据集不平衡
print('broadband',Counter(df['broadband']))
broadband Counter({0: 908, 1: 206})
2. Split test set and training set
# 客户id没有用,故丢弃cust_id这一列
X = df.iloc[:,1:-1]
y = df['broadband']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(
X,y,test_size=0.3,random_state=888
)
3. Decision tree modeling
from sklearn import tree
# 直接使用交叉网格搜索来优化决策树模型
from sklearn.model_selection import GridSearchCV
# 网格搜索的参数:正常决策树建模中的参数 - 评估指标,树的深度,最小拆分的叶子样本数
# 通常来说,十几层的树已经是比较深了
param_grid = {
'max_depth': [2, 3, 4, 5, 6, 7, 8],
'min_samples_split': [4, 8, 12, 16, 20, 24, 28]
}
clf_cv = GridSearchCV(
estimator=tree.DecisionTreeClassifier(),
param_grid=param_grid,
scoring='roc_auc',
cv=5
)
clf_cv.fit(X_train,y_train)
pred_y_test = clf_cv.predict(X_test)
import sklearn.metrics as metrics
print("决策树 AUC:")
fpr_test, tpr_test, th_test = metrics.roc_curve(y_test, pred_y_test)
print('AUC = %.4f' % metrics.auc(fpr_test, tpr_test))
决策树 AUC:
AUC = 0.7763
4. Random Forest Modeling
# 一样是直接使用网格搜索
param_grid = {
'max_depth':[5, 6, 7, 8], # 深度:这里是森林中每棵决策树的深度
'n_estimators':[11,13,15], # 决策树个数-随机森林特有参数
'max_features':[0.3,0.4,0.5], # 每棵决策树使用的变量占比-随机森林特有参数(结合原理)
'min_samples_split':[4,8,12,16] # 叶子的最小拆分样本量
}
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc_cv = GridSearchCV(estimator=rfc,
param_grid=param_grid,
scoring='roc_auc',
cv=5)
rfc_cv.fit(X_train, y_train)
# 使用随机森林对测试集进行预测
test_est = rfc_cv.predict(X_test)
print('随机森林 AUC...')
fpr_test, tpr_test, th_test = metrics.roc_curve(test_est, y_test) # 构造 roc 曲线
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))
随机森林 AUC...
AUC = 0.8181
# 查看最佳参数,看是否在决策边界上,还需重新设置网格搜索参数
rfc_cv.best_params_
{'max_depth': 7,
'max_features': 0.3,
'min_samples_split': 4,
'n_estimators': 15}
# 调整决策边界,这里只是做示范
param_grid = {
'max_depth':[7, 8, 10, 12],
'n_estimators':[11, 13, 15, 17, 19], # 决策树个数-随机森林特有参数
'max_features':[0.2,0.3,0.4, 0.5, 0.6, 0.7], # 每棵决策树使用的变量占比-随机森林特有参数
'min_samples_split':[2, 3, 4, 8, 12, 16] # 叶子的最小拆分样本量
}
# 重复上述步骤,可写成函数供快捷调用
rfc_cv = GridSearchCV(estimator=rfc,
param_grid=param_grid,
scoring='roc_auc',
n_jobs=-1,
cv=5)
rfc_cv.fit(X_train, y_train)
# 使用随机森林对测试集进行预测
test_est = rfc_cv.predict(X_test)
print('随机森林 AUC...')
fpr_test, tpr_test, th_test = metrics.roc_curve(test_est, y_test) # 构造 roc 曲线
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))
# 这里的 auc 只提升了很多
随机森林 AUC...
AUC = 0.8765
4.2 Random Forest Analysis of Factors Affecting Hotel Reservation Cancellation Rate
Dataset download address: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand
The data set is based on two hotels, a resort hotel and a city hotel, both of which are located in Portugal, the resort hotel is in Algarve, and the city hotel is in It is located in Lisbon, the capital of Portugal. The geographical span of the two hotels is relatively large, and the influence of mutual interference between data is relatively small.
1. Data Exploration
import pandas as pd
import matplotlib.pyplot as plt
#显示设置
pd.set_option("display.max_columns", None) #设置显示全部列
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号
# 加载数据
data_origin = pd.read_csv("../data/hotel_bookings.csv")
#数据备份
data=data_origin.copy()
print(data.shape)
data.head()
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
# hotel 酒店类型
# is_canceled 预订是否取消(0,1)
# lead_time 提前预订天数
# arrival_date_year 入住年份
# arrival_date_month 入住月份
# arrival_date_week_number 入住周数
# arrival_date_day_of_month 入住日期
# stays_in_weekend_nights 周末过夜数
# stays_in_week_nights 工作日过夜数
# adults 成人人数
# children 儿童人数
# babies 婴儿人数
# meal 餐食类型(BB早餐,HB午餐,FB晚餐,Undefined/SC无餐)
# country 客户来源国家
# market_segment 市场细分
# distribution_channel 订单渠道
# is_repeated_guest 是否是老客户
# previous_cancellations 历史取消预订的次数
# previous_bookings_not_canceled 历史未取消预订的次数
# reserved_room_type 预定房间类型
# assigned_room_type 实际房间类型
# booking_changes 预定更改次数
# deposit_type 押金类型(No Deposit,Non Refund,Refundable)
# agent 预订旅行社ID
# company 预订公司/实体ID
# days_in_waiting_list 等待天数
# customer_type 客户类型
# adr 客房日均价
# required_car_parking_spaces 车位需求数
# total_of_special_requests 特殊需求数
# reservation_status 预订最终状态(Canceled,Check-out,No-show)
# reservation_status_date 预订最终状态更新日期
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 market_segment 119390 non-null object
15 distribution_channel 119390 non-null object
16 is_repeated_guest 119390 non-null int64
17 previous_cancellations 119390 non-null int64
18 previous_bookings_not_canceled 119390 non-null int64
19 reserved_room_type 119390 non-null object
20 assigned_room_type 119390 non-null object
21 booking_changes 119390 non-null int64
22 deposit_type 119390 non-null object
23 agent 103050 non-null float64
24 company 6797 non-null float64
25 days_in_waiting_list 119390 non-null int64
26 customer_type 119390 non-null object
27 adr 119390 non-null float64
28 required_car_parking_spaces 119390 non-null int64
29 total_of_special_requests 119390 non-null int64
30 reservation_status 119390 non-null object
31 reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
data.describe()
is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | agent | company | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119386.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 103050.000000 | 6797.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 |
mean | 0.370416 | 104.011416 | 2016.156554 | 27.165173 | 15.798241 | 0.927599 | 2.500302 | 1.856403 | 0.103890 | 0.007949 | 0.031912 | 0.087118 | 0.137097 | 0.221124 | 86.693382 | 189.266735 | 2.321149 | 101.831122 | 0.062518 | 0.571363 |
std | 0.482918 | 106.863097 | 0.707476 | 13.605138 | 8.780829 | 0.998613 | 1.908286 | 0.579261 | 0.398561 | 0.097436 | 0.175767 | 0.844336 | 1.497437 | 0.652306 | 110.774548 | 131.655015 | 17.594721 | 50.535790 | 0.245291 | 0.792798 |
min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | -6.380000 | 0.000000 | 0.000000 |
25% | 0.000000 | 18.000000 | 2016.000000 | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 62.000000 | 0.000000 | 69.290000 | 0.000000 | 0.000000 |
50% | 0.000000 | 69.000000 | 2016.000000 | 28.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.000000 | 179.000000 | 0.000000 | 94.575000 | 0.000000 | 0.000000 |
75% | 1.000000 | 160.000000 | 2017.000000 | 38.000000 | 23.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 229.000000 | 270.000000 | 0.000000 | 126.000000 | 0.000000 | 1.000000 |
max | 1.000000 | 737.000000 | 2017.000000 | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | 10.000000 | 10.000000 | 1.000000 | 26.000000 | 72.000000 | 21.000000 | 535.000000 | 543.000000 | 391.000000 | 5400.000000 | 8.000000 | 5.000000 |
# 通过观察数据,发现1个异常点:客房日均价(adr)不为负,应剔除adr为负的异常值,
# adr最大值为5400,显然远远大于均值+3倍标准差。
# 用箱形图看下异常值:
import seaborn as sns
plt.figure(figsize=(12, 1))
sns.boxplot(x=list(data["adr"]))
plt.show()
#查看缺失值
missing=data.isnull().sum()
missing[missing != 0]
children 4
country 488
agent 16340
company 112593
dtype: int64
2. Data cleaning
1)缺失值填充
① If the country is missing, it cannot be added, and the missing value can be distinguished by unknown;
② If the children are missing, it can be understood that there are no children staying, so it is not filled in the order registration, and the missing value is filled with 0;
③ If the agent is missing, it can be understood as a non-travel agency Reservation, so it is not filled in the order registration, and the missing value is filled with 0;
④ If the company is missing, it can be understood as a non-company reservation, so it is not filled in the order registration, and the missing value is filled with 0;
data_fill = data.fillna(
{
"country":"unknown",
"children":0,
"agent":0,
"company":0
}
)
missing=data_fill.isnull().sum()
missing[missing != 0]
Series([], dtype: int64)
2)异常值处理
(1) The sum of the number of adults and children staying is 0;
(2) The average daily price of the room is negative, and the average daily price of the room is higher than 1,000 yuan;
(3) undefined and sc in meal mean no meal.
# 2)异常值处理
# 1、adults与children入住人数之和为0
drop_a = data_fill[data_fill[["adults","children"]].sum(axis=1) == 0]
# 2、客房日均价为负,客房日均价高于1000元
drop_b = data_fill[(data_fill["adr"]<0) | (data_fill["adr"]>1000)]
data_fill.drop(drop_a.index,inplace=True)
data_done=data_fill[(data_fill["adr"]>=0) & (data_fill["adr"] <1000)]
# 3、meal中undefined与sc均表示无餐
data_done["meal"].replace({
"Undefined":"SC"},inplace=True)
data_done.shape # 数据清洗完以后,还剩下119208条记录
(119208, 32)
3. Normalization of numerical features and numericalization of category features (one-hot encoding)
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,OneHotEncoder
# 1、数值型特征进行归一化
num_features = ["lead_time",
"arrival_date_week_number",
"arrival_date_day_of_month",
"stays_in_weekend_nights",
"stays_in_week_nights",
"adults",
"children",
"babies",
"is_repeated_guest",
"previous_cancellations",
"previous_bookings_not_canceled",
"agent",
"company",
"required_car_parking_spaces",
"total_of_special_requests",
"adr"]
num_transformer = Pipeline(
steps=[
('imputer', SimpleImputer(strategy='constant',fill_value=0)), # 将空值填充为自定义的值
('scaler', StandardScaler()) # 数据归一化
]
)
# 2、类别特征标准化(one-hot)
cat_features = ["hotel",
"arrival_date_month",
"meal",
"market_segment",
"distribution_channel",
"reserved_room_type",
"deposit_type",
"customer_type"]
cat_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
("onehot", OneHotEncoder(handle_unknown='ignore'))
]
)
features = num_features + cat_features
from sklearn.compose import ColumnTransformer
'''
SimpleImputer类可用于替换缺少的值,MinMaxScaler类可用于缩放数值,而OneHotEncoder可用于编码分类变量。
ColumnTransformer()在Python的机器学习库scikit-learn中,可以选择地进行数据转换。
要使用ColumnTransformer,必须指定一个转换器列表。
每个转换器是一个三元素元组,用于定义转换器的名称,要应用的转换以及要应用于其的列索引,例如:(名称,对象,列)
'''
preprocessor = ColumnTransformer(
transformers=[
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
]
)
4. Choose random forest modeling and choose the best parameters
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score,train_test_split,GridSearchCV
X = data_done.drop("is_canceled",axis=1)
y = data_done["is_canceled"]
# 初步调参n_estimators
# scorel = []
# for i in range(0,200,10):
# rfc_model = RandomForestClassifier(n_estimators=i+1,
# n_jobs=-1,
# random_state=0)
# rfc = Pipeline(
# steps=[
# ('preprocessor', preprocessor),
# ('model',rfc_model)
# ]
# )
# split = KFold(n_splits=10, shuffle=True, random_state=42)
# rfc_t_s = cross_val_score(rfc,
# X,
# y,
# cv=split,
# scoring="accuracy",
# n_jobs=-1
# ).mean()
# print('step',rfc_t_s)
# scorel.append(rfc_t_s)
# print(max(scorel),(scorel.index(max(scorel))*10) + 1)
# plt.figure(figsize=[20,5])
# plt.plot(range(1,201,10),scorel)
# plt.show()
# 再次调参n_estimators,注意数据集训练比较慢
scorel = []
for i in range(150,170):
rfc_model = RandomForestClassifier(n_estimators=i+1,
n_jobs=-1,
random_state=0)
rfc = Pipeline(
steps=[
('preprocessor', preprocessor),
('model',rfc_model)
]
)
split = KFold(n_splits=10, shuffle=True, random_state=42)
rfc_t_s = cross_val_score(rfc,
X,
y,
cv=split,
scoring="accuracy",
n_jobs=-1
).mean()
print('step:',i,',score=',rfc_t_s)
scorel.append(rfc_t_s)
# 从图像可以看出来,n_estimators应该为161
print(max(scorel),([*range(150,170)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])
plt.plot(range(150,170),scorel)
plt.show()
param_grid = {
"model__max_depth":[*range(1,40,10)],
'model__min_samples_leaf':[*range(1,50,10)]
}
rfc_model_t = RandomForestClassifier(n_estimators=161,
criterion="gini",
max_features=0.4,
n_jobs=-1,
random_state=0)
rfc_t = Pipeline(
steps=[
('preprocessor', preprocessor),
('model',rfc_model_t)
]
)
split_t = KFold(n_splits=10, shuffle=True, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
GS = GridSearchCV(rfc_t, param_grid, cv=split_t)
GS.fit(X_train, y_train)
GS.best_params_
#
5. Select the best parameters to train the model
#训练模型,评估模型
rfc_model = RandomForestClassifier(n_estimators=161
,criterion="gini"
,max_depth=31
,min_samples_leaf=1
,max_features=0.4
,n_jobs=-1
,random_state=0)
rfc = Pipeline(steps=[
('preprocessor', preprocessor),
('model',rfc_model)
])
rfc = rfc.fit(X_train, y_train)
score_=rfc.score(X_test, y_test)
score_
0.8683835248720745
#交叉检验
split = KFold(n_splits=10, shuffle=True, random_state=42)
rfc_s = cross_val_score( rfc,
X,
y,
cv=split,
scoring="accuracy",
n_jobs=-1)
plt.plot(range(1,11),rfc_s,label = "RandomForest")
6. Feature Importance
#特征重要性
import eli5
onehot_columns = list(rfc.named_steps['preprocessor'].
named_transformers_['cat'].
named_steps['onehot'].
get_feature_names_out(input_features=cat_features))
feat_imp_list = num_features + onehot_columns
#按重要排序的前10个重要特征及其系数
feat_imp_df = eli5.formatters.as_dataframe.explain_weights_df(
rfc.named_steps['model'],
feature_names=feat_imp_list)
feat_imp_df.head(10)
feature | weight | std | |
---|---|---|---|
0 | deposit_type_Non Refund | 0.144919 | 0.108617 |
1 | lead_time | 0.140804 | 0.016317 |
2 | adr | 0.093004 | 0.003363 |
3 | deposit_type_No Deposit | 0.075658 | 0.103332 |
4 | arrival_date_day_of_month | 0.066274 | 0.002718 |
5 | arrival_date_week_number | 0.053368 | 0.002609 |
6 | total_of_special_requests | 0.052200 | 0.011193 |
7 | agent | 0.043441 | 0.005135 |
8 | previous_cancellations | 0.041885 | 0.014677 |
9 | stays_in_week_nights | 0.039749 | 0.002220 |
7. Further analysis of the influencing factors with a weight greater than 10% (advance booking time and deposit type)
# 1、预定取消与提前预定时间的关系
plt.figure(figsize=(12, 8))
lead_cancel_data = data_done.groupby("lead_time")["is_canceled"].describe()
sns.scatterplot(x=lead_cancel_data.index, y=lead_cancel_data["mean"].values * 100,color='olive')
plt.title("提前预定时间对预订取消率的影响", fontsize=16)
plt.xlabel("提前预订时间", fontsize=16)
plt.ylabel("预订取消的概率", fontsize=16)
plt.show()
The longer you book in advance, the higher your cancellation rate. The hotel should consider whether to limit the maximum time for advance booking in light of specific operating costs to avoid resource occupation.
#预定取消与保证金类型的关系
plt.figure(figsize=(12, 8))
deposit_cancel_data = data_done.groupby("deposit_type")["is_canceled"].describe()
sns.barplot(x=deposit_cancel_data.index, y=deposit_cancel_data["mean"] * 100,palette="Blues")
plt.title("Effect of deposit_type on cancelation", fontsize=16)
plt.xlabel("Deposit type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()
The booking cancellation rate of non-refundable deposits is the highest among all types of deposits. This phenomenon deviates from the concept of consumer psychology and requires further analysis.
In fact, among all market segments,
the cancellation rate of offline travel agency bookings and group bookings is the highest.
Among deposit types, the bookings without deposits are mainly concentrated in online bookings,
while the bookings of non-refundable deposits are mainly concentrated online. Travel agency bookings and group bookings lead to a high cancellation rate for non-refundable deposits.
The hotel can try to design a questionnaire, interview offline travel agencies and groups, and explore the reasons and solutions.