Traditional Machine Learning (6) Integrated Algorithm (1) - Random Forest Algorithm and Case Details

Traditional Machine Learning (6) Integrated Algorithm (1) - Random Forest Algorithm and Case Details

1 Overview

集成学习(Ensemble Learning)It is to integrate multiple models through a certain strategy, and improve the accuracy of decision-making through group decision-making.

The primary problem of integrated learning is what kind of learner to choose and how to integrate multiple base learners, that is, the integration strategy.

1.1 Classification of integrated learning

In addition to making the learning effect of each base learner good, an effective integration also requires the difference between each base learner to be as large as possible (difference: the prediction results of each base learner are not exactly the same). Ensemble learning is often effective when combined with models with large variance.

  • There is no strong dependency between the base learners in the Bagging class method, and it can be executed in parallel.

  • There is a strong dependency between the base learners in the Boosting class method, which must be executed serially.

  • Stacking: aggregate multiple classification or regression models (can be done in stages).

Both Boosting and Bagging are usually used 同一种基学习器(base learner), so we generally call it a homogeneous integration method.

Stacking通常都是基于多个不同的基学习器The integration done, so we call it heterogeneous integration method.

1.1.1 Bagging

The Bagging class method is 随机构造训练样本、随机选择特征to improve the independence of each base model through an equal method. Due to the difference in training data, there will be differences in the obtained learners. However, if each subset sampled is completely different, each base learner can only train a small part of the data and cannot perform effective learning. So consider using overlapping sampling subsets. Representative methods include Bagging和随机森林the following.

  • Bagging (Bootstrap Aggregating) is 通过不同模型的训练数据集的独立性来提高不同模型之间的独立性. We perform random sampling with replacement on the original training set. T data sets containing m samples can be sampled and trained in parallel to obtain T models, and then these basic learning models can be combined. For ensemble methods of base learners, Bagging 通常对分类任务使用简单投票法,对回归任务使用平均法. If there are two classes with the same number of votes in the predicted result, it can be determined by random selection or by checking the confidence of the learner vote.

  • Random Forest (Random Forest) is 在Bagging的基础上再引入了随机特征to further improve the independence between each base model. In random forest, each base model is a decision tree. Unlike traditional decision trees, in RF, for each node of each base decision tree, a random selection is made from the attribute set of the node. Contains a subset of k attributes, and then selects an optimal attribute from this subset due to division, while the traditional decision tree directly selects an optimal attribute from the attribute set of the current node to divide the set.

1.1.2 Boosting

The Boosting class method is 按照一定的顺序来先后训练不同的基模型,每个模型都针对先前模型的错误进行专门训练. According to the results of the previous model, the weight of the training sample is adjusted to increase the difference between different base models. The process of Boosting is very similar to the process of human learning. Our process of learning new knowledge is often iterative. When we study for the first time, we will remember part of the knowledge, but we often make some mistakes, and we will be deeply impressed by these mistakes. During the second study, we will strengthen the study of the knowledge that has made mistakes, so as to reduce the occurrence of similar mistakes. The cycle repeats until the number of mistakes is reduced to a very low level.

The Boosting class method is a very powerful ensemble method, as long as the accuracy of the base model is higher than random guessing, the ensemble method can significantly improve the accuracy of the ensemble model. Representative methods of Boosting class methods are: AdaBoost,GBDTetc.

1.1.3 Stacking

For a problem, we can use different types of learners to solve the learning problem. These learners are usually able to learn a part of the problem, but cannot learn the entire space of the problem. The practice of Stacking is 首先构建多个不同类型的一级学习器,并使用他们来得到一级预测结果,然后基于这些一级预测结果,构建一个二级学习器,来得到最终的预测结果.

The motivation of Stacking can be described as: if a certain first-level learner learns a certain region of the feature space by mistake, then the second-level learner can properly correct this error by combining the learning behavior of other first-level learners.

1.2 Random Forest Principle

Random Forest (Random Fores) is an integrated algorithm for simple bagging of decision trees. It trains multiple decision trees through multiple random sampling samples, and uses multiple decision trees to integrate decision-making. Because it has multiple trees and each tree is random, it is called random forest.

insert image description here

1.2.1 Model Expressions

y = 1 k [ t 1 ∗ p r o b ( x ) + [ t 2 ∗ p r o b ( x ) + . . . + [ t k ∗ p r o b ( x ) ] y = \frac{1}{k}[t_{1}*prob(x) + [t_{2}*prob(x) + ...+ [t_{k}*prob(x)] y=k1[t1prob(x)+[t2prob(x)+...+[tkprob(x)]

其中:                                                                                                       
 t(i)          :  决策树                                                                               
 t(i).prob(x)  :  第i棵树对x的预测,输出为各个类别的预测概率(行向量)   
 k             :  森林规模数  
 
即模型为多棵决策树组成,最后的预测概率为各棵树的概率预测均值。概率得分最大的一类,即为预测类别

1.2.2 Model training

The focus of model training is how to train multiple different weak trees. It can be achieved in the following way: randomly select some samples each time, and train a weak tree with some variables. To make the tree a weak tree, the depth of the tree can be set shallower. In short, the prediction of the tree is not very accurate.

-- 1、训练流程如下

放回式抽取n个样本,每个样本抽到的次数,作为样本权重。
用加了权重的样本训练弱决策树(弱决策树即:最大分割特征:2),一直训练K棵树为止。
简单地说,就是生成k棵树,每棵树用的样本随机抽取。最后k棵树组合在一起就是森林。

-- 2、训练参数
 (1) 变量最大个数m,m一般远小于总变量个数M,例如 根号M  
 (2) 森林规模(树的棵数k)           

1.2.3 Generalization ability of random forest

袋外错误率

The random forest can evaluate the generalization ability of the out-of-bag error rate obb (out-of-bag) error.

The idea of ​​the out-of-bag error rate is to pass 未参与训练样本(袋外样本)的准确率来检验森林的泛化能力. Since each sample is only used for training by part of the tree, you can only use the sub-forest that the sample does not participate in to predict the sample. The out-of-bag error rate is to use this method to calculate all samples prediction accuracy.

袋外类别预测

The category prediction outside the bag refers to the probability prediction of the tree that has not participated in the training of the sample, and summarizes the prediction results of all the trees for the sample (the value after the sum of the probability is normalized), which category has the highest probability in the end, Just think which is the out-of-bag prediction category.

insert image description here

The out-of-bag prediction accuracy rate of all samples is the out-of-bag score (obb_socre), and the out-of-bag error rate is: obb_error = 1 - obb_socre

1.2.4 Feature Weight

The feature weight is the proportion of each feature's contribution to the random forest. The higher the feature weight, the more important the feature is to the composition of the forest and the greater the impact on the decision result.

As long as the scores of each tree in the forest are averaged and normalized, it is the feature score of the forest.
s = norm ( 1 k ∑ i = 1 nsi ) s = norm(\frac{1}{k}\sum\limits_{i=1}^ns_{i})s=norm(k1i=1nsi)
Among them, s(i) is a vector, which is the score of the i-th tree for each feature.

1.2.5 Advantages and disadvantages of random forest

1. Advantages of random forest algorithm

  • Due to the use of integrated algorithms, the accuracy itself is better than most individual algorithms, so the accuracy is high

  • It performs well on the test set. Due to the introduction of two randomness, the random forest is not easy to fall into overfitting (random sample, random feature)

  • In industry, due to the introduction of two randomness, the random forest has a certain anti-noise ability, which has certain advantages compared with other algorithms

  • Due to the combination of trees, the random forest can handle nonlinear data, which itself belongs to the nonlinear classification (fitting) model

  • It can handle very high-dimensional (many features) data, and does not need to do feature selection, and has strong adaptability to data sets: it can handle both discrete data and continuous data, and the data set does not need to be normalized

  • The training speed is fast and can be used on large-scale data sets

  • Can handle default values ​​(separately as a class) without additional processing

  • Thanks to out-of-bag data (OOB), an unbiased estimate of the true error can be obtained during model generation without loss of training data volume

  • During the training process, the mutual influence between features can be detected, and the importance of features can be obtained, which has certain reference significance

  • Since each tree can be generated independently and simultaneously, it is easy to make a parallelization method

  • Due to its simple implementation, high precision, and strong anti-overfitting ability, it is suitable as a benchmark model when faced with nonlinear data.

2. Disadvantages of random forest algorithm

  • When the number of decision trees in the random forest is large, the space and time required for training will be relatively large

  • There are still many places in the random forest that are not easy to explain. It is a bit of a black box model.

  • On some sample sets with relatively large noise, the RF model is prone to overfitting

2. Manually implement random forest

from sklearn.datasets import load_iris
import numpy as np
from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier
np.random.seed(888)

# ==================== 加载数据=====================================================
iris = load_iris()
X = iris.data
Y = iris.target
n_samples = X.shape[0]            # 样本个数
n_samples_bootstrap = X.shape[0]  # 抽样个数
c_n = np.unique(Y).shape[0]       # 类别数
tree_num = 100                    # 森林决策树个数
trees = []                        # 初始化树列表
p_oob = np.zeros((n_samples, c_n))     # oob投票结果
random_state = np.random.mtrand._rand  # 随机状态
max_features = 2                       # 每棵树分割最大特征数(至少有一个分割点,若一个都没有,则无视该条件)
# 建立树模板
base_estimator = DecisionTreeClassifier()
base_estimator.set_params(**{
    
    'criterion': 'gini',
                             'min_samples_split': 2,
                             'min_samples_leaf': 1,
                             'min_weight_fraction_leaf': 0.0,
                             'max_features': max_features,
                             'max_leaf_nodes': None,
                             'min_impurity_decrease': 0.0,
                             'random_state': None,
                             'ccp_alpha': 0.0
                             })

# 逐树训练
random_state_list = [random_state.randint(np.iinfo(np.int32).max) for i in range(tree_num)]  # 初始化树随机状态
for i in range(tree_num):
    sample_indices = np.random.RandomState(random_state_list[i]).randint(0, n_samples, n_samples_bootstrap)  # 抽样
    sample_counts = np.bincount(sample_indices, minlength=n_samples)  # 抽样分布
    curr_sample_weight = np.ones((n_samples,), dtype=np.float64) * sample_counts  # 样本权重
    cur_tree = clone(base_estimator)  # 初始化树
    cur_tree.set_params(**{
    
    'random_state': random_state_list[i]})  # 设置当前树随机状态
    cur_tree.fit(X, Y, sample_weight=curr_sample_weight, check_input=False)  # 训练树
    trees.append(cur_tree)  # 将本次训练好的树, 添加到树列表

    # 计算obb得分
    un_select = ~ np.isin(range(n_samples), sample_indices)  # 未选中的数据
    cur_p_oob = cur_tree.predict_proba(X[un_select, :])  # 将当前未选中数据的预测结果
    p_oob[un_select, :] += cur_p_oob  # 投票到汇总结果

# =============== 模型指标统计 ===================================
oob_score = np.mean(Y == np.argmax(p_oob, axis=1), axis=0)  # obb样本正确率即为obb得分

# 计算特征得分
all_importances = [getattr(tree, 'feature_importances_') for tree in trees if tree.tree_.node_count > 1]  # 获取每棵树中各特征评估
all_importances = np.mean(all_importances, axis=0, dtype=np.float64)  # 求均值
feature_importances = all_importances / np.sum(all_importances)       # 归一化

# =============== 模型预测 ===================================
sim_p = np.zeros((X.shape[0], c_n), dtype=np.float64)  # 初始化投票得分
for i in range(len(trees)):  # 逐树投票
    sim_p += trees[i].predict_proba(X) / len(trees)    # 投票
sim_c = np.argmax(sim_p, axis=1)  # 得分最高者作为投票结果

# =================打印结果==========================
print("\n----前5条预测结果:----")
print(sim_p[0:5])  # 打印结果
print("\n----袋外准确率oob_score:----")
print(oob_score)  # 打印oob得分
print("\n----特征得分:----")
print(feature_importances)
----5条预测结果:----
[[1.   0.   0.  ]
 [0.99 0.01 0.  ]
 [1.   0.   0.  ]
 [1.   0.   0.  ]
 [1.   0.   0.  ]]

----袋外准确率oob_score:----
0.9533333333333334

----特征得分:----
[0.08186032 0.02758341 0.44209899 0.44845728]

3. Bagging and Random Forest in sklearn

3.1 bagging

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)


'''
Bagging策略:
首先对训练数据集进行多次采样,保证每次得到的采样数据都是不同的;
分别训练多个模型,例如树模型;
预测时需得到所有模型结果再进行集成。
'''

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

X, y = make_moons(n_samples=500,noise=0.25,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

plt.plot(X[:,0][y==0],X[:,1][y==0],'yo',alpha = 0.6)
plt.plot(X[:,0][y==0],X[:,1][y==1],'bs',alpha = 0.6)

insert image description here

'''
1、传统决策树
'''
from sklearn import tree
from sklearn.metrics import accuracy_score

tree_clf = tree.DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train,y_train)

y_pred = tree_clf.predict(X_test)
print('传统决策树精确率为:',accuracy_score(y_test,y_pred))
# 传统决策树精确率为: 0.936
'''
2、Bagging
'''
from sklearn.ensemble import BaggingClassifier


bag_clf = BaggingClassifier(
    tree.DecisionTreeClassifier(), # 拟合数据集的随机子集的基学习器
    n_estimators=500,              # 基学习器数目
    max_samples=100,               # 每个学习器抽样的最大样本数
    bootstrap=True,                # 样本是否放回
    n_jobs=-1,
    oob_score=True,
    random_state=42
)
bag_clf.fit(X_train,y_train)

y_pred = bag_clf.predict(X_test)
print('Bagging精确率为:',accuracy_score(y_test,y_pred))
# Bagging精确率为: 0.952
'''
3、决策边界对比
'''
from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf,X,y,axes=[-1.5,2.5,-1,1.5],alpha=0.5,contour =True):
    x1s=np.linspace(axes[0],axes[1],100)
    x2s=np.linspace(axes[2],axes[3],100)
    x1,x2 = np.meshgrid(x1s,x2s)
    X_new = np.c_[x1.ravel(),x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1,x2,y_pred,cmap = custom_cmap,alpha=0.3)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1,x2,y_pred,cmap = custom_cmap2,alpha=0.8)
    plt.plot(X[:,0][y==0],X[:,1][y==0],'yo',alpha = 0.6)
    plt.plot(X[:,0][y==0],X[:,1][y==1],'bs',alpha = 0.6)
    plt.axis(axes)
    plt.xlabel('x1')
    plt.xlabel('x2')
    
    
plt.figure(figsize = (12,5))

plt.subplot(121)
plot_decision_boundary(tree_clf,X,y)
plt.title('Decision Tree')

plt.subplot(122)
plot_decision_boundary(bag_clf,X,y)
plt.title('Decision Tree With Bagging')
plt.show()

insert image description here

# 随机森林可用袋外错误率obb(out-of-bag) error评估泛化能力.
# obb_error = 1 - obb_socre
obb_error = 1 - bag_clf.oob_score_
obb_error
# 0.06399999999999995

3.2 Random Forest

"""
sklearn的随机森林Demo
"""
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import numpy as np

np.random.seed(55)
# ==================== 加载数据 =================
iris = load_iris()
X   = iris.data
y   = iris.target

# ========================= 模型训练 =============
clf = RandomForestClassifier(
    n_jobs=1,
    oob_score=True,
    max_features=2,
    n_estimators=100
)
clf.fit(X, y)

# =============================== 模型预测 ========
pred_prob = clf.predict_proba(X)
pred_c    = clf.predict(X)
preds     = iris.target_names[pred_c]

#=================打印结果==========================
print("\n----前5条预测结果:----")
print(pred_prob[0:5])
print("\n----袋外准确率oob_score:----")
print(clf.oob_score_)
print("\n----特征得分:----")
print(clf.feature_importances_)
----前5条预测结果:----
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]

----袋外准确率oob_score:----
0.9533333333333334

----特征得分:----
[0.09910252 0.0317905  0.48356935 0.38553764]

4. Random forest case

4.1 Random forest predicts broadband customer churn

Dataset link: https://pan.baidu.com/s/1vmjldkWZtQWlFopWlFLX9w Extraction code: ad1h

Like neural networks, ensemble learning is a black-box model with poor explanatory power, so we don’t need to delve too much into the specific meaning of each variable in the data set. We only need to pay attention to the last variable, broadband, and strive to pass such as age, duration of use, Variables such as payment status, traffic and call status can make a more accurate prediction of whether broadband customers will renew their subscriptions.

1. Data Exploration

import pandas as pd
import numpy as np

df = pd.read_csv('../data/broadband.csv') # 宽带客户数据

# 列名全部换成小写
df.rename(str.lower, axis='columns', inplace=True)

# 只需关注参数,broadband:0-离开,1-留存
df.head()
cust_id gender age tenure channel autopay arpb_3m call_party_cnt day_mou afternoon_mou night_mou avg_call_length broadband
0 63 1 34 27 2 0 203 0 0.0 0.0 0.0 3.04 1
1 64 0 62 58 1 0 360 0 0.0 1910.0 0.0 3.30 1
2 65 1 39 55 3 0 304 0 437.2 200.3 0.0 4.92 0
3 66 1 39 55 3 0 304 0 437.2 182.8 0.0 4.92 0
4 67 1 39 55 3 0 304 0 437.2 214.5 0.0 4.92 0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1114 entries, 0 to 1113
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   cust_id          1114 non-null   int64  
 1   gender           1114 non-null   int64  
 2   age              1114 non-null   int64  
 3   tenure           1114 non-null   int64  
 4   channel          1114 non-null   int64  
 5   autopay          1114 non-null   int64  
 6   arpb_3m          1114 non-null   int64  
 7   call_party_cnt   1114 non-null   int64  
 8   day_mou          1114 non-null   float64
 9   afternoon_mou    1114 non-null   float64
 10  night_mou        1114 non-null   float64
 11  avg_call_length  1114 non-null   float64
 12  broadband        1114 non-null   int64  
dtypes: float64(4), int64(9)
memory usage: 113.3 KB
from collections import Counter

# 查看broadband分布情况,随机森林擅长处理数据集不平衡
print('broadband',Counter(df['broadband']))
broadband Counter({0: 908, 1: 206})

2. Split test set and training set

# 客户id没有用,故丢弃cust_id这一列
X = df.iloc[:,1:-1]

y = df['broadband']

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(
    X,y,test_size=0.3,random_state=888
)

3. Decision tree modeling

from sklearn import tree

# 直接使用交叉网格搜索来优化决策树模型
from sklearn.model_selection import GridSearchCV

# 网格搜索的参数:正常决策树建模中的参数 - 评估指标,树的深度,最小拆分的叶子样本数
# 通常来说,十几层的树已经是比较深了
param_grid = {
    
    
    'max_depth': [2, 3, 4, 5, 6, 7, 8],
    'min_samples_split': [4, 8, 12, 16, 20, 24, 28]
}

clf_cv = GridSearchCV(
    estimator=tree.DecisionTreeClassifier(),
    param_grid=param_grid,
    scoring='roc_auc',
    cv=5
)

clf_cv.fit(X_train,y_train)
pred_y_test = clf_cv.predict(X_test)

import sklearn.metrics as metrics

print("决策树 AUC:")
fpr_test, tpr_test, th_test = metrics.roc_curve(y_test, pred_y_test)
print('AUC = %.4f' % metrics.auc(fpr_test, tpr_test))
决策树 AUC:
AUC = 0.7763

4. Random Forest Modeling

# 一样是直接使用网格搜索
param_grid = {
    
    
    'max_depth':[5, 6, 7, 8],       # 深度:这里是森林中每棵决策树的深度
    'n_estimators':[11,13,15],      # 决策树个数-随机森林特有参数
    'max_features':[0.3,0.4,0.5],   # 每棵决策树使用的变量占比-随机森林特有参数(结合原理)
    'min_samples_split':[4,8,12,16] # 叶子的最小拆分样本量
}

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc_cv = GridSearchCV(estimator=rfc,
                      param_grid=param_grid,
                      scoring='roc_auc',
                      cv=5)

rfc_cv.fit(X_train, y_train)

# 使用随机森林对测试集进行预测
test_est = rfc_cv.predict(X_test)
print('随机森林 AUC...')
fpr_test, tpr_test, th_test = metrics.roc_curve(test_est, y_test) # 构造 roc 曲线
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))
随机森林 AUC...
AUC = 0.8181
# 查看最佳参数,看是否在决策边界上,还需重新设置网格搜索参数
rfc_cv.best_params_
{'max_depth': 7,
 'max_features': 0.3,
 'min_samples_split': 4,
 'n_estimators': 15}
# 调整决策边界,这里只是做示范
param_grid = {
    
    
    'max_depth':[7, 8, 10, 12],
    'n_estimators':[11, 13, 15, 17, 19],              # 决策树个数-随机森林特有参数
    'max_features':[0.2,0.3,0.4, 0.5, 0.6, 0.7],      # 每棵决策树使用的变量占比-随机森林特有参数
    'min_samples_split':[2, 3, 4, 8, 12, 16]          # 叶子的最小拆分样本量
}

# 重复上述步骤,可写成函数供快捷调用
rfc_cv = GridSearchCV(estimator=rfc,
                      param_grid=param_grid,
                      scoring='roc_auc',
                      n_jobs=-1,
                      cv=5)

rfc_cv.fit(X_train, y_train)
# 使用随机森林对测试集进行预测
test_est = rfc_cv.predict(X_test)

print('随机森林 AUC...')
fpr_test, tpr_test, th_test = metrics.roc_curve(test_est, y_test) # 构造 roc 曲线
print('AUC = %.4f' %metrics.auc(fpr_test, tpr_test))
# 这里的 auc 只提升了很多
随机森林 AUC...
AUC = 0.8765

4.2 Random Forest Analysis of Factors Affecting Hotel Reservation Cancellation Rate

Dataset download address: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand

The data set is based on two hotels, a resort hotel and a city hotel, both of which are located in Portugal, the resort hotel is in Algarve, and the city hotel is in It is located in Lisbon, the capital of Portugal. The geographical span of the two hotels is relatively large, and the influence of mutual interference between data is relatively small.

1. Data Exploration

import pandas as pd
import matplotlib.pyplot as plt

#显示设置
pd.set_option("display.max_columns", None)  #设置显示全部列
plt.rcParams['font.sans-serif']=['SimHei']  #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  #用来正常显示负号


# 加载数据
data_origin = pd.read_csv("../data/hotel_bookings.csv")

#数据备份
data=data_origin.copy()

print(data.shape)
data.head()
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies meal country market_segment distribution_channel is_repeated_guest previous_cancellations previous_bookings_not_canceled reserved_room_type assigned_room_type booking_changes deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 0.0 0 BB PRT Direct Direct 0 0 0 C C 3 No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 0.0 0 BB PRT Direct Direct 0 0 0 C C 4 No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 0.0 0 BB GBR Direct Direct 0 0 0 A C 0 No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 0.0 0 BB GBR Corporate Corporate 0 0 0 A A 0 No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 0.0 0 BB GBR Online TA TA/TO 0 0 0 A A 0 No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03
# hotel                           酒店类型
# is_canceled                     预订是否取消(0,1)
# lead_time                       提前预订天数
# arrival_date_year               入住年份
# arrival_date_month              入住月份
# arrival_date_week_number        入住周数
# arrival_date_day_of_month       入住日期
# stays_in_weekend_nights         周末过夜数
# stays_in_week_nights            工作日过夜数
# adults                          成人人数
# children                        儿童人数
# babies                          婴儿人数
# meal                            餐食类型(BB早餐,HB午餐,FB晚餐,Undefined/SC无餐)
# country                         客户来源国家
# market_segment                  市场细分
# distribution_channel            订单渠道
# is_repeated_guest               是否是老客户
# previous_cancellations          历史取消预订的次数
# previous_bookings_not_canceled  历史未取消预订的次数
# reserved_room_type              预定房间类型
# assigned_room_type              实际房间类型
# booking_changes                 预定更改次数
# deposit_type                    押金类型(No Deposit,Non Refund,Refundable)
# agent                           预订旅行社ID
# company                         预订公司/实体ID
# days_in_waiting_list            等待天数
# customer_type                   客户类型
# adr                             客房日均价
# required_car_parking_spaces     车位需求数
# total_of_special_requests       特殊需求数
# reservation_status              预订最终状态(Canceled,Check-out,No-show)
# reservation_status_date         预订最终状态更新日期
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
data.describe()
is_canceled lead_time arrival_date_year arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies is_repeated_guest previous_cancellations previous_bookings_not_canceled booking_changes agent company days_in_waiting_list adr required_car_parking_spaces total_of_special_requests
count 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119386.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 103050.000000 6797.000000 119390.000000 119390.000000 119390.000000 119390.000000
mean 0.370416 104.011416 2016.156554 27.165173 15.798241 0.927599 2.500302 1.856403 0.103890 0.007949 0.031912 0.087118 0.137097 0.221124 86.693382 189.266735 2.321149 101.831122 0.062518 0.571363
std 0.482918 106.863097 0.707476 13.605138 8.780829 0.998613 1.908286 0.579261 0.398561 0.097436 0.175767 0.844336 1.497437 0.652306 110.774548 131.655015 17.594721 50.535790 0.245291 0.792798
min 0.000000 0.000000 2015.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 6.000000 0.000000 -6.380000 0.000000 0.000000
25% 0.000000 18.000000 2016.000000 16.000000 8.000000 0.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 62.000000 0.000000 69.290000 0.000000 0.000000
50% 0.000000 69.000000 2016.000000 28.000000 16.000000 1.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000 179.000000 0.000000 94.575000 0.000000 0.000000
75% 1.000000 160.000000 2017.000000 38.000000 23.000000 2.000000 3.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 229.000000 270.000000 0.000000 126.000000 0.000000 1.000000
max 1.000000 737.000000 2017.000000 53.000000 31.000000 19.000000 50.000000 55.000000 10.000000 10.000000 1.000000 26.000000 72.000000 21.000000 535.000000 543.000000 391.000000 5400.000000 8.000000 5.000000
# 通过观察数据,发现1个异常点:客房日均价(adr)不为负,应剔除adr为负的异常值,
# adr最大值为5400,显然远远大于均值+3倍标准差。
# 用箱形图看下异常值:
import seaborn as sns


plt.figure(figsize=(12, 1))
sns.boxplot(x=list(data["adr"]))
plt.show()


insert image description here

#查看缺失值
missing=data.isnull().sum()
missing[missing != 0]
children         4
country        488
agent        16340
company     112593
dtype: int64

2. Data cleaning

1)缺失值填充

① If the country is missing, it cannot be added, and the missing value can be distinguished by unknown;
② If the children are missing, it can be understood that there are no children staying, so it is not filled in the order registration, and the missing value is filled with 0;
③ If the agent is missing, it can be understood as a non-travel agency Reservation, so it is not filled in the order registration, and the missing value is filled with 0;
④ If the company is missing, it can be understood as a non-company reservation, so it is not filled in the order registration, and the missing value is filled with 0;

data_fill = data.fillna(
    {
    
    
        "country":"unknown",
        "children":0,
        "agent":0,
        "company":0
    }
)

missing=data_fill.isnull().sum()
missing[missing != 0]
Series([], dtype: int64)

2)异常值处理

(1) The sum of the number of adults and children staying is 0;
(2) The average daily price of the room is negative, and the average daily price of the room is higher than 1,000 yuan;
(3) undefined and sc in meal mean no meal.

# 2)异常值处理

# 1、adults与children入住人数之和为0
drop_a = data_fill[data_fill[["adults","children"]].sum(axis=1) == 0]
# 2、客房日均价为负,客房日均价高于1000元
drop_b = data_fill[(data_fill["adr"]<0) | (data_fill["adr"]>1000)]
data_fill.drop(drop_a.index,inplace=True)

data_done=data_fill[(data_fill["adr"]>=0) & (data_fill["adr"] <1000)]

# 3、meal中undefined与sc均表示无餐
data_done["meal"].replace({
    
    "Undefined":"SC"},inplace=True)
data_done.shape # 数据清洗完以后,还剩下119208条记录
(119208, 32)

3. Normalization of numerical features and numericalization of category features (one-hot encoding)

from sklearn.pipeline import  Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,OneHotEncoder

# 1、数值型特征进行归一化
num_features = ["lead_time",
                "arrival_date_week_number",
                "arrival_date_day_of_month",
                "stays_in_weekend_nights",
                "stays_in_week_nights",
                "adults",
                "children",
                "babies",
                "is_repeated_guest",
                "previous_cancellations",
                "previous_bookings_not_canceled",
                "agent",
                "company",
                "required_car_parking_spaces",
                "total_of_special_requests",
                "adr"]

num_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='constant',fill_value=0)), # 将空值填充为自定义的值
        ('scaler', StandardScaler())                     # 数据归一化
    ]
)

# 2、类别特征标准化(one-hot)
cat_features = ["hotel",
                "arrival_date_month",
                "meal",
                "market_segment",
                "distribution_channel",
                "reserved_room_type",
                "deposit_type",
                "customer_type"]



cat_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
        ("onehot", OneHotEncoder(handle_unknown='ignore'))
    ]
)


features = num_features + cat_features

from sklearn.compose import ColumnTransformer
'''
SimpleImputer类可用于替换缺少的值,MinMaxScaler类可用于缩放数值,而OneHotEncoder可用于编码分类变量。

ColumnTransformer()在Python的机器学习库scikit-learn中,可以选择地进行数据转换。

要使用ColumnTransformer,必须指定一个转换器列表。
每个转换器是一个三元素元组,用于定义转换器的名称,要应用的转换以及要应用于其的列索引,例如:(名称,对象,列)
'''
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features),
        ('cat', cat_transformer, cat_features)
    ]
)

4. Choose random forest modeling and choose the best parameters

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score,train_test_split,GridSearchCV



X = data_done.drop("is_canceled",axis=1)
y = data_done["is_canceled"]

# 初步调参n_estimators
# scorel = []
# for i in range(0,200,10):
#     rfc_model = RandomForestClassifier(n_estimators=i+1,
#                                        n_jobs=-1,
#                                        random_state=0)
#     rfc = Pipeline(
#         steps=[
#             ('preprocessor', preprocessor),
#             ('model',rfc_model)
#         ]
#     )
#     split = KFold(n_splits=10, shuffle=True, random_state=42)
#     rfc_t_s = cross_val_score(rfc,
#                                  X,
#                                  y,
#                                  cv=split,
#                                  scoring="accuracy",
#                                  n_jobs=-1
#                               ).mean()
#     print('step',rfc_t_s)
#     scorel.append(rfc_t_s)
# print(max(scorel),(scorel.index(max(scorel))*10) + 1)
# plt.figure(figsize=[20,5])
# plt.plot(range(1,201,10),scorel)
# plt.show()
# 再次调参n_estimators,注意数据集训练比较慢
scorel = []
for i in range(150,170):
    rfc_model = RandomForestClassifier(n_estimators=i+1,
                                       n_jobs=-1,
                                       random_state=0)
    rfc = Pipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('model',rfc_model)
        ]
    )
    split = KFold(n_splits=10, shuffle=True, random_state=42)
    rfc_t_s = cross_val_score(rfc,
                                 X,
                                 y,
                                 cv=split,
                                 scoring="accuracy",
                                 n_jobs=-1
                              ).mean()
    print('step:',i,',score=',rfc_t_s)
    scorel.append(rfc_t_s)

# 从图像可以看出来,n_estimators应该为161
print(max(scorel),([*range(150,170)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])
plt.plot(range(150,170),scorel)
plt.show()
param_grid = {
    
    
                "model__max_depth":[*range(1,40,10)],
                'model__min_samples_leaf':[*range(1,50,10)]
             }

rfc_model_t = RandomForestClassifier(n_estimators=161,
                                     criterion="gini",
                                     max_features=0.4,
                                     n_jobs=-1,
                                     random_state=0)
rfc_t = Pipeline(
                    steps=[
                            ('preprocessor', preprocessor),
                            ('model',rfc_model_t)
                    ]
                 )

split_t = KFold(n_splits=10, shuffle=True, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
GS = GridSearchCV(rfc_t, param_grid, cv=split_t)

GS.fit(X_train, y_train)

GS.best_params_
# 

5. Select the best parameters to train the model

#训练模型,评估模型
rfc_model = RandomForestClassifier(n_estimators=161
                                 ,criterion="gini"
                                 ,max_depth=31
                                 ,min_samples_leaf=1
                                 ,max_features=0.4
                                 ,n_jobs=-1
                                 ,random_state=0)

rfc = Pipeline(steps=[
                          ('preprocessor', preprocessor),
                          ('model',rfc_model)
                      ])

rfc = rfc.fit(X_train, y_train)
score_=rfc.score(X_test, y_test)
score_
0.8683835248720745
#交叉检验
split = KFold(n_splits=10, shuffle=True, random_state=42)
rfc_s = cross_val_score(         rfc,
                                 X,
                                 y,
                                 cv=split,
                                 scoring="accuracy",
                                 n_jobs=-1)

plt.plot(range(1,11),rfc_s,label = "RandomForest")


insert image description here

6. Feature Importance

#特征重要性
import eli5
onehot_columns = list(rfc.named_steps['preprocessor'].
                      named_transformers_['cat'].
                      named_steps['onehot'].
                      get_feature_names_out(input_features=cat_features))

feat_imp_list = num_features + onehot_columns

#按重要排序的前10个重要特征及其系数
feat_imp_df = eli5.formatters.as_dataframe.explain_weights_df(
   rfc.named_steps['model'],
    feature_names=feat_imp_list)



feat_imp_df.head(10)
feature weight std
0 deposit_type_Non Refund 0.144919 0.108617
1 lead_time 0.140804 0.016317
2 adr 0.093004 0.003363
3 deposit_type_No Deposit 0.075658 0.103332
4 arrival_date_day_of_month 0.066274 0.002718
5 arrival_date_week_number 0.053368 0.002609
6 total_of_special_requests 0.052200 0.011193
7 agent 0.043441 0.005135
8 previous_cancellations 0.041885 0.014677
9 stays_in_week_nights 0.039749 0.002220

7. Further analysis of the influencing factors with a weight greater than 10% (advance booking time and deposit type)

# 1、预定取消与提前预定时间的关系
plt.figure(figsize=(12, 8))
lead_cancel_data = data_done.groupby("lead_time")["is_canceled"].describe()
sns.scatterplot(x=lead_cancel_data.index, y=lead_cancel_data["mean"].values * 100,color='olive')

plt.title("提前预定时间对预订取消率的影响", fontsize=16)
plt.xlabel("提前预订时间", fontsize=16)
plt.ylabel("预订取消的概率", fontsize=16)
plt.show()

The longer you book in advance, the higher your cancellation rate. The hotel should consider whether to limit the maximum time for advance booking in light of specific operating costs to avoid resource occupation.

#预定取消与保证金类型的关系
plt.figure(figsize=(12, 8))
deposit_cancel_data = data_done.groupby("deposit_type")["is_canceled"].describe()
sns.barplot(x=deposit_cancel_data.index, y=deposit_cancel_data["mean"] * 100,palette="Blues")

plt.title("Effect of deposit_type on cancelation", fontsize=16)
plt.xlabel("Deposit type", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()

The booking cancellation rate of non-refundable deposits is the highest among all types of deposits. This phenomenon deviates from the concept of consumer psychology and requires further analysis.

In fact, among all market segments,
the cancellation rate of offline travel agency bookings and group bookings is the highest.
Among deposit types, the bookings without deposits are mainly concentrated in online bookings,
while the bookings of non-refundable deposits are mainly concentrated online. Travel agency bookings and group bookings lead to a high cancellation rate for non-refundable deposits.

The hotel can try to design a questionnaire, interview offline travel agencies and groups, and explore the reasons and solutions.

Guess you like

Origin blog.csdn.net/qq_44665283/article/details/130332923