xgboost的介绍和模型调参

简介

当模型没有达到预期效果的时候，XGBoost就是数据科学家的最终武器。XGboost是一个高度复杂的算法，有足够的能力去学习数据的各种各样的不规则特征。

用XGBoost建模很简单，但是提升XGBoost的模型效果却需要很多的努力。因为这个算法使用了多维的参数。为了提升模型效果，调参就不可避免，但是想要知道参数怎么调，什么样的参数能够得出较优的模型输出就很困难了。

这篇文章适用于XGBoost新手，教会新手学习使用XGBoost的一些有用信息来帮忙调整参数。

What should you know?

XGBoost(extreme Gradient Boosting) 是一个高级的梯度增强算法（gradient boosting algorithm）,推荐看一下我前一篇翻译的自该作者的文章

XGBoost 的优势

Regularization：
- 标准的GBM并没有XGBoost的Regularization，这个能帮助减少过拟合问题
Parallel Processing:
- XGBoost实现了并行计算，与GBM相比非常快
- 但是基于序列的模型，新模型的建立是基于前面已经建立好的模型，如何能实现并行计算呢？探索一下吧
- XGBoost 支持在Hadoop上实现
High Flexibility
- XGBoost允许用户定制优化目标和评价标准
- 这一点将会使得我们对模型有很多可以改进的地方
Handling Missing Values
- XGBoost有内嵌的处理缺失值的程序
- 其他模型中用户被要求为缺失值提供相应的与其他值不同的值去填充缺失值，XGBoost会尝试对缺失值进行分类，并学习这种分类
Tree Pruning:
- GBM会停止对一个节点进行分裂，当其计算到这个节点的split的loss是负数时，GBM是一个贪婪算法
- XGBoost的分类取决于max_depth，当树的深度达到max_depth时，开始进行剪枝，移除没有正基尼（no positive gain）节点的split
- 另一个优点是一个节点被分裂的时候loss为-2，当其二次分裂的时候loss可能为+10，GBM会停止该节点的分裂，XGBoost会进入到第二步，然后结合两个分裂的影响，最终为+8
Built-in Cross-Validation
- XGBoost允许每一个交叉验证实现boosting过程，因而通过一次run就能获得boosting迭代的优化量
- 与GBM需要运营grid-search且需要限时值的范围获得优化量不同
Continue on Existing Model
用户可以通过上个版本的XGBoost模型训练新版本的模型
GBM的sklearn也有这个特性

加深对XGBoost的理解的文章：
1.XGBoost Guide – Introduction to Boosted Trees
2.Words from the Author of XGBoost

XGBoost Parameters

XGBoost的变量类型有三类：

General Parameters：调控整个方程
Booster Parameters：调控每步树的相关变量
Learning Task Parameters：调控优化表现的变量

1.General Parameters：

booster [default=gbtree]：

gbtree: tree-based models，树模型
gblinear: linear models，线性模型

silent [default=0]:

设置成1表示打印运行过程中的相关信息
通常选择默认值就好，打印出的信息能够帮助理解model

nthread [default to maximum number of threads available if not set]

主要用于并行计算，系统的内核数需要作为变量
如果希望运行所有的内核，就不需要设置该参数，程序会自己检测到该值

2.Booster Parameters

虽然XGBoost有两种boosters,作者在参数这一块只讨论了tree booster，原因是tree booster的表现总是好于 linear booster

eta [default=0.3]
- 与GBM中学习率的概念相似
- 通过减小每一步的权重能够使得建立的模型更鲁棒
- 通常最终的数值范围在[0.01-0.2]之间
min_child_weight [default=1]
- 定义观测样本生成的孩子节点的权重最小和
- 这个概念与GBM中的min_child_leaf概念类似，但是又不完全一样，这个概念指的是某观测叶子节点中所有样本权重之和的最小值，而GBM指的是叶子节点的最少样本量
- 用于防止过拟合问题：较大的值能防止过拟合，过大的值会导致欠拟合问题
- 需要通过CV调参
max_depth [default=6]
- 树的最大深度
- 用于防止过拟合问题
- 通过CV调参
- 通常值的范围：[3-10]
max_leaf_nodes
- 一棵树最多的叶子节点的数目
- 与max_depth定义一个就好
gamma [default=0]
- 一个节点分裂的条件是其分裂能够起到降低loss function的作用，gamma 定义loss function降低多少才分裂
- 这个变量使得算法变得保守，它的值取决于 loss function需要被调节
max_delta_step [default=0]
- 此变量的设置使得我们定义每棵树的权重估计值的变化幅度。如果值为0，值的变化没有限制，如果值>0，权重的变化将会变得相对保守
- 通常这个参数不会被使用，但如果是极度不平衡的逻辑回归将会有所帮助
subsample [default=1]：
- 与GBM的subsample定义一样，指的是没有每棵树的样本比例
- 低值使得模型更保守且能防止过拟合，但太低的值会导致欠拟合
- 通常取值范围[0.5-1]
colsample_bytree [default=1]
- 与GBM中的max_features类似，指的是每棵树随机选取的特征的比例
- 通常取值范围[0.5-1]
colsample_bylevel [default=1]
- 指的是树的每个层级分裂时子样本的特征所占的比例
- 作者表示不用这个参数，因为subsample和colsample_bytree组合做的事与之类似
lambda [default=1]
- l2正则化权重的术语（同 Ridge regression）
- 用于处理XGBoost里的正则化部分，虽然很多数据科学家不怎么使用这个参数，但是它可以用于帮助防止过拟合
alpha [default=0]
- l1正则化的权重术语（同Lasso regression）
- 当特征量特别多的时候可以使用，这样能加快算法的运行效率
scale_pos_weight [default=1]
- 当样本不平衡时，需要设置一个大于0的数帮助算法尽快收敛

3.Learning Task Parameters

此类变量用于定义优化目标每一次计算的需要用到的变量

objective [default=reg:linear]
- 用于定义loss function，通常有以下几类
- binary:logistic-用于二分类，返回分类的概率而不是类别（class）
- multi:softmax-多分类问题，返回分类的类别而不是概率
- multi:softprob-与softmax类似，但是返回样本属于每一类的概率
eval_metric [ default according to objective ]
- 这个变量用于测试数据（validation data.）
- 默认值：回归-rmse；分类-error
- 通常值如下：
  - rmse – root mean square error
  - mae – mean absolute error
  - logloss – negative log-likelihood
  - error – Binary classification error rate (0.5 threshold)
  - merror – Multiclass classification error rate
  - mlogloss – Multiclass logloss
  - auc: Area under the curve
seed [default=0]
- 随机种子的值

有些变量在Python的sklearn的接口中对应命名如下：
1. eta -> learning rate
2. lambda ->reg_lambda
3. alpha -> reg_alpha

扫描二维码关注公众号，回复： 4661054 查看本文章

可能感到困惑的是这里并没有像GBM中一样提及n_estimators，这个参数实际存在于XGBClassifier中，但实际是通过num_boosting_rounds在我们调用fit函数事来体现的。

作者推荐以下链接，进一步加深对XGBOOST的了解：
1.XGBoost Parameters (official guide)
2.XGBoost Demo Codes (xgboost GitHub repository)
3.Python API Reference (official guide)

XGBoost 调参步骤

#导入需要的数据和库
#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid search

import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4

train = pd.read_csv('train_modified.csv')
target = 'Disbursed'
IDcol = 'ID'
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10
   
   11
   
   12
   
   13
   
   14
   
   15
   
   16
   
   17

此处作者调用了两种类型的XGBoost：
1.xgb：xgboost直接的库，可以调用cv函数
2.XGBClassifier: sklearn对XGBoost的包装，可以允许使用sklearn的网格搜索功能进行并行计算

#定义一个函数帮助产生xgboost模型及其效果
def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):

    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)
        cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
            metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)
        alg.set_params(n_estimators=cvresult.shape[0])

    #Fit the algorithm on the data
    alg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')

    #Predict training set:
    dtrain_predictions = alg.predict(dtrain[predictors])
    dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]

    #Print model report:
    print "\nModel Report"
    print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)
    print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)

    feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)
    feat_imp.plot(kind='bar', title='Feature Importances')
    plt.ylabel('Feature Importance Score')

#xgboost’s sklearn没有feature_importances，但是#get_fscore() 有相同的功能
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10
   
   11
   
   12
   
   13
   
   14
   
   15
   
   16
   
   17
   
   18
   
   19
   
   20
   
   21
   
   22
   
   23
   
   24
   
   25
   
   26
   
   27

General Approach for Parameter Tuning

通常的做法如下：
1.选择一个相对高一点的学习率（learning rate）：通常0.1是有用的，但是根据问题的不同，可以选择范围在[0.05,0.3]之间，根据选好的学习率选择最优的树的数目，xgboost有一个非常有用的cv函数可以用于交叉验证并能返回最终的最优树的数目
2.调tree-specific parameters（max_depth, min_child_weight, gamma, subsample, colsample_bytree）
3.调regularization parameters（lambda, alpha）
4.调低学习率并决定优化的参数

step1:Fix learning rate and number of estimators for tuning tree-based parameters

1.设置参数的初始值：

max_depth = 5 : [3,10],4-6都是不错的初始值的选择
min_child_weight = 1 : 如果数据是不平衡数据，初始值设置最好小于1
gamma = 0 : 初始值通常设置在0.1-0.2范围内，并且在后续的调参中也会经常被调节
subsample, colsample_bytree = 0.8 : 通常使用0.8作为调参的开始参数，调整范围为[0.5-0.9]
scale_pos_weight = 1:因为作者的数据为高度不平衡数据

#通过固定的学习率0.1和cv选择合适的树的数量
#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb1, train, predictors)
#作者调整后得到的树的值为140，如果这个值对于当前的系统而言太大了，可以调高学习率重新训练
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10
   
   11
   
   12
   
   13
   
   14
   
   15
   
   16
   
   17

step2:Tune max_depth and min_child_weight

先调这两个参数的原因是因为这两个参数对模型的影响做大

param_test1 = {
 'max_depth':range(3,10,2),
 'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10

最优的 max_depth=5，min_child_weight=5
因为之前的步长是2，在最优参数的基础上，在上调下调各一步，看是否能得到更好的参数

param_test2 = {
 'max_depth':[4,5,6],
 'min_child_weight':[4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10

以上结果跑出来的最优参数为：max_depth=4，min_child_weight=6,另外从作者跑出来的cv结果看，再提升结果比较困难，可以进一步对min_child_weight试着调整看一下效果：

param_test2b = {
 'min_child_weight':[6,8,10,12]
}
gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,
 min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2b.fit(train[predictors],train[target])
modelfit(gsearch3.best_estimator_, train, predictors)
gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10

step3:Tune gamma

param_test3 = {
 'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test3, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch3.fit(train[predictors],train[target])
gsearch3.grid_scores_, gsearch3.best_params_, gsearch3.best_score_
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9

基于以上调好参数的前提下，可以看一下模型的特征的表现：

xgb2 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb2, train, predictors)
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10
   
   11
   
   12
   
   13

step4: Tune subsample and colsample_bytree

param_test4 = {
 'subsample':[i/10.0 for i in range(6,10)],
 'colsample_bytree':[i/10.0 for i in range(6,10)]
}
gsearch4 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test4, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch4.fit(train[predictors],train[target])
gsearch4.grid_scores_, gsearch4.best_params_, gsearch4.best_score_
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10

#上一步发现最优值均为0.8，这一步做的事情是在附近以0.05的步长做调整
param_test5 = {
 'subsample':[i/100.0 for i in range(75,90,5)],
 'colsample_bytree':[i/100.0 for i in range(75,90,5)]
}
gsearch5 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test5, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch5.fit(train[predictors],train[target])
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10

Step 5: Tuning Regularization Parameters

这一步的作用是通过使用过regularization 来降低过拟合问题，大部分的人选择忽略这个参数，因为gamma 有提供类似的功能

param_test6 = {
 'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}
gsearch6 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test6, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch6.fit(train[predictors],train[target])
gsearch6.grid_scores_, gsearch6.best_params_, gsearch6.best_score_
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9

这一步调参之后结果可能会变差，方法是在获得的最优的参数0.01附近进行微调，看能否获得更好的结果

param_test7 = {
 'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]
}
gsearch7 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=177, max_depth=4,
 min_child_weight=6, gamma=0.1, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), 
 param_grid = param_test7, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch7.fit(train[predictors],train[target])
gsearch7.grid_scores_, gsearch7.best_params_, gsearch7.best_score_
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9

然后基于获得的更好的值，我们再看一下模型的整体表现

xgb3 = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb3, train, predictors)
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10
   
   11
   
   12
   
   13
   
   14

Step 6: Reducing Learning Rate

最后一步就是降低学习率并增加更多的树

xgb4 = XGBClassifier(
 learning_rate =0.01,
 n_estimators=5000,
 max_depth=4,
 min_child_weight=6,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 reg_alpha=0.005,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)
modelfit(xgb4, train, predictors)
  
  
   
   1
   
   2
   
   3
   
   4
   
   5
   
   6
   
   7
   
   8
   
   9
   
   10
   
   11
   
   12
   
   13
   
   14

最后作者分享了两条经验：
1.仅仅通过调参来提升模型的效果是很难的
2.想要提升模型的效果，还可以通过特征工程、模型融合以及stacking方法