1.XGBOOST原理、重要参数

1.1原理

B站视频：XGBOOST视频（陈天奇PPT）
提升树、梯度提升树
XGBOOST与决策树、提升树（ boosting tree）、梯度提升树（gradient boosting tree）密切相关，网上有很多算法介绍。如这一篇：GBT\GBDT\GBRT\XGBOOST的关系
这里主要比较下GBT与XGBOOST：

XGBOOST与GBT相比主要在于泰勒展开项为二阶，GBT为一阶（都是用一阶或者二阶导数近似拟合残差）；
XGBOOST引入了正则项
XGBOOST可进行并行化处理（不懂）

1.2重要参数

直接看官网
XGBOOST参数
只列出重要参数
共三种参数：
1.通用参数：与我们所使用的booster相关（树/线性）

booster[default=gbtree]：gbtree基于树的模型 ；gblinear基于线性模型（很少用）；
nthread [default to maximum number of threads available if not set]：用于运行XGBoost的并行线程数

2.Booster参数（基于gbtree）：

eta [default=0.3, alias: learning_rate]：学习率;range: [0,1]
gamma [default=0, alias: min_split_loss]:控制叶子增长；range: [0,∞]，越大越保守
max_depth [default=6]：树的最大深度；range: [0,∞]
min_child_weight [default=1]，range: [0,∞]；越大越保守
subsample [default=1]：训练实例的子样本比率。range: (0,1]；将其设置为0.5意味着XGBoost将在生成树之前随机抽取一半的训练数据。这样可以防止过拟合。
colsample_bytree[default=1]：列抽样参数
lambda [default=1, alias: reg_lambda]：L2权值正则化项
alpha [default=0, alias: reg_alpha]：L1 权值正则化项

3.学习任务参数：明确学习任务和相应的学习目标

1.objective [default=reg:squarederror]：
（常用），reg：回归，binary：二分类，multi：多分类
binary:logistic：二分类逻辑回归，输出概率
reg:squarederror：平方损失回归
reg:logistic：逻辑回归

2.eval_metric [default according to objective]：对于验证数据的评估指标，将根据目标分配一个默认指标(rmse用于回归，error用于分类，平均平均精度用于排名)
常用：MAE,LOGLOSS,RMSE,AUC

2.GridsearchCV&xgboost.cv

sklearn提供了调参利器：sklearn.model_selection.GridSearchCV,对估计器的指定参数值进行穷举搜索
sklearn介绍

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False）

重要参数:

estimator——学习器；
param_grid——待调整的参数（字典形式，如	param_gird={'reg_alpha':[i/10 for i in 			range(0,50)]}		）
scoring——评估标准（‘roc_auc’，‘accuracy’，‘f1’等）
cv——确定交叉验证分割策略，默认5折交叉验证，也可使用分层k折，重复k折等

xgboost可以通过xgboost.cv获取最佳树的个数（学习器个数），实质和gridsearchcv相同，通过输入训练集，以及分割策略，将训练集分为训练集和验证集，进行交叉验证，以获取最优超参数

xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)

重要参数：

params (dict)——学习器的参数
dtrain (DMatrix) ——训练集,DMatrix是xgboost使用的一种内部数据结构，它对内存效率和训练速度进行了优化
num_boost_round (int)——学习器个数
folds，metrics——同gridsearchcv的cv参数，scoring参数
early_stopping_rounds (int) ——停止迭代轮数，如等于50，则如果精度在某一轮次后五十轮仍未提升，则停止迭代

3.XGBOOST简单调参

使用数据：pima-indians-diabetes .xlsx
流程：
1.数据预处理；
2.设置初始参数值，并配以较高学习率，获得最佳学习器个数（xgboost.cv函数获取）；
3.max_depth,min_child_weight
4.gamma
5.subsample,colsample_bytree
6.正则化参数：reg_alpha/reg_lambda
7.配置较低学习率，较多学习器，使用xgboost.cv获取最佳学习器个数

3.1 数据预处理：切分数据

3.2 设置初始超参数（高学习率）

按照经验设置，之后可以调整的。

xgb1 = XGBClassifier(max_depth=2,
                     learning_rate=0.1,
                     n_estimators=5000,
                     silent=False,
                     objective='binary:logistic',
                     booster='gbtree',
                     n_jobs=4,
                     gamma=0,
                     min_child_weight=1,
                     subsample=0.8,
                     colsample_bytree=0.8,
                     reg_alpha=0,
                     seed=888)

利用CV函数找出最优学习器个数

rkf=RepeatedKFold(n_splits=10,n_repeats=5,random_state=88)#设置分割策略
cv_result = xgb.cv(xgb1.get_xgb_params(),
                   dtrain,
                   num_boost_round=xgb1.get_xgb_params()['n_estimators'],
                   folds=rkf,
                   metrics='auc',
                   early_stopping_rounds=50,
                   callbacks=[xgb.callback.early_stop(50),
                              xgb.callback.print_evaluation(period=1,show_stdv=True)])

结果显示最佳树的个数为：33（这里test-auc实质上为验证集精度）

#Stopping. Best iteration:
# [33]	train-auc:0.92898+0.00458	test-auc:0.84057+0.05435

调整参数（树修改为33个）

xgb1 = XGBClassifier(max_depth=2,
                     learning_rate=0.1,
                     n_estimators=33,
                     silent=False,
                     objective='binary:logistic',
                     booster='gbtree',
                     n_jobs=4,
                     gamma=0,
                     min_child_weight=1,
                     subsample=0.8,
                     colsample_bytree=0.8,
                     reg_alpha=0,
                     seed=888)

3.3调整max_depth,min_child_weight

利用GridsearchCV函数：

param_grid={'max_depth':range(0,20),
            'min_child_weight':range(0,20)}
grid_search=GridSearchCV(xgb1,param_grid,scoring='roc_auc',cv=rkf,iid=False)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果：max_depth=5,min_child_weight=11

# best_params: {'max_depth': 5, 'min_child_weight': 11}
# best_score: 0.8444164587726978

因此修改参数

xgb1 = XGBClassifier(max_depth=5, learning_rate=0.1,n_estimators=33, silent=False, objective='binary:logistic', booster='gbtree', n_jobs=4,gamma=0,min_child_weight=11,subsample=0.8,colsample_bytree=0.8,reg_alpha=0,seed=888)

3.4调整gamma

同理调整gamma(由于gamma先经过粗调，之后经过细调的工作和粗调差不多，没有列出)

param_grid={'gamma':[i for i in range(0,20)]}
grid_search=GridSearchCV(xgb1,param_grid,scoring='roc_auc',cv=rkf)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果：gamma=0.299，0.845比原来提升了一点

#best_params: {'gamma': 0.299}
# best_score: 0.845197258266488

修改参数：

xgb1 = XGBClassifier(max_depth=5, learning_rate=0.1,n_estimators=33, silent=False, objective='binary:logistic', booster='gbtree', n_jobs=4,gamma=0.299,min_child_weight=11,subsample=0.8,colsample_bytree=0.8,reg_alpha=0,seed=888)

3.5调整subsample,colsample_bytree

此处也是粗调，细调过程一样

param_grid={'subsample':[i/10 for i in range(0,11)],
            'colsample_bytree':[i/10 for i in range(0,11)]}
grid_search=GridSearchCV(xgb1,param_grid,scoring='roc_auc',iid=False,cv=rkf)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果：colsample_bytree’: 0.38, ‘subsample’: 0.8，略微提升

#best_params: {'colsample_bytree': 0.38, 'subsample': 0.8}
# best_score: 0.8465616625676469

修改参数

xgb1 = XGBClassifier(max_depth=5, learning_rate=0.1,n_estimators=33, silent=False, objective='binary:logistic', booster='gbtree', n_jobs=4,gamma=0.299,min_child_weight=11,subsample=0.8,colsample_bytree=0.38,reg_alpha=0,seed=888)

3.6正则化参数调优

此处选择reg_alpha,选择reg_lambda也是可以的

param_gird={'reg_alpha':[i/ for i in range(0,10)]
}
grid_search=GridSearchCV(xgb1,param_gird,scoring='roc_auc',iid=False,cv=rkf)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果：‘reg_alpha’: 0.783，略微提升

# best_params: {'reg_alpha': 0.783}
# best_score: 0.8482302064155616

3.7调整学习率、最优学习器个数

首先降低学习率，增加学习器个数，

xgb1 = XGBClassifier(max_depth=5,
                     learning_rate=0.01,
                     n_estimators=5000,
                     silent=False,
                     objective='binary:logistic',
                     booster='gbtree',
                     n_jobs=4,
                     gamma=0.299,
                     min_child_weight=11,
                     subsample=0.8,
                     colsample_bytree=0.38,
                     reg_alpha=0.783,
                     seed=888)

同样利用cv函数获取最优学习器个数：

cv_result = xgb.cv(xgb1.get_xgb_params(),
                   dtrain,
                   num_boost_round=xgb1.get_xgb_params()['n_estimators'],
                   folds=rkf,
                   metrics='auc',
                   early_stopping_rounds=50,
                   callbacks=[xgb.callback.early_stop(50),
                              xgb.callback.print_evaluation(period=1,show_stdv=True)])

结果

# Stopping. Best iteration:
# [403]	train-auc:0.90195+0.00506	test-auc:0.84498+0.06081

接下来利用所有训练数据进行训练，利用测试数据测试（0.65）

xgb_bst1=xgb1.fit(x_train,y_train)
pred_1=xgb_bst1.predict(x_test)
print(metrics.roc_auc_score(y_test,pred_1))
# 0.6564179104477611

NOTE:只利用参数调节是不能显著提升性能的，更重要的是数据的好坏（分布差异，有无异常数据，缺失数据，不同维度是否在同一量级）

参考：
1.sklearn文档
2.XGBoost实战与调参优化
3.xgboost文档

7.机器学习系列(12)_XGBoost参数调优完全指南（附Python代码）

西红柿炒豆腐

发布了24 篇原创文章 · 获赞 8 · 访问量 2157

私信关注

XGBOOST模型简单调参