XGBOOST模型简单调参

1.XGBOOST原理、重要参数

1.1原理

B站视频:XGBOOST视频(陈天奇PPT)
提升树、梯度提升树
XGBOOST与决策树、提升树( boosting tree)、梯度提升树(gradient boosting tree)密切相关,网上有很多算法介绍。如这一篇:GBT\GBDT\GBRT\XGBOOST的关系
这里主要比较下GBT与XGBOOST:

XGBOOST与GBT相比主要在于泰勒展开项为二阶,GBT为一阶(都是用一阶或者二阶导数近似拟合残差);
XGBOOST引入了正则项
XGBOOST可进行并行化处理(不懂)

1.2重要参数

直接看官网
XGBOOST参数
只列出重要参数
共三种参数:
1.通用参数:与我们所使用的booster相关(树/线性)

booster[default=gbtree]:gbtree基于树的模型 ;gblinear基于线性模型(很少用);
nthread [default to maximum number of threads available if not set]:用于运行XGBoost的并行线程数

2.Booster参数(基于gbtree):

eta [default=0.3, alias: learning_rate]:学习率;range: [0,1]
gamma [default=0, alias: min_split_loss]:控制叶子增长;range: [0,∞],越大越保守
max_depth [default=6]:树的最大深度;range: [0,∞]
min_child_weight [default=1],range: [0,∞];越大越保守
subsample [default=1]:训练实例的子样本比率。range: (0,1];将其设置为0.5意味着XGBoost将在生成树之前随机抽取一半的训练数据。这样可以防止过拟合。
colsample_bytree[default=1]:列抽样参数
lambda [default=1, alias: reg_lambda]:L2权值正则化项
alpha [default=0, alias: reg_alpha]:L1 权值正则化项

3.学习任务参数:明确学习任务和相应的学习目标

1.objective [default=reg:squarederror]:
(常用),reg:回归,binary:二分类,multi:多分类
binary:logistic:二分类逻辑回归,输出概率
reg:squarederror:平方损失回归
reg:logistic:逻辑回归

2.eval_metric [default according to objective]:对于验证数据的评估指标,将根据目标分配一个默认指标(rmse用于回归,error用于分类,平均平均精度用于排名)
常用:MAE,LOGLOSS,RMSE,AUC

2.GridsearchCV&xgboost.cv

sklearn提供了调参利器:sklearn.model_selection.GridSearchCV,对估计器的指定参数值进行穷举搜索
sklearn介绍

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

重要参数:

estimator——学习器;
param_grid——待调整的参数(字典形式,如	param_gird={'reg_alpha':[i/10 for i in 			range(0,50)]}		)
scoring——评估标准(‘roc_auc’,‘accuracy’,‘f1’等)
cv——确定交叉验证分割策略,默认5折交叉验证,也可使用分层k折,重复k折等

xgboost可以通过xgboost.cv获取最佳树的个数(学习器个数),实质和gridsearchcv相同,通过输入训练集,以及分割策略,将训练集分为训练集和验证集,进行交叉验证,以获取最优超参数

xgboost.cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)

重要参数:

params (dict)——学习器的参数
dtrain (DMatrix) ——训练集,DMatrix是xgboost使用的一种内部数据结构,它对内存效率和训练速度进行了优化
num_boost_round (int)——学习器个数
folds,metrics——同gridsearchcv的cv参数,scoring参数
early_stopping_rounds (int) ——停止迭代轮数,如等于50,则如果精度在某一轮次后五十轮仍未提升,则停止迭代

3.XGBOOST简单调参

使用数据:pima-indians-diabetes .xlsx
流程
1.数据预处理;
2.设置初始参数值,并配以较高学习率,获得最佳学习器个数(xgboost.cv函数获取);
3.max_depth,min_child_weight
4.gamma
5.subsample,colsample_bytree
6.正则化参数:reg_alpha/reg_lambda
7.配置较低学习率,较多学习器,使用xgboost.cv获取最佳学习器个数

3.1 数据预处理:切分数据

相关库

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import RepeatedKFold,StratifiedKFold,GridSearchCV
from sklearn import metrics

读取数据,切分数据(分层切分为训练集、测试集),可以看到分层切分数据后,训练集测试集分布相同

data=np.array(pd.read_excel(r'C:\pima-indians-diabetes .xlsx',header=None))
X=data[:,:-1]
Y=data[:,-1]
# per1=np.sum(Y)/len(Y)#0.3489583333333333
x_train,x_test,y_train,y_test=train_test_split(X,Y,stratify=Y,shuffle=True,random_state=1)
# per2=np.sum(y_test)/len(y_test)#0.3489583333333333
# per3=np.sum(y_train)/len(y_train)#0.3489583333333333

转换为Dmatrix

dtrain=xgb.DMatrix(x_train,label=y_train)
dtest=xgb.DMatrix(x_test,label=y_test)

3.2 设置初始超参数(高学习率)

按照经验设置,之后可以调整的。

xgb1 = XGBClassifier(max_depth=2,
                     learning_rate=0.1,
                     n_estimators=5000,
                     silent=False,
                     objective='binary:logistic',
                     booster='gbtree',
                     n_jobs=4,
                     gamma=0,
                     min_child_weight=1,
                     subsample=0.8,
                     colsample_bytree=0.8,
                     reg_alpha=0,
                     seed=888)

利用CV函数找出最优学习器个数

rkf=RepeatedKFold(n_splits=10,n_repeats=5,random_state=88)#设置分割策略
cv_result = xgb.cv(xgb1.get_xgb_params(),
                   dtrain,
                   num_boost_round=xgb1.get_xgb_params()['n_estimators'],
                   folds=rkf,
                   metrics='auc',
                   early_stopping_rounds=50,
                   callbacks=[xgb.callback.early_stop(50),
                              xgb.callback.print_evaluation(period=1,show_stdv=True)])

结果显示最佳树的个数为:33(这里test-auc实质上为验证集精度)

#Stopping. Best iteration:
# [33]	train-auc:0.92898+0.00458	test-auc:0.84057+0.05435

调整参数(树修改为33个)

xgb1 = XGBClassifier(max_depth=2,
                     learning_rate=0.1,
                     n_estimators=33,
                     silent=False,
                     objective='binary:logistic',
                     booster='gbtree',
                     n_jobs=4,
                     gamma=0,
                     min_child_weight=1,
                     subsample=0.8,
                     colsample_bytree=0.8,
                     reg_alpha=0,
                     seed=888)

3.3调整max_depth,min_child_weight

利用GridsearchCV函数:

param_grid={'max_depth':range(0,20),
            'min_child_weight':range(0,20)}
grid_search=GridSearchCV(xgb1,param_grid,scoring='roc_auc',cv=rkf,iid=False)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果:max_depth=5,min_child_weight=11

# best_params: {'max_depth': 5, 'min_child_weight': 11}
# best_score: 0.8444164587726978

因此修改参数

xgb1 = XGBClassifier(max_depth=5, learning_rate=0.1,n_estimators=33, silent=False, objective='binary:logistic', booster='gbtree', n_jobs=4,gamma=0,min_child_weight=11,subsample=0.8,colsample_bytree=0.8,reg_alpha=0,seed=888)

3.4调整gamma

同理调整gamma(由于gamma先经过粗调,之后经过细调的工作和粗调差不多,没有列出)

param_grid={'gamma':[i for i in range(0,20)]}
grid_search=GridSearchCV(xgb1,param_grid,scoring='roc_auc',cv=rkf)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果:gamma=0.299,0.845比原来提升了一点

#best_params: {'gamma': 0.299}
# best_score: 0.845197258266488

修改参数:

xgb1 = XGBClassifier(max_depth=5, learning_rate=0.1,n_estimators=33, silent=False, objective='binary:logistic', booster='gbtree', n_jobs=4,gamma=0.299,min_child_weight=11,subsample=0.8,colsample_bytree=0.8,reg_alpha=0,seed=888)

3.5调整subsample,colsample_bytree

此处也是粗调,细调过程一样

param_grid={'subsample':[i/10 for i in range(0,11)],
            'colsample_bytree':[i/10 for i in range(0,11)]}
grid_search=GridSearchCV(xgb1,param_grid,scoring='roc_auc',iid=False,cv=rkf)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果:colsample_bytree’: 0.38, ‘subsample’: 0.8,略微提升

#best_params: {'colsample_bytree': 0.38, 'subsample': 0.8}
# best_score: 0.8465616625676469

修改参数

xgb1 = XGBClassifier(max_depth=5, learning_rate=0.1,n_estimators=33, silent=False, objective='binary:logistic', booster='gbtree', n_jobs=4,gamma=0.299,min_child_weight=11,subsample=0.8,colsample_bytree=0.38,reg_alpha=0,seed=888)

3.6正则化参数调优

此处选择reg_alpha,选择reg_lambda也是可以的

param_gird={'reg_alpha':[i/ for i in range(0,10)]
}
grid_search=GridSearchCV(xgb1,param_gird,scoring='roc_auc',iid=False,cv=rkf)
grid_search.fit(x_train,y_train)
print('best_params:',grid_search.best_params_)
print('best_score:',grid_search.best_score_)

结果:‘reg_alpha’: 0.783,略微提升

# best_params: {'reg_alpha': 0.783}
# best_score: 0.8482302064155616

3.7调整学习率、最优学习器个数

首先降低学习率,增加学习器个数,

xgb1 = XGBClassifier(max_depth=5,
                     learning_rate=0.01,
                     n_estimators=5000,
                     silent=False,
                     objective='binary:logistic',
                     booster='gbtree',
                     n_jobs=4,
                     gamma=0.299,
                     min_child_weight=11,
                     subsample=0.8,
                     colsample_bytree=0.38,
                     reg_alpha=0.783,
                     seed=888)

同样利用cv函数获取最优学习器个数:

cv_result = xgb.cv(xgb1.get_xgb_params(),
                   dtrain,
                   num_boost_round=xgb1.get_xgb_params()['n_estimators'],
                   folds=rkf,
                   metrics='auc',
                   early_stopping_rounds=50,
                   callbacks=[xgb.callback.early_stop(50),
                              xgb.callback.print_evaluation(period=1,show_stdv=True)])

结果

# Stopping. Best iteration:
# [403]	train-auc:0.90195+0.00506	test-auc:0.84498+0.06081

接下来利用所有训练数据进行训练,利用测试数据测试(0.65)

xgb_bst1=xgb1.fit(x_train,y_train)
pred_1=xgb_bst1.predict(x_test)
print(metrics.roc_auc_score(y_test,pred_1))
# 0.6564179104477611

NOTE:只利用参数调节是不能显著提升性能的,更重要的是数据的好坏(分布差异,有无异常数据,缺失数据,不同维度是否在同一量级)

参考:
1.sklearn文档
2.XGBoost实战与调参优化
3.xgboost文档

7.机器学习系列(12)_XGBoost参数调优完全指南(附Python代码)

发布了24 篇原创文章 · 获赞 8 · 访问量 2157

猜你喜欢

转载自blog.csdn.net/weixin_44839513/article/details/105299397
今日推荐