机器学习xgboost参数解释笔记

首先xgboost有两种接口，xgboost自带API和Scikit-Learn的API，具体用法有细微的差别但不大。

在运行 XGBoost 之前, 我们必须设置三种类型的参数: （常规参数）general parameters，（提升器参数）booster parameters和（任务参数）task parameters。

常规参数与我们用于提升的提升器有关，通常是树模型或线性模型
提升器参数取决于你所选择的提升器
学习任务的参数决定了学习场景, 例如回归任务可以使用不同的参数进行排序相关的任务
命令行参数的行为与 xgboost 的 CLI 版本相关
本文只介绍xgboost自带的API，Scikit-Learn的API可以对照参考。

1 xgboost.train(params, dtrain, num_boost_round=10, evals=(), \
2 obj=None, feval=None, maximize=False, early_stopping_rounds=None, \
3 evals_result=None, verbose_eval=True, learning_rates=None, \
4 xgb_model=None, callbacks=None)

params：这是一个字典，里面包含着训练中的参数关键字和对应的值，形式如下：

 1 params = {
 2     'booster':'gbtree',
 3     'min_child_weight': 100,
 4     'eta': 0.02,
 5     'colsample_bytree': 0.7,
 6     'max_depth': 12,
 7     'subsample': 0.7,
 8     'alpha': 1,
 9     'gamma': 1,
10     'silent': 1,
11     'objective': 'reg:linear',
12     'verbose_eval': True,
13     'seed': 12
14 }

其中具体的参数以下会介绍。

General Parameters
booster [default=gbtree]

有两中模型可以选择gbtree和gblinear。gbtree使用基于树的模型进行提升计算，gblinear使用线性模型进行提升计算。缺省值为gbtree。
silent [default=0]

取0时表示打印出运行时信息，取1时表示以缄默方式运行，不打印运行时信息。缺省值为0。
nthread [default to maximum number of threads available if not set]

XGBoost运行时的线程数。缺省值是当前系统可以获得的最大线程数
num_pbuffer [set automatically by xgboost, no need to be set by user]

size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last boosting step.
num_feature [set automatically by xgboost, no need to be set by user]

boosting过程中用到的特征维数，设置为特征个数。XGBoost会自动设置，不需要手工设置。
Booster Parameters
eta [default=0.3]

为了防止过拟合，更新过程中用到的收缩步长。在每次提升计算之后，算法会直接获得新特征的权重。 eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3
取值范围为：[0,1]
gamma [default=0]

minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
range: [0,∞]
max_depth [default=6]

数的最大深度。缺省值为6
取值范围为：[1,∞]
min_child_weight [default=1]

孩子节点中最小的样本权重和。如果一个叶子节点的样本权重和小于min_child_weight则拆分过程结束。在现行回归模型中，这个参数是指建立每个模型所需要的最小样本数。该成熟越大算法越conservative
取值范围为: [0,∞]
max_delta_step [default=0]

Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. 通常不需要这个参数，但是当类非常不平衡时，它可能有助于逻辑回归。将其设置为1-10可能有助于控制更新
取值范围为：[0,∞]
subsample [default=1]

用于训练模型的子样本占整个样本集合的比例。如果设置为0.5则意味着XGBoost将随机的冲整个样本集合中随机的抽取出50%的子样本建立树模型，这能够防止过拟合。
取值范围为：(0,1]
colsample_bytree [default=1]
在建立树时对特征采样的比例。缺省值为1
取值范围：(0,1]
Task Parameters
objective [ default=reg:linear ]
定义学习任务及相应的学习目标，可选的目标函数如下：

“reg:linear” –线性回归。
“reg:logistic” –逻辑回归。
“binary:logistic”–二分类的逻辑回归问题，输出为概率。
“binary:logitraw”–二分类的逻辑回归问题，输出的结果为wTx。
“count:poisson”–计数问题的poisson回归，输出结果为poisson分布。在poisson回归中，max_delta_step的缺省值为0.7。(used to safeguard optimization)
“multi:softmax” –让XGBoost采用softmax目标函数处理多分类问题，同时需要设置参数num_class（类别个数）
“multi:softprob” –和softmax一样，但是输出的是ndata * nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。没行数据表示样本所属于每个类别的概率。
“rank:pairwise”–set XGBoost to do ranking task by minimizing the pairwise loss
base_score [ default=0.5 ]

the initial prediction score of all instances, global bias
eval_metric [ default according to objective ]
校验数据所需要的评价指标，不同的目标函数将会有缺省的评价指标（rmse for regression, and error for classification, mean average precision for ranking）
用户可以添加多种评价指标，对于Python用户要以list传递参数对给程序，而不是map参数list参数不会覆盖’eval_metric’
The choices are listed below:
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
“merror”: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases).
“mlogloss”: Multiclass logloss
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
“ndcg-”,”map-”,”ndcg@n-”,”map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatively
“gamma-deviance”: [residual deviance for gamma regression]
seed[ default=0 ]

random number seed.

随机数的种子。缺省值为0

dtrain：训练的数据

num_boost_round：这是指提升迭代的次数，也就是生成多少基模型

evals：这是一个列表，用于对训练过程中进行评估列表中的元素。形式是evals = [(dtrain,'train'),(dval,'val')]或者是evals = [(dtrain,'train')]，对于第一种情况，它使得我们可以在训练过程中观察验证集的效果

obj：自定义目的函数

feval：自定义评估函数

maximize：是否对评估函数进行最大化

early_stopping_rounds：早期停止次数，假设为100，验证集的误差迭代到一定程度在100次内不能再继续降低，就停止迭代。这要求evals 里至少有一个元素，如果有多个，按最后一个去执行。返回的是最后的迭代次数（不是最好的）。如果early_stopping_rounds存在，则模型会生成三个属性，bst.best_score，bst.best_iteration和bst.best_ntree_limit

evals_result：字典，存储在watchlist中的元素的评估结果。

verbose_eval ：(可以输入布尔型或数值型)，也要求evals里至少有一个元素。如果为True,则对evals中元素的评估结果会输出在结果中；如果输入数字，假设为5，则每隔5个迭代输出一次。

learning_rates：每一次提升的学习率的列表，

xgb_model：在训练之前用于加载的xgb model。

以上原文链接https://blog.csdn.net/iyuanshuo/article/details/80142730

机器学习xgboost参数解释笔记

猜你喜欢