https://zhuanlan.zhihu.com/p/28672955
3.1 XGBoost的参数主要分为三种(这里我就不翻译了):
- General Parameters: 控制总体的功能
- Booster Parameters: 控制单个学习器的属性
- Learning Task Parameters: 控制调优的步骤
(1)General Parameters:
booster [default=gbtree]
- 选择每一次迭代中,模型的种类. 有两个选择:
- gbtree: 基于树的模型
- gblinear: 线性模型
- 设为1 则不打印执行信息
- I设为0打印信息
- 这个是设置并发执行的信息,设置在几个核上并发
- 如果你希望在机器的所有可以用的核上并发执行,则采用默认的参数
silent [default=0]:
nthread [default to maximum number of threads available if not set]
(2)Booster Parameters
有2种booster,线性的和树的,一般树的booster较为常用。
eta [default=0.3]
- 类似于GBM里面的学习率
- 通过在每一步中缩小权重来让模型更加鲁棒
- 一般常用的数值: 0.01-0.2
- 这个参数用来控制过拟合
- 如果数值太大可能会导致欠拟合
- Defines the minimum sum of weights of all observations required in a child.
- Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
- Too high values can lead to under-fitting hence, it should be tuned using CV.
- 设置树的最大深度
- 控制过拟合,如果树的深度太大会导致过拟合
- 应该使用CV来调节。
- The maximum depth of a tree, same as GBM.
- Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
- Should be tuned using CV.
- Typical values: 3-10
- 叶子节点的最大值
- 也是为了通过树的规模来控制过拟合
- The maximum number of terminal nodes or leaves in a tree.
- Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
- If this is defined, GBM will ignore max_depth.
- 如果叶子树确定了,对于2叉树来说高度也就定了,此时以叶子树确定的高度为准
- 如果分裂能够使loss函数减小的值大于gamma,则这个节点才分裂。gamma设置了这个减小的最低阈值。如果gamma设置为0,表示只要使得loss函数减少,就分裂
- 这个值会跟具体的loss函数相关,需要调节
- A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
- Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
- In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
- 如果参数设置为0,表示没有限制。如果设置为一个正值,会使得更新步更加谨慎。(关于这个参数我还是没有完全理解透彻。。。)
- Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
- 不是很经常用,但是在逻辑回归时候,使用它可以处理类别不平衡问题。
- 对原数据集进行随机采样来构建单个树。这个参数代表了在构建树时候 对原数据集采样的百分比。eg:如果设为0.8表示随机抽取样本中80%的个体来构建树。
- Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
- Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
- 相对小点的数值可以防止过拟合,但是过小的数值会导致欠拟合(因为采样过小)。
- Typical values: 0.5-1
- 创建树的时候,从所有的列中选取的比例。e.g:如果设为0.8表示随机抽取80%的列 用来创建树
- Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
- Typical values: 0.5-1
- Denotes the subsample ratio of columns for each split, in each level.
- I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
- L2 regularization term on weights (analogous to Ridge regression)
- L2正则项,类似于Ridge Regression
- This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
- 可以用来考虑降低过拟合,L2本身可以防止过分看重某个特定的特征。尽量考虑尽量多的特征纳入模型。
- L1 regularization term on weight (analogous to Lasso regression)
- L1正则。 类似于lasso
- Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
- L1正则有助于产生稀疏的数据,这样有助于提升计算的速度
- A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
min_child_weight [default=1]
max_depth [default=6]
max_leaf_nodes
gamma [default=0]
max_delta_step [default=0]
subsample [default=1]
colsample_bytree [default=1]
colsample_bylevel [default=1]
lambda [default=1]
alpha [default=0]
scale_pos_weight [default=1]
(3)Learning Task Parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.
objective [default=reg:linear]
- This defines the loss function to be minimized. Mostly used values are:
- logistic regression for binary classification, returns predicted probability (not class)
- multiclass classification using the softmax objective, returns predicted class (not probabilities)
- you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
- same as softmax, but returns predicted probability of each data point belonging to each class.
binary:logistic
multi:softmax
multi:softprob
eval_metric [ default according to objective ]
- 对于回归问题默认采用rmse,对于分类问题一般采用error
- The metric to be used for validation data.
- The default values are rmse for regression and error for classification.
- Typical values are:
- rmse – root mean square error
- mae – mean absolute error
- logloss – negative log-likelihood
- error – Binary classification error rate (0.5 threshold)
- merror – Multiclass classification error rate
- mlogloss – Multiclass logloss
- auc: Area under the curve
seed [default=0]
- 为了产生能过重现的结果。如果不设置这个种子,每次产生的结果都会不同。
- The random number seed.
- Can be used for generating reproducible results and also for parameter tuning.
If you have been using Scikit-Learn till now, these parameter names might not look familiar. A good news is that xgboost module in python has an sklearn wrapper called XGBClassifier. It uses sklearn style naming convention. The parameters names which will change are:
短语参数命名规则。现在在xgboost的module中,有一个sklearn的封装。在这个module中命名规则和sklearn的命名规则一致。
扫描二维码关注公众号,回复:
2091562 查看本文章
- eta –> learning_rate
- lambda –> reg_lambda
- alpha –> reg_alpha
(4)调参
parm_grid
如果数据量不大的话可否用parm_grid罗列所有可能的参数,使用GridSearchCV来验证。
booster
设置一些初始值。
learning rate和booster不变,调节和estimators。
estimator和booster参数不变,调节learning rate
estimator和learning rate不变,调节booster。
可以从影响最大的max_depth 和min_child_weight开始。逐步调节所有可能有影响的booster参数
缩小learning rate,得到最佳的learning rate值
step6:得到一组效果还不错的参数组合