参考链接

Xgboost官网参数介绍

1.简介

XGBoost一般包含三种类型的参数: 常规参数, 提升器参数和任务参数.

常规参数与我们用于提升的提升器有关，通常是树模型或线性模型
提升器参数取决于你所选择的提升器
学习任务的参数决定了学习场景, 例如回归任务可以使用不同的参数进行排序相关的任务
命令行参数的行为与 xgboost 的 CLI 版本相关

2. 常规参数

booster [default=gbtree]

决定使用哪个booster,取值可以是gbtree、gblinear或dart。其中，gbtree和dart是基于树模型，gblinear是基于线性函数

silent [default=0]

是否打印运行信息。0表示打印，1表示不打印

nthread [default to maximum number of threads available if not set]

设置用于运行xgboost的并行线程数，默认为系统的最大线程数

num_pbuffer[xgboost自动设置，不需要用户设置]

预测结果所需的缓存大小，一般设置为训练样本数。缓存用来保存最后一轮提升的预测结果

num_feature[xgboost自动设置，不需要用户设置]

模型提升时使用的特征维度，一般设置为最大特征维度

3. 提升器参数

3.1 用于 Tree 提升的参数

eta [default=0.3]

迭代过程中为了避免过拟合而使用的步长。每次迭代后，我们都会获得各个特征的权重，而eta可以缩小这一权重而使得提升过程更鲁棒

取值范围: [0,1]

gamma [default=0]

节点分裂所需的最小损失下降值。该值越大，算法将会越保守

取值范围: [0,∞]

max_depth [default=6]

树的最大深度。该值越大，单棵树模型越复杂，越容易过拟合

取值范围: [1,∞]

min_child_weight [default=1]

叶节点样本的最小权重和。如果树的叶节点分裂使得叶节点的样本权重和小于该值，那么我们将停止分裂。该值越大，模型越保守，但容易欠拟合。

取值范围: [0,∞]

max_delta_step [default=0]

Maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced. Set it to value of 1-10 might help control the update

range: [0,∞]

subsample [default=1]

设置训练每棵树时采样的比例。如果设置为0.5，则表示每轮随机选取一半的样本用于训练。这可以降低过拟合

取值范围:（0.1]

colsample_bytree [default=1]

设置构建每棵树时随机选取特征的比例

取值范围:（0.1]

colsample_bylevel [default=1]

设置构建树时每一级每一次分裂随机选取特征的比例

取值范围:（0.1]

lambda [default=1]

权重的L2正则化项(类似于Ridge Regression)。该值增大将会使模型更保守

alpha [default=0]

权重的L1正则化项(类似于LASSO)。该值增大将会使模型更保守

tree_method, string [default=’auto’]

The tree constructtion algorithm used in XGBoost(see description in the reference paper)

Distributed and external memory version only support approximate algorithm.

Choices: {‘auto’, ‘exact’, ‘approx’}

‘auto’: Use heuristic to choose faster one.For small to medium dataset, exact greedy will be used. For very large-dataset, approximate algorithm will be choosed. Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is choosed to notify this choice.

‘exact’: Exact greedy algorithm.

‘approx’: Approximate greedy algorithm using sketching and histogram.

sketch_eps, [default=0.03]

This is only used for approximate greedy algorithm.

This roughly translated into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical ganrantee with sketch accuracy.

Usuaully user do not have to tune this. but consider set to lower number for more accurate enumeration.
range: (0, 1)

scale_pos_weight, [default=0]

对于类别不平衡问题，可以设置该值来解决不平衡问题。该值常设置为负例样本数与正例样本数的比值

3.2 用于 Dart Booster 的其它参数

待补充

3.3 用于 Linear Booster 的参数

lambda [default=0]

权重的L2正则化项(类似于Ridge Regression)。该值增大将会使模型更保守

alpha [default=0]

权重的L1正则化项(类似于LASSO)。该值增大将会使模型更保守

lambda_bias

在偏置上的L2正则。缺省值为0（在L1上没有偏置项的正则，因为L1时偏置不重要）

4 学习任务的参数(定制学习任务及对应的学习目标)

objective [ default=reg:linear ]

可能的取值包括：

“reg:linear” ----线性回归
“reg:logistic” ----逻辑回归
“binary:logistic” ----二分类的逻辑回归，输出的是概率
“binary:logitraw” ----二分类的逻辑回归，输出的是逻辑回归方程中wTx（w的转置乘x）
“count:poisson” ----poisson regression for count data, output mean of poisson distribution
max_delta_step is set to 0.7 by default in poisson regression (used to safeguard optimization)
“multi:softmax”----设置XGBoost使用softmax objective进行多分类。同时，你需要设置num_class
“multi:softprob”----和softmax一样，但是输出的是ndata*nclass的向量，可以将该向量reshape成ndata行nclass列的矩阵。每行数据表示样本所属于每个类别的概率。
“rank:pairwise”----set XGBoost to do ranking task by minimizing the pairwise loss
“reg:gamma”----gamma regression for severity data, output mean of gamma distribution

base_score [ default=0.5 ]

the initial prediction score of all instances, global bias

如果迭代足够多次，改变该值将不会产生太大的影响

eval_metric [ default according to objective ]

设置验证集的评估指标。该参数的默认值根据目标函数而有所不同（回归时rmse；分类是error；ranking是平均准确率）

用户可以同时添加多个评估指标。python用户需要用列表类型传递多个评估指标

评估指标的取值包括：

“rmse”: root mean square error
“mae”: mean absolute error
“logloss”: negative log-likelihood
“error”: 二分类问题的分类错误率。对于预测问题，其预测值大于0.5的样本被认为是正例，小于0.5的认为是负例
“merror”: 多分类问题的分类错误率，其取值为分类错误的样本数与总样本数的比值
“mlogloss”: 多分类问题的logloss值
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
“ndcg-”,”map-”,”ndcg@n-”,”map@n-”: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions. training repeatively
“gamma-deviance”: [residual deviance for gamma regression]

seed [ default=0 ]

设置随机种子

5 命令行参数

下列参数需要在xgboost的控制版本中使用

use_buffer [ default=1 ]

设置是否将输入的文本创建为二进制的缓存文件。这样做将会减小加载时间

num_round

设置提升算法的迭代轮数

data

指定训练数据的路径

test:data

将要做预测的测试集的路径

save_period [default=0]

保存模型的迭代轮数。设置为10，表示每10轮保存模型；设置为0表示，表示训练过程不保存任何模型

task [default=train] options: train, pred, eval, dump

train：使用训练数据训练
pred：对测试数据进行预测
eval：通过eval[name]=filenam定义评价指标
dump：将学习模型保存成文本格式

model_in [default=NULL]

path to input model, needed for test, eval, dump, if it is specified in training, xgboost will continue training from the input model

model_out [default=NULL]

path to output model after training finishes, if not specified, will output like 0003.model where 0003 is number of rounds to do boosting.

model_dir [default=models]

设置训练过程中被保存模型的输出路径

fmap

feature map, used for dump model

name_dump [default=dump.txt]

name of model dump file

name_pred [default=pred.txt]

预测文件的文件名

pred_margin [default=0]

predict margin instead of transformed probability

Xgboost参数总结

参考链接

1.简介

2. 常规参数

3. 提升器参数

3.1 用于 Tree 提升的参数

3.2 用于 Dart Booster 的其它参数

3.3 用于 Linear Booster 的参数

4 学习任务的参数(定制学习任务及对应的学习目标)

5 命令行参数

猜你喜欢