[xgboost] python

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/weixin_37993251/article/details/88867515

参考:https://xgboost.apachecn.org/

https://xgboost.readthedocs.io


Python 软件包介绍

本文档给出了有关 xgboost python 软件包的基本演练.

其他有用的链接列表

安装 XGBoost

要安装 XGBoost, 请执行以下步骤:

  • 您需要在项目的根目录下运行 make 命令
  • 在 python-package 目录下运行
python setup.py install
import xgboost as xgb

数据接口

XGBoost python 模块能够使用以下方式加载数据:

  • libsvm txt format file(libsvm 文本格式的文件)
  • Numpy 2D array, and(Numpy 2维数组, 以及)
  • xgboost binary buffer file. (xgboost 二进制缓冲文件)

这些数据将会被存在一个名为 DMatrix 的对象中.

  • 要加载 ligbsvm 文本格式或者 XGBoost 二进制文件到 DMatrix 对象中. 代码如下:
dtrain = xgb.DMatrix('train.svm.txt')
dtest = xgb.DMatrix('test.svm.buffer')
  • 要加载 numpy 的数组到 DMatrix 对象中, 代码如下:
data = np.random.rand(5,10) # 5 entities, each contains 10 features
label = np.random.randint(2, size=5) # binary target
dtrain = xgb.DMatrix( data, label=label)
  • 要加载 scpiy.sparse 数组到 DMatrix 对象中, 代码如下:
csr = scipy.sparse.csr_matrix((dat, (row, col)))
dtrain = xgb.DMatrix(csr)
  • 保存 DMatrix 到 XGBoost 二进制文件中后, 会在下次加载时更快:
dtrain = xgb.DMatrix('train.svm.txt')
dtrain.save_binary("train.buffer")
  • 要处理 DMatrix 中的缺失值, 您可以通过指定缺失值的参数来初始化 DMatrix:
dtrain = xgb.DMatrix(data, label=label, missing = -999.0)
  • 在需要时可以设置权重:
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w)

设置参数

XGBoost 使用 pair 格式的 list 来保存 参数. 例如:

  • Booster(提升)参数
param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' }
param['nthread'] = 4
param['eval_metric'] = 'auc'
  • 您也可以指定多个评估的指标:
param['eval_metric'] = ['auc', 'ams@0'] 

# alternativly:
# plst = param.items()
# plst += [('eval_metric', 'ams@0')]
  • 指定验证集以观察性能
evallist  = [(dtest,'eval'), (dtrain,'train')]

训练

有用参数列表和数据以后, 您现在可以训练一个模型了.

  • 训练
num_round = 10
bst = xgb.train( plst, dtrain, num_round, evallist )
  • 保存模型 训练之后,您可以保存模型并将其转储出去.
bst.save_model('0001.model')
  • 转储模型和特征映射 您可以将模型转储到 txt 文件并查看模型的含义
# 转存模型
bst.dump_model('dump.raw.txt')
# 转储模型和特征映射
bst.dump_model('dump.raw.txt','featmap.txt')
  • 加载模型 当您保存模型后, 您可以使用如下方式在任何时候加载模型文件
bst = xgb.Booster({'nthread':4}) #init model
bst.load_model("model.bin") # load data

提前停止

如果您有一个验证集, 你可以使用提前停止找到最佳数量的 boosting rounds(梯度次数). 提前停止至少需要一个 evals 集合. 如果有多个, 它将使用最后一个.

train(..., evals=evals, early_stopping_rounds=10)

该模型将开始训练, 直到验证得分停止提高为止. 验证错误需要至少每个 early_stopping_rounds 减少以继续训练.

如果提前停止,模型将有三个额外的字段: bst.best_scorebst.best_iteration 和 bst.best_ntree_limit. 请注意 train() 将从上一次迭代中返回一个模型, 而不是最好的一个.

这与两个度量标准一起使用以达到最小化(RMSE, 对数损失等)和最大化(MAP, NDCG, AUC). 请注意, 如果您指定多个评估指标, 则 param ['eval_metric'] 中的最后一个用于提前停止.

预测

当您 训练/加载 一个模型并且准备好数据之后, 即可以开始做预测了.

# 7 个样本, 每一个包含 10 个特征
data = np.random.rand(7, 10)
dtest = xgb.DMatrix(data)
ypred = bst.predict(xgmat)

如果在训练过程中提前停止, 可以用 bst.best_ntree_limit 从最佳迭代中获得预测结果:

ypred = bst.predict(xgmat,ntree_limit=bst.best_ntree_limit)

绘图

您可以使用 plotting(绘图)模块来绘制出 importance(重要性)以及输出的 tree(树).

要绘制出 importance(重要性), 可以使用 plot_importance. 该函数需要安装 matplotlib.

xgb.plot_importance(bst)

输出的 tree(树)会通过 matplotlib 来展示, 使用 plot_tree 指定 target tree(目标树)的序号. 该函数需要 graphviz 和 matplotlib.

xgb.plot_tree(bst, num_trees=2)

当您使用 IPython 时, 你可以使用 to_graphviz 函数, 它可以将 target tree(目标树)转换成 graphviz 实例. graphviz 实例会自动的在 IPython 上呈现.

xgb.to_graphviz(bst, num_trees=2)

参数调整注意事项

参数调整是机器学习中的一门暗艺术,模型的最优参数可以依赖于很多场景。所以要创建一个全面的指导是不可能的。

本文档试图为 xgboost 中的参数提供一些指导意见。

理解 Bias-Variance(偏差-方差)权衡

如果你了解一些机器学习或者统计课程,你会发现这可能是最重要的概念之一。 当我们允许模型变得更复杂(例如深度更深)时,模型具有更好的拟合训练数据的能力,会产生一个较少的偏差模型。 但是,这样复杂的模型需要更多的数据来拟合。

xgboost 中的大部分参数都是关于偏差方差的权衡的。最好的模型应该仔细地将模型复杂性与其预测能力进行权衡。 参数文档 会告诉你每个参数是否会使得模型更加 conservative (保守)与否。这可以帮助您在复杂模型和简单模型之间灵活转换。

控制过拟合

当你观察到训练精度高,但是测试精度低时,你可能遇到了过拟合的问题。

通常有两种方法可以控制 xgboost 中的过拟合。

  • 第一个方法是直接控制模型的复杂度
    • 这包括 max_depthmin_child_weight 和 gamma
  • 第二种方法是增加随机性,使训练对噪声强健
    • 这包括 subsamplecolsample_bytree
    • 你也可以减小步长 eta, 但是当你这么做的时候需要记得增加 num_round 。

处理不平衡的数据集

对于广告点击日志等常见情况,数据集是极不平衡的。 这可能会影响 xgboost 模型的训练,有两种方法可以改善它。

  • 如果你只关心预测的排名顺序(AUC)
    • 通过 scale_pos_weight 来平衡 positive 和 negative 权重。
    • 使用 AUC 进行评估
  • 如果你关心预测正确的概率
    • 在这种情况下,您无法重新平衡数据集
    • 在这种情况下,将参数 max_delta_step 设置为有限数字(比如说1)将有助于收敛

Text Input Format of DMatrix

Basic Input Format

XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See this Wikipedia article for a description of the CSV format.)

For training or predicting, XGBoost takes an instance file with the format as below:

train.txt

1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400
0 0:1.3 1:0.3
1 0:0.01 1:0.3
0 0:0.2 1:0.3

Each line represent a single instance, and in the first line ‘1’ is the instance label, ‘101’ and ‘102’ are feature indices, ‘1.2’ and ‘0.03’ are feature values. In the binary classification case, ‘1’ is used to indicate positive samples, and ‘0’ is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.

Auxiliary Files for Additional Information

Note: all information below is applicable only to single-node version of the package. If you’d like to perform distributed training with multiple nodes, skip to the section Embedding additional information inside LibSVM file.

Group Input Format

For ranking task, XGBoost supports the group input format. In ranking task, instances are categorized into query groups in real world scenarios. For example, in the learning to rank web pages scenario, the web page instances are grouped by their queries. XGBoost requires an file that indicates the group information. For example, if the instance file is the train.txt shown above, the group file should be named train.txt.group and be of the following format:

train.txt.group

2
3

This means that, the data set contains 5 instances, and the first two instances are in a group and the other three are in another group. The numbers in the group file are actually indicating the number of instances in each group in the instance file in order. At the time of configuration, you do not have to indicate the path of the group file. If the instance file name is xxx, XGBoost will check whether there is a file named xxx.groupin the same directory.

Instance Weight File

Instances in the training data may be assigned weights to differentiate relative importance among them. For example, if we provide an instance weight file for the train.txt file in the example as below:

train.txt.weight

1
0.5
0.5
1
0.5

It means that XGBoost will emphasize more on the first and fourth instance (i.e. the positive instances) while training. The configuration is similar to configuring the group information. If the instance file name is xxx, XGBoost will look for a file named xxx.weight in the same directory. If the file exists, the instance weights will be extracted and used at the time of training.

Note

Binary buffer format and instance weights

If you choose to save the training data as a binary buffer (using save_binary()), keep in mind that the resulting binary buffer file will include the instance weights. To update the weights, use the set_weight() function.

Initial Margin File

XGBoost supports providing each instance an initial margin prediction. For example, if we have a initial prediction using logistic regression for train.txt file, we can create the following file:

train.txt.base_margin

-0.4
1.0
3.4

XGBoost will take these values as initial margin prediction and boost from that. An important note about base_margin is that it should be margin prediction before transformation, so if you are doing logistic loss, you will need to put in value before logistic transformation. If you are using XGBoost predictor, use pred_margin=1 to output margin values.

Embedding additional information inside LibSVM file

This section is applicable to both single- and multiple-node settings.

Query ID Columns

This is most useful for ranking task, where the instances are grouped into query groups. You may embed query group ID for each instance in the LibSVM file by adding a token of form qid:xx in each row:

train.txt

1 qid:1 101:1.2 102:0.03
0 qid:1 1:2.1 10001:300 10002:400
0 qid:2 0:1.3 1:0.3
1 qid:2 0:0.01 1:0.3
0 qid:3 0:0.2 1:0.3
1 qid:3 3:-0.1 10:-0.3
0 qid:3 6:0.2 10:0.15

Keep in mind the following restrictions:

  • You are not allowed to specify query ID’s for some instances but not for others. Either every row is assigned query ID’s or none at all.
  • The rows have to be sorted in ascending order by the query IDs. So, for instance, you may not have one row having large query ID than any of the following rows.

Instance weights

You may specify instance weights in the LibSVM file by appending each instance label with the corresponding weight in the form of [label]:[weight], as shown by the following example:

train.txt

1:1.0 101:1.2 102:0.03
0:0.5 1:2.1 10001:300 10002:400
0:0.5 0:1.3 1:0.3
1:1.0 0:0.01 1:0.3
0:0.5 0:0.2 1:0.3

where the negative instances are assigned half weights compared to the positive instances.

猜你喜欢

转载自blog.csdn.net/weixin_37993251/article/details/88867515