DM12---xgboost学习

基本资料

论文：
https://arxiv.org/abs/1603.02754
原理博客：
《机器学习（四）— 从gbdt到xgboost》
https://www.cnblogs.com/mfryf/p/5946815.html
《GBDT&GBRT与XGBoost》
http://blog.csdn.net/u011826404/article/details/76427732
《XGBoost原理解析》
http://blog.csdn.net/dreamyx/article/details/70194018
《xgboost导读和实战》
https://wenku.baidu.com/view/44778c9c312b3169a551a460.html
代码：
https://github.com/dmlc/xgboost
python-lib下载：
https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
官网：
http://xgboost.readthedocs.io/en/latest/python/python_intro.html
API:
http://xgboost.readthedocs.io/en/latest/python/python_api.html

xgboost的xgboost包

以下从python的实现方面来看看xgboost，python实现的时候，xgboost有如下几个模块：
Core Data Structure
—-xgboost.DMatrix
—-xgboost.Booster
Learning API
—-xgboost.train
—-xgboost.cv
Scikit-Learn API[Scikit-Learn Wrapper interface for XGBoost]
—-xgboost.XGBRegressor
—-xgboost.XGBClassifier
Plotting API
—-xgboost.plot_importance
—-xgboost.plot_tree
—-xgboost.to_graphviz

xgboost参数

官方参数介绍看这里：
http://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters

General Parameters（常规参数）
1.booster [default=gbtree]：选择基分类器，gbtree: tree-based models/gblinear: linear models
2.silent [default=0]:设置成1则没有运行信息输出，最好是设置为0.
3.nthread [default to maximum number of threads available if not set]：线程数另外还有两个参数不用用户设置： ● num_pbuffer [set automatically by
xgboost, no need to be set by user]
○ size of prediction buffer, normally set to number of training instances. The buffers are used to save the prediction results of last
boosting step. ● num_feature [set automatically by xgboost, no need
to be set by user]
○ feature dimension used in boosting, set to maximum dimension of the feature

Booster Parameters（模型参数）
1.eta [default=0.3]:shrinkage参数，用于更新叶子节点权重时，乘以该系数，避免步长过大。参数值越大，越可能无法收敛。把学习率
eta 设置的小一些，小学习率可以使得后面的学习更加仔细。
2.min_child_weight [default=1]:这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言，假设 h 在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100
个样本。这个参数非常影响结果，控制叶子节点中二阶导的和的最小值，该参数值越小，越容易 overfitting。
3.max_depth [default=6]: 每颗树的最大深度，树高越深，越容易过拟合。
4.max_leaf_nodes:最大叶结点数，与max_depth作用有点重合。
5.gamma [default=0]：后剪枝时，用于控制是否后剪枝的参数。
6.max_delta_step [default=0]：这个参数在更新步骤中起作用，如果取0表示没有约束，如果取正值则使得更新步骤更加保守。可以防止做太大的更新步子，使更新更加平缓。

7.subsample [default=1]：样本随机采样，较低的值使得算法更加保守，防止过拟合，但是太小的值也会造成欠拟合。
8.colsample_bytree [default=1]：列采样，对每棵树的生成用的特征进行列采样.一般设置为： 0.5-1
9.lambda [default=1]：控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。
10.alpha [default=0]:控制模型复杂程度的权重值的 L1 正则项参数，参数值越大，模型越不容易过拟合。
11.scale_pos_weight [default=1]：如果取值大于0的话，在类别样本不平衡的情况下有助于快速收敛。 Learning Task Parameters（学习任务参数）
1.objective [default=reg:linear]：定义最小化损失函数类型，常用参数： binary:logistic –logistic regression for binary classification, returns predicted
probability (not class) multi:softmax –multiclass classification
using the softmax objective, returns predicted class (not
probabilities) you also need to set an additional num_class (number
of classes) parameter defining the number of unique classes
multi:softprob –same as softmax, but returns predicted probability of
each data point belonging to each class.
2.eval_metric [ default according to objective ]： The metric to be used for validation data. The default values are rmse for regression
and error for classification. Typical values are: rmse – root mean
square error mae – mean absolute error logloss – negative
log-likelihood error – Binary classification error rate (0.5
threshold) merror – Multiclass classification error rate mlogloss –
Multiclass logloss auc: Area under the curve
3.seed [default=0]： The random number seed. 随机种子，用于产生可复现的结果 Can be used for generating reproducible results and also for parameter
tuning. 注意: python sklearn style参数名会有所变化 eta –> learning_rate lambda
–> reg_lambda alpha –> reg_alpha

实践一下

# coding=utf-8

import xgboost as xgb
from sklearn import datasets
from sklearn import svm
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=0, shuffle=True)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

svc = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
svc_pre = svc.predict(X_test)
# score = clf.score(X_test, y_test)
# print('svm-score:', score)

# xgb矩阵赋值
xgb_train = xgb.DMatrix(X_train, label=y_train)
xgb_test = xgb.DMatrix(X_test, label=y_test)

#
params = {
    # 1. General Parameters
    'booster': 'gbtree',  # default=gbtree 选择基分类器，gbtree: tree-based models/gblinear: linear models
    'silent': 1,  # default=0, 设置成1则没有运行信息输出，0则显示运行信息.
    'nthread': 7,  # [default to maximum number of threads available if not set] cpu 线程数
    # 2. Tree Booster 参数
    'eta': 0.007,  # [default=0.3, alias: learning_rate] 如同学习率
    'gamma': 0.1,  # [default=0, alias: min_split_loss] 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子，范围: [0,∞]。
    'max_depth': 12,  # [default=6]构建树的深度，越大越容易过拟合
    'min_child_weight': 3,  # [default=1]是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言，
    # 假设 h 在 0.01 附近，min_child_weight 为 1 意味着叶子节点中最少需要包含 100 个样本。
    # 这个参数非常影响结果，控制叶子节点中二阶导的和的最小值，该参数值越小，越容易 overfitting。
    'max_delta_step': 1,  # [default=0]
    'subsample': 0.9,  # [default=1]随机采样训练样本 ◦range: (0,1]
    'colsample_bytree': 0.9,  # [default=1] 生成树时进行的列采样
    # 'colsample_bylevel':0.8,# [default=1]
    'lambda': 1.5,  # [default=1, alias: reg_lambda]控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。
    'alpha': 0,  # [default=0, alias: reg_alpha]控制模型复杂程度的权重值的 L1 正则项参数，参数值越大，模型越不容易过拟合
    'tree_method': 'auto',  # [default=’auto’]' 可以选择{‘auto’, ‘exact’, ‘approx’, ‘hist’, ‘gpu_exact’, ‘gpu_hist’}
    # ◾‘auto’: Use heuristic to choose faster one.◾For small to medium dataset, exact greedy will be used.
    # ◾For very large-dataset, approximate algorithm will be chosen.
    # ◾Because old behavior is always use exact greedy in single machine, user will get a message when approximate algorithm is chosen to notify this choice.
    #
    # ◾‘exact’: Exact greedy algorithm.
    # ◾‘approx’: Approximate greedy algorithm using sketching and histogram.
    # ◾‘hist’: Fast histogram optimized approximate greedy algorithm. It uses some performance improvements such as bins caching.
    # ◾‘gpu_exact’: GPU implementation of exact algorithm.
    # ◾‘gpu_hist’: GPU implementation of hist algorithm.
    'sketch_eps': 0.03,  # [default=0.03],这个参数对于 approximate greedy algorithm 才有用 ◦range: (0, 1)
    'scale_pos_weight': 1,  # [default=1]如果取值大于0的话，在类别样本不平衡的情况下有助于快速收敛。
    'updater': 'grow_colmaker,prune',  # [default=’grow_colmaker,prune’]设置值用逗号分隔开，定义树的更新
    # The following updater plugins exist:◾‘grow_colmaker’: non-distributed column-based construction of trees.
    # ◾‘distcol’: distributed tree construction with column-based data splitting mode.
    # ◾‘grow_histmaker’: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
    # ◾‘grow_local_histmaker’: based on local histogram counting.
    # ◾‘grow_skmaker’: uses the approximate sketching algorithm.
    # ◾‘sync’: synchronizes trees in all distributed nodes.
    # ◾‘refresh’: refreshes tree’s statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
    # ◾‘prune’: prunes the splits where loss < min_split_loss (or gamma).
    #
    # ◦In a distributed setting, the implicit updater sequence value would be adjusted as follows:◾‘grow_histmaker,prune’ when dsplit=’row’ (or default) and prob_buffer_row == 1 (or default); or when data has multiple sparse pages
    # ◾‘grow_histmaker,refresh,prune’ when dsplit=’row’ and prob_buffer_row < 1
    # ◾‘distcol’ when dsplit=’col’

    # 3. 学习任务
    'objective': 'multi:softmax',  # 多分类的问题
    'num_class': 3,  # 类别数，与 multisoftmax 并用
    # 这个参数默认是 1，
    'seed': 1000,
    'eval_metric': 'mlogloss'  # [default according to objective]
}
plst = list(params.items())
num_rounds = 5000  # 迭代次数
watchlist = [(xgb_train, 'train'), (xgb_test, 'val')]

# 训练模型并保存
# early_stopping_rounds 当设置的迭代次数较大时，early_stopping_rounds 可在一定的迭代次数内准确率没有提升就停止训练
bst = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=100)
bst.save_model('xgb.model')  # 用于存储训练出的模型
print("best best_ntree_limit", bst.best_ntree_limit)
bst_pre = bst.predict(xgb_test, ntree_limit=bst.best_ntree_limit)

xgb_sc = zero_one_loss(y_test, bst_pre)
svm_sc = zero_one_loss(y_test, svc_pre)
rs = np.c_[y_test, bst_pre, svc_pre]
print(rs)
print('xgb_sc:', 1 - xgb_sc, 'svm_sc:', 1 - svm_sc)
xgb.plot_importance(bst)
xgb.plot_tree(bst, num_trees=2)
plt.show()

# 输出要提交的数据
# np.savetxt('submission.csv', np.c_[range(1, len(X_test) + 1), preds], delimiter=',', header='ImageId,Label',
#            comments='', fmt='%d')

运行结果

xgb_sc: 0.9777777777777777 svm_sc: 0.9777777777777777

这里写图片描述

参考文献

xgboost入门与实战（原理篇） http://blog.csdn.net/sb19931201/article/details/52557382

xgboost入门与实战（实战调参篇）
http://blog.csdn.net/sb19931201/article/details/52577592

DM07-Ensemble组合技术
http://blog.csdn.net/ld326/article/details/79367190

happyprince, http://blog.csdn.net/ld326/article/details/79543529