以下是我的学习笔记,以及总结,如有错误之处请不吝赐教。
本文主要介绍kaggle大杀器xgboost和lightgbm两个工具库的简单使用,以及相关案例代码。
首先回忆一下boosting原理,以及由boosting衍生出来的算法:Adaboost和GBDT以及后面更强的xgboost,忘记的同学可以查阅我之前的文章:ml课程:决策树、随机森林、GBDT、XGBoost相关(含代码实现),除此之外当然还有树模型的相关集成算法的内容:ml课程:模型融合与调优及相关案例代码。回忆杀完了,我们进入正文。
XGboost:
是eXtreme Gradient Boosting的简称,源码在这:xgboost,是由陈天奇大佬团伙开发的实现可扩展,编写,分布式的GBDT算法的一个库,可以用于c++,python,R,julia,java,scala,hadoop,现在有很多协作者共同维护开发。
xgboost计算速度更快的原因有以下几点:
- Parallelization:训练是可以用所有的cpu内核来并行化建树(单棵树)。
- Distributed Computing :用分布式计算来训练非常大的模型。
- Out-of-Core Computing:对于非常大的数据集还可以进行out-of-core computing.
- Cache Optimization of data structures and algorithms:可以更好的利用硬件。
下图是XGBoost与其他gradient boosting和bagged decision trees效果比较:
xgboost另一个优点是预测问题中模型表现非常好,具体可以看下面几个比赛大牛的链接:
- Vlad Sandulescu, Mihai Chiru, 1st place of the KDD Cup 2016 competition. Link to the arxiv paper.
- Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the Dato Truely Native? competition. Link to the Kaggle interview.
- Vlad Mironov, Alexander Guschin, 1st place of the CERN LHCb experiment Flavour of Physics competition. Link to the Kaggle interview.
最常用XGboost部分:
与sklearn类似,这个库也有以下几个常用的部分:
- XGBoost Tutorials,主要是如何使用这个库的一些案例介绍。
- XGBoost Parameters,主要是需要调节的参数:通用参数(general parameters)、集成参数(booster parameters)、任务参数(task parameters)。
- Python API Reference:各种api接口。 4.高级用法:在github上获取源码,更改相关参数;例如:我们可以自定义损失函数和评价指标
#注意:我们调用原数据需要转换为.train和.test #!/usr/bin/python import numpy as np import xgboost as xgb ### # advanced: customized loss function # print('start running example to used customized objective function') dtrain = xgb.DMatrix('../data/agaricus.txt.train') dtest = xgb.DMatrix('../data/agaricus.txt.test') # note: for customized objective function, we leave objective as default # note: what we are getting is margin value in prediction # you must know what you are doing param = {'max_depth': 2, 'eta': 1, 'silent': 1} watchlist = [(dtest, 'eval'), (dtrain, 'train')] num_round = 2 # user define objective function, given prediction, return gradient and second order gradient # this is log likelihood loss def logregobj(preds, dtrain): labels = dtrain.get_label() preds = 1.0 / (1.0 + np.exp(-preds)) grad = preds - labels hess = preds * (1.0 - preds) return grad, hess #grad和hess分别表示一阶导数和二阶导数 # user defined evaluation function, return a pair metric_name, result # NOTE: when you do customized loss function, the default prediction value is margin # this may make builtin evaluation metric not function properly # for example, we are doing logistic loss, the prediction is score before logistic transformation # the builtin evaluation error assumes input is after logistic transformation # Take this in mind when you use the customization, and maybe you need write customized evaluation function def evalerror(preds, dtrain): labels = dtrain.get_label() # return a pair metric_name, result. The metric name must not contain a colon (:) or a space # since preds are margin(before logistic transformation, cutoff at 0) return 'my-error', float(sum(labels != (preds > 0.0))) / len(labels) # training with customized objective, we can also do step by step training # simply look at xgboost.py's implementation of train bst = xgb.train(param, dtrain, num_round, watchlist, obj=logregobj, feval=evalerror)
相关链接:
xgboost完整流程小项目:https://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/
xgboost sklearn库API接口:https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn
xgboost API:https://xgboost.readthedocs.io/en/latest/
github源码:https://github.com/dmlc/xgboost
lightGBM:
与XGboost类似,lightGBM也是微软开源的一个工具库,它与XGboost的区别是运行计算更快,尤其是在大数据的情况下,支持的算法也更多。
最常用lightGBM部分:
最后,还是回到案例代码上:欢迎关注我的github
To be continue......