Simple and crude understanding and implementation of machine learning ensemble learning (III): Boosting the principle of integration, implementation, api introduction, GBDT, XGBoost, Taylor expansion

Integrated learning

learning target

  • Understand two core tasks of solving major integrated learning
  • You know the principles of integrated bagging
  • We know the process of establishing a random decision tree forest
  • Why do I need to know random with replacement (Bootstrap) sampling
  • Random Forest algorithm to achieve application RandomForestClassifie
  • We know the principle of boosting integration
  • Know the difference between bagging and boosting the
  • Learn gbdt implementation process
    Here Insert Picture Description

5.3 Boosting

1.boosting integration principle

1.1 What is boosting

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-JxZQPRu6-1583250504346) (../ images / boosting1.png)]

With the accumulation of learning from weak to strong

In short: each adding a new weak learners, overall capacity will be raised

Representatives algorithm: Adaboost, GBDT, XGBoost

1.2 implementation:

1. Training a learner

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-5oZXOFok-1583250504346) (../ images / boosting2.png)]

2. Adjust the data distribution

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-WT9W33ON-1583250504347) (../ images / boosting3.png)]

3. The second training learners

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MfDcJCCr-1583250504347)(../images/boostin4.png)]

4. Adjust the distribution data again

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2YwDEK9C-1583250504348)(../images/boosting5.png)]

5. Click learner training, adjustment data distribution

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-d3Wo9IHh-1583250504349)(../images/boosting6.png)]

6. The whole process of realization

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tFOoUVDM-1583250504350)(../images/boosting8.png)]

key point:

How to confirm the voting weights?

How to adjust the data distribution?

在这里插入图片描述

在这里插入图片描述[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-pmkuoxHP-1583250504350) (... / images / boosting9.png)]

AdaBoost construction process Summary

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9TS11isW-1583250504351)(../images/boosting10.png)]

bagging integrated integrated with boosting differences:

A difference: the data front

Bagging: data is sampled training;

Boosting: the importance of adjusting the data according to the results of the previous round of learning.

Difference between the two: casting votes

Bagging: All learner affirmative vote;

Boosting: learning is weighted vote.

Three differences: learning sequence

Bagging learning parallel, each learner no dependencies;

Boosting learning is a serial, learning the order.

Distinguish four: the main role

Bagging mainly used to improve generalization performance (solved fit, it can be said to reduce variance)

Boosting主要用于提高训练精度 (解决欠拟合,也可以说降低偏差)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pk0Ow4Ei-1583250504351)(../images/baggingVSboosting.png)]

1.3 api介绍

  • from sklearn.ensemble import AdaBoostClassifier
    • api链接:https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

2 GBDT(了解)

梯度提升决策树(GBDT Gradient Boosting Decision Tree) 是一种迭代的决策树算法,**该算法由多棵决策树组成,所有树的结论累加起来做最终答案。**它在被提出之初就被认为是泛化能力(generalization)较强的算法。近些年更因为被用于搜索排序的机器学习模型而引起大家关注。

GBDT = 梯度下降 + Boosting + 决策树

2.1 梯度的概念(复习)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MGipXjmI-1583250504352)(../images/gbdt1.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wPBdqKyX-1583250504352)(../images/gbdt2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2iMleifI-1583250504353)(../images/gbdt3.png)]

2.2 GBDT执行流程

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-uBxbXfT3-1583250504353)(../images/gbdt4.png)]

如果上式中的hi(x)=决策树模型,则上式就变为:

GBDT = 梯度下降 + Boosting + 决策树

2.3 案例

预测编号5的身高:

编号 年龄(岁) 体重(KG) 身高(M)
1 5 20 1.1
2 7 30 1.3
3 21 70 1.7
4 30 60 1.8
5 25 65 ?

第一步:计算损失函数,并求出第一个预测值:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OhXwxgLp-1583250504354)(../images/gbdt5.png)]

第二步:求解划分点

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LCku1lt2-1583250504355)(../images/gbdt6.png)]

得出:年龄21为划分点的方差=0.01+0.0025=0.0125

第三步:通过调整后目标值,求解得出h1(x)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TrxPuL4i-1583250504355)(../images/gbdt8.png)]

第四步:求解h2(x)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6c9TKVab-1583250504356)(../images/gbdt9.png)]

… …

得出结果:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ryLXPU0E-1583250504356)(../images/gbdt10.png)]

编号5身高 = 1.475 + 0.03 + 0.275 = 1.78

2.4 GBDT主要执行思想

1.使用梯度下降法优化代价函数;

2.使用一层决策树作为弱学习器,负梯度作为目标值;

3.利用boosting思想进行集成。

3.XGBoost【了解】

XGBoost= 二阶泰勒展开+boosting+决策树+正则化

  • Face questions: What XGBoost understand, please explain in detail how it works

Answer points: second-order Taylor expansion, boosting, decision trees, regularization

Boosting : XGBoost enhance the idea of multiple use Boosting weak learners iterative learning

Second-order Taylor expansion : each round of study, XGBoost loss of function of the second-order Taylor expansion, the use of first and second order gradient optimization.

Decision Tree : In each round learning, XGBoost decision tree algorithm to optimize learning as weak.

Regularization : XGBoost prevent overfitting, value added penalty term loss function, the restriction tree leaf node and the number of leaf nodes in the decision tree optimization process.

4 What is the Taylor expansion [expand]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mhXWZz11-1583250504357)(../images/taylor.png)]

Taylor expansion, the more accurate the results

Published 627 original articles · won praise 839 · views 110 000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104644962