Integrated Learning - Boosting Algorithm: Brief Principles and Differences of Adaboost, GBDT, XGBOOST and lightGBM

1. Boosting algorithm

insert image description here

The Boosting algorithm is to upgrade a group of weak learners to a strong learner algorithm in series. Its working mechanism is as follows:
(1) Use the initial training set to train a base learner;
(2) Adjust the distribution of training samples according to the performance of the base learner, so that the training samples that were wrong before get the greatest attention in the future ;
(3) Use the adjusted sample distribution for the next base learner;
(4) Repeat steps 2-3 until the number of base learners reaches the specified T value
(5) Perform T base learners The weighted combination yields an ensemble learner.
According to different strategies, there will be three common Boosting algorithms: Adaboost, GBDT, and XGBoost.

2. Adaboost algorithm

Adaboost emphasizes Adaptive (adaptive), constantly adding weak classifiers for boosting by continuously modifying sample weights (increasing the weight of misclassified samples and reducing the weight of paired samples). Its core steps are the following two:
Weight adjustment : increase the weight of the wrongly classified samples in the last round, and reduce the weight of the correctly classified samples, so that the wrongly classified samples can get more attention in the next round of base classifiers .
Combination of base classifiers : The method of weighted majority voting is adopted, that is, to increase the weight of classifiers with small classification errors and to reduce the weights of classifiers with large errors.
The steps and considerations of Adaboost are consistent with the Boosting algorithm, and the steps are basically the same.
insert image description here

Adaboost algorithm features

  • Sub-classifiers can be built using various methods, providing the framework itself
  • Subclassifiers are easy to construct
  • The speed is fast, and the parameters are not adjusted very much
  • low generalization error

3. GBDT algorithm

GBDT is designed to continuously reduce residuals (regression), and to build a new model in the direction of residual reduction (negative gradient) by continuously adding new trees. That is, the loss function is designed to reduce the residual error as quickly as possible . To get residuals, all decision trees use CART regression trees. Among them, Shrinkage (reduction) is an important branch of GBDT. It approaches the real result by taking small steps each time. This method can effectively reduce the risk of overfitting, because only a small part of each tree is learned, The cumulative result is also the content of this small part, and the goal is approached by learning a few more trees.

Residual: The difference between the true value and the predicted value.

Fundamental

insert image description here
(1) Train a model m1 (20 years old) and generate an error e1 (10 years old);
(2) Train a second model m2 (6 years old) for e1 and generate an error e2 (4 years old);
(3) Train for e2 The third model m3 (3 years old) produces an error e3 (3 years old);
(4) Train the fourth model m4 (1 year old) for e3...
(5) The final prediction result is: m1+m2+m3+m4 = 20+6+3+1=30 years old
Of course, the actual process will not only predict one feature, the actual GBDT process will be similar to the following, different features are predicted, and then the result of each tree is given Different weights, and finally add the different trees:
insert image description here

GBDT Features

Advantages :

  • In the prediction stage, because the structure of each tree has been determined, the calculation can be parallelized and the calculation speed is fast.
  • It is suitable for dense data, has good generalization ability and expressive ability, and is the top common model in data science competitions.
  • The interpretability is good, the robustness is also good, and the high-order relationship between features can be automatically discovered.

Disadvantages :

  • GBDT is less efficient on high-dimensional sparse data sets, and the performance is not as good as SVM or neural network.
  • Suitable for numerical features, weak performance on NLP or text features.
  • The training process cannot be parallelized, and engineering acceleration can only be reflected in the process of building a single tree.

Question 1 : Why not use the CART classification tree?

Each iteration of GBDT needs to fit the gradient value, so a continuous value regression tree should be used

Question 2 : What is the difference between a regression tree and a classification tree?

1. The most important thing for the regression tree is to find the best division point, and the division point contains the possible values ​​of all features.
2. The best division point of classification tree is entropy or Gini coefficient, that is to say, it is measured by purity, but the samples of regression tree are all continuous label values, so it is not appropriate to use entropy, and square error can better simulate combine.

4. XGBOOST algorithm

The principle of XGBOOST is similar to that of GBDT. It is an optimized distributed gradient boosting library and a tool for large-scale parallel boosting trees.
insert image description here

The difference between XGBOOST and GBDT:

  • CART tree complexity: XGBoost considers the complexity of the tree, but GBDT does not;
  • Loss function: XGBoost is the second-order derivative of the last round of loss function fitting, while GBDT is the first-order derivative of the last round of loss function, so the accuracy and number of iterations of the former are less;
  • Multi-threading: XGBoost enables multi-threading when selecting the best segmentation point, and runs faster.

LightGBM Algorithm

A gradient boosting framework open sourced by Microsoft is mainly reflected in efficient parallel training, which is 10 times faster than XGBoost and has a memory usage rate of 1/6 of the latter. It is mainly optimized by:

  • Decision tree algorithm based on Histogram histogram;
  • Perform differential acceleration on the Histogram histogram;
  • Use a leaf-wise leaf growth strategy with depth constraints;
  • Direct support for categorical features;
  • Efficient parallelism is directly supported.

Decision Tree Algorithm Based on Histogram

A simple understanding is to discretize continuous values ​​into k integers to construct a histogram, then when traversing the data, it is also to directly traverse the histogram to find the optimal division point. Although this reduces the accuracy to a certain extent, it has been greatly optimized in terms of memory consumption and calculation speed; at the same time, since the decision tree itself is a weak model, the accuracy of the division points has little effect, and it can also effectively prevent overshooting. fit.
insert image description here

Do difference acceleration for Histogram histogram

Usually constructing a histogram requires traversing all the data on the leaves, but by making a difference to the histogram, only k buckets of the histogram need to be traversed. In this way, after constructing a histogram of a leaf, it is easy to obtain the histogram of sibling leaves, and the speed can be doubled.
insert image description here

A leaf-wise leaf growth strategy with depth constraints

XGBoost adopts a level-wise leaf growth strategy. It can perform multi-thread optimization and control the complexity of the model. It is not easy to overfit, but it treats the leaves of the same layer equally, which will cause many low-gain to be split and searched. , which will bring a lot of unnecessary overhead.
insert image description here

lightGBM adopts the leaf-wise leaf generation strategy, which is to find the one with the largest leaf splitting gain and split it. In the case of the same number of splits, this method has lower error and higher precision. But correspondingly, a decision tree that is too deep may be generated, resulting in overfitting, so a maximum depth limit needs to be added.
insert image description here

Direct support for categorical features

General machine learning tools need to convert category features into numerical features, which reduces the efficiency of space and time, while lightGBM can directly input category features.
insert image description here

Direct support for efficient parallelism

Feature parallelism: save all training data on each machine, without vertical data division, and directly execute locally after the optimal division is obtained, without communication between machines.
insert image description here
Data parallelism: assign the task of histogram merging to different machines, reduce communication and calculation, and use histogram difference to further reduce communication traffic.
insert image description here
Parallel voting: By finding TOP k features locally, these features screened out based on voting may be the optimal partition point, so only the screened out features are merged during merging, thereby reducing communication.
insert image description here

Features of lightGBM:

Advantages :

  • Fast speed: traverse the histogram to reduce time complexity; use leaf-wise algorithm to reduce a large number of calculations; use feature parallelism, data parallelism and voting parallelism to speed up calculation; optimize the cache to increase the cache hit rate.
  • Small memory: use the histogram algorithm to convert features into bin values, record fewer indexes to reduce memory consumption; change feature storage to store bin values ​​to reduce memory consumption; use mutually exclusive feature bundling algorithm to reduce the number of features.

Disadvantages :

  • It may grow a deeper decision tree, resulting in overfitting, and it is necessary to use a depth limit to prevent overfitting;
  • The traditional boosting algorithm will make the error smaller and smaller through iteration, while lightGBM is based on the deviation algorithm and is more sensitive to noise;
  • If all the characteristics of the optimal solution are not considered, there will be incomplete consideration.

Guess you like

Origin blog.csdn.net/gjinc/article/details/131921374