Random Forest,AdaBoost(adaptive boosting),GB (Gradient Boost), LightGBM

Random Forest

The Random Forest algorithm uses multiple (randomly generated) decision trees to generate the final output. (Reference: https://blog.csdn.net/qq_39777550/article/details/107312048)

The working principle of the decision tree is similar to the continuous if-else condition. The disadvantage is that it may lead to overfitting. A feature with a large influence contributes a lot to the prediction, so this feature affects more than other secondary features, so we create multiple decisions Tree random sampling features

Random forest uses tagging technique to reduce the high variance phenomenon (high variance phenomenon: https://www.mastersindatascience.org/learning/difference-between-bias-and-variance/)

Assuming that the data has M×N, assuming that there are 3 bags created, divided into three bags, each containing different data, each bag is randomly sampled in M ​​rows and N columns, usually sampling 2/3 of the row data and root number For N column data, each bag data is predicted by the model and voted

AdaBoost(adaptive boosting)

Adaptive Boosting
reference:
https://www.cnblogs.com/pinard/p/6136914.html
https://www.cnblogs.com/pinard/p/6133937.html
code implementation can refer to:
https://blog.csdn .net/weixin_43584807/article/details/104784611

AdaBoost is an assemble learning where many models are combined to create a better model

Bagging and boosting are different implementations of the model

Random forest is the way to implement bagging

The difference between random forest and Ada
1. RF is a bagging family, Ada is a boosting family
2. RF is parallel learning, the decision trees are constructed independently from the data in parallel, Ada is sequential learning, and the construction of the decision tree is based on the previous tree The result, model 2 is based on model 1, model 3 is based on model 2, and so on
(this is also an important difference between bagging and boosting)
3. All RF models are equal, if there are 10 models (RF creates 10 decision trees), 10 models equal vote The final result, all Ada models are not equal, some models may have higher weights, some models have lower weights on the results
4, RF is fully grown trees (Complete Tree), Ada is stump, just Just a root node has two leaf nodes

In boosting, the model can also be called weak learner

In Ada, the weak learner first constructs a stump, one root and two leaf, leaf is the attribute of the two bin candidates differentiated, reasoning predicts each data, and dynamically adjusts the weight according to the result, the weight of the wrong sample increases, and the relative The weight of the correct sample is reduced. For the next weak learner, more attention is paid to the wrongly predicted sample. Try to correctly predict the wrong sample predicted by the previous model.

The final model combines all the learned experiences to predict the outcome

GB (Gradient Boost)

GB (gradient boost), now look at the difference between GB and Ada

Ada updates the records weight according to the algorithm, and GB updates the weight according to the loss function

Ada is a stump with only one root node and two leaf nodes, while GB usually has 8 to 32 leaf nodes

Common loss, mse loss

The GB algorithm
initially calculates the average value of the target column as the initial predicted value, and then calculates the distance d between the predicted value and the target

model_1 predicts this distance pred_d, and optimizes the loss function MSE according to the actual distance d. Since the distance pred_d is predicted, the predicted target result = mean + α·pred_d, where α is the learning rate, such as 0.1, and the reasoning result is used as the new prediction value predicted value, and then calculate the distance d between the predicted value and the actual value again, then start training model_2, and then predict the distance d, ... to get the predicted target result of model2, and then go down one by one, the final predicted result will gradually approach the actual result

final prediction = based value + α·pred_1 + α·pred_2 + α·pred_3 + … + α·pred_n

LightGBM

Light Gradient Boosting Machine

Reference:
https://lightgbm.readthedocs.io/en/v3.3.2/
API call can refer to:
https://www.kaggle.com/code/amitkumarjaiswal/lightgbm/script
LightGBM paper:
https://www.microsoft .com/en-us/research/wp-content/uploads/2017/11/lightgbm.pdf

LightGBM is a gradient boosting framework

The boosting algorithm is an integration technique. The key to integration is that we have multiple models m1, m2,...,mn, m0 as the base model, assuming that the number of prediction results is n, and get error e, and train the next model m1 according to the error, and so on Train by analogy, and finally we can combine all the models to get the final model

LightGBM has several characteristics. It is these characteristics that make the algorithm execution efficiency very high. It is recommended to use LightGBM when there is no GPU and no computer with high computing power. If there is a GPU, it is recommended to use XGB because XGB has good scalability. The following introduces Three characteristics of LightGBM

One feature is the internal use of decision tree training. First, select a root column as the root node, and then use guinea entropy to select an optimal split point to split into two subtrees. This process is called binning, which is divided into two subtrees on the left and right sides of the optimal split point. bin, then only judge the bin interval where the variable is located, and select a path for subsequent reasoning, which greatly speeds up the operation of the algorithm (histogram)

The second feature is exclusive feature bundling, mutually exclusive variable bundling, creating a new variable from mutually exclusive variables, such as Feature_a and Feature_b only appearing mutually exclusive 1 and 0, and 1 or 0 at the same time, then Feature_a And Feature_b can be bundled into a Feature_ab, for example, Feature_ab can be represented by 10 or 01

The third feature is gradient based one side sampling. Suppose we start from the very beginning. After the model m0 is executed, we get the gradient descent results gradient1, 2, ..., and sort the gradient from high to low. We only select a part of the samples for gradient descent. , usually 20% of the hyperparameters are selected, and 10% (hyperparameters) samples are randomly selected from the remaining 80%, then we have newly selected 10% of the sub-samples of 20% + 80% for gradient descent, because we think The small part of the gradient descent has been trained almost, and then randomly select some samples for gradient descent. (Then one side sampling refers to top20% sampling)

Guess you like

Origin blog.csdn.net/qq_19841133/article/details/127132241