Traditional Machine Learning Notes 7 - Detailed Explanation of GBDT Model

foreword

  In the last blog post, we introduced the basic knowledge points about the regression tree model. If you don’t understand, you can go back to the front and read it again. Traditional Machine Learning Notes 6 - Regression Tree Model . In this blog post, we continue to introduce the content of traditional machine learning, traditional machine learning GBDT.

1. GBDT algorithm

  GBDT(Gradient Boosting Decision Tree), that is, the gradient boosting decision tree, is an iterative decision tree algorithm, we also call MARTit, by constructing a set of weak learners, and accumulating the results of multiple decision trees as the final prediction result, the decision tree It is effectively combined with the integration algorithm.

1.1.Boosting

  Before introducing GDBTthe algorithm, we need to look at it first Boosting. BoostingThe method adopts a serial method for training and classifiers, and there are dependencies between each base classifier. His idea is to superimpose the base classifiers layer by layer, and when each layer is trained, the previous layer of base classifiers is divided The samples are given higher weights, and the final results are obtained by weighting the results of each layer of classifiers during testing. Note that as we have introduced before Bagging, Baggingduring training, there is no strong dependence between each base classifier, and parallel training can be performed. For Baggingthose who don’t understand, you can go back to the previous blog post to review, 1.4.Bagging .

1.2.GDBT

  Let's start to formally introduce GDBTthe algorithm. In fact, its principle is very simple. The principle is as follows:

  • The sum of the results of all weak classifiers equals the predicted value.
  • Each time, the current prediction is used as the benchmark, and the next weak classifier is used to fit the error function to the residual of the predicted value (the error between the predicted value and the real value).
  • GBDT's weak classifier uses a tree model.

insert image description here
As shown above, we GDBTuse it to predict the price:

  • The first weak classifier (the first tree) predicts an age (such as 90 yuan), and the calculation finds that the error is 10 yuan;
  • The second tree predicts the fitting residual, the predicted value is 6 yuan, and the calculation finds that the gap is still 4 yuan;
  • The third tree continues to predict the fitting residual, the predicted value is 3 yuan, and it is found that the gap is only 1 yuan in the formula;
  • The fourth lesson tree fits the remaining residuals with 1 yuan, and it is completed.

  In the end, the conclusions of the four trees are added up to get the marked answer of 100 yuan (in the actual engineering implementation, GBDT calculates the negative gradient and uses the negative gradient to approximate the residual).

1.2.1. GBDT and Negative Gradient Approximate Residuals

  When doing a regression task, GDBTthere will be a predicted value for the sample in each round of iterations, and the loss function at this time is the mean square error loss function:
l ( yi , y ^ i ) = 1 2 ( yi − y ^ i ) 2 l\left(y_i, \hat{y}_i\right)=\frac{1}{2}\left(y_i-\hat{y}_i\right)^2l(yi,y^i)=21(yiy^i)2
损失函数的负梯度计算
− [ ∂ l ( y i , y i ^ ) ∂ y ^ i ] = ( y i − y i ^ ) -\left[\frac{\partial l\left(y_i, \hat{y_i}\right)}{\partial \hat{y}_i}\right]=\left(y_i-\hat{y_i}\right) [y^il(yi,yi^)]=(yiyi^)
From the above formula, it can be seen that when the loss function is the mean square error, the value of each fitting is the "true value-predicted value", that is, the residual.

1.2.2. GDBT training process

  Let's take the prediction of the age of four people as an example to introduce GDBTthe training process. Suppose there are four people (A, B, C, D) whose ages are (14, 16, 24, 26). Among them, A and B are high school freshmen and high school sophomores, and C and D are fresh graduates and employees who have worked for two years. We first use the regression tree to train, and the results are as follows: Next, we use GBDT to do this, because the data is too
insert image description here
  large Less, we limit the leaf nodes to have at most two, that is, each tree has only one branch, and limit learning to only two trees. We will get the results shown in Figure 2 below:
insert image description here
  the left picture in the above picture is the same as the above decision tree picture, because A and B are relatively similar in age, and C and D are relatively similar in age, they are divided into two groups, and each group uses the average age as predictor. At this time, the residual is calculated (that is, the predicted value of A + the residual of A = the actual value of A), so the residual of A is 16-15=1 (note that the predicted value of A refers to the cumulative sum of all previous trees, There is only one tree in front of it, so it is directly 15. If there are still trees, they need to be added up as the predicted value of A). Then the residuals of A, B, C, and D are -1, 1, -1, 1 respectively. Then we replace the original values ​​of A, B, C, and D with the residuals, and go to the second tree to learn. If our predicted values ​​are equal to their residuals, we only need to add the conclusions of the second tree to the first The real age can be obtained from a tree. The data here is obviously all I can do, the second tree has only two values ​​1 and -1, split directly into two nodes. At this time, the residuals of all people are 0, that is, everyone gets the real predicted value. In other words, now the predicted values ​​of A, B, C, and D are all consistent with the real age.

  • A: A 14-year-old senior high school student who seldom does shopping and often asks questions on Baidu; predicted age A = 15 – 1 = 14
  • B: 16 years old senior high school student; shopping less, often answering questions; predicted age B = 15 + 1 = 16
  • C: 24-year-old fresh graduate; shopping a lot, often Baidu questions; predicted age C = 25 – 1 = 24
  • D: A 26-year-old employee who has worked for two years; does a lot of shopping and often answers questions; predicted age D = 25 + 1 = 26

2. Gradient promotion and gradient descent

  Next, let's compare the gradient boosting and gradient descent algorithms. The two iterative optimization algorithms use the information of the negative gradient direction of the loss function to update the current model in each iteration. Let's first look at the calculation formulas of the two models.
Gradient descent:
  In gradient descent, the model is expressed in a parameterized form, so that the update of the model is equivalent to the update of the parameters.
F = F t − 1 − ρ t ∇ FL ∣ F = F t − 1 L = ∑ il ( yi , F ( xi ) ) \begin{gathered} F=F_{t-1}-\left.\rho_t \ nabla_F L\right|_{F=F_{t-1}} \\ L=\sum_i l\left(y_i, F\left(x_i\right)\right) \end{gathered}F=Ft1rtFLF=Ft1L=il(yi,F(xi))
Gradient ascent:
  In gradient ascension, the model does not need to be parameterized, but is directly defined in the function space, which greatly expands the types of models that can be used.
wt = wt − 1 − ρ t ∇ w L ∣ w = wt − 1 L = ∑ il ( yi , fw ( wi ) ) \begin{gathered} w_t=w_{t-1}-\left.\rho_t \nabla_w L\right|_{w=w_{t-1}} \\ L=\sum_i l\left(y_i, f_w\left(w_i\right)\right) \end{gathered}wt=wt1rtwLw=wt1L=il(yi,fw(wi))

3. Advantages and disadvantages of GDBT model

  Above we introduced GDBTthe basic algorithm of the model, so what are its advantages and disadvantages?
advantage:

  • In the prediction stage, because the structure of each tree has been determined, the calculation can be parallelized and the calculation speed is fast.
  • It is suitable for dense data, has good generalization ability and expressive ability, and is the top common model in data science competitions.
  • The interpretability is good, the robustness is also good, and the high-order relationship between features can be automatically discovered.

shortcoming:

  • GBDT is less efficient on high-dimensional sparse data sets, and the performance is not as good as SVM or neural network.
  • Suitable for numerical features, weak performance on NLP or text features.
  • The training process cannot be parallelized, and engineering acceleration can only be reflected in the process of building a single tree.

4. GDBT vs Random Forest

Same point:

  • They are all integrated models, composed of multiple tree groups, and the final result is determined by multiple trees together.
  • When RF and GBDT use CART trees, they can be classification trees or regression trees.

difference:

  • During training, random forest trees can be generated in parallel, while GBDT can only be generated serially.
  • The result of random forest is voted by majority vote, while GBDT is accumulated by multiple trees.
  • Random Forest is not sensitive to outliers, while GBDT is more sensitive to outliers.
  • Random Forest reduces the variance of the model, while GBDT reduces the bias of the model.

Guess you like

Origin blog.csdn.net/qq_38683460/article/details/127614547