Super detailed! Introduction to XGBoost principle

This article was written by Yuan Zhecheng himself. Don’t count it as a plagiarism check. Thank you! XGBoost is the earliest of the three most popular GBDT models. The latter two are improved on the basis of XGBoost. Therefore, this article introduces XGBoost to have a certain understanding of the GBDT model.

gradient boosted tree

Improved integration algorithm

Build multiple weak estimators on the data and aggregate the modeling results of all weak estimators to obtain better regression or classification performance than a single model. There are many ways to integrate different weak evaluators, including random forests that resume multiple parallel weak evaluators at once, and gradient boosting trees that build weak evaluators one by one. Gradient boosting trees can be either regression trees or classification trees. The following uses gradient boosting regression trees as examples.
First, the modeling process of gradient boosting regression trees is roughly as follows: initially build a tree, and then gradually iterate, adding a tree during each iteration, and then gradually form strong estimators of many tree models.
Insert image description here
For decision trees, each sample that is put into the model iii will eventually fall to a leaf node. For ordinary regression trees, the value on each leaf node is the mean of all samples on this leaf node.
For gradient boosting regression trees, the prediction result of each sample is the weighted sum of the results on all trees:
y ^ i = ∑ k = 1 K γ khk ( xi ) \hat { y } _ { i } = \sum _ { k=1 } ^ { K } \gamma _ { k} h _ { k } \left( x _ { i } \right)y^i=k=1Kckhk(xi)
whereKKK is the total number of trees,kkk represents thekthk trees,γ k \gamma_{k}ckis the weight of this tree, hk h_khkRepresents the prediction result on this tree (i.e., the average of all samples).
One of the optimizations of XGBoost is to optimize the prediction results. It is no longer the weighted sum of the average values ​​of all samples, but introduces a prediction score, also called leaf weight.
y ^ i = ∑ k = 1 K fk ( xi ) \hat { y } _ { i } = \sum _ {k=1 } ^ { K } f _ { k } \left( x _ { i } \right )y^i=k=1Kfk(xi)
in whichfk ( xi ) f_k(x_i)fk(xi) is the prediction score, and the other letters have the same meaning as before.

Parameters:n-estimators

Use this parameter to determine how many trees there are in total.
First of all, the number of trees in XGB determines the learning ability of the model. The greater the number of trees, the stronger the learning ability of the model. As long as the number of trees in XGB is sufficient, even if there is only a small amount of data, the model can learn 100% of the information of the training data, so XGB is also a model that is inherently overfitting. But in this case, the model becomes very unstable.
Second, when the number of trees in XGB is small, the impact on the model is greater. When the number of trees is already large, the impact on the model is relatively small and there can only be weak changes. When the data itself is overfitting, using too many trees will have little effect and will instead waste computing resources.
Third, increasing the number of trees has a limit to its impact on the model. Initially, the performance of the model will increase with the number of trees in XGB, but
after more trees there are, the more effective the model will be. Decline, which also shows that violently increasing n_estimators may not be effective.

Parameterssubsample

Before we train the model, there must be a huge data set. We all know that tree models are inherently overfitting models, and if the amount of data is too large, the calculation of the tree model will be very slow. Therefore, we need to perform bootstrap sampling (bootsrap) on our original data set. Sampling with replacement prevents overfitting.
In the gradient boosting tree, we need to build a new tree every iteration, so we need to extract a new training sample with replacement in each iteration. However, this does not guarantee that every time a new tree is created, the integration effect will be better than before. Therefore, we stipulate that in the gradient boosting tree, each time an estimator is constructed, the model will be more focused on those samples in the data set that are prone to be misjudged. (Similar to adding priority to sampling in DQN.)
First, we have a huge data set. When building the first tree, we perform initial and replacement sampling on the data, and then model. After the modeling is completed, we conduct an evaluation of the model, and then feed back the samples with incorrect predictions by the model to our data set, and one iteration is completed. Next, we need to build a second decision tree, so we start sampling with replacement for the second time. But this sampling with replacement is different from the first random sampling with replacement. In this sampling, we increased the weight of samples that were incorrectly judged by the first tree. In other words, samples judged incorrectly by the first tree are more likely to be selected by us. Modeling based on this weighted training set, our new decision tree will be more inclined to these samples with larger weights that are easily judged incorrectly. After the modeling is completed, we feed back the incorrectly judged samples to the original data set. In the next iteration, the weight of the samples that were judged wrong will be greater, and the new model will be more inclined to these samples that are difficult to judge. This iterates over and over again.
Generally speaking, this parameter does not have much impact on the results and is generally left at the default value.

Reference number: eta( η \etathe )

From a data perspective, we make the model more inclined to work hard on samples that are difficult to judge. However, it does not mean that as long as I build a new decision tree that favors difficult samples, it will be able to help me judge the difficult samples correctly. Difficult samples are given increased weight because the previous tree failed to judge it correctly, so for the next tree, the difficulty of the test set it has to judge is higher than the difficulty of the data encountered by the previous tree. Yes, it will become increasingly difficult to judge all these samples correctly. What if the newly created tree is not as good as the previous tree in judging difficult samples? Therefore, in addition to ensuring that the model gradually tends to the direction of difficult samples, we must also control the generation of new weak classifiers. We must ensure that, Each time a new tree is added, it must be the tree with the best prediction effect for this new data set.
A similar method to gradient descent is used here:
Insert image description here
now we want to solve the optimal result of the ensemble algorithm, then we should be able to use the same idea: we first find a loss function, which should be able to be obtained by bringing in our prediction results Measure the prediction performance of our gradient boosted tree on the sample. We then use gradient descent to iterate our ensemble algorithm:
yi ^ ( k + 1 ) = yi ^ ( k ) + fk + 1 ( xi ) \hat { y _ { i } } ^ { ( k + 1 ) } = \hat { y _ { i } } ^ { ( k ) } + f _ { k + 1 } \left( x _ { i } \right)yi^(k+1)=yi^(k)+fk+1(xi) We let this process continue to iterate until we find y ^ \hat{y}
that can minimize the loss functiony^. A parameter eta \eta
is also used in XGBη to control the iteration rate, so that our update method becomes as follows:
y ^ i ( k + 1 ) = y ^ i ( k ) + η fk + 1 ( xi ) \hat { y } _ { i } ^ { ( k + 1 ) } = \hat { y } _ { i } ^ { ( k ) } + \eta f _ { k + 1 } \left( x _ { i } \right)y^i(k+1)=y^i(k)+ηfk+1(xi)
the \etaThe larger η is, the faster the algorithm converges, and it may not be able to converge to the optimum;η \etaThe smaller η is, the more accurate solution can be found, but the algorithm converges slowly.
Generally speaking η \etaeta can be fine-tuned as follows, roughly between [0.01, 0.2].

Gradient Boosting Algorithm Summary

The gradient boosting algorithm mainly consists of 3 parts:
1. A loss function that can measure the effect of the integration algorithm and can be optimized;
2. A weak estimator fk (x) f_k(x) that can achieve predictionfk( x ) ;
3. A means to integrate weak evaluators, such as iteration means, sampling means, sample weighting, etc.
XGBoost operates on these three core elements of the gradient boosting tree. It redefines the loss function and Weak evaluator, and the integration method of the lifting algorithm has been improved to achieve a balance between computing speed and model effect.

XGBoost

Objective function of XGBoost

The objective function in the gradient boosting algorithm is optional, and different objective functions can be selected for classification or regression problems. As long as it is differentiable and can represent some kind of loss.
In addition to the traditional loss function, XGB also introduces model complexity to measure the computational efficiency of the algorithm. Therefore the objective function is written: traditional loss function + model complexity.
O bj = ∑ i = 1 ml ( yi , y ^ i ) + ∑ k = 1 K Ω ( fk ) O bj = \sum _ { i = 1 } ^ { m } l \left( y _ { i } , \hat { y } _ { i } \right) + \sum _ { k = 1 } ^ { K } \Omega \left( f _ { k } \right)Obj=i=1ml(yi,y^i)+k=1KOh(fk)
partiii represents the iithitem in the data seti samples,mmm means importkkthThe total amount of data for k trees, K represents the sum of all trees established so far (n_estimators), the first term is the traditional loss function that measures the difference between the true label and the predicted value, and the second term represents the complexity of the model. Using some transformation of the tree modelΩ \OmegaΩ means. When we iterate each tree, we minimizeO bj ObjO b j to strive to obtain the optimaly ^ \hat{y}y^, thus minimizing the error rate of the model and the complexity of the model at the same time.
The loss function can also be understood from another perspective: through variance and bias. Generally speaking, variance and bias go hand in hand. The greater the variance, the smaller the bias.
Insert image description here
Variance can be explained simply as the stability of the model across different data sets, while bias is the accuracy of the model's predictions. The variance-deviation dilemma can correspond to our O bj ObjO b jone O bj = ∑ i = 1 ml ( yi , y ^ i ) + ∑ k = 1 K Ω
( fk ) O bj = \sum _ { i = 1 } ^ { m } l \left( y_{i}, \hat {y}_{i}\right) + \sum_{k = 1}^{k}\Omega\left(f_{k}\right)Obj=i=1ml(yi,y^i)+k=1KOh(fk)
The first term is a measure of our bias, and the more inaccurate the model, the larger the first term will be. The second item is to measure our variance. The more complex the model, the more specific the model's learning will be, and the performance on different data sets will vary greatly, and the greater the variance will be. Therefore, the minimum value we solve is actually to find the balance point between variance and bias, so as to achieve the smallest generalization error of the model and the fastest running speed. We know that tree models and tree ensemble models are very easy to overfit, so most tree models will initially appear in the upper right part of the image, and we must control the model not to overfit through pruning. Now XGBoost's loss function has a part that limits the variance from becoming larger, which means that XGBoost will be smarter than other tree models and will not easily fall to the upper right of the image.

Solve the objective function of XGB

In the past, when solving the objective function, gradient descent or dual construction methods were usually used. However, there are no parameters that need to be solved in the objective function of XGB, so traditional gradient descent cannot be used. In XGB, tree fk f_kfkis not a vector of numbers and is the same as the input training feature matrix xxx is not directly related, and although the iterative process can be compared to gradient descent, the solution process is completely different.
In the process of solving XGB, we strive to transform the objective function into a simpler writing method that is directly related to the structure of the tree, so as to establish the relationship between the structure of the tree and the effect of the model (including generalization ability and running speed). contact directly. Because of the existence of this connection, the objective function of XGB is also called the "structure score".
Our objective function is:
O bj = ∑ i = 1 ml ( yi , y ^ i ) + ∑ k = 1 K Ω ( fk ) O bj = \sum _ { i = 1 } ^ { m } l \left( y _ { i } , \hat { y } _ { i } \right) + \sum _ { k = 1 } ^ { K } \Omega \left( f _ { k } \right)Obj=i=1ml(yi,y^i)+k=1KOh(fk)
atttt次迭代得到的预测结果为:
y ^ i ( t ) = ∑ k t f k ( x i ) = ∑ k t − 1 f k ( x i ) + f t ( x i ) = y ^ i ( t − 1 ) + f t ( x i ) \begin{aligned} \hat { y } _ { i } ^ { ( t ) } = \sum _ { k } ^ { t } f _ { k } \left( x _ { i } \right) & = \sum _ { k } ^ { t - 1 } f _ { k } \left( x _ { i } \right) + f _ { t } \left( x _ { i } \right) \\ & = \hat { y } _ { i } ^ { ( t - 1 ) } + f _ { t } \left( x _ { i } \right) \end{aligned} y^i(t)=ktfk(xi)=kt1fk(xi)+ft(xi)=y^i(t1)+ft(xi)
So the first term of the objective function can be written as:
∑ i = 1 ml ( yi , yi ^ ) = ∑ i = 1 ml ( yit , yi ^ t − 1 + ft ( xi ) ) \begin{aligned} &\sum_ {i=1}^{m}l(y_i,\hat{y_i})\\ &= \sum_{i=1}^{m}l(y_{i}^{t},\hat{y_i} ^{t-1}+f_t(x_i)) \end{aligned}i=1ml ( yi,yi^)=i=1ml ( yit,yi^t1+ft(xi))
And because for a certain batch of data, yit y_i^tyitis a known constant, so ll can beThe l function is equivalent to the followingFFF函数:
F ( y ^ i t − 1 + f t ( x i ) ) F(\hat{y}_i^{t-1}+f_t(x_i)) F(y^it1+ft(xi) )
Perform Taylor expansion of the F function. Note that although F here seems to be a unary function, it is actually related to the input data. However, for a certain batch of data, it is unary, but it is actually related to the input data. , so the l function is essentially a binary function. Therefore, partial derivation is used here instead of total derivation. There are many details to understand here, so I won’t go into details.
F ( y ^ i ( t − 1 ) + ft ( xi ) ) ≈ F ( y ^ i ( t − 1 ) ) + ft ( xi ) ∗ ∂ F ( y ^ i ( t − 1 ) ) ∂ y ^ i ( t − 1 ) + 1 2 ( ft ( xi ) ) 2 ∗ ∂ 2 F ( y ^ i ( t − 1 ) ) ∂ ( y ^ i ( t − 1 ) ) 2 ≈ l ( yit , y ^ i ( t − 1 ) ) + ft ( xi ) ∗ ∂ l ( yit , y ^ i ( t − 1 ) ) ∂ y ^ i ( t − 1 ) + 1 2 ( ft ( xi ) ) 2 ∗ ∂ 2 l ( yit , y ^ i ( t − 1 ) ) ∂ ( y ^ i ( t − 1 ) ) 2 ≈ l ( yit , y ^ i ( t − 1 ) ) + ft ( xi ) ∗ gi + 1 2 ( ft ( xi ) ) 2 ∗ hi \begin{aligned} F \left( \hat { y } _ { i } ^ { ( t - 1 ) } + f _ { t } \left( x _ { i } \right) \right ) & \approx F \left( \hat { y } _ { i } ^ { ( t - 1 ) } \right) \quad + f _ { t } \left( x _ { i } \right) * \frac { \partial F \left( \hat { y } _ { i } ^ { ( t - 1 ) } \right) } { \partial \hat { y } _ { i } ^ { ( t - 1 ) } } + \frac { 1 } { 2 } \left( f _ { t } \left( x _ { i } \right) \right) ^ { 2 } * \frac { \partial ^ { 2 } F \left( \hat { y } _ { i } ^ { ( t - 1 ) } \right) } { \partial \left( \hat { y } _ { i } ^ { ( t - 1 ) } \right) ^ { 2 } } \\ & \approx l \left( y _ { i } ^ { t } , \hat { y } _ { i } ^ { ( t - 1 ) } \right) + f _ { t } \left( x _ { i } \right) * \frac { \partial l \left( y _ { i } ^ { t } , \hat { y } _ { i } ^ { ( t - 1 ) } \right) } { \partial \hat { y } _ { i } ^ { ( t - 1 ) } } + \frac { 1 } { 2 } \left( f _ { t } \left( x _ { i } \right) \right) ^ { 2 } * \frac { \partial ^ { 2 } l \left( y _ { i } ^ { t } , \hat { y } _ { i } ^ { ( t - 1 ) } \right) } { \ partial \left( \hat { y } _ { i } ^ { ( t - 1 ) } \right) ^ { 2 } } \\ & \approx l \left( y _ { i } ^ { t } , \hat { y } _ { i } ^ { ( t - 1 ) } \right) + f _ { t } \left( x _ { i } \right) * g _ { i } \quad + \frac { 1 } { 2 } \left( f _ { t } \left( x _ { i } \right) \right) ^ { 2 } * h _ { i } \end{aligned}F(y^i(t1)+ft(xi))F(y^i(t1))+ft(xi)y^i(t1)F(y^i(t1))+21(ft(xi))2(y^i(t1))22F _(y^i(t1))l(yit,y^i(t1))+ft(xi)y^i(t1)l(yit,y^i(t1))+21(ft(xi))2(y^i(t1))22 l(yit,y^i(t1))l(yit,y^i(t1))+ft(xi)gi+21(ft(xi))2hi
Now the first term of the loss function can be simplified to the following form:
∑ i = 1 m [ l ( yit , y ^ i ( t − 1 ) ) + ft ( xi ) gi + 1 2 ( ft ( xi ) ) 2 hi ) ] \left.\sum _ { i = 1 } ^ { m } \left[ l \left( y _ { i } ^ { t } , \hat { y } _ { i } ^ { ( t - 1 ) } \right) + f _ { t } \left( x _ { i } \right) g _ { i } + \frac { 1 } { 2 } \left( f _ { t } \left( x _ { i } \right) \right) ^ { 2 } h _ { i } \right) \right]i=1m[l(yit,y^i(t1))+ft(xi)gi+21(ft(xi))2hi) ]
Now let’s discuss the current form of the loss function. We are now in the t-th iteration, so for the t-th iteration, the results obtained in the first t-1 iterations are constants, sol ( yit , y ^ i ( t − 1 ) ) l(y_i^t,\hat{y}_i^{(t-1)})l ( yit,y^i(t1)) is a constant,gi g_igihi h_ihiis also a constant, because our Taylor expansion is at y ^ i ( t − 1 ) \hat{y}_i^{(t-1)}y^i(t1)expanded everywhere. gi g_igihi h_ihiAlso known as the per-sample gradient statistic, each sample is different.

Let’s look at the second term of the objective function: , at the t-th iteration
∑ k = 1 K Ω ( fk ) = ∑ k = 1 t − 1 Ω ( fk ) + Ω ( ft ) \sum_{k=1}^ K\Omega(f_k) = \sum_{k=1}^{t-1}\Omega(f_k)+\Omega(f_t)k=1KOh ( fk)=k=1t1Oh ( fk)+Oh ( ft)
, so the sum of the first term is equivalent to a constant, and only the latter term is related to the current iteration.
It should be mentioned here that when performing Taylor expansion, we made an assumption, that is,ft (xi) f_t(x_i)ft(xi) is a small amount, and for tree models, there are generally hundreds of n-estimators, so this assumption is established, so the final form of our loss function is as follows:
O bj = ∑ i = 1 m [ ft ( xi ) gi + 1 2 ( ft ( xi ) ) 2 hi ) ] + Ω ( ft ) \left.O bj = \sum _ { i = 1 } ^ { m } \left[ f _ { t } \left( x _ { i } \right) g _ { i } + \frac { 1 } { 2 } \left( f _ { t } \left( x _ { i } \right) \right) ^ { 2 } h _ { i } \right) \right] + \Omega \left( f _ { t } \right)Obj=i=1m[ft(xi)gi+21(ft(xi))2hi)]+Oh(ft)
The core part of this formula is that we need to determineft f_tftspecific form.

Parameterized decision tree ft ( x ) f_t(x)ft(x)

For XGB, there is a prediction score on each leaf node, also called leaf weight. All samples falling on this leaf node have the same value, use fk (xi) f_k(x_i)fk(xi) orwww means.
When there are multiple trees, the regression result of the ensemble model is the sum of the prediction scores of all trees. Assuming that there are a total of K decision trees in this ensemble model, the entire model will be in sampleii.The prediction result on i
is: yi ^ = ∑ k = 1 K fl ( xi ) \hat{y_i} = \sum_{k=1}^Kf_l(x_i)yi^=k=1Kfl(xi)
Based on this understanding, let's consider each tree. For each tree, it has its own unique structure. This structure refers to the number of leaf nodes, the depth of the tree, the position of the leaves, etc., forming a tree structure that can define a unique model. In this structure, we useq ( xi ) q(x_i)q(xi) represents samplexi x_ixiThe leaf node where it is located, and use wq (xi) w_{q(x_i)}wq(xi)to express the iithThe i sample falls to thettthThe qth (xi) q(x_i)on the t treeq(xi) leaf nodes, then there is:
ft (xi) = wq (xi) f_t(x_i) = w_{q(x_i)}ft(xi)=wq(xi)
Suppose a tree contains a total of TTT leaf nodes, where the index of each node isjjj , then the sample weight of this leaf node iswj w_jwj. Based on this, define the complexity of the model Ω ( f ) \Omega(f)Ω ( f ) is:
Ω ( f ) = γ T + regular term \Omega(f) = \gamma T + regular termOh ( f )=γT+Regular term L2 regularization
can be used:
= γ T + 1 2 λ ∥ w ∥ 2 = γ T + 1 2 λ ∑ j = 1 T wj 2 \begin{array} { l } = \gamma T + \frac { 1 } { 2 } \lambda \| w \| ^ { 2 } \\ = \gamma T + \frac { 1 } { 2 } \lambda \sum _ { j = 1 } ^ { T } w _ { j } ^ { 2 } \end{array}=γT+21λw2=γT+21lj=1Twj2
使用L1正则:
= γ T + 1 2 α ∣ w ∣ = γ T + 1 2 α ∑ j = 1 T ∣ w j ∣ \begin{array} { l } = \gamma T + \frac { 1 } { 2 } \alpha | w | \\ = \gamma T + \frac { 1 } { 2 } \alpha \sum _ { j = 1 } ^ { T } \left| w _ { j } \right| \end{array} =γT+21αw=γT+21aj=1Twj
Can also be used together:
= γ T + 1 2 α ∑ j = 1 T ∣ wj ∣ + 1 2 λ ∑ j = 1 T wj 2 = \gamma T + \frac { 1 } { 2 } \alpha \sum _ { j = 1 } ^ { T } \left| w _ { j } \right| + \frac { 1 } { 2 } \lambda \sum _ { j = 1 } ^ { T } w _ { j } ^ { 2 }=γT+21aj=1Twj+21lj=1Twj2
There are two parts in this structure. One part is γ T \gamma T that controls the tree structure.γ T , the other part is our regular term. Number ofleavesTTT can represent the entire tree structure. This is because all trees in XGBoost are CART trees (binary trees), so we canTTT determines the depth of the tree, andγ \gammaγ is our self-defined parameter to control the number of leaves.
The parameters in front of the regularization term are used to control the regularization intensity. The same is true in XGB. Whenλ \lambdaλ andα \alphaThe larger α , the heavier the penalty, and the greater the proportion of regular terms. In the optimization direction of minimizing the objective function, the number of leaf nodes will be suppressed, and the complexity of the model will become lower and lower, so For XGB, which is inherently overfitting, regularization can improve the model effect to a certain extent.
In practical applications, regularization parameters are often not the best choice for us to adjust parameters. If we really want to control the complexity of the model, we will adjustγ \gammaγ instead of adjusting these two regularization parameters. For tree models, pruning parameters still have a higher status.

Find the best tree structure: www andTTT

We now express the tree in terms of prediction scores at leaf nodes, and the complexity of the tree is the number of leaves plus the regularization term:
ft ( xi ) = wq ( xi ) , Ω ( ft ) = γ T + 1 2 λ ∑ j = 1 T wj 2 f _ { t } \left( x _ { i } \right) = w _ { q \left( x _ { i } \right) } , \quad \Omega \left( f _ { t } \right) = \gamma T + \frac { 1 } { 2 } \lambda \sum _ { j = 1 } ^ { T } w _ { j } ^ { 2 }ft(xi)=wq(xi),Oh(ft)=γT+21lj=1Twj2
Suppose our ttthThe structure of the t -tree has been determined asqqq , we can bring the structure of the tree into our loss function to continue transforming the objective function. The purpose of transformation is to establish the relationship between the tree structure (number of leaf nodes) and the objective function. XGB uses L2 regularization by default.
∑ i = 1 m [ ft ( xi ) gi + 1 2 ( ft ( xi ) ) 2 hi ) ] + Ω ( ft ) = ∑ i = 1 m [ wq ( xi ) gi + 1 2 wq ( xi ) 2 hi ] + γ T + 1 2 λ ∑ j = 1 T wj 2 = ∑ i = 1 mwq ( xi ) gi + [ ∑ i = 1 m 1 2 wq ( xi ) 2 hi ] + γ T + 1 2 λ ∑ j = 1 T wj 2 \begin{aligned} &\left. \sum _ { i = 1 } ^ { m } \left[ f _ { t } \left( x _ { i } \right) g _ { i } + \frac { 1 } { 2 } \left( f _ { t } \left( x _ { i } \right) \right) ^ { 2 } h _ { i } \right) \right] + \Omega \ left( f _ { t } \right) \\ = & \sum _ { i = 1 } ^ { m } \left[ w _ { q \left( x _ { i } \right) } g _ { i } + \frac { 1 } { 2 } w _ { q \left( x _ { i } \right) } ^ { 2 } h _ { i } \right] + \gamma T + \frac { 1 } { 2 } \lambda \sum _ { j = 1 } ^ { T } w _ { j } ^ { 2 } \\ = & \sum _ { i = 1 } ^ { m } w _ { q \left( x _ { i } \right) } g _ { i } + \left[ \sum _ { i = 1 } ^ { m } \frac { 1 } { 2 } w _ { q \left( x _ { i } \right) } ^ { 2 } h _ { i } \right] + \gamma T + \frac { 1 } { 2 } \lambda \sum _ { j = 1 } ^ { T } w _ { j } ^ { 2 } \end {aligned}==i=1m[ft(xi)gi+21(ft(xi))2hi)]+Oh(ft)i=1m[wq(xi)gi+21wq(xi)2hi]+γT+21lj=1Twj2i=1mwq(xi)gi+[i=1m21wq(xi)2hi]+γT+21lj=1Twj2
Now we still calculate the loss function by summing the number of samples. The current transformation goal is to calculate the loss function by summing the total number of leaf nodes.
Suppose we now have 2 leaves and 3 samples:
Insert image description here
then the first term of the loss function can be calculated in the following way:
∑ i = 1 mwq ( xi ) ∗ gi = wq ( x 1 ) ∗ g 1 + wq ( x 2 ) ∗ g 2 + wq ( x 3 ) ∗ g 3 = w 1 ( g 1 + g 2 ) + w 2 ∗ g 3 = ∑ j = 1 T ( wj ∑ i ∈ I jgi ) \begin{aligned} \sum _ { i = 1 } ^ { m } w _ { q \left( x _ { i } \right) } * g _ { i } & = w _ { q \left( x _ { 1 } \right) } * g _ { 1 } + w _ { q \left( x _ { 2 } \right) } * g _ { 2 } + w _ { q \left( x _ { 3 } \right) } * g _ { 3 } \\ & = w _ { 1 } \left( g _ { 1 } + g _ { 2 } \right) + w _ { 2 } * g _ { 3 } \\ & = \sum _ { j = 1 } ^ { T } \left( w _ { j } \sum _ { i \in I _ { j } } g _ { i } \right) \end{aligned}i=1mwq(xi)gi=wq(x1)g1+wq(x2)g2+wq(x3)g3=w1(g1+g2)+w2g3=j=1TwjiIjgi
What is used here is wj w_j on each leaf nodewj( not a wordj is the index subscript of the leaf node, we define the index asjjThe set of all samples on the leaf of j is I j I_jIj则损失函数可以进一步被转化为:
O b j = l ∑ j = 1 T ( w j ∗ ∑ i ∈ I j g i ) + 1 2 ∑ j = 1 T ( w j 2 ∗ ∑ i ∈ I j h i ) + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ j = 1 T [ w j ∑ i ∈ I j g i + 1 2 w j 2 ( ∑ i ∈ I j h i + λ ) ] + γ T \begin{aligned} Obj &= { l } \sum _ { j = 1 } ^ { T } \left( w _ { j } * \sum _ { i \in I _ { j } } g _ { i } \right) + \frac { 1 } { 2 } \sum _ { j = 1 } ^ { T } \left( w _ { j } ^ { 2 } * \sum _ { i \in I _ { j } } h _ { i } \right) + \gamma T + \frac { 1 } { 2 } \lambda \sum _ { j = 1 } ^ { T } w _ { j } ^ { 2 } \\ &=\sum _ { j = 1 } ^ { T } \left[ w _ { j } \sum _ { i \in I _ { j } } g _ { i } + \frac { 1 } { 2 } w _ { j } ^ { 2 } \left( \sum _ { i \in I _ { j } } h _ { i } + \lambda \right) \right] + \gamma T \end{aligned} Obj=lj=1TwjiIjgi+21j=1Twj2iIjhi+γT+21lj=1Twj2=j=1TwjiIjgi+21wj2iIjhi+l+γT
我们再定义:
G j = ∑ i ∈ I j g i , H j = ∑ i ∈ I j h i G _ { j } = \sum _ { i \in I _ { j } } g _ { i } , \quad H _ { j } = \sum _ { i \in I _ { j } } h _ { i } Gj=iIjgi,Hj=iIjhi
So the loss function can be transformed into the following form:
O bj ( t ) = ∑ j = 1 T [ wj G j + 1 2 wj 2 ( H j + λ ) ] + γ TF ∗ ( wj ) = wj G j + 1 2 wj 2 ( H j + λ ) \begin{aligned} O bj ^ { ( t ) } = \sum _ { j = 1 } ^ { T } \left[ w _ { j } G _ { j } + \frac { 1 } { 2 } w _ { j } ^ { 2 } \left( H _ { j } + \lambda \right) \right] + \gamma T \\ F ^ { * } \left( w _ { j } \right) = w _ { j } G _ { j } + \frac { 1 } { 2 } w _ { j } ^ { 2 } \left( H _ { j } + \lambda \right) \end{ aligned}Obj(t)=j=1T[wjGj+21wj2(Hj+l ) ]+γTF(wj)=wjGj+21wj2(Hj+l ).
where each jjThe value of j is always one withwj w_jwjis the quadratic function F ∗ F^* of the independent variableF , our goal is to pursueO bj ObjO b j is the smallest, as long as each leafjjThe quadratic functions under the value of j are all smallest, so their sum must also be the smallest. So, we are atF ∗ F*F onwj w_jwjDerivative, let the first derivative equal to 0 to find the extreme value, we can get:
∂ F ∗ ( wj ) ∂ wj = G j + wj ( H j + λ ) 0 = G j + wj ( H j + λ ) wj = − G j H j + λ \begin{aligned} \frac { \partial F ^ { * } \left( w _ { j } \right) } { \partial w _ { j } } & = G _ { j } + w _ { j } \left( H _ { j } + \lambda \right) \\ 0 & = G _ { j } + w _ { j } \left( H _ { j } + \lambda \right) \\ w _ { j } & = - \frac { G _ { j } } { H _ { j } + \lambda } \end{aligned}wjF(wj)0wj=Gj+wj(Hj+l )=Gj+wj(Hj+l )=Hj+lGj
We find the number of leaf nodes TT in a givenThe minimum value of the loss function in the case of T , but TTT is also a variable to be solved. By bringing this result into the objective function, we can get:
O bj ( t ) = ∑ j = 1 T [ − G j H j + λ ∗ G j + 1 2 ( − G j H j + λ ) 2 ( H j + λ ) = ∑ j = 1 T [ − G j 2 H j + λ + 1 2 ∗ G j 2 H j + λ ] + γ T = − 1 2 ∑ j = 1 TG j 2 H j + λ + γ T \begin{aligned} O bj ^ { ( t ) } & = \sum _ { j = 1 } ^ { T } \left[ - \frac { G _ { j } } { H _ { j } + \lambda } * G _ { j } + \frac { 1 } { 2 } \left( - \frac { G _ { j } } { H _ { j } + \lambda } \right) ^ { 2 } \left( H _ { j } + \lambda \right) \right.\\ & = \sum _ { j = 1 } ^ { T } \left[ - \frac { G _ { j } ^ { 2 } } { H _ { j } + \lambda } + \frac { 1 } { 2 } * \frac { G _ { j } ^ { 2 } } { H _ { j } + \lambda } \right] + \gamma T \\ & = - \frac { 1 } { 2 } \sum _ { j = 1 } ^ { T } \frac { G _ { j } ^ { 2 } } { H _ { j } + \lambda } + \gamma T \end{aligned}Obj(t)=j=1T[Hj+lGjGj+21(Hj+lGj)2(Hj+l )=j=1T[Hj+lGj2+21Hj+lGj2]+γT=21j=1THj+lGj2+γT
At this point, our objective function has changed dramatically. Our sample sizeiii has been attributed to each leaf, and our objective function is calculated based on each leaf node, that is, the structure of the tree. Therefore, our objective function is also called "structure score". The lower the score, the better the overall structure of the tree. In this way, we have established a direct connection between the structure (leaves) of the tree and the effect of the model.
Let’s look at a specific example:
Insert image description here
The calculation formula for the objective function of this tree is:
O bj = − ( g 1 2 h 1 + λ + g 4 2 h 4 + λ + ( g 2 + g 3 + g 5 ) 2 h 2 + h 3 + h 5 + λ ) + 3 γ O bj = - \left( \frac { g _ { 1 } ^ { 2 } } { h _ { 1 } + \lambda } + \frac { g _ { 4 } ^ { 2 } } { h _ { 4 } + \lambda } + \frac { \left( g _ { 2 } + g _ { 3 } + g _ { 5 } \right) ^ { 2 } } { h _ { 2 } + h _ { 3 } + h _ { 5 } + \lambda } \right) + 3 \gammaObj=(h1+lg12+h4+lg42+h2+h3+h5+l(g2+g3+g5)2)+3 γWe
will find the best tree structure based on the loss function. It can be seen from the formula thatλ \lambdaλ γ \gamma γ are all set hyperparameters,G j G_jGjand H j H_jHjThey are all determined by the loss function and the prediction result under a specific structure y ^ it − 1 \hat{y}_i^{t-1}y^it1decided jointly. So we minimize the objective function and what we solve is the number of leaves, so the essence is to solve the structure of the tree. That is, for each iteration, the best parameter TT
is first solvedT , and then solveG j G_jGjand H j H_jHj, and then the weight wj w_j on each leaf can be solvedwj, thus finding our optimal tree structure and completing an iteration.
Next we will consider how to find the optimal tree structure:

Finding the best branching method: Gain

The greedy algorithm refers to the algorithm that controls the local optimum to achieve the global optimum. The decision tree algorithm itself is a method using the greedy algorithm. As an integrated model of trees, XGB naturally thinks of using this method to calculate, so we believe that if each leaf is optimal, then the overall generated tree structure is optimal, so that we can avoid enumerating all possibilities tree structure.
Insert image description here
How we calculate in decision trees: We use the Gini coefficient or information entropy to measure the impurity of leaf nodes after branching. The difference between the information entropy before branching and the information entropy after divide and conquer is called information gain. The one with the largest information gain The branches on the feature are selected by us, and when the information gain is lower than a certain threshold, the tree stops growing. In XGB, the method we use is similar: we first use the objective function to measure the quality of the tree's structure, and then let the tree grow from depth 0. Each time a branch is made, we calculate how much the objective function is reduced ( So the value of Gain is the original score minus the score after branching) . When the reduction of the objective function is lower than a certain threshold we set, the tree will stop growing.
For example:
Insert image description here
It’s still the same example as before. For the leaf node of the middle node (is male?), our T = 1 T=1T=1 , the structure score of this node is: I = { 1 , 4 } G = g 1 + g 4 H = h 1 + h 4 S coremiddle = − 1 2 G 2 H + λ + γ \begin{aligned} I & = \{ 1,4 \} \\ G & = g _ { 1 } + g _ { 4 } \\ H & = h _ { 1 } + h _ { 4 } \\ Score_{middle}&= - \ frac { 1 } { 2 } \frac { G ^ { 2 } } { H + \lambda } + \gamma \end{aligned}IGHScoremiddle={ 1,4}=g1+g4=h1+h4=21H+lG2+c
For the younger brother and sister nodes, there are:
S coresister = − 1 2 g 4 2 h 4 + λ + γ S corebrother = − 1 2 g 1 2 h 1 + λ + γ \begin{aligned} { Score_{sister }} &= - \frac { 1 } { 2 } \frac { g _ { 4 } ^ { 2 } } { h _ { 4 } + \lambda } + \gamma \\ Score_{brother}&= - \frac { 1 } { 2 } \frac { g _ { 1 } ^ { 2 } } { h _ { 1 } + \lambda } + \gamma \end{aligned}ScoresisterScorebrother=21h4+lg42+c=21h1+lg12+c
Relatively smooth equivalents:
− G ain = S coresister + S corebrother − S coremiddle = − 1 2 g 4 2 h 4 + λ + γ − 1 2 g 1 2 h 1 + λ + γ − ( − 1 2 G 2 H + λ + γ ) = − 1 2 g 4 2 h 4 + λ + γ − 1 2 g 1 2 h 1 + λ + γ + 1 2 G 2 H + λ − γ = − g 4 2 h 4 + λ + g 1 2 h 1 + λ − G 2 H + λ ] + γ = − 1 2 [ g 4 2 h 4 + λ + g 1 2 h 1 + λ − ( g 1 + g 4 ) 2 ( h 1 + h 4 ) + λ ] + γ \begin{aligned} -Gain &= Score_{sister}+Score_{brother}-Score_{middle}\\ &= -\frac { 1 } { } \frac{g_{4}^{2}}{h_{4}+\lambda}+\gamma-\frac{1}{2}\frac{g_{1}^{2}}{ h _ { 1 } + \lambda } + \gamma - \left( - \frac { 1 } { 2 } \frac { G ^ { 2 } } { H + \lambda } + \gamma \right) \\ &= -\frac{1}{2}\frac{g_{4}^{2}}{h_{4}+\lambda}+\gamma-\frac{1}{2}\frac{g_{ 1 } ^ { 2 } } { h_{ 1 } + \lambda } + \gamma + \frac { 1 } { 2 } \ frac { G ^ { 2 } } { H + \lambda } - \gamma \\& =-\frac{1}{2}\left[\frac{g_{4}^{2}}{h_{4}+\lambda}+\frac{g_{1}^{2}} { h _ { 1 } + \lambda } - \frac { G ^ { 2 } } { H + \lambda } \right] + \gamma \\ &=-\frac { 1 } { 2 } \left[ \frac {g_{4}^{2}}{h_{4}+\lambda}+\frac{g_{1}^{2}}{h_{1}+\lambda} - \frac{\ left( g_{1} + g_{4}\right)^{2}}{\left(h_{1}+h_{4}\right)+\lambda}\right]+\gamma\ end{aligned}Gain=Scoresister+ScorebrotherScoremiddle=21h4+lg42+c21h1+lg12+c(21H+lG2+c )=21h4+lg42+c21h1+lg12+c+21H+lG2c=21[h4+lg42+h1+lg12H+lG2]+c=21[h4+lg42+h1+lg12(h1+h4)+l(g1+g4)2]+c
After removing the negative sign, the expression of Gain can be obtained:
G ain = 1 2 [ GL 2 HL + λ + GR 2 HR + λ − ( GL + GR ) 2 HL + HR + λ ] − γ Gain= \frac { 1 } { 2 } \left[ \frac { G _ { L } ^ { 2 } } { H _ { L } + \lambda } + \frac { G _ { R } ^ { 2 } } { H _ { R } + \lambda } - \frac { \left( G _ { L } + G _ { R } \right) ^ { 2 } } { H _ { L } + H _ { R } + \lambda } \right] - \ gammaGain=21[HL+lGL2+HR+lGR2HL+HR+l(GL+GR)2]γ
CART trees are all binary trees, so this formula can be generalized. Among themGL G_LGLand HL H_LHLCalculated from the left node (younger brother node), GR G_RGRand HR H_RHRCalculated from the node (sister node), and (GL + GR) (G_L+G_R)(GL+GR) and(HL + HR) (H_L+H_R)(HL+HR) is calculated from the intermediate nodes. For any branch, we can calculate it in this way. In reality, we will perform the above calculation on all branch points of all features, and then select the node that makes the objective function decrease the fastest for branching. We perform such calculations for each layer of each tree. Compared with the original gradient descent, practice has proven that this method of solving the optimal tree structure is faster and can perform well on large data.

Important parameters γ \gammac

γ \gamma γ is a penalty term that will be subtracted every time we add a leaf. The more we add, the more severely Gain will be punished. It can be used to prevent overfitting.
In XGB, we stipulate that as long as the difference between structural scoresG ain GainG a i n is greater than 0, that is, as long as the objective function can continue to decrease, we allow the tree to continue branching. In other words, our requirement for the reduction of the objective function is:
1 2 [ GL 2 HL + λ + GR 2 HR + λ − ( GL + GR ) 2 HL + HR + λ ] − γ > 0 1 2 [ GL 2 HL + λ + GR 2 HR + λ − ( GL + GR ) 2 HL + HR + λ ] > γ \begin{aligned} \frac { 1 } { 2 } \left[ \frac { G _ { L } ^ { 2 } } { H _ { L } + \lambda } + \frac { G _ { R } ^ { 2 } } { H _ { R } + \lambda } - \frac { \left( G _ { L } + G _ { R } \right) ^ { 2 } } { H _ { L } + H _ { R } + \lambda } \right] - \gamma > 0 \\ \frac { 1 } { 2 } \left [ \frac { G _ { L } ^ { 2 } } { H _ { L } + \lambda } + \frac { G _ { R } ^ { 2 } } { H _ { R } + \lambda } - \ frac { \left( G _ { L } + G _ { R } \right) ^ { 2 } } { H _ { L } + H _ { R } + \lambda } \right] > \gamma \end{aligned }21[HL+lGL2+HR+lGR2HL+HR+l(GL+GR)2]c>021[HL+lGL2+HR+lGR2HL+HR+l(GL+GR)2]>c
In this way, we can directly set γ \gammaThe size of γ to stop the trees in XGB from growing. γ \gammaγ is therefore defined as the minimum reduction in the objective function required for further branching at the leaf nodes of the tree.

Guess you like

Origin blog.csdn.net/weixin_42988382/article/details/105893310