Explanation of GBDT algorithm principle and summary of commonly used training frameworks: XGBoost LightGBM CatBoost NGBoost

1 Basic knowledge points

1.1 Ensemble Learning

Ensemble Learning is the core concept in Machine Learning. Its main idea is summarized as follows: by training multiple weak learners to achieve the effect of a strong learner, the combined performance is better than that of any weak learner. Errors in machine learning can be roughly divided into two categories:One is the bias error (bias error): refers to the difference between the predicted value and the real value, and the other is the variance error (variance error): refers to the degree of dispersion of the predicted value as a random variable, and integrated learning can just reduce these problems. By combining the results of multiple classifiers, the deviation of model prediction can be reduced, especially for some unstable learners, so the learners learned by ensemble learning have higher stability. In ensemble learning, the common method isBaging和Boosting, and then we briefly describe these two methods.

1.2 Bagging and Boosting

To use Bagging or Boosting techniques, we must choose a base learner. For example, we can choose to classify trees, then Bagging and Boosting will be combined into a series of tree learners and become an integrated learner. Next, how do Bagging and Boosting train to get N learners?

  • First:training data selection
    Each time, N new training data sets are generated from the original training data set, and then the N learners are trained separately. When generating each new training data set, Bagging randomly selects samples, which means that the probability of each sample appearing in the new training set is the same, and Boosting will select according to the weight of the sample, so some samples will be It has a higher probability of being selected in the new training set.

  • second:training process
    The main difference between Bagging and Boosting is the training process.Among them, Bagging is parallel in the training phase, and each trainer is independent, while Boosting builds each trainer based on sequence. The establishment of a new trainer depends on the previous trainer, so it is not independent., the comparison chart is as follows:
    insert image description here

In the Boosting algorithm, the selection of training data for each classifier depends on the prediction results of the previous classifier, so in each training step, the weight of the sample will be readjusted,The wrongly predicted data will increase the weight, and the probability will be more likely to be selected to enter the next classifier for training, focusing on the identification of these hard samples.

  • third:forecasting process
    After training N learners, there is a difference between Bagging and Boosting in the prediction results. Bagging strategy, the final result is the average of the results of N learners, and the prediction result of Boosting is the weighted sum, expressed as follows:
    Bagging = 1 N ∑ i = 1 N si \text{Bagging} = \frac{1}{N} \sum_{i=1}^N s_iBagging=N1i=1Nsi
    Boosting = ∑ i = 1 N w i s i \text{Boosting} = \sum_{i=1}^N w_is_i Boosting=i=1Nwisi
    That middle weight wi w_iwiwill be assigned according to the performance of each classifier's prediction,The better the performance of the learner, the greater the weight corresponding to. But it does not mean that Boosting must be better than Bagging. It needs to be considered according to multiple factors such as specific data sets and learners. IfIf the performance of a single learner is poor, it is difficult for Bagging to obtain a powerful learner, but the Boosting optimization strategy can just strengthen the effect of multiple learners. On the contrary, if every learner is overfitting, then Bagging is the best choice, and Boosting does not help to avoid overfitting

1.3 Adaptive Boosting

Adaptive Boosting (AdaBoost) is a Boosting method. The core idea of ​​Boosting is to learn from the previous model error. andThe AdaBoost learning method mainly increases the weight of the misclassified samples, so that the next model pays more attention to the recognition effect of the misclassified samples.. The basic steps of training are as follows:

  • train a tree model
  • Calculate the error rate ee of this tree model errore
  • Calculate the wieght of this decision tree based on the error rate:learning_rate * log((1-e) /e), so the larger the error rate e, the smaller the weight
  • Update the weight of each sample: for samples that are paired by the model, the weight remains unchanged; for samples that are misclassified, the new weight is:old_weight * np.exp(weight of this tree), after updating, the weight of the sample becomes larger, in the next step, the identification of such misclassified samples will be strengthened
  • Repeat the above steps until the trained tree has reached the maximum value
  • Make the final prediction: predict each candidate set sample through a weighted voting mechanism

1.3 Gradient Boosting

Gradient Boosting is also a Boosting method. As we mentioned above, the core point of the boosting model is to learn from previous mistakes.And each iteration of Gradient Boosting directly fits the residual of the previous step (the partial derivative of the target loss function to the output value), so that the prediction result of the current t step is equal to the negative gradient direction of the target loss function to the prediction value of the previous step t-1 , so that each iteration ( ft ( xi ) = ft − 1 ( xi ) − ∂ L ( yi , ft − 1 ( xi ) ) ∂ ft − 1 ( xi ) f_t(x_i) = f_{t-1} (x_i)-\frac{\partial L(y_i,f_{t-1}(x_i))}{\partial f_{t-1}(x_i)}ft(xi)=ft1(xi)ft1(xi)L(yi,ft1(xi))) continuously reduce the target loss loss, the algorithm flow is as follows:

1. 初始化 : f 0 ( x ) = argmin γ ∑ i = 1 NL ( yi , γ ) 1. initialization: f_0(x) = \text{argmin}_\gamma \sum_{i=1}^NL (y_i, \gamma)1. Initialize _:f0(x)=argminci=1NL ( yi,c )

2. for  t = 1 to  T : 2. \text{for} \text{ } t=1 \text{} \text{to} \text{ }T: 2.for t=1to T :

        ( a ) 计 算 负 梯 度 : y ^ i = − ∂ L ( y i , f t − 1 ( x i ) ) ∂ f t − 1 ( x i ) , i = 1 , 2 , . . . N \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(a) 计算负梯度: \hat{y}_i =- \frac{\partial L(y_i,f_{t-1}(x_i))}{\partial f_{t-1}(x_i)}, i=1,2,...N        ( a ) Calculate negative gradient:y^i=ft1(xi)L(yi,ft1(xi)),i=1,2,...N

        (b) Fitting y ^ i with the base learner ht(x) by minimizing the squared error, \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(b) Fitting \hat{y}_i with the base learner h_t(x) by minimizing the squared error,       ( b ) by minimizing the squared error , with the base learner ht( x ) fity^i,

           w t = argmin w ∑ i = 1 N L ( y ^ i − h t ( x i ; w ) ] 2 \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ } w_t = \text{argmin}_w \sum_{i=1}^N L(\hat{y}_i - h_t(x_i; w)]^2           wt=argminwi=1NL(y^iht(xi;w)]2

        (c) Use Linesearch to determine the step size ρ m to make L the smallest\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(c) Use Linesearch to determine Step size \rho_m, so that L is the smallest       ( c ) Determine the step size ρ using Line search ar c h _ _ _ _m, so that L is the smallest ,

           ρ t = argmin ρ ∑ i = 1 N L ( y i , f t − 1 ( x i ) + ρ h t ( x i ; w t ) ) \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ } \rho_t = \text{argmin}_{\rho} \sum_{i=1}^N L(y_i, f_{t-1}(x_i) + \rho h_t(x_i;w_t))           rt=argminri=1NL ( yi,ft1(xi)+ρ ht(xi;wt))

        ( d ) f t ( x ) = f t − 1 ( x ) + ρ t h t ( x ; w t ) \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }(d) f_t(x) = f_{t-1}(x)+\rho_th_t(x;w_t)        (d)ft(x)=ft1(x)+rtht(x;wt)

3. Output f M ( x ) 3. Output f_M(x)3. output f _M(x)

2 GBDT algorithm

2.1 Principle

GBDT (Gradient Boosting Decision Tree) is a gradient boosting tree. Next, we will deduce the details of the algorithm in detail.

1) GBDT prediction result value expression
Suppose we have K trees, we use the function fk ( x ) f_k(x) for the predicted value of the kth treefk( x ) means, then for a samplexi x_ixi,最终预测的结果值为:
y i ^ = ∑ k = 1 K f k ( x i ) \hat{y_i}=\sum_{k=1}^Kf_k(x_i) yi^=k=1Kfk(xi)

2)定义目标损失函数
损失函数用来度量模型预测的值与真实标签值之间的差异,同时一般会增加正则化项来惩罚模型的权重参数,避免模型过于复杂,但是基于树模型算法,由于没有权重参数需要优化,因此需要其它的方式来惩罚模型的复杂性。一般对于树模型,通常正则化项会从树的深度,树的叶子节点数或者叶子节点权重值的L2范数等作为因子。通常,如果一颗树的叶子节点越多,树越深越容易过拟合,或者叶子节点权重分值越高,这些都容易导致过拟合。综合这些问题,那么对于树模型,我们可以定义的目标损失函数如下:

O b j = ∑ i = 1 n l ( y i , y i ^ ) + ∑ k = 1 K Ω ( f k ) Obj = \sum_{i=1}^n l(y_i, \hat{y_i}) + \sum_{k=1}^K\Omega(f_k) Obj=i=1nl(yi,yi^)+k=1KΩ(fk)
其中 l ( y i , y ^ i ) l(y_i, \hat{y}_i) l ( yi,y^i) measures the difference between the model prediction and the real label,Ω ( fk ) \Omega(f_k)Oh ( fk) is a regularization term, which measures the complexity of the model and prevents the model from overfitting. Our goal is to minimize the above objective loss function.

3) Deformation of the target loss function
With the objective function, how does the model learn? Since what we want to train is a tree-based function ft ( x ) f_t(x)ft( x ) , not a numeric vector, cannot be solved by gradient descent. So another method called additive training (boosting) is needed to find the optimal solution.
Assuming that the initial value is 0, for each additional tree, the iterative form of the predicted result value is as follows:

y i ^ ( 0 ) = 0 \hat{y_i}^{(0)} = 0 yi^(0)=0
y i ^ ( 1 ) = f 1 ( x i ) = y i ^ ( 0 ) + f 1 ( x i ) \hat{y_i}^{(1)} = f_1(x_i) = \hat{y_i}^{(0)} + f_1(x_i) yi^(1)=f1(xi)=yi^(0)+f1(xi)
y i ^ ( 2 ) = f 1 ( x i ) + f 2 ( x i ) = y i ^ ( 1 ) + f 2 ( x i ) \hat{y_i}^{(2)} = f_1(x_i) + f_2(x_i) = \hat{y_i}^{(1)} + f_2(x_i) yi^(2)=f1(xi)+f2(xi)=yi^(1)+f2(xi)

y i ^ ( t ) = ∑ k = 1 t f k ( x i ) = y i ^ ( t − 1 ) + f t ( x i ) \hat{y_i}^{(t)} =\sum_{k=1}^tf_k(x_i) = \hat{y_i}^{(t-1)} + f_t(x_i) yi^(t)=k=1tfk(xi)=yi^(t1)+ft(xi)

It can be seen from the above formula that at step t, the final prediction result is the sum of all the results of the previous t-1 step and the result of the current tree (the learning rate factor is not considered here to control the weight of each tree) value scaling). We expand the objective function as follows:

O b j ( t ) = ∑ i = 1 n l ( y i , y i ^ ( t ) ) + ∑ i = 1 t Ω ( f i ) Obj^{(t)}=\sum_{i=1}^nl(y_i,\hat{y_i}^{(t)}) + \sum_{i=1}^t \Omega(f_i) Obj(t)=i=1nl ( yi,yi^(t))+i=1tOh ( fi)
           = ∑ i = 1 n l ( y i , y i ^ ( t − 1 ) + f t ( x i ) ) + Ω ( f t ) + c o n s t \text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }\text{ }=\sum_{i=1}^nl(y_i, \hat{y_i}^{(t-1)} + f_t(x_i)) + \Omega(f_t) +const           =i=1nl ( yi,yi^(t1)+ft(xi))+Oh ( ft)+const

Assuming our target loss function is the mean square error, the transformation is as follows:

O b j ( t ) = ∑ i = 1 n ( y i − ( y i ^ ( t − 1 ) + f t ( x i ) ) ) 2 + Ω ( f t ) + c o n s t Obj^{(t)} = \sum_{i=1}^n (y_i - (\hat{y_i}^{(t-1)} + f_t(x_i)))^2 + \Omega(f_t) + const Obj(t)=i=1n(yi(yi^(t1)+ft(xi)))2+Oh ( ft)+const
      = ∑ i = 1 n [ 2 ( y i ^ ( t − 1 ) − y i ) f t ( x i ) + f t ( x i ) 2 ] + Ω ( f t ) + c o n s t \text{ }\text{ }\text{ }\text{ }\text{ }= \sum_{i=1}^n[2(\hat{y_i}^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + const      =i=1n[2(yi^(t1)yi)ft(xi)+ft(xi)2]+Oh ( ft)+const

Since the function ft ( xi ) f_t(x_i)ft(xi) y i ^ ( t − 1 ) \hat{y_i}^{(t-1)} yi^( t 1 ) has no connection. In the second step change of the above objective function, the formula omits the separateyi ^ ( t − 1 ) \hat{y_i}^{(t-1)}yi^( t 1 ) term. The following is a solution to GBDT optimization based on the principle of Taylor's formula.

4) Taylor formula
Taylor's formula is one of the approximation methods often used to study the properties of complex functions, and it is also an important application content of functional differential calculus. If the function satisfies certain conditions, the Taylor formula can use the derivative values ​​of the function at a certain point as coefficients to construct a polynomial to approximate the function. The expansion of the formula is as follows:
f ( x + Δ x ) ≈ f ( x ) + f ′ ( x ) Δ x + 1 2 f ′ ′ ( x ) Δ x 2 f(x+\Delta x) \approx f(x) + f^{'}(x)\Delta x + \frac{1}{2}f^{''}(x)\Delta x^2f(x+Δ x )f(x)+f(x)Δx+21f(x)Δx2

5) The objective function is represented by Taylor expansion
Next, we carry out Taylor expansion on the objective function, and the solution process is as follows:
O bj ( t ) = ∑ i = 1 nl ( yi , yi ^ ( t − 1 ) + ft ( xi ) ) + Ω ( ft ) + const Obj^{(t)} = \sum_{i=1}^nl(y_i,\hat{y_i}^{(t-1)} + f_t(x_i)) + \Omega(f_t) + constObj(t)=i=1nl ( yi,yi^(t1)+ft(xi))+Oh ( ft)+const
我们令:
g i = ∂ y i ^ ( t − 1 ) l ( y i , y ^ ( t − 1 ) ) g_i = \partial_{\hat{y_i}^{(t-1)}}l(y_i, \hat{y}^{(t-1)}) gi=yi^(t1)l ( yi,y^(t1))
h i = ∂ y i ^ ( t − 1 ) 2 l ( y i , y i ^ ( t − 1 ) ) h_i = \partial_{\hat{y_i}^{(t-1)}}^2l(y_i, \hat{y_i}^{(t-1)}) hi=yi^(t1)2l ( yi,yi^( t 1 ) )
gig_igisum hi h_ihiRepresent the objective function l ( yi , yi ^ ( t − 1 ) ) l(y_i, \hat{y_i}^{(t-1)})l ( yi,yi^(t1)) y ^ ( t − 1 ) \hat{y}^{(t-1)} y^The first and second derivatives of ( t 1 ) . So our objective function Taylor expansion formula is:
O bj ( t ) ≈ ∑ i = 1 n [ l ( yi , yi ^ ( t − 1 ) ) + gift ( xi ) + 1 2 shift 2 ( xi ) ] + Ω ( ft ) + const Obj^{(t)} \approx \sum_{i=1}^n[l(y_i, \hat{y_i}^{(t-1)}) + g_if_t(x_i) + \frac {1}{2}h_if^2_t(x_i)] + \Omega(f_t) + constObj(t)i=1n[ l ( yi,yi^(t1))+gift(xi)+21hift2(xi)]+Oh ( ft)+c o n s t
If the target loss function is considered to be the mean square error, thengi g_igisum hi h_ihi计算结果为:
g i = ∂ y i ^ ( t − 1 ) ( y i ^ ( t − 1 ) − y i ) 2 = 2 ( y i ^ ( t − 1 ) − y i ) g_i = \partial_{\hat{y_i}^{(t-1)}}(\hat{y_i}^{(t-1)}-y_i)^2=2(\hat{y_i}^{(t-1)}-y_i) gi=yi^(t1)(yi^(t1)yi)2=2(yi^(t1)yi)
h i = ∂ y i ^ ( t − 1 ) 2 ( y i ^ ( t − 1 ) − y i ) 2 = 2 h_i = \partial_{\hat{y_i}^{(t-1)}}^2(\hat{y_i}^{(t-1)}-y_i)^2=2 hi=yi^(t1)2(yi^(t1)yi)2=2So
we resolve the objective function as:
O bj ( t ) ≈ ∑ i = 1 n [ l ( yi , yi ^ ( t − 1 ) ) + gift ( xi ) + 1 2 shift 2 ( xi ) ] + Ω ( ft ) + const Obj^{(t)} \approx \sum_{i=1}^n[l(y_i, \hat{y_i}^{(t-1)}) + g_if_t(x_i) + \frac{ 1}{2}h_if^2_t(x_i)] + \Omega(f_t) + constObj(t)i=1n[ l ( yi,yi^(t1))+gift(xi)+21hift2(xi)]+Oh ( ft)+const
= ∑ i = 1 n [ 2 ( y i ^ ( t − 1 ) − y i ) f t ( x i ) + f t ( x i ) 2 ] + Ω ( f t ) + c o n s t = \sum_{i=1}^n[2(\hat{y_i}^{(t-1)}-y_i)f_t(x_i) + f_t(x_i)^2]+ \Omega(f_t) + const =i=1n[2(yi^(t1)yi)ft(xi)+ft(xi)2]+Oh ( ft)+c o n s tBecause
the constant has no effect on the optimization of the objective function, the constant part is removed and further resolved as:
= ∑ i = 1 n [ gift ( xi ) + 1 2 shift 2 ( xi ) ] + Ω ( ft ) = \ sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if^2_t(x_i)] + \Omega(f_t)=i=1n[gift(xi)+21hift2(xi)]+Oh ( ft)
in thatgi g_igisum hi h_ihiRepresent the objective function l ( yi , yi ^ ( t − 1 ) ) l(y_i, \hat{y_i}^{(t-1)})l ( yi,yi^(t1)) y ^ ( t − 1 ) \hat{y}^{(t-1)} y^The first and second derivatives of ( t 1 ) .

6) Regularization term Ω ( ft ) \Omega(f_t)Oh ( ft)
Then the regularization term of the objective function Ω ( ft ) \Omega(f_t)Oh ( ft) is how to express it? The first thing we need to make clear is that the purpose of the regularization term is to prevent the model from overfitting, so it is necessary to prevent the model from being too complex. Based on the tree model structure, we can define the regularization formula as follows: Ω ( ff ) = γ T + 1 2
λ ∑ j = 1 T wj 2 \Omega(f_f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^Tw_j^2Oh ( ff)=γT+21lj=1Twj2
where TTT represents the number of leaf nodes,wj w_jwjIndicates the jjthThe score of j leaf nodes, from the expression of the regularization formula, we can see that the regularization term prevents a tree from having too many leaf nodes and the value of the leaf nodes cannot be too large. As shown below:
insert image description here

Ω \OmegaThe calculation result of Ω is: γ 3 + 1 2 λ ( 4 + 0.01 + 1 ) \gamma^3 + \frac{1}{2}\lambda(4+0.01+1)c3+21l ( 4+0.01+1 ) , in the previous target loss function, we knowft ( x ) f_t(x)ft( x ) is the predicted score of the model result, then for the tree model structure we set:
ft ( x ) = wq ( x ) f_t(x) = w_{q(x)}ft(x)=wq(x)
其中 w ∈ R T w \in \text{R}^T wRT is a T-dimensional vector, and each value in the element represents the score of each leaf node. q ( ​​x ) q(x)q(x)是映射函数,对于样本x,映射到树结构的具体哪个叶子节点。将正则化项和树模型结构表示带入上述目标函数为:
O b j ( t ) ≈ ∑ i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) Obj^{(t)} \approx \sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if^2_t(x_i)] + \Omega(f_t) Obj(t)i=1n[gift(xi)+21hift2(xi)]+Ω(ft)
= ∑ i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + γ T + 1 2 λ ∑ j = 1 T w j 2 =\sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if^2_t(x_i)] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^Tw_j^2 =i=1n[gift(xi)+21hift2(xi)]+γT+21λj=1Twj2
= ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T = \sum_{j=1}^T[(\sum_{i \in I_j}g_i)w_j+\frac{1}{2}(\sum_{i \in I_j}h_i + \lambda)w_j^2] + \gamma T =j=1T[(iIjgi)wj+21(iIjhi+λ)wj2]+γT

我们定义:
G j = ∑ i ∈ I j g i G_j = \sum_{i \in I_j} g_i Gj=iIjgi
H j = ∑ i ∈ I j h i H_j = \sum_{i \in I_j} h_i Hj=iIjhi
则我们对上述目标函数进一步化解得到:
O b j ( t ) = ∑ j = 1 T [ ( ∑ i ∈ I j g i ) w j + 1 2 ( ∑ i ∈ I j h i + λ ) w j 2 ] + γ T Obj^{(t)}= \sum_{j=1}^T[(\sum_{i \in I_j}g_i)w_j+\frac{1}{2}(\sum_{i \in I_j}h_i + \lambda)w_j^2] + \gamma T Obj(t)=j=1T[(iIjgi)wj+21(iIjhi+λ)wj2]+γT
= ∑ j = 1 T [ G j w j + 1 2 ( H j + λ ) w j 2 ] + γ T =\sum_{j=1}^T[G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2]+\gamma T =j=1T[Gjwj+21(Hj+λ)wj2]+γT
Among them wj w_jwjIs the value we want to optimize, that is, the leaf node value in the tree structure, to minimize the above target loss function, we wj w_jwjtake the derivative and set the result equal to 0
∑ j = 1 T [ G j + ( H j + λ ) w j ] = 0 \sum_{j=1}^T[G_j+(H_j+\lambda)w_j]=0 j=1T[Gj+(Hj+l ) wj]=0
, the following values ​​can be obtained:
wj ∗ = − G j H j + λ w_j^* = -\frac{G_j}{H_j+\lambda}wj=Hj+lGj
Will wj ∗ w_j^*wjvalue into the target loss function, the target loss function can be obtained as:
O bj ( t ) = − 1 2 ∑ j = 1 TG j 2 H j + λ + γ T Obj^{(t)} = -\frac{ 1}{2}\sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma TObj(t)=21j=1THj+lGj2+γT

Get the final target loss function expression, what we want to solve is gi g_igisum hi h_ihiwhere G j G_jGjand H j H_jHjRespectively represent the leaf node jjThe first derivativegg of all samples of jsum of g fractions and second derivativehhsum of fractions of h. So for any kind of structure tree, we can calculate the target loss function value for the leaf node where the sample arrives, as shown in the following figure:
insert image description here

7) How to solve the optimal segmentation structure of a tree
In order to minimize Obj, we need to calculate all combined split trees, then calculate the target loss value of each tree according to the above formula, and finally select the tree structure with the smallest loss value. Is this possible? It is definitely not feasible, exhaustively enumerating all tree structures, the complexity is too high. So, do we need at each level of the treeDivide the node into left node and right node, depending on whether the target loss value after division is less than that before no division, and according to which feature point to split, split threshold, select the maximum gain (the target loss loss value is the lowest after the newly selected cut point), and the split gain gain is calculated as follows (loss before cutting - loss after cutting):
Gain = 1 2 [ GL 2 HL + λ + GR 2 HR − λ − ( GL + GR ) 2 HL + HR + λ ] − γ Gain = \frac{1}{2}[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R - \lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}]-\gammaGain=21[HL+lGL2+HRlGR2HL+HR+l(GL+GR)2]γSo
whether the leaf node should be split, through the above formula, if the gain income is greater than 0, then you can continue to split, as for how to cut, you canggg andhhh to sort, and then scan from left to right to select the optimal cutting point.

2.2 Training

After getting familiar with the principle of GBDT, let's take a look at how the GBDT model is trained, and what kind of model structure we get after training

training steps
Assuming we are using the mean square error loss function, the gbdt model training steps are summarized as follows:

  • The first step (initial value determination): the initial value model prediction score of each sample is the average value of all samples: f 0 ( xi ) = 1 n ∑ i = 1 nyi f_0(x_i)=\frac{1}{ n}\sum_{i=1}^n y_if0(xi)=n1i=1nyi
  • The second step (building the first tree): For each sample, according to the previous prediction results yi ^ 0 = f 0 ( xi ) \hat{y_i}^0=f_0(x_i)yi^0=f0(xi) value and the corresponding label valueyi y_iyi, to get the objective function to the predicted value of the model in the previous step一阶导数 g i = 2 ( y i ^ 0 − y i ) g_i=2(\hat{y_i}^0-y_i) gi=2(yi^0yi) and the second derivativehi = 2 h_i=2hi=2, and then traverse all feature types, and the range of feature values ​​corresponding to each feature. After splitting, calculate the Gain growth before and after splitting, select the splitting point with the largest growth, and then continue to split nodes down until before and after splitting The difference is lower than the set threshold, or the depth of the model, the number of leaf nodes, etc. exceed the set threshold. at last,The leaf node w value is calculated according to the above formula: w = − G j H j + λ w= -\frac{G_j}{H_j+\lambda} w=Hj+ lGj, where G j , H j G_j, H_jGjHjis the sum of the first-order derivatives and the sum of the second-order derivatives of all samples falling into this leaf node, so that a sample can be calculated, divided according to the rules, and which leaf node it falls intoFunction f 1 ( xi ) f_1(x_i)f1(xi)
  • The third step (building the second tree): For all samples, through all the previous trees, according to the segmentation conditions, the cumulative sum of the scores falling into the leaf nodes of each tree is the predicted score of the current sample: yi ^ 1 = f 0 ( xi ) + f 1 ( xi ) \hat{y_i}^1=f_0(x_i)+f_1(x_i)yi^1=f0(xi)+f1(xi) , according to the first tree building rule process, for the construction of the second tree, then calculate the scoref 2 ( xi ) f_2(x_i)f2(xi)
  • The fourth step (building the third tree): All samples go through the previously constructed tree and reach the leaf nodes according to the conditions, then the predicted result value of the sample at this time is: yi ^ 2 = f 0 ( xi ) + f 1 ( xi ) + f 2 ( xi ) \hat{y_i}^2 = f_0(x_i) + f_1(x_i) + f_2(x_i)yi^2=f0(xi)+f1(xi)+f2(xi),根据最新结果,然后计算每个样本的 g i , h i g_i,h_i gi,hi,选择增益最大的gain作为切分点。
  • 直到建树数量满足设置的阈值

模型结构
训练完后,就得到了N棵树,每一棵树的结构本质上是一连串的分段规则组成,根据输入样本的特征满足情况走树的不同分支,最后落入到树的某个叶子节点,其中落入到叶子节点的权重值就是这个样本在当前这课树的预测分值。

2.3 预测

当训练好了模型以后,预测的过程就简单了,假设有T棵树,则最终的模型预测结果为这个样本落入到每棵树的叶子节点分值之和,用公式表达如下:
y i ^ = ∑ j = 0 T f j ( x i ) \hat{y_i} = \sum_{j=0}^Tf_j(x_i) yi^=j=0Tfj(xi)

3 训练框架

接下来介绍优化Gradient Boosting算法的几种分布式训练框架,这些框架支持分布式训练,树的调优,缺失值处理,正则化等避免过拟合问题。

3.1 XGBoost

XGBoost: A Scalable Tree Boosting System 是由2014年5月,由DMLC开发出来的,目前是比较受欢迎,高效分布式训练Gradient Boosted Trees算法框架,包含的详细资料可以参考官方文档: 官网文档

3.2 LightGBM

LightGBM: A Highly Efficient Gradient Boosting Decision Tree 是由Microsoft Teams in January 2017Aiming at some problems existing in the XGBoost framework, a more efficient learning framework is designed, mainly based onGradient unilateral sampling GOSS ((Gradient Based One Side Sampling) and mutually exclusive feature merging EFB(Exclusive Feature Bundling) to speed up the learning efficiency of the model. For details, please refer to the official document official website document . The following is a simple code for rank sorting based on lightgbm:

import lightgbm as lgb
import numpy as np 
import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import graphviz

#读取数据,显示前面20行
df = pd.read_csv('train.csv')
df.head(20)
#显示数据列名称
df.columns
#抽取x,y对应的字段
X = df.drop(['label','query','term'], axis=1)
y = df.label
group=np.loadtxt('./group.txt')
#训练数据
train_data = lgb.Dataset(X, label=y, group=group,free_raw_data=False)
#参数定义
params = {
    
    
    'task' : 'train', 
    'boosting_type': 'gbdt',
    'objective': 'lambdarank',
    'num_iterations': 200,
    'learning_rate':0.1,
    'num_leaves': 31,
    'tree_learner': 'serial',
    'max_depth': 6,
    'metric': 'ndcg',
    'metric_freq': 10,
    'train_metric':True,
    'ndcg_at':[2],
    'max_bin':255,
    'max_position': 20,
    'verbose':0
}
#指明类别特征
categorical_feature=[0,1]
#训练
gbm=lgb.train(params,
              train_data,
              valid_sets=train_data,
              categorical_feature=categorical_feature)
#模型保存
gbm.save_model('model_large.md')
#预测
bst = lgb.Booster(model_file='model_large.md')
df_test = pd.read_csv('test.csv')
y_pred = bst.predict(test)

#feature重要度
fea_imp = pd.DataFrame({
    
    'imp': bst.feature_importance(importance_type='split'), 'col': X.columns})
fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]
fea_imp.plot(kind='barh', x='col', y='imp', figsize=(10, 7), legend=None)
plt.title('Feature Importance')
plt.ylabel('Features')
plt.xlabel('Importance');
# 基于shap特征分析
explainer = shap.TreeExplainer(bst)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
shap.summary_plot(shap_values, X)
shap.dependence_plot('entropy', shap_values, X, interaction_index=None, show=True)

3.3 Cat Boost

CatBoost: unbiased boosting with categorical features 是由In April 2017, Russian search giant Yandex developedA framework for optimizing xgboost is developed. The biggest advantage of this framework is that it can handle categorical features. Compared with LightGBM, it does not require label encoding for categorical features, which is convenient for users to quickly operate. Detailed official website: official website website , the following is a detailed comparison chart of the three:
training loss comparison:
insert image description here
model performance comparison: the left is the CPU machine, and the right is the GPU machine.
insert image description here
It can be seen that Catboost has a certain improvement in the index convergence effect compared with XGBoost and LightGBM Under the condition of , the performance of the model has been greatly improved. The following is a simple code example for training based on Catboost:

import numpy as np 
import pandas as pd
import os
from sklearn.metrics import mean_squared_error
from sklearn import feature_selection
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import seaborn as sns
import matplotlib.pyplot as plt

#读取数据
df = pd.read_csv('data.csv')
df.head()
#显示特征名
df.columns
#显示某个特征名的数据情况
pd.set_option('display.float_format', '{:.2f}'.format)
df.f_ctf.describe()
#特征分布显示
plt.figure(figsize = (10, 4))
plt.scatter(range(df.shape[0]), np.sort(df['f_ctf'].values))
plt.xlabel('index')
plt.ylabel('f_ctf')
plt.title("f_ctf Distribution")
plt.show();
#训练样本划分
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, random_state=42)
#指明categorical特征
categorical_features_indices=[1,2,3]
#模型训练
model =  CatBoostClassifier(iterations=700,
                             learning_rate=0.01,
                             depth=15,
                             eval_metric='AUC',
                             random_seed = 42,
                             bagging_temperature = 0.2,
                             od_type='Iter',
                             metric_period = 75,
                             loss_function='Logloss',
                             od_wait=100)
model.fit(X_train, y_train,
                 eval_set=(X_valid, y_valid),
                 cat_features=categorical_features_indices,
                 use_best_model=True,
                 plot=True)

#特征重要度显示
fea_imp = pd.DataFrame({
    
    'imp': model.feature_importances_, 'col': X.columns})
fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True,False]).iloc[-30:]
fea_imp.plot(kind='barh', x='col', y='imp', figsize=(10, 7), legend=None)
plt.title('CatBoost - Feature Importance')
plt.ylabel('Features')
plt.xlabel('Importance');

3.4 NGBoost

NGBoost: Natural Gradient Boosting for Probabilistic Prediction is a relatively new training Gradient Boosting algorithm framework. AtOctober 2019 by Andrew Ng's team at StanfordPublished. The github code is recorded in: NGBoost Github , the core point isIt uses natural gradient boosting, a modular boosting algorithm for probabilistic prediction. The algorithm consists of a ​base learner​, a ​parameter probability distribution​ and a ​scoring rule​.

4 Combination of tree model and depth model

The tree model has strong interpretability and stability, but the disadvantage is that it has no semantic features and the generalization ability is not strong enough. Therefore, in actual scenarios, the tree model can be combined with the deep learning model. For example, a simple average may have It must be improved. This is a simple experiment to compare the effect. The bert model has a certain improvement in the emotion classification task compared with the gbdt model. However, combining the results of bert and gbdt and simply calculating an average value has the best effect. For details, please refer to the one made by someone. Simple comparison test: bert vs catboost

6 References

1 GBDT Youtube Video

Guess you like

Origin blog.csdn.net/BGoodHabit/article/details/115425587