It took 3 weeks to understand the principle of xgboost algorithm

Algorithm process

First learn the decision tree , then learn the random forest , and finally come to xgboost. I thought it would be easier to transition so smoothly, but I didn't expect it to be so difficult. It took more than 3 weeks scattered and scattered, and after reading the explanations and elaborations of many articles, I barely got the algorithm principle of xgboost.

Among them, there are two reference sources, which I personally recommend: one is text and the other is video .

The algorithm flow chart of xgboost is directly given here, which is convenient for us to understand xgboost intuitively. For a training set, xgboost first uses CART tree training to obtain a model, which will generate a deviation value for each sample; then use the sample deviation value as a new training set, continue to use CART tree training to obtain a new model; repeat , until an exit condition is met.

The final xgboost model is the sum of all the above models. Suppose there are a total of M models, and the output of each model is defined as fi f_ifi, then the final output of the xgboost model y ^ i \hat y_iy^i
y ^ i = ∑ i = 1 M f i ( x i ) \hat y_i=\sum_{i=1}^Mf_i(x_i) y^i=i=1Mfi(xi)

Obviously, xgboost, like random forest, is also an integration of multiple models, but there are still many differences between them. To describe it in technical terms, xgboost is a boosting method whose core feature is to reduce deviation and the logic is serial; while random forest is a bagging method whose core feature is to reduce variance and the logic is parallel.

After figuring out the algorithm process, we also need to pay attention to the details in the process. There are two things that need to be understood: (1) What is a CART tree? (2) How to get the CART tree model?

CART tree

Simply understand, the CART tree is first a tree; on this basis, each leaf node jjj will be assigned a node valuewj w_jwj

Assuming iii samplesxi x_ixiand jjThe mapping relationship between j
nodes is j = q ( xi ) j=q(x_i)j=q(xi)
Take the following figure as an example:q ( x 1 ) = 2 , q ( x 2 ) = 1 , q ( x 3 ) = 2 q(x_1)=2,q(x_2)=1,q(x_3)=2q(x1)=2,q(x2)=1,q(x3)=2 .

then theiiThe node value of i samples is
wq ( xi ) w_{q(x_i)}wq(xi)
On the other hand, the jjSample set I j I_j of j nodesIjCan be expressed as
I j = { xi ∣ q ( xi ) = j } I_j= \{x_i|q(x_i)=j \}Ij={ xiq(xi)=j }
Or the above picture as an example:I 1 = { 2 } , I 2 = { 1 , 3 } I_1=\{2\}, I_2=\{1,3\}I1={ 2},I2={ 1,3}

From the concept of CART tree, we can easily find that two types of data are needed to determine a CART tree: tree structure and node value. Next, we first study clearly the optimal node value optimization strategy when the tree structure is fixed; on this basis, we return to try to determine the optimal tree structure.

best node value

When the tree structure is fixed, the node that each sample falls into is determined immediately. At this point, define the following objective function to measure the error of xgboost on the overall sample
obj = ∑ i = 1 nl ( yi , y ^ i ( M ) ) + ∑ m = 1 M Ω ( fm ) \text{obj}=\ sum_{i=1}^nl(y_i,\hat y_i^{(M)})+\sum_{m=1}^M\Omega{(f_m)}obj=i=1nl ( yi,y^i(M))+m=1MOh ( fm)
Among them, the meaning of the first item is the error of the sample itself,nnn represents the sample size,l ( ⋅ ) l(·)l() is a function to measure the sample error, such as MSE, etc.,yi y_iyifor secondThe true value of i samples,y ^ i ( M ) \hat y_i^{(M)}y^i(M)for secondThe predicted value of i samples; the second term is the regular term, which hasbeen used in linear model optimization: ridge regression and Lasso regression before. The main purpose is to reduce the risk of overfitting, and the goal here is also to reduce overfitting joint risk.

The second item is simple, first deal with
∑ m = 1 M Ω ( fm ) = ∑ m = 1 M − 1 Ω ( fm ) + Ω ( f M ) \sum_{m=1}^M\Omega{(f_m) }=\sum_{m=1}^{M-1}\Omega{(f_m)}+\Omega{(f_M)}m=1MOh ( fm)=m=1M1Oh ( fm)+Oh ( fM)
From the algorithm flow, we can see that the CART tree is built one by one. When we want to optimize theMMWhen there are M trees, the firstM − 1 M-1M1 tree has completed the calculation, that is, the regular term value of these trees has been determined, so we can ignore it when optimizing; for theMMthM trees,Ω \OmegaΩflow Ω
( f M ) = γ T + 1 2 λ ∑ j = 1 T wj 2 \Omega(f_M)=\gamma T+\frac{1}{2}\lambda\sum_{j=1}^ Tw_j^2Oh ( fM)=γT+21lj=1Twj2
Here, γ \gammaγ andλ \lambdaλ is a penalty item,TTT is the total number of leaf nodes of the tree. This definition can be understood as follows: On the one hand,TTIf the T value is large, it means that the depth of the tree is relatively deep, and the probability of overfitting will become higher, so useγ \gammaγ punishes; on the other hand,wwThe w value is large, indicating that the tree will occupy a large proportion in the entire model, that is, the prediction result mainly depends on the tree, and the risk of overfitting will also increase at this time, so it is necessary to use λ \lambdaλ is punished.

Now back to the first item. After having the specific expression of the second item, if given l ( ⋅ ) l(·)l() expression, it seems that the extremum can be obtained directly through gradient derivation. But in practice, this method is not feasible. This is mainly becausey ^ \hat yy^It is obtained through the tree model, so the value is not continuous, so it is not derivable.

That being the case, we need other solutions. Fortunately, because the former M − 1 M-1M1 CART tree has been determined, so you only need to pay attention to theMMThe node values ​​of M trees are enough, so the first item can be transformed into
∑ i = 1 nl ( yi , y ^ i ( M ) ) = ∑ j = 1 T ∑ i ∈ I jl ( yi , y ^ i ( M ) ) \sum_{i=1}^nl(y_i,\hat y_i^{(M)})=\sum_{j=1}^T\sum_{i\in I_j}l(y_i,\hat y_i^ {(M)})i=1nl ( yi,y^i(M))=j=1TiIjl ( yi,y^i(M))
The value of this conversion is to convert the statistical logic of errors from the sum of samples to the sum of nodes, so that it can be compared withΩ \OmegaΩ uses the same variable, which facilitates combining operations of expressions.

To deal with this term, we will take a Taylor expansion on it and hold down to the second order term.

First review the Taylor formula
f ( x + δ x ) = f ( x ) + f ′ ( x ) δ x + 1 2 f ′ ′ ( x ) δ x 2 f(x+\delta x)=f(x)+f '(x)\delta x+\frac{1}{2}f''(x)\delta x^2f(x+δ x )=f(x)+f(x)δx+21f′′(x)δx2
Here puty ^ i ( M − 1 ) \hat y_i^{(M-1)}y^i(M1)defined as xxx , thenδ x \delta xδ x iswj w_jwj,我们照葫芦画瓢进行泰勒展开
l ( y i , y ^ i ( M ) ) = l ( y i , y ^ i ( M − 1 ) + w j ) = l ( y i , y ^ i ( M − 1 ) ) + l ′ ( y i , y ^ i ( M − 1 ) ) w j + 1 2 l ′ ′ ( y i , y ^ i ( M − 1 ) ) w j 2 l(y_i, \hat y_i^{(M)})=l(y_i, \hat y_i^{(M-1)}+w_j)=l(y_i, \hat y_i^{(M-1)})+l'(y_i, \hat y_i^{(M-1)})w_j+\frac{1}{2}l''(y_i, \hat y_i^{(M-1)})w_j^2 l ( yi,y^i(M))=l ( yi,y^i(M1)+wj)=l ( yi,y^i(M1))+l(yi,y^i(M1))wj+21l′′(yi,y^i(M1))wj2
The first one is a constant and can be ignored. Let gi = l ′ ( yi , y ^ i ( M − 1 ) ) g_i=l'(y_i, \hat y_i^{(M-1)})gi=l(yi,y^i(M1)) h i = l ′ ′ ( y i , y ^ i ( M − 1 ) ) h_i=l''(y_i, \hat y_i^{(M-1)}) hi=l′′(yi,y^i(M1))

The overall objective function becomes

obj = ∑ j = 1 T ∑ i ∈ I j [ g i w j + 1 2 h i w j 2 ] + γ T + 1 2 λ ∑ j = 1 T w j 2 \text{obj}=\sum_{j=1}^T\sum_{i\in I_j}[g_iw_j+\frac{1}{2}h_iw_j^2]+\gamma T+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 obj=j=1TiIj[giwj+21hiwj2]+γT+21lj=1Twj2
due to wj w_jwjsum iii无关,所以可以调整为
obj = ∑ j = 1 T [ w j ∑ i ∈ I j g i + 1 2 w j 2 ∑ i ∈ I j h i ] + γ T + 1 2 λ ∑ j = 1 T w j 2 \text{obj}=\sum_{j=1}^T[w_j\sum_{i\in I_j}g_i+\frac{1}{2}w_j^2\sum_{i\in I_j}h_i]+\gamma T+\frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 obj=j=1T[wjiIjgi+21wj2iIjhi]+γT+21lj=1Twj2
merge wj 2 w_j^2wj2项,得到
obj = ∑ j = 1 T [ w j ∑ i ∈ I j g i + 1 2 w j 2 ( λ + ∑ i ∈ I j h i ) ] + γ T \text{obj}=\sum_{j=1}^T[w_j\sum_{i\in I_j}g_i+\frac{1}{2}w_j^2(\lambda+\sum_{i\in I_j}h_i)]+\gamma T obj=j=1T[wjiIjgi+21wj2( l+iIjhi)]+γT

G j = ∑ i ∈ I j g i G_j=\sum_{i\in I_j}g_i Gj=iIjgi H j = ∑ i ∈ I j h i H_j=\sum_{i\in I_j}h_i Hj=iIjhi, the above formula can be simplified as
obj = ∑ j = 1 T [ wj G j + 1 2 wj 2 ( λ + H j ) ] + γ T \text{obj}=\sum_{j=1}^T[w_jG_j+\ frac{1}{2}w_j^2(\lambda+H_j)]+\gamma Tobj=j=1T[wjGj+21wj2( l+Hj)]+γ T
is a binary linear expression, the optimal solution is
wj ∗ = − G j λ + H j w_j^*=-\frac{G_j}{\lambda+H_j}wj=l+HjGj
The corresponding optimal objective function value is
obj ∗ = − 1 2 ∑ j = 1 TG j 2 λ + H j + γ T \text{obj}^*=-\frac{1}{2}\sum_{j= 1}^T\frac{G_j^2}{\lambda+H_j}+\gamma Tobj=21j=1Tl+HjGj2+γT

Here also need to describe gi g_igisum hi h_ihiSpecifically how it is calculated. For example, suppose we are optimizing the 11th CART tree, which means that the first 10 CART trees have been determined. These 10 trees pair samples ( xi , yi = 1 x_i,y_i=1xi,yi=1 ) the predicted value isyi = − 1 y_i=-1yi=1 , assuming we are doing classification now, our loss function is
L ( θ ) = ∑ iyiln ( 1 + e − y ^ i ) + ( 1 − yi ) ln ( 1 + ey ^ i ) L(\theta) =\sum_i y_iln(1+e^{-\hat y_i})+(1-y_i)ln(1+e^{\hat y_i})L ( i )=iyiln(1+ey^i)+(1yi)ln(1+ey^i)
becauseyi = 1 y_i=1yi=1 , the loss function becomes
L ( θ ) = ln ( 1 + e − y ^ i ) L(\theta)=ln(1+e^{-\hat y_i})L ( i )=ln(1+ey^i)
to find the gradient, the result is
e − y ^ i 1 − e − y ^ i \frac{e^{-\hat y_i}}{1-e^{-\hat y_i}}1ey^iey^i
y ^ i = − 1 \hat y_i=-1 y^i=1 into the gradient expression, theng 11 = − 0.27 g_{11}=-0.27g11=0.27

Continue to derive the gradient expression, and after obtaining the second derivative expression, bring it into y ^ i \hat y_iy^iThe value of h 11 h_{11} can be obtainedh11

Understand wj ∗ w_j^* againwj. Assuming there is only one sample at the node, then
wj ∗ = ( 1 λ + hj ) ( − gj ) w_j^*=(\frac{1}{\lambda+h_j})(-g_j)wj=(l+hj1)(gj)
− g j -g_j gjRepresents the negative gradient direction, that is, the direction in which the value of the objective function drops the fastest, which is in line with our cognition; 1 λ + hj \frac{1}{\lambda+h_j}l + hj1Can be seen as the learning rate, if hj h_jhjA large value indicates that the gradient change is relatively large, that is, a small disturbance will bring about a great change in the objective function. At this time, the learning rate should be reduced, so it is also in line with our cognition.

In summary, as long as the tree structure is determined, the optimal value wi w_i of each node can be obtained by the above methodwi, so that the objective function value is minimized. Now, we are left with only the optimal tree structure.

optimal tree structure

In order to determine the optimal tree structure, this section introduces a commonly used method: the greedy algorithm.

As shown below. For the current node, there are three samples of A, B and C, and the optimal objective function value at this node can be calculated as
obj 0 = γ − 1 2 ( GA + GB + GC ) 2 HA + HB + HC + λ \ text{obj}_0=\gamma-\frac{1}{2}\frac{(G_A+G_B+G_C)^2}{H_A+H_B+H_C+\lambda}obj0=c21HA+HB+HC+l(GA+GB+GC)2

Try to split the node, assuming that there may be three situations, [A, BC], [C, AB] and [B, AC], the corresponding optimal objective function values ​​are obj 1 = 2 γ − 1 2
GA 2 HA + λ − 1 2 ( GB + GC ) 2 HB + HC + λ \text{obj}_1=2\gamma-\frac{1}{2}\frac{G_A^2}{H_A+\lambda}- \frac{1}{2}\frac{(G_B+G_C)^2}{H_B+H_C+\lambda}obj1=2 c21HA+lGA221HB+HC+l(GB+GC)2
obj 2 = 2 γ − 1 2 ( G A + G B ) 2 H A + H B + λ − 1 2 ( G C ) 2 H C + λ \text{obj}_2=2\gamma-\frac{1}{2}\frac{(G_A+G_B)^2}{H_A+H_B+\lambda}-\frac{1}{2}\frac{(G_C)^2}{H_C+\lambda} obj2=2 c21HA+HB+l(GA+GB)221HC+l(GC)2
obj 3 = 2 γ − 1 2 ( G A + G C ) 2 H A + H C + λ − 1 2 ( G B ) 2 H B + λ \text{obj}_3=2\gamma-\frac{1}{2}\frac{(G_A+G_C)^2}{H_A+H_C+\lambda}-\frac{1}{2}\frac{(G_B)^2}{H_B+\lambda} obj3=2 c21HA+HC+l(GA+GC)221HB+l(GB)2

Calculate the change value of the objective function in the three cases, namely: obj 1 − obj 0 \text{obj}_1-\text{obj}_0obj1obj0 obj 2 − obj 0 \text{obj}_2-\text{obj}_0 obj2obj0 obj 3 − obj 0 \text{obj}_3-\text{obj}_0 obj3obj0. Then take the split result corresponding to the largest change value as the best split method for the node next time.

Therefore, the process of determining the optimal tree structure is:
starting from the depth of the tree at 0:
(1) enumerate all available features for each leaf node;
(2) for each feature, the training samples belonging to the node according to the The feature values ​​are arranged in ascending order, and the best split point of the feature is determined through the above greedy logic, and the split benefit of the feature is recorded; (3) The feature with the
largest profit is selected as the split feature, and the best split point of the feature is used. The split point is used as the split position, and two new left and right leaf nodes are split on the node, and the corresponding sample set is associated with each new node; (4)
Go back to step 1 and repeat until the specific conditions are met until;

At this point, I finally figured out the algorithm principle of xgboost.

Guess you like

Origin blog.csdn.net/taozibaby/article/details/131372688