[Machine Learning] Detailed Interpretation of XGBoost (Integrated Learning_Boosting_GBM)

[Machine Learning] Detailed Interpretation of XGBoost (Integrated Learning_Boosting_GBM)

1 Introduction

XGBoost (eXtreme Gradient Boosting, XGB) is a Boosting algorithm. It has done a lot of theoretical and engineering optimization on the GBM algorithm, and its accuracy and operating efficiency have been greatly improved. XGBoost usually uses CART as the base learner. For classification or regression problems, the effect is very good. It shines in various data competitions, and is widely used in the industry, mainly because of its excellent effect, simple use, and fast speed.

insert image description here


2. Fundamentals

xgb is an implementation of the GBM method in the boosting algorithm , mainly to reduce the deviation, that is, to reduce the error of the model. Therefore, it uses multiple base learners, each base learner is relatively simple (to avoid overfitting), and the next learner is the result of learning the previous base learner yity^{t}_{i}yitand actual value yi y_{i}yiThrough the learning of multiple learners, the difference between the model value and the actual value is continuously reduced.
insert image description here
The base learner of XGB can choose CART or linear function, usually CART is used, and here we take CART as an example. Then when the base learner is CART, the corresponding f ( x ) f(x)f ( x ) can be expressed as follows, where subscripts are removed.
insert image description here
This formula is a bit convoluted, explain it:

  • Suppose the decision tree has TTT pieces,w = { w 1 , w 2 , . . . w T } \mathbf w = \{w_1, w_2,.....w_T\}w={ w1,w2,.....wT} is a weight vector composed of individual leaf weights.
  • 角标 q ( x ) ∈ { 1 , 2 , . . . . T } q(x) \in \{ 1, 2, ....T\} q(x){ 1,2,.... T } represents the specific leaf number, and this q function represents the decision-making process of the decision tree.
  • So wq ( x ) w_{q(x)}wq(x)It is the weight of the leaf node where the input x is assigned by the q function.
  • For example, as shown in the figure below, there is an input xi x_ixi, the q function makes a judgment on it, thinking that it should be assigned to the No. 3 leaf node, so that the function f ( x ) f(x)f ( x ) will put the weight of node 3w 3 w_3w3Returned as output.
    insert image description here

In summary, the basic idea of ​​XGB is:

  • New trees are continuously generated, and each tree is learned based on the difference between the previous tree and the target value, thereby reducing the deviation of the model.
  • The output of the final model result is as follows: yi = ∑ k = 1 tfk ( xi ) y_{i}=\sum_{k=1}^{t}f_{k}(x_{i})yi=k=1tfk(xi) , that is, the sum of the results of all trees is the predicted value of the model for a sample.

How to select/generate a better tree at each step? That is determined by the following objective function.


3. Objective function (second-order Taylor expansion solution)

3.1 Basic objective function

insert image description here
The objective function consists of two parts:

  • One is the model error, which is the difference between the true value of the sample and the predicted value,
  • The second is the structural error of the model, that is, the regular term, which is used to limit the complexity of the model.
    • For the CART decision tree, the regular term is:
      insert image description here
      because yit = yit − 1 + ft ( xi ) y_{i}^{t}=y_{i}^{t-1}+f_{t}(x_{i} )yit=yit1+ft(xi) , so put it into the above formula and transform it into:
      O bjt = ∑ n = 1 n L ( yi , yit − 1 + ft ( xi ) ) + Ω ( ft ) + ∑ t = 1 T − 1 Ω ( ft ) Obj^{t}=\sum_{n=1}^{n}L(y_{i},y_{i}^{t-1}+f_{t}(x_{i}))+\ Omega(f_{t})+\sum_{t=1}^{T-1}\Omega(f_{t})Objt=n=1nL ( yi,yit1+ft(xi))+Oh ( ft)+t=1T1Oh ( ft)

therefore,

  • The error of the t-th tree consists of three parts: the sum of the errors of n samples in the t-th tree, the structural error of the t-th tree, and the structural error of the previous t-1 trees.
  • Among them, the structure error of the first t-1 trees is a constant, which can be ignored because we already know the structure of the first t-1 trees .

3.2 Second-order Taylor simplification

Taylor formula can refer to: https://blog.csdn.net/qq_51392112/article/details/130645876 .
The above formula is: the objective function of the tth step of xgb. The only variable is ft f_{t}ft, the loss function here is still a relatively complex expression, so in order to simplify it, the second-order Taylor expansion is used to approximate the expression, that is, here for ft ( xi ) = y ^ it − 1 f_t(x_i) = \hat y_{i}^{t-1}ft(xi)=y^it1carry out Taylor expansion . Then the above loss function is converted to the second derivative:
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here

3.3 Shrinkage and Column Sampling

  • The reduction here is the same as the reduction in GBDT, which is to reduce the impact of a single tree and leave room for subsequent trees. So introduce a learning rate n and use it to multiply the ft ( x ) f_t(x) generated by each stepft(x)

  • Column sampling is used in both GBDT and random forests to reduce variance at the expense of bias. The way to do this is to use only a part of the features when splitting the nodes.


4. Node splitting

The goal of node splitting is to find a feature, and after splitting a node according to this feature, the objective function can be reduced the most. So there are two layers of loops, the outer layer is feature traversal, and the inner layer is feature value traversal. There is nothing to say about the feature traversal, it can only be found one by one, but the eigenvalue traversal can be optimized to shorten the time-consuming to find the optimal split node. We look at node splitting in four steps:

  • Accurate Greedy Algorithm: Try every position to accurately locate the optimal split node
  • Approximate algorithm: search by quantile to find approximately optimal split node
  • Weighted quantile thumbnail: According to the weighted quantile of h, find a split node that is better than the previous one
  • Sparse-aware splitting: a little optimization for data with sparsity or missing values

4.1 Accurate Greedy Algorithm

First understand the most basic method - precise greedy algorithm.

The splitting method of the greedy algorithm is a violent search method, traversing each feature, traversing each value of the feature, calculating the gain before and after splitting, and selecting the feature value with the largest gain as the splitting point.

  • First, each feature needs to be sorted. After sorting, the sum of the first-order derivatives on the left and right of the split node G and the sum of the second-order derivatives H are easy to calculate. Review their definitions:
    insert image description here
    notice, gi g_igi h i h_i hiIt is calculated based on the results of round t-1, and is currently solving the tree structure of round t, so gi g_igi h i h_i hiIt is known and calculated at present.

The precise greedy algorithm is as shown in the figure below. The red line is the dividing line, which is divided once in each place and calculated once. L spillt L_{spilt}Lspilt. g , hg,h on the left side of the dividing lineg,The addition of h is the left side G , HG , HG,H ,g , hg,hg,The addition of h is the right side G , HG, HG,H. _
insert image description here
Then after traversing once, select the split point with the most reduced objective function to split the node, that is, inL spilled L_{spilt}LspiltThe node is split at the largest split point.

4.2 Approximation Algorithms

The problem with the exact greedy algorithm is obvious:

  • There are too many candidate nodes and it takes too long.

Hence the algorithm to approximate it. It is not necessary to find the optimal segmentation point so accurately.

  • Because the base learner we trained is a weak learner, so there is no need for him to learn very accurately.
  • There is also Boosting to improve his accuracy.

Then based on this idea, we can narrow down the range of candidate nodes, no longer enumerate, but only select quantile points. For example, two tertiles are selected in the figure below, and quartiles or more points can also be used. In essence, it is to divide features into multiple buckets, make the data granularity coarser, only deal with buckets, and reduce the amount of calculation.
insert image description here
There is a timing problem here, namely,

  • Before a tree is split, it is divided according to the quantile (global split point),
  • Or is the division done at each node split (local split point)?

Both are fine, but the latter requires fewer divisions, because the latter divides the features used by the parent node more finely.

  • For example, the No. 1 feature has been used once in the parent node, then for the local split, when the child node uses this feature again, it will re-select the candidate split point in the part of the sample it is assigned to; while the global split is at the beginning. The candidate split point is fixed, and when the child node is used again, it will not be reselected in the subdivision sample set.
  • The author conducted experimental tests on the two methods of global split point selection and local split point selection. In the figure, eps is the parameter mentioned later, and 1/eps can approximately represent the number of quantiles mentioned above.
    • It can be seen that the global split mode can obtain the same accuracy as the precise greedy algorithm when the division is fine, while the local split mode can obtain the same accuracy in the later stage of training after roughly dividing.

insert image description here

To sum up, the approximate algorithm is actually bucketing. The purpose is to improve the calculation speed and reduce the number of traversals, so the features are bucketed. It is to divide the value of each feature into different buckets according to the quantile, and use the boundary value of the bucket as a candidate set for splitting nodes. Each traversal is no longer traversing all feature values, but only traversing the feature Just a few buckets (each bucket can be understood as the quantile of the feature value), which can reduce the number of times to traverse the feature value.

  • The bucketing algorithm is divided into global mode and local mode.
    • In the global mode, after the bucket is divided for the first time, the bucket is no longer updated, and the divided bucket is always used for subsequent splits. This reduces the computational complexity, but after multiple divisions, some buckets may be empty, that is, there is no data in the bucket.
    • The local mode is to re-divide the buckets before each sorting. The advantage is that each bucketing can ensure that the number of samples in each bucket is uniform. The disadvantage is the large amount of calculation.

4.3 Weighted quantile thumbnail

There is still room for optimization in the previous method of using quantiles to select points.

  • The quantile point selection method does not pay attention to samples with large errors. It treats all sorted samples equally, which is not very good.

If the samples with relatively large errors can be dealt with more finely, the effect will be better. So there is the idea of ​​weighted quantile thumbnails, using weights to represent the size of the error.

  • The second derivative hh used in the XGBoost articleh stands for weight,hhWhen h accumulates to a certain value, do a split.
  • For example, the figure below is to do a segmentation when the accumulation reaches 0.3. XGBoost has also done some detailed work here, such as how to use eps as hhThe threshold of h cumulative sum is used to select the weighted quantile point. For details, please refer to the original text.
    insert image description here

why choose hhWhat about h as a weight? It can be viewed from two angles.

  • The first point of view is from the form of the objective function. The original objective function can be transformed by identity to match the square term.
    insert image description here
    ∑ \sum inside is about− gihi - \frac{g_i}{h_i}higiThe weighted square error of , when hi h_ihiWhen it is large, the proportion of this square error in the objective function will be relatively larger, so hhh as the weight.
  • The second angle is from hhFrom the expression of h . When the loss function is chosen for squared error for regression, i.e.
    insert image description here
    h = 2 h = 2h=2 is a constant, which degenerates into a quantile selection point; but when the loss function selects the cross entropy for classification, that is,
    insert image description here
    h = p ( 1 − p ) h = p(1-p)h=p(1p ) , whenp = 0.5 p=0.5p=When around 0.5 hhh is the largest, and the predicted probability is equal to 0.5, which means that the sample point is swinging left and right during the classification process, so we need to pay attention to it,
    hhThe size of h can just be used to reflect the degree of swing, so choosehhh as the weight.

4.4 Sparse perception

Machine learning often encounters the problem of data sparsity. There are three situations that lead to data sparsity:

  • The data has a large number of missing values
  • There are a large number of 0 elements in the data
  • One-hot encoding is used in feature engineering, resulting in a large number of 0

So it is very important to make the algorithm aware of the sparse pattern of the data. The solution of XGBoost is to set a default direction in each node. Missing values ​​or 0 elements go directly to the default direction and do not participate in the learning of the tree model. For example, in the figure below, it is directly in the direction of the default red dotted line.
insert image description here
How to choose the default direction? Try both sides and see which side brings L spilt L_{spilt}LspiltChoose whichever is bigger. Specifically, when the node is split, the missing value or 0 element sample is first divided into the left node to calculate L spillt L_{spilt}Lspilt, and then assigned to the right node to calculate L spillt L_{spilt}Lspilt, choose the larger side as the default direction.

For a node that is splitting, because the sparse samples corresponding to the selected features are actually regarded as a class, it is not necessary to consider the sparse value when finding the optimal split point, only consider the non-sparse value. Therefore, the amount of calculation is reduced, and the time-consuming operation is shortened.
insert image description here
The author did an experiment with sparse data, using the sparse perception algorithm, the speed can be increased by about 50 times.
insert image description here


5. Efficiency optimization

5.1 Block Parallel

In the decision tree learning process, the most time-consuming part is the repeated sorting of the data when splitting. In order to improve the speed, XGBoost adopts the method of space for time. For a list of features, it creates an equal number of pointers, each pointer is bound to a feature value, and points to the corresponding sample, sorted according to the size of the feature, and saves the sorted pointers of each feature in the block structure. The goal of sorting features before training is achieved, which improves the speed, but also requires additional space to store pointers. In terms of sections, the design of the block structure has the following characteristics:

  • Store data in a sparse format such as CSC (Compressed Sparse Column Format)
  • Stored features are sorted
  • A pointer to the sample is stored because information such as the first and second derivatives is accessed
  • Since CSC is stored column-wise, it is possible to use multi-threaded parallel computation on features at each split

insert image description here

5.2 Cache-aware access

The proposed block structure can optimize the time complexity of node splitting, but there is a non-sequential memory access problem when indexing samples. This is because we are accessing the samples in feature size order, but the samples are not in this order. Non-sequential access to memory will reduce the CPU's cache hit ratio. XGBoost designs two different optimization strategies for exact greedy algorithm and approximate algorithm.

For the precise greedy algorithm, a cache-aware pre-reading algorithm is used to solve it. Open up a new buffer (buffer) in the memory, and store the data it wants to access in this buffer before the CPU accesses the data, and change the non-continuous data into continuous data, so that the cache hit rate of the CPU can be improved , which saves a certain amount of data reading time for the CPU.

For approximation algorithms, including both the weighted quantile and sparsity-aware cases, this problem is addressed by reducing the block size. The cache hit rate is low. One reason is that the blocks are too large to be stored in the cache. If the block size is reduced to the extent that the CPU cache can fully accommodate it, there will be no misses. But too small blocks will lead to low parallel efficiency. After trying various block sizes, the author chose the optimal 2 16 2^{16}216 , each block stores2 16 2^{16}216 samples. The experimental results are as follows:
insert image description here

5.3 Out-of-core Computation of Blocks

When the data cannot be read into the memory all at once, it is necessary to use an out-of-core computing method to open a single thread to read the required data from the hard disk to solve the problem of insufficient memory. However, reading data from the hard disk is very slow compared to the processor, so XGBoost uses two methods to balance the speed of the two:

  • block compression. For the row index, subtract the starting block index from the current index, as mentioned earlier, the block size is 2 16 2^{16}216 , so the calculated offset is stored in a 16-bit integer, and this offset is used to replace the original index that takes up more space.
  • block sharding. Store the data separately on different hard disks, and when it is in use, multiple hard disks will start working together to improve the reading speed.

6. Supplement

6.1 Missing value processing

For a sample with missing features in a certain dimension, xgb will try to put it in the left subtree to calculate the gain, and then put it in the right subtree to calculate the gain, and compare the size of the gain in the left and right subtrees to determine which subtree to put in.

6.2 Prevent overfitting

XGB proposes two methods to prevent overfitting:

  • The first one is called Shrinkage, that is, the learning rate. When iterating a tree each time, the weight of each leaf node is multiplied by a reduction factor, so that the influence of each tree will not be too large, and it will give the following tree Leave more room for optimization.
  • Another method is called Column Subsampling, which is similar to the random forest selection part of the eigenvalues ​​for tree building, which is divided into two methods:
    • Method 1 Random sampling by layer, before splitting the same layer nodes, randomly select some feature values ​​to traverse, and calculate information gain;
    • The second way is to randomly sample some eigenvalues ​​before building a tree, and then all node splits of the tree traverse these eigenvalues ​​to calculate the information gain.

reference

【1】https://blog.csdn.net/qq_18293213/article/details/123965029
【2】https://zhuanlan.zhihu.com/p/360060567
【3】XGBoost: A Scalable Tree Boosting System https://dl.acm.org/doi/10.1145/2939672.2939785

Guess you like

Origin blog.csdn.net/qq_51392112/article/details/130644365