Supervised learning (4): XGBoost algorithm

Supervised learning (4): XGBoost algorithm

  XGBoost (eXtreme Gradient Boosting) full name is extreme gradient boosting, XGBoost is the trump card of integrated learning methods. In the Kaggle data mining competition, most of the winners use XGBoost. XGBoost performs very well on most regression and classification problems. This article introduces the principle of XGBoost algorithm in more detail. .

1. The method of constructing the optimal model

  The general method of constructing the optimal model is to minimize the loss function of the training data , which is represented by the letter L, as follows:

Insert picture description here
  Formula (1) is calledMinimize experience risk, The model obtained by training has high complexity. When the training data is small, the model is prone to overfitting .

  Therefore, in order to reduce the complexity of the model, the following formula is often used:

Insert picture description here
  Where J(f) is the complexity of the model, and formula (2) is calledMinimize structural risk, Models that minimize structural risk often have better predictions on training data and unknown test data.

  application: Decision tree generation and pruning correspond to empirical risk minimization and structural risk minimization respectively. XGBoost's decision tree generation is the result of structural risk minimization, which will be described in detail later.

2. Boosting's return to thinking

  The Boosting method combines multiple weak learners to give the final learning results. Regardless of whether the task is classification or regression , we use the idea of ​​regression tasks to construct the optimal Boosting model .

  Regression thinking: treat the output of each weak learner as a continuous value, The purpose of this is to accumulate the results of each weak learner, and to better use the loss function to optimize the model .

  Suppose ft(xi) is the output result of the t-th weak learner, yit is the output result of the model, and yi is the actual output result. The expression is as follows:

Insert picture description here
  The above two formulas are additive models, both of which default to the output result of the weak learner as continuous values .

  Regression Thoughts for Classification Tasks

  According to the result of formula 2.1, the final classifier is obtained:

Insert picture description here
  The loss function of classification generally chooses exponential function or logarithmic function . It is assumed that the loss function is a logarithmic function. The loss function of the learner is:

Insert picture description here
  If the actual output result yi=1, then:

Insert picture description here
  Find the gradient of (2.5) to yi(t), we get:

Insert picture description here
  The direction of the negative gradient is the direction in which the loss function drops the fastest. The inverse value of (2.6) is greater than 0, so the weak learner iterates in the direction of increasing yi(t). The graph is expressed as:

Insert picture description here  As shown in the figure above, when the actual label yi of the sample is 1, the model output yi(t) increases as the number of iterations increases (red line arrow), and the loss function of the model decreases accordingly; when the actual label yi of the sample is- At 1 o'clock, the model output result yi(t) decreases as the number of iterations increases (red line arrow), and the loss function of the model decreases accordingly. This is where the principle of the additive model lies. The goal of reducing the loss function is achieved through multiple iterations .

  summary:The Boosting method regards the output of each weak learner as a continuous value, so that the loss function is a continuous value. Therefore, the purpose of optimizing the model can be achieved through the iteration of the weak learner. This is also the principle of the integrated learning additive model.

3. Derivation of the objective function of XGBoost

  The objective function, the loss function, is used to construct the optimal model by minimizing the loss function . From the first section, we know that the loss function should be added with a regular term representing the complexity of the model, and the model corresponding to XGBoost contains multiple CART trees, so , The objective function of the model is:

Insert picture description here
  The equation (3.1) is the regularized loss function. The first part on the right side of the equation is the training error of the model , and the second part is the regularization term . The regularization term here is the addition of the regularization terms of K trees.

  Introduction of CART tree

Insert picture description here
  The figure above shows the K-th CART tree. To determine a CART tree, two parts need to be determined. The first part is the structure of the tree. This structure maps the input sample to a certain leaf node, denoted as fk(x). The second part is the value of each leaf node, q(x) represents the output leaf node serial number, and Wq(x) represents the value of the corresponding leaf node serial number. By definition:

Insert picture description here
  Definition of tree complexity

  The model corresponding to the XGBoost method contains multiple cart trees , defining the complexity of each tree:

Insert picture description here
  Where T is the number of leaf nodes, and ||w|| is the modulus of the leaf node vector. γ represents the difficulty of node segmentation, and λ represents the L2 regularization coefficient.

  The complexity of the following example tree is expressed:

Insert picture description here  Objective function derivation

  According to formula (3.1), the objective function of the learning model with t iterations is:

Insert picture description here
  The approximate expression of the second derivative of Taylor's formula :

Insert picture description here
  Let ft(xi) be Δx, then the second-order approximate expansion of equation (3.5):

Insert picture description here
  among them:

Insert picture description here
  l(yi,y(t-1) represents the prediction error of the learning model composed of the previous t-1 trees, gi and hi represent the first and second derivatives of the prediction error to the current model, respectively, and the current model is reduced from the prediction error Iterate in small directions.

  Ignore the constant term of equation (3.8) and combine it with equation (3.4) to get:

Insert picture description here
  Simplify (3.9) by using (3.2):

Insert picture description here
  The first part of the formula (3.10) is to accumulate all the training sample sets, because all the samples are mapped to the leaf nodes of the tree, we change our thinking, starting from the leaf nodes, and accumulate all the leaf nodes, we get:

Insert picture description here
  make

Insert picture description here
  Gj represents the sum of the first-order derivatives of all input samples mapped to the leaf node j, and in the same way, Hj represents the sum of the second-order derivatives.
Get:

Insert picture description here
  For a certain structure of the t-th CART tree (which can be represented by q(x)), its leaf nodes are independent of each other, and Gj and Hj are deterministic quantities. Therefore, (3.12) can be regarded as one yuan two about the leaf nodes Minor function. Minimize (3.12) formula, get:

Insert picture description here
  Get the final objective function :

Insert picture description here
  (3.14) is also called the scoring function, which is a measure of the quality of the tree structure. The smaller the value, the better the structure. We use the scoring function to select the best segmentation point to build the CART tree.

4. XGBoost's regression tree construction method

  The scoring function derived in the previous section is a standard to measure the quality of the tree structure. Therefore, the scoring function can be used to select the best segmentation point. First determine all the segmentation points of the sample feature, and segment each determined segmentation point. The criteria for the quality of the segmentation are as follows:

Insert picture description here
  Gain represents the difference between a single node obj and the tree obj of the two split nodes , traverse the split points of all features, and find the split point with the largest Gain that is the best split pointAccording to this method, continue to split the nodes to get the CART tree. If the value of γ is set too large, Gain is negative, which means that the node is not split, because the tree structure after splitting has deteriorated. The larger the γ value, the stricter the obj drop after segmentation is required. This value can be set in XGBoost.

5. The difference between XGBoost and GDBT

  1)XGBoost generates the CART tree considering the complexity of the tree, GDBT does not consider the complexity of the tree in the pruning step of the tree
  2)XGBoost is the second-order derivative expansion that fits the last round of loss function, and GDBT is the first-order derivative expansion that fits the last round of loss function. Therefore, XGBoost has higher accuracy and satisfies the same training effect. The required number of iterations less
  3)Both XGBoost and GDBT iteratively improve the performance of the model, but XGBoost can start multi-threading when selecting the best segmentation point, which greatly improves the running speed.

PS: This section only selects a few differences related to the content of this article .

reference

  Chen Tianqi "XGBoost: A Scalable Tree Boosting System"
  Li Hang "Statistical Learning Method"

Guess you like

Origin blog.csdn.net/weixin_42691585/article/details/108776258