Supervised learning (4): XGBoost algorithm
XGBoost (eXtreme Gradient Boosting) full name is extreme gradient boosting, XGBoost is the trump card of integrated learning methods. In the Kaggle data mining competition, most of the winners use XGBoost. XGBoost performs very well on most regression and classification problems. This article introduces the principle of XGBoost algorithm in more detail. .
table of Contents
1. The method of constructing the optimal model
The general method of constructing the optimal model is to minimize the loss function of the training data , which is represented by the letter L, as follows:
Formula (1) is calledMinimize experience risk, The model obtained by training has high complexity. When the training data is small, the model is prone to overfitting .
Therefore, in order to reduce the complexity of the model, the following formula is often used:
Where J(f) is the complexity of the model, and formula (2) is calledMinimize structural risk, Models that minimize structural risk often have better predictions on training data and unknown test data.
application: Decision tree generation and pruning correspond to empirical risk minimization and structural risk minimization respectively. XGBoost's decision tree generation is the result of structural risk minimization, which will be described in detail later.
2. Boosting's return to thinking
The Boosting method combines multiple weak learners to give the final learning results. Regardless of whether the task is classification or regression , we use the idea of regression tasks to construct the optimal Boosting model .
Regression thinking: treat the output of each weak learner as a continuous value, The purpose of this is to accumulate the results of each weak learner, and to better use the loss function to optimize the model .
Suppose ft(xi) is the output result of the t-th weak learner, yit is the output result of the model, and yi is the actual output result. The expression is as follows:
The above two formulas are additive models, both of which default to the output result of the weak learner as continuous values .
Regression Thoughts for Classification Tasks:
According to the result of formula 2.1, the final classifier is obtained:
The loss function of classification generally chooses exponential function or logarithmic function . It is assumed that the loss function is a logarithmic function. The loss function of the learner is:
If the actual output result yi=1, then:
Find the gradient of (2.5) to yi(t), we get:
The direction of the negative gradient is the direction in which the loss function drops the fastest. The inverse value of (2.6) is greater than 0, so the weak learner iterates in the direction of increasing yi(t). The graph is expressed as:
As shown in the figure above, when the actual label yi of the sample is 1, the model output yi(t) increases as the number of iterations increases (red line arrow), and the loss function of the model decreases accordingly; when the actual label yi of the sample is- At 1 o'clock, the model output result yi(t) decreases as the number of iterations increases (red line arrow), and the loss function of the model decreases accordingly. This is where the principle of the additive model lies. The goal of reducing the loss function is achieved through multiple iterations .
summary:The Boosting method regards the output of each weak learner as a continuous value, so that the loss function is a continuous value. Therefore, the purpose of optimizing the model can be achieved through the iteration of the weak learner. This is also the principle of the integrated learning additive model. 。
3. Derivation of the objective function of XGBoost
The objective function, the loss function, is used to construct the optimal model by minimizing the loss function . From the first section, we know that the loss function should be added with a regular term representing the complexity of the model, and the model corresponding to XGBoost contains multiple CART trees, so , The objective function of the model is:
The equation (3.1) is the regularized loss function. The first part on the right side of the equation is the training error of the model , and the second part is the regularization term . The regularization term here is the addition of the regularization terms of K trees.
Introduction of CART tree
The figure above shows the K-th CART tree. To determine a CART tree, two parts need to be determined. The first part is the structure of the tree. This structure maps the input sample to a certain leaf node, denoted as fk(x). The second part is the value of each leaf node, q(x) represents the output leaf node serial number, and Wq(x) represents the value of the corresponding leaf node serial number. By definition:
Definition of tree complexity
The model corresponding to the XGBoost method contains multiple cart trees , defining the complexity of each tree:
Where T is the number of leaf nodes, and ||w|| is the modulus of the leaf node vector. γ represents the difficulty of node segmentation, and λ represents the L2 regularization coefficient.
The complexity of the following example tree is expressed:
Objective function derivation
According to formula (3.1), the objective function of the learning model with t iterations is:
The approximate expression of the second derivative of Taylor's formula :
Let ft(xi) be Δx, then the second-order approximate expansion of equation (3.5):
among them:
l(yi,y(t-1) represents the prediction error of the learning model composed of the previous t-1 trees, gi and hi represent the first and second derivatives of the prediction error to the current model, respectively, and the current model is reduced from the prediction error Iterate in small directions.
Ignore the constant term of equation (3.8) and combine it with equation (3.4) to get:
Simplify (3.9) by using (3.2):
The first part of the formula (3.10) is to accumulate all the training sample sets, because all the samples are mapped to the leaf nodes of the tree, we change our thinking, starting from the leaf nodes, and accumulate all the leaf nodes, we get:
make
Gj represents the sum of the first-order derivatives of all input samples mapped to the leaf node j, and in the same way, Hj represents the sum of the second-order derivatives.
Get:
For a certain structure of the t-th CART tree (which can be represented by q(x)), its leaf nodes are independent of each other, and Gj and Hj are deterministic quantities. Therefore, (3.12) can be regarded as one yuan two about the leaf nodes Minor function. Minimize (3.12) formula, get:
Get the final objective function :
(3.14) is also called the scoring function, which is a measure of the quality of the tree structure. The smaller the value, the better the structure. We use the scoring function to select the best segmentation point to build the CART tree.
4. XGBoost's regression tree construction method
The scoring function derived in the previous section is a standard to measure the quality of the tree structure. Therefore, the scoring function can be used to select the best segmentation point. First determine all the segmentation points of the sample feature, and segment each determined segmentation point. The criteria for the quality of the segmentation are as follows:
Gain represents the difference between a single node obj and the tree obj of the two split nodes , traverse the split points of all features, and find the split point with the largest Gain that is the best split pointAccording to this method, continue to split the nodes to get the CART tree. If the value of γ is set too large, Gain is negative, which means that the node is not split, because the tree structure after splitting has deteriorated. The larger the γ value, the stricter the obj drop after segmentation is required. This value can be set in XGBoost.
5. The difference between XGBoost and GDBT
1)XGBoost generates the CART tree considering the complexity of the tree, GDBT does not consider the complexity of the tree in the pruning step of the tree。
2)XGBoost is the second-order derivative expansion that fits the last round of loss function, and GDBT is the first-order derivative expansion that fits the last round of loss function. Therefore, XGBoost has higher accuracy and satisfies the same training effect. The required number of iterations less。
3)Both XGBoost and GDBT iteratively improve the performance of the model, but XGBoost can start multi-threading when selecting the best segmentation point, which greatly improves the running speed.。
PS: This section only selects a few differences related to the content of this article .
reference
Chen Tianqi "XGBoost: A Scalable Tree Boosting System"
Li Hang "Statistical Learning Method"