Detailed explanation of GBDT+XGBoost algorithm (below): XGBoost

4、XGBoost

XGBoost, eXtreme Gradient Boosting, extreme gradient boosting. It uses the second-order Taylor expansion to approximate the loss function, and then finds the optimal tree structure and leaf node values ​​by minimizing the loss function.

The core algorithm idea: add trees continuously, and continuously perform feature splitting to grow a tree. Adding a tree each time is actually learning a new function f(x) to fit the residual error of the last prediction. When we get k trees after training, we need to predict the score of a sample. In fact, according to the characteristics of this sample, it will fall to a corresponding leaf node in each tree, and each leaf node corresponds to a score. In the end, you only need to add up the scores corresponding to each tree to get the predicted value of the sample.

4.1 Construction of XGBoost objective function

Assuming that K trees have been trained, the (final) predicted value for the i-th sample is:

 The final result is equivalent to the sum of all K tree predictions.

Objective function:

 The objective function consists of two parts, the loss function and the complexity term of the control model .

(1) Loss function

For regression problems, use least squares as the loss function ;

For classification problems, use cross-entropy as the loss function .

(2) The complexity of the model

The complexity item of the control model is also called regularization . The function of regularization is to control the complexity of the model, so that the models that are easy to overfit are not considered.

Q: How do we define the complexity of the XGBoost model? How to define the complexity of a decision tree? What are the characteristics of a complex decision tree model?

A: For a decision tree,  the depth of the tree, the number of leaf nodes, and the predicted value corresponding to the leaf nodes are related to its complexity; the more complete a tree means, the easier it is for the tree to overfit .

The deeper the depth of a tree , the easier it is to overfit; the more leaf nodes a number has , the easier it is to overfit; for regression tasks, the larger the predicted value corresponding to the leaf node , the more complex the model.

4.2 Taylor expansion and approximation of objective function

 Use the Taylor expansion to approximate the objective function:

 The new objective function shows that we need parameterization and complexity of each tree to find the tree that minimizes these parameters. \large f_k(x_i)\Omega(f_k)

4.3 Parameterization of trees

(1) Parameterization  .  \large f_k(x_i)

How to represent a tree with parameters?

The value of the leaf node is represented by ω. We assume that the leaf node of 15 is represented by ω1, the leaf node of 12 is represented by ω2, and the leaf node of 20 is represented by ω3. W=(ω1,ω2,ω3)=(15,12,20) W here is a parameter.

\large f_k(x_i): Indicates the prediction result of the k-th tree on the sample Xi, more specifically, it is to plan the Xi-th sample to the leaf node.

A function q(x) is defined here: the position of the sample x. Here it is assumed that on the first leaf node (that is, the place of 15) there are samples [1, 3] falling here, the second node has samples [4] falling on this place, and samples [2,5] falling on the third A leaf node is here:

 After using the function q to indicate the position where the sample falls, it can be expressed by parameters  \large f_k(x_i). The sample Xi falls  q(x_i)on the th leaf node. Then  \large f_k(x_i)the predicted value of can be  W_{q(x_i)}represented by , which is  \large f_k(x_i)parameterized. W is a parameter, and the subscript q(x_i)indicates which leaf node it falls on. But the subscript here is still a function, which needs to be defined:

That is, which samples Xi fall on the jth leaf node. For example: I1={1,3} means that samples 1 and 3 fall on the first node. The purpose of this representation is to reorganize the samples according to the position of the leaf nodes.  

(2) Parameterization \Omega(f_k).

Defines the complexity of the tree.

 4.4 Build each tree using greedy algorithm

How to find the shape of a tree?

Greedy algorithm to build a tree: try to split a leaf node each time, calculate the gain before and after the split, and select the one with the largest gain. The tree is continuously constructed and expanded in a greedy manner.

Like the following example:

4.6 Advantages of XGBoost 

advantage:

(1) Regularization prevents overfitting

(2) XGBoost not only uses the first-order derivative, but also uses the second-order derivative, the loss is more accurate, and the loss can also be customized.

(3) Parallel optimization of XGBoost, the parallelism of XGBoost is at the feature granularity.

(4) Considering that the training data is sparse, the default direction of the branch can be specified for missing values ​​or specified values, which can greatly improve the efficiency of the algorithm.

(5) Support column sampling, which can not only reduce overfitting, but also reduce calculations.

How does XGBoost handle missing values?

When looking for split points in the training phase , the calculated split gain does not include missing value samples. In terms of logic implementation, in order to ensure completeness, the samples missing the feature value will be assigned to the two cases of left leaf node and right leaf node respectively. After calculating the gain, select the direction with the largest gain to classify the samples with missing values .
In the prediction phase , if there are no missing values ​​in the training set and there are missing values ​​in the test set, the default direction of the branch should be specified for the missing value (or the value that the specified value does not appear), and the missing value will be automatically divided into this branch during prediction .

What is the difference between XGBoost and GBDT?

Base classifier: GBDT uses the classification and regression decision tree CART as the base classifier. The base classifier of XGBoost not only supports the CART decision tree, but also supports linear classifiers. At this time, XGBoost is equivalent to Logistic regression with L1 and L2 regularization items (classification problem) or linear regression (regression problem).

Derivative information: GBDT only uses first-order derivative information when optimizing the solution. XGBoost performs second-order Taylor expansion on the cost function, and uses first-order and second-order derivative information at the same time.

Regular term: XGBoost adds a regular term to the cost function to control the complexity of the model. The regularization item includes the number of leaf nodes of the tree and the sum of the squares of the L2 modulus of the predicted value output on each leaf node. The regularization term helps reduce the variance of the model, making the learned model simpler and preventing overfitting. There is no regularization term in the cost function of GBDT.

Missing value processing: For samples with missing feature values, XGBoost can automatically learn its splitting direction.

Guess you like

Origin blog.csdn.net/bb8886/article/details/130236970