Recently, I have studied in depth the algorithm principle and model data structure of XGBoost, a data mining competition artifact...

guide

Anyone who is engaged in data mining-related work must know the XGBoost algorithm. This generation of artifacts that once shined in data mining competitions is a classic algorithm proposed by Chen Tianqi in 2016. In essence, XGBoost is regarded as an optimized implementation of the GBDT algorithm, but apart from the inheritance of the integrated algorithm concept, the specific design details are actually quite different. Recently, I have studied in depth and briefly explored the data structure of the underlying design. I can't help but feel the subtlety of the algorithm! Let's talk for a summary and learn from the future!

5e684b815d0b6a1a71fef1df910a029a.png

In 2016, Chen Tianqi was invited to participate in the sharing session on XGBoost

XGBoost is an integrated algorithm in machine learning. It is divided into three integrated schools and belongs to the Boosting school. The Boosting genre is also the most active and powerful genre in the integrated algorithm. In addition to XGBoost, there are Adaboost and GBDT before, and LightGBM and CatBoost later. Of course, LightGBM, CatBoost and XGBoost are generally regarded as improved and optimized implementations of GBDT.

The principle of the XGBoost algorithm is actually very mature and complete, and there are countless shares about it on the Internet, so this article does not want to repeat the work of the predecessors, but focuses on the following writing purposes: First, share from the perspective of explaining the formula Personal understanding of the principle of the XGBoost algorithm; the second is to simply explore the underlying data structure design of XGBoost. The latter is similar to the previous tweet " Data Science: Decision Tree in Sklearn, how is the bottom layer designed and stored?" "Positioning.

01 Learn formula derivation and understand principles

In order to understand the algorithm principle of XGBoost, the following mainly shares the derivation process of the five main formulas related to the algorithm. All formulas are mainly derived from the official XGBoost documentation (https://xgboost.readthedocs.io/en/latest), and can also be found by consulting related papers.

Formula 1 - Additive model of Boosting ensemble algorithm:

The formula is very simple, but it is important to understand the formula that follows. As a Boosting algorithm, XGBoost's integration idea follows a typical addition model, that is, the output of the integration algorithm is equal to the sum of the outputs of each base learner (regression problems are well understood by summation, and classification problems are actually fitting logloss. Similar to the practice in logistic regression, mentioned later), fk(x) in the above formula represents a single base learner, and K represents the number of base learners. Take the integrated model with two basic learners in the figure below as an example. The output of the integrated model of "son" is the sum of the respective outputs of the two basic learners, that is, 2+0.9=2.9; the output of the integrated model of "grandpa" is two The sum of the outputs of the base learners, ie -1-0.9=-1.9.

3026ace8f53185dac7669e5dd428c10b.png

Equation 2 – Objective function of the base learner in XGBoost

Machine learning has three elements: model, strategy and algorithm. The strategy includes how to define or evaluate the quality of the model, in other words, it involves defining the loss function. Different from the continuous fitting residual in GBDT, XGBoost adds the structural loss of the model on its basis, which can be understood as the cost of model learning; while the residual corresponds to the experience loss, which can be understood as the gap and lack of model learning ability.

In the above objective function (the objective function is a higher-level term than the loss function, it generally includes two parts: objective function = loss + regular term, the smaller the better. In addition to the objective function and loss function, there is also a related The term is called the cost function, and in a sense the cost function can be approximately understood as a loss function.), the first part of the sum defines the gap between the current model training result and the real value, which is called the empirical risk, and the specific measurement method depends on The corresponding loss function definition, a typical loss function is: the regression problem corresponds to the MSE loss, and the classification problem corresponds to the logloss loss; the second part of the sum reflects the structural risk of the model and affects the generalization ability of the model. If the base learner chooses a decision tree, then the structural risk here is defined as:

Among them, γ and λ are regular term coefficients, and T is the number of leaf nodes in the decision tree. Note that this is the number of leaf nodes, not the number of nodes in the decision tree (when the CART decision tree is pruned after CCP, the regular term used is to calculate the number of all nodes). In addition, this is the general formula for introducing the principle of XGBoost, and it is also written in Chen Tianqi's earliest paper. In the xgboost toolkit of Python, in addition to the gamma and reg_lambda corresponding to these two parameters, the model initialization parameters also have the reg_alpha parameter , represents the first-order regularization term, which can be written as:

Equation 3 - Taylor second-order expansion approximation in XGBoost

Understanding: This Taylor second-order expansion approximation can be said to be the soul of XGBoost, and it also best reflects its mystery and power compared with GBDT. In order to better explain the above approximation, the Taylor expansion in the usual sense is first given:

Of course, the above formula is only extended to the second order, as long as f(x) is infinitely differentiable, then there can be a higher order approximation. In XGBoost, applying Taylor's second-order expansion approximation is actually only for the empirical risk part of the model, that is, each subterm of the first part of the summation in Equation 2. Here, again given a single expression:

In the above formula, the subscript i corresponds to the i-th sample in the training set, and the superscripts t and t-1 correspond to round t and round t-1 of the integrated algorithm. Then further, who is f, who is x, and who is △x? Here, f is a function symbol, which corresponds to l in the loss function. The key point is to understand x and △x.

During model training, the training data set is actually determined, which is a set of fixed values ​​in each base learner, so yi is no exception, and can be regarded as a constant in the above formula. In the t-th round of integrated model training, the purpose of model training at this time is to find the optimal t-th round result based on the training results of the previous t-1 rounds (already determined at this time), so as to obtain the current minimum possible loss .

In fact, in ensemble learning, the first basic learner is often able to fit most of the results. For example, in the usual example of fitting age, assuming that the result of fitting is 100, then it is likely The fitting result of the first base learner is 90, while the following N-1 learners are just constantly correcting the residual: 10. The purpose of this example is to express: in the above formula, the fitting result y_hat of the first t-1 round actually corresponds to x in f(x+△x), and the fitting value of the t-th round can be regarded as Floating variable △x. So far, according to the style of Taylor expansion, the above objective function can be specifically expanded as:

Where gi and hi are the first-order derivative and the second-order derivative respectively, which are written as follows:

Further, record y as the real value and y_hat as the fitted value, then for the regression problem, the most commonly used MSE loss is applied, then its loss function and the corresponding first-order derivative and second-order derivative are respectively:

For the classification problem, taking the binary classification problem as an example, the default loss function in XGBoost is logloss, and the corresponding loss function and the corresponding first-order derivative and second-order derivative are respectively:

Equation 4 - Optimal Leaf Weight Solving in Decision Tree

In theory, XGBoost can support any base learner, but in fact the most commonly used decision tree is used. The xgboost tool library in Python also uses gbtree as the base learner by default. In the decision tree, the optimal decision tree obtained by the t-th round of training is actually the process of seeking the optimal leaf weight, so it is particularly important to understand the optimal leaf weight.

Fortunately, the solution of the above two formulas is very simple and easy to understand, even in the category of mathematics knowledge in junior high school, which is much easier to understand than the Lagrangian dual problem in SVM. First look at the following conversion:

The approximate equal sign in the first step is of course derived from the Taylor second-order expansion approximation in formula 3, but the constant part is omitted at this time. It should be noted that the ∑ solution at this time is based on the sample as the granularity of the solution, that is, When i is the sample number, n is the total number of samples. In the second step of equal sign conversion, the leaf node is used as the granularity, and multiple samples falling on the same leaf node are aggregated. At this time, the prediction results of all samples falling on the same leaf node are its leaves The weight ωj, the internal summation of each leaf node corresponds to the internal Σ.

With the above approximate expansion and the aggregation of each leaf node, the following formula can be derived:

Where Gj and Hj are the sum of the first and second derivatives of all samples of the jth leaf node, namely:

The above target formula can be regarded as the sum of T unary quadratic expressions, where the variable in each unary quadratic expression is ωj. Obviously, solving the minimum value problem of the form f(x)=ax^2+bx+c is a mathematical problem in junior high school, and then it is easy to get the optimal ωj and the corresponding minimum value of the loss function at this time. The result is:

The one-dimensional quadratic function here must have a minimum value, because the coefficient 1/2(Hj+λ) of its quadratic term must be a positive number!

Equation 5 - Split Gain of Decision Tree

Formula 4 solves the problem of the optimal weight of leaf nodes, so it actually bypasses a pre-issue: how to split the internal nodes of the decision tree? How to split internal nodes can actually be further subdivided into two sub-problems:

①Which feature is selected for splitting?

②What threshold is used to divide the left and right subtrees?

The first problem is easy to solve. The simplest and still used method is to go through all the features one by one and compare which feature brings the greatest gain.

As for the second question, the optimal splitting threshold is actually obtained by using the traversal optimization method. As for how to perform the traversal optimization, it can be further subdivided into two problems:

i) Which candidate split thresholds to choose?

ii) How to measure which split threshold is better?

Choosing which candidate split thresholds involves a lot of skills. Both XGBoost and LightGBM use the histogram method to simplify the possible optimal split point candidate values. There are many details involved here, so let’s not talk about it for now; and how to measure the split threshold is more important. For optimal problems, the conclusion in the previous formula 4 can just be used - the minimum loss expression of the leaf node under the optimal weight. Based on this, the process of measuring the optimal split threshold is as follows:

  • If the node does not split, that is, it is regarded as a leaf node, the current minimum loss value can be obtained;

  • For the selected features and thresholds, all samples of the current node are divided into left and right subtrees, and then the minimum loss values ​​corresponding to the left and right subtrees can be obtained;

So, will the loss be reduced if the node is directly used as a leaf node to split it into left and right child leaf nodes? So just subtract the losses before and after the split! So why does the γT part become -γ after the subtraction? In fact, it is because the regular term of this part corresponds to 1 leaf node before splitting, and corresponds to 2 leaf nodes after splitting, so the subtraction of γT between the two parts is -γ.

The above is the understanding of several core formula derivation links in XGBoost. I believe that after understanding these 5 formulas, you can basically understand how XGBoost is designed and implemented. Of course, the power and ingenious design of XGBoost are not limited to the above algorithm principles. In fact, there are many practical skills and optimizations, which also constitute the scalable ability of XGBoost. For details, please refer to the paper "XGBoost: A Scalable Tree Boosting System".

02 View the source code to understand the underlying data structure

The first part mainly introduces the core formula part in XGBoost. The following briefly shares the underlying data structure design in XGBoost. The reason for adding this part of work is still because of the need for some pre-research work in the near future, so I focused on how the bottom layer of XGBoost stores all the basic learners, that is, the training results of each decision tree.

In order to understand how each decision tree after training is stored in XGBoost, we look at the source code of the model training part of the classifier. After a simple inspection, we can locate the following code:

c73233a2577d58e205fc2a19b486b2d9.png

That is to say, the result of XGBClassifer model training should be saved in the _Booster attribute.

Of course, the sklearn type interface provided by xgboost viewed above, in its native training method, actually calls the xgboost.train function to implement model training. At this time, whether it is a regression task or a classification task, this function is called. It is just to distinguish different task types by different objective functions.

In order to further examine the _Booster attribute, we actually train an XGBoost binary classification model, using the following simple code example:

from sklearn.datasets import load_iris
from xgboost import XGBClassifier


X, y = load_iris(return_X_y=True)
# 原生鸢尾花数据集是三分类,对其进行采用为二分类
X = X[y<2]
y = y[y<2]


xgb = XGBClassifier(use_label_encoder=False)
xgb.fit(X, y, eval_metric='logloss')

Then, check the result of this _Booster through the dir attribute:

2030772031f3c7ba96cefd82c4d10a2b.png

In fact, this _Booster attribute is a class defined in xgboost, and the above results can also directly view the definition of the Booster class in xgboost. In the above dir results, there are several functions worthy of attention:

  • save_model: It is used to store the training results of the xgboost model as a file, and xgboost is very friendly in that after version 1.0.0, it directly supports storage in json format, which is much more convenient than pickle format and greatly enhances readability

b711722f598f821329bd6322c0384320.png

  • load_model: If there is save_model, there must be load_model. The two are reciprocal operations, that is, load_model can read the result of the json file of save_model into an xgboost model

  • dump_model: In fact, dump also has the meaning of storage. For example, the read and write functions defined in json are load and dump. In xgboost, the difference between dump_model and save_model is that the storage result of dump_model is easy for human to read, but the process is one-way, that is, the result of dump cannot be loaded back;

  • get_dump: Similar to dump_model, except that it is not stored as a file, but just returns a string;

  • trees_to_dataframe: The meaning is very clear, which is to convert all the tree information after training into a dataframe.

Here, first look at the results of trees_to_dataframe:

9c76869d2a3db96cabdf1dc3faa2f58a.png

It seems to be inferred from the column names. Except for the last two fields, Cover and Category, the meaning of the other fields is very clear, so I won’t explain too much.

After that, explore the results of save_model and dump_model. Since the result of dump_model is human-readable, let's look at the result first:

d24ea81e60504e586247b2707fd142d8.png

Here, the three decision tree information of the txt file after dump_model is intercepted. It can be seen that the result of dump_model only retains the splitting information of each decision tree. Taking the first decision tree of booster[0] as an example, the decision tree has three nodes. The indentation relationship of a node expresses the corresponding relationship of child nodes, and the internal node label identifies the selected split feature and the corresponding threshold, but the value behind the leaf node is not actually its weight.

Then, explore the json file results of save_model, and first check the structural relationship of the entire json:

f55468cb2259ebf05b17a716fda9493d.png

Among them, the trees part is a list containing 100 items, corresponding to the information of 100 basic learners. Further inspection, similar to the Array-based Tree Representation form defined in sklearn, the information of each node of the decision tree here is still Array- based, that is, the i-th value of each attribute represents the corresponding attribute of the corresponding i-th node. The main fields and their meanings are as follows:

74d31d1267c1e8d368009dd40ab5cfd2.png

It is worth pointing out that by comparing the values ​​​​of the three attributes left_children, right_children, and parents, it is easy to infer that the decision tree node number in xgboost is a hierarchical traversal, which is different from the preorder traversal used in sklearn.

The above is just some simple and simple exploration of the basic learner information in xgboost. If you are interested, you can try more and check the corresponding source code design. I hope it will be helpful to understand the principle of xgboost.

8fc3394b10e27d1a06f3cf8d337e45f6.png

Related Reading:

Guess you like

Origin blog.csdn.net/weixin_43841688/article/details/121759953