Gradient boosting tree (GBDT) bis Detailed: Classification Examples

IEEE data in December 2006, held an international conference on mining (ICDM, International Conference on Data Mining), the participating experts selected the top ten at the time of data mining algorithms (top 10 data mining algorithms), can be found in [1 ]. As an important representative of the Integrated Learning (Ensemble learning) of which AdaBoost selected. But Boosting algorithm based on design thinking, is often used to compare another algorithm is Gradient Boost with AdaBoost. It is one of the best of these algorithms in which the traditional machine learning algorithm is a real distribution fitting, and a few years of deep learning has not been popular front, Gradient Boost swept almost all kinds of data mining (Data mining) or knowledge discovery (Knowledge discovery) contest.

 

Gradient boosting tree (GBDT, Gradient Boosted Decision Trees), also known as Gradient Tree Boosting, ensemble learning technique is the addition of a decision tree model based learner to Boost framed. In the previous article [2], we have been using the specific method GBDT regression analysis insight, hope readers on this basis to read this article.

 

We already know, GBDT can be used for regression, it can also be used for classification. This article focuses on the use of a specific method of classifying GBDT. Color as an example, we will use the training data set (Train Set) shown in the following table, the audience's age, likes to eat popcorn, and they like together constitute the feature vector, and each audience whether you like watching science fiction movies it is classified as a label.

Before just as we do in the use of GBDT return, as a start, to build a single decision tree node as an initial estimate for each individual or initial forecast. Before doing the return, this step is averaged, and now we have to do is classification tasks, then this is used as an initial estimate of the value used is the log (odds), which with logistic regression used is the same specific you can refer to the article [3] and [4] before me.

How do we use this value to classify predict it? Do, we will be transformed into a probability value memories logistic regression, specifically, to the log (odds) value of Logistic transformation into a probability function. Therefore, we can get a given audience, he likes watching sci-fi movies is the probability

Note that in order to facilitate the calculation, we reserve the significant digits behind the decimal place, but in fact log (odds) and like to watch science fiction probability of only 0.7 after two approximate calculation coincidence are equal, the two are not necessarily linked directly .

 

Now we know that a given audience, he likes watching sci-fi movies probability is 0.7, indicating a probability greater than 50%, so we ultimately conclude that he is like watching a science fiction movie. This initial guess or estimate how about it? Now this is a very rough estimate. On the training data set, four individuals have the right predictions, while the other two wrong predictions. Or not quite ideal model now built on the training data set fit. We can use pseudo-residuals to measure the initial estimate of how far from the truth. As we have done so at the time of the regression analysis, the pseudo residual herein refers to the difference between the observed and predicted values.

As shown above, the red two points, the data set represents two viewers do not like science fiction, or that they look like science fiction probability is zero. Similarly, the blue four points, four representative data sets is like watching science fiction audience, or that they like to watch science fiction probability is 1. Red and blue dots are the actual observations, while the gray dotted line is our prediction. Can be easily calculated dummy residuals for each data point is as follows:

Like when the next thing to do just before a regression, that is based on the feature vector to construct a decision tree to predict the pseudo residuals given in the table. Thereby obtaining the following tree decision tree. Note that we need to define the upper limit of the amount allowed in the decision tree leaves (normalized gyrus has a similar limit), in this case, we set this limit is three, after all the data in the example here of a very small scale. But in practice, when faced with a large data set, typically set the allowable maximum number of leaves between 8-32. When using the kit scikit-learn, when the class is instantiated sklearn.ensemble.GradientBoostingClassifier, this value can be controlled by specifying max_leaf_nodes.

But how to use the tree just built up above the tree is relatively complex. Noted, the initial estimate of log (odds) ≈0.7 is a logarithmic likelihood, the pseudo residual new tree leaf node is given a probability based on the value, so the two do not directly summed . Some conversion is essential, particularly when using GBDT do classification, as the need for further processing of the transform value on each leaf node.

The above derivation of this transformation involves some mathematical details, which we left to do follow-up article explained in detail. Now, we do the calculation based on the formula: For example, for a value of -0.7 only the first leaf node, because there is only one value, so you can ignore the above formulas and symbols add, that

Note that because the tree before building a decision tree, we are on a decision tree, like watching science fiction probability for all viewers are given 0.7, so the above formula Previous Probability is brought into the value. Thus, the replacement value of the leaf node is -3.3.

 

Next, the calculation leaf node comprising two 0.3 and -0.7 values. Then there

Note that, because the leaf node contains two dummy residuals, and therefore part of the result is added for each pseudo residuals corresponding to the denominator plus one. Further, currently Previous Probability is the same for each pseudo residuals (as a decision tree is only one node), but in our case of generating a decision tree is not the same .

 

Similarly, the last calculation result as a node

So now the decision tree becomes like the following

Now, based on the decision tree prior to the establishment of, and now just got a new decision tree, it can be predicted for each audience whether they would like to see science fiction. Before the handover of the case, we would also like to use a decision tree learning rate for newly acquired scaled. Here we use a value of 0.8, for convenience's demonstration purposes here, the value taken is relatively large. When using the kit scikit-learn, when the class is instantiated sklearn.ensemble.GradientBoostingClassifier, the learning rate can be controlled by specifying a value of the parameter learning_rate, the default parameter is 0.1.

Now, for example, in the table to calculate the first log viewer (odds), available 0.7 + (0.8 × 1.4) = 1.8. Therefore, the probability was calculated using the Logistic function

He noted that after we first estimate of the probability of whether the audience like to watch science fiction is 0.7, now 0.9, our model is clearly moving in a better direction a small step forward. The above-described method, the following probability calculated one by one to see the rest of the audience likes science fiction, then calculate a pseudo residual, the following table of results available

Next, a decision tree is constructed based on the feature vectors to predict the residual pseudo given in the table. Thereby obtaining a decision tree follows.

The re-used before further processing of the converted value on each leaf node, i.e. each leaf node to obtain the final output.

For example, we calculate the output leaf node table shown in the above figure data corresponding to the second row

\frac{0.5}{0.5\times (1-0.5)}=2

Previous Probability wherein it into a value of 0.5 on the predicted probability table.

Calculated again leaves a slightly more complex, as shown above, there are four viewers predictors are directed to the leaf node. Then there

The leaf nodes of the output is 0.6. In this step we can also see that each pseudo residuals corresponding denominator is not necessarily the same, because the probability that the last time the prediction is different. After all the leaf nodes are calculated corresponding to output a new decision tree can be obtained as follows:

Now, we can have all the combinations of the obtained together, as shown in FIG. In the beginning, we have only one node of a tree, on this basis, we got a second decision trees, now there is a new decision tree, all of these newly generated decision tree are scaled by Learning Rate then add the total to the single node tree beginning.

Next, based on all existing tree, and can be regarded as a new pseudo-residuals. Note that, to obtain the first pseudo residual simply calculated based on the initial estimate. Second pseudo residuals is obtained based on the initial estimate, together with the calculated first decision tree, and the next to be calculated pseudo third residual value is based on an initial estimate, together with the first, two calculated with the decision tree. Further, after each new introduction of a decision tree, the dummy will be gradually smaller residual. Pseudo residuals becomes smaller, it means that the constructed model is gradually approaching moving in the right direction.

 

In order to obtain better results, we will repeatedly perform the calculation and build a new tree pseudo residual process until no residual pseudo variation is significantly greater than a certain value, the decision tree or reached the maximum number set in advance. When using the kit scikit-learn, when the class is instantiated GradientBoostingClassifier, can be controlled by specifying the upper limit of the number of decision trees n_estimators, the default value of the parameter is 100.

 

For demonstration purposes, now assume we've got a final GBDT, it is only the above-described three configuration tree (a tree and two subsequent initial tree established). If there is a young 25 years old, likes to eat popcorn, like green audience. Does he like to watch science fiction films? We performed this feature vectors in the three decision tree again to obtain the value of the leaf node, then the role of the Learning Rate and added together, so there 0.7+ (0.8 \times1.4) + (0.8 \times0.6) = 2.3. Note that this is a log (odds), so we use Logistic function to convert it into a probability, was 0.9. Therefore, we conclude that viewers like to watch science fiction. This is the use of specific methods have been trained to classify GBDT.

 

In a subsequent article, we will explain in detail the mathematical principles Gradient Boost, and then the reader will understand more deeply the reason so designed algorithm.

 


references

 

* Herein primarily with reference to examples and adapted from the literature [5], [6] Literature provides a simple example of the use of GBDT scikit-learn the classification predicted.

【1】Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y. and Zhou, Z.H., 2008. Top 10 algorithms in data mining. Knowledge and information systems, 14(1), pp.1-37.

[2] return gradient boosting tree (GBDT) of the Detailed

[3] blog article link

[4] blog article link

【5】Gradient Boost for Classification

【6】https://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

 

Published 363 original articles · won praise 4373 · Views 4.4 million +

Guess you like

Origin blog.csdn.net/baimafujinji/article/details/104854260