In-depth analysis of GBDT binary classification algorithm (with code implementation)

table of Contents:

  1. Introduction to GBDT classification algorithm

  2. GBDT binary classification algorithm

  • 2.1 Logistic regression log loss function

  • 2.2 GBDT two classification principle

  • Example of GBDT binary classification algorithm

  • Hand tearing GBDT binary classification algorithm

  • 4.1 Implementing GBDT binary classification algorithm with Python3

  • 4.2 Using sklearn to implement GBDT binary classification algorithm

  • Common loss functions for GBDT classification tasks

  • to sum up

  • Reference

An overview of the main content of this article:

1 Introduction to GBDT classification algorithm

Whether GBDT is used for classification or regression, the CART regression tree has always been used. GBDT does not choose a classification tree because the task we choose is a classification task. The core reason here is that each round of GBDT training is based on the negative gradient of the previous training model. This requires that at each iteration, the output of the real label minus the weak classifier is meaningful, that is, the residual is meaningful. If the selected weak classifier is a classification tree, category subtraction is meaningless. For such problems, you can use two methods to solve:

  • The exponential loss function is adopted, so that GBDT degenerates into Adaboost, which can solve the classification problem;

  • Use a log-likelihood loss function similar to logistic regression, so that the difference between the probability value of the result and the true probability value can be used as a residual to fit;

Let's take a look at how to classify GBDT through the two classification problem.

2 GBDT binary classification algorithm

2.1 Logistic regression log loss function

The prediction function of logistic regression is:

 

The value of the function has a special meaning, which represents the probability that the result is taken, so the probabilities for the input classification result to be category and category are:

 

 

Below we derive the log loss function of logistic regression based on the above formula. The above formula can be written as:

Then take the likelihood function as:

Because and and the extreme values ​​are obtained at the same place, we then take the log-likelihood function as:

Maximum likelihood estimation is to find the maximum value. Here, taking the opposite number can be solved by gradient descent method, and the required parameters are obtained:

2.2 GBDT two classification principle

The loss function of a single sample of logistic regression can be expressed as:

Among them, is the result of logistic regression prediction. Assuming that the current learner is after the first iteration, after replacing it with the above formula, the loss function can be written as:

Among them, the response value corresponding to the first tree is (the negative gradient of the loss function, that is, the pseudo residual):

For the generated decision tree, calculate the best residual fitting value of each leaf node as:

Since there is no closed form solution (closed form solution) , we generally use approximate values ​​instead:

Supplemental approximation substitution process:
Suppose there is only one sample:

Order, then 

Seeking the first derivative:

Find the second derivative:

For Taylor's second-order expansion of:

  When taking the extreme value, the above second-order expression is:

The complete process of GBDT binary classification algorithm is as follows:

(1) Initialize the first weak learner:

Among them, is the proportion in the training sample, using prior information to initialize the learner.

(2) For the establishment of a classification regression tree:

a) Yes, calculate the response value (negative gradient of the loss function, that is, the pseudo-residual) corresponding to the first tree:

b) For, using the CART regression tree to fit the data to obtain the first regression tree, the corresponding leaf node area is, where, and are the number of leaf nodes of the first regression tree.

c) For the leaf node area, calculate the best fitting value:

d) Update the strong learner:

(3) Get the expression of the final strong learner:

It can be seen from the above process that except for the calculation of the negative gradient caused by the loss function and the calculation of the best residual fitting value of the leaf node, the process of binary GBDT classification and GBDT regression algorithm is basically similar. So how is binary GBDT classified?

Sorting out the logistic regression formula, we can get, where is the probability of predicting a given input as a positive sample. Logistic regression uses a linear model to fit the log odds of the event Y = 1 | x. The binary GBDT classification algorithm is similar to the logistic regression idea. A series of gradient lifting trees are used to fit this logarithmic probability. The classification model can be expressed as:

 

3 Examples of GBDT binary classification algorithm

3.1 Data set introduction

The training set is shown in the following table. The characteristics of a set of data are age and weight. The height is greater than 1.5 meters as the classification boundary. The height is greater than 1.5 meters and the label is 1, and the height and less than 1.5 meters are 0. data.

The test data is shown in the following table. There is only one set of data, with an age of 25 and a weight of 65. We use the GBDT model trained on the training set to predict whether the height of this set of data is greater than 1.5 meters?

3.2 Model training stage

parameter settings:

  • Learning rate: learning_rate = 0.1

  • Number of iterations: n_trees = 5

  • The depth of the tree: max_depth = 3

1) Initialize the weak learner:

2) For the establishment of a classification regression tree:

Since we set the number of iterations:, n_trees=5this is the setting.

First calculate the negative gradient. According to the above log loss, the negative gradient (pseudo residual, approximate residual) is:

We know that gradient lifting algorithms, the key is to use the value of the negative gradient of the loss function as an approximation of the residuals in the regression tree lifting tree algorithm to fit a regression tree. Here, for convenience, we call the negative gradient residual.

The calculation results of the residuals are now listed as follows:

At this time, the residual is used as the label of the sample to train the weak learner, that is, the data in the following table:

Then find the best partition node of the regression tree, and traverse each possible value of each feature. From the age feature value to the end, and the weight feature to the end, calculate the square error (Square Error) of the two groups of data after splitting, the square loss of the left node, and the square loss of the right node, and find the one that minimizes the square loss The divided node is the best divided node.

For example: use age as the dividing node, divide the sample less than to the left node, and the sample greater than or equal to 7 to the right node. The left node includes, the right node includes samples,,,, and all possible divisions are shown in the following table:

The minimum total square loss of the above dividing points is that there are two dividing points: age and weight, so randomly choose one as the dividing point, here we choose age. Now our first tree looks like this:

The depth of the tree in the parameters we set, the depth max_depth=3of the tree is only now, and it needs to be divided again. This time, the left and right nodes are divided separately, but when we generated the tree, we set three trees to continue growing condition:

  • The depth did not reach the maximum. The depth of the tree is set to 3, which means it needs to grow into 3 layers;

  • Number of point samples> = min_samples_split;

  • The label values ​​of the samples on this node are different. If the values ​​are the same, it means that they have been divided very well, and there is no need to divide; (This program meets this condition, so the tree has only 2 levels)

In the end our first regression tree looks like this:

At this point our tree satisfies the setting, we need to do one more thing, assign a parameter to each leaf node of this tree to fit the residual.

According to the above division results, for convenience of presentation, the first leaf node is specified from left to right, and the process of calculating the value is as follows:

 

 

The first tree at this time looks like this:

Next update fast learner, we need to use the learning rate: learning_rate=0.1by lrexpress. The update formula is:

Why use learning rate? This is Shrinkage's idea. If all the fitting values ​​are added every time, that is, the learning rate, it is easy to learn in one step and cause GBDT to overfit.

Repeat this step until the   end, and finally build a tree.

The following will show the final structure of each tree. These figures are generated by the code on my GitHub. Interested students can run the code.

https://github.com/Microstrong0305/WeChat-zhihu-csdnblog-code/tree/master/Ensemble%20Learning/GBDT_GradientBoostingBinaryClassifier

The first tree:

The second tree:

The third tree:

The fourth tree:

Fifth tree:

3) Get the final strong learner:

3.3 Model prediction stage

  •  

  • In, the age of the test sample is greater than the age of the dividing node, so it is predicted as

  • In, the age of the test sample is greater than the age of the dividing node, so it is predicted as

  • In, the age of the test sample is greater than the age of the dividing node, so it is predicted as

  • In, the age of the test sample is greater than the age of the dividing node, so it is predicted as

  • In, the age of the test sample is greater than the age of the dividing node, so it is predicted as

The final prediction result is:

4 Hand tear GBDT two classification algorithm

All data sets and codes of this article are in GitHub, address: https://github.com/Microstrong0305/WeChat-zhihu-csdnblog-code/tree/master/Ensemble%20Learning

4.1 Implementing GBDT binary classification algorithm with Python3

Required Python library:

pandas、PIL、pydotplus、matplotlib
br

Where the pydotpluslibrary will automatically call Graphviz, so it is necessary to go to Graphvizthe official website to download graphviz-2.38.msiinstallation, then the installation directory binis added to the system environment variables, and finally restart the computer.

Due to the large amount of code for implementing the GBDT binary classification algorithm with Python3, I will not list the detailed code here. Interested students can take a look at GitHub at https://github.com/Microstrong0305/WeChat-zhihu- csdnblog-code / tree / master / Ensemble% 20Learning / GBDT_GradientBoostingBinaryClassifier

4.2 Using sklearn to implement GBDT binary classification algorithm

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier


'''
调参:
loss:损失函数。有deviance和exponential两种。deviance是采用对数似然,exponential是指数损失,后者相当于AdaBoost。
n_estimators:最大弱学习器个数,默认是100,调参时要注意过拟合或欠拟合,一般和learning_rate一起考虑。
learning_rate:步长,即每个弱学习器的权重缩减系数,默认为0.1,取值范围0-1,当取值为1时,相当于权重不缩减。较小的learning_rate相当于更多的迭代次数。
subsample:子采样,默认为1,取值范围(0,1],当取值为1时,相当于没有采样。小于1时,即进行采样,按比例采样得到的样本去构建弱学习器。这样做可以防止过拟合,但是值不能太低,会造成高方差。
init:初始化弱学习器。不使用的话就是第一轮迭代构建的弱学习器.如果没有先验的话就可以不用管


由于GBDT使用CART回归决策树。以下参数用于调优弱学习器,主要都是为了防止过拟合
max_feature:树分裂时考虑的最大特征数,默认为None,也就是考虑所有特征。可以取值有:log2,auto,sqrt
max_depth:CART最大深度,默认为None
min_sample_split:划分节点时需要保留的样本数。当某节点的样本数小于某个值时,就当做叶子节点,不允许再分裂。默认是2
min_sample_leaf:叶子节点最少样本数。如果某个叶子节点数量少于某个值,会同它的兄弟节点一起被剪枝。默认是1
min_weight_fraction_leaf:叶子节点最小的样本权重和。如果小于某个值,会同它的兄弟节点一起被剪枝。一般用于权重变化的样本。默认是0
min_leaf_nodes:最大叶子节点数
'''


gbdt = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, n_estimators=5, subsample=1
                                  , min_samples_split=2, min_samples_leaf=1, max_depth=3
                                  , init=None, random_state=None, max_features=None
                                  , verbose=0, max_leaf_nodes=None, warm_start=False
                                  )


train_feat = np.array([[1, 5, 20],
                       [2, 7, 30],
                       [3, 21, 70],
                       [4, 30, 60],
                       ])
train_label = np.array([[0], [0], [1], [1]]).ravel()


test_feat = np.array([[5, 25, 65]])
test_label = np.array([[1]])
print(train_feat.shape, train_label.shape, test_feat.shape, test_label.shape)


gbdt.fit(train_feat, train_label)
pred = gbdt.predict(test_feat)


total_err = 0
for i in range(pred.shape[0]):
    print(pred[i], test_label[i])
    err = (pred[i] - test_label[i]) / test_label[i]
    total_err += err * err
print(total_err / pred.shape[0])

The difficulty of implementing the GBDT binary classification algorithm with the GBDT library in sklearn is how to better adjust the following parameters:

GitHub address to implement GBDT binary classification algorithm with sklearn: https://github.com/Microstrong0305/WeChat-zhihu-csdnblog-code/tree/master/Ensemble%20Learning/GBDT_Classification_sklearn

5 Common loss functions for GBDT classification tasks

For the GBDT classification algorithm, the loss function generally has two kinds of log loss function and exponential loss function:

(1) If it is an exponential loss function, the loss function expression is:

For the calculation of negative gradient and the best negative gradient fitting of leaf nodes, please refer to the process of Adaboost algorithm.

(2) If it is a logarithmic loss function, it is divided into binary classification and multivariate classification. This article mainly introduces the loss function of GBDT binary classification.

6 Summary

In this article, we first briefly introduce how to turn the GBDT regression algorithm into a classification algorithm; then derive the principle of the GBDT binary classification algorithm from the logarithmic loss function of logistic regression; secondly, not only the GBDT binary classification algorithm using Python3, but also The sklearn is used to implement the GBDT binary classification algorithm; finally, the common loss functions in the GBDT classification task are introduced. GBDT can perfectly solve binary classification tasks, so is it effective for multiple classification tasks? If it works, how does GBDT do multi-classification? These problems require us to constantly explore and dig deeper principles of GBDT. Let us look forward to the performance of GBDT in multi-class tasks!

 

Article source: https://mp.weixin.qq.com/s/XLxJ1m7tJs5mGq3WgQYvGw#

Published 45 original articles · won praise 2 · Views 5228

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/105182792