Financial Scoring Card Project—4. Application of GBDT Model in the Early Warning Model of Attrition

1. Introduction to GBDT model

  The gradient boosting tree is an ensemble model that can be used for classification, regression and sorting. The core of GBDT is to accumulate the results of all trees as the final result. GBDT can be used for classification and does not mean that it is the result of accumulating all classification trees. The trees in GBDT are regression trees (using the square error minimization criterion, feature selection, and generating binary trees), not classification trees. This is very important for understanding GBDT.
  Gradient boosting trees. When the loss function is square loss, the next The tree fitting is the residual value of the previous tree (the actual value minus the predicted value). When the loss function is a non-square loss, the negative gradient value of the loss function is fitted.


A simple example: The true age of A is 18 years old, but the predicted age of the first tree is 12 years old, which is 6 years behind, that is, the residual is 6 years old. Then in the second tree, we set the age of A to 6 years old to learn. If the second tree can really divide A into 6-year-old leaf nodes, the conclusion of adding two trees is the true age of A; If the conclusion of the second tree is 5 years old, then A still has a residual error of 1 year, and the age of A in the third tree becomes 1 year old, so continue to learn.

  When the loss function is the squared difference loss, the residual is used as the absolute direction of the global optimum, and Gradient is not required to solve it. However, there are many kinds of loss functions. When the loss function is a non-square loss, Freidman, a big cow in the machine learning industry, proposed a gradient boosting algorithm: using the fastest descent approximation method, that is, using the value of the negative gradient of the loss function in the current model as a regression The approximate value of the residual of the lifting tree algorithm in the problem is fitted to a regression tree.
Insert picture description here

Features:

  • Combination model based on simple regression decision tree

    Advantages of decision trees:

    Strong explanatory,
    allows variable interaction
    , insensitive to outliers, missing values, and collinearity

    Disadvantages of decision trees:

    Accuracy is not high enough,
    easy to overfit,
    large amount of calculation

  • Boost in the direction of gradient descent

  • Only accept numerical continuous variables-feature transformation is required (convert categorical variables into discrete variables)

advantage:

  • High accuracy
  • Not easy to overfit

1. The GBDT structure of the case

Insert picture description here

2.GBDT commonly used parameters

For more parameter analysis, see sklearn official website
GBDT framework common parameters

n_estimators: the number of classification trees, K

learning_rate: that is, the weight reduction coefficient vv of each weak learnerv , also called step length. Smallervvv means that more iterations of the weak learner are needed.
The parameters n_estimators and learning_rate must be adjusted together. Can start from a smaller vvv Start to adjust the parameters, the default is 1, the relationship between these two parameters is opposite, one is bigger, the other is smaller

Subsample: (not replaced) sampling rate, recommended between [0.5, 0.8], the default is 1.0, that is, sub-sampling is not used

init: It is the weak learner at the time of initialization. It is generally used when there is a priori knowledge of the data, or when some fitting has been done before

loss: Loss function in GBDT algorithm

max_features : {‘auto’, ‘sqrt’, ‘log2’}, int or float, default=None
寻找最佳分割时需要考虑的特性数量:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If ‘auto’, then max_features=sqrt(n_features).
If ‘sqrt’, then max_features=sqrt(n_features).
If ‘log2’, then max_features=log2(n_features).
If None, then max_features=n_features.

Parameters of the weak classification tree:

max_features: The maximum number of features considered when dividing

max_depth: The maximum depth of the decision tree

min_samples_split: The minimum number of samples required for subdividing internal nodes. The default is 2. If the sample size is not large, you do not need to care about this value. If the sample size is very large, it is recommended to increase this value

min_samples_leaf: minimum number of samples of leaf nodes

min_weight_fraction_leaf: The smallest sample weight of the leaf node. The default is 0, that is, the weight issue is not considered. Generally speaking, if we have more samples with missing values, or the distribution category of the classification tree samples has a large deviation , the sample weight will be introduced. At this time, we must pay attention to this value.

max_leaf_nodes: The maximum number of leaf nodes. By limiting the maximum number of leaf nodes, overfitting can be prevented

min_impurity_split: minimum impurity of node division

2. Classifier performance index—AUC

  If you want to understand the AUC and ROC curves, you must thoroughly understand the concept of confusion matrix ! ! !
There are concepts of Postitive, Negative, False, and True in the confusion matrix.

  • The predicted category of 0 is Negative, and the predicted category of 1 is Postitive.
  • If the prediction is wrong, it is False (false), and the prediction is correct is True (true)

Combine the above concepts and you have a confusion matrix!
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
Insert picture description here
ROC calculation process see this blog

3. The application of GBDT in the early warning model of attrition

1. Tuning process

Import modules, load data sets, split data sets

# 导入模块
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn import ensemble, metrics

# 读取预处理后的数据集
modelData = pd.read_csv('data/modelData.csv', header=0)
allFeatures = list(modelData.columns)
# 移除CUST_ID与CHURN_CUST_IND列
allFeatures.remove('CUST_ID')
allFeatures.remove('CHURN_CUST_IND')

# 切割数据集
x_train, x_test, y_train, y_test = train_test_split(modelData[allFeatures], modelData['CHURN_CUST_IND'], test_size=0.3,
                                                    shuffle=True)
print(y_train.value_counts())
print(y_test.value_counts())

GBDT set by default

# 1.使用默认模型参数
gbdt_0 = GradientBoostingClassifier(random_state=10)
gbdt_0.fit(x_train, y_train)
y_pred = gbdt_0.predict(x_test)
# predict_proba生成一个n行k列的数组,其中n为样本数据量,k为标签个数,
# 每一行为某个样本属于某个标签的概率,每行概率和为1
y_pred_prob = gbdt_0.predict_proba(x_test)[:, 1]
# %g 浮点数字(根据值的大小采用%e或%f)
print('Accuracy : %.4g' % metrics.accuracy_score(y_test, y_pred))
print('AUC(Testing) : %f' % metrics.roc_auc_score(y_test, y_pred_prob))

# 训练集上的accuracy、auc
y_pred_1 = gbdt_0.predict(x_train)
y_pred_prob_1 = gbdt_0.predict_proba(x_train)[:, 1]
print('Accuracy : %.4g' % metrics.accuracy_score(y_train, y_pred_1))
print('AUC(Training) : %f' % metrics.roc_auc_score(y_train, y_pred_prob_1))

The first step of tuning:
  first start with learning rate (step size) and n_estimators (number of iterations) . Generally, a smaller step size is selected to grid search for the best number of iterations. Here, we might as well set the learning rate to 0.1, the number of iterations searched is 20~80, and determine the n_estimators (number of iterations).

# 2.设置一个较小的learning_rate,网格搜索n_estimators
params_test = {
    
    'n_estimators': np.arange(20, 81, 10)}
gbdt_1 = GradientBoostingClassifier(learning_rate=0.1, min_samples_split=300, min_samples_leaf=20,
                                    max_depth=8, max_features='sqrt', subsample=0.8, random_state=10)
gs = GridSearchCV(estimator=gbdt_1, param_grid=params_test, scoring='roc_auc', cv=5)
gs.fit(x_train, y_train)
print('模型最佳参数为', gs.best_params_)
print('模型最好的评分为', gs.best_score_)
模型最佳参数为 {
    
    'n_estimators': 80}
模型最好的评分为 0.9999988452344752

The second step of parameter tuning:
After the learning rate and the number of iterations are determined, we begin to adjust the parameters of the decision tree. First, perform a grid search on the maximum depth of the decision tree max_depth and the minimum number of samples min_samples_split required for subdivision of internal nodes . The search ranges are 3-13 and 100-800 , respectively. Since the minimum number of samples min_samples_split required for subdividing internal nodes is also related to other parameters of the decision tree, we first determine the maximum depth of the decision tree max_depth

# 3.对max_depth与min_samples_split进行网格搜索
params_test = {
    
    'max_depth': np.arange(3, 14, 1), 'min_samples_split': np.arange(100, 801, 100)}
gbdt_2 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, min_samples_leaf=20,
                                    max_features='sqrt', subsample=0.8, random_state=10)
gs = GridSearchCV(estimator=gbdt_2, param_grid=params_test, scoring='roc_auc', cv=5)
gs.fit(x_train, y_train)
print('模型最佳参数为', gs.best_params_)
print('模型最好的评分为', gs.best_score_)

The third step of tuning:

The minimum number of samples min_samples_split required to subdivide the internal nodes and the minimum number of leaf nodes min_samples_leaf are adjusted together. The tuning range is 400-1000 and 60-100 respectively.

# 4.内部节点再划分所需最小样本数min_samples_split和叶子节点最少样本数min_samples_leaf一起调参
params_test = {
    
    'min_samples_leaf': np.arange(20, 101, 10), 'min_samples_split': np.arange(400, 1001, 100)}
gbdt_3 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=9,
                                    max_features='sqrt', subsample=0.8, random_state=10)
gs = GridSearchCV(estimator=gbdt_3, param_grid=params_test, scoring='roc_auc', cv=5)
gs.fit(x_train, y_train)
print('模型最佳参数为', gs.best_params_)
print('模型最好的评分为', gs.best_score_)

Final version

gbdt_4 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=9, min_samples_leaf=70,
                                    min_samples_split=500, max_features='sqrt', subsample=0.8, random_state=10)
gbdt_4.fit(x_train, y_train)
y_pred_4 = gbdt_4.predict(x_test)
y_pred_prob_4 = gbdt_4.predict_proba(x_test)[:, 1]
print('Accuracy : %.4g' % metrics.accuracy_score(y_test, y_pred_4))
print('AUC(Testing) : %f' % metrics.roc_auc_score(y_test, y_pred_prob_4))

# 训练集上的accuracy、auc
y_pred_1 = gbdt_4.predict(x_train)
y_pred_prob_1 = gbdt_4.predict_proba(x_train)[:, 1]
print('Accuracy : %.4g' % metrics.accuracy_score(y_train, y_pred_1))
print('AUC(Training) : %f' % metrics.roc_auc_score(y_train, y_pred_prob_1))

It is found that the effect is not as good as the default effect. The main reason is that we only used 0.8 sub-sampling, and 20% of the data did not participate in the fitting.
The fourth step of tuning:
we perform a grid search on the maximum number of features max_features

# 对max_features进行网格搜索
param_test4 = {
    
    'max_features': range(5, 31, 2)}
gbdt_4 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=9, min_samples_leaf=70,
                                    min_samples_split=500, subsample=0.8, random_state=10)
gs = GridSearchCV(estimator=gbdt_4, param_grid=param_test4, scoring='roc_auc', cv=5)
gs.fit(x_train, y_train)
print('模型最佳参数为', gs.best_params_)
print('模型最好的评分为', gs.best_score_)

The fifth step of ginseng:

# 对subsample进行网格搜索
param_test5 = {
    
    'subsample':[0.6,0.7,0.75,0.8,0.85,0.9]}
gbdt_5 = GradientBoostingClassifier(learning_rate=0.1, n_estimators=80, max_depth=9, min_samples_leaf=70,
                                    min_samples_split=500, max_features=28, random_state=10)
gs = GridSearchCV(estimator=gbdt_5, param_grid=param_test5, scoring='roc_auc', cv=5)
gs.fit(x_train, y_train)
print('模型最佳参数为', gs.best_params_)
print('模型最好的评分为', gs.best_score_)

Now that we have basically got all our tuning results, we can halve the step size and double the maximum number of iterations to increase the generalization ability of our model.

2. Variable importance

  Like random forests, GBDT can also give the importance of features.

clf = GradientBoostingClassifier(learning_rate=0.05, n_estimators=70,max_depth=9, min_samples_leaf =70,
               min_samples_split =1000, max_features=28, random_state=10,subsample=0.8)
clf.fit(X_train, y_train)
importances = clf.feature_importances_
# 按重要性降序对要素进行排序。 默认情况下,argsort返回升序
features_sorted = np.argsort(-importances)
import_feautres = [allFeatures[i] for i in features_sorted]

Guess you like

Origin blog.csdn.net/weixin_46649052/article/details/114399752