Step 28 Machine Learning Classification in Action: Catboost Modeling


foreword

Catboost modeling ~


1. Python parameter adjustment

(1) Preparation before modeling

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('X disease code fs.csv')
X = dataset.iloc[:, 1:14].values
Y = dataset.iloc[:, 0].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 666)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

(2) Catboost's tuning strategy

Review the parameters first. The parameters that need to be adjusted are:
depth : The depth of the tree, the default is 6, and the maximum is 16.
grow_policy : Subtree growth strategy. Optional: SymmetricTree (default value, symmetrical tree), Depthwise (full layer growth, same as xgb), Lossguide (leaf node growth, same as lgb).
min_data_in_leaf : Minimum sample size of leaf nodes. Can only be used with Lossguide and Depthwise growth strategies.
max_leaves : The maximum number of leaf nodes, it is not recommended to use a value greater than 64, because it will greatly slow down the training process. Can only be used in conjunction with the Lossguide growth policy.
iterations : The number of iterations, the default is 500.
learning_rate : learning rate, default 0.03.
l2_leaf_reg : L2 regularization.
random_strength : The disturbance item of feature splitting information gain, the default is 1, used to avoid overfitting.
rsm : column sampling ratio, the default value is 1, and the value is (0, 1].

(3) Catboost parameter tuning demo

(A) Let the default parameters go for a while

import catboost as cb
classifier = cb.CatBoostClassifier(eval_metric='AUC')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_testprba = classifier.predict_proba(X_test)[:,1] 
y_trainpred = classifier.predict(X_train)
y_trainprba = classifier.predict_proba(X_train)[:,1]
from sklearn.metrics import confusion_matrix
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)

Although overfitting, it is better than the previous xgb and lgb:
insert image description here
insert image description here
the AUC of the verification set has reached 0.8693, comparable to any previous model, but the AUC of the training set is close to 1.0 (0.9978), so there is still overfitting , continue to adjust the parameters to see if it can be improved.

(B) Opening Model1 (SymmetricTree)

(a) Since grow_policy selects SymmetricTree, min_data_in_leaf and max_leaves cannot be adjusted. Therefore, try to adjust the depth first:

import catboost as cb
param_grid=[{
    
    
             'depth': [i for i in range(6,11)],
           },
           ]
boost = cb.CatBoostClassifier(eval_metric='AUC')
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(boost, param_grid, n_jobs = -1, verbose = 2, cv=10)      
grid_search.fit(X_train, y_train)    
classifier = grid_search.best_estimator_  
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_testprba = classifier.predict_proba(X_test)[:,1] 
y_trainpred = classifier.predict(X_train)
y_trainprba = classifier.predict_proba(X_train)[:,1]
from sklearn.metrics import confusion_matrix
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)

Optimal parameters: depth=7
Catboost’s optimal parameter retrieval is a bit different. The two methods introduced before are a bit difficult to find, so enter the code directly:

grid_search.best_estimator_._init_params:

insert image description here
I don’t look at the results, it must be overfitting, after all, the fitting parameters have not been adjusted.

(b) Then, adjust l2_leaf_reg:

param_grid=[{
    
    
             'l2_leaf_reg': [i for i in range(1,11)],   
            },
           ]
boost = cb.CatBoostClassifier(depth = 7, eval_metric='AUC')

Optimal parameters: l2_leaf_reg=6
insert image description here
and then cool down, the more adjusted the better the fitting.

(c) Continue to adjust the overfitting parameters: random_strength

param_grid=[{
    
    
             'random_strength': [i for i in range(1,11)],   
            },
           ]
boost = cb.CatBoostClassifier(depth = 7, l2_leaf_reg = 6, eval_metric='AUC')

Optimal parameter: random_strength=7

insert image description here
Still overfitting.

(d) Continue to adjust the parameter rsm:

param_grid=[{
    
    
             'rsm': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],   
            },
           ]
boost = cb.CatBoostClassifier(depth = 7, l2_leaf_reg = 6, random_strength=7, eval_metric='AUC')

Optimal parameters: rsm=0.3
insert image description here
One operation, not as good as the first version, next, try to shorten the number of iterations:

(e) Try adjusting learning_rate and iterations together:

param_grid=[{
    
    
             'learning_rate': [0.03,0.06,0.08,0.1], 
             'iterations': [100,200,300,400,500,600,700,800],              
            },
           ]
boost = cb.CatBoostClassifier(depth = 7, l2_leaf_reg = 6, random_strength=7, rsm=0.3, eval_metric='AUC')

Optimal parameters: learning_rate=0.06 and iterations=300
insert image description here

In summary, the optimal parameters are grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, eval_metric='AUC'.

(f) Finally, try several parameters of Overfitting detection settings:
① early_stopping_rounds: early stop setting, which is not enabled by default.

classifier = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200, eval_metric='AUC')
classifier.fit(X_train, y_train)

Facts have proved that it is useless. Continue to test down.
② od_type: Overfitting detection type, default IncToDec. Optional: IncToDec, Iter.
③ od_pval: IncToDec overfitting detection threshold, when the specified value is reached, the training will stop. It is required to input a validation data set, and the recommended value range is [10e-10, 10e-2]. The default value is 0, that is, no overfitting detection is used.
This step is to enter the verification set we divided when fitting the model. It feels a bit like missing questions in advance. Here is also a demonstration. Everyone knows
:

classifier = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200, eval_metric='AUC')
classifier.fit(X_train, y_train,  eval_set=(X_test, y_test), plot=True)

Looking at the results, the overfitting has been alleviated:

insert image description here
Then, try adjusting od_type and od_pval:

classifier = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200, eval_metric='AUC', od_type='IncToDec',od_pval=0.1 )
classifier.fit(X_train, y_train,  eval_set=(X_test, y_test), plot=True)

The result has not changed, and then use the grid to try which value of od_pval is better:

import catboost as cb
param_grid=[{
    
    
              'od_pval': [0.6,0.2,0.1,0.01,0.001,0.0001,0.00001,0.000001],
           },
           ]
boost = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, 
                              rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200,
                              eval_metric='AUC')
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(boost, param_grid, n_jobs = -1, verbose = 2, cv=10)      
grid_search.fit(X_train, y_train, eval_set=(X_test, y_test))    
classifier = grid_search.best_estimator_  

As it turns out, no changes, so here goes, and look at the final result for Model1:

insert image description here
insert image description here

(C) Opening Model2 (Depthwise)

(a) grow_policy selects Depthwise, so an additional min_data_in_leaf is added to adjust. Similarly, try to adjust the depth first:
the depth is still equal to 7, and it is still overfitting.
(b) Adjust min_data_in_leaf

param_grid=[{
    
    
             'min_data_in_leaf': range(5,200,10),
           },
           ]
boost = cb.CatBoostClassifier(grow_policy='Depthwise', depth=7, eval_metric='AUC')

min_data_in_leaf is equal to 135, and overfitting is alleviated.
insert image description hereinsert image description here
(c) Then, I adjusted parameters such as l2_leaf_reg, random_strength, learning_rate and iterations, and the performance went back, so let's stop here. Look directly at the results of Model2 (I didn't use the validation set to tune the parameters):

insert image description here
insert image description here

(D) Opening Model3 (Lossguide)

(a) grow_policy selects Lossguide, so min_data_in_leaf and max_leaves are added to adjust. Similarly, try to adjust the depth first:
the depth is still equal to 8, and it is still overfitting.
(b) Adjust min_data_in_leaf

param_grid=[{
    
    
             'min_data_in_leaf': range(5,200,10),
           },
           ]
boost = cb.CatBoostClassifier(grow_policy='Lossguide', depth=8, eval_metric='AUC')

Optimal parameter: min_data_in_leaf=115
insert image description here
(c) Adjust num_leaves

param_grid=[{
    
    
              'num_leaves': range(5, 100, 5),
           },
           ]
boost = cb.CatBoostClassifier(grow_policy='Lossguide', 
depth=8,min_data_in_leaf=115,  eval_metric='AUC')

Optimal parameter: num_leaves=5,
the effect is not bad:

insert image description here
insert image description here
(d) I will not adjust parameters such as l2_leaf_reg, random_strength, learning_rate, and iterations. I have a hunch that the performance will go back after the adjustment, so stop here. Look directly at the results of Model3 (here I did not join the validation set for parameter tuning):

insert image description here
insert image description here


2. SPSSPRO parameter adjustment (I figured it out myself)

Slightly~


Summarize

According to grow_policy (subtree growth strategy), it can be divided into three models (Model1, Model2 and Model3). Strictly speaking, the original Catboost is the one that uses SymmetricTree (symmetric tree). After all, the symmetrical tree is one of its characteristics. .
From the results, Model1 has a large overfitting, unless the test set is used for debugging (I reserve my opinion on this), Model2 and Model3 introduce some parameters of DT, which can better correct the overfitting.
One more thing, I just learned Catboost not long ago, and I feel that I still haven’t learned many hidden skills. The words of the family are for your reference. If there are mistakes, please correct me.

Guess you like

Origin blog.csdn.net/qq_30452897/article/details/129316045