Article Directory
foreword
Catboost modeling ~
1. Python parameter adjustment
(1) Preparation before modeling
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('X disease code fs.csv')
X = dataset.iloc[:, 1:14].values
Y = dataset.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 666)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
(2) Catboost's tuning strategy
Review the parameters first. The parameters that need to be adjusted are:
① depth : The depth of the tree, the default is 6, and the maximum is 16.
② grow_policy : Subtree growth strategy. Optional: SymmetricTree (default value, symmetrical tree), Depthwise (full layer growth, same as xgb), Lossguide (leaf node growth, same as lgb).
③ min_data_in_leaf : Minimum sample size of leaf nodes. Can only be used with Lossguide and Depthwise growth strategies.
④ max_leaves : The maximum number of leaf nodes, it is not recommended to use a value greater than 64, because it will greatly slow down the training process. Can only be used in conjunction with the Lossguide growth policy.
⑤ iterations : The number of iterations, the default is 500.
⑥ learning_rate : learning rate, default 0.03.
⑦ l2_leaf_reg : L2 regularization.
⑧ random_strength : The disturbance item of feature splitting information gain, the default is 1, used to avoid overfitting.
⑨ rsm : column sampling ratio, the default value is 1, and the value is (0, 1].
(3) Catboost parameter tuning demo
(A) Let the default parameters go for a while
import catboost as cb
classifier = cb.CatBoostClassifier(eval_metric='AUC')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_testprba = classifier.predict_proba(X_test)[:,1]
y_trainpred = classifier.predict(X_train)
y_trainprba = classifier.predict_proba(X_train)[:,1]
from sklearn.metrics import confusion_matrix
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)
Although overfitting, it is better than the previous xgb and lgb:
the AUC of the verification set has reached 0.8693, comparable to any previous model, but the AUC of the training set is close to 1.0 (0.9978), so there is still overfitting , continue to adjust the parameters to see if it can be improved.
(B) Opening Model1 (SymmetricTree)
(a) Since grow_policy selects SymmetricTree, min_data_in_leaf and max_leaves cannot be adjusted. Therefore, try to adjust the depth first:
import catboost as cb
param_grid=[{
'depth': [i for i in range(6,11)],
},
]
boost = cb.CatBoostClassifier(eval_metric='AUC')
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(boost, param_grid, n_jobs = -1, verbose = 2, cv=10)
grid_search.fit(X_train, y_train)
classifier = grid_search.best_estimator_
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
y_testprba = classifier.predict_proba(X_test)[:,1]
y_trainpred = classifier.predict(X_train)
y_trainprba = classifier.predict_proba(X_train)[:,1]
from sklearn.metrics import confusion_matrix
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)
Optimal parameters: depth=7
Catboost’s optimal parameter retrieval is a bit different. The two methods introduced before are a bit difficult to find, so enter the code directly:
grid_search.best_estimator_._init_params:
I don’t look at the results, it must be overfitting, after all, the fitting parameters have not been adjusted.
(b) Then, adjust l2_leaf_reg:
param_grid=[{
'l2_leaf_reg': [i for i in range(1,11)],
},
]
boost = cb.CatBoostClassifier(depth = 7, eval_metric='AUC')
Optimal parameters: l2_leaf_reg=6
and then cool down, the more adjusted the better the fitting.
(c) Continue to adjust the overfitting parameters: random_strength
param_grid=[{
'random_strength': [i for i in range(1,11)],
},
]
boost = cb.CatBoostClassifier(depth = 7, l2_leaf_reg = 6, eval_metric='AUC')
Optimal parameter: random_strength=7
Still overfitting.
(d) Continue to adjust the parameter rsm:
param_grid=[{
'rsm': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],
},
]
boost = cb.CatBoostClassifier(depth = 7, l2_leaf_reg = 6, random_strength=7, eval_metric='AUC')
Optimal parameters: rsm=0.3
One operation, not as good as the first version, next, try to shorten the number of iterations:
(e) Try adjusting learning_rate and iterations together:
param_grid=[{
'learning_rate': [0.03,0.06,0.08,0.1],
'iterations': [100,200,300,400,500,600,700,800],
},
]
boost = cb.CatBoostClassifier(depth = 7, l2_leaf_reg = 6, random_strength=7, rsm=0.3, eval_metric='AUC')
Optimal parameters: learning_rate=0.06 and iterations=300
In summary, the optimal parameters are grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, eval_metric='AUC'.
(f) Finally, try several parameters of Overfitting detection settings:
① early_stopping_rounds: early stop setting, which is not enabled by default.
classifier = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200, eval_metric='AUC')
classifier.fit(X_train, y_train)
Facts have proved that it is useless. Continue to test down.
② od_type: Overfitting detection type, default IncToDec. Optional: IncToDec, Iter.
③ od_pval: IncToDec overfitting detection threshold, when the specified value is reached, the training will stop. It is required to input a validation data set, and the recommended value range is [10e-10, 10e-2]. The default value is 0, that is, no overfitting detection is used.
This step is to enter the verification set we divided when fitting the model. It feels a bit like missing questions in advance. Here is also a demonstration. Everyone knows :
classifier = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200, eval_metric='AUC')
classifier.fit(X_train, y_train, eval_set=(X_test, y_test), plot=True)
Looking at the results, the overfitting has been alleviated:
Then, try adjusting od_type and od_pval:
classifier = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6, rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200, eval_metric='AUC', od_type='IncToDec',od_pval=0.1 )
classifier.fit(X_train, y_train, eval_set=(X_test, y_test), plot=True)
The result has not changed, and then use the grid to try which value of od_pval is better:
import catboost as cb
param_grid=[{
'od_pval': [0.6,0.2,0.1,0.01,0.001,0.0001,0.00001,0.000001],
},
]
boost = cb.CatBoostClassifier(grow_policy='SymmetricTree', depth=8, min_data_in_leaf=115, l2_leaf_reg=6,
rsm=0.3, random_strength=7, learning_rate=0.06, iterations=300, early_stopping_rounds=200,
eval_metric='AUC')
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(boost, param_grid, n_jobs = -1, verbose = 2, cv=10)
grid_search.fit(X_train, y_train, eval_set=(X_test, y_test))
classifier = grid_search.best_estimator_
As it turns out, no changes, so here goes, and look at the final result for Model1:
(C) Opening Model2 (Depthwise)
(a) grow_policy selects Depthwise, so an additional min_data_in_leaf is added to adjust. Similarly, try to adjust the depth first:
the depth is still equal to 7, and it is still overfitting.
(b) Adjust min_data_in_leaf
param_grid=[{
'min_data_in_leaf': range(5,200,10),
},
]
boost = cb.CatBoostClassifier(grow_policy='Depthwise', depth=7, eval_metric='AUC')
min_data_in_leaf is equal to 135, and overfitting is alleviated.
(c) Then, I adjusted parameters such as l2_leaf_reg, random_strength, learning_rate and iterations, and the performance went back, so let's stop here. Look directly at the results of Model2 (I didn't use the validation set to tune the parameters):
(D) Opening Model3 (Lossguide)
(a) grow_policy selects Lossguide, so min_data_in_leaf and max_leaves are added to adjust. Similarly, try to adjust the depth first:
the depth is still equal to 8, and it is still overfitting.
(b) Adjust min_data_in_leaf
param_grid=[{
'min_data_in_leaf': range(5,200,10),
},
]
boost = cb.CatBoostClassifier(grow_policy='Lossguide', depth=8, eval_metric='AUC')
Optimal parameter: min_data_in_leaf=115
(c) Adjust num_leaves
param_grid=[{
'num_leaves': range(5, 100, 5),
},
]
boost = cb.CatBoostClassifier(grow_policy='Lossguide',
depth=8,min_data_in_leaf=115, eval_metric='AUC')
Optimal parameter: num_leaves=5,
the effect is not bad:
(d) I will not adjust parameters such as l2_leaf_reg, random_strength, learning_rate, and iterations. I have a hunch that the performance will go back after the adjustment, so stop here. Look directly at the results of Model3 (here I did not join the validation set for parameter tuning):
2. SPSSPRO parameter adjustment (I figured it out myself)
Slightly~
Summarize
According to grow_policy (subtree growth strategy), it can be divided into three models (Model1, Model2 and Model3). Strictly speaking, the original Catboost is the one that uses SymmetricTree (symmetric tree). After all, the symmetrical tree is one of its characteristics. .
From the results, Model1 has a large overfitting, unless the test set is used for debugging (I reserve my opinion on this), Model2 and Model3 introduce some parameters of DT, which can better correct the overfitting.
One more thing, I just learned Catboost not long ago, and I feel that I still haven’t learned many hidden skills. The words of the family are for your reference. If there are mistakes, please correct me.