Python grid search and Bayesian parameter adjustment practice

        

Table of contents

1. Cross-validation

2. Grid search

3. Bayesian Optimization

        The most time-consuming part of the entire modeling process is feature engineering (including variable analysis), followed by parameter adjustment, so today we will introduce related methods of parameter adjustment through code practice: grid search and Bayesian parameter adjustment.

1. Cross-validation

        The most commonly used training set and test set division methods in work are mainly random proportion splitting and (stratified) cross-validation. Random proportion splitting can divide the training set and test set according to the ratio of 7-3, 8-2, but such data division There is a certain degree of randomness, so it is not as rigorous as using cross-validation.

1. Guide the package and set initial parameters

# 导包
import re
import os
from sqlalchemy import create_engine
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve,roc_auc_score
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
import matplotlib.pyplot as plt
import gc
from sklearn import metrics
from sklearn.model_selection import cross_val_predict,cross_validate

# 设定xgb参数
params={
    'objective':'binary:logistic'
    ,'eval_metric':'auc'
    ,'n_estimators':500
    ,'eta':0.03
    ,'max_depth':3
    ,'min_child_weight':100
    ,'scale_pos_weight':1
    ,'gamma':5
    ,'reg_alpha':10
    ,'reg_lambda':10
    ,'subsample':0.7
    ,'colsample_bytree':0.7
    ,'seed':123
}

2. Initialize the model object, 5-fold cross-validation, the cross-validation function cross_validate can set multiple targets (AUC is used here), but cross_val_score can only set one

xgb = XGBClassifier(**params)
scoring=['roc_auc']
scores=cross_validate(xgb,df[col_list],df.y,cv=5,scoring=scoring,return_train_score=True )
scores 

3. The native interfaces of Xgb and lgb can be used with the built-in cv function.

xgb.cv(params,df,num_boost_round=10,nfold=5,stratified=True,metrics='auc' ,as_pandas=True)

 

2. Grid search

        By traversing all parameter combinations, find the parameter group with the best effect on the test set

1. Set the adjustment parameters and range and use grid search

from sklearn.model_selection import GridSearchCV
train=df_train.head(60000)
test=df_train.tail(10000)

param_value_dics={
                   'n_estimators':range(100,900,500),
                   'eta':np.arange(0.02,0.2,0.1),
                   'max_depth':range(3,5,1),
#                    'num_leaves':range(10,30,10),
#                    'min_child_weight':range(300,1500,500),
               }

xgb_model=XGBClassifier(**params)
clf=GridSearchCV(xgb_model,param_value_dics,scoring='roc_auc',n_jobs=-1,cv=5,return_train_score=True)
clf.fit(df[col_list], df.y)

2. Return the optimal parameters

clf.best_params_

3. Return the model with optimal parameters

clf.best_estimator_

4. Return the test set evaluation index under optimal parameters (AUC set here) 

clf.best_score_

         Grid search globally traverses the set parameter range to find the optimal parameters, that is, the global optimal solution. However, the program runs at a very slow speed under such a calculation amount, and the goal of grid search is to optimize the test set effect. There is still a risk of overfitting in parameter adjustment training under this kind of goal; to address these problems, Bayesian optimization parameter adjustment is available.

3. Bayesian Optimization

        Bayesian parameter adjustment requires customizing the objective function, which can be flexibly selected according to the actual situation. It runs quickly and can display the model results of each current parameter while running. It is indeed a very useful method.

1. Guide package

from bayes_opt import BayesianOptimization
import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics
from sklearn.model_selection import cross_val_predict,cross_validate
from xgboost import XGBClassifier

2. Customize the parameter adjustment target. Here, the AUC mean value of the test set under cv is used as the parameter adjustment target.


def xgb_cv(n_estimators,max_depth,eta,subsample,colsample_bytree):
    params={
            'objective':'binary:logistic',
            'eval_metric':'auc',
            'n_estimators':10,
            'eta':0.03,
            'max_depth':3,
            'min_child_weight':100,
            'scale_pos_weight':1,
            'gamma':5,
            'reg_alpha':10,
            'reg_lambda':10,
            'subsample':0.7,
            'colsample_bytree':0.7,
            'seed':123,
        }
    params.update({'n_estimators':int(n_estimators),'max_depth':int(max_depth),'eta':eta,'sub    sample':subsample,'colsample_bytree':colsample_bytree})
    model=XGBClassifier(**params)
    cv_result=cross_validate(model,df_train.head(10000)[col_list],df_train.head(10000).y,     cv=5,scoring='roc_auc',return_train_score=True)
    return cv_result.get('test_score').mean()

3. Set the parameter adjustment range. Pay attention to the fixed format of the numerical range: (left, right). Bayesian parameter adjustment will randomly select real number points within this range. For parameters such as n_estimators and max_depth, the results will be floating. points, so these parameters must be converted into integers in the model part: int(n_estimators)

param_value_dics={
                   'n_estimators':(100, 500),
                    'max_depth':(3, 6),
                   'eta':(0.02,0.2),
                   'subsample':(0.6, 1.0),
                   'colsample_bytree':(0.6, 1.0)
               }

4. Create a Bayesian parameter adjustment object and iterate 20 times

lgb_bo = BayesianOptimization(
        xgb_cv,
        param_value_dics
    )        
lgb_bo.maximize(init_points=1,n_iter=20) #init_points-调参基准点,n_iter-迭代次数

5. Check the optimal parameter results 

lgb_bo.max

6. View all parameter adjustment results

lgb_bo.res

7. Convert all parameter adjustment results into DataFrame for easy observation and display 

result=pd.DataFrame([res.get('params') for res in lgb_bo.res])
result['target']=[res.get('target') for res in lgb_bo.res]
result[['max_depth','n_estimators']]=result[['max_depth','n_estimators']].astype('int')
result

8. Under the results of the current parameter adjustment, you can refine the settings again or modify the parameter range and further adjust the parameters.

lgb_bo.set_bounds({'n_estimators':(400, 450)})
lgb_bo.maximize(init_points=1,n_iter=20)

For more knowledge and code sharing, follow the public account Python risk control model and data analysis

 

Guess you like

Origin blog.csdn.net/a7303349/article/details/125701303