Detailed explanation and actual combat of NGBoost parameters

Table of contents

1. Model introduction

2. Detailed explanation of parameters

3. Actual combat

1. Use the previous data (ps: including explanation of catboost related parameters)

2. catboost modeling

3. ngboost modeling

The best part


        NGBoost is a new member of the boosting family after xgboost, lightGBM, and catboost. It has higher accuracy, but the training and inference speed is slower due to high computational complexity.

Official documentation: User Guide

Paper: https://arxiv.org/abs/1910.03225

1. Model introduction

        NGBoost (Natural Gradient Boosting) is a probabilistic prediction model that combines the ideas of gradient boosting trees and natural gradient descent. The main goal of NGBoost is to improve prediction accuracy by predicting the distribution of the target variable, rather than just predicting a point estimate (this is more for regression problems, other models for classification problems can also support it).

        NGBoost uses natural gradient descent to update model parameters, which can reduce oscillations and convergence time during training. Natural gradient descent is a variant of gradient descent that uses the Fisher information matrix to normalize gradient directions for more accurate updates in different parameter spaces. This method can make NGBoost more robust and have better convergence performance.

2. Detailed explanation of parameters

        The number of ngboost hyperparameters is not large, much lower than that of catboost (if you need to use it directly, you can replace: with #)

ngb_params={
    'Dist':k_categorical(2), : 预测值y的分布,取值k_categorical, Bernoulli,Normal,Exponential等
    'Score':LogScore, : 损失函数,取值LogScore, CRPScore
    'Base':default_tree_learner, : 基学习器,取值default_tree_learner、DecisionTreeRegressor(criterion='friedman_mse', max_depth=4)
    'natural_gradient':True,   : 自然梯度 or 常规梯度
    'n_estimators':1000,  : 迭代次数
    'learning_rate':0.01,  : 学习速率
    'minibatch_frac':1.0,  : 行采样
    'col_sample':1.0,  : 列采样
    'verbose':True,
    'verbose_eval':100,
    'tol':0.0001,   : 迭代过程中损失函数阈值,当损失函数的变化小于tol时,训练过程将停止
    'random_state':1,
}

3. Actual combat

1. Use the previous data (ps: including explanation of catboost related parameters)

Detailed explanation and actual combat of catboost parameters (strongly recommended)_Python risk control model and data analysis blog-CSDN blog

# 1、导包
import re
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve,roc_auc_score
import matplotlib.pyplot as plt
import gc
from bayes_opt import BayesianOptimization
from catboost import Pool, cv
import ngboost
from ngboost import NGBClassifier
from ngboost.learners import default_tree_learner
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
from ngboost.distns import k_categorical, Bernoulli,Normal,Exponential
from ngboost.scores import LogScore, CRPScore

# 2、数据读取
df=pd.read_csv('E:/train.csv',engine='python').head(80000)
print(df.shape)
df.head()

# 3、预处理
df_copy['employmentLength']=df_copy['employmentLength'].replace(' years','')
dic={'< 1':0,'10+':20}
df_copy['employmentLength']=df_copy['employmentLength'].map(dic).astype('float')


# 4、特征分组
float_col=list(df_copy.select_dtypes(exclude=['string','object']).drop(['id','isDefault'],axis=1).columns).copy()
cate_col=['grade', 'subGrade']
all_fea=float_col+cate_col

2. catboost modeling

        Set the number of iterations to 1000, the learning rate to 0.01, and the row and column sampling to 100%. The final result is that the AUC of train and test are 0.72 and 0.718 respectively.

catboost_params={
    'loss_function': 'Logloss',
    'custom_loss': 'AUC',
    'eval_metric': 'AUC',
    'iterations': 1000,
    'learning_rate': 0.01,
    'random_seed': 123,
    'l2_leaf_reg': 5,
    'bootstrap_type': 'Bernoulli',
    'subsample': 1,
    'sampling_frequency': 'PerTree',
    'use_best_model': True,
    'best_model_min_trees': 50,
    'depth': 4,
    'grow_policy': 'SymmetricTree',
    'min_data_in_leaf': 500,
    'one_hot_max_size': 4,
    'rsm': 1,
    'nan_mode': 'Max',
    'input_borders': None,
    'boosting_type': 'Ordered',
    'max_ctr_complexity': 2,
    'logging_level': 'Verbose',
    'metric_period': 1,
    'early_stopping_rounds': 20,
    'border_count': 254,
    'feature_border_type': 'GreedyLogSum'
}
def catboost_model(df,y_name,params,cate_col=[]):
    x_train,x_test, y_train, y_test =train_test_split(df.drop(y_name,axis=1),df[y_name],test_size=0.2, random_state=123)
    
    model = CatBoostClassifier(**params)
    model.fit(x_train, y_train,eval_set=[(x_train, y_train),(x_test,y_test)],cat_features=cate_col)
    
    train_pred = [pred[1] for pred in  model.predict_proba(x_train)]
    train_auc= roc_auc_score(list(y_train),train_pred)
    
    test_pred = [pred[1] for pred in  model.predict_proba(x_test)]
    test_auc= roc_auc_score(list(y_test),test_pred)
    
    result={
        'train_auc':train_auc,
        'test_auc':test_auc,
    }
    return model,result


model,model_result=catboost_model(df_copy[all_fea+['isDefault']],'isDefault',params,cate_col)
model_result

3. ngboost modeling

important point:

        (1) Category variables are not supported in the model and need to be coded by yourself.

        (2) Missing data is not supported in the model and needs to be filled in by yourself

        (3) Due to the high complexity of the model, training and inference speeds slow down, but the general parameters are still very effective.

        The control parameters are similar to catboost. For convenience, only numerical variables are used for fitting evaluation. The AUC of the final results train and test are 0.728 and 0.72 respectively. There is actually a small amount of information loss here, including losses caused by categorical variable information and missing value filling. The model effect is still strong!

ngb_params={
    'Dist':k_categorical(2), # 预测值y的分布,取值k_categorical, Bernoulli,Normal,Exponential等
    'Score':LogScore, # 损失函数,取值LogScore, CRPScore
    'Base':default_tree_learner, # 基学习器、类似于子树,取值default_tree_learner、DecisionTreeRegressor(criterion='friedman_mse', max_depth=4)
    'natural_gradient':True,   # 自然梯度 or 常规梯度
    'n_estimators':1000,  # 迭代次数
    'learning_rate':0.01,  # 学习速率
    'minibatch_frac':1.0,  # 行采样
    'col_sample':1.0,  # 列采样
    'verbose':True,
    'verbose_eval':100,
    'tol':0.0001,   # 迭代过程中损失函数阈值,当损失函数的变化小于tol时,训练过程将停止
    'random_state':1,
}

def ngboost_model(df,y_name,fea_list,params):
    x_train,x_test, y_train, y_test =train_test_split(df[fea_list],df[y_name],test_size=0.2, random_state=123)
    
    model = NGBClassifier(**params)
    model.fit(x_train, y_train)
    
    train_pred = model.predict_proba(x_train)[:,1]
    train_auc= roc_auc_score(list(y_train),train_pred)
    
    test_pred = model.predict_proba(x_test)[:,1]
    test_auc= roc_auc_score(list(y_test),test_pred)
    
    result={
        'train_auc':train_auc,
        'test_auc':test_auc,
    }
    return model,result

ngboost_model,model_result=ngboost_model(df_copy.fillna(-1),'isDefault',float_col,ngb_params)
model_result

         Interested students can try it. ngboost's regression and classification models are very effective. Compare xgboost, lightgbm, and catboost horizontally.

The best part

        Follow the Weixin public account Python risk control model and data analysis to get more Python, machine learning, deep learning, and risk control knowledge

Guess you like

Origin blog.csdn.net/a7303349/article/details/130152148