NGBoost参数详解及实战

一、模型介绍

二、参数详解

三、实战

1、使用前文的数据（ps：含catboost相关参数解释）

2、catboost建模

3、ngboost建模

精华部分

NGBoost是继xgboost、lightGBM、catboost之后boosting家族的新成员，拥有更高的精度、但是由于计算复杂度高导致训练和推理速度更慢。

官方文档：User Guide

扫描二维码关注公众号，回复： 16918458 查看本文章

论文：https://arxiv.org/abs/1910.03225

一、模型介绍

NGBoost（Natural Gradient Boosting）是一种概率预测模型，它结合了梯度提升树和自然梯度下降的思想。NGBoost 的主要目标是通过预测目标变量的分布来提高预测精度，而不仅仅是预测一个点估计值（这里更多是针对回归问题，分类问题其他模型也可以支持）。

NGBoost 使用自然梯度下降来更新模型参数，可以减少训练过程中的振荡和收敛时间。自然梯度下降是梯度下降的变体，它使用 Fisher 信息矩阵来归一化梯度方向，以便在不同参数空间中进行更准确的更新。这种方法可以使得 NGBoost 更加鲁棒，并且具有更好的收敛性能。

二、参数详解

ngboost超参数的数量并不多，远低于catboost（如需直接使用，可将：替换为 # ）

ngb_params={
    'Dist':k_categorical(2), ： 预测值y的分布，取值k_categorical, Bernoulli,Normal,Exponential等
    'Score':LogScore, ： 损失函数，取值LogScore, CRPScore
    'Base':default_tree_learner, ： 基学习器，取值default_tree_learner、DecisionTreeRegressor(criterion='friedman_mse', max_depth=4)
    'natural_gradient':True,   ： 自然梯度 or 常规梯度
    'n_estimators':1000,  ： 迭代次数
    'learning_rate':0.01,  ： 学习速率
    'minibatch_frac':1.0,  ： 行采样
    'col_sample':1.0,  ： 列采样
    'verbose':True,
    'verbose_eval':100,
    'tol':0.0001,   ： 迭代过程中损失函数阈值，当损失函数的变化小于tol时，训练过程将停止
    'random_state':1,
}

三、实战

1、使用前文的数据（ps：含catboost相关参数解释）

catboost参数详解及实战（强推）_Python风控模型与数据分析的博客-CSDN博客

# 1、导包
import re
import os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve,roc_auc_score
import matplotlib.pyplot as plt
import gc
from bayes_opt import BayesianOptimization
from catboost import Pool, cv
import ngboost
from ngboost import NGBClassifier
from ngboost.learners import default_tree_learner
from sklearn.tree import DecisionTreeClassifier,DecisionTreeRegressor
from ngboost.distns import k_categorical, Bernoulli,Normal,Exponential
from ngboost.scores import LogScore, CRPScore

# 2、数据读取
df=pd.read_csv('E:/train.csv',engine='python').head(80000)
print(df.shape)
df.head()

# 3、预处理
df_copy['employmentLength']=df_copy['employmentLength'].replace(' years','')
dic={'< 1':0,'10+':20}
df_copy['employmentLength']=df_copy['employmentLength'].map(dic).astype('float')


# 4、特征分组
float_col=list(df_copy.select_dtypes(exclude=['string','object']).drop(['id','isDefault'],axis=1).columns).copy()
cate_col=['grade', 'subGrade']
all_fea=float_col+cate_col

2、catboost建模

设置迭代次数1000、学习速率0.01、行采样列采样均为100%，最终效果train、test的auc分别为0.72、0.718

catboost_params={
    'loss_function': 'Logloss',
    'custom_loss': 'AUC',
    'eval_metric': 'AUC',
    'iterations': 1000,
    'learning_rate': 0.01,
    'random_seed': 123,
    'l2_leaf_reg': 5,
    'bootstrap_type': 'Bernoulli',
    'subsample': 1,
    'sampling_frequency': 'PerTree',
    'use_best_model': True,
    'best_model_min_trees': 50,
    'depth': 4,
    'grow_policy': 'SymmetricTree',
    'min_data_in_leaf': 500,
    'one_hot_max_size': 4,
    'rsm': 1,
    'nan_mode': 'Max',
    'input_borders': None,
    'boosting_type': 'Ordered',
    'max_ctr_complexity': 2,
    'logging_level': 'Verbose',
    'metric_period': 1,
    'early_stopping_rounds': 20,
    'border_count': 254,
    'feature_border_type': 'GreedyLogSum'
}
def catboost_model(df,y_name,params,cate_col=[]):
    x_train,x_test, y_train, y_test =train_test_split(df.drop(y_name,axis=1),df[y_name],test_size=0.2, random_state=123)
    
    model = CatBoostClassifier(**params)
    model.fit(x_train, y_train,eval_set=[(x_train, y_train),(x_test,y_test)],cat_features=cate_col)
    
    train_pred = [pred[1] for pred in  model.predict_proba(x_train)]
    train_auc= roc_auc_score(list(y_train),train_pred)
    
    test_pred = [pred[1] for pred in  model.predict_proba(x_test)]
    test_auc= roc_auc_score(list(y_test),test_pred)
    
    result={
        'train_auc':train_auc,
        'test_auc':test_auc,
    }
    return model,result


model,model_result=catboost_model(df_copy[all_fea+['isDefault']],'isDefault',params,cate_col)
model_result

3、ngboost建模

注意点：

（1）不支持类别型变量入模，需要自行编码

（2）不支持缺失数据入模，需要自行填充

（3）由于模型复杂度高，训练、推理速度减慢，不过通用参数效果依然很好

控制参数与catboost相近，为方便此处仅使用数值型变量进行拟合评估。最终效果train、test的auc分别为0.728、0.72，这里实际上还有小部分信息损失，包括类别型变量信息、缺失值填充导致的损失，模型效果依然坚挺！

ngb_params={
    'Dist':k_categorical(2), # 预测值y的分布，取值k_categorical, Bernoulli,Normal,Exponential等
    'Score':LogScore, # 损失函数，取值LogScore, CRPScore
    'Base':default_tree_learner, # 基学习器、类似于子树，取值default_tree_learner、DecisionTreeRegressor(criterion='friedman_mse', max_depth=4)
    'natural_gradient':True,   # 自然梯度 or 常规梯度
    'n_estimators':1000,  # 迭代次数
    'learning_rate':0.01,  # 学习速率
    'minibatch_frac':1.0,  # 行采样
    'col_sample':1.0,  # 列采样
    'verbose':True,
    'verbose_eval':100,
    'tol':0.0001,   # 迭代过程中损失函数阈值，当损失函数的变化小于tol时，训练过程将停止
    'random_state':1,
}

def ngboost_model(df,y_name,fea_list,params):
    x_train,x_test, y_train, y_test =train_test_split(df[fea_list],df[y_name],test_size=0.2, random_state=123)
    
    model = NGBClassifier(**params)
    model.fit(x_train, y_train)
    
    train_pred = model.predict_proba(x_train)[:,1]
    train_auc= roc_auc_score(list(y_train),train_pred)
    
    test_pred = model.predict_proba(x_test)[:,1]
    test_auc= roc_auc_score(list(y_test),test_pred)
    
    result={
        'train_auc':train_auc,
        'test_auc':test_auc,
    }
    return model,result

ngboost_model,model_result=ngboost_model(df_copy.fillna(-1),'isDefault',float_col,ngb_params)
model_result

感兴趣的同学可以去尝试一下，ngboost的回归、分类模型效果都很好，横向对比xgboost、lightgbm、catboost。

精华部分

关注威信公众号Python风控模型与数据分析，获取更多Python、机器学习、深度学习、风控知识