金融贷款逾期的模型构建4——模型调优

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012736685/article/details/85231794

一、任务

使用网格搜索法对7个模型进行调优(调参时采用五折交叉验证的方式),并进行模型评估,展示代码的运行结果。

二、概述

机器学习模型基本都会涉及调参不同的参数组合会产生不同的效果 :

  • 如果模型数据量不是很大(运行时间不是很长)——GridSearchCV来自动选择输入参数中的最优组合。
  • 若很大数据量,模型运行特别费计算资源和时间——GridSearchCV可能会成本太高,需要对模型了解深入一点或者积累更多的实战经验,最后进行手动调参。

1、参数说明

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

(1)estimator
所使用的分类器,如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10), 并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数,或者score方法。
(2)param_grid
param_grid 值为字典或者列表,即需要最优化的参数的取值,param_grid =param_test1,param_test1 = {‘n_estimators’:range(10,71,10)}。
(3)scoring
准确度评价标准,默认None,这时需要使用score函数;或者如scoring=‘roc_auc’,根据所选模型不同,评价准则不同。字符串(函数名),或是可调用对象,需要其函数签名形如:scorer(estimator, X, y);如果是None,则使用estimator的误差估计函数。scoring参数选择如下:
传送门:http://scikit-learn.org/stable/modules/model_evaluation.html
(4)cv
交叉验证参数,默认None,使用三折交叉验证。指定fold数量,默认为3,也可以是yield训练/测试数据的生成器。
(5)refit
默认为True,程序将会以交叉验证训练集得到的最佳参数,重新对所有可用的训练集与开发集进行,作为最终用于性能评估的最佳模型参数。即在搜索参数结束后,用最佳参数结果再次fit一遍全部数据集。
(6)iid
默认True,为True时,默认为各个样本fold概率分布一致,误差估计为所有样本之和,而非各个fold的平均。
(7)verbose
日志冗长度,int:冗长度,0:不输出训练过程,1:偶尔输出,>1:对每个子模型都输出。
(8)n_jobs
并行数,int:个数,-1:跟CPU核数一致, 1:默认值。
(9)pre_dispatch
指定总共分发的并行任务数。当n_jobs大于1时,数据将在每个运行点进行复制,这可能导致OOM,而设置pre_dispatch参数,则可以预先划分总共的job数量,使数据最多被复制pre_dispatch次

2、常用方法

grid.fit():运行网格搜索;
grid_scores_:给出不同参数情况下的评价结果;
best_params_:描述了已取得最佳结果的参数的组合;
best_score_:成员提供优化过程期间观察到的最好的评分。

二、实现

1、模块引入

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb
import numpy as np
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)

2、模型评估函数

## 模型评估
def model_metrics(clf, y_target, y_predict):
    accuracy = accuracy_score(y_target, y_predict)
    print('The accuracy is ', accuracy)
    precision = precision_score(y_target, y_predict)
    print('The precision is ', precision)
    recall = recall_score(y_target, y_predict)
    print('The recall is ', recall)

3、数据读取

## 读取数据
    data = pd.read_csv("data_all.csv")
    x = data.drop(labels='status', axis=1)
    y = data['status']
    x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)

    ## 数据标准化
    scaler = StandardScaler()
    scaler.fit(x_train)
    x_train_stand = scaler.transform(x_train)
    x_test_stand = scaler.transform(x_test)

4、Logistic Regression

(1)调参部分

lr = LogisticRegression()
# 要调参数
param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数:',grid.best_params_)
print('训练集的最佳分数:', grid.best_score_)
print('测试集的最佳分数:', grid.score(x_test_stand, y_test))

==》最佳参数: {‘C’: 0.1, ‘penalty’: ‘l1’}

(2)模型评估

lr = LogisticRegression(C = 0.1, penalty = 'l1')
lr.fit(x_train_stand, y_train)
y_pre_lr = lr.predict(x_test_stand)
model_metrics(lr, y_test, y_pre_lr)

结果输出

The accuracy is  0.7890679747722494
The precision is  0.6746987951807228
The recall is  0.31197771587743733

5、SVM

(1)调参部分

svm = SVC(random_state=2018, probability=True)
param = {'C':[0.01, 0.1, 1]}
grid = GridSearchCV(estimator = svm, param_grid = param, scoring='roc_auc',cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数:',grid.best_params_)
print('训练集的最佳分数:', grid.best_score_)
print('测试集的最佳分数:', grid.score(x_test_stand, y_test))

==》最佳参数: {‘C’: 0.1}

(2)模型评估

svm = SVC(C = 0.1, random_state=2018, probability=True)
svm.fit(x_train_stand, y_train)
y_pre_svm = svm.predict(x_test_stand)
model_metrics(svm, y_test, y_pre_svm)

结果输出

The accuracy is  0.7575332866152769
The precision is  0.8823529411764706
The recall is  0.04178272980501393

6、Decision Tree

(1)调参部分

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features='sqrt',random_state =2018)
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
# 最佳参数: {'max_depth': 9, 'min_samples_split': 300}
param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
# 最佳参数: {'min_samples_leaf': 90, 'min_samples_split': 50}
param = {'max_features':range(7,20,2)}  	
# 最佳参数: {'max_features': 9}
grid = GridSearchCV(estimator = dt, param_grid = param,scoring = 'roc_auc', cv = 5)
grid.fit(x_train_stand, y_train)
print('最佳参数:',grid.best_params_)
print('训练集的最佳分数:', grid.best_score_)
print('测试集的最佳分数:', grid.score(x_test_stand, y_test))

(2)模型评估

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features=9,random_state =2018)
dt.fit(x_train_stand, y_train)
y_pre_dt = dt.predict(x_test_stand)
model_metrics(dt, y_test, y_pre_dt)

结果输出

The accuracy is  0.7561317449194114
The precision is  0.5578947368421052
The recall is  0.14763231197771587

7、Random Forest

## Random Forest
# param = {'n_estimators': range(1, 200, 5), 'max_features': ['log2', 'sqrt', 'auto']}
# 最佳参数: {'max_features': 'sqrt', 'n_estimators': 171}
rf = RandomForestClassifier(n_estimators=171, max_features='sqrt', random_state=2018)
rf.fit(x_train_stand, y_train)
y_pre_rf = rf.predict(x_test_stand)
model_metrics(rf, y_test, y_pre_rf)

输出结果

The accuracy is  0.7848633496846531
The precision is  0.6857142857142857
The recall is  0.26740947075208915

8、GBDT

# gbdt = GradientBoostingClassifier(random_state=2018)
# param = {'n_estimators': range(1, 100, 10), 'learning_rate': np.arange(0.1, 1, 0.1)}
# grid = GridSearchCV(estimator = gbdt, param_grid = param,scoring = 'roc_auc', cv = 5)
# grid.fit(x_train_stand, y_train)
# print('最佳参数:',grid.best_params_)
# print('训练集的最佳分数:', grid.best_score_)
# print('测试集的最佳分数:', grid.score(x_test_stand, y_test))
# 最佳参数: {'learning_rate': 0.1, 'n_estimators': 41}
gbdt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=41, random_state=2018)
gbdt.fit(x_train_stand, y_train)
y_pre_gbdt = gbdt.predict(x_test_stand)
model_metrics(gbdt, y_test, y_pre_gbdt)

9、XGBoost

## 调参部分
param = {'n_estimators':range(20,200,20)}
# param = {'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 12, 2)}
# param = {'gamma': [i / 10 for i in range(1, 6)]}
# param = {'subsample': [i / 10 for i in range(5, 10)], 'colsample_bytree': [i / 10 for i in range(5, 10)]}
# param = {'reg_alpha': [1e-5, 1e-2, 0.1, 0, 1, 100]}
# param = {'n_estimators': range(20, 200, 20)}
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01, gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=2018)
grid = GridSearchCV(estimator=xgb, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数:', grid.best_params_)
print('训练集的最佳分数:', grid.best_score_)
print('测试集的最佳分数:', grid.score(x_test_stand, y_test))
# # 最佳参数: {'n_estimators': 40}
# 训练集的最佳分数: 0.8028110571725202
# 测试集的最佳分数: 0.7770857458817146

## 模型评估
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01,
                        gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic',
                        nthread=4, scale_pos_weight=1, seed=2018)
xgboost.fit(x_train_stand, y_train)
y_pre_xgb = xgboost.predict(x_test_stand)
model_metrics(xgboost, y_test, y_pre_xgb)

输出结果

The accuracy is  0.7876664330763841
The precision is  0.6521739130434783
The recall is  0.3342618384401114

10、LightGBM

## 调参部分
gbm = lgb.LGBMClassifier(seed = 2018)
param = {'learning_rate': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1, 6, 1),
              'n_estimators': range(30, 50, 5)}
grid = GridSearchCV(estimator=gbm, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数:', grid.best_params_)
print('训练集的最佳分数:', grid.best_score_)
print('测试集的最佳分数:', grid.score(x_test_stand, y_test))
# 最佳参数: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 40}
# 训练集的最佳分数: 0.8007228827289531
# 测试集的最佳分数: 0.7729296422647178

## 模型评估
gbm = lgb.LGBMClassifier(learning_rate =  0.1, max_depth = 3, n_estimators = 40, seed=2018)
gbm.fit(x_train_stand, y_train)
y_pre_gbm = gbm.predict(x_test_stand)
model_metrics(gbm, y_test, y_pre_gbm)

输出结果

The accuracy is  0.7932725998598459
The precision is  0.6839080459770115
The recall is  0.33147632311977715

三、遇到的问题

1、UnboundLocalError: local variable ‘xxx’ referenced before assignment

错误
UnboundLocalError: local variable ‘xxx’ referenced before assignment

在函数外部已经定义了变量n,在函数内部对该变量进行运算,运行时会遇到了这样的错误:

主要是因为没有让解释器清楚变量是全局变量还是局部变量。

解决方案:修改变量的命名,使之不发生冲突

2、ImportError: [joblib] Attempting to do parallel computing without protecting

错误
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if name == ‘main‘”. Please see the joblib documentation on Parallel for more information

解决方案:添加if __name__=='__main__':即可

3、recall

为什么召回率普遍偏低?

猜你喜欢

转载自blog.csdn.net/u012736685/article/details/85231794
今日推荐