金融贷款逾期的模型构建4——模型调优

一、任务

使用网格搜索法对7个模型进行调优（调参时采用五折交叉验证的方式），并进行模型评估，展示代码的运行结果。

二、概述

机器学习模型基本都会涉及调参不同的参数组合会产生不同的效果：

如果模型数据量不是很大（运行时间不是很长）——GridSearchCV来自动选择输入参数中的最优组合。
若很大数据量，模型运行特别费计算资源和时间——GridSearchCV可能会成本太高，需要对模型了解深入一点或者积累更多的实战经验，最后进行手动调参。

1、参数说明

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

（1）estimator
所使用的分类器，如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10), 并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法。
（2）param_grid
param_grid 值为字典或者列表，即需要最优化的参数的取值，param_grid =param_test1，param_test1 = {‘n_estimators’:range(10,71,10)}。
（3）scoring
准确度评价标准，默认None,这时需要使用score函数；或者如scoring=‘roc_auc’，根据所选模型不同，评价准则不同。字符串（函数名），或是可调用对象，需要其函数签名形如：scorer(estimator, X, y)；如果是None，则使用estimator的误差估计函数。scoring参数选择如下：
传送门：http://scikit-learn.org/stable/modules/model_evaluation.html
（4）cv
交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。
（5）refit
默认为True,程序将会以交叉验证训练集得到的最佳参数，重新对所有可用的训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再次fit一遍全部数据集。
（6）iid
默认True,为True时，默认为各个样本fold概率分布一致，误差估计为所有样本之和，而非各个fold的平均。
（7）verbose
日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。
（8）n_jobs
并行数，int：个数,-1：跟CPU核数一致, 1:默认值。
（9）pre_dispatch
指定总共分发的并行任务数。当n_jobs大于1时，数据将在每个运行点进行复制，这可能导致OOM，而设置pre_dispatch参数，则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次

2、常用方法

grid.fit()：运行网格搜索；
grid_scores_：给出不同参数情况下的评价结果；
best_params_：描述了已取得最佳结果的参数的组合；
best_score_：成员提供优化过程期间观察到的最好的评分。

二、实现

1、模块引入

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
import xgboost as xgb
import numpy as np
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
warnings.filterwarnings(action ='ignore', category = DeprecationWarning)

2、模型评估函数

## 模型评估
def model_metrics(clf, y_target, y_predict):
    accuracy = accuracy_score(y_target, y_predict)
    print('The accuracy is ', accuracy)
    precision = precision_score(y_target, y_predict)
    print('The precision is ', precision)
    recall = recall_score(y_target, y_predict)
    print('The recall is ', recall)

3、数据读取

## 读取数据
    data = pd.read_csv("data_all.csv")
    x = data.drop(labels='status', axis=1)
    y = data['status']
    x_train, x_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=2018)

    ## 数据标准化
    scaler = StandardScaler()
    scaler.fit(x_train)
    x_train_stand = scaler.transform(x_train)
    x_test_stand = scaler.transform(x_test)

4、Logistic Regression

（1）调参部分

lr = LogisticRegression()
# 要调参数
param = {'C':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']}
grid = GridSearchCV(estimator=lr, param_grid=param, scoring='roc_auc', cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数：',grid.best_params_)
print('训练集的最佳分数：', grid.best_score_)
print('测试集的最佳分数：', grid.score(x_test_stand, y_test))

==》最佳参数： {‘C’: 0.1, ‘penalty’: ‘l1’}

（2）模型评估

lr = LogisticRegression(C = 0.1, penalty = 'l1')
lr.fit(x_train_stand, y_train)
y_pre_lr = lr.predict(x_test_stand)
model_metrics(lr, y_test, y_pre_lr)

结果输出

The accuracy is  0.7890679747722494
The precision is  0.6746987951807228
The recall is  0.31197771587743733

5、SVM

（1）调参部分

svm = SVC(random_state=2018, probability=True)
param = {'C':[0.01, 0.1, 1]}
grid = GridSearchCV(estimator = svm, param_grid = param, scoring='roc_auc',cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数：',grid.best_params_)
print('训练集的最佳分数：', grid.best_score_)
print('测试集的最佳分数：', grid.score(x_test_stand, y_test))

==》最佳参数： {‘C’: 0.1}

（2）模型评估

svm = SVC(C = 0.1, random_state=2018, probability=True)
svm.fit(x_train_stand, y_train)
y_pre_svm = svm.predict(x_test_stand)
model_metrics(svm, y_test, y_pre_svm)

结果输出

The accuracy is  0.7575332866152769
The precision is  0.8823529411764706
The recall is  0.04178272980501393

6、Decision Tree

（1）调参部分

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features='sqrt',random_state =2018)
param = {'max_depth':range(3,14,2), 'min_samples_split':range(100,801,200)}
# 最佳参数： {'max_depth': 9, 'min_samples_split': 300}
param = {'min_samples_split':range(50,1000,100), 'min_samples_leaf':range(60,101,10)}
# 最佳参数： {'min_samples_leaf': 90, 'min_samples_split': 50}
param = {'max_features':range(7,20,2)}  	
# 最佳参数： {'max_features': 9}
grid = GridSearchCV(estimator = dt, param_grid = param,scoring = 'roc_auc', cv = 5)
grid.fit(x_train_stand, y_train)
print('最佳参数：',grid.best_params_)
print('训练集的最佳分数：', grid.best_score_)
print('测试集的最佳分数：', grid.score(x_test_stand, y_test))

（2）模型评估

dt = DecisionTreeClassifier(max_depth=9,min_samples_split=50,min_samples_leaf=90, max_features=9,random_state =2018)
dt.fit(x_train_stand, y_train)
y_pre_dt = dt.predict(x_test_stand)
model_metrics(dt, y_test, y_pre_dt)

结果输出

The accuracy is  0.7561317449194114
The precision is  0.5578947368421052
The recall is  0.14763231197771587

7、Random Forest

## Random Forest
# param = {'n_estimators': range(1, 200, 5), 'max_features': ['log2', 'sqrt', 'auto']}
# 最佳参数： {'max_features': 'sqrt', 'n_estimators': 171}
rf = RandomForestClassifier(n_estimators=171, max_features='sqrt', random_state=2018)
rf.fit(x_train_stand, y_train)
y_pre_rf = rf.predict(x_test_stand)
model_metrics(rf, y_test, y_pre_rf)

输出结果

The accuracy is  0.7848633496846531
The precision is  0.6857142857142857
The recall is  0.26740947075208915

8、GBDT

# gbdt = GradientBoostingClassifier(random_state=2018)
# param = {'n_estimators': range(1, 100, 10), 'learning_rate': np.arange(0.1, 1, 0.1)}
# grid = GridSearchCV(estimator = gbdt, param_grid = param,scoring = 'roc_auc', cv = 5)
# grid.fit(x_train_stand, y_train)
# print('最佳参数：',grid.best_params_)
# print('训练集的最佳分数：', grid.best_score_)
# print('测试集的最佳分数：', grid.score(x_test_stand, y_test))
# 最佳参数： {'learning_rate': 0.1, 'n_estimators': 41}
gbdt = GradientBoostingClassifier(learning_rate=0.1, n_estimators=41, random_state=2018)
gbdt.fit(x_train_stand, y_train)
y_pre_gbdt = gbdt.predict(x_test_stand)
model_metrics(gbdt, y_test, y_pre_gbdt)

9、XGBoost

## 调参部分
param = {'n_estimators':range(20,200,20)}
# param = {'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 12, 2)}
# param = {'gamma': [i / 10 for i in range(1, 6)]}
# param = {'subsample': [i / 10 for i in range(5, 10)], 'colsample_bytree': [i / 10 for i in range(5, 10)]}
# param = {'reg_alpha': [1e-5, 1e-2, 0.1, 0, 1, 100]}
# param = {'n_estimators': range(20, 200, 20)}
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01, gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic', nthread=4, scale_pos_weight=1, seed=2018)
grid = GridSearchCV(estimator=xgb, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数：', grid.best_params_)
print('训练集的最佳分数：', grid.best_score_)
print('测试集的最佳分数：', grid.score(x_test_stand, y_test))
# # 最佳参数： {'n_estimators': 40}
# 训练集的最佳分数： 0.8028110571725202
# 测试集的最佳分数： 0.7770857458817146

## 模型评估
xgboost = xgb.XGBClassifier(learning_rate=0.1, n_estimators=40, max_depth=3, min_child_weight=11, reg_alpha=0.01,
                        gamma=0.1, subsample=0.7, colsample_bytree=0.7, objective='binary:logistic',
                        nthread=4, scale_pos_weight=1, seed=2018)
xgboost.fit(x_train_stand, y_train)
y_pre_xgb = xgboost.predict(x_test_stand)
model_metrics(xgboost, y_test, y_pre_xgb)

输出结果

The accuracy is  0.7876664330763841
The precision is  0.6521739130434783
The recall is  0.3342618384401114

10、LightGBM

## 调参部分
gbm = lgb.LGBMClassifier(seed = 2018)
param = {'learning_rate': np.arange(0.1, 0.5, 0.1), 'max_depth': range(1, 6, 1),
              'n_estimators': range(30, 50, 5)}
grid = GridSearchCV(estimator=gbm, param_grid=param, scoring='roc_auc', n_jobs=4, iid=False, cv=5)
grid.fit(x_train_stand, y_train)
print('最佳参数：', grid.best_params_)
print('训练集的最佳分数：', grid.best_score_)
print('测试集的最佳分数：', grid.score(x_test_stand, y_test))
# 最佳参数： {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 40}
# 训练集的最佳分数： 0.8007228827289531
# 测试集的最佳分数： 0.7729296422647178

## 模型评估
gbm = lgb.LGBMClassifier(learning_rate =  0.1, max_depth = 3, n_estimators = 40, seed=2018)
gbm.fit(x_train_stand, y_train)
y_pre_gbm = gbm.predict(x_test_stand)
model_metrics(gbm, y_test, y_pre_gbm)

输出结果

The accuracy is  0.7932725998598459
The precision is  0.6839080459770115
The recall is  0.33147632311977715

三、遇到的问题

1、UnboundLocalError： local variable ‘xxx’ referenced before assignment

错误：
UnboundLocalError： local variable ‘xxx’ referenced before assignment

在函数外部已经定义了变量n，在函数内部对该变量进行运算，运行时会遇到了这样的错误：

主要是因为没有让解释器清楚变量是全局变量还是局部变量。

解决方案：修改变量的命名，使之不发生冲突

2、ImportError: [joblib] Attempting to do parallel computing without protecting

错误：
ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using “if name == ‘main‘”. Please see the joblib documentation on Parallel for more information

解决方案：添加if __name__=='__main__':即可

3、recall

为什么召回率普遍偏低？