Financial Risk Control - Loan Default Prediction

foreword

Alibaba Tianchi platform has very rich data resources, which has a good guiding role in strengthening data analysis thinking. This article deepens the understanding of different model training by analyzing loan default prediction, and expects to further optimize and improve the thinking framework of data analysis in practical projects.

Module import, Jupyter environment configuration and dataset loading

Import the required modules.

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sklearn 
from sklearn.linear_model import LinearRegression,LogisticRegressionCV
from sklearn.exceptions import ConvergenceWarning
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
复制代码

Some previous settings.

#设置字体,防止中文乱码。
mpl.rcParams['font.sans-serif'] =[u'simHei']
#防止图片下负号显示为矩形框
mpl.rcParams['axes.unicode_minus'] =False
#拦截警告
warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)  
#seaboen字体设置
sns.set(font='SimHei')
复制代码

Load the dataset.

data_submit =pd.read_csv('sample_submit.csv ')
data_testA  =pd.read_csv('testA.csv')
data_train  =pd.read_csv('train.csv')
data_train
______________________________________________________

id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	...	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	...	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	...	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	...	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
3	3	11000.0	3	7.26	340.96	A	A4	46854.0	10+ years	1	...	16.0	4.0	7.0	21.0	6.0	9.0	0.0	0.0	0.0	1.0
4	4	3000.0	3	12.99	101.07	C	C2	54.0	NaN	1	...	4.0	9.0	10.0	15.0	7.0	12.0	0.0	0.0	0.0	4.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
799995	799995	25000.0	3	14.49	860.41	C	C4	2659.0	7 years	1	...	6.0	2.0	12.0	13.0	10.0	14.0	0.0	0.0	0.0	3.0
799996	799996	17000.0	3	7.90	531.94	A	A4	29205.0	10+ years	0	...	15.0	16.0	2.0	19.0	2.0	7.0	0.0	0.0	0.0	0.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	...	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	...	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	...	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0
800000 rows × 47 columns
复制代码

Exploratory Analysis (EDA)

Dataset overview

It mainly includes: data type information, data statistical distribution preview and data dimension analysis.

#查看数据类型
data_train.info()
#查看数据维度
data_train.shape
#查看数据统计学分布情况
data_train.describe()
——————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
复制代码

View missing values

Look at data with missing values ​​and extract them for visualization.

#用DtaFrame存放数据集中缺失值的数量,列名为counts
nan_num = pd.DataFrame(data_train.isnull().sum(),columns=['counts'])
#取出缺失值数量大于1的列和值
data_nan =nan_num[nan_num['counts']>0]
#按照值排序
data_nan.sort_values(by='counts',inplace=True)
————————————————————————————————————————————————————
	counts
employmentTitle	1
postCode	1
title	       1
dti	239
pubRecBankruptcies	405
revolUtil	531
n10	33239
n4	33239
n12	40270
n9	40270
n7	40270
n6	40270
n3	40270
n13	40270
n2	40270
n1	40270
n0	40270
n5	40270
n14	40270
n8	40271
employmentLength	46799
n11	69752
复制代码

Visualize missing values ​​by number.

plt.figure(figsize=(30,10))
data_nan.plot.hist()
复制代码

image.png

As long as the analysis of outliers is to observe whether it is within the deviation range, the box plot is used to view the data distribution.

#查看去重后values数量
data_u =data_train.nunique().sort_values()
data_u
______________________________________________________
policyCode                 1
term                       2
applicationType            2
initialListStatus          2
isDefault                  2
verificationStatus         3
n12                        5
n11                        5
homeOwnership              6
grade                      7
pubRecBankruptcies        11
employmentLength          11
purpose                   14
n13                       28
delinquency_2years        30
n14                       31
pubRec                    32
n1                        33
subGrade                  35
n0                        39
ficoRangeLow              39
ficoRangeHigh             39
n9                        44
n4                        46
n2                        50
n3                        50
regionCode                51
n5                        65
n7                        70
openAcc                   75
n10                       76
n8                       102
n6                       107
totalAcc                 134
issueDate                139
interestRate             641
earliesCreditLine        720
postCode                 932
revolUtil               1286
loanAmnt                1540
dti                     6321
title                  39644
annualIncome           44926
revolBal               71116
installment            72360
employmentTitle       248683
id                    800000
dtype: int64
复制代码

data type analysis

It can be seen from the previous analysis that the data set mainly includes three data types: object, int64, and float64. The following will filter out the data according to the data type, and then analyze it in turn.

num_type=data_train.select_dtypes(exclude=['object'])
cls_type=data_train.select_dtypes(include=['object'])
num_type.columns ,cls_type.columns
——————————————————————————————————————————————————————
(Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment',
        'employmentTitle', 'homeOwnership', 'annualIncome',
        'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode',
        'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',
        'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
        'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0',
        'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11',
        'n12', 'n13', 'n14'],
       dtype='object'),
 Index(['grade', 'subGrade', 'employmentLength', 'issueDate',
        'earliesCreditLine'],
       dtype='object'))
复制代码

Numeric data

The constructor is used to classify data types. In order to display regularity, the return value is set to ndarray type for transposition processing.

def classfify_types():
    num_continuous=[]
    num_discrete=[]
    global data_train,num_type
    for name  in num_type:
        counts =data_train[name].nunique()
        if counts<= 10:
            num_discrete.append(name)
        else:
            num_continuous.append(name)
    return np.array(num_discrete).T,np.array(num_continuous).T
复制代码

Continuously distributed data

 array(['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle',
        'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti',
        'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',
        'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil',
        'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6',
        'n7', 'n8', 'n9', 'n10', 'n13', 'n14'], dtype='<U18'))
复制代码

Discrete distributed data.

array(['term', 'homeOwnership', 'verificationStatus', 'isDefault',
        'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12'],
       dtype='<U18'),
复制代码

Looking at the discrete data situation, the data is too long, and only the beginning results are shown.

for i in num_discrete:
    print(f'{i}:{data_train[i]}')
复制代码

image.png

The distribution of continuous data can be analyzed by visualization. Since the amount of data is too large and the execution time is long, 5w pieces of data are selected for visualization to observe whether the data is normally distributed.

data_continuous =data_train[num_continuous]
date_partial=data_continuous.head(50000)
df = pd.melt(date_partial,value_vars=num_continuous)
sp=sns.FacetGrid(df,col='variable',col_wrap=3,sharex=False,sharey=False)
sp=sp.map(sns.distplot,'value',color='r',rug=True)
复制代码

image.png

Typed data

Let's first look at the distribution of categorical dataimage.png

image.pngClassified data can be analyzed by visualization, extracting loan grade data for graphing. It can be found that there are many loan grades of B/C/A.

plt.figure(figsize=(8,5))
sns.barplot(data_train['grade'].value_counts().index,data_train['grade'].value_counts())
复制代码

image.png

Let's take a look at the distribution of working years. The number of people who have worked for more than ten years is the largest, and the other working years are more evenly distributed.

plt.figure(figsize=(8,5))
sns.barplot(data_train['employmentLength'].value_counts().index,data_train['employmentLength'].value_counts())
复制代码

image.png

Data relationship analysis

Let’s first look at the distribution of target predicted values.

data_train['isDefault'].value_counts().plot.bar(color='g')
复制代码

image.pngLet's analyze the relationship between loan default and eigenvalues.

The datasets with default=0/1 are extracted separately.

data_y = data_train[data_train['isDefault']==1]
data_n = data_train[data_train['isDefault']==0]
复制代码

Analyze the relationship between employmentLength&isDefault and grade&grade for the default and non-default data sets respectively.

fig,((ax1,ax2),(ax3,ax4)) =plt.subplots(2,2,figsize=(14,8))
data_y.groupby('employmentLength').size().plot.bar(ax=ax1,color='r',alpha=0.8)
data_n.groupby('employmentLength')['employmentLength'].count().plot.bar(ax=ax2,color='r',alpha=0.85)
data_y.groupby('grade').size().plot.bar(ax=ax3,color='g',alpha=0.8)
data_n.groupby('grade')['grade'].count().plot.bar(ax=ax4,color='g',alpha=0.85)
plt.xticks(rotation=90)
复制代码

image.pngVisually analyze the relationship between loan amount and default situation, and find that default situation mostly occurs in people with large loan amount.

fig,((ax1,ax2)) =plt.subplots(1,2,figsize=(10,8))
data_train.groupby('isDefault')['loanAmnt'].sum()
sns.countplot(x='isDefault',data=data_train,ax=ax1,color='r',alpha=0.9)
data_train.groupby('isDefault')['loanAmnt'].median()
sns.countplot(x='isDefault',data=data_train,ax=ax2,color='b',alpha=1)
复制代码

image.png

feature engineering

经过之前的探索性分析后,对于整个数据集有了比较全面的了解,因此前期的分析十分重要,在在特征工程中,主要针对EDA中的要点进行具体分析处理,主要包括:缺失值的处理、特征值数据处理与特征选择。

缺失值处理

在这之前已经进行过数据类型的分析,将预测目标值从数据集中分离出来后,对于数值型数据用平均值来填充空位,而分类型的数据选择它的众数作为结果。

num_type=list(data_train.select_dtypes(exclude=['object']).columns)
cls_type=list(data_train.select_dtypes(include=['object']).columns)
e ='isDefault'
num_type.remove(e)
#预测值分离
data_target =data_train['isDefault']
data_train.drop('isDefault',axis=1,inplace=True)
#用平均数填充
data_train[num_type]=data_train[num_type].fillna(data_train[num_type].mean())
#用众数填充
data_train[cls_type]=data_train[cls_type].fillna(data_train[cls_type].mode())
data_train[cls_type].isnull().sum()
data_train['employmentLength'].fillna('10+ years',inplace=True)
_____________________________________________________________

grade	subGrade	employmentLength	issueDate	earliesCreditLine
0	E	E2	2 years	2014-07-01	Aug-2001
1	D	D2	5 years	2012-08-01	May-2002
2	D	D3	8 years	2015-10-01	May-2006
3	A	A4	10+ years	2015-08-01	May-1999
4	C	C2	10+ years	2016-03-01	Aug-1977
...	...	...	...	...	...
799995	C	C4	7 years	2016-07-01	Aug-2011
799996	A	A4	10+ years	2013-04-01	May-1989
799997	C	C3	10+ years	2015-10-01	Jul-2002
799998	A	A4	10+ years	2015-02-01	Jan-1994
799999	B	B3	5 years	2018-08-01	Feb-2002
800000 rows × 5 columns
复制代码

测试集中的缺失值处理:

data_testA[num_type]=data_testA[num_type].fillna(data_testA[num_type].mean())
data_testA[cls_type]=data_testA[cls_type].fillna(data_testA[cls_type].mode())
data_testA['employmentLength']=data_testA['employmentLength'].fillna('10+ years')
复制代码

由于id对模型训练没有实际意义,选择直接删除。

data_train.drop(['id'],axis=1,inplace=True)
data_testA.drop(['id'],1,inplace=True)
复制代码

将测试集加载到训练集中,为了数据安全,后面选择训练集进行训练。

combined=data_train.append(data_testA)
复制代码

查看合并后数据的情况。公司的策略policyCode值只有一个,对于总体分析来说,意义很小,选择删除。处理后的数据集有100万条数据,44个特征,其中前80万是训练集,后20万是测试集,这样便于后面数据集分离,即'data_train=combined[:800000]'.

combined.drop(['index','id','policyCode'],1,inplace=True)
combined.shape
_______________________________________________________
(1000000, 44)
复制代码

独热编码

观察数据集中根据value值来判断需要进行虚拟编码来处理的有: 贷款期限(term), 贷款等级(grade),贷款等级子级(subgrade), 工作年限(employmentLength), 验证状态(verificationStatus)。此时combined有97个特征。

def dummies_coder():
    global combined
    for name in ['term','grade','subGrade',
                 'employmentLength','verificationStatus']:
        data_dummies = pd.get_dummies(combined[name],prefix=name)
        combined = pd.concat([combined,data_dummies],axis=1)
        combined.drop(name,axis=1,inplace=True)
    return combined
combined.shape
————————————————————————————————————————————————————————
(1000000, 97)
复制代码

特征组合

根据官网给的信息,对所有特征主观分类,之前已经完成了独热编码,在这个基础上观察特征的关联性。

借款人信息: 
            O-employmentLength就业年限(年)          
            annualIncome 年收入         
            employmentTitle就业职称 
            dti  债务收入比                 
贷款信息:
           loanAmnt  贷款金额           
           O-term    贷款期限(year)        
           interestRate 贷款利率            
           O-grade  贷款等级                  
           O-subGrade  贷款等级之子级               
           issueDate 贷款发放的月份             
   
还款信息:
             installment分期付款金额    

贷款附加信息:
            O-verificationStatus验证状态         
            postCode 贷款申请邮政编码的前3位数字     
            purpose 贷款用途类别                
            delinquency_2years 违约事件数     
            pubRec贬损公共记录的数量                                       
            homeOwnership 房屋所有权状况            
            revolBal信贷周转余额合计             
       earliesCreditLine借款人最早报告的信用额度开立的月份        
       openAcc借款人信用档案中未结信用额度的数量
            title 借款人提供的贷款名称                 
    applicationType个人申请还是与两个共同借款人的联合申请     
    0-initialListStatus         
    revolUtil循环额度利用率/相对于所有可用循环信贷的信贷金额        
    totalAcc 借款人信用档案中当前的信用额度总数              
    ficoRangeLow 贷款发放时的fico所属的下限          
     ficoRangeHigh贷款发放时的fico所属的上限           
     pubRecBankruptcies公开记录清除的数量        
还款附加信息:
            regionCode  地区编码    
            
匿名特征n0-n14,为一些贷款人行为计数特征的处理
n12                        5
n11                        5
 n13                       28
n14                       31
n1                        33
n0                        39
n9                        44
n4                        46
n2                        50
n3                        50
n5                        65
n7                        70
n10                       76
n8                       102
n6                       107
复制代码
  • 将年收入(annualIncome) x 债务收入比(dti) =年债务(debt),这表明了贷款人的经济压力程度,影响着是否违约;
combined['debt']=combined['annualIncome'] *combined['dti']
combined.drop(['annualIncome','dti'],1,inplace=True)
复制代码
  • 将贷款金额(loanAmnt) x 贷款利率(interestRate) x 年数(term) =利息(interest),再将贷款本息总额(total_loan)/分期付款金额(installment)=分期数(stages),分期数越大表明了违约可能发生的周期更长;
combined['stages']=(combined['loanAmnt'] *combined['interestRate']*0.01*combined['term']+combined['loanAmnt'])/combined['installment']
combined.drop(['loanAmnt','interestRate','term','installment'],1,inplace=True)
combined.shape
__________________________________________________________
(1000000, 92)
复制代码
  • 将(delinquency_2years)违约事件数+(pubRec)贬损公共记录的数量-( pubRecBankruptcies )公开记录清除的数量 =信用记录(Credit_record),不良的记录可以从侧面看出贷款人的违约履历;
combined['Credit_record'] = combined['delinquency_2years']+combined['pubRec']-combined['pubRecBankruptcies']
combined.drop(['delinquency_2years','pubRec','pubRecBankruptcies'],1,inplace=True)
复制代码
  • 将(openAcc)借款人信用档案中未结信用额度的数量+(totalAcc)借款人信用档案中当前的信用额度总数(revolBal)信贷周转余额合计+(revolUtil)循环额度利用率/相对于所有可用循环信贷的信贷金额=总信贷额度(credit_line),总的信贷额度可以看出贷款人的信用水准,信用越好额度越大,并且额度也可以用来规避违约风险;
combined['credit_line'] =combined['openAcc']+combined['totalAcc']+combined['revolBal']+combined['revolUtil']
combined.drop(['openAcc','totalAcc','revolBal','revolUtil'],1,inplace=True)
复制代码
  • 将(ficoRangeLow)贷款发放时的fico所属的下限+ (ficoRangeHigh)贷款发放时的fico所属的上限 =贷款发放时的fico所属的范围(ficoRange),机械相加,不觉明厉;
combined['ficoRange'] = combined['ficoRangeHigh'] +combined['ficoRangeLow']
combined.drop(['ficoRangeHigh','ficoRangeLow'],1,inplace=True)
复制代码
  • 将贷款人行为计数特征(n)n0+n1…………+n14 =total_n,既然是贷款人的行为,并且n值都很小,估测是不好的事,索性全部变成一家人,一下子干掉15个,太爽了。。。。。
combined['total_n']=combined['n0']+combined['n1']+combined['n2']+combined['n3']+combined['n4']combined['n5']+combined['n6']+combined['n7']+combined['n8']+combined['n9']+combined['n10']+combined['n11']+combined['n12']+combined['n13']+combined['n14']
combined.drop(['n0','n1', 'n2','n3' ,'n4' ,'n5' ,'n6' ,'n7' ,'n8' ,'n9' ,'n10' ,
               'n11' ,'n12' ,'n13' , 'n14'],1,inplace=True)
combined.shape
___________________________________________________
(1000000, 72)
复制代码

全部处理完成后,数据集中有72个特征。 下面回复训练集、测试集和目标值。

train=combined.iloc[:800000]
test=combined.iloc[800000:]
targets=pd.read_csv('train.csv',usecols=['isDefault'])['isDefault'].values
复制代码

模型训练

特征选择

经过特征处理后,数据有72个特征之多,现在需要进行特征选择,来人为的降低特征维度,下面将使用随机森立估算器计算特征重要性。

clf=RandomForestClassifier(n_estimators=50,max_features='sqrt')
clf =clf.fit(train,targets)
复制代码

对于每一个特征进行可视化。

features =pd.DataFrame()
features['feature'] =train.columns
features['importance']=clf.feature_importances_
features.sort_values(by=['importance'],ascending=True,inplace=True)
features.set_index('feature',inplace=True)
features.plot(kind='barh',figsize=(15,15))
复制代码

image.png

可以发现,正如特征工程中所探索的那样,stages,debt,cridit_line,total_n,postcode,employmenttitle都有较强的关联性。下面我们将数据集变得更加紧凑,特征一下子减少了很多。

model =SelectFromModel(clf,prefit=True)
train_reduce =model.transform(train)
train_reduce.shape
————————————————————————————————————————————————————
(800000, 12)
复制代码

设置一个评分函数,来对模型进行打分:

def compute_score(clf,x,y,cv=5,scoring='accuracy'):
    xval =cross_val_score(clf,x,y,cv=5,scoring=scoring)
    return np.mean(xval)
复制代码

基础模型

实例化分类器。

logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()
svc=SVC()
gnb = GaussianNB()
models = [ logreg_cv, rf,svc, gnb,gboost]
复制代码

由于数据集比较大,为了提高效率,选取部分数据进行训练。

train_partial =train[:80000]
targets_partial =targets[:80000]
for model in models:
    print (f'Cross-validation of :{model.__class__}')
    score = compute_score(clf=model, x=train_partial, y=targets_partial, scoring='accuracy')
    print (f'CV score ={score}')
    print ('****')
——————————————————————————————————————————————————————————————————
Cross-validation of :<class 'sklearn.linear_model._logistic.LogisticRegressionCV'>
CV score =0.7982125
****
Cross-validation of :<class 'sklearn.ensemble._forest.RandomForestClassifier'>
CV score =0.7988125
****
Cross-validation of :<class 'sklearn.svm._classes.SVC'>
CV score =0.7984249999999999
****
Cross-validation of :<class 'sklearn.naive_bayes.GaussianNB'>
CV score =0.7983374999999999
****
Cross-validation of :<class 'sklearn.ensemble._gb.GradientBoostingClassifier'>
CV score =0.8005749999999999
****
复制代码
  • logreg_cv CV score =0.7982125
  • rf CV score =0.7988125
  • svc CV score =0.7984249999999999
  • gnb 0.7983374999999999
  • gboost 0.8005749999999999

根据初次训练的结果,rf和gboost表现较好,决定进行参数优化以改进模型。

超参数调整

在使用的模型中,随机森林和梯度提升决策树的表现较为良好,因此在这个基础上继续进一步优化参数,将使用gridsearch来完成这个步骤。

model_box= []
rf=RandomForestClassifier(random_state=2021,max_features='auto')
rf_params ={'n_estimators':[50,120,300],'max_depth':[5,8,15],
            'min_samples_leaf':[2,5,10],'min_samples_split':[2,5,10]}
model_box.append([rf,rf_params])

gboost=GradientBoostingClassifier(random_state=2021)
gboost_params ={'learning_rate':[0.05,0.1,0.15],'n_estimators':[10,50],
                'max_depth':[3,4,6,10],'min_samples_split':[50,10]}
model_box.append([gboost,gboost_params])
复制代码
for i in range(len(model_box)):
    best_model= GridSearchCV(model_box[i][0],param_grid=model_box[i][1],refit=True,cv=5,scoring='roc_auc').fit(train_partial,targets_partial)
    print(model_box[i],':')
    print('best_parameters:',best_model.best_params_) 
  
 ————————————————————————————————————————————————————————————————
[RandomForestClassifier(random_state=2021), {'n_estimators': [50, 120, 300], 'max_depth': [5, 8, 15], 'min_samples_leaf': [2, 5, 10], 'min_samples_split': [2, 5, 10]}] :
best_parameters: {'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300}
[GradientBoostingClassifier(random_state=2021), {'learning_rate': [0.05, 0.1, 0.15], 'n_estimators': [10, 50], 'max_depth': [3, 4, 6, 10], 'min_samples_split': [50, 10]}] :
best_parameters: {'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50}

复制代码

网格搜索的结果显示,最优的参数是:

  • rf:

best_parameters: 'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300

  • gboost:

best_parameters: 'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50

带入这些参数进行模型训练并计算score值来评估模型。

for i  in model_best:
    model_best[i].fit(train_partial,targets_partial)
    score = cross_val_score(model_best[i],train_partial,targets_partial,cv=5,scoring='score')
    print('%s的score值为:%.4f' % (i,score.mean()))
__________________________________________________________________
rf的score值为:0.7997
gboost的score值为:0.8005
复制代码

选择最优模型进行预测并输出结果文件。

model_final =model_best['gboost']
predictions= model_final.predict(test).astype(int)
df_predictions = pd.DataFrame()
abc = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/testA.csv')
df_predictions['id'] = abc['id']
df_predictions['isDefault'] = predictions
df_predictions[['id','isDefault']].to_csv('/content/drive/MyDrive/Colab Notebooks/submit.csv', index=False)
复制代码

Guess you like

Origin juejin.im/post/6945277628093825060