foreword
Alibaba Tianchi platform has very rich data resources, which has a good guiding role in strengthening data analysis thinking. This article deepens the understanding of different model training by analyzing loan default prediction, and expects to further optimize and improve the thinking framework of data analysis in practical projects.
Module import, Jupyter environment configuration and dataset loading
Import the required modules.
import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import sklearn
from sklearn.linear_model import LinearRegression,LogisticRegressionCV
from sklearn.exceptions import ConvergenceWarning
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
复制代码
Some previous settings.
#设置字体,防止中文乱码。
mpl.rcParams['font.sans-serif'] =[u'simHei']
#防止图片下负号显示为矩形框
mpl.rcParams['axes.unicode_minus'] =False
#拦截警告
warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)
#seaboen字体设置
sns.set(font='SimHei')
复制代码
Load the dataset.
data_submit =pd.read_csv('sample_submit.csv ')
data_testA =pd.read_csv('testA.csv')
data_train =pd.read_csv('train.csv')
data_train
______________________________________________________
id loanAmnt term interestRate installment grade subGrade employmentTitle employmentLength homeOwnership ... n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
0 0 35000.0 5 19.52 917.97 E E2 320.0 2 years 2 ... 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0
1 1 18000.0 5 18.49 461.90 D D2 219843.0 5 years 0 ... NaN NaN NaN NaN NaN 13.0 NaN NaN NaN NaN
2 2 12000.0 5 16.99 298.17 D D3 31698.0 8 years 0 ... 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0
3 3 11000.0 3 7.26 340.96 A A4 46854.0 10+ years 1 ... 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0
4 4 3000.0 3 12.99 101.07 C C2 54.0 NaN 1 ... 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
799995 799995 25000.0 3 14.49 860.41 C C4 2659.0 7 years 1 ... 6.0 2.0 12.0 13.0 10.0 14.0 0.0 0.0 0.0 3.0
799996 799996 17000.0 3 7.90 531.94 A A4 29205.0 10+ years 0 ... 15.0 16.0 2.0 19.0 2.0 7.0 0.0 0.0 0.0 0.0
799997 799997 6000.0 3 13.33 203.12 C C3 2582.0 10+ years 1 ... 4.0 26.0 4.0 10.0 4.0 5.0 0.0 0.0 1.0 4.0
799998 799998 19200.0 3 6.92 592.14 A A4 151.0 10+ years 0 ... 10.0 6.0 12.0 22.0 8.0 16.0 0.0 0.0 0.0 5.0
799999 799999 9000.0 3 11.06 294.91 B B3 13.0 5 years 0 ... 3.0 4.0 4.0 8.0 3.0 7.0 0.0 0.0 0.0 2.0
800000 rows × 47 columns
复制代码
Exploratory Analysis (EDA)
Dataset overview
It mainly includes: data type information, data statistical distribution preview and data dimension analysis.
#查看数据类型
data_train.info()
#查看数据维度
data_train.shape
#查看数据统计学分布情况
data_train.describe()
——————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
复制代码
View missing values
Look at data with missing values and extract them for visualization.
#用DtaFrame存放数据集中缺失值的数量,列名为counts
nan_num = pd.DataFrame(data_train.isnull().sum(),columns=['counts'])
#取出缺失值数量大于1的列和值
data_nan =nan_num[nan_num['counts']>0]
#按照值排序
data_nan.sort_values(by='counts',inplace=True)
————————————————————————————————————————————————————
counts
employmentTitle 1
postCode 1
title 1
dti 239
pubRecBankruptcies 405
revolUtil 531
n10 33239
n4 33239
n12 40270
n9 40270
n7 40270
n6 40270
n3 40270
n13 40270
n2 40270
n1 40270
n0 40270
n5 40270
n14 40270
n8 40271
employmentLength 46799
n11 69752
复制代码
Visualize missing values by number.
plt.figure(figsize=(30,10))
data_nan.plot.hist()
复制代码
As long as the analysis of outliers is to observe whether it is within the deviation range, the box plot is used to view the data distribution.
#查看去重后values数量
data_u =data_train.nunique().sort_values()
data_u
______________________________________________________
policyCode 1
term 2
applicationType 2
initialListStatus 2
isDefault 2
verificationStatus 3
n12 5
n11 5
homeOwnership 6
grade 7
pubRecBankruptcies 11
employmentLength 11
purpose 14
n13 28
delinquency_2years 30
n14 31
pubRec 32
n1 33
subGrade 35
n0 39
ficoRangeLow 39
ficoRangeHigh 39
n9 44
n4 46
n2 50
n3 50
regionCode 51
n5 65
n7 70
openAcc 75
n10 76
n8 102
n6 107
totalAcc 134
issueDate 139
interestRate 641
earliesCreditLine 720
postCode 932
revolUtil 1286
loanAmnt 1540
dti 6321
title 39644
annualIncome 44926
revolBal 71116
installment 72360
employmentTitle 248683
id 800000
dtype: int64
复制代码
data type analysis
It can be seen from the previous analysis that the data set mainly includes three data types: object, int64, and float64. The following will filter out the data according to the data type, and then analyze it in turn.
num_type=data_train.select_dtypes(exclude=['object'])
cls_type=data_train.select_dtypes(include=['object'])
num_type.columns ,cls_type.columns
——————————————————————————————————————————————————————
(Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment',
'employmentTitle', 'homeOwnership', 'annualIncome',
'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode',
'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',
'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0',
'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11',
'n12', 'n13', 'n14'],
dtype='object'),
Index(['grade', 'subGrade', 'employmentLength', 'issueDate',
'earliesCreditLine'],
dtype='object'))
复制代码
Numeric data
The constructor is used to classify data types. In order to display regularity, the return value is set to ndarray type for transposition processing.
def classfify_types():
num_continuous=[]
num_discrete=[]
global data_train,num_type
for name in num_type:
counts =data_train[name].nunique()
if counts<= 10:
num_discrete.append(name)
else:
num_continuous.append(name)
return np.array(num_discrete).T,np.array(num_continuous).T
复制代码
Continuously distributed data
array(['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle',
'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti',
'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc',
'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil',
'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6',
'n7', 'n8', 'n9', 'n10', 'n13', 'n14'], dtype='<U18'))
复制代码
Discrete distributed data.
array(['term', 'homeOwnership', 'verificationStatus', 'isDefault',
'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12'],
dtype='<U18'),
复制代码
Looking at the discrete data situation, the data is too long, and only the beginning results are shown.
for i in num_discrete:
print(f'{i}:{data_train[i]}')
复制代码
The distribution of continuous data can be analyzed by visualization. Since the amount of data is too large and the execution time is long, 5w pieces of data are selected for visualization to observe whether the data is normally distributed.
data_continuous =data_train[num_continuous]
date_partial=data_continuous.head(50000)
df = pd.melt(date_partial,value_vars=num_continuous)
sp=sns.FacetGrid(df,col='variable',col_wrap=3,sharex=False,sharey=False)
sp=sp.map(sns.distplot,'value',color='r',rug=True)
复制代码
Typed data
Let's first look at the distribution of categorical data
Classified data can be analyzed by visualization, extracting loan grade data for graphing. It can be found that there are many loan grades of B/C/A.
plt.figure(figsize=(8,5))
sns.barplot(data_train['grade'].value_counts().index,data_train['grade'].value_counts())
复制代码
Let's take a look at the distribution of working years. The number of people who have worked for more than ten years is the largest, and the other working years are more evenly distributed.
plt.figure(figsize=(8,5))
sns.barplot(data_train['employmentLength'].value_counts().index,data_train['employmentLength'].value_counts())
复制代码
Data relationship analysis
Let’s first look at the distribution of target predicted values.
data_train['isDefault'].value_counts().plot.bar(color='g')
复制代码
Let's analyze the relationship between loan default and eigenvalues.
The datasets with default=0/1 are extracted separately.
data_y = data_train[data_train['isDefault']==1]
data_n = data_train[data_train['isDefault']==0]
复制代码
Analyze the relationship between employmentLength&isDefault and grade&grade for the default and non-default data sets respectively.
fig,((ax1,ax2),(ax3,ax4)) =plt.subplots(2,2,figsize=(14,8))
data_y.groupby('employmentLength').size().plot.bar(ax=ax1,color='r',alpha=0.8)
data_n.groupby('employmentLength')['employmentLength'].count().plot.bar(ax=ax2,color='r',alpha=0.85)
data_y.groupby('grade').size().plot.bar(ax=ax3,color='g',alpha=0.8)
data_n.groupby('grade')['grade'].count().plot.bar(ax=ax4,color='g',alpha=0.85)
plt.xticks(rotation=90)
复制代码
Visually analyze the relationship between loan amount and default situation, and find that default situation mostly occurs in people with large loan amount.
fig,((ax1,ax2)) =plt.subplots(1,2,figsize=(10,8))
data_train.groupby('isDefault')['loanAmnt'].sum()
sns.countplot(x='isDefault',data=data_train,ax=ax1,color='r',alpha=0.9)
data_train.groupby('isDefault')['loanAmnt'].median()
sns.countplot(x='isDefault',data=data_train,ax=ax2,color='b',alpha=1)
复制代码
feature engineering
经过之前的探索性分析后,对于整个数据集有了比较全面的了解,因此前期的分析十分重要,在在特征工程中,主要针对EDA中的要点进行具体分析处理,主要包括:缺失值的处理、特征值数据处理与特征选择。
缺失值处理
在这之前已经进行过数据类型的分析,将预测目标值从数据集中分离出来后,对于数值型数据用平均值来填充空位,而分类型的数据选择它的众数作为结果。
num_type=list(data_train.select_dtypes(exclude=['object']).columns)
cls_type=list(data_train.select_dtypes(include=['object']).columns)
e ='isDefault'
num_type.remove(e)
#预测值分离
data_target =data_train['isDefault']
data_train.drop('isDefault',axis=1,inplace=True)
#用平均数填充
data_train[num_type]=data_train[num_type].fillna(data_train[num_type].mean())
#用众数填充
data_train[cls_type]=data_train[cls_type].fillna(data_train[cls_type].mode())
data_train[cls_type].isnull().sum()
data_train['employmentLength'].fillna('10+ years',inplace=True)
_____________________________________________________________
grade subGrade employmentLength issueDate earliesCreditLine
0 E E2 2 years 2014-07-01 Aug-2001
1 D D2 5 years 2012-08-01 May-2002
2 D D3 8 years 2015-10-01 May-2006
3 A A4 10+ years 2015-08-01 May-1999
4 C C2 10+ years 2016-03-01 Aug-1977
... ... ... ... ... ...
799995 C C4 7 years 2016-07-01 Aug-2011
799996 A A4 10+ years 2013-04-01 May-1989
799997 C C3 10+ years 2015-10-01 Jul-2002
799998 A A4 10+ years 2015-02-01 Jan-1994
799999 B B3 5 years 2018-08-01 Feb-2002
800000 rows × 5 columns
复制代码
测试集中的缺失值处理:
data_testA[num_type]=data_testA[num_type].fillna(data_testA[num_type].mean())
data_testA[cls_type]=data_testA[cls_type].fillna(data_testA[cls_type].mode())
data_testA['employmentLength']=data_testA['employmentLength'].fillna('10+ years')
复制代码
由于id对模型训练没有实际意义,选择直接删除。
data_train.drop(['id'],axis=1,inplace=True)
data_testA.drop(['id'],1,inplace=True)
复制代码
将测试集加载到训练集中,为了数据安全,后面选择训练集进行训练。
combined=data_train.append(data_testA)
复制代码
查看合并后数据的情况。公司的策略policyCode值只有一个,对于总体分析来说,意义很小,选择删除。处理后的数据集有100万条数据,44个特征,其中前80万是训练集,后20万是测试集,这样便于后面数据集分离,即'data_train=combined[:800000]'.
combined.drop(['index','id','policyCode'],1,inplace=True)
combined.shape
_______________________________________________________
(1000000, 44)
复制代码
独热编码
观察数据集中根据value值来判断需要进行虚拟编码来处理的有: 贷款期限(term), 贷款等级(grade),贷款等级子级(subgrade), 工作年限(employmentLength), 验证状态(verificationStatus)。此时combined有97个特征。
def dummies_coder():
global combined
for name in ['term','grade','subGrade',
'employmentLength','verificationStatus']:
data_dummies = pd.get_dummies(combined[name],prefix=name)
combined = pd.concat([combined,data_dummies],axis=1)
combined.drop(name,axis=1,inplace=True)
return combined
combined.shape
————————————————————————————————————————————————————————
(1000000, 97)
复制代码
特征组合
根据官网给的信息,对所有特征主观分类,之前已经完成了独热编码,在这个基础上观察特征的关联性。
借款人信息:
O-employmentLength就业年限(年)
annualIncome 年收入
employmentTitle就业职称
dti 债务收入比
贷款信息:
loanAmnt 贷款金额
O-term 贷款期限(year)
interestRate 贷款利率
O-grade 贷款等级
O-subGrade 贷款等级之子级
issueDate 贷款发放的月份
还款信息:
installment分期付款金额
贷款附加信息:
O-verificationStatus验证状态
postCode 贷款申请邮政编码的前3位数字
purpose 贷款用途类别
delinquency_2years 违约事件数
pubRec贬损公共记录的数量
homeOwnership 房屋所有权状况
revolBal信贷周转余额合计
earliesCreditLine借款人最早报告的信用额度开立的月份
openAcc借款人信用档案中未结信用额度的数量
title 借款人提供的贷款名称
applicationType个人申请还是与两个共同借款人的联合申请
0-initialListStatus
revolUtil循环额度利用率/相对于所有可用循环信贷的信贷金额
totalAcc 借款人信用档案中当前的信用额度总数
ficoRangeLow 贷款发放时的fico所属的下限
ficoRangeHigh贷款发放时的fico所属的上限
pubRecBankruptcies公开记录清除的数量
还款附加信息:
regionCode 地区编码
匿名特征n0-n14,为一些贷款人行为计数特征的处理
n12 5
n11 5
n13 28
n14 31
n1 33
n0 39
n9 44
n4 46
n2 50
n3 50
n5 65
n7 70
n10 76
n8 102
n6 107
复制代码
- 将年收入(annualIncome) x 债务收入比(dti) =年债务(debt),这表明了贷款人的经济压力程度,影响着是否违约;
combined['debt']=combined['annualIncome'] *combined['dti']
combined.drop(['annualIncome','dti'],1,inplace=True)
复制代码
- 将贷款金额(loanAmnt) x 贷款利率(interestRate) x 年数(term) =利息(interest),再将贷款本息总额(total_loan)/分期付款金额(installment)=分期数(stages),分期数越大表明了违约可能发生的周期更长;
combined['stages']=(combined['loanAmnt'] *combined['interestRate']*0.01*combined['term']+combined['loanAmnt'])/combined['installment']
combined.drop(['loanAmnt','interestRate','term','installment'],1,inplace=True)
combined.shape
__________________________________________________________
(1000000, 92)
复制代码
- 将(delinquency_2years)违约事件数+(pubRec)贬损公共记录的数量-( pubRecBankruptcies )公开记录清除的数量 =信用记录(Credit_record),不良的记录可以从侧面看出贷款人的违约履历;
combined['Credit_record'] = combined['delinquency_2years']+combined['pubRec']-combined['pubRecBankruptcies']
combined.drop(['delinquency_2years','pubRec','pubRecBankruptcies'],1,inplace=True)
复制代码
- 将(openAcc)借款人信用档案中未结信用额度的数量+(totalAcc)借款人信用档案中当前的信用额度总数(revolBal)信贷周转余额合计+(revolUtil)循环额度利用率/相对于所有可用循环信贷的信贷金额=总信贷额度(credit_line),总的信贷额度可以看出贷款人的信用水准,信用越好额度越大,并且额度也可以用来规避违约风险;
combined['credit_line'] =combined['openAcc']+combined['totalAcc']+combined['revolBal']+combined['revolUtil']
combined.drop(['openAcc','totalAcc','revolBal','revolUtil'],1,inplace=True)
复制代码
- 将(ficoRangeLow)贷款发放时的fico所属的下限+ (ficoRangeHigh)贷款发放时的fico所属的上限 =贷款发放时的fico所属的范围(ficoRange),机械相加,不觉明厉;
combined['ficoRange'] = combined['ficoRangeHigh'] +combined['ficoRangeLow']
combined.drop(['ficoRangeHigh','ficoRangeLow'],1,inplace=True)
复制代码
- 将贷款人行为计数特征(n)n0+n1…………+n14 =total_n,既然是贷款人的行为,并且n值都很小,估测是不好的事,索性全部变成一家人,一下子干掉15个,太爽了。。。。。
combined['total_n']=combined['n0']+combined['n1']+combined['n2']+combined['n3']+combined['n4']combined['n5']+combined['n6']+combined['n7']+combined['n8']+combined['n9']+combined['n10']+combined['n11']+combined['n12']+combined['n13']+combined['n14']
combined.drop(['n0','n1', 'n2','n3' ,'n4' ,'n5' ,'n6' ,'n7' ,'n8' ,'n9' ,'n10' ,
'n11' ,'n12' ,'n13' , 'n14'],1,inplace=True)
combined.shape
___________________________________________________
(1000000, 72)
复制代码
全部处理完成后,数据集中有72个特征。 下面回复训练集、测试集和目标值。
train=combined.iloc[:800000]
test=combined.iloc[800000:]
targets=pd.read_csv('train.csv',usecols=['isDefault'])['isDefault'].values
复制代码
模型训练
特征选择
经过特征处理后,数据有72个特征之多,现在需要进行特征选择,来人为的降低特征维度,下面将使用随机森立估算器计算特征重要性。
clf=RandomForestClassifier(n_estimators=50,max_features='sqrt')
clf =clf.fit(train,targets)
复制代码
对于每一个特征进行可视化。
features =pd.DataFrame()
features['feature'] =train.columns
features['importance']=clf.feature_importances_
features.sort_values(by=['importance'],ascending=True,inplace=True)
features.set_index('feature',inplace=True)
features.plot(kind='barh',figsize=(15,15))
复制代码
可以发现,正如特征工程中所探索的那样,stages,debt,cridit_line,total_n,postcode,employmenttitle都有较强的关联性。下面我们将数据集变得更加紧凑,特征一下子减少了很多。
model =SelectFromModel(clf,prefit=True)
train_reduce =model.transform(train)
train_reduce.shape
————————————————————————————————————————————————————
(800000, 12)
复制代码
设置一个评分函数,来对模型进行打分:
def compute_score(clf,x,y,cv=5,scoring='accuracy'):
xval =cross_val_score(clf,x,y,cv=5,scoring=scoring)
return np.mean(xval)
复制代码
基础模型
实例化分类器。
logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()
svc=SVC()
gnb = GaussianNB()
models = [ logreg_cv, rf,svc, gnb,gboost]
复制代码
由于数据集比较大,为了提高效率,选取部分数据进行训练。
train_partial =train[:80000]
targets_partial =targets[:80000]
for model in models:
print (f'Cross-validation of :{model.__class__}')
score = compute_score(clf=model, x=train_partial, y=targets_partial, scoring='accuracy')
print (f'CV score ={score}')
print ('****')
——————————————————————————————————————————————————————————————————
Cross-validation of :<class 'sklearn.linear_model._logistic.LogisticRegressionCV'>
CV score =0.7982125
****
Cross-validation of :<class 'sklearn.ensemble._forest.RandomForestClassifier'>
CV score =0.7988125
****
Cross-validation of :<class 'sklearn.svm._classes.SVC'>
CV score =0.7984249999999999
****
Cross-validation of :<class 'sklearn.naive_bayes.GaussianNB'>
CV score =0.7983374999999999
****
Cross-validation of :<class 'sklearn.ensemble._gb.GradientBoostingClassifier'>
CV score =0.8005749999999999
****
复制代码
- logreg_cv CV score =0.7982125
- rf CV score =0.7988125
- svc CV score =0.7984249999999999
- gnb 0.7983374999999999
- gboost 0.8005749999999999
根据初次训练的结果,rf和gboost表现较好,决定进行参数优化以改进模型。
超参数调整
在使用的模型中,随机森林和梯度提升决策树的表现较为良好,因此在这个基础上继续进一步优化参数,将使用gridsearch来完成这个步骤。
model_box= []
rf=RandomForestClassifier(random_state=2021,max_features='auto')
rf_params ={'n_estimators':[50,120,300],'max_depth':[5,8,15],
'min_samples_leaf':[2,5,10],'min_samples_split':[2,5,10]}
model_box.append([rf,rf_params])
gboost=GradientBoostingClassifier(random_state=2021)
gboost_params ={'learning_rate':[0.05,0.1,0.15],'n_estimators':[10,50],
'max_depth':[3,4,6,10],'min_samples_split':[50,10]}
model_box.append([gboost,gboost_params])
复制代码
for i in range(len(model_box)):
best_model= GridSearchCV(model_box[i][0],param_grid=model_box[i][1],refit=True,cv=5,scoring='roc_auc').fit(train_partial,targets_partial)
print(model_box[i],':')
print('best_parameters:',best_model.best_params_)
————————————————————————————————————————————————————————————————
[RandomForestClassifier(random_state=2021), {'n_estimators': [50, 120, 300], 'max_depth': [5, 8, 15], 'min_samples_leaf': [2, 5, 10], 'min_samples_split': [2, 5, 10]}] :
best_parameters: {'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300}
[GradientBoostingClassifier(random_state=2021), {'learning_rate': [0.05, 0.1, 0.15], 'n_estimators': [10, 50], 'max_depth': [3, 4, 6, 10], 'min_samples_split': [50, 10]}] :
best_parameters: {'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50}
复制代码
网格搜索的结果显示,最优的参数是:
- rf:
best_parameters: 'max_depth': 15, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 300
- gboost:
best_parameters: 'learning_rate': 0.15, 'max_depth': 4, 'min_samples_split': 50, 'n_estimators': 50
带入这些参数进行模型训练并计算score值来评估模型。
for i in model_best:
model_best[i].fit(train_partial,targets_partial)
score = cross_val_score(model_best[i],train_partial,targets_partial,cv=5,scoring='score')
print('%s的score值为:%.4f' % (i,score.mean()))
__________________________________________________________________
rf的score值为:0.7997
gboost的score值为:0.8005
复制代码
选择最优模型进行预测并输出结果文件。
model_final =model_best['gboost']
predictions= model_final.predict(test).astype(int)
df_predictions = pd.DataFrame()
abc = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/testA.csv')
df_predictions['id'] = abc['id']
df_predictions['isDefault'] = predictions
df_predictions[['id','isDefault']].to_csv('/content/drive/MyDrive/Colab Notebooks/submit.csv', index=False)
复制代码