Python credit score card model based analysis (a)

Credit risk measurement system includes a main body and rating models debt rating two parts. Ratings and the body has a series of debt rating rating model, where the body rating models available "four cards" to represent, respectively, a card A, B cards, C F cards and cards; debt rating model generally according to body financing purposes , divided into corporate finance model, cash flow financing model and project financing model. We focused on the development process of the main rating models.

First, the project process

Typical credit scoring model shown in Figure 1-1. The main credit risk rating model development process are as follows:
(1) data acquisition, including access to existing customers and potential customers' data. Existing customers means the customer has to carry out related financing business class securities companies, including individual and institutional clients; potential customer is a customer in the future be carried out related financing business class securities companies, including institutional clients, which is to solve the Securities Industry fewer samples commonly used methods, these potential institutional clients including listed companies, issuers publicly issued bonds, three new board listed companies, regional equity trading center listed companies, financial institutions and other non-standard.
(2) pre-processing the data, the main data including cleaning work, missing values, an abnormal value processing, mainly in order to obtain the raw data into formatted data can be used as model development.
(3) exploratory data analysis, the step of obtaining a sample mainly about the overall situation, the overall situation of the sample description index main histogram, box, etc. FIG.
(4) variable selection, the step mainly through statistical methods, select the most significant impact on the index of non-compliance. There are univariate feature selection methods and machine learning model.
(5) the development model, the variables including the step of segmentation, WOE variables (weight of evidence) into three parts and logistic regression estimation.
(6) evaluation model, this step is primarily distinguishing capability assessment model, predictive ability, stability, and formation model assessment report concluded that if the model can be used.
(7) credit score, credit score is determined according to the method of logistic regression coefficients and the like WOE. Convert Logistic model in the form of standard scores.
(8) establish a scoring system based on credit scoring methods, establish an automatic credit scoring system.

 

Figure 1-1 credit scoring model development process

 

PS: sometimes named for convenience, reference numerals corresponding variables substituting

Second, data acquisition

Data from the Kaggle Give Me s Some at Credit , there are 150,000 of sample data, the figure below you can see the general situation of this data.
Data is personal consumer loans, can only be considered when using the credit score the final implementation of some aspects of the data should be as follows obtain data:
- basic properties include: the borrower's age at the time.
- Solvency: including the borrower's monthly income, debt ratio.
- Credit transactions: the number of 35-59 days past due within two years, the number of 60-89 days past due within two years, within 90 years
the number of days or greater than 90 days past due.
- Property condition: including the number of open credit and loans, real estate loans or credit amount.
- Loan properties: No.
- Other factors include: the amount of the borrower's family members (not including myself).
- Time Window: Watch Window arguments for the past two years, the dependent variable is the performance of the window the next two years.

FIG original data variable 2-1

 

3. Data preprocessing

Prior to processing the data, the need to understand the data outliers and missing values ​​situation. There describe the python () function, data set can understand the missing values, mean and median and the like.

 
  1. #载入数据

  2. data = pd.read_csv('cs-training.csv')

  3. #数据集确实和分布情况

  4. data.describe().to_csv('DataDescribe.csv')

Details of the data set:

 

Figure 3-1 detail variable

 

From the figure shows, the presence and absence of variable MonthlyIncome NumberOfDependents, total variable MonthlyIncome missing values ​​29731, NumberOfDependents 3924 has missing values.

3.1 missing values

这种情况在现实问题中非常普遍,这会导致一些不能处理缺失值的分析方法无法应用,因此,在信用风险评级模型开发的第一步我们就要进行缺失值处理。缺失值处理的方法,包括如下几种。
(1) 直接删除含有缺失值的样本。
(2) 根据样本之间的相似性填补缺失值。
(3) 根据变量之间的相关关系填补缺失值。
变量MonthlyIncome缺失率比较大,所以我们根据变量之间的相关关系填补缺失值,我们采用随机森林法:

 
  1. # 用随机森林对缺失值预测填充函数

  2. def set_missing(df):

  3. # 把已有的数值型特征取出来

  4. process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]

  5. # 分成已知该特征和未知该特征两部分

  6. known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()

  7. unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()

  8. # X为特征属性值

  9. X = known[:, 1:]

  10. # y为结果标签值

  11. y = known[:, 0]

  12. # fit到RandomForestRegressor之中

  13. rfr = RandomForestRegressor(random_state=0,

  14. n_estimators=200,max_depth=3,n_jobs=-1)

  15. rfr.fit(X,y)

  16. # 用得到的模型进行未知特征值预测

  17. predicted = rfr.predict(unknown[:, 1:]).round(0)

  18. print(predicted)

  19. # 用得到的预测结果填补原缺失数据

  20. df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predicted

  21. return df

NumberOfDependents变量缺失值比较少,直接删除,对总体模型不会造成太大影响。对缺失值处理完之后,删除重复项。

 
  1. data=set_missing(data)#用随机森林填补比较多的缺失值

  2. data=data.dropna()#删除比较少的缺失值

  3. data = data.drop_duplicates()#删除重复项

  4. data.to_csv('MissingData.csv',index=False)

3.2 异常值处理

缺失值处理完毕后,我们还需要进行异常值处理。异常值是指明显偏离大多数抽样数据的数值,比如个人客户的年龄为0时,通常认为该值为异常值。找出样本总体中的异常值,通常采用离群值检测的方法。
首先,我们发现变量age中存在0,显然是异常值,直接剔除:

 
  1. # 年龄等于0的异常值进行剔除

  2. data = data[data['age'] > 0]

对于变量NumberOfTime30-59DaysPastDueNotWorse、NumberOfTimes90DaysLate、NumberOfTime60-89DaysPastDueNotWorse这三个变量,由下面的箱线图图3-2可以看出,均存在异常值,且由unique函数可以得知均存在96、98两个异常值,因此予以剔除。同时会发现剔除其中一个变量的96、98值,其他变量的96、98两个值也会相应被剔除。

 

图3-2 箱形图

 

剔除变量NumberOfTime30-59DaysPastDueNotWorse、NumberOfTimes90DaysLate、NumberOfTime60-89DaysPastDueNotWorse的异常值。另外,数据集中好客户为0,违约客户为1,考虑到正常的理解,能正常履约并支付利息的客户为1,所以我们将其取反。

 
  1. #剔除异常值

  2. data = data[data['NumberOfTime30-59DaysPastDueNotWorse'] < 90]

  3. #变量SeriousDlqin2yrs取反

  4. data['SeriousDlqin2yrs']=1-data['SeriousDlqin2yrs']

3.3 数据切分

为了验证模型的拟合效果,我们需要对数据集进行切分,分成训练集和测试集。

from sklearn.cross_validation import train_test_split
 
  1. Y = data['SeriousDlqin2yrs']

  2. X = data.ix[:, 1:]

  3. #测试集占比30%

  4. X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

  5. # print(Y_train)

  6. train = pd.concat([Y_train, X_train], axis=1)

  7. test = pd.concat([Y_test, X_test], axis=1)

  8. clasTest = test.groupby('SeriousDlqin2yrs')['SeriousDlqin2yrs'].count()

  9. train.to_csv('TrainData.csv',index=False)

  10. test.to_csv('TestData.csv',index=False)

四、探索性分析

在建立模型之前,我们一般会对现有的数据进行 探索性数据分析(Exploratory Data Analysis) 。 EDA是指对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索。常用的探索性数据分析方法有:直方图、散点图和箱线图等。
客户年龄分布如图4-1所示,可以看到年龄变量大致呈正态分布,符合统计分析的假设。

 

图4-1 客户年龄分布

 

客户年收入分布如图4-2所示,月收入也大致呈正态分布,符合统计分析的需要。

 

图4-2 客户收入分布

五、变量选择

特征变量选择(排序)对于数据分析、机器学习从业者来说非常重要。好的特征选择能够提升模型的性能,更能帮助我们理解数据的特点、底层结构,这对进一步改善模型、算法都有着重要作用。至于Python的变量选择代码实现可以参考结合Scikit-learn介绍几种常用的特征选择方法
在本文中,我们采用信用评分模型的变量选择方法,通过WOE分析方法,即是通过比较指标分箱和对应分箱的违约概率来确定指标是否符合经济意义。首先我们对变量进行离散化(分箱)处理。

5.1 分箱处理

变量分箱(binning)是对连续变量离散化(discretization)的一种称呼。信用评分卡开发中一般有常用的等距分段、等深分段、最优分段。其中等距分段(Equval length intervals)是指分段的区间是一致的,比如年龄以十年作为一个分段;等深分段(Equal frequency intervals)是先确定分段数量,然后令每个分段中数据数量大致相等;最优分段(Optimal Binning)又叫监督离散化(supervised discretizaion),使用递归划分(Recursive Partitioning)将连续变量分为分段,背后是一种基于条件推断查找较佳分组的算法。
我们首先选择对连续变量进行最优分段,在连续变量的分布不满足最优分段的要求时,再考虑对连续变量进行等距分段。最优分箱的代码如下:

 
  1. # 定义自动分箱函数

  2. def mono_bin(Y, X, n = 20):

  3. r = 0

  4. good=Y.sum()

  5. bad=Y.count()-good

  6. while np.abs(r) < 1:

  7. d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})

  8. d2 = d1.groupby('Bucket', as_index = True)

  9. r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)

  10. n = n - 1

  11. d3 = pd.DataFrame(d2.X.min(), columns = ['min'])

  12. d3['min']=d2.min().X

  13. d3['max'] = d2.max().X

  14. d3['sum'] = d2.sum().Y

  15. d3['total'] = d2.count().Y

  16. d3['rate'] = d2.mean().Y

  17. d3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))

  18. d4 = (d3.sort_index(by = 'min')).reset_index(drop=True)

  19. print("=" * 60)

  20. print(d4)

  21. return d4

针对我们将使用最优分段对于数据集中的RevolvingUtilizationOfUnsecuredLines、age、DebtRatio和MonthlyIncome进行分类。

 

图5-1 RevolvingUtilizationOfUnsecuredLines分箱情况.png

 

图5-2 age分箱情况.png

 

图5-3 DebtRatio分箱情况.png

 

图5-4 MonthlyIncome分箱情况.png

 

针对不能最优分箱的变量,分箱如下:

 
  1. # 连续变量离散化

  2. cutx3 = [ninf, 0, 1, 3, 5, pinf]

  3. cutx6 = [ninf, 1, 2, 3, 5, pinf]

  4. cutx7 = [ninf, 0, 1, 3, 5, pinf]

  5. cutx8 = [ninf, 0,1,2, 3, pinf]

  6. cutx9 = [ninf, 0, 1, 3, pinf]

  7. cutx10 = [ninf, 0, 1, 2, 3, 5, pinf]

5.2 WOE

WoE分析, 是对指标分箱、计算各个档位的WoE值并观察WoE值随指标变化的趋势。其中WoE的数学定义是:
woe=ln(goodattribute/badattribute)
在进行分析时,我们需要对各指标从小到大排列,并计算出相应分档的WoE值。其中正向指标越大,WoE值越小;反向指标越大,WoE值越大。正向指标的WoE值负斜率越大,反响指标的正斜率越大,则说明指标区分能力好。WoE值趋近于直线,则意味指标判断能力较弱。若正向指标和WoE正相关趋势、反向指标同WoE出现负相关趋势,则说明此指标不符合经济意义,则应当予以去除。
woe函数实现在上一节的mono_bin()函数里面已经包含,这里不再重复。

5.3 相关性分析和IV筛选

接下来,我们会用经过清洗后的数据看一下变量间的相关性。注意,这里的相关性分析只是初步的检查,进一步检查模型的VI(证据权重)作为变量筛选的依据。
相关性图我们通过Python里面的seaborn包,调用heatmap()绘图函数进行绘制,实现代码如下:

 
  1. corr = data.corr()#计算各变量的相关性系数

  2. xticks = ['x0','x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴标签

  3. yticks = list(corr.index)#y轴标签

  4. fig = plt.figure()

  5. ax1 = fig.add_subplot(1, 1, 1)

  6. sns.heatmap(corr, annot=True, cmap='rainbow', ax=ax1, annot_kws={'size': 9, 'weight': 'bold', 'color': 'blue'})#绘制相关性系数热力图

  7. ax1.set_xticklabels(xticks, rotation=0, fontsize=10)

  8. ax1.set_yticklabels(yticks, rotation=0, fontsize=10)

  9. plt.show()

生成的图形如图5-5所示:

 

图5-5 数据集各变量的相关性

 

由上图可以看出,各变量之间的相关性是非常小的。NumberOfOpenCreditLinesAndLoans和NumberRealEstateLoansOrLines的相关性系数为0.43。
接下来,我进一步计算每个变量的Infomation Value(IV)。IV指标是一般用来确定自变量的预测能力。 其公式为:
IV=sum((goodattribute-badattribute)*ln(goodattribute/badattribute))
通过IV值判断变量预测能力的标准是:
< 0.02: unpredictive
0.02 to 0.1: weak
0.1 to 0.3: medium
0.3 to 0.5: strong
> 0.5: suspicious
IV的实现放在mono_bin()函数里面,代码实现如下:

 
  1. # 定义自动分箱函数

  2. def mono_bin(Y, X, n = 20):

  3. r = 0

  4. good=Y.sum()

  5. bad=Y.count()-good

  6. while np.abs(r) < 1:

  7. d1 = pd.DataFrame({"X": X, "Y": Y, "Bucket": pd.qcut(X, n)})

  8. d2 = d1.groupby('Bucket', as_index = True)

  9. r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)

  10. n = n - 1

  11. d3 = pd.DataFrame(d2.X.min(), columns = ['min'])

  12. d3['min']=d2.min().X

  13. d3['max'] = d2.max().X

  14. d3['sum'] = d2.sum().Y

  15. d3['total'] = d2.count().Y

  16. d3['rate'] = d2.mean().Y

  17. d3['woe']=np.log((d3['rate']/(1-d3['rate']))/(good/bad))

  18. d3['goodattribute']=d3['sum']/good

  19. d3['badattribute']=(d3['total']-d3['sum'])/bad

  20. iv=((d3['goodattribute']-d3['badattribute'])*d3['woe']).sum()

  21. d4 = (d3.sort_index(by = 'min')).reset_index(drop=True)

  22. print("=" * 60)

  23. print(d4)

  24. cut=[]

  25. cut.append(float('-inf'))

  26. for i in range(1,n+1):

  27. qua=X.quantile(i/(n+1))

  28. cut.append(round(qua,4))

  29. cut.append(float('inf'))

  30. woe=list(d4['woe'].round(3))

  31. return d4,iv,cut,woe

生成的IV图代码:

 
  1. ivlist=[ivx1,ivx2,ivx3,ivx4,ivx5,ivx6,ivx7,ivx8,ivx9,ivx10]#各变量IV

  2. index=['x1','x2','x3','x4','x5','x6','x7','x8','x9','x10']#x轴的标签

  3. fig1 = plt.figure(1)

  4. ax1 = fig1.add_subplot(1, 1, 1)

  5. x = np.arange(len(index))+1

  6. ax1.bar(x, ivlist, width=0.4)#生成柱状图

  7. ax1.set_xticks(x)

  8. ax1.set_xticklabels(index, rotation=0, fontsize=12)

  9. ax1.set_ylabel('IV(Information Value)', fontsize=14)

  10. #在柱状图上添加数字标签

  11. for a, b in zip(x, ivlist):

  12. plt.text(a, b + 0.01, '%.4f' % b, ha='center', va='bottom', fontsize=10)

  13. plt.show()

输出图像:

 

图5-6 输出的各变量IV图

 

可以看出,DebtRatio、MonthlyIncome、NumberOfOpenCreditLinesAndLoans、NumberRealEstateLoansOrLines和NumberOfDependents变量的IV值明显较低,所以予以删除。

小结

本文主要介绍了信用评分模型开发过程中的数据预处理、探索性分析和变量选择。数据预处理主要针对缺失值用随机森林法和直接剔除法进行处理,对于异常值主要根据实际情况和箱形图的数据分布,对异常值进行剔除;探索性分析主要对各变量的分布情况进行初始的探究;变量选择主要考虑了变量的分箱方法,根据分箱结果计算WOE值,然后检查变量之间的相关性,根据各变量的IV值来选择对数据处理有好效果的变量。
接下来会介绍信用评分模型的模型开发、模型评估和信用评分等。
基于Python的信用评分卡模型分析(二)



作者:YoLean
链接:https://www.jianshu.com/p/f931a4df202c
來源:简书
简书著作权归作者所有,任何形式的转载都请联系作者获得授权并注明出处。

Guess you like

Origin blog.csdn.net/huobanjishijian/article/details/84560861