Python案例分析之客户信贷预测模型

阅读提示

本文将通过逻辑回归算法实现用户信贷预测模型的建立，本次实验涉及到数据清洗、建模、预测三部分，希望各位读者能有所收获，感谢阅读。

一、项目解读

对年轻人来说，还花呗可以说是每个月必备的一项任务了，很有可能刚到手的工资还花呗就已经花去了大半。那么在这个快节奏的时代中，有时候我们需要支配一些资金去满足某些需求，但又因为囊中羞涩一时无法拿出太大的数额，为了解决这中尴尬的局面，信贷机构悄然产生了。

信贷业务又称为信贷资产或贷款业务，是商业银行最重要的资产业务，通过放款收回本金和利息，扣除成本后获得利润，所以信贷是商业银行的主要赢利手段。
由于放款脱离了银行的控制，不能按时收回本息的风险较大，所以对信贷应在遵守合同法和贷款通则的基础上，建立严格的贷款制度，其主要内容是：建立贷款关系，贷款申请，贷前调查，贷款审批及发放，贷后检查，贷款收回与展期，信贷制裁等制度。

1.1 逻辑回归算法

1.1.1 Logistic函数

Logistic回归模型中的因变量只有1和0（发生于不发生）两种。假设在p个独立自变量x1，x2…xp作用下，y取1的概率是p = P（y = 1|X）取0的概率是1-p，取1和取0的概率之比为
$\frac{p}{1-p}$
称为事件的优势比（odds），对odds取自然对数得Logistic变换
$Logit(p) = ln(\frac{p}{1-p}) 称为①$
令①=z，则
$p = \frac{1}{1+e^{z}}$
称为Logistic函数

如图：

1.1.2Logistic回归建模步骤

a、根据分析目的设置指标变量(因变量和自变量),然后收集数据，根据收集到的数据，对特征再次进行筛选

b、y取1的概率是p= P(y= 1|X)，取0概率是1-p。用
$ln(\frac{p}{1-p})$
和自变量列出线性回归方程，估计出模型中的回归系数
c、进行模型检验。模型有效性的检验指标有很多，最基本的有正确率，其次有混淆矩阵、ROC曲线、KS值等。

d、模型应用:输入自变量的取值，就可以得到预测变量的值，或者根据预测变量的值去控制自变量的取值。

实例：

年龄	教育	工龄	地址	收入	负债率	信用卡负债	其他负债	违约
41	3	17	12	176.00	9.30	11.36	5.01	1
27	1	10	6	31.00	17.30	1.36	4.00	0

需要数据集请私聊我

利用Scikit-Learn对这个数据进行逻辑回归分析。首先进行特征筛选，特征筛选的方法有很多，主要包含在Scikit_Learn 的feature_ selection 库中，比较简单的有通过F检验(f_ regression)来给出各个特征的F值和p值，从而可以筛选变量(选择F值大的或者p值小的特征)。其次有递归特征消除( Recursive Feature Elimination, RFE)和稳定性选择(StabilitySelection)等比较新的方法。这里使用了稳定性选择方法中的随机逻辑回归进行特征筛选，然后利用筛选后的特征建立逻辑回归模型，输出平均正确率。

逻辑回归代码

# -*- coding: utf-8 -*-
# 逻辑回归 自动建模
import pandas as pd

# 参数初始化
filename = '../data/bankloan.xls'
data = pd.read_excel(filename)
x = data.iloc[:, :8].as_matrix()
y = data.iloc[:, 8].as_matrix()

from sklearn.linear_model import LogisticRegression as LR
from stability_selection.randomized_lasso import RandomizedLogisticRegression as RLR

rlr = RLR()  # 建立随机逻辑回归模型，筛选变量
rlr.fit(x, y)  # 训练模型
rlr.get_support()  # 获取特征筛选结果，也可以通过.scores_方法获取各个特征的分数
print(u'通过随机逻辑回归模型筛选特征结束。')
print(u'有效特征为：%s' % ','.join(data.columns[rlr.get_support()]))
x = data[data.columns[rlr.get_support()]].as_matrix()  # 筛选好特征

lr = LR()  # 建立逻辑货柜模型
lr.fit(x, y)  # 用筛选后的特征数据来训练模型
print(u'逻辑回归模型训练结束。')
print(u'模型的平均正确率为：%s' % lr.score(x, y))  # 给出模型的平均正确率，本例为81.4%

结果：

通过随机逻辑回归模型筛选特征结束。
有效特征为：工龄，地址，负债率，信用卡负债
逻辑回归模型训练结束。
模型的平均正确率为：0.814285714286

1.2 客户逾期还款业务

在本文中，将通过对收集到的贷款机构数据集进行清洗与建模，预测用户是否具有还款能力并判断是否贷款给该用户，本文将从数据清洗、数据挖掘、数据建模三个方面进行一个小小的实战操作。

1.3 数据源内容解读

数据集是Lending Club平台产生借贷的业务数据，共有52个变量，39522条记录。

让我们先看一下数据集是什么样子
在这里插入图片描述
可以看到数据样本中有很多很多列属性，而每一列都代表什么特征呢？

这里我选取了一部分进行汉化
在这里插入图片描述
而在我们真正进行建模的时候，并不是所有的属性都会用到，我们需要先对数据进行预处理。

1.4 Python主要数据预处理函数

在数据挖掘中，海量的原始数据中存在着大量不完整(有缺失值)、不一致、有异常的数据，严重影响到数据挖掘建模的执行效率，甚至可能导致挖掘结果的偏差，所以进行数据清洗就显得尤为重要，数据清洗完成后接着进行或者同时进行数据集成、转换、规约等一系列的处理，该过程就是数据预处理。数据预处理一方面是要提高数据的质量，另一方面是要让数据更好地适应特定的挖掘技术或工具。统计发现，在数据挖掘的过程中，数据预处理工作量占到了整个过程的60%。

数据预处理的主要内容包括：数据清洗、数据集成、数据变换和数据规约

函数名	函数功能	所属拓展库
interpolate	一维、高维数据插值	Scipy
unique	去除数据中的重复元素，得到单值元素列表，它是对象的方法名	Pandas/Numpy
isnull	判断是否为空	Pandas
notnull	判断是否非空	Pandas
PCA	对指标变量矩阵进行主成分分析	Scikit-Learn
random	生成随机矩阵	Numpy

1.4.1 interpolate

功能： interpolate是Scipy的一一个子库，包含了大量的插值函数，如拉格朗日插值、样条插值、高维插值等。使用前需要用from scipy.interpolate import *引入相应的插值函数，可以根据需要到官网查找对应的函数名。

使用格式：

f = scipy.interpolate.lagrange(x,y)

这里仅仅展示了一维数据的拉格朗日插值的命令，其中x, y为对应的自变量和因变量数据。插值完成后，可以通过f(a) 计算新的插值结果。类似的还有样条插值、多维数据插值等，此处不一一展示。

1.4.2 unique

**功能: ** 去除数据中的重复元素，得到单值元素列表。它既是Numpy库的一个函数(np.unique（）),也是Series对象的一个方法。

使用格式：

np.unique（D），D是一维数据，可以是list、array、Series
D.unique（），D是Pandas的Series对象

实例：

求向量A中的单值元素，并返回相关索引

D = pd.Series([1, 1, 2, 3, 5])
print(D.unique())
print(np.unique(D))

结果：

[1 2 3 5]
[1 2 3 5]

Process finished with exit code 0

1.4.3 isnull / notnull

功能： 判断每个元素是否空值 / 非空值

使用格式： D.isnull（）/ D.notnull（）。这里的D要求是Series对象，返回一个布尔Series。可以通过D [ D.isnull（）]或D[D.notnull（）]找出D中的空值 / 非空值。

1.4.4andom

功能: random是Numpy的一个子库(Python本身也自带了random,但Numpy的更加强大)，可以用该库下的各种函数生成服从特定分布的随机矩阵，抽样时可使用。

使用格式

np.random.randn(k, m, n,…生成一个k * m * n *… 随机矩阵，其元素均匀分布在区间(0,1)上
np.random.randn(k, m, n…)_.生成一个k * m * n * …随机矩阵，其元素服从标准正态分布

1.4.5 PCA

功能： 对指标变量矩阵进行主成分分析,使用前需要用from sklearn.decomposition import PCA引入该函数。

使用格式: model = PCA（）。注意，Scikit-Learn 下的PCA是一个建模式的对象，也就是说，一般的流程是建模，然后是训练model.fit(D)，D为要进行主成分分析的数据矩阵，训练结束后获取模型的参如.components_获取特征向量，以及.explained_ variance. _ratio_获取各个属性的贡献率等。

实例：

使用PCA()对一个10 * 4 维的随机矩阵进行主成分分析

from sklearn.decomposition import PCA

D = np.random.randn(10, 4)
pca = PCA()
pca.fit(D)
PCA(copy=True, n_components=None, whiten=False)
print(pca.components_)  # 返回模型的各个特征向量
print("*" * 50)
print(pca.explained_variance_ratio_)  # 返回各个成分个字的方差百分比

结果：

[[-0.73391691  0.22922579 -0.13039917  0.62595332]
 [-0.41771778  0.57241446 -0.02724733 -0.70506108]
 [ 0.22012336  0.49807219  0.80277934  0.24293029]
 [-0.48828633 -0.60968952  0.58120475 -0.22815825]]
**************************************************
[0.50297117 0.28709267 0.14575757 0.06417859]

Process finished with exit code 0

二、数据预处理

首先，去掉一些明显没用的特征，如desc，url,并将剩下特征保存到一个新的csv文件中。（也可以使用replace()参数）

2.1 调用warnings包，屏蔽报红

import warnings
warnings.filterwarnings('ignore') #忽视

2.2 筛选特征值

分析数据集，显示数据标签，挑选我们不需要的特征
这里先做一个约定，2万行数据中，如果空白值超过一半，则剔除掉这些列
thresh = half_count：剔除

import pandas as pd

loans_2020 = pd.read_csv('LoanStats3a.csv', skiprows=1) #第一行是字符串，所以要skiprows=1跳过第一行
half_count = len(loans_2020) / 2 # 4万行除以2 = 19767.5行

loans_2020 = loans_2020.dropna(thresh=half_count, axis=1)
loans_2020 = loans_2020.drop(['desc', 'url'],axis=1) #按照列中，删除描述和URL链接
loans_2020.to_csv('loans_2020.csv', index=False) #追加到“loans_2007.csv”文件 ， index=False表示不加索引

这里我们对处理好的数据进行展示

import pandas as pd

loans_2020 = pd.read_csv("loans_2020.csv")
print("数据展示:第一行 \n",loans_2020.iloc[0])

数据展示:第一行 
 id                                1077501
member_id                      1.2966e+06
loan_amnt                            5000
funded_amnt                          5000
funded_amnt_inv                      4975
term                            36 months
int_rate                           10.65%
installment                        162.87
grade                                   B
sub_grade                              B2
emp_title                             NaN
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
issue_d                            Dec-11
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
zip_code                            860xx
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                   Jan-85
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                         83.70%
total_acc                               9
initial_list_status                     f
out_prncp                               0
out_prncp_inv                           0
total_pymnt                       5863.16
total_pymnt_inv                   5833.84
total_rec_prncp                      5000
total_rec_int                      863.16
total_rec_late_fee                      0
recoveries                              0
collection_recovery_fee                 0
last_pymnt_d                       Jan-15
last_pymnt_amnt                    171.62
last_credit_pull_d                 Nov-16
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object

shape[1]代表有多少列 ,shape[0]代表有多少行

print("原始列数={}".format(loans_2020.shape[1]))

原始列数=  52

通常来说，样本中的id 和 member_id属性对银行评定是否放贷并没有任何影响，这只是用户所特有的标识；而funded_amnt (期望贷款的数目)和 funded_amnt_inv（实际贷到的数目）显然与我们要做的预测也没什么关系。在判断一个特征值是否有用时要结合很多实际情况进行分析。这里不做过多讨论，为了实验方便我们选择舍弃这些属性列。

'''
id：用户ID
#member_id：会员编号
funded_amnt：承诺给该贷款的总金额
funded_amnt_inv：投资者为该贷款承诺的总金额
grade：贷款等级。贷款利率越高，则等级越高
sub_grade：贷款子等级
emp_title：工作名称
issue_d：贷款月份
'''
loans_2020 = loans_2020.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

在这里我们看一下用户当前贷款的状态

#loan_status：Fully Paid:全部还清  Charged Off:没有按时还款  

loans_2020['loan_status']

0         Fully Paid
1        Charged Off
2         Fully Paid
3         Fully Paid
4            Current
            ...     
39530     Fully Paid
39531            NaN
39532            NaN
39533            NaN
39534            NaN
Name: loan_status, Length: 39535, dtype: object

继续剔除不需要的属性列

#zip_code：常用的邮编
#out_prncp和out_prncp_inv都是一样的：总资金中剩余的未偿还本金
#out_prncp_inv：实际未偿还的本金
#total_rec_prncp：迄今收到的本金

loans_2020 = loans_2020.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

#total_rec_int：迄今收到的利息
#recoveries：是否收回本金
#collection_recovery_fee：收集回收费用
#last_pymnt_d：最近一次收到还款的时间
#last_pymnt_amnt：全部的还款的时间


#保留候选特征
loans_2020 = loans_2020.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2020.iloc[0])#第一行数据

loan_amnt                            5000
term                            36 months
int_rate                           10.65%
installment                        162.87
emp_length                      10+ years
home_ownership                       RENT
annual_inc                          24000
verification_status              Verified
loan_status                    Fully Paid
pymnt_plan                              n
purpose                       credit_card
title                            Computer
addr_state                             AZ
dti                                 27.65
delinq_2yrs                             0
earliest_cr_line                   Jan-85
inq_last_6mths                          1
open_acc                                3
pub_rec                                 0
revol_bal                           13648
revol_util                         83.70%
total_acc                               9
initial_list_status                     f
last_credit_pull_d                 Nov-16
collections_12_mths_ex_med              0
policy_code                             1
application_type               INDIVIDUAL
acc_now_delinq                          0
chargeoff_within_12_mths                0
delinq_amnt                             0
pub_rec_bankruptcies                    0
tax_liens                               0
Name: 0, dtype: object

那么经过初步筛选后，剩下了多少特征列呢？

print("现存列数 = ",loans_2020.shape[1])

现存列数 =  32

确定当前贷款状态（label值）

2.3 LabelEncoder 和 OneHotEncoder

在进行Python数据处理的时候，我们想要将繁杂的数据特征变成简单、容易识别的编码，Python为我们提供了两个非常好用的方法。

通俗来说
LabelEncoder 是对不连续的数字或者文本进行编号

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit([1,5,67,100])
le.transform([1,1,100,67,5])

print(le.transform([1,1,100,67,5]))


#得到 [0 0 3 2 1] 分别对应每个数字出现的位置

OneHotEncoder 用于将表示分类的数据扩维

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit([[1],[2],[3],[4]])
ohe.transform(([2],[3],[1],[4])).toarray()

print(ohe.transform(([2],[3],[1],[4])).toarray())

# 得到
		[[0. 1. 0. 0.]
		 [0. 0. 1. 0.]
		 [1. 0. 0. 0.]
		 [0. 0. 0. 1.]]

那么对我们拿到的数据集该如何处理呢？

print(loans_2020['loan_status'].value_counts())#计算该列特征的属性的个数


'''
Fully Paid：批准了客户的贷款，可看做 1
Charged Off：没有批准了客户的贷款，可看做 0
Late (16-30 days) ：延期了16-30 days
Late (31-120 days)：延期了31-120 days ，所以这些都不确定的属性，相当于“取保候审”
'''

Fully Paid            33693
Charged Off            5612
Current                 201
Late (31-120 days)       10
In Grace Period           9
Late (16-30 days)         5
Default                   1
Name: loan_status, dtype: int64

二分类

#要做一个二分类，用0 1 表示
loans_2020 = loans_2020[(loans_2020['loan_status'] == "Fully Paid") |
                        (loans_2020['loan_status'] == "Charged Off")]
status_replace = {
    #特征当做key，value是0和1
    "loan_status": {
        #第一个键值改为1 ，第二个键值改为0
        "Fully Paid": 1, #支付
        "Charged Off": 0,#未支付
    }
}

loans_2020 = loans_2020.replace(status_replace)  #replace：执行的是查找并替换的操作

在进行编码后，数据变成了这个样子

loans_2020['loan_status']

0        1
1        0
2        1
3        1
5        1
        ..
39526    1
39527    1
39528    1
39529    1
39530    1
Name: loan_status, Length: 39305, dtype: int64

2.4 去掉特征中只有一种属性的列

#在原始数据中的特征值或者属性里都是一样的，对于分类模型的预测是没有用的
orig_columns = loans_2020.columns  #展现出所有的列
drop_columns = []  #初始化空值

for col in orig_columns:
    # dropna()先删除空值，再去重算唯一的属性
    col_series = loans_2020[col].dropna().unique()  #去重唯一的属性
    if len(col_series) == 1:  #如果该特征的属性只有一个属性，就给过滤掉该特征
        drop_columns.append(col)
        
loans_2020 = loans_2020.drop(drop_columns, axis=1)
print(drop_columns)
print("--------------------------------------------")
print(loans_2020.shape)
loans_2020.to_csv('filtered_loans_2020.csv', index=False

这时只剩下39305行，24列数据了

['initial_list_status', 'collections_12_mths_ex_med', 'policy_code', 'application_type', 'acc_now_delinq', 'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens']
--------------------------------------------
(39305, 24)

注：
当我们筛选出特征和标签后，就可以丢给scikit-learn了吗？

当然是不行的，还需要做缺失值、字符值、标点符号、%号、str等值得处理。

2.5 处理缺失值

import pandas as pd

loans = pd.read_csv('filtered_loans_2020.csv')
null_counts = loans.isnull().sum()  #用pandas的isnull统计一下每列的缺失值并作累加
print(null_counts) 

#对于每列中缺失的情况不是很大，大多数是很好的数据，删掉几个列也无可厚非(对于样本大)，或者是只删除缺失值，或者用均值、中位数和众数补充

loan_amnt                  0
term                       0
int_rate                   0
installment                0
emp_length              1073
home_ownership             0
annual_inc                 0
verification_status        0
loan_status                0
pymnt_plan                 0
purpose                    0
title                     11
addr_state                 0
dti                        0
delinq_2yrs                0
earliest_cr_line           0
inq_last_6mths             0
open_acc                   0
pub_rec                    0
revol_bal                  0
revol_util                50
total_acc                  0
last_credit_pull_d         1
pub_rec_bankruptcies     449
dtype: int64

从统计出的结果可以看出title和revol_util相对于数据总量来说较少，可以直接去掉缺失值所在的行。

而pub_rec_bankruptcies中的缺失值较多，说明该数据统计的情况较差，在本文中直接将此特征删除即可。

loans = loans.drop("pub_rec_bankruptcies", axis=1)
loans = loans.dropna(axis=0) 

#用dtypes类型统计有多少个是object、int、float类型的特征
print(loans.dtypes.value_counts())

删除后，统计各类型是特征的数目

object     12
float64    10
int64       1
dtype: int64

2.6数据类型的转换

由于sk-learn库不接受字符型的数据，所以还需将上面特征中12个字符型的数据进行处理。

#Pandas里select_dtypes只选定“object”的类型str，只选定字符型的数据

object_columns_df = loans.select_dtypes(include=["object"])
print(object_columns_df.iloc[0])

处理思路
term：分期多少个月
int_rate：利息，10.65%，后面还要去掉%
emp_length：超过10年的看做是10，9年看做是9…
home_ownership：房屋所有权，是租的、自己的、还是抵押掉了，使用用0 1 2来代替

term                     36 months
int_rate                    10.65%
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
pymnt_plan                       n
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line            Jan-85
revol_util                  83.70%
last_credit_pull_d          Nov-16
Name: 0, dtype: object

'''
查看指定标签的属性，并记数
home_ownership：房屋所有权
verification_status：身份保持证明
emp_length：工作时长
term：贷款分期的时间
addr_state：地址邮编
'''


cols = [
    'home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state'
]
for c in cols:
    print(loans[c].value_counts())

RENT        18237
MORTGAGE    17035
OWN          2805
OTHER          96
NONE            1
Name: home_ownership, dtype: int64
Not Verified       16182
Verified           12251
Source Verified     9741
Name: verification_status, dtype: int64
10+ years    8794
< 1 year     4492
2 years      4339
3 years      4052
4 years      3397
5 years      3262
1 year       3182
6 years      2201
7 years      1747
8 years      1463
9 years      1245
Name: emp_length, dtype: int64
 36 months    27980
 60 months    10194
Name: term, dtype: int64
CA    6876
NY    3644
FL    2739
TX    2657
NJ    1799
IL    1478
PA    1470
VA    1355
GA    1342
MA    1278
OH    1176
MD    1019
AZ     824
WA     788
CO     758
NC     739
CT     725
MI     688
MO     654
MN     589
NV     477
SC     457
OR     431
WI     429
AL     424
LA     422
KY     320
OK     292
KS     257
UT     248
AR     233
DC     211
RI     196
NM     180
HI     168
WV     167
NH     160
DE     109
MT      78
WY      78
AK      77
SD      61
VT      54
MS      19
TN      16
ID       6
IA       5
NE       1
Name: addr_state, dtype: int64

显示purpose和title属性

#"purpose"和"title"表达的意思相近，且从输出结果可以看出"title"所含的属性较多，可以将其舍弃掉
print(loans["purpose"].value_counts())#purpose：你贷款时的目的是什么，买房还是买车，还是其他消费

print("------------------------------------------------")

print(loans["title"].value_counts())#title：跟purpose一样，贷款的目的，选一个就行了

debt_consolidation    18057
credit_card            4927
other                  3761
home_improvement       2846
major_purchase         2103
small_business         1745
car                    1489
wedding                 924
medical                 665
moving                  551
house                   364
vacation                347
educational             300
renewable_energy         95
Name: purpose, dtype: int64
------------------------------------------------
Debt Consolidation                  2122
Debt Consolidation Loan             1670
Personal Loan                        625
Consolidation                        502
debt consolidation                   483
                                    ... 
Unexpected Legal Fees-Short Term       1
Payoff The cards                       1
increasing membership                  1
Silver products                        1
Getting back on the road!!             1
Name: title, Length: 18933, dtype: int64

将工作年限 LabelEncoder


'''
jemp_length设置为字典，emp_length当做key ，value里还是字典
"10+ years": 10...
"9 years" : 9...
...
调用replace函数进行替换操作
在利息这列，有符号%，使用astype()处理
'''

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}

# 删除：last_credit_pull_d：LC撤回最近的月份   
#earliest_cr_line：第一次借贷时间
#addr_state：家庭邮编
#title：URL的标题
loans = loans.drop(
    ["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
#rstrip：删除 string 字符串末尾的指定字符
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
#revol_util：透支额度占信用比例
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dict)

mapping_dict

{'emp_length': {'10+ years': 10,
  '9 years': 9,
  '8 years': 8,
  '7 years': 7,
  '6 years': 6,
  '5 years': 5,
  '4 years': 4,
  '3 years': 3,
  '2 years': 2,
  '1 year': 1,
  '< 1 year': 0,
  'n/a': 0}}

剩余的其他字符型特征，此处选择使用pandas的get_dummies()函数，直接映射为数值型。

print(loans)

       loan_amnt        term  int_rate  installment  emp_length  \
0         5000.0   36 months     10.65       162.87          10   
1         2500.0   60 months     15.27        59.83           0   
2         2400.0   36 months     15.96        84.33          10   
3        10000.0   36 months     13.49       339.31          10   
4         5000.0   36 months      7.90       156.46           3   
...          ...         ...       ...          ...         ...   
39300    12000.0   36 months      9.33       383.45           2   
39301     4000.0   36 months      8.07       125.48           4   
39302     9000.0   36 months     10.59       292.91           1   
39303    10000.0   36 months      8.38       315.12           0   
39304    12000.0   36 months      9.96       386.99          10   

      home_ownership  annual_inc verification_status  loan_status pymnt_plan  \
0               RENT     24000.0            Verified            1          n   
1               RENT     30000.0     Source Verified            0          n   
2               RENT     12252.0        Not Verified            1          n   
3               RENT     49200.0     Source Verified            1          n   
4               RENT     36000.0     Source Verified            1          n   
...              ...         ...                 ...          ...        ...   
39300           RENT     68640.0        Not Verified            1          n   
39301           RENT     21600.0        Not Verified            1          n   
39302           RENT     25920.0        Not Verified            1          n   
39303           RENT    107000.0        Not Verified            1          n   
39304       MORTGAGE    100000.0        Not Verified            1          n   

                  purpose    dti  delinq_2yrs  inq_last_6mths  open_acc  \
0             credit_card  27.65          0.0             1.0       3.0   
1                     car   1.00          0.0             5.0       3.0   
2          small_business   8.72          0.0             2.0       2.0   
3                   other  20.00          0.0             1.0      10.0   
4                 wedding  11.20          0.0             3.0       9.0   
...                   ...    ...          ...             ...       ...   
39300  debt_consolidation   7.47          2.0             0.0       8.0   
39301  debt_consolidation  10.33          0.0             1.0       6.0   
39302      major_purchase   5.56          0.0             2.0       7.0   
39303      small_business   2.28          0.0             2.0       4.0   
39304  debt_consolidation   8.17          0.0             2.0      14.0   

       pub_rec  revol_bal  revol_util  total_acc  
0          0.0    13648.0        83.7        9.0  
1          0.0     1687.0         9.4        4.0  
2          0.0     2956.0        98.5       10.0  
3          0.0     5598.0        21.0       37.0  
4          0.0     7963.0        28.3       12.0  
...        ...        ...         ...        ...  
39300      0.0    11370.0        41.6       22.0  
39301      0.0     3737.0        55.8       11.0  
39302      0.0     6353.0        39.5        8.0  
39303      0.0    15043.0        65.2       25.0  
39304      0.0    25413.0        45.2       26.0  

[38174 rows x 19 columns]

查看指定标签的属性，并记数

'''
home_ownership：房屋所有权
verification_status：身份保持证明
emp_length：客户公司名称
purpose：贷款的意图
term：贷款分期的时间
'''

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])

#concat() 方法用于连接两个或多个数组
loans = pd.concat([loans, dummy_df], axis=1)

loans = loans.drop(cat_columns, axis=1)

#pymnt_plan 指示是否已为贷款实施付款计划 ，里面都为N，删掉这一列
loans = loans.drop("pymnt_plan", axis=1)
loans.to_csv('cleaned_loans_2020.csv', index=False)

总结：
什么时候用OneHotEncoder独热编码和LabelEncoder标签编码？

特征的属性小于等于3 ，用OneHotEncoder，比如：天气、性别，属于无序特征

特征的属性大于3，用LabelEncoder，比如：星期属于有序型

数据类型转换

import pandas as pd
loans = pd.read_csv("cleaned_loans_2020.csv") # 清洗完的数据拿过来，现在的数据要么是float类型和int类型
print(loans.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38174 entries, 0 to 38173
Data columns (total 37 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   loan_amnt                            38174 non-null  float64
 1   int_rate                             38174 non-null  float64
 2   installment                          38174 non-null  float64
 3   annual_inc                           38174 non-null  float64
 4   loan_status                          38174 non-null  int64  
 5   dti                                  38174 non-null  float64
 6   delinq_2yrs                          38174 non-null  float64
 7   inq_last_6mths                       38174 non-null  float64
 8   open_acc                             38174 non-null  float64
 9   pub_rec                              38174 non-null  float64
 10  revol_bal                            38174 non-null  float64
 11  revol_util                           38174 non-null  float64
 12  total_acc                            38174 non-null  float64
 13  home_ownership_MORTGAGE              38174 non-null  int64  
 14  home_ownership_NONE                  38174 non-null  int64  
 15  home_ownership_OTHER                 38174 non-null  int64  
 16  home_ownership_OWN                   38174 non-null  int64  
 17  home_ownership_RENT                  38174 non-null  int64  
 18  verification_status_Not Verified     38174 non-null  int64  
 19  verification_status_Source Verified  38174 non-null  int64  
 20  verification_status_Verified         38174 non-null  int64  
 21  purpose_car                          38174 non-null  int64  
 22  purpose_credit_card                  38174 non-null  int64  
 23  purpose_debt_consolidation           38174 non-null  int64  
 24  purpose_educational                  38174 non-null  int64  
 25  purpose_home_improvement             38174 non-null  int64  
 26  purpose_house                        38174 non-null  int64  
 27  purpose_major_purchase               38174 non-null  int64  
 28  purpose_medical                      38174 non-null  int64  
 29  purpose_moving                       38174 non-null  int64  
 30  purpose_other                        38174 non-null  int64  
 31  purpose_renewable_energy             38174 non-null  int64  
 32  purpose_small_business               38174 non-null  int64  
 33  purpose_vacation                     38174 non-null  int64  
 34  purpose_wedding                      38174 non-null  int64  
 35  term_ 36 months                      38174 non-null  int64  
 36  term_ 60 months                      38174 non-null  int64  
dtypes: float64(12), int64(25)
memory usage: 10.8 MB
None

三、模型训练

前面花费了大量的时间在进行数据处理，这足以说明在机器学习中数据准备的工作有多重要，有了好的数据才能预测出好的分类结果，对于二分类问题，一般情况下，首选逻辑回归。
首先定义模型效果的评判标准。根据贷款行业的实际情况，在这里我们假设将钱借给了没有还款能力的人，结果损失一千，将钱借给了有偿还能力的人，从每笔中赚0.1的利润，而其余情况收益为零，就相当于预测对十个人才顶上预测错一个人的收益，所以精度不再适用于此模型，为了实现利润最大化，不仅要求模型预测recall率较高，同时是需要要让fall-out率较低，故这里采用两个指标TPR(true positive rate)和FPR(false positive rate)。

#LR不是回归而是分类，用它进行训练
from sklearn.linear_model import LogisticRegression # 分类

lr = LogisticRegression() # 调用逻辑回归的算法包
cols = loans.columns # 4万行 * 24列的样本

train_cols = cols.drop("loan_status") # 删除loan_status这一列,因为我们想要将他作为目标值

features = loans[train_cols] # 23列的特征矩阵
target = loans["loan_status"] # 作为标签矩阵

lr.fit(features, target) #开始训练
predictions = lr.predict(features) # 开始预测

3.1 查看预测结果

predictions[:10] #0:代表没有偿还  1:代表偿还

#结果
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

lr.predict_proba(features)#lr的概率模型

#结果
#前面是没有能力偿还的概率，后面是有能力偿还的概率
   array([[0.23940129, 0.76059871],
           [0.35607142, 0.64392858],
           [0.32106074, 0.67893926],
           ...,
           [0.30770809, 0.69229191],
           [0.10258821, 0.89741179],
           [0.09494366, 0.90505634]])

3.2 逻辑回归的超参数

lr

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

3.3 分析需求

目的是赚取有能力偿还贷款的客户的利息

第一个实际值为0，客户不会还钱，模型预测客户能还钱，为1 ，假设系统贷给了客户1000块钱，但是一分都没还，说明预测错了为阴性，赔了1000块钱
第二个实际值为1，客户有偿还能力，模型预测客户有能力偿还，就挣了客户的利息钱，1000*0.1 =100块钱
第三个实际值为0，本来客户是不还钱的，模型预测不还钱，并且没有贷款给他
第四个客户能还钱，模型预测客户不能还，没有贷款给他

3.4 建立混淆矩阵

import pandas as pd
#接下来就是如何算4个指标 fp tp fn tn

print("----------------------------------------")
# 假正类（False Positive，FP）：将负类预测为正类
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])
print(fp)
print("----------------------------------------")


# 真正类（True Positive，TP）：将正类预测为正类
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])
print(tp)
print("----------------------------------------")


# 假负类（False Negative，FN）：将正类预测为负类
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])
print(fn)
print("----------------------------------------")

# 真负类（True Negative，TN）：将负类预测为负类
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])
print("----------------------------------------")
print(tn)

#结果

    ----------------------------------------
    5355
    ----------------------------------------
    32786
    ----------------------------------------
    23
    ----------------------------------------
    10
    ----------------------------------------

这里有个问题：
最终拿什么衡量指标来评价模型？

这个数据集后续观察是不平衡的，借钱的有6个，不借钱的有1个，借钱的样本本来就多，不借钱的样本本来就少，相当于7个人来了，有6个人借给他了（定为1），有1个人没借给他（定为0），7个样本的错误率为1/7 ，准确率为6/7 ，用”精度“衡量的时候看一下图例：

第一个实际值为0，没有偿还能力，模型预测客户为1 ，代表不能还，赔1000块钱
后面的实际值为1，代表有偿还能力，模型借给他1000块钱，挣了个利息钱100块
最终-1000 + 600 = -400 用”精度“来衡量最终还是会赔钱的，因为数据集能还钱的样本很多，显然这样是不合理的，所以就不考虑”精度“了。

所以这里我们建立混淆矩阵

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

lr = LogisticRegression()
predictions = cross_val_predict(lr, features, target, cv=10) # Kfold = 10(交叉验证)
predictions = pd.Series(predictions)
print(predictions[:1000])

0      1
1      1
2      1
3      1
4      1
      ..
995    1
996    1
997    1
998    1
999    1
Length: 1000, dtype: int64

# 假正类（False Positive，FP）：将负类预测为正类
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])


# 真正类（True Positive，TP）：将正类预测为正类
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])



# 假负类（False Negative，FN）：将正类预测为负类
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])



# 真负类（True Negative，TN）：将负类预测为负类
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

$TPR = \frac{truepositives}{falsepositives + truepositives}$
$FPR = \frac{falsepositives}{falsepositives + truepositives}$

真正率TPR: 是指客户的实际值为1，有偿还能力，模型预测也为1，说明这些客户群体越来越多，挣的利息也越来越多（我们想让TRP越高越好）

本质上期望TPR越高越好，FPR越低越好

tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))


print(tpr) #真正率
print(fpr) #假正率
print(predictions[:20])

0.9991160961931177
0.998695246971109
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
dtype: int64

从得到的结果中发现前20个人几乎都是有能力还款且贷款给他们了，说明来一个人基本都会判断为可以借钱，那显然模型就完全没有分类的意义。

这时候我们就要考虑一个重要的问题了：权重

为什么会出现这种情况？

问题就出在了前面的数据集中，比如说数据是6:1，绝大多数是1，小部分是0，样本不均衡的情况下，导致分类器错误的认为把所有的样本预测为1，因为负样本少，我们需要进行 “数据增强”。

对数据来说，一部分是6份，另一部分是1份，把6份的权重设置为1，把1份的权重设置为6，设置权重项来进行衡量，把不均衡的样本变得均衡，加了权重项，让正样本对结果的影响小一些

3.5 考虑权重后使用逻辑回归训练

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

"""
class_weight：可以调整正反样本的权重
balanced:希望正负样本平衡一些的
"""
lr = LogisticRegression(class_weight="balanced")
predictions = cross_val_predict(lr, features, target, cv=10)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)#真正率
print()
print(fpr)#假正率
print()
print(predictions[:20])

0.5273248194093084 #真正率

0.33401677539608576 #假正率

0     0
1     1
2     0
3     1
4     1
5     0
6     0
7     0
8     0
9     1
10    1
11    0
12    0
13    1
14    0
15    0
16    1
17    1
18    1
19    0
dtype: int64

3.6 自定义权重后使用逻辑回归训练

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict

"""
权重项可以自己定义的
0代表5倍的
1代表10倍的
"""
penalty = {
    0: 5, #这里设置为5倍
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
# kf = KFold(features.shape[0], random_state=1)
kf = 10
predictions = cross_val_predict(lr, features, target, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))

print(tpr)
print()
print(fpr)

0.7041360602273766 # 真正率

0.5237651444547996 # 假正率

四、总结

为什么会出现上面极其离谱的现象呢？

这是由于我们的样本是很不均衡的，这就容易导致我们构建的分类器把所有样本都归为样本量较大的那一个类。解决的方法有很多，其中一个是进行数据增强，就是把少的样本增多，但是要添加的数据要么是收集的，要么是自己造的，所以这项工作还是挺难的。所以将考虑权重，将少的样本的权重增大，期望模型能够达到比较均衡的状态。

对上述模型的预测结果进行简单的分析，发现错误率和正确率都达到99.9%，错误率太高，通过观察预测结果发现，模型几乎将所有的样本都判断为正例，通过对原始数据的了解，分析造成该现象的原因是由于政府样本数量相差太大，即样本不均衡造成模型对正例样本有所偏重，这里采用对样本添加权重值的方式进行调整，首先采用默认的均衡调整。

本文中的案例不是着重给出一个正确率的预测模型，只是给出使用机器学习建模的一般流程。

分为两大部分：数据处理和模型学习

第一部分需要大量的街舞知识对原始数据进行清理及特征提取

第二部分模型学习，涉及长时间的模型参数调整，调整方向和策略需要根据经验来灵活调整。

当模型效果不理想时，可以考虑的调整策略：

1、调节正负样本的权重参数。

2、更换模型算法。

3、同时几个使用模型进行预测，然后取去测的最终结果。

4、使用原数据，生成新特征。

5、调整模型参数

★至此，本文已经将客户贷款预测案例简单讲解完毕，希望各位读者能从文中真正的学到一些东西，最重要的还是面对不同案例时候灵活的应用所知所学，感谢阅读！

高羊羊羊羊羊杨

发布了29 篇原创文章 · 获赞 379 · 访问量 2万+

私信关注