Línea de base de modelado de datos de crédito alemán!

Número público: You Er Hut
Autor: Peter
Editor: Peter

Hola a todos, mi nombre es Peter~

Este documento es un modelo simple de datos crediticios alemanes basado en el modelo de 3 árboles, que se puede usar como línea de base, y finalmente presenta la dirección de optimización. Los contenidos principales incluyen:

importar biblioteca

Bibliotecas importadas para manipulación de datos, visualización, modelado, etc.

import pandas as pd
import numpy as np

# 1、基于plotly
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
py.offline.init_notebook_mode(connected = True)
from plotly.subplots import make_subplots  # 多子图
# 2、基于matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
# 中文显示问题
#设置字体
plt.rcParams["font.sans-serif"]=["SimHei"] 
#正常显示负号
plt.rcParams["axes.unicode_minus"]=False 

# 3、基于seaborn
import seaborn as sns
# plt.style.use("fivethirtyeight")
plt.style.use('ggplot')

# 数据标准化、分割、交叉验证
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split,cross_val_score

# 模型
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# 模型评价
from sklearn import metrics  
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, precision_score, f1_score

# 忽略notebook中的警告
import warnings
warnings.filterwarnings("ignore")

Introducción de datos

Los datos provienen del sitio web oficial de la UCI: archive.ics.uci.edu/ml/datasets…

Información básica: 1000 piezas de datos + 20 variables + variable objetivo + sin valores perdidos

Significados en chino e inglés de las variables características:

Vector de características chino: 1. Estado de la cuenta corriente; 2. Ciclo de préstamo; 3. Crédito histórico; 4. Propósito del préstamo; 5. Límite de crédito; 6. Estado de la cuenta de ahorros; 7. Situación laboral actual; Porcentaje; 9. Género y estado civil 10. Información de la garantía 11. Residencia actual 12. Estado de la propiedad 13. Edad 14. Otras cuotas 15. Estado de la propiedad 16. Número de tarjetas de crédito 17. Estado laboral 18. Número de dependientes 19 .Registro de números de teléfono 20. Si hay alguna experiencia laboral en el extranjero
El vector de funciones corresponde al inglés: 1.status_account, 2.duration, 3.credit_history, 4,purpose, 5.amount, 6.svaing_account, 7.present_emp, 8.income_rate, 9.personal_status, 10.other_debtors, 11.residence_info , 12 .propiedad, 13.edad, 14.inst_planes, 15.vivienda, 16.num_créditos, 17.trabajo, 18.dependientes, 19.teléfono, 20.trabajador_extranjero

leer datos

Los datos descargados no tienen encabezado, y el encabezado en inglés correspondiente se encuentra en Internet para generar un DataFrame:

En [4]:

df.shape

Fuera[4]:

(1000, 21)

En [5]:

df.dtypes  # 字段类型

Fuera[5]:

checking_account_status    object
duration                    int64
credit_history             object
purpose                    object
credit_amount               int64
savings                    object
present_employment         object
installment_rate            int64
personal                   object
other_debtors              object
present_residence           int64
property                   object
age                         int64
other_installment_plans    object
housing                    object
existing_credits            int64
job                        object
dependents                  int64
telephone                  object
foreign_worker             object
customer_type               int64
dtype: object

En [6]:

# 不同的字段类型统计

pd.value_counts(df.dtypes.values)

Fuera[6]:

object    13
int64      8
dtype: int64

En [7]:

df.isnull().sum()

Out[7]:

checking_account_status    0
duration                   0
credit_history             0
purpose                    0
credit_amount              0
savings                    0
present_employment         0
installment_rate           0
personal                   0
other_debtors              0
present_residence          0
property                   0
age                        0
other_installment_plans    0
housing                    0
existing_credits           0
job                        0
dependents                 0
telephone                  0
foreign_worker             0
customer_type              0
dtype: int64

不同字段下的取值统计

In [8]:

columns = df.columns     # 字段
columns

Out[8]:

Index(['checking_account_status', 'duration', 'credit_history', 'purpose',
       'credit_amount', 'savings', 'present_employment', 'installment_rate',
       'personal', 'other_debtors', 'present_residence', 'property', 'age',
       'other_installment_plans', 'housing', 'existing_credits', 'job',
       'dependents', 'telephone', 'foreign_worker', 'customer_type'],
      dtype='object')

1、针对字符类型字段的取值情况统计：

string_columns = df.select_dtypes(include="object").columns

# 两个基本参数：设置行、列
fig = make_subplots(rows=3, cols=5)  

for i, v in enumerate(string_columns):  
    r = i // 5 + 1
    c = (i+1) % 5
    
    data = df[v].value_counts().reset_index()

    if c ==0:
        fig.add_trace(go.Bar(x=data["index"],y=data[v],
                             text=data[v],name=v),
                      row=r, col=5)
    else:
        fig.add_trace(go.Bar(x=data["index"],y=data[v],
                             text=data[v],name=v),
                     row=r, col=c)

fig.update_layout(width=1000, height=900)

fig.show()

2、针对数值型字段的分布情况：

number_columns = df.select_dtypes(exclude="object").columns.tolist()
number_columns

# 两个基本参数：设置行、列
fig = make_subplots(rows=2, cols=4)  # 2行4列

for i, v in enumerate(number_columns):  # number_columns 长度是8
    r = i // 4 + 1
    c = (i+1) % 4

    if c ==0:
        fig.add_trace(go.Box(y=df[v].tolist(),name=v),
                 row=r, col=4)
    else:
        fig.add_trace(go.Box(y=df[v].tolist(),name=v),
                 row=r, col=c)

fig.update_layout(width=1000, height=900)

fig.show()

字段处理

支票状态-checking_account_status

中文含义：现有支票帐户的状态

A11：<0 DM
A12：0 <= x <200 DM
A13：> = 200 DM /至少一年的薪水分配
A14：无支票帐户）

In [11]:

df["checking_account_status"].value_counts()

Out[11]:

A14    394
A11    274
A12    269
A13     63
Name: checking_account_status, dtype: int64

In [12]:

fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="checking_account_status", data=df)

plt.title("number of checking_account_status")

for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()

在这里我们根据每个人的支票账户金额的大小进行硬编码：

In [13]:

# A11：<0 DM，A12：0 <= x <200 DM，A13：> = 200 DM /至少一年的薪水分配，A14：无支票帐户
# 编码1
cas = {"A11": 1,"A12":2, "A13":3, "A14":0}
df["checking_account_status"] = df["checking_account_status"].map(cas)

借款周期-duration

中文含义是：持续时间(月)

In [14]:

duration = df["duration"].value_counts()
duration.head()

Out[14]:

24    184
12    179
18    113
36     83
6      75
Name: duration, dtype: int64

In [15]:

fig = px.violin(df,y="duration")

fig.show()

信用卡历史-credit_history

中文含义

A30：未提取任何信用/已全额偿还所有信用额
A31：已偿还该银行的所有信用额
A32：已到期已偿还的现有信用额
A33：过去的还款延迟
A34：关键帐户/其他信用额现有（不在此银行）

In [17]:

ch = df["credit_history"].value_counts().reset_index()
ch

Out[17]:

	index	credit_history
0	A32	530
1	A34	293
2	A33	88
3	A31	49
4	A30	40

In [18]:

fig = px.pie(ch,names="index",values="credit_history")

fig.update_traces(
    textposition='inside',
    textinfo='percent+label'
)

fig.show()

# 编码2：独热码

df_credit_history = pd.get_dummies(df["credit_history"])
df = df.join(df_credit_history)
df.drop("credit_history", inplace=True, axis=1)

借款目的-purpose

借款目的

In [20]:

# 统计每个目的下的人数，根据人数的多少来实施硬编码
purpose = df["purpose"].value_counts().sort_values(ascending=True).reset_index()

purpose.columns = ["purpose", "number"]

purpose

# 编码3
df["purpose"] = df["purpose"].map(dict(zip(purpose.purpose,purpose.index)))

信用额度-credit_amount

表示的是信用额度

In [22]:

px.violin(df["credit_amount"])

账户储蓄-savings

账户/债券储蓄（A61：<100 DM，A62：100 <= x <500 DM，A63：500 <= x <1000 DM，A64：> = 1000 DM，A65：未知/无储蓄账户

In [24]:

string_columns

Out[24]:

Index(['checking_account_status', 'credit_history', 'purpose', 'savings',
       'present_employment', 'personal', 'other_debtors', 'property',
       'other_installment_plans', 'housing', 'job', 'telephone',
       'foreign_worker'],
      dtype='object')

In [25]:

df["savings"].value_counts()

Out[25]:

A61    603
A65    183
A62    103
A63     63
A64     48
Name: savings, dtype: int64

In [26]:

# 编码6：硬编码
savings = {"A61":1,"A62":2, "A63":3, "A64":4,"A65":0}

df["savings"] = df["savings"].map(savings)

目前状态-present_employment

A71：待业
A72：<1年
A73：1 <= x <4年
A74：4 <= x <7年
A75：..> = 7年

In [28]:

df["present_employment"].value_counts()

Out[28]:

A73    339
A75    253
A74    174
A72    172
A71     62
Name: present_employment, dtype: int64

In [29]:

# 编码7：独热码

df_present_employment = pd.get_dummies(df["present_employment"])

In [30]:

df = df.join(df_present_employment)

df.drop("present_employment", inplace=True, axis=1)

个人婚姻状态和性别-personal

个人婚姻状况和性别（A91：男性：离婚/分居，A92：女性：离婚/分居/已婚，A93：男性：单身，A94：男性：已婚/丧偶，A95：女性：单身）

In [31]:

# 编码8：独热码

df_personal = pd.get_dummies(df["personal"])
df = df.join(df_personal)

df.drop("personal", inplace=True, axis=1)

其他担保人-other_debtors

A101：无，A102：共同申请人，A103：担保人

In [32]:

# 编码9：独热码

df_other_debtors = pd.get_dummies(df["other_debtors"])
df = df.join(df_other_debtors)

df.drop("other_debtors", inplace=True, axis=1)

资产-property

In [33]:

# 编码10：独热码

df_property = pd.get_dummies(df["property"])
df = df.join(df_property)

df.drop("property", inplace=True, axis=1)

住宿-housing

A151:租房，A152:自有，A153:免费

In [34]:

# 编码11：独热码

df_housing = pd.get_dummies(df["housing"])
df = df.join(df_housing)

df.drop("housing", inplace=True, axis=1)

其他投资计划-other_installment_plans

A141：银行，A142：店铺，A143：无

In [35]:

fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="other_installment_plans", data=df)

plt.title("number of other_installment_plans")

for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()

# 编码12：独热码

df_other_installment_plans = pd.get_dummies(df["other_installment_plans"])
df = df.join(df_other_installment_plans)

df.drop("other_installment_plans", inplace=True, axis=1)

工作-job

A171 : 非技术人员-非居民
A172:非技术人员-居民
A173:技术人员/官员
A174:管理/个体经营/高度合格的员工/官员

In [37]:

fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="job", data=df)

plt.title("number of job")

for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()

# 编码13：独热码

df_job = pd.get_dummies(df["job"])
df = df.join(df_job)

df.drop("job", inplace=True, axis=1)

电话-telephone

A191:无，A192:有，登记在客户名下

In [39]:

# 编码14：独热码

df_telephone = pd.get_dummies(df["telephone"])
df = df.join(df_telephone)

df.drop("telephone", inplace=True, axis=1)

是否国外工作-foreign_worker

A201: 有，A202: 无

In [40]:

# 编码15：独热码

df_foreign_worker = pd.get_dummies(df["foreign_worker"])
df = df.join(df_foreign_worker)

df.drop("foreign_worker", inplace=True, axis=1)

两种类型顾客统计-customer_type

预测类别：1 =良好，2 =不良

In [41]:

fig,ax = plt.subplots(figsize=(12,8), dpi=80)

sns.countplot(x="customer_type", data=df)

plt.title("number of customer_type")

for p in ax.patches:
    ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()

打乱数据shuffle

In [42]:

from sklearn.utils import shuffle

# 随机打乱数据
df = shuffle(df).reset_index(drop=True)

建模

数据分割

In [44]:

# 选取特征
X  = df.drop("customer_type",axis=1)

# 目标变量
y = df['customer_type']
from sklearn.model_selection import train_test_split

In [45]:

# 2-8比例
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)

数据标准化

In [46]:

ss = StandardScaler()

X_train = ss.fit_transform(X_train)

In [47]:

y_train

Out[47]:

556    1
957    1
577    2
795    2
85     1
      ..
106    1
270    2
860    1
435    1
102    2
Name: customer_type, Length: 200, dtype: int64

In [48]:

# 分别求出训练集的均值和标准差

mean_ = ss.mean_  # 均值
var_ = np.sqrt(ss.var_)  # 标准差

将上面求得的均值和标准差用于测试集中：

In [50]:

# 归一化之后的测试集中的特征数据
X_test = (X_test - mean_) / var_

模型1：决策树

In [51]:

dt = DecisionTreeClassifier(max_depth=5)

dt.fit(X_train, y_train)

Out[51]:

DecisionTreeClassifier(max_depth=5)

In [52]:

# 预测
y_pred = dt.predict(X_test)
y_pred[:5]

Out[52]:

array([2, 1, 1, 2, 1])

In [53]:

# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat

Out[53]:

array([[450, 118],
       [137,  95]])

In [54]:

# 混淆矩阵可视化

classes = ["良好","不良"]

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes)
disp.plot(
    include_values=True,            # 混淆矩阵每个单元格上显示具体数值
    cmap="GnBu",                 # matplotlib识别的颜色图
    ax=None,  
    xticks_rotation="horizontal",    
    values_format="d"              
)

plt.show()

## auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred)  # 测试值和预测值
auc_roc

0.5008681398737251

模型2：随机森林

In [56]:

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

Out[56]:

RandomForestClassifier()

In [57]:

# 预测
y_pred = rf.predict(X_test)
y_pred[:5]

Out[57]:

array([1, 1, 1, 2, 1])

In [58]:

# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat

Out[58]:

array([[476,  92],
       [142,  90]])

In [59]:

# 混淆矩阵可视化

classes = ["良好","不良"]

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes)
disp.plot(
    include_values=True,            # 混淆矩阵每个单元格上显示具体数值
    cmap="GnBu",                 # matplotlib识别的颜色图
    ax=None,  
    xticks_rotation="horizontal",    
    values_format="d"              
)

plt.show()

## auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred)  # 真实值和预测值
auc_roc

0.6129796017484215

模型3：XGboost

In [62]:

from xgboost.sklearn import XGBClassifier
## 定义 XGBoost模型 
clf = XGBClassifier()

# X_train = X_train.values
# X_test = X_test.values

In [63]:

clf.fit(X_train, y_train)

Out[63]:

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [65]:

# 先转成数组再传进来
X_test = X_test.values

y_pred = clf.predict(X_test)
y_pred[:5]

Out[65]:

array([1, 1, 1, 2, 1])

In [66]:

# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat

Out[66]:

array([[445, 123],
       [115, 117]])

In [67]:

# 混淆矩阵可视化

classes = ["良好","不良"]

disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes)
disp.plot(
    include_values=True,            # 混淆矩阵每个单元格上显示具体数值
    cmap="GnBu",                 # matplotlib识别的颜色图
    ax=None,  
    xticks_rotation="horizontal",    
    values_format="d"              
)

plt.show()

## auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred)  # 真实值和预测值
auc_roc

0.6438805245264692

模型优化

基于相关系数进行特征筛选

# y：customer_type是目标变量

# 1、计算每个特征和目标变量的相关系数

data = pd.concat([X,y],axis=1)

corr = data.corr()
corr[:5]

相关系数的描述统计信息：发现整体的相关系数（绝对值）都比较小

热力图

ax = plt.subplots(figsize=(20,16))

ax = sns.heatmap(corr, 
                 vmax=0.8, 
                 square=True, 
                 annot=True,  # 显示数据
                 cmap="YlGnBu")

根据相关系数筛选前20个变量

k = 20

cols = corr.nlargest(k,"customer_type")["customer_type"].index
cols

Index(['customer_type', 'duration', 'checking_account_status', 'credit_amount',
       'A30', 'A31', 'A124', 'A72', 'A141', 'A151', 'A201', 'A153', 'A92',
       'installment_rate', 'A102', 'A142', 'A91', 'A32', 'A174', 'A71'],
      dtype='object')

cm = np.corrcoef(data[cols].values.T)

hm = plt.subplots(figsize=(10,10))  # 调整画布大小
hm = sns.heatmap(data[cols].corr(),  # 前10个属性的相关系数
                 annot=True, 
                 square=True)
plt.show()

筛选相关系数绝对值大于0.1的变量

threshold = 0.1

corrmat = data.corr()
top_corr_features = corrmat.index[abs(corrmat["customer_type"]) > threshold]

plt.figure(figsize=(10,10))

g = sns.heatmap(data[top_corr_features].corr(),  # 大于0.5的特征构成的DF的相关系数矩阵
                annot=True,
                square=True,
                cmap="nipy_spectral_r"
               )

新数据建模

# 筛选出为True的特征
useful_col = corrmat.index[abs(corrmat["customer_type"]) > threshold].tolist()

new_df = df[useful_col]
new_df.head()

数据切分

# 选取特征
X  = new_df.drop("customer_type",axis=1)

# 目标变量
y = new_df['customer_type']

# 3-7比例
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)

标准化

ss = StandardScaler()
X_train = ss.fit_transform(X_train)

# 分别求出训练集的均值和标准差

mean_ = ss.mean_  # 均值
var_ = np.sqrt(ss.var_)  # 标准差

# 归一化之后的测试集中的特征数据

X_test = (X_test - mean_) / var_

建模

from xgboost.sklearn import XGBClassifier
## 定义 XGBoost模型 
clf = XGBClassifier()

clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [80]:

# 先转成数组再传进来
X_test = X_test.values

y_pred = clf.predict(X_test)
y_pred[:5]

Out[80]:

array([2, 1, 2, 2, 1])

In [81]:

# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat

Out[81]:

array([[406,  94],
       [ 96, 104]])

In [82]:

## auc-roc

auc_roc = metrics.roc_auc_score(y_test, y_pred)  # 真实值和预测值
auc_roc

Out[82]:

0.666

优化方向

经过3种不同树模型的建模，我们发现模型的AUC值并不是很高。AUC 值是一个概率值，AUC 值越大，分类算法越好。可以考虑优化的方向：

特征工程处理：这个可以重点优化。目前对原始的特征变量使用了3种不同类型编码、独热码和硬编码；有些字段的编码方式需要优化。
筛选变量：相关系数是用来检测两个连续型变量之间线性相关的程度；特征变量和最终因变量的关系不一定线性相关。本文中观察到相关系数都很低，似乎佐证了这点。后续考虑通过其他方法来筛选变量进行建模
模型调优：通过网格搜索等优化单个模型的参数，或者通过模型融合来增强整体效果。

数据集获取

关注公众号【尤而小屋】，回复德国即可领取本文数据集。