Número público: You Er Hut
Autor: Peter
Editor: Peter
Hola a todos, mi nombre es Peter~
Este documento es un modelo simple de datos crediticios alemanes basado en el modelo de 3 árboles, que se puede usar como línea de base, y finalmente presenta la dirección de optimización. Los contenidos principales incluyen:
importar biblioteca
Bibliotecas importadas para manipulación de datos, visualización, modelado, etc.
import pandas as pd
import numpy as np
# 1、基于plotly
import plotly as py
import plotly.express as px
import plotly.graph_objects as go
py.offline.init_notebook_mode(connected = True)
from plotly.subplots import make_subplots # 多子图
# 2、基于matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
# 中文显示问题
#设置字体
plt.rcParams["font.sans-serif"]=["SimHei"]
#正常显示负号
plt.rcParams["axes.unicode_minus"]=False
# 3、基于seaborn
import seaborn as sns
# plt.style.use("fivethirtyeight")
plt.style.use('ggplot')
# 数据标准化、分割、交叉验证
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split,cross_val_score
# 模型
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor,DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# 模型评价
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, recall_score, roc_auc_score, precision_score, f1_score
# 忽略notebook中的警告
import warnings
warnings.filterwarnings("ignore")
Introducción de datos
Los datos provienen del sitio web oficial de la UCI: archive.ics.uci.edu/ml/datasets…
Información básica: 1000 piezas de datos + 20 variables + variable objetivo + sin valores perdidos
Significados en chino e inglés de las variables características:
-
Vector de características chino: 1. Estado de la cuenta corriente; 2. Ciclo de préstamo; 3. Crédito histórico; 4. Propósito del préstamo; 5. Límite de crédito; 6. Estado de la cuenta de ahorros; 7. Situación laboral actual; Porcentaje; 9. Género y estado civil 10. Información de la garantía 11. Residencia actual 12. Estado de la propiedad 13. Edad 14. Otras cuotas 15. Estado de la propiedad 16. Número de tarjetas de crédito 17. Estado laboral 18. Número de dependientes 19 .Registro de números de teléfono 20. Si hay alguna experiencia laboral en el extranjero
-
El vector de funciones corresponde al inglés: 1.status_account, 2.duration, 3.credit_history, 4,purpose, 5.amount, 6.svaing_account, 7.present_emp, 8.income_rate, 9.personal_status, 10.other_debtors, 11.residence_info , 12 .propiedad, 13.edad, 14.inst_planes, 15.vivienda, 16.num_créditos, 17.trabajo, 18.dependientes, 19.teléfono, 20.trabajador_extranjero
leer datos
Los datos descargados no tienen encabezado, y el encabezado en inglés correspondiente se encuentra en Internet para generar un DataFrame:
En [4]:
df.shape
Fuera[4]:
(1000, 21)
En [5]:
df.dtypes # 字段类型
Fuera[5]:
checking_account_status object
duration int64
credit_history object
purpose object
credit_amount int64
savings object
present_employment object
installment_rate int64
personal object
other_debtors object
present_residence int64
property object
age int64
other_installment_plans object
housing object
existing_credits int64
job object
dependents int64
telephone object
foreign_worker object
customer_type int64
dtype: object
En [6]:
# 不同的字段类型统计
pd.value_counts(df.dtypes.values)
Fuera[6]:
object 13
int64 8
dtype: int64
En [7]:
df.isnull().sum()
Out[7]:
checking_account_status 0
duration 0
credit_history 0
purpose 0
credit_amount 0
savings 0
present_employment 0
installment_rate 0
personal 0
other_debtors 0
present_residence 0
property 0
age 0
other_installment_plans 0
housing 0
existing_credits 0
job 0
dependents 0
telephone 0
foreign_worker 0
customer_type 0
dtype: int64
不同字段下的取值统计
In [8]:
columns = df.columns # 字段
columns
Out[8]:
Index(['checking_account_status', 'duration', 'credit_history', 'purpose',
'credit_amount', 'savings', 'present_employment', 'installment_rate',
'personal', 'other_debtors', 'present_residence', 'property', 'age',
'other_installment_plans', 'housing', 'existing_credits', 'job',
'dependents', 'telephone', 'foreign_worker', 'customer_type'],
dtype='object')
1、针对字符类型字段的取值情况统计:
string_columns = df.select_dtypes(include="object").columns
# 两个基本参数:设置行、列
fig = make_subplots(rows=3, cols=5)
for i, v in enumerate(string_columns):
r = i // 5 + 1
c = (i+1) % 5
data = df[v].value_counts().reset_index()
if c ==0:
fig.add_trace(go.Bar(x=data["index"],y=data[v],
text=data[v],name=v),
row=r, col=5)
else:
fig.add_trace(go.Bar(x=data["index"],y=data[v],
text=data[v],name=v),
row=r, col=c)
fig.update_layout(width=1000, height=900)
fig.show()
2、针对数值型字段的分布情况:
number_columns = df.select_dtypes(exclude="object").columns.tolist()
number_columns
# 两个基本参数:设置行、列
fig = make_subplots(rows=2, cols=4) # 2行4列
for i, v in enumerate(number_columns): # number_columns 长度是8
r = i // 4 + 1
c = (i+1) % 4
if c ==0:
fig.add_trace(go.Box(y=df[v].tolist(),name=v),
row=r, col=4)
else:
fig.add_trace(go.Box(y=df[v].tolist(),name=v),
row=r, col=c)
fig.update_layout(width=1000, height=900)
fig.show()
字段处理
支票状态-checking_account_status
中文含义:现有支票帐户的状态
- A11:<0 DM
- A12:0 <= x <200 DM
- A13:> = 200 DM /至少一年的薪水分配
- A14:无支票帐户)
In [11]:
df["checking_account_status"].value_counts()
Out[11]:
A14 394
A11 274
A12 269
A13 63
Name: checking_account_status, dtype: int64
In [12]:
fig,ax = plt.subplots(figsize=(12,8), dpi=80)
sns.countplot(x="checking_account_status", data=df)
plt.title("number of checking_account_status")
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()
在这里我们根据每个人的支票账户金额的大小进行硬编码:
In [13]:
# A11:<0 DM,A12:0 <= x <200 DM,A13:> = 200 DM /至少一年的薪水分配,A14:无支票帐户
# 编码1
cas = {"A11": 1,"A12":2, "A13":3, "A14":0}
df["checking_account_status"] = df["checking_account_status"].map(cas)
借款周期-duration
中文含义是:持续时间(月)
In [14]:
duration = df["duration"].value_counts()
duration.head()
Out[14]:
24 184
12 179
18 113
36 83
6 75
Name: duration, dtype: int64
In [15]:
fig = px.violin(df,y="duration")
fig.show()
信用卡历史-credit_history
中文含义
- A30:未提取任何信用/已全额偿还所有信用额
- A31:已偿还该银行的所有信用额
- A32:已到期已偿还的现有信用额
- A33:过去的还款延迟
- A34:关键帐户/其他信用额现有(不在此银行)
In [17]:
ch = df["credit_history"].value_counts().reset_index()
ch
Out[17]:
index | credit_history | |
---|---|---|
0 | A32 | 530 |
1 | A34 | 293 |
2 | A33 | 88 |
3 | A31 | 49 |
4 | A30 | 40 |
In [18]:
fig = px.pie(ch,names="index",values="credit_history")
fig.update_traces(
textposition='inside',
textinfo='percent+label'
)
fig.show()
# 编码2:独热码
df_credit_history = pd.get_dummies(df["credit_history"])
df = df.join(df_credit_history)
df.drop("credit_history", inplace=True, axis=1)
借款目的-purpose
借款目的
In [20]:
# 统计每个目的下的人数,根据人数的多少来实施硬编码
purpose = df["purpose"].value_counts().sort_values(ascending=True).reset_index()
purpose.columns = ["purpose", "number"]
purpose
# 编码3
df["purpose"] = df["purpose"].map(dict(zip(purpose.purpose,purpose.index)))
信用额度-credit_amount
表示的是信用额度
In [22]:
px.violin(df["credit_amount"])
账户储蓄-savings
账户/债券储蓄(A61:<100 DM,A62:100 <= x <500 DM,A63:500 <= x <1000 DM,A64:> = 1000 DM,A65:未知/无储蓄账户
In [24]:
string_columns
Out[24]:
Index(['checking_account_status', 'credit_history', 'purpose', 'savings',
'present_employment', 'personal', 'other_debtors', 'property',
'other_installment_plans', 'housing', 'job', 'telephone',
'foreign_worker'],
dtype='object')
In [25]:
df["savings"].value_counts()
Out[25]:
A61 603
A65 183
A62 103
A63 63
A64 48
Name: savings, dtype: int64
In [26]:
# 编码6:硬编码
savings = {"A61":1,"A62":2, "A63":3, "A64":4,"A65":0}
df["savings"] = df["savings"].map(savings)
目前状态-present_employment
- A71:待业
- A72:<1年
- A73:1 <= x <4年
- A74:4 <= x <7年
- A75:..> = 7年
In [28]:
df["present_employment"].value_counts()
Out[28]:
A73 339
A75 253
A74 174
A72 172
A71 62
Name: present_employment, dtype: int64
In [29]:
# 编码7:独热码
df_present_employment = pd.get_dummies(df["present_employment"])
In [30]:
df = df.join(df_present_employment)
df.drop("present_employment", inplace=True, axis=1)
个人婚姻状态和性别-personal
个人婚姻状况和性别(A91:男性:离婚/分居,A92:女性:离婚/分居/已婚,A93:男性:单身,A94:男性:已婚/丧偶,A95:女性:单身)
In [31]:
# 编码8:独热码
df_personal = pd.get_dummies(df["personal"])
df = df.join(df_personal)
df.drop("personal", inplace=True, axis=1)
其他担保人-other_debtors
A101:无,A102:共同申请人,A103:担保人
In [32]:
# 编码9:独热码
df_other_debtors = pd.get_dummies(df["other_debtors"])
df = df.join(df_other_debtors)
df.drop("other_debtors", inplace=True, axis=1)
资产-property
In [33]:
# 编码10:独热码
df_property = pd.get_dummies(df["property"])
df = df.join(df_property)
df.drop("property", inplace=True, axis=1)
住宿-housing
A151:租房,A152:自有,A153:免费
In [34]:
# 编码11:独热码
df_housing = pd.get_dummies(df["housing"])
df = df.join(df_housing)
df.drop("housing", inplace=True, axis=1)
其他投资计划-other_installment_plans
A141:银行,A142:店铺,A143:无
In [35]:
fig,ax = plt.subplots(figsize=(12,8), dpi=80)
sns.countplot(x="other_installment_plans", data=df)
plt.title("number of other_installment_plans")
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()
# 编码12:独热码
df_other_installment_plans = pd.get_dummies(df["other_installment_plans"])
df = df.join(df_other_installment_plans)
df.drop("other_installment_plans", inplace=True, axis=1)
工作-job
- A171 : 非技术人员-非居民
- A172:非技术人员-居民
- A173:技术人员/官员
- A174:管理/个体经营/高度合格的员工/官员
In [37]:
fig,ax = plt.subplots(figsize=(12,8), dpi=80)
sns.countplot(x="job", data=df)
plt.title("number of job")
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()
# 编码13:独热码
df_job = pd.get_dummies(df["job"])
df = df.join(df_job)
df.drop("job", inplace=True, axis=1)
电话-telephone
A191:无,A192:有,登记在客户名下
In [39]:
# 编码14:独热码
df_telephone = pd.get_dummies(df["telephone"])
df = df.join(df_telephone)
df.drop("telephone", inplace=True, axis=1)
是否国外工作-foreign_worker
A201: 有,A202: 无
In [40]:
# 编码15:独热码
df_foreign_worker = pd.get_dummies(df["foreign_worker"])
df = df.join(df_foreign_worker)
df.drop("foreign_worker", inplace=True, axis=1)
两种类型顾客统计-customer_type
预测类别:1 =良好,2 =不良
In [41]:
fig,ax = plt.subplots(figsize=(12,8), dpi=80)
sns.countplot(x="customer_type", data=df)
plt.title("number of customer_type")
for p in ax.patches:
ax.annotate(f'\n{p.get_height()}', (p.get_x(), p.get_height()+5), color='black', size=20)
plt.show()
打乱数据shuffle
In [42]:
from sklearn.utils import shuffle
# 随机打乱数据
df = shuffle(df).reset_index(drop=True)
建模
数据分割
In [44]:
# 选取特征
X = df.drop("customer_type",axis=1)
# 目标变量
y = df['customer_type']
from sklearn.model_selection import train_test_split
In [45]:
# 2-8比例
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)
数据标准化
In [46]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
In [47]:
y_train
Out[47]:
556 1
957 1
577 2
795 2
85 1
..
106 1
270 2
860 1
435 1
102 2
Name: customer_type, Length: 200, dtype: int64
In [48]:
# 分别求出训练集的均值和标准差
mean_ = ss.mean_ # 均值
var_ = np.sqrt(ss.var_) # 标准差
将上面求得的均值和标准差用于测试集中:
In [50]:
# 归一化之后的测试集中的特征数据
X_test = (X_test - mean_) / var_
模型1:决策树
In [51]:
dt = DecisionTreeClassifier(max_depth=5)
dt.fit(X_train, y_train)
Out[51]:
DecisionTreeClassifier(max_depth=5)
In [52]:
# 预测
y_pred = dt.predict(X_test)
y_pred[:5]
Out[52]:
array([2, 1, 1, 2, 1])
In [53]:
# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat
Out[53]:
array([[450, 118],
[137, 95]])
In [54]:
# 混淆矩阵可视化
classes = ["良好","不良"]
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes)
disp.plot(
include_values=True, # 混淆矩阵每个单元格上显示具体数值
cmap="GnBu", # matplotlib识别的颜色图
ax=None,
xticks_rotation="horizontal",
values_format="d"
)
plt.show()
## auc-roc
auc_roc = metrics.roc_auc_score(y_test, y_pred) # 测试值和预测值
auc_roc
0.5008681398737251
模型2:随机森林
In [56]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
Out[56]:
RandomForestClassifier()
In [57]:
# 预测
y_pred = rf.predict(X_test)
y_pred[:5]
Out[57]:
array([1, 1, 1, 2, 1])
In [58]:
# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat
Out[58]:
array([[476, 92],
[142, 90]])
In [59]:
# 混淆矩阵可视化
classes = ["良好","不良"]
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes)
disp.plot(
include_values=True, # 混淆矩阵每个单元格上显示具体数值
cmap="GnBu", # matplotlib识别的颜色图
ax=None,
xticks_rotation="horizontal",
values_format="d"
)
plt.show()
## auc-roc
auc_roc = metrics.roc_auc_score(y_test, y_pred) # 真实值和预测值
auc_roc
0.6129796017484215
模型3:XGboost
In [62]:
from xgboost.sklearn import XGBClassifier
## 定义 XGBoost模型
clf = XGBClassifier()
# X_train = X_train.values
# X_test = X_test.values
In [63]:
clf.fit(X_train, y_train)
Out[63]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
In [65]:
# 先转成数组再传进来
X_test = X_test.values
y_pred = clf.predict(X_test)
y_pred[:5]
Out[65]:
array([1, 1, 1, 2, 1])
In [66]:
# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat
Out[66]:
array([[445, 123],
[115, 117]])
In [67]:
# 混淆矩阵可视化
classes = ["良好","不良"]
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_mat, display_labels=classes)
disp.plot(
include_values=True, # 混淆矩阵每个单元格上显示具体数值
cmap="GnBu", # matplotlib识别的颜色图
ax=None,
xticks_rotation="horizontal",
values_format="d"
)
plt.show()
## auc-roc
auc_roc = metrics.roc_auc_score(y_test, y_pred) # 真实值和预测值
auc_roc
0.6438805245264692
模型优化
基于相关系数进行特征筛选
# y:customer_type是目标变量
# 1、计算每个特征和目标变量的相关系数
data = pd.concat([X,y],axis=1)
corr = data.corr()
corr[:5]
相关系数的描述统计信息:发现整体的相关系数(绝对值)都比较小
热力图
ax = plt.subplots(figsize=(20,16))
ax = sns.heatmap(corr,
vmax=0.8,
square=True,
annot=True, # 显示数据
cmap="YlGnBu")
根据相关系数筛选前20个变量
k = 20
cols = corr.nlargest(k,"customer_type")["customer_type"].index
cols
Index(['customer_type', 'duration', 'checking_account_status', 'credit_amount',
'A30', 'A31', 'A124', 'A72', 'A141', 'A151', 'A201', 'A153', 'A92',
'installment_rate', 'A102', 'A142', 'A91', 'A32', 'A174', 'A71'],
dtype='object')
cm = np.corrcoef(data[cols].values.T)
hm = plt.subplots(figsize=(10,10)) # 调整画布大小
hm = sns.heatmap(data[cols].corr(), # 前10个属性的相关系数
annot=True,
square=True)
plt.show()
筛选相关系数绝对值大于0.1的变量
threshold = 0.1
corrmat = data.corr()
top_corr_features = corrmat.index[abs(corrmat["customer_type"]) > threshold]
plt.figure(figsize=(10,10))
g = sns.heatmap(data[top_corr_features].corr(), # 大于0.5的特征构成的DF的相关系数矩阵
annot=True,
square=True,
cmap="nipy_spectral_r"
)
新数据建模
# 筛选出为True的特征
useful_col = corrmat.index[abs(corrmat["customer_type"]) > threshold].tolist()
new_df = df[useful_col]
new_df.head()
数据切分
# 选取特征
X = new_df.drop("customer_type",axis=1)
# 目标变量
y = new_df['customer_type']
# 3-7比例
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=42)
标准化
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
# 分别求出训练集的均值和标准差
mean_ = ss.mean_ # 均值
var_ = np.sqrt(ss.var_) # 标准差
# 归一化之后的测试集中的特征数据
X_test = (X_test - mean_) / var_
建模
from xgboost.sklearn import XGBClassifier
## 定义 XGBoost模型
clf = XGBClassifier()
clf.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
In [80]:
# 先转成数组再传进来
X_test = X_test.values
y_pred = clf.predict(X_test)
y_pred[:5]
Out[80]:
array([2, 1, 2, 2, 1])
In [81]:
# 混淆矩阵
confusion_mat = metrics.confusion_matrix(y_test,y_pred)
confusion_mat
Out[81]:
array([[406, 94],
[ 96, 104]])
In [82]:
## auc-roc
auc_roc = metrics.roc_auc_score(y_test, y_pred) # 真实值和预测值
auc_roc
Out[82]:
0.666
优化方向
经过3种不同树模型的建模,我们发现模型的AUC值并不是很高。AUC 值是一个概率值,AUC 值越大,分类算法越好。可以考虑优化的方向:
- 特征工程处理:这个可以重点优化。目前对原始的特征变量使用了3种不同类型编码、独热码和硬编码;有些字段的编码方式需要优化。
- 筛选变量:相关系数是用来检测两个连续型变量之间线性相关的程度;特征变量和最终因变量的关系不一定线性相关。本文中观察到相关系数都很低,似乎佐证了这点。后续考虑通过其他方法来筛选变量进行建模
- 模型调优:通过网格搜索等优化单个模型的参数,或者通过模型融合来增强整体效果。
数据集获取
关注公众号【尤而小屋】,回复德国即可领取本文数据集。