数据传送门(data.csv):https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q
目标:数据集是金融数据(非脱敏),要预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。
任务:分别用IV值和随机森林进行特征选择。然后分别构建模型(逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM),进行模型评估。
一、IV值
1、概述
IV:Information Value,即信息价值,或者信息量。用于衡量变量的预测能力,也就是说,若某特征的IV值越大,该特征对预测的结果影响越大。
适用条件:有监督模型且必须是二分类。
常见的IV取值范围代表意思如下:
- 若IV在(-∞,0.02]区间,视为无预测力变量
- 若IV在(0.02,0.1]区间,视为较弱预测力变量
- 若IV在(0.1,+∞)区间,视为预测力可以,而实际应用中,也是保留IV值大于0.1的变量进行筛选。
IV值计算
2、IV计算
WOE 是 IV 的计算基础。
(1)WOE
WOE(Weight of Evidence,证据权重)。WOE是对原始自变量的一种编码形式。
- 首先,对该特征进行分组处理(也称离散化、分箱等)。
- 然后,对第
组,计算
,公式如下所示:
其中, 表示该组中响应客户(在风险模型中,即违约客户)占所有样本中所有响应客户的比例, 表示该组中未响应客户占样本中所有未响应客户的比例。 表示这个组中响应客户的数量, 表示这个组中未响应客户的数量, 表示样本中所有响应客户的数量, 表示样本中所有未响应客户的数量。
==》 :“当前分组中响应客户占所有响应客户的比例”和“当前分组中没有响应的客户占所有没有响应的客户的比例”的差异。 - 公式变形:
==》 :当前这个组中响应的客户和未响应客户的比值,和所有样本中这个比值的差异。
==》WOE越大,这种差异越大,这个分组里的样本响应的可能性就越大,WOE越小,差异越小,这个分组里的样本响应的可能性就越小。
(2)IV 计算
其中,n为特征的分组个数。
二、实现
0、相关模块
import pandas as pd
from pandas import DataFrame as df
from numpy import log
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import roc_auc_score, recall_score, roc_curve, auc
import matplotlib.pyplot as plt
1、IV值
def calcWOE(dataset, col, target):
## 对特征进行统计分组
subdata = df(dataset.groupby(col)[col].count())
## 每个分组中响应客户的数量
suby = df(dataset.groupby(col)[target].sum())
## subdata 与 suby 的拼接
data = df(pd.merge(subdata, suby, how='left', left_index=True, right_index=True))
## 相关统计,总共的样本数量total,响应客户总数b_total,未响应客户数量g_total
b_total = data[target].sum()
total = data[col].sum()
g_total = total - b_total
## WOE公式
data["bad"] = data.apply(lambda x:round(x[target]/b_total, 100), axis=1)
data["good"] = data.apply(lambda x:round((x[col] - x[target])/g_total, 100), axis=1)
data["WOE"] = data.apply(lambda x:log(x.bad / x.good), axis=1)
return data.loc[:, ["bad", "good", "WOE"]]
def calcIV(dataset):
print()
dataset["IV"] = dataset.apply(lambda x:(x["bad"] - x["good"]) * x["WOE"], axis=1)
IV = sum(dataset["IV"])
return IV
file_name = '1.csv'
data = pd.read_csv(file_name, encoding='gbk')
X = data.drop(labels="status", axis=1)
print(X.shape)
y = data["status"]
col_list = [col for col in data.drop(labels=['Unnamed: 0','status'], axis=1)]
data_IV = df()
fea_iv = []
for col in col_list:
col_WOE = calcWOE(data, col, "status")
## 删除nan、inf、-inf
col_WOE = col_WOE[~col_WOE.isin([np.nan, np.inf, -np.inf]).any(1)]
col_IV = calcIV(col_WOE)
if col_IV > 0.1:
data_IV[col] = [col_IV]
fea_iv.append(col)
data_IV.to_csv('data_IV.csv', index=0)
print(fea_iv)
输出结果
['trans_amount_increase_rate_lately', 'trans_activity_day', 'repayment_capability', 'first_transaction_time', 'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month', 'abs', 'avg_price_last_12_month', 'trans_fail_top_count_enum_last_1_month', 'trans_fail_top_count_enum_last_6_month', 'trans_fail_top_count_enum_last_12_month', 'max_cumulative_consume_later_1_month', 'pawns_auctions_trusts_consume_last_1_month', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_day', 'trans_day_last_12_month', 'apply_score', 'loans_score', 'loans_count', 'loans_overdue_count', 'history_suc_fee', 'history_fail_fee', 'latest_one_month_suc', 'latest_one_month_fail', 'loans_avg_limit', 'consfin_credit_limit', 'consfin_max_limit', 'consfin_avg_limit', 'loans_latest_day']
2、Random Forest
rfc = RandomForestClassifier()
rfc.fit(X, y)
rfc_impc = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)
fea_gini = rfc_impc[:20].index.tolist()
print(fea_gini)
输出结果
['trans_fail_top_count_enum_last_1_month', 'history_fail_fee', 'loans_score', 'apply_score', 'latest_one_month_fail', 'trans_fail_top_count_enum_last_12_month', 'Unnamed: 0', 'trans_amount_3_month', 'trans_activity_day', 'max_cumulative_consume_later_1_month', 'repayment_capability', 'historical_trans_amount', 'consfin_credit_limit', 'latest_query_day', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_time', 'loans_overdue_count', 'history_suc_fee', 'trans_days_interval', 'number_of_trans_from_2011']
3、特征合并
features = list(set(fea_gini)|set(fea_iv))
X_final = X[features]
print(X_final.shape)
(4754, 35)
分析:从原来的(4754, 92)经过筛选得到 (4754, 35) 特征的数据,去掉了大量的冗余。
4、模型构建
## 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.3, random_state=2019)
## 模型1:Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)
# ## 模型2:SVM
svm = SVC(kernel='linear',probability=True)
svm.fit(X_train,y_train)
## 模型3:Decision Tree
dtc = DecisionTreeClassifier(max_depth=8)
dtc.fit(X_train,y_train)
## 模型4:Random Forest
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
## 模型5:GBDT
gbdt = GradientBoostingClassifier()
gbdt.fit(X_train,y_train)
## 模型6:XGBoost
xgbc = xgb.XGBClassifier()
xgbc.fit(X_train,y_train)
## 模型7:LightGBM
lgbc = lgb.LGBMClassifier()
lgbc.fit(X_train,y_train)
5、模型评估
## 模型评估
def model_metrics(clf, X_train, X_test, y_train, y_test):
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
y_train_prob = clf.predict_proba(X_train)[:, 1]
y_test_prob = clf.predict_proba(X_test)[:, 1]
# 准确率
print('准确率: ',end=' ')
print('训练集: ', '%.4f' % accuracy_score(y_train, y_train_pred), end=' ')
print('测试集: ', '%.4f' % accuracy_score(y_test, y_test_pred))
# 精准率
print('精准率:',end=' ')
print('训练集: ', '%.4f' % precision_score(y_train, y_train_pred), end=' ')
print('测试集: ', '%.4f' % precision_score(y_test, y_test_pred))
# 召回率
print('召回率:',end=' ')
print('训练集: ', '%.4f' % recall_score(y_train, y_train_pred), end=' ')
print('测试集: ', '%.4f' % recall_score(y_test, y_test_pred))
# f1_score
print('f1-score:',end=' ')
print('训练集: ', '%.4f' % f1_score(y_train, y_train_pred), end=' ')
print('测试集: ', '%.4f' % f1_score(y_test, y_test_pred))
# auc
print('auc:',end=' ')
print('训练集: ', '%.4f' % roc_auc_score(y_train, y_train_prob), end=' ')
print('测试集: ', '%.4f' % roc_auc_score(y_test, y_test_prob))
# roc曲线
fpr_train, tpr_train, thred_train = roc_curve(y_train, y_train_prob, pos_label=1)
fpr_test, tpr_test, thred_test = roc_curve(y_test, y_test_prob, pos_label=1)
label = ['Train - AUC:{:.4f}'.format(auc(fpr_train, tpr_train)),
'Test - AUC:{:.4f}'.format(auc(fpr_test, tpr_test))]
plt.plot(fpr_train, tpr_train)
plt.plot(fpr_test, tpr_test)
plt.plot([0, 1], [0, 1], 'd--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(label, loc=4)
plt.title('ROC Curve')
model_metrics(lr, X_train, X_test, y_train, y_test)
model_metrics(svm, X_train, X_test, y_train, y_test)
model_metrics(dtc, X_train, X_test, y_train, y_test)
model_metrics(rfc, X_train, X_test, y_train, y_test)
model_metrics(gbdt, X_train, X_test, y_train, y_test)
model_metrics(xgbc, X_train, X_test, y_train, y_test)
model_metrics(lgbc, X_train, X_test, y_train, y_test)
出现的问题:
-
TypeError: 'list' object is not callable set
原因:上面重复定义list所以该处不可使用,提示:定义任何对象不要和关键字或者import里面的函数等等同名。 -
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for)
在预测的时候出现该警告,同一模型有的评价指标结果为0,目前没有解决。
参考:
https://blog.csdn.net/kevin7658/article/details/50780391/