金融贷款逾期的模型构建6——特征选择

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u012736685/article/details/85837324

数据传送门(data.csv)https://pan.baidu.com/s/1G1b2QJjYkkk7LDfGorbj5Q
目标:数据集是金融数据(非脱敏),要预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。

任务:分别用IV值和随机森林进行特征选择。然后分别构建模型(逻辑回归、SVM、决策树、随机森林、GBDT、XGBoost和LightGBM),进行模型评估。

一、IV值

1、概述

IV:Information Value,即信息价值,或者信息量。用于衡量变量的预测能力,也就是说,若某特征的IV值越大,该特征对预测的结果影响越大。

适用条件:有监督模型且必须是二分类。

常见的IV取值范围代表意思如下:

  • 若IV在(-∞,0.02]区间,视为无预测力变量
  • 若IV在(0.02,0.1]区间,视为较弱预测力变量
  • 若IV在(0.1,+∞)区间,视为预测力可以,而实际应用中,也是保留IV值大于0.1的变量进行筛选。

IV值计算

2、IV计算

WOE 是 IV 的计算基础。

(1)WOE

WOE(Weight of Evidence,证据权重)。WOE是对原始自变量的一种编码形式。

  • 首先,对该特征进行分组处理(也称离散化、分箱等)。
  • 然后,对第 i i 组,计算 W O E WOE ,公式如下所示:
    W O E i = l n ( p y i p n i ) = l n ( # y i / # y T # n i / # n T ) WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})
    其中, p y i p_{y_i} 表示该组中响应客户(在风险模型中,即违约客户)占所有样本中所有响应客户的比例, p n i p_{n_i} 表示该组中未响应客户占样本中所有未响应客户的比例。 # y i \#y_i 表示这个组中响应客户的数量, # n i \#n_i 表示这个组中未响应客户的数量, # y T \#y_T 表示样本中所有响应客户的数量, # n T \#n_T 表示样本中所有未响应客户的数量。
    ==》 W O E WOE :“当前分组中响应客户占所有响应客户的比例”和“当前分组中没有响应的客户占所有没有响应的客户的比例”的差异。
  • 公式变形:
    W O E i = l n ( p y i p n i ) = l n ( # y i / # y T # n i / # n T ) = l n ( # y i / # n i # y T / # n T ) WOE_i = ln(\frac{p_{y_i}}{p_{n_i}})=ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})=ln(\frac{\#y_i/\#n_i}{\#y_T/\#n_T})
    ==》 W O E WOE :当前这个组中响应的客户和未响应客户的比值,和所有样本中这个比值的差异。
    ==》WOE越大,这种差异越大,这个分组里的样本响应的可能性就越大,WOE越小,差异越小,这个分组里的样本响应的可能性就越小。

(2)IV 计算

I V i = ( p y i p n i ) W O E i = ( p y i p n i ) l n ( p y i p n i ) = ( # y i / # y T # n i / # n T ) l n ( # y i / # y T # n i / # n T ) I V = i = 1 n I V i IV_i =(p_{y_i}-p_{n_i})* WOE_i = (p_{y_i}-p_{n_i})*ln(\frac{p_{y_i}}{p_{n_i}})=(\#y_i/\#y_T-\#n_i/\#n_T)ln(\frac{\#y_i/\#y_T}{\#n_i/\#n_T})\\ IV = \sum_{i=1}^{n}IV_i
其中,n为特征的分组个数。

二、实现

0、相关模块

import pandas as pd
from pandas import DataFrame as df
from numpy import log
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, f1_score
from sklearn.metrics import roc_auc_score, recall_score, roc_curve, auc
import matplotlib.pyplot as plt

1、IV值

def calcWOE(dataset, col, target):
    ## 对特征进行统计分组
    subdata = df(dataset.groupby(col)[col].count())
    ## 每个分组中响应客户的数量
    suby = df(dataset.groupby(col)[target].sum())
    ## subdata 与 suby 的拼接
    data = df(pd.merge(subdata, suby, how='left', left_index=True, right_index=True))

    ## 相关统计,总共的样本数量total,响应客户总数b_total,未响应客户数量g_total
    b_total = data[target].sum()
    total = data[col].sum()
    g_total = total - b_total

    ## WOE公式
    data["bad"] = data.apply(lambda x:round(x[target]/b_total, 100), axis=1)
    data["good"] = data.apply(lambda x:round((x[col] - x[target])/g_total, 100), axis=1)
    data["WOE"] = data.apply(lambda x:log(x.bad / x.good), axis=1)
    return data.loc[:, ["bad", "good", "WOE"]]

def calcIV(dataset):
    print()
    dataset["IV"] = dataset.apply(lambda x:(x["bad"] - x["good"]) * x["WOE"], axis=1)
    IV = sum(dataset["IV"])
    return IV

file_name = '1.csv'
data = pd.read_csv(file_name, encoding='gbk')
X = data.drop(labels="status", axis=1)
print(X.shape)
y = data["status"]
col_list = [col for col in  data.drop(labels=['Unnamed: 0','status'], axis=1)]
data_IV = df()
fea_iv = []

for col in col_list:
    col_WOE = calcWOE(data, col, "status")
    ## 删除nan、inf、-inf
    col_WOE = col_WOE[~col_WOE.isin([np.nan, np.inf, -np.inf]).any(1)]
    col_IV = calcIV(col_WOE)
    if col_IV > 0.1:
        data_IV[col] = [col_IV]
        fea_iv.append(col)

data_IV.to_csv('data_IV.csv', index=0)
print(fea_iv)

输出结果

['trans_amount_increase_rate_lately', 'trans_activity_day', 'repayment_capability', 'first_transaction_time', 'historical_trans_day', 'rank_trad_1_month', 'trans_amount_3_month', 'abs', 'avg_price_last_12_month', 'trans_fail_top_count_enum_last_1_month', 'trans_fail_top_count_enum_last_6_month', 'trans_fail_top_count_enum_last_12_month', 'max_cumulative_consume_later_1_month', 'pawns_auctions_trusts_consume_last_1_month', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_day', 'trans_day_last_12_month', 'apply_score', 'loans_score', 'loans_count', 'loans_overdue_count', 'history_suc_fee', 'history_fail_fee', 'latest_one_month_suc', 'latest_one_month_fail', 'loans_avg_limit', 'consfin_credit_limit', 'consfin_max_limit', 'consfin_avg_limit', 'loans_latest_day']

2、Random Forest

rfc = RandomForestClassifier()
rfc.fit(X, y)
rfc_impc = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)
fea_gini = rfc_impc[:20].index.tolist()
print(fea_gini)

输出结果

['trans_fail_top_count_enum_last_1_month', 'history_fail_fee', 'loans_score', 'apply_score', 'latest_one_month_fail', 'trans_fail_top_count_enum_last_12_month', 'Unnamed: 0', 'trans_amount_3_month', 'trans_activity_day', 'max_cumulative_consume_later_1_month', 'repayment_capability', 'historical_trans_amount', 'consfin_credit_limit', 'latest_query_day', 'pawns_auctions_trusts_consume_last_6_month', 'first_transaction_time', 'loans_overdue_count', 'history_suc_fee', 'trans_days_interval', 'number_of_trans_from_2011']

3、特征合并

features = list(set(fea_gini)|set(fea_iv))
X_final = X[features]
print(X_final.shape)

(4754, 35)
分析:从原来的(4754, 92)经过筛选得到 (4754, 35) 特征的数据,去掉了大量的冗余。

4、模型构建

## 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.3, random_state=2019)

## 模型1:Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)

# ## 模型2:SVM
svm = SVC(kernel='linear',probability=True)
svm.fit(X_train,y_train)

## 模型3:Decision Tree
dtc = DecisionTreeClassifier(max_depth=8)
dtc.fit(X_train,y_train)

## 模型4:Random Forest
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)

## 模型5:GBDT
gbdt = GradientBoostingClassifier()
gbdt.fit(X_train,y_train)

## 模型6:XGBoost
xgbc = xgb.XGBClassifier()
xgbc.fit(X_train,y_train)

## 模型7:LightGBM
lgbc = lgb.LGBMClassifier()
lgbc.fit(X_train,y_train)

5、模型评估

## 模型评估
def model_metrics(clf, X_train, X_test, y_train, y_test):
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)

    y_train_prob = clf.predict_proba(X_train)[:, 1]
    y_test_prob = clf.predict_proba(X_test)[:, 1]

    # 准确率
    print('准确率: ',end=' ')
    print('训练集: ', '%.4f' % accuracy_score(y_train, y_train_pred), end=' ')
    print('测试集: ', '%.4f' % accuracy_score(y_test, y_test_pred))

    # 精准率
    print('精准率:',end=' ')
    print('训练集: ', '%.4f' % precision_score(y_train, y_train_pred), end=' ')
    print('测试集: ', '%.4f' % precision_score(y_test, y_test_pred))

    # 召回率
    print('召回率:',end=' ')
    print('训练集: ', '%.4f' % recall_score(y_train, y_train_pred), end=' ')
    print('测试集: ', '%.4f' % recall_score(y_test, y_test_pred))

    # f1_score
    print('f1-score:',end=' ')
    print('训练集: ', '%.4f' % f1_score(y_train, y_train_pred), end=' ')
    print('测试集: ', '%.4f' % f1_score(y_test, y_test_pred))

    # auc
    print('auc:',end=' ')
    print('训练集: ', '%.4f' % roc_auc_score(y_train, y_train_prob), end=' ')
    print('测试集: ', '%.4f' % roc_auc_score(y_test, y_test_prob))

    # roc曲线
    fpr_train, tpr_train, thred_train = roc_curve(y_train, y_train_prob, pos_label=1)
    fpr_test, tpr_test, thred_test = roc_curve(y_test, y_test_prob, pos_label=1)

    label = ['Train - AUC:{:.4f}'.format(auc(fpr_train, tpr_train)),
             'Test - AUC:{:.4f}'.format(auc(fpr_test, tpr_test))]
    plt.plot(fpr_train, tpr_train)
    plt.plot(fpr_test, tpr_test)
    plt.plot([0, 1], [0, 1], 'd--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.legend(label, loc=4)
    plt.title('ROC Curve')

model_metrics(lr, X_train, X_test, y_train, y_test)
model_metrics(svm, X_train, X_test, y_train, y_test)
model_metrics(dtc, X_train, X_test, y_train, y_test)
model_metrics(rfc, X_train, X_test, y_train, y_test)
model_metrics(gbdt, X_train, X_test, y_train, y_test)
model_metrics(xgbc, X_train, X_test, y_train, y_test)
model_metrics(lgbc, X_train, X_test, y_train, y_test)

出现的问题:

  1. TypeError: 'list' object is not callable set
    原因:上面重复定义list所以该处不可使用,提示:定义任何对象不要和关键字或者import里面的函数等等同名。

  2. UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. 'precision', 'predicted', average, warn_for)在预测的时候出现该警告,同一模型有的评价指标结果为0,目前没有解决。

参考:
https://blog.csdn.net/kevin7658/article/details/50780391/

猜你喜欢

转载自blog.csdn.net/u012736685/article/details/85837324
今日推荐