ML - 贷款用户逾期情况分析5 - 特征工程2(特征选择)

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/a786150017/article/details/84573202

特征选择 (判定贷款用户是否逾期)

给定金融数据,预测贷款用户是否会逾期。
(status是标签:0表示未逾期,1表示逾期。)

Task8(特征工程2 - 特征选择) - 分别用IV值和随机森林挑选特征,再构建模型,进行模型评估

1. IV值进行特征选择

1.1 基本介绍

在二分类问题中,IV值(Information Value)主要用来对输入变量进行编码和预测能力评估

IV 值的取值范围是[0, \infty ),其大小表示该变量预测能力的强弱。通常认为:

IV值 预测能力
<0.02 无用
0.02—0.1 弱预测
0.1—0.3 中等预测
0.3—0.5 强预测
>0.5 可疑

一般选择中等和强预测能力的变量用于模型开发,一些学派也只提倡具有中等IV值的变量来进行模型开发。

1.2 计算公式

1)WOE

WOE(weight of evidence,证据权重),是对原始变量的一种编码形式。

对一个变量进行WOE编码,首先要把变量进行分组处理(分箱或离散化)。常用离散化的方法有等宽分组,等高分组,或利用决策树来分组。
分组后,对于第 i 组,WOE的计算公式见下式:
W O E i = ln p y 1 p y 0 = ln # B i / # B T # G i / # G T WO{E_i} = \ln {{{p_{{y_1}}}} \over {{p_{{y_0}}}}} = \ln {{\# {B_i}/\# {B_T}} \over {\# {G_i}/\# {G_T}}}

它衡量了"当前分组中响应用户/所有响应用户"和"当前分组中未响应用户/所有未响应用户"的差异。

2)IV值

IV值的计算以WOE为基础,相当于是WOE值的一个加权求和。

假设变量分了n个组。对第i组,计算公式如下:
I V i = ( # B i # B T # G i # G T ) ln # B i / # B T # G i / # G T I{V_i} = \left( {{{\# {B_i}} \over {\# {B_T}}} - {{\# {G_i}} \over {\# {G_T}}}} \right)\ln {{\# {B_i}/\# {B_T}} \over {\# {G_i}/\# {G_T}}}
计算了变量各个组的 IV 值之后,我们就可以计算整个变量的 IV 值:
I V = i = 1 n I V i IV = \sum\limits_{i = 1}^n {I{V_i}}

IV值主要用于特征选择,如果想对变量的预测能力进行排序,可以按 IV 值从高到低筛选。

IV在WOE前多乘了一个因子:
1)保证了IV的值不是负数;
2)很好的考虑了分组中样本占整体的比例(比例越低,这个分组对变量整体预测能力的贡献越低)。

2. 随机森林进行特征选择

随机森林提供了两种特征选择的方法:mean decrease impurity和mean decrease accuracy。

2.1 平均不纯度减少 mean decrease impurity

利用不纯度可以确定节点(最优条件). 对于分类问题,常采用基尼不纯度/信息增益;对于回归问题,常采用方差/最小二乘拟合。

训练决策树时,可以计算每个特征减少了多少树的不纯度。对于一个决策树森林来说,可以算出每个特征平均减少了多少不纯度,并把它平均减少的不纯度作为特征选择的值。
【缺点】
1)该方法存在偏向, 对具有更多类别的变量更有利;
2)label存在多个关联特征(任意一个都可以作为优秀特征), 则一旦某个特征被选择, 其他特征的重要性会急剧降低。这会造成误解:错误的认为先被选中的特征是很重要的,而其余的特征是不重要的。

2.2 平均精确率减少 Mean decrease accuracy

直接度量每个特征对模型精确率的影响。

打乱每个特征的特征值顺序,并且度量顺序变动对模型的精确率的影响。
对于不重要的变量来说,打乱顺序对模型的精确率影响不会太大,但是对于重要的变量来说,打乱顺序就会降低模型的精确率。

3. 代码

import pickle
import pandas as pd
from sklearn.model_selection import train_test_split

# 导入数据
data = pd.read_csv('data.csv')
data.drop_duplicates(inplace=True)

# 载入特征
with open('feature.pkl', 'rb') as f:
    X = pickle.load(f)

# 提取标签
y = data.status

# 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2333)
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score

def model_metrics(clf, X_train, X_test, y_train, y_test):
    # 预测
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # 准确率
    print('[准确率]', end = ' ')
    print('训练集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
    print('测试集:', '%.4f'%accuracy_score(y_test, y_test_pred))
    
    # auc取值:用roc_auc_score或auc
    print('[auc值]', end = ' ')
    print('训练集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
    print('测试集:', '%.4f'%roc_auc_score(y_test, y_test_proba))

3.1 IV值进行特征选择

stats.scoreatpercentile(x, 50) # 得到x在50%处的数值
np.in1d(B,A) # 在序列B中寻找与序列A相同的值,并返回一逻辑值(True,False)

处理上述特征时, 遇到了IV的极端情况, 响应数为0或未响应数为0。
为简单起见, 我们在代码中对极端值进行平滑处理。

import math
import numpy as np
from scipy import stats
from sklearn.utils.multiclass import type_of_target

def woe(X, y, event=1):  
    res_woe = []
    iv_dict = {}
    for feature in X.columns:
        x = X[feature].values
        # 1) 连续特征离散化
        if type_of_target(x) == 'continuous':
            x = discrete(x)
        # 2) 计算该特征的woe和iv
        # woe_dict, iv = woe_single_x(x, y, feature, event)
        woe_dict, iv = woe_single_x(x, y, feature, event)
        iv_dict[feature] = iv
        res_woe.append(woe_dict) 
        
    return iv_dict
        
def discrete(x):
    # 使用5等分离散化特征
    res = np.zeros(x.shape)
    for i in range(5):
        point1 = stats.scoreatpercentile(x, i * 20)
        point2 = stats.scoreatpercentile(x, (i + 1) * 20)
        x1 = x[np.where((x >= point1) & (x <= point2))]
        mask = np.in1d(x, x1)
        res[mask] = i + 1    # 将[i, i+1]块内的值标记成i+1
    return res

def woe_single_x(x, y, feature,event = 1):
    # event代表预测正例的标签
    event_total = sum(y == event)
    non_event_total = y.shape[-1] - event_total
    
    iv = 0
    woe_dict = {}
    for x1 in set(x):    # 遍历各个块
        y1 = y.reindex(np.where(x == x1)[0])
        event_count = sum(y1 == event)
        non_event_count = y1.shape[-1] - event_count
        rate_event = event_count / event_total    
        rate_non_event = non_event_count / non_event_total
        
        if rate_event == 0:
            rate_event = 0.0001
            # woei = -20
        elif rate_non_event == 0:
            rate_non_event = 0.0001
            # woei = 20
        woei = math.log(rate_event / rate_non_event)
        woe_dict[x1] = woei
        iv += (rate_event - rate_non_event) * woei
    return woe_dict, iv
import warnings
warnings.filterwarnings("ignore")

iv_dict = woe(X_train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv

输出

[(‘historical_trans_amount’, 2.6975301004625365),
(‘trans_amount_3_month’, 2.5633548887586746),
(‘pawns_auctions_trusts_consume_last_6_month’, 2.343990314630991),
(‘repayment_capability’, 2.31685232254565),
(‘first_transaction_day’, 2.10946672748192),
(‘abs’, 2.048054369415617),
(‘consfin_avg_limit’, 1.8005797778063934),
(‘consume_mini_time_last_1_month’, 1.4570522032774857),

3.2 随机森林挑选特征

首先网格调参,求得模型参数。

import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 观察默认参数的性能
rf0 = RandomForestClassifier(oob_score=True, random_state=2333)
rf0.fit(X_train, y_train)
print('袋外分数:', rf0.oob_score_)
model_metrics(rf0, X_train, X_test, y_train, y_test)

输出

袋外分数: 0.7342951608055305
[准确率] 训练集: 0.9805 测试集: 0.7744
[auc值] 训练集: 0.9996 测试集: 0.7289

# 网格法调参, 步骤省略...
"""
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}
# param_test = {'min_samples_split':range(10,100,20), 'min_samples_leaf':range(10,60,10)}
# param_test = {'max_features':range(3,17,2)}
gsearch = GridSearchCV(estimator = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50, 
                                                          min_samples_leaf=20, max_features = 9,random_state=2333), 
                       param_grid = param_test, scoring='roc_auc', cv=5)

gsearch.fit(X_train, y_train)
# gsearch.grid_scores_, 
gsearch.best_params_, gsearch.best_score_
"""

最终参数及性能

rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
                            min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数:', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)

输出

袋外分数: 0.7844905320108205
[准确率] 训练集: 0.8115 测试集: 0.7954
[auc值] 训练集: 0.8946 测试集: 0.7914

3.2.1 平均不纯度减少 mean decrease impurity

对于每颗树,按照impurity(此处是gini指数 )给特征排序,然后整个森林取平均

rf.fit(X_train, y_train)
feature_impotance1 = sorted(zip(map(lambda x: '%.4f'%x, rf.feature_importances_), list(X_train.columns)), reverse=True)
feature_impotance1[:10]

输出

[(‘0.1333’, ‘trans_fail_top_count_enum_last_1_month’),
(‘0.0818’, ‘loans_score’),
(‘0.0784’, ‘history_fail_fee’),
(‘0.0623’, ‘apply_score’),
(‘0.0580’, ‘latest_one_month_fail’),
(‘0.0424’, ‘loans_overdue_count’),
(‘0.0307’, ‘trans_fail_top_count_enum_last_12_month’),
(‘0.0237’, ‘trans_fail_top_count_enum_last_6_month’),
(‘0.0194’, ‘trans_day_last_12_month’),
(‘0.0184’, ‘max_cumulative_consume_later_1_month’)]

3.2.2 平均精确率减少 Mean decrease accuracy

打乱每个特征的特征值顺序,并且度量顺序变动对模型的精确率的影响。(也可以measure每个特征加躁,看对结果的准确率的影响。)

import numpy as np
from collections import defaultdict
from sklearn.model_selection import cross_val_score, ShuffleSplit

scores = defaultdict(list)
rs = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
for train_idx, test_idx in rs.split(X_train):
    x_train, x_test = X_train.values[train_idx], X_train.values[test_idx]
    Y_train, Y_test = y_train.values[train_idx], y_train.values[test_idx]
    r = rf.fit(x_train, Y_train)
    acc = accuracy_score(Y_test, rf.predict(x_test))
    for i in range(x_train.shape[1]):
        X_t = x_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = accuracy_score(Y_test, rf.predict(X_t))
        scores[X_train.columns[i]].append((acc - shuff_acc) / acc)
        
feature_impotance2=sorted([('%.4f'%np.mean(score), feat) for feat, score in scores.items()], reverse=True)
feature_impotance2[:10]

输出

[(‘0.0163’, ‘history_fail_fee’),
(‘0.0153’, ‘trans_fail_top_count_enum_last_1_month’),
(‘0.0120’, ‘loans_score’),
(‘0.0097’, ‘latest_one_month_fail’),
(‘0.0097’, ‘apply_score’),
(‘0.0062’, ‘loans_overdue_count’),
(‘0.0046’, ‘trans_fail_top_count_enum_last_12_month’),
(‘0.0041’, ‘trans_fail_top_count_enum_last_6_month’),
(‘0.0036’, ‘latest_one_month_suc’),
(‘0.0025’, ‘avg_price_last_12_month’)]

3.3 综合挑选特征

useless = []
for feature in X_train.columns:
    if feature in [t[1] for t in feature_impotance1[50:]] and feature in [t[1] for t in feature_impotance2[50:]]:
        useless.append(feature)
        print(feature, iv_dict[feature])
X_train.drop(useless, axis = 1, inplace = True)
X_test.drop(useless, axis = 1, inplace = True)

模型选择与模型评估

调参过程略, 参见"Finance3 - ModelAdjustPara.ipynb"

from sklearn.preprocessing import StandardScaler

# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train.values)
X_test = std.transform(X_test.values)
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier

lr = LogisticRegression(C = 0.1, penalty = 'l1')
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
svm_poly =  svm.SVC(C = 0.01, kernel = 'poly', probability=True)
svm_rbf =  svm.SVC(gamma = 0.01, C =0.01 , probability=True)
svm_sigmoid =  svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True)
dt = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5, 
                    gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5, 
                    objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11, 
                    gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5, 
                    nthread=4,scale_pos_weight=1, seed=27)
sclf = StackingClassifier(classifiers=[svm_linear, svm_poly, svm_rbf, svm_sigmoid, dt, xgb, lgb], 
                            meta_classifier=lr, use_probas=True,average_probas=False)
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)

输出

[准确率] 训练集: 0.8563 测试集: 0.8017
[auc值] 训练集: 0.9061 测试集: 0.7875

(未特征选择时的输出)

[准确率] 训练集: 0.8365 测试集: 0.8017
[auc值] 训练集: 0.8773 测试集: 0.7962

分析:调参后单模型性能有所提升。Stacking后和未特征选择时的结果对比,相差不大(AUC略有下降)。起码说明,删除某些特征后,对性能影响不大 → 这些特征冗余。

遇到的问题

1)求IV值遇到极端值时怎么处理?
将WOE标记为0/无穷或平滑处理,对IV值有较大大影响。已经无法从0.2—0.5的取值来删除特征了(除了可疑预测,其余都在0.2—0.5之间)。

2)虽然已经求得IV值或feature_importance,但不知道是不是取值不合常规就一定要删除该特征。
若一个特征一个特征删除后对比性能,进行验证的话,还要重新调参(很麻烦…)

Reference

1)结合Scikit-learn介绍几种常用的特征选择方法
2)IV值的计算及使用
3)Information Value (IV) and Weight of Evidence (WOE) – A Case Study from Banking (Part 4)
4)计算IV值的代码
5)详细 - 数据挖掘模型中的IV和WOE详解

More

代码参见Github: https://github.com/libihan/Exercise-ML/blob/master/Finance2.2 - FeatureSelection.ipynb

猜你喜欢

转载自blog.csdn.net/a786150017/article/details/84573202
今日推荐