文章目录
特征选择 (判定贷款用户是否逾期)
给定金融数据,预测贷款用户是否会逾期。
(status是标签:0表示未逾期,1表示逾期。)
Task8(特征工程2 - 特征选择) - 分别用IV值和随机森林挑选特征,再构建模型,进行模型评估
1. IV值进行特征选择
1.1 基本介绍
在二分类问题中,IV值(Information Value)主要用来对输入变量进行编码和预测能力评估。
IV 值的取值范围是[0, ),其大小表示该变量预测能力的强弱。通常认为:
IV值 | 预测能力 |
---|---|
<0.02 | 无用 |
0.02—0.1 | 弱预测 |
0.1—0.3 | 中等预测 |
0.3—0.5 | 强预测 |
>0.5 | 可疑 |
一般选择中等和强预测能力的变量用于模型开发,一些学派也只提倡具有中等IV值的变量来进行模型开发。
1.2 计算公式
1)WOE
WOE(weight of evidence,证据权重),是对原始变量的一种编码形式。
对一个变量进行WOE编码,首先要把变量进行分组处理(分箱或离散化)。常用离散化的方法有等宽分组,等高分组,或利用决策树来分组。
分组后,对于第 i 组,WOE的计算公式见下式:
它衡量了"当前分组中响应用户/所有响应用户"和"当前分组中未响应用户/所有未响应用户"的差异。
2)IV值
IV值的计算以WOE为基础,相当于是WOE值的一个加权求和。
假设变量分了n个组。对第i组,计算公式如下:
计算了变量各个组的 IV 值之后,我们就可以计算整个变量的 IV 值:
IV值主要用于特征选择,如果想对变量的预测能力进行排序,可以按 IV 值从高到低筛选。
IV在WOE前多乘了一个因子:
1)保证了IV的值不是负数;
2)很好的考虑了分组中样本占整体的比例(比例越低,这个分组对变量整体预测能力的贡献越低)。
2. 随机森林进行特征选择
随机森林提供了两种特征选择的方法:mean decrease impurity和mean decrease accuracy。
2.1 平均不纯度减少 mean decrease impurity
利用不纯度可以确定节点(最优条件). 对于分类问题,常采用基尼不纯度/信息增益;对于回归问题,常采用方差/最小二乘拟合。
训练决策树时,可以计算每个特征减少了多少树的不纯度。对于一个决策树森林来说,可以算出每个特征平均减少了多少不纯度,并把它平均减少的不纯度作为特征选择的值。
【缺点】
1)该方法存在偏向, 对具有更多类别的变量更有利;
2)label存在多个关联特征(任意一个都可以作为优秀特征), 则一旦某个特征被选择, 其他特征的重要性会急剧降低。这会造成误解:错误的认为先被选中的特征是很重要的,而其余的特征是不重要的。
2.2 平均精确率减少 Mean decrease accuracy
直接度量每个特征对模型精确率的影响。
打乱每个特征的特征值顺序,并且度量顺序变动对模型的精确率的影响。
对于不重要的变量来说,打乱顺序对模型的精确率影响不会太大,但是对于重要的变量来说,打乱顺序就会降低模型的精确率。
3. 代码
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split
# 导入数据
data = pd.read_csv('data.csv')
data.drop_duplicates(inplace=True)
# 载入特征
with open('feature.pkl', 'rb') as f:
X = pickle.load(f)
# 提取标签
y = data.status
# 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=2333)
# 性能评估
from sklearn.metrics import accuracy_score, roc_auc_score
def model_metrics(clf, X_train, X_test, y_train, y_test):
# 预测
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
y_train_proba = clf.predict_proba(X_train)[:,1]
y_test_proba = clf.predict_proba(X_test)[:,1]
# 准确率
print('[准确率]', end = ' ')
print('训练集:', '%.4f'%accuracy_score(y_train, y_train_pred), end = ' ')
print('测试集:', '%.4f'%accuracy_score(y_test, y_test_pred))
# auc取值:用roc_auc_score或auc
print('[auc值]', end = ' ')
print('训练集:', '%.4f'%roc_auc_score(y_train, y_train_proba), end = ' ')
print('测试集:', '%.4f'%roc_auc_score(y_test, y_test_proba))
3.1 IV值进行特征选择
stats.scoreatpercentile(x, 50) # 得到x在50%处的数值
np.in1d(B,A) # 在序列B中寻找与序列A相同的值,并返回一逻辑值(True,False)
处理上述特征时, 遇到了IV的极端情况, 响应数为0或未响应数为0。
为简单起见, 我们在代码中对极端值进行平滑处理。
import math
import numpy as np
from scipy import stats
from sklearn.utils.multiclass import type_of_target
def woe(X, y, event=1):
res_woe = []
iv_dict = {}
for feature in X.columns:
x = X[feature].values
# 1) 连续特征离散化
if type_of_target(x) == 'continuous':
x = discrete(x)
# 2) 计算该特征的woe和iv
# woe_dict, iv = woe_single_x(x, y, feature, event)
woe_dict, iv = woe_single_x(x, y, feature, event)
iv_dict[feature] = iv
res_woe.append(woe_dict)
return iv_dict
def discrete(x):
# 使用5等分离散化特征
res = np.zeros(x.shape)
for i in range(5):
point1 = stats.scoreatpercentile(x, i * 20)
point2 = stats.scoreatpercentile(x, (i + 1) * 20)
x1 = x[np.where((x >= point1) & (x <= point2))]
mask = np.in1d(x, x1)
res[mask] = i + 1 # 将[i, i+1]块内的值标记成i+1
return res
def woe_single_x(x, y, feature,event = 1):
# event代表预测正例的标签
event_total = sum(y == event)
non_event_total = y.shape[-1] - event_total
iv = 0
woe_dict = {}
for x1 in set(x): # 遍历各个块
y1 = y.reindex(np.where(x == x1)[0])
event_count = sum(y1 == event)
non_event_count = y1.shape[-1] - event_count
rate_event = event_count / event_total
rate_non_event = non_event_count / non_event_total
if rate_event == 0:
rate_event = 0.0001
# woei = -20
elif rate_non_event == 0:
rate_non_event = 0.0001
# woei = 20
woei = math.log(rate_event / rate_non_event)
woe_dict[x1] = woei
iv += (rate_event - rate_non_event) * woei
return woe_dict, iv
import warnings
warnings.filterwarnings("ignore")
iv_dict = woe(X_train, y_train)
iv = sorted(iv_dict.items(), key = lambda x:x[1],reverse = True)
iv
输出
[(‘historical_trans_amount’, 2.6975301004625365),
(‘trans_amount_3_month’, 2.5633548887586746),
(‘pawns_auctions_trusts_consume_last_6_month’, 2.343990314630991),
(‘repayment_capability’, 2.31685232254565),
(‘first_transaction_day’, 2.10946672748192),
(‘abs’, 2.048054369415617),
(‘consfin_avg_limit’, 1.8005797778063934),
(‘consume_mini_time_last_1_month’, 1.4570522032774857),
…
3.2 随机森林挑选特征
首先网格调参,求得模型参数。
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# 观察默认参数的性能
rf0 = RandomForestClassifier(oob_score=True, random_state=2333)
rf0.fit(X_train, y_train)
print('袋外分数:', rf0.oob_score_)
model_metrics(rf0, X_train, X_test, y_train, y_test)
输出
袋外分数: 0.7342951608055305
[准确率] 训练集: 0.9805 测试集: 0.7744
[auc值] 训练集: 0.9996 测试集: 0.7289
# 网格法调参, 步骤省略...
"""
param_test = {'n_estimators':range(20,200,20)}
# param_test = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}
# param_test = {'min_samples_split':range(10,100,20), 'min_samples_leaf':range(10,60,10)}
# param_test = {'max_features':range(3,17,2)}
gsearch = GridSearchCV(estimator = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
min_samples_leaf=20, max_features = 9,random_state=2333),
param_grid = param_test, scoring='roc_auc', cv=5)
gsearch.fit(X_train, y_train)
# gsearch.grid_scores_,
gsearch.best_params_, gsearch.best_score_
"""
最终参数及性能
rf = RandomForestClassifier(n_estimators=120, max_depth=9, min_samples_split=50,
min_samples_leaf=20, max_features = 9,oob_score=True, random_state=2333)
rf.fit(X_train, y_train)
print('袋外分数:', rf.oob_score_)
model_metrics(rf, X_train, X_test, y_train, y_test)
输出
袋外分数: 0.7844905320108205
[准确率] 训练集: 0.8115 测试集: 0.7954
[auc值] 训练集: 0.8946 测试集: 0.7914
3.2.1 平均不纯度减少 mean decrease impurity
对于每颗树,按照impurity(此处是gini指数 )给特征排序,然后整个森林取平均
rf.fit(X_train, y_train)
feature_impotance1 = sorted(zip(map(lambda x: '%.4f'%x, rf.feature_importances_), list(X_train.columns)), reverse=True)
feature_impotance1[:10]
输出
[(‘0.1333’, ‘trans_fail_top_count_enum_last_1_month’),
(‘0.0818’, ‘loans_score’),
(‘0.0784’, ‘history_fail_fee’),
(‘0.0623’, ‘apply_score’),
(‘0.0580’, ‘latest_one_month_fail’),
(‘0.0424’, ‘loans_overdue_count’),
(‘0.0307’, ‘trans_fail_top_count_enum_last_12_month’),
(‘0.0237’, ‘trans_fail_top_count_enum_last_6_month’),
(‘0.0194’, ‘trans_day_last_12_month’),
(‘0.0184’, ‘max_cumulative_consume_later_1_month’)]
3.2.2 平均精确率减少 Mean decrease accuracy
打乱每个特征的特征值顺序,并且度量顺序变动对模型的精确率的影响。(也可以measure每个特征加躁,看对结果的准确率的影响。)
import numpy as np
from collections import defaultdict
from sklearn.model_selection import cross_val_score, ShuffleSplit
scores = defaultdict(list)
rs = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
for train_idx, test_idx in rs.split(X_train):
x_train, x_test = X_train.values[train_idx], X_train.values[test_idx]
Y_train, Y_test = y_train.values[train_idx], y_train.values[test_idx]
r = rf.fit(x_train, Y_train)
acc = accuracy_score(Y_test, rf.predict(x_test))
for i in range(x_train.shape[1]):
X_t = x_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = accuracy_score(Y_test, rf.predict(X_t))
scores[X_train.columns[i]].append((acc - shuff_acc) / acc)
feature_impotance2=sorted([('%.4f'%np.mean(score), feat) for feat, score in scores.items()], reverse=True)
feature_impotance2[:10]
输出
[(‘0.0163’, ‘history_fail_fee’),
(‘0.0153’, ‘trans_fail_top_count_enum_last_1_month’),
(‘0.0120’, ‘loans_score’),
(‘0.0097’, ‘latest_one_month_fail’),
(‘0.0097’, ‘apply_score’),
(‘0.0062’, ‘loans_overdue_count’),
(‘0.0046’, ‘trans_fail_top_count_enum_last_12_month’),
(‘0.0041’, ‘trans_fail_top_count_enum_last_6_month’),
(‘0.0036’, ‘latest_one_month_suc’),
(‘0.0025’, ‘avg_price_last_12_month’)]
3.3 综合挑选特征
useless = []
for feature in X_train.columns:
if feature in [t[1] for t in feature_impotance1[50:]] and feature in [t[1] for t in feature_impotance2[50:]]:
useless.append(feature)
print(feature, iv_dict[feature])
X_train.drop(useless, axis = 1, inplace = True)
X_test.drop(useless, axis = 1, inplace = True)
模型选择与模型评估
调参过程略, 参见"Finance3 - ModelAdjustPara.ipynb"
from sklearn.preprocessing import StandardScaler
# 特征归一化
std = StandardScaler()
X_train = std.fit_transform(X_train.values)
X_test = std.transform(X_test.values)
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from lightgbm.sklearn import LGBMClassifier
from mlxtend.classifier import StackingClassifier
lr = LogisticRegression(C = 0.1, penalty = 'l1')
svm_linear = svm.SVC(C = 0.01, kernel = 'linear', probability=True)
svm_poly = svm.SVC(C = 0.01, kernel = 'poly', probability=True)
svm_rbf = svm.SVC(gamma = 0.01, C =0.01 , probability=True)
svm_sigmoid = svm.SVC(C = 0.01, kernel = 'sigmoid',probability=True)
dt = DecisionTreeClassifier(max_depth=5,min_samples_split=50,min_samples_leaf=60, max_features=9, random_state =2333)
xgb = XGBClassifier(learning_rate =0.1, n_estimators=80, max_depth=3, min_child_weight=5,
gamma=0.2, subsample=0.8, colsample_bytree=0.8, reg_alpha=1e-5,
objective= 'binary:logistic', nthread=4,scale_pos_weight=1, seed=27)
lgb = LGBMClassifier(learning_rate =0.1, n_estimators=100, max_depth=3, min_child_weight=11,
gamma=0.1, subsample=0.5, colsample_bytree=0.9, reg_alpha=1e-5,
nthread=4,scale_pos_weight=1, seed=27)
sclf = StackingClassifier(classifiers=[svm_linear, svm_poly, svm_rbf, svm_sigmoid, dt, xgb, lgb],
meta_classifier=lr, use_probas=True,average_probas=False)
sclf.fit(X_train, y_train.values)
model_metrics(sclf, X_train, X_test, y_train, y_test)
输出
[准确率] 训练集: 0.8563 测试集: 0.8017
[auc值] 训练集: 0.9061 测试集: 0.7875
(未特征选择时的输出)
[准确率] 训练集: 0.8365 测试集: 0.8017
[auc值] 训练集: 0.8773 测试集: 0.7962
分析:调参后单模型性能有所提升。Stacking后和未特征选择时的结果对比,相差不大(AUC略有下降)。起码说明,删除某些特征后,对性能影响不大 → 这些特征冗余。
遇到的问题
1)求IV值遇到极端值时怎么处理?
将WOE标记为0/无穷或平滑处理,对IV值有较大大影响。已经无法从0.2—0.5的取值来删除特征了(除了可疑预测,其余都在0.2—0.5之间)。
2)虽然已经求得IV值或feature_importance,但不知道是不是取值不合常规就一定要删除该特征。
若一个特征一个特征删除后对比性能,进行验证的话,还要重新调参(很麻烦…)
Reference
1)结合Scikit-learn介绍几种常用的特征选择方法
2)IV值的计算及使用
3)Information Value (IV) and Weight of Evidence (WOE) – A Case Study from Banking (Part 4)
4)计算IV值的代码
5)详细 - 数据挖掘模型中的IV和WOE详解
More
代码参见Github: https://github.com/libihan/Exercise-ML/blob/master/Finance2.2 - FeatureSelection.ipynb