Model building - use logistic regression to build models, lightGBM for feature screening

1. Model building process

1.1 Experimental design

The new model should be compared with the original plan, and it is proved by experiments, and special attention should be paid to the model and strategy cannot be adjusted at the same time. The general experimental design includes the following procedures:
insert image description here

Question: After the business is stable, can manual review be removed?

Answer : No, after the general model is launched, the performance of high and low segments is better, but the middle segment still needs manual review; and even after the model is perfected, we can only reduce manual review, and it is impossible to completely abandon manual review.

1.2 Sample design

1.3 Model training and evaluation

When performing model selection and evaluation, we evaluate models in the following order: Interpretability > Stability > Discrimination.

Discrimination index: AUC and KS
Stability index: PSI
AUC: The area under the ROC curve, which reflects the ability of the model output probability to rank good and bad users, and is the average state of model discrimination.
KS: It reflects the largest difference in the distribution of good and bad users, which is the best state of model discrimination.

Among the business indicators, we mainly look at the pass rate and overdue rate. Under the premise of a reasonable overdue rate, the pass rate should be increased as much as possible.

A card: Pay more attention to the pass rate, and the overdue rate can be slightly lower;
B card: find ways to reduce the overdue rate and increase the quota for good users.

2. Logistic regression model construction

Logistic regression is essentially a regression problem, and its output is between [0,1]. Then this result can correspond to the user's default probability, and we can map the default probability to the score.
For example:
industry-standard scorecard conversion formula score = 650 + 50 log 2 ( P overdue/ P not overdue) score = 650+50log_{2}(P_{overdue}/P_{not overdue})score=650+50log2(POverdue/Pnot overdue) , so how to transform it here? Let's look at the following Sigmoid function:
y = 1 1 + e − z = 1 1 + e − ( w T x + b ) y = \frac{1}{1+e^{-z}} = \frac{ 1}{1+e^{-(w^Tx+b)}}y=1+ez1=1+e(wTx+b)1
can be transformed into the following formula:
ln ( y 1 − y ) = w T x + b ln(\frac{y}{1-y})=w^Tx+bln(1yy)=wTx+b
and our score conversion formula can be transformed as follows:
log 2 ( P overdue / P not overdue) = ln ( P overdue 1 − P overdue) / ln ( 2 ) = ( w T x + b ) / ln ( 2 ) log_ {2}(P_{overdue}/P_{not overdue}) = ln(\frac{P_{overdue}}{1-P_{overdue}})/ln(2) = (w^Tx+b)/ln (2)log2(POverdue/Pnot overdue)=ln(1POverduePOverdue)/ln(2)=(wTx+b ) / l n ( 2 )
So we only need to solve the coefficient of each feature in the logistic regression, and then weight and sum each feature value of the sample to get the current standardized credit score of the customer. Among them, 650 and 50 in the score conversion formula are examples and need to be adjusted according to the actual business.

Logistic regression build scorecard code

import model

# 导入所需要的模块
import pandas as pd 
from sklearn.metrics import roc_auc_score,roc_curve,auc 
from sklearn.model_selection import train_test_split 
from sklearn import metrics 
from sklearn.linear_model import LogisticRegression 
import numpy as np 
import random 
import math
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

Check the basic information of the data

df = pd.read_csv('Bcard.txt', encoding='utf-8')
print(df.info())
df.head()

insert image description here

'''
bad_ind 为标签
外部评分数据:td_score,jxl_score,mj_score,rh_score,zzc_score,zcx_score
内部数据: person_info, finance_info, credit_info, act_info
obs_month: 申请日期所在月份的最后一天(数据经过处理,将日期都处理成当月最后一天)
'''
# 看一下申请日期的分布,我们将最后一个月作为测试集,其他作为训练集
print(df.obs_mth.unique())
print(df.bad_ind.describe())

insert image description here
Divide training set and test set

train_df =df[df['obs_mth']!='2018-11-30'].reset_index()
test_df = df[df['obs_mth'] == '2018-11-30'].reset_index()

All features are used for model training

# 没有进行特征筛选的逻辑回归模型
feature_lst = df.columns.drop(['obs_mth','bad_ind','uid'])
train_X = train_df[feature_lst]
train_y = train_df['bad_ind']
test_X = test_df[feature_lst]
test_y = test_df['bad_ind']
lr_model = LogisticRegression(C=0.1)
lr_model.fit(train_X,train_y)

# 对模型进行评估,这里使用predict_proba返回概率值,左边为预测为0的概率,右边为预测为1的概率,我们取1的概率
# 测试集
y_prob = lr_model.predict_proba(test_X)[:,1]
auc = roc_auc_score(test_y,y_prob)
fpr_lr,tpr_lr,_ = roc_curve(test_y,y_prob)
test_KS = max(tpr_lr-fpr_lr)
# 训练集
y_prob_train = lr_model.predict_proba(train_X)[:,1]
auc_train = roc_auc_score(train_y,y_prob_train)
fpr_lr_train,tpr_lr_train,_ = roc_curve(train_y,y_prob_train)
train_KS = max(tpr_lr_train-fpr_lr_train)

plt.plot(fpr_lr,tpr_lr,label = 'test LR auc=%0.3f'%auc) #绘制训练集ROC 
plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR auc=%0.3f'%auc_train) #绘制验证集ROC 
plt.plot([0,1],[0,1],'k--') 
plt.xlabel('False positive rate') 
plt.ylabel('True positive rate') 
plt.title('ROC Curve') 
plt.legend(loc = 'best') 
plt.show()
print('训练集的KS:%0.3f'%train_KS)
print('测试集的KS:%0.3f'%test_KS)

insert image description here
Filter features with lightgbm

# 对特征进行筛选,让模型更准确
import lightgbm as lgb
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_X, train_y, random_state = 0, test_size=0.2)
lgb_clf = lgb.LGBMClassifier(boosting_type='gbdt', 
                             objective = 'binary',
                             metric = 'auc',
                            learning_rate=0.1,
                            n_estimators=24,
                            max_depth=5,
                            num_leaves=20,
                            max_bin=45,
                            min_data_in_leaf = 6, 
                             bagging_fraction = 0.6, 
                             bagging_freq = 0, 
                             feature_fraction = 0.8)
lgb_clf.fit(X_train, y_train,eval_set=[(X_train, y_train),(X_test, y_test)], eval_metric='auc')
lgb_auc = lgb_clf.best_score_['valid_1']['auc']
feature_importance = pd.DataFrame({
    
    'name':lgb_clf.booster_.feature_name(),
                                                    'importance':lgb_clf.feature_importances_}).sort_values(by='importance', ascending=False)
feature_importance

insert image description here
Build a model using the filtered four features

# 使用排名靠前的四个特征进行新的模型构建
feature_lst = feature_importance['name'][0:4]
train_X = train_df[feature_lst]
train_y = train_df['bad_ind']
test_X = test_df[feature_lst]
test_y = test_df['bad_ind']
lr_model = LogisticRegression(C=0.1)
lr_model.fit(train_X,train_y)

# 对模型进行评估,这里使用predict_proba返回概率值,左边为预测为0的概率,右边为预测为1的概率,我们取1的概率
# 测试集
y_prob = lr_model.predict_proba(test_X)[:,1]
auc = roc_auc_score(test_y,y_prob)
fpr_lr,tpr_lr,_ = roc_curve(test_y,y_prob)
test_KS = max(tpr_lr-fpr_lr)
# 训练集
y_prob_train = lr_model.predict_proba(train_X)[:,1]
auc_train = roc_auc_score(train_y,y_prob_train)
fpr_lr_train,tpr_lr_train,_ = roc_curve(train_y,y_prob_train)
train_KS = max(tpr_lr_train-fpr_lr_train)

plt.plot(fpr_lr,tpr_lr,label = 'test LR auc=%0.3f'%auc) #绘制训练集ROC 
plt.plot(fpr_lr_train,tpr_lr_train,label = 'train LR auc=%0.3f'%auc_train) #绘制验证集ROC 
plt.plot([0,1],[0,1],'k--') 
plt.xlabel('False positive rate') 
plt.ylabel('True positive rate') 
plt.title('ROC Curve') 
plt.legend(loc = 'best') 
plt.show()
print('训练集的KS:%0.3f'%train_KS)
print('测试集的KS:%0.3f'%test_KS)

insert image description here
The filtered model has higher KS and AUC for the data of the test set, and the results are more stable.
print regression coefficients

# 打印回归系数
print('变量名单:',feature_lst) 
print('系数:',lr_model.coef_) 
print('截距:',lr_model.intercept_)

insert image description here
generate report

# 生成报告
# 计算出报告中所需要的字段:KS值、负样本个数、正样本个数、负样本累计个数、正样本累计个数、捕获率、负样本占比
temp_ = pd.DataFrame()
temp_['predict_bad_prob'] = lr_model.predict_proba(test_X)[:,1]
temp_['real_bad'] = test_y
temp_.sort_values('predict_bad_prob', ascending=False, inplace=True)
temp_['num'] = [i for i in range(temp_.shape[0])]
temp_['num'] = pd.cut(temp_.num, bins=20,labels=[i for i in range(20)])
temp_

report = pd.DataFrame()
report['BAD'] = temp_.groupby('num').real_bad.sum().astype(int)
report['GOOD'] = temp_.groupby('num').real_bad.count().astype(int) - report['BAD']
report['BAD_CNT'] = report['BAD'].cumsum()
report['GOOD_CNT'] = report['GOOD'].cumsum()
good_total = report['GOOD_CNT'].max()
bad_total = report['BAD_CNT'].max()
report['BAD_PECT'] = round(report.BAD_CNT/bad_total,3)
report['BAD_RATE'] = report.apply(lambda x:round(x.BAD/(x.BAD+x.GOOD),3), axis=1)

# 计算KS值
def cal_ks(x):
    ks = (x.BAD_CNT/bad_total)-(x.GOOD_CNT/good_total)
    return round(math.fabs(ks),3)
report['KS'] = report.apply(cal_ks,axis=1)
report

insert image description here

Plot a line chart of BAD_RATE and KS

# 绘制出折线图badrate和ks图
fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111)
ax.plot(report.index.values.to_list(), report.BAD_RATE, '-o', label='BAD_RATE')
ax2 = ax.twinx()
ax2.plot(report.index.values.to_list(), report.KS, '--o', color='red',label='KS')
ax.grid()
ax.set_xlim(-1,20,5)
ax.set_ylim(0,0.1)
ax2.set_ylim(0,0.5)
ax.legend(loc=2)
ax2.legend(loc=0)

insert image description here
Build a scoring formula to score each customer

'''
6     person_info
8     credit_info
9        act_info
7    finance_info
Name: name, dtype: object
系数: [[ 2.48386162  1.88254182 -1.43356854  4.44901224]]
截距: [-3.90631899]
'''

# 计算每个客户的评分
def score(person_info,credit_info,act_info,finance_info):
    xbeta = person_info*2.48386162+credit_info*1.88254182+act_info*(-1.43356854)+finance_info*4.44901224-3.90631899
    score = 900+50*(xbeta)/(math.log(2))
    return score
test_df['score'] = test_df.apply(lambda x:score(x.person_info,x.credit_info,x.act_info,x.finance_info), axis=1)
fpr_lr,tpr_lr,_ = roc_curve(test_y,test_df['score'])
print('val_ks:', abs(fpr_lr-tpr_lr).max())

insert image description here

According to the grades, the overdue rate of each group can be obtained

# 根据评分进行划分等级
def level(score):
    level = ''
    if score <= 600: 
        level = "D" 
    elif score <= 640 and score > 600 : 
        level = "C" 
    elif score <= 680 and score > 640: 
        level = "B" 
    elif score > 680 : 
        level = "A" 
    return level

test_df['level'] = test_df.score.map(lambda x:level(x))
test_df.level.groupby(test_df.level).count()/len(test_df)

insert image description here

in conclusion

  • From the report it can be seen that:
    • The maximum KS value of the model appears in the sixth box (No. 5). If the box is divided into finer pieces, the KS value will continue to increase, and the upper limit is the KS value calculated by the previous formula.
    • The samples in the first 4 boxes accounted for 20% of the total population, and the captured negative samples accounted for 56.4% of all negative samples.
  • It can be seen from the line chart that:
    • The model fluctuates in the position of the 8th box, that is, the proportion of negative samples in the 8th box is higher than that of the 7th box
    • Although there are many fluctuations in the graph, the range is not large, and the overall trend is relatively stable. Therefore, the ranking ability of the model can still be accepted by the business.
  • Judging from the overdue rates of the four groups, the overdue rates of C and D are similar.

Guess you like

Origin blog.csdn.net/gjinc/article/details/131852206