Alibaba Cloud Tianchi Big Data Long-term Competition: Financial Risk Control-Loan Default Prediction (Including Code)

Preface

1. Introduction to the competition questions

2. Data descriptive statistics

2.1.Read data

2.2. View duplicate values

2.3. Statistical target variable proportion

2.4. View statistics of data

2.5. Count the types of each variable

2.6. Check whether the feature distribution of the training set and the test set is consistent

2.7 View data correlations

3. Data cleaning

3.1.Categorical variable processing

3.1.1 grade and subGrade processing

3.1.2 employmentLength processing

3.1.3 issueDate and earlyliesCreditLine processing

3.2 Numeric variable filling

3.3 Save data

4. Feature exploration

4.1 PCA principal component analysis

4.2 Toad: Python-based standardized scorecard model

4.2.1  toad_quality

4.2.2  toad.selection.select

4.2.3 psi: Compare the difference between the variable distributions of the training set and the test set

5. Data modeling

Summarize


Preface

Through the study of this competition, I have further improved my skills in data analysis and mining. Although the final score is only 0.7346, the value of accumulated experience in this process is immeasurable. This is the first time for me to deal with such a large amount of data. While exploring data on my own, I also continued to learn from the experiences of many predecessors, giving me a new understanding of big data processing.


1. Introduction to the competition questions

The competition question is based on personal credit in financial risk control. Players are required to predict whether the loan applicant is likely to default based on the data information of the loan applicant, so as to determine whether to approve the loan. This is a typical classification problem. Through this competition question, we will guide everyone to understand some business backgrounds in financial risk control, solve practical problems, and help newcomers to the competition practice and improve themselves.

This data comes from the loan records of a certain credit platform. The total data volume exceeds 1.2 million and contains 47 columns of variable information, 15 of which are anonymous variables. In order to ensure the fairness of the competition, 800,000 items will be selected as the training set, 200,000 items as the test set A, and 200,000 items as the test set B. At the same time, information such as employmentTitle, purpose, postCode, and title will be desensitized.

The data variable characteristics are explained as follows

2. Data descriptive statistics

2.1.Read data

import pandas as pd     # 数据分布统计
df=pd.read_csv("/train.csv")
test=pd.read_csv("/testA.csv")
df.shape

(800000, 47) The training set has 800,000 samples and 47 variables

2.2. View duplicate values

df[df.duplicated()==True]#打印重复值

0 rows × 47 columns No duplicate values

2.3. Statistical target variable proportion

(df['isDefault'].value_counts()/len(df)).round(2)
0    0.8
1    0.2

The target variable ratio is 1:4, and the sample categories are unbalanced

2.4. View statistics of data

df.describe().T

The n-series features are all missing, and the standard deviations of data involving amounts such as loan amount and annual income are relatively large and highly volatile.

2.5. Count the types of each variable

df.nunique()
df=df.drop(['id','policyCode'],axis=1) # 删除ID列及只有一个值的policyCode列

2.6. Check whether the feature distribution of the training set and the test set is consistent

# 分离数值变量与分类变量
Nu_feature = list(df.select_dtypes(exclude=['object']).columns)  # 数值变量
Ca_feature = list(df.select_dtypes(include=['object']).columns)
# 查看数值型训练集与测试集分布
Nu_feature.remove('isDefault') # 移除目标变量
# 画图
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
plt.figure(figsize=(30,30))
i=1
for col in Nu_feature:
    ax=plt.subplot(8,5,i)
    ax=sns.distplot(df[col],color='violet')
    ax=sns.distplot(test[col],color='lime')
    ax.set_xlabel(col)
    ax.set_ylabel('Frequency')
    ax=ax.legend(['train','test'])
    i+=1
plt.show()

Since there are many variables, only some variables are shown, and the distribution is consistent. If the distribution of the training set and the test set are inconsistent, it will affect the generalization performance of the model. It is like training the characteristics of the elderly and the result is predicting the characteristics of the children.

2.7 View data correlations

plt.figure(figsize=(10,8))
train_corr=df.corr()
sns.heatmap(train_corr,vmax=0.8,linewidths=0.05,cmap="Blues")

 Some features have relatively high correlation, but there is no particularly high correlation between the target variable and the feature variable.

3. Data cleaning

3.1.Categorical variable processing

Ca_feature:['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

3.1.1 grade and subGrade processing

from sklearn.preprocessing import LabelEncoder     
lb = LabelEncoder()                               
cols = ['grade','subGrade']
for j in cols:
    df[j] = lb.fit_transform(df[j])
df[cols].head() 

#grade及subGrade是有严格的字母顺序的,与测试集相对应,可以直接用编码转换,转换结果如下
  grade	subGrade
0	4	  21
1	3	  16
2	3	  17
3	0	  3
4	2	  11

3.1.2 employmentLength processing

# 年限转化为数字,在进行缺失值填充
df['employmentLength']=df['employmentLength'].str.replace(' years','').str.replace(' year','').str.replace('+','').replace('< 1',0)

# 随机森林填补年限缺失值 由于分类变量只有年限有缺失,所以这样填充
from sklearn.tree import DecisionTreeClassifier   
DTC = DecisionTreeClassifier()
empLenNotNull = df.employmentLength.notnull()
columns = ['loanAmnt','grade','interestRate','annualIncome','homeOwnership','term','regionCode'] 
# regionCode变量加入后,准确度从0.85提升至0.97 
DTC.fit(df.loc[empLenNotNull,columns], df.employmentLength[empLenNotNull])
print(DTC.score(df.loc[empLenNotNull,columns], df.employmentLength[empLenNotNull]))
# DTC.score:0.9828872204324179

# 填充
for data in [df]:
    empLen_pred = DTC.predict(data.loc[:,columns])   # 对年限数据进行预测
    empLenIsNull = data.employmentLength.isnull()    # 判断是否为空值,isnull返回的是布尔值
    data.employmentLength[empLenIsNull] = empLen_pred[empLenIsNull] # 如果是空值进行填充

# 转化为整数
df['employmentLength']=df['employmentLength'].astype('int64')

3.1.3 issueDate and earlyliesCreditLine processing

import datetime
df['issueDate']=pd.to_datetime(df['issueDate'])
df['issueDate_year']=df['issueDate'].dt.year.astype('int64')
df['issueDate_month']=df['issueDate'].dt.month.astype('int64')
df['earliesCreditLine']=pd.to_datetime(df['earliesCreditLine'])  # 先在EXCEL上转化为日期
df['earliesCreditLine_year']=df['earliesCreditLine'].dt.year.astype('int64')
df['earliesCreditLine_month']=df['earliesCreditLine'].dt.month.astype('int64')
df=df.drop(['issueDate','earliesCreditLine'],axis=1)
# issueDate及earliesCreditLine两个变量将日期分解,分别提取‘年’和‘月’并转化为整数便于计算,由于测试集这两个变量的‘日’都是1,对目标变量没有影向,所以训练集不提取,提取完后将这两个原始变量删除

3.2 Numeric variable filling

df[Nu_feature] = df[Nu_feature].fillna(df[Nu_feature].median())  
# 考虑平均值易受极值影响,数值变量用中位数填充

3.3 Save data

df.to_csv("/df2.csv")

Note: The test set also needs to be processed in the same way

4. Feature exploration

4.1 PCA principal component analysis

from sklearn.decomposition import PCA
pca = PCA()
X1=df2.drop(columns='isDefault')
df_pca_train = pca.fit_transform(X1)
pca_var_ration = pca.explained_variance_ratio_
pca_cumsum_var_ration = np.cumsum(pca.explained_variance_ratio_)
print("PCA 累计解释方差")
print(pca_cumsum_var_ration)
x=range(len(pca_cumsum_var_ration))
plt.scatter(x,pca_cumsum_var_ration)
###################
PCA 累计解释方差
[0.6785479  0.96528967 0.99287836 0.99667955 0.9999971  0.99999948
 0.99999985 0.99999993 0.99999995 0.99999996 0.99999998 0.99999998
 0.99999999 0.99999999 0.99999999 1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.         1.         1.         1.
 1.         1.         1.        ]

It can be seen that the cumulative variance contribution rate of the first two variables is close to 1. The dimensionality reduction effect is obvious, but it is not suitable for modeling.

4.2 Toad: Python-based standardized scorecard model

4.2.1  toad_quality

import toad
toad_quality = toad.quality(df2, target='isDefault', iv_only=True)
# 计算各种评估指标,如iv值、gini指数,entropy熵,以及unique values,结果以iv值排序
# 	             iv
subGrade	    0.485106565
interestRate	0.463530061
grade	        0.463476859
term	        0.172635079
ficoRangeLow	0.125252862
ficoRangeHigh	0.125252862
dti	            0.072902752
verificationStatus	0.054518912
n14	            0.045646121
loanAmnt	    0.040412211
installment	    0.039444828
title	        0.034895535
issueDate_year	0.034170341
homeOwnership	0.031995853
n2	            0.031194387
n3	            0.031194387
annualIncome	0.030305725
n9	            0.029678353
employmentTitle	0.028019829
revolUtil	    0.025677543

The above shows features with IV values ​​greater than 0.02. Features with IV values ​​less than 0.02 have almost no effect on the target variable. I have tested modeling using only the above features, and the model effect is not as good as all features.

4.2.2 toad.selection.select

selected_data, drop_lst= toad.selection.select(df2,target = 'isDefault', empty = 0.5, iv = 0.02, corr=0.7,return_drop=True) 
# 筛选空值率>0.5,IV<0.02,相关性大于0.7的特征
# (800000, 15) 保留了15个特征
# 以下是删除的特征,通过return_drop=True显示
   {'empty': array([], dtype=float64),
   'iv': array(['employmentLength', 'purpose', 'postCode', 'regionCode',
        'delinquency_2years', 'openAcc', 'pubRec', 'pubRecBankruptcies',
        'revolBal', 'totalAcc', 'initialListStatus', 'applicationType',
        'n0', 'n1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n10', 'n11', 'n12',
        'n13', 'issueDate_month', 'earliesCreditLine_year',
        'earliesCreditLine_month'], dtype=object),
   'corr': array(['n9', 'grade', 'n3', 'installment', 'ficoRangeHigh',
          'interestRate'], dtype=object)}

The filtered features are used for modeling, but the effect is not good either.

4.2.3 psi: Compare the difference between the variable distributions of the training set and the test set

psi = toad.metrics.PSI(df2,testA)   # psi没有大于0.25的,都比较稳定
psi.sort_values(0,ascending=False)
##############部分结果展示##############
revolBal                   2.330739e-01
installment                1.916890e-01
employmentTitle            1.513944e-01
employmentLength           6.919465e-02
annualIncome               4.075954e-02
dti                        2.810131e-02
title                      1.875967e-02

Feature engineering is an indispensable part of machine learning, and it is also a very complex project. I only made a simple attempt.

5. Data modeling

I compared xgboost and catboost, and finally chose catboost. The results of the trial are as follows:

RandomForestClassifier+xgboost AUC Test 0.721/Online 0.71
xgboost+rooms AUC test 0.722
catboost+rooms AUC test 0.727
catboost+categorical variable AUC test 0.736/online 0.72
catboost+5KFold+500iterations AUC test 0.734/online 0.728
catboost+3KFold+300iterations+add categorical variables AUC Test 0.738/Online 0.7346
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split  
from catboost import CatBoostClassifier
from sklearn.model_selection import KFold
train=pd.read_csv("/df2.csv")
testA2=pd.read_csv("/testA.csv")
# 选取相关变量做分类变量并转化为字符串格式
col=['grade','subGrade','employmentTitle','homeOwnership','verificationStatus','purpose','issueDate_year','postCode','regionCode','earliesCreditLine_year','issueDate_month','earliesCreditLine_month','initialListStatus','applicationType']
for i in train.columns:
    if i in col:
        train[i] = train[i].astype('str')
for i in testA2.columns:
    if i in col:
        testA2[i] = testA2[i].astype('str')
# 划分特征变量与目标变量
X=train.drop(columns='isDefault')
Y=train['isDefault']
# 划分训练及测试集
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=123)
# 模型训练
clf=CatBoostClassifier(
            loss_function="Logloss",
            eval_metric="AUC",
            task_type="CPU",
            learning_rate=0.1,
            iterations=300,
            random_seed=2022,
            od_type="Iter",  
            depth=7) 
result = []
mean_score = 0
n_folds=3
kf = KFold(n_splits=n_folds ,shuffle=True,random_state=2022)
for train_index, test_index in kf.split(X):
    x_train = X.iloc[train_index]
    y_train = Y.iloc[train_index]
    x_test = X.iloc[test_index]
    y_test = Y.iloc[test_index]
    clf.fit(x_train,y_train,verbose=300,cat_features=col)
    y_pred=clf.predict_proba(x_test)[:,1]
    print('验证集auc:{}'.format(roc_auc_score(y_test, y_pred)))
    mean_score += roc_auc_score(y_test, y_pred) / n_folds
    y_pred_final = clf.predict_proba(testA2)[:,-1]
    result.append(y_pred_final)
# 模型评估
print('mean 验证集Auc:{}'.format(mean_score))
cat_pre=sum(result)/n_folds  
# 结果
0:	total: 3.13s	remaining: 15m 35s
299:	total: 9m 15s	remaining: 0us
验证集auc:0.7388007571702323
0:	total: 2.08s	remaining: 10m 20s
299:	total: 9m 45s	remaining: 0us
验证集auc:0.7374681864389327
0:	total: 1.73s	remaining: 8m 38s
299:	total: 9m 22s	remaining: 0us
验证集auc:0.7402961974320663
mean 验证集Auc:0.7388550470137438

Note: catboost can handle categorical features efficiently and reasonably. You only need to use the cat_features parameter to specify the categorical features. The more categorical features are added, the more time-consuming the calculation will be, but the effect will also be improved. It can be seen that it took nearly half an hour to complete the three cross-validation runs, and it was only in the case of iterations=300. Due to my limited PC capabilities, there was not much adjustment and testing of parameters. Prediction of big data target variables Cross-validation is essential. The model can learn more through different divisions of the training set and the test set. At the same time, the final average of each prediction result can make the results more stable.


Summarize

1.关于样本平衡的问题,imbalanced_ensemble是个不错的尝试,该库有很多平衡样本的方法,本人已经试过OverBoostClassifier、BorderlineSMOTE、SPE的方法来平衡类别,过采样容易增加噪声,导致训练集表现不错,测试集一般,同时会导致小样本量预测失准,降采样容易导致对大样本量学习不足,但并不代表平衡样本的方法就不适用,还需要不断摸索。

2.对于缺失值的问题,一般都是数值型变量用中位数填充,类别变量用众数填充,还可以通过回归模型选取相关变量进行预测,可能会有惊喜。

3.此类风控预测如果能够结合业务人员的经验对变量进行筛选和补充,相信会有不一样的结果。

4.关于特征降维还有很多方法可以尝试,PCA只是其中一种,特征工程也是一个庞杂的体系,需要不断学习。

5.关于模型调参,可以适当提高预测精度,如果时间允许,可以组合测试参数。

6.参赛的过程大于结果,从中学到的知识和经验会为我今后大数据处理打下基础。

Guess you like

Origin blog.csdn.net/weixin_46685991/article/details/125836476
Recommended