Scorecard project summary

1. Purpose of Analysis

The credit score A card is made to judge whether the user defaults.

Second, data processing

This step involves not only conventional processing such as column name modification, deduplication, missing values, and outliers, but also WOE binning.

2.1, normative processing of column name format

Modify some column names, such as 'NumberOfTime60-89DaysPastDueNotWorse' and 'NumberOfTime30-59DaysPastDueNotWorse' in the data. Due to some algorithms or methods, some symbols in the column names will cause problems or even errors. For example, the "-" here will be considered as a minus sign in the regression formula, so it is replaced by "_".

train_data.columns=train_data.columns.map(lambda x:x.replace('-','_'))

2.2, repeated data processing

Delete duplicate data. Duplicate data can easily lead to a reduction in the standard error of the regression coefficient, which reduces the corresponding p-value.

train_data.drop_duplicates(inplace=True)

2.3. Processing of missing values

2.3.1. View the distribution of missing values

We use the matrix () method of the missingno library to view the missing values ​​of each feature.

import missingno as msno
msno.matrix(train_data)
plt.show()

Insert picture description hereFrom the above figure, there are missing values ​​in the two columns 'MonthlyIncome' and 'NumberOfDependents', the former accounted for more than 80%, and the latter accounted for less.

2.3.2, missing value processing logic

1. The conventional method of processing missing values ​​is to directly delete data with more than 80% of the missing values. For credit score cards, since all variables need to be binned, the missing values ​​here can be used as separate boxes.
2. For the last column of 'NumberOfDependents', the proportion of missing values ​​is only 2.56%. As a single box, the information is not enough, so a single value is filled. Use the median to fill.

2.3.2. Single value replaces missing value

Replace the missing value of the last column 'NumberOfDependents' with the median.

NOD_median=train_data['NumberOfDependents'].median()
train_data['NumberOfDependents'].fillna(NOD_median,inplace=True)

2.4. Outlier handling

Common processing methods for outliers include deleting the row where they are located, replacing with missing values ​​(to deal with missing values), or blocking method. Combine business logic and algorithm requirements to determine whether and how to deal with outliers.In general, the blocking method can be used, that is, extreme extreme values ​​are changed to less abnormal extreme values. It can also be left out. We first define the blocking method function and use it when needed.

def block(x,lower=True,upper=True):
# x是输入的Series对象,lower表示是否替换1%分位数,upper表示是否替换99%分位数
    ql=x.quantile(0.01)
    qu=x.quantile(0.99)
    if lower:
        out=x.mask(x<ql,ql)
    if upper:
        out=x.mask(x>qu,qu)
    return out

2.4, custom function, summarize the data cleaning process

We first define a function to encapsulate the data cleaning process. This function can be used to synchronize data cleaning when predicting new data later.

def datacleaning(testdata,include_y=False):
    testdata.columns=testdata.columns.map(lambda x:x.replace('-','_'))
    testdata['NumberOfDependents'].fillna(NOD_median,inplace=True)      #新数据的缺失值需要用训练集的填充方式处理
    if include_y:
        testdata["SeriousDlqin2yrs"]=1-testdata.SeriousDlqin2yrs  #好客户用1表示,坏客户用0表示,便于混淆矩阵指标分析提取
    return testdata

3. Feature selection

First, we customize an external file smob.py, which encapsulates the selection of feature functions through decision trees, the generation of WOE and the calculation of IV value functions, and calculation model evaluation index functions.
Note: The functions involved are listed in the complete code at the end of the article.

3.1. Generate binning objects for each X

Let's first use the 'RevolvingUtilizationOfUnsecuredLines' feature as an explanation. The other features are the same.
1. Initialize the data first

y='SeriousDlqin2yrs' #初始化一个标签名变量
iv_all=pd.Series()   #初始化一个空序列变量,用于存放各特征IV值

2. Call the smbin function to find the woe data and IV value of the feature.

RUO=smbin(train_data,y,'RevolvingUtilizationOfUnsecuredLines')

Insert picture description here

3.2, view all characteristic IV values

The IV value is one of the indicators used to screen important features. According to the IV value, the features less than 0.02 are deleted. In general, IV <0.02, almost no help for prediction; 0.02≤IV <0.1, has a certain help; 0.1≤IV <0.3, has a great help for prediction; IV≥0.3, has a great help.

iv_all.sort_values(ascending=False)

Insert picture description hereFrom the results, these features are more or less important. Next, we preliminary model to see the effect.

3.3, generate WOE data

The principle of binning: the number of binning is moderate, not too much or too little; the number of records in each binning is reasonable; the binning should reflect the obvious trend characteristics; the difference between adjacent binning should not be too large. We chose the supervised binning method, and the algorithm chose the CART tree.
1. Initialize the list
The objects obtained by smbin and smbin_cu before are filtered according to the IV value and placed in a list.

x_list = [RUO,age,NO3059,DebtRatio,MonthlyIncome,NOO,NO90,NRE,NO6089,NOD]

2. Generate WOE data
Use the smgen function to generate new data based on the obtained list

data_woe=smgen(train_data,x_list)
data_woe.head()

Insert picture description here

Fourth, modeling

4.1, preliminary modeling

Establish a logistic regression model, fit the data, and view the regression results.

glmodel=sm.GLM(Y_train,X_train,family=sm.families.Binomial()).fit()
glmodel.summary()    
#可以查看系数(coef)、系数标准误(std err)
#P>|z|,就是P检验,小于0.05说明显著,说明变量是重要变量

Insert picture description here
From the results returned by the model, the IV value can help us screen important features, but it is not an accurate method. It seems that the high IV value in the data is not necessarily an important feature. Next, we need to conduct hypothesis testing, and we choose the P test (less than 0.05 is the important feature). We found that the 'NumberOfOpenCreditLinesAndLoans_woe' feature has a P value higher than 0.05, so we chose to delete the feature.

4.2, view collinearity

Calculate the vif value of each predictor variable to see if there is collinearity among the variables. The larger the VIF value, the larger the standard error of the coefficient will be, and the confidence interval will be particularly small. When the VIF is between [1,3), the variable can be used directly; between [3,7), the data needs to be processed a little before it can be used; between [7,10), the data must be processed to be used; Phenomenon, the variables need to be changed.

vif=[variance_inflation_factor(X_train.iloc[:,1:].values,i
                              ) for i in range(X_train.shape[1]-1)]
print(pd.Series(dict(zip(X_train.columns[1:],vif))))

Insert picture description hereIt can be seen from the results that there is no collinear phenomenon among the variables. Next, make a score card.

V. Generate score card

5.1. Principle of generating credit score model

1. For a specific score, good customers and bad customers have a certain proportional relationship, that is, odds (odds), odds = xPctGood / xPctBad.
2. The odds ratio doubles when increasing a certain score value, for example, 45 points, odds doubles (from 50: 1 to 100: 1)
3. Formula (5.1): Score = Offset + Factor × ln (odds); Formula (5.2): Score + pdo = Offset + Factor × ln (2 × odds)
4. At this time, we need to use the custom function smscale (model, feature list, pdo, score, odds). The value range combines the above two formulas to adjust the values ​​of the three parameters pdo, score, and odds.
The range of business improvement here is [300, 843], and finally we get the parameter values: pdo = 43, score = 1151, odds = 10.

5.2. Calculate the score of each category

The left side of the regression equation obtained by Logistic logistic regression is substituted into the credit score formula (5.1) in the previous step.
Insert picture description hereOur custom function ScoreCard saves the score results of each category, and scorecard.ScoreCard can get the data.

Insert picture description here

6. Evaluation model

We evaluate by drawing a KS curve.
1. According to the previous scorecard object, get the test set score.

testscore=smscoregen(scorecard,X_test)

2. Draw the ks curve through the true y in the test set and the predicted score; at the same time get the corresponding optimal threshold and related metrics.

evaluate1=evaluate01(Y_test,testscore['Score'],index='ks',plot='ks')

Insert picture description here

Seven, new data prediction

1. Read the data first.

test_data=pd.read_csv('data/CreditScore_data/give-me-some-credit-dataset/cs-test.csv',index_col=0)

2. Perform the same cleaning on the data as the training data, that is, use the previously defined cleaning function.

test0=datacleaning(test_data)

3. Generate data containing woe columns from the binning objects obtained previously.

test_woe=smgennew(test0,x_list)
test_woe.head()

4. Extract WOE data to generate prediction data, and add a constant item column.

T=test_woe.iloc[:,-len(x_list):]
T=sm.add_constant(T)

5. Predict the score of each row of data to generate a total score and a score for each feature.

Tscore=smscoregen(scorecard,T)

6. Judging the quality of the customers based on the scores and the thresholds obtained from training. Good customers are 1, and bad customers are 0.

test0[y]=(Tscore.Score>evaluate1.cutoff)*1
test0.head()

Insert picture description here
View the AUC value: evaluate1. AUC = 0.8609421718165055. It shows that our model generalization ability is good.

8. Summary

1. The focus of this project is to bin and identify important features, and less data cleaning.
2. The difficulty of this project is to adjust the parameters. You must first fix a score parameter, and then adjust the other two parameters to observe the data change trend. The score parameter is equivalent to the center line of the balance. The other two parameters, pdo and odds, are the weights on the left and right sides, and the data can be achieved by adding or subtracting data from the left and right weights.
3. It should be noted that regression algorithms are particularly sensitive to repeated values, missing values, and collinearity between variables. It is necessary to analyze and deal with it before modeling.
4. Another detail that needs attention is that we need to divide the data set before processing the missing value, and then perform the data processing. When processing the training set, the test set data must also be processed synchronously (be sure to follow the training set logic first).

Attach complete code

Published 12 original articles · Like9 · Visitors 20,000+

Guess you like

Origin blog.csdn.net/sun91019718/article/details/101755562