introduction of credit score card

background

With the rise of fintech start-ups, a host of new consumer lenders have emerged over the past 5 years to compete with traditional banks. They typically target segments that banks consider too small or have had to cut lending due to late losses incurred during the financial crisis. In layman's terms, consumer finance companies are targeting the subprime mortgage market of banks.

One of the main competitive advantages of these new consumer finance companies is technology, including IT technology and machine learning modeling technology, AI face recognition and voice recognition technology.

Large banks are relatively traditional and conservative, with little motivation for technological change. The main customers of the bank are customers with good credit, and credit cards are vigorously promoted to encourage users to spend ahead of schedule.

For example, UK business lender iwoca uses information from affiliated company accounts, VAT returns and even sales transactions on eBay or Amazon to determine new loans. Lendable, a British consumer loan company, can complete personal credit loans in minutes, instead of the days or weeks that traditional banks need for long approval.

British commercial lenders iwoca and lendable, and domestic consumer finance companies such as China Merchants Union, Industrial Consumer Finance, Jiebei, Weiweidai, and Paipaidai will use a risk control system similar to the one below to automatically approve loans for most customers.

With a fast and automated decision engine, they use an automated and fast credit risk model to assess risk.

What is a credit score card

Most of us are familiar with the concept of a credit score, a numerical value that represents an individual's creditworthiness. All credit institutions like banks have complex credit models. These models read various information about users such as salary, credit history, factors such as age, gender, and multiple loans, then train the models, and finally output the customer's credit score through complex mathematical calculations. Credit scorecard models can output user credit scores or default probabilities.

A credit scorecard is one such credit model, and it is one of the most common. Credit scorecard is based on logistic regression algorithm. It's relatively easy for everyone to understand, and it's been around for decades, so the development process is standard and well known.

Credit scorecards also have several subclass models, common ones are A, B, and C cards.

It is important to note, however, that score ranges may vary from institution to institution, and the cutoff point for rejecting an application for a lower score varies from lender to lender, and may even vary within the same lender but across different products.

Build a credit scorecard

The target variable is usually in binary form and, depending on the data, can be 0 for good customers (lending customers) and 1 for defaulting customers or customers who pay 90 days overdue (denying money lending customers).

Step 1: Data Exploration and Cleaning

All necessary steps in model fitting, but since it is not specific to building a credit scorecard model, we will skip this part. Don't forget to split the dataset into training and testing datasets as well, i.e. train and test datasets.

Step 2: Data Transformation - Weight of Evidence Approach

Then we need to transform all independent variables (like age, income, etc.) using the Weight of Evidence (WoE) method. The method measures the "strength" of the group's risk of distinguishing good from bad, based on the ratio of good applicants to poor applicants at each group level, and attempts to find a monotonic relationship between the independent variable and the target variable.

Transformation steps for continuous variables:

  1. Divide the data into bins, usually about 10, up to 20 (the number of bins is not the more the better, nor the less the better, the number of bins is determined according to the characteristics of the data set)
  2. Calculate the percentage of good events and the percentage of bad events
  3. Take the natural logarithm to calculate WOE
  4. Replace raw data with calculated WOE values

If the independent variable is categorical, skip 1 above and do the remaining steps.

Example in Python:

After putting your data into bins, and grouping the good and bad counts for each bin, your data might look like the boxes below. WoE can be calculated for each bin group using the code below. Negative values ​​indicate a higher proportion of bad applicants than good applicants in a particular subgroup.

import pandas as pd
import numpy as np


# dummy data as example
age = ['18 to 25','26 to 35','36 to 45','46 to 60','>= 60']
df = pd.DataFrame(age, columns=['Age Group'])
df['counts'] = [31234, 30293, 29384, 30192, 27394]
df['bad'] = [4920, 4123, 3784, 2608, 1479]
df['good'] = df.counts - df.bad

# calculate WOE
df['total_distri'] = df.counts/sum(df.counts)
df['bad_distri'] = df.bad/sum(df.bad)
df['good_distri'] = df.good/sum(df.good)
df['WOE'] = np.log(df.good_distri / df.bad_distri)
df['WOE%'] = df.WOE * 100

At the end of the transformation, if you started with 20 independent variables, you will now have 20 WOE_variablename columns available for the next step.

Benefits of using WoE conversion:
  • It helps to establish a strict linear relationship with the log odds used in logistic regression
  • It can handle missing values ​​as they can be merged together
  • Can handle outliers or extreme values ​​as they are also binned and the input model fitted values ​​are WoE transformed values ​​rather than raw extreme values
  • It also handles categorical values, so dummy variables are not needed

Step 3: Feature Selection Using Information Values

Information Value (IV) comes from information theory, which measures the predictive power of independent variables, which is useful for feature selection. It is good practice to perform feature selection to determine whether it is necessary to include all features in the model, most of the time we want to eliminate weak features as simpler models are usually preferred.

According to Siddiqi (2006), by convention, the value of the IV statistic in credit scoring can be interpreted as follows

According to Mr. Toby's many years of modeling experience, variables with iv>0.5 in card A are rare, but in cards B and C, variables with iv values ​​greater than 0.5 often appear. Of course, we also need to review the rationality of these strong variables.

Example in Python:

Continuing with the previous example, here we calculate the IV for "age" to be about 0.15, which means age has "moderate predictive power", so we proceed with model fitting. Variables with IV scores less than 0.02 should be removed.

Remarks: According to Mr. Toby's many years of modeling experience, the age variable will have different iv values ​​in different data sets. In many datasets the age variable iv has very low values. Here is just an example, students should not memorize by rote.

Step 4: Model Fitting and Interpreting Results

Now we fit a logistic regression model using the WoE of our newly transformed training dataset.

When scaling the model into a scorecard, we need the logistic regression coefficients from the model fit and the transformed WoE values. We also need to convert the model's scores from a log odds unit to a points system.

For each independent variable Xi, its corresponding score is:

Score_i= (βi × WoE_i + α/n) × Factor + Offset/n

Among them:
βi — logistic regression coefficient of variable Xi
α — logistic regression intercept
WoE — weight of evidence of variable Xi
n — number of independent variable Xi in the model
Factor, Offset — called scaling parameter, where

  • factor = pdo/ln(2)
  • Offset = Target Score — (Factor × ln(Target Odds))

For the example above, we chose to set the target score at 600, which means that the odds of a good client versus a bad client are 50 to 1, and an increase of 20 means the odds are doubled. Note that the choice of scaling does not affect the predictive strength of the scorecard.

The final total score is the sum of all scores based on the input values ​​of the independent variables. Lenders will then evaluate incoming applications against the modeled total score and the cut-off point (set according to other credit default models).

Total Score = Σ Score_i

This is the introduction to the credit score card. In fact, there are many details of the credit score card model. Due to the limited space, I can only give a brief overview. Realistic model development is not linear but a complex iterative process.

If you are interested in various details of the credit score card, you are welcome to bookmark "Python Credit Score Card Modeling (with Code)" , which will satisfy all your curiosity and questions about the credit score card.

Copyright statement: The article comes from the official account (python risk control model), without permission, no plagiarism. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/128719650