Financial risk control project combat-python scorecard

Credit risk measurement models can include individual credit ratings, corporate credit ratings and national credit ratings. Personal credit ratings are composed of a series of rating models, the common ones are A card (application scoring card), B card (behavior model), C card (collection model) and F card (anti-fraud model). Today we show the development process of the personal credit rating model. The data uses the well-known give me some credit data set on kaggle .

1. Modeling process

A typical credit scorecard model is shown in Figure 1-1. The main development process of the credit risk rating model is as follows:
(1) Obtain data, including the data of customers who apply for loans. The data includes various dimensions of customers, including age, gender, income, occupation, number of family members, housing situation, consumption situation, debt and so on.
(2) Data preprocessing, the main work includes data cleaning, missing value processing, outlier processing, data type conversion, etc. We need to convert the raw data layer by layer into modelable data.
(3) EDA exploratory data analysis and descriptive statistics, including statistics of overall data size, proportion of good and bad customers, data types, variable missing rate, variable frequency analysis histogram visualization, box plot visualization, variable correlation visualization etc.
(4) Variable selection, through statistical and machine learning methods, to screen out the variables that have the most significant impact on the default status. There are many common variable selection methods, including iv, feature importance, variance and so on. In addition, variables with a high missing rate are also recommended to be deleted. It is also recommended to delete variables that have no business explanatory variables and no value variables.
(5) Model development, the main difficulties in scoring card modeling are woe binning, score stretching, and variable coefficient calculation. Among them, woe binning is one of the most difficult points in the scorecard, which requires rich statistical knowledge and business experience. At present, there are more than 50 binning algorithms, and there is no unified gold standard. Generally, the machine automatically bins first, then manually adjusts the binning, and finally repeatedly tests the final performance of the model, and selects the best binning algorithm.
(6) Model verification, verify the model's ability to distinguish, predict, stability, sorting ability, etc., and form a model evaluation report to draw a conclusion whether the model can be used. Model verification is not done at one time, but after modeling, before the model is launched, and after the model is launched, it is regularly verified. Model development and maintenance is a cycle, not a one-off process.
(7) Credit scorecard, which generates a scorecard based on the variable coefficient and WOE value of logistic regression. The scorecard is convenient for business interpretation, has been used for decades, is very stable, and is loved by the financial industry. The method is to convert the Logistic model probability score into a standard score form of 300-900 points.
(8) Establish a score card model system, and establish a computer automatic credit scoring system based on the credit score card method. The traditional American product FICO has similar functions, and the underlying language of FICO is Java. Java, python or R are currently popular languages ​​to build scorecard automation model systems.

(9) Model monitoring. As time goes by, the ability to distinguish models, such as ks and auc, will gradually decrease, and the stability of the model will also shift. We need a professional model monitoring team. When monitoring shows a significant decline in model discrimination or a large shift in model stability, we need to redevelop the model and iterate the model. The model monitoring team should send model monitoring reports to relevant teams by email on time every day, especially the development team and business team.

insert image description here

2. Get data

Before I talked about using the German credit dataset to build a python credit scorecard model. The advantage of this dataset is that it has a small amount of data and does not require high computer hardware, which is convenient for students from all walks of life to learn and test.

picture

Credit scoring algorithms make guesses about the probability of default and are what banks use to determine whether a loan should be granted. The data pertains to personal consumer loans, improving the state-of-the-art in credit scores by predicting the likelihood that someone will experience financial hardship within the next two years.

Banks play a vital role in a market economy. They decide who gets funding and on what terms, and can make investment decisions or terminate investments. For markets and societies to function, individuals and firms need access to credit.

give me some credit has 150,000 sample data, which solves the actual combat data of medium and large financial institutions, and is closer to the actual combat of financial enterprise projects. The dataset improves credit scores by predicting how likely someone is to experience financial hardship within the next two years.

picture

The Chinese interpretation of variables, the variables are few but precise, can be used as a reference for modeling

picture

We classify the above variables into:

– Basic attributes: including the age of the borrower at the time.

– Debt solvency: including the borrower’s available line ratio, monthly income, and debt ratio.

– Credit history: 35-59 days overdue times within two years, 60-89 days overdue times within two years, 90 days overdue times within two years

days or more than 90 days past due.

– Property status: includes the number of open credit and loans, and the number of real estate loans or lines.

– Other factors: the number of dependents of the borrower (not including himself)

In the kaggle model competition, the bonus is 5,000 US dollars, and the model evaluation index is AUC .

picture

The best AUC score of the give me some credit data set on the Internet at home and abroad is 0.85.

picture

However, the AUC in our "python credit scorecard modeling (with code)" tutorial can reach 0.929, and the AUC can be higher after parameter adjustment, which is much higher than the model performance AUC=0.85 of the give me some credit paper on the Internet. Internet papers have a lot of plausible but actually incorrect theories about modeling steps.

picture

If you are curious about how we can achieve an AUC of 0.929 for the give me some credit data set, please refer to the tutorial "Python credit score card modeling (with code)"

An overview of the give me some credit dataset in "Python Credit Scorecard Modeling (with Code)".

picture

3. Data preprocessing

Data preprocessing, the main work includes data cleaning, missing value processing, outlier processing, data type conversion, etc. We need to convert the raw data layer by layer into modelable data.


3.1 Missing value processing

The missing data in the give me some credit dataset is not serious. Only two variables have missing values, and the missing rates are 2% and 19.8%.

picture

In reality, it is very common for data to have a large number of missing values. The missing rate of many variables in the central bank's credit investigation can be as high as 99%. Missing values ​​can cause problems in some data analysis and modeling. Usually in the first step of credit risk scorecard model development we will deal with missing values. The methods of missing value processing include the following.
(1) Directly delete samples containing missing values.
(2) Fill missing values.
(3) Ignore it.

3.2 Outlier processing

After missing value processing, we need to perform an outlier test. Outliers are divided into statistical outliers and business outliers. Statistical outliers are usually judged by box plots, as shown in the figure below.

picture

The abnormal value in the business is to judge whether the data is reasonable or not based on the definition of the variable and common sense of the business line. For example, in the give me some credit data set, the age of individual customers is 0. According to common sense, we consider this value to be an outlier. Which loan company will lend money to users whose age is 0?

picture

3.3 Data partitioning

After we build the model, we generally encounter the following three situations, underfitting underfitting, just right fitting, and overfitting overfitting.

picture

In order to verify the performance of the model, we need to divide the dataset.

First divide all data into x data and y data (target target variable).

Then divide the x data and y data into a training set and a test set, and generate four variables train_x, test_x, train_y, test_y.

picture

4. EDA exploratory data analysis and descriptive statistics

Due to the physiological structure of the human brain, most people are not sensitive to numbers, but data visualization is more friendly to the brain's understanding. This is the importance of data visualization, and it is also convenient for reporting to leaders or decision-makers.

picture

EDA exploratory data analysis and descriptive statistics include statistics on the overall data volume, the proportion of good and bad customers, data types, variable missing rate, variable frequency analysis histogram visualization, box plot visualization, variable correlation visualization, etc. EDA is the abbreviation of Exploratory Data Analysis, and the Chinese interpretation is exploratory data analysis. There are many common exploratory data analysis methods: hist histogram, scater scatterplot, boxer boxplot, heat heat map, pairplot pairing map.

The age variable histogram of the give me some credit data set

picture

The target variable histogram of the give me some credit data set shows that the proportion of good and bad customers is very unbalanced. The number of good customers is about 15 times the number of bad customers.

picture

Histogram of the variable number of family members in the give me some credit data set

picture

The pairplot pairing diagram of all variables in the give me some credit data set, a lot of information is clear at a glance.

picture

The heat map of the correlation of all variables in the give me some credit data set can be analyzed to show that there are 6 pairs of variable correlations that are very high, and attention should be paid when screening variables.

picture

The give me some credit data set age variable is about the classification box plot statistics of good and bad customers. We can see that the median age of good customers is higher than the median age of bad customers.

picture

5. Variable selection

Variable selection, through statistical and machine learning methods, to screen out the variables that have the most significant impact on the default status. There are many common variable selection methods, including iv, feature importance, variance and so on. In addition, variables with a high missing rate are also recommended to be deleted. It is also recommended to delete variables that have no business explanatory variables and no value variables.

The feature importance visualization of the integrated tree algorithm catboost in the "python credit score card modeling (with code)" tutorial, we can clearly see that the RevolvingUtilizationOfUnsecuredLines available amount ratio variable is the most important. The longer the blue bars in the figure, the greater the importance, and vice versa.

picture

The calculation results of the variable iv value in the "python credit score card modeling (with code)" tutorial are as follows. We can clearly see that the iv value of the available amount ratio variable of RevolvingUtilizationOfUnsecuredLines is the highest.

picture

Through the feature importance and iv value methods, we can get the same conclusion: RevolvingUtilizationOfUnsecuredLines available quota ratio variable is the most important.

6. Model development

The main difficulties in model development and scorecard modeling are woe binning, score stretching, and variable coefficient calculation. Among them, woe binning is one of the most difficult points in the scorecard, which requires rich statistical knowledge and business experience. At present, there are more than 50 binning algorithms, and there is no unified gold standard. Generally, the machine automatically bins first, then manually adjusts the binning, and finally repeatedly tests the final performance of the model, and selects the best binning algorithm.

"Python credit score card modeling (with code)" explains Kmeans, equal frequency binning, equidistant binning, chi-square binning, decision tree binning algorithm principle and python to achieve binning code. "Python credit score card modeling (with code)" also tells you how to choose the binning method? Under different needs, choose the most suitable binning method.

picture

Binning is mainly divided into supervised methods and unsupervised methods. The k-means clustering algorithm (k-means clustering algorithm) is an iterative clustering analysis algorithm. Its steps are to divide the data into K groups in advance, then randomly select K objects as the initial cluster centers, and then calculate The distance between each object and each seed cluster center, assign each object to the cluster center closest to it. The cluster centers and the objects assigned to them represent a cluster. Each time a sample is assigned, the cluster center of the cluster is recalculated based on the existing objects in the cluster. This process will be repeated until a certain termination condition is met. Termination conditions can be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers change again, and the sum of squared errors is locally minimized. The figure below shows the principle of the Kmeans binning algorithm.

picture

The legendary optimal binning is the decision tree binning.

The steps of the decision tree binning algorithm are:

Step 1 : First, it uses the variables we want to discretize to train a decision tree of finite depth (2, 3 or 4) to predict the target.

_Step 2:_Then replace the original variable values ​​with the probabilities returned by the tree. All observations within a single bin have the same probability, so substituting probabilities is equivalent to grouping observations within a cutoff determined by the decision tree.

The advantages and disadvantages of the decision tree binning algorithm are:

benefit:

  • The decision tree returned by the probabilistic forecast is monotonically related to the target.

  • The new bins show reduced entropy, which is how similar the observations within each bin/bucket are to themselves compared to observations in other bins/buckets.

  • The tree will automatically find the bin.

shortcoming:

  • may lead to overfitting

  • What's more, some tuning of tree parameters may be required to achieve optimal splits (e.g. depth, minimum number of samples in a partition, maximum number of partitions, and minimum information gain). This can be time consuming.

picture

                                (决策树分箱可视化)


  • 1
  • 2
  • 3

Equidistant binning can be used for variables like age.

picture

After the binning is completed, the binned data is converted into woe data, and finally modeled with a logistic regression algorithm.

7. Model Validation

After logistic regression algorithm modeling, we need model validation. Model verification is to verify whether the model's discrimination ability, prediction ability, stability, sorting ability and other indicators are qualified, and form a model evaluation report to draw a conclusion whether the model can be used. Model verification is not done at one time, but after modeling, before the model is launched, and after the model is launched, it is regularly verified. Model development and maintenance is a cycle, not a one-off process.

As time goes by, the model's ability to distinguish, such as ks, auc, will gradually decrease, and the model's stability will also shift. When the model's ability to distinguish is significantly reduced or the model's stability is greatly shifted, we need to redevelop the model and iterate the model.

The AUC of the training model in the "python credit scorecard modeling (with code)" tutorial is 0.929. The specific model performance is as follows:

model accuracy is: 0.9406307593547452

model precision is: 0.9060132575757576

model sensitivity is: 0.6077497220898841

f1_score: 0.7274973861800208

AUC: 0.9290751730536397

good classifier

gini 0.8581503461072795

ks value:0.7107

It far exceeds the model performance AUC 0.85 of the Internet give me some credit data set modeling paper.

picture

8. The Birth of the Scorecard

===

The model generates scorecards based on logistic regression variable coefficients and WOE values. The scorecard is convenient for business interpretation, has been used for decades, is very stable, and is loved by the financial industry. The method is to convert the Logistic model probability score into a standard score form of 300-900 points. The American FICO score is emulated by most domestic credit score cards.

Individuals with FICO scores of 800 or above have a special credit history. People with high credit scores are likely to have had multiple lines of credit over the years. They did not exceed any credit limits and paid off all debts in a timely manner.

A FICO score of mid to high 700 is a good score. Individuals scoring in this range borrow and spend wisely and make timely payments. These people, such as those over the age of 800, tend to have easier access to credit and often pay much lower interest rates.

The most common scores are between 650 and 750. While individuals with scores in this range have fairly good credit, they may be late on payments. These people usually don't have a hard time getting loans. However, they may have to pay a slightly higher interest rate.

The last practical range to consider is a score of 599 or lower. They are considered to have a poor credit score and are usually the result of multiple late payments, missed payments, or debts that have been transferred to a collection agency. It is often difficult, if not impossible, for individuals with such FICO scores to obtain any form of credit.

picture

As shown in the figure below, the FICO credit score of very poor300-579 accounts for the lowest proportion, only 17%; the proportion of good670-739 is the highest, reaching 21.5%.

picture

"Python credit scorecard modeling (with code)" has detailed chapters on scorecard generation, including PDO, theta0, P0, A, B, odds, woe, iv and other professional terms that are fully explained.

picture

9. Scorecard automatic scoring system

Based on the above basis, we can generate an automated scoring system to generate real labels for good and bad customers, predicted labels for good and bad customers, probability values ​​for bad customers, and stretch scores for each application form user.

picture

Based on the credit scorecard method, we can build a computerized automatic credit scoring system. The traditional American product FICO has similar functions, and the underlying language of FICO is Java. Java, python or R are currently popular languages ​​to build scorecard automation model systems. If the amount of data is large, it is not easy to establish an automatic credit scoring system, which requires continuous testing and updating by a professional team. Python or R is an open source language, and the package is regularly upgraded. If there is no professional team to maintain, the system will have serious problems in the future.

10. Model Monitoring

As time goes by, the model's ability to distinguish, such as ks, auc, will gradually decrease, and the model's stability will also shift. We need a professional model monitoring team. When monitoring shows a significant decline in model discrimination or a large shift in model stability, we need to redevelop the model and iterate the model. The model monitoring team should send model monitoring reports to relevant teams by email on time every day, especially the development team and business team.

The ks indicator of model monitoring, when the model ks is lower than 0.2, the model's ability to distinguish good and bad customers has little effect, and the model needs to be iterated again.

picture

The bad rate indicator monitored by the model, when the bad rate suddenly rises, the leaders will be very nervous, which means that a large number of loans cannot cover the cost.

picture

The PSI index monitored by the model, when the PSI is higher than 0.25, indicates that the model is extremely unstable and needs to be re-iterated.

picture

Summarize

The main process of the credit scorecard model based on Python is introduced here, but there are many details in the actual scorecard modeling, and the description of these details on the Internet is too hasty or even incorrect. For example, if the missing rate of a variable reaches 80%-90%, should the variable be deleted directly? Can the variable correlation be removed if it is as high as 0.8? Experienced modelers need to find a balance between mathematical theory, actual needs of business lines, computer test results, etc., instead of thinking about problems from only one angle. It's like experienced surgeons don't necessarily follow textbook theories exactly. There are many controversies in statistics, machine learning, artificial intelligence and other fields, and there is not a complete consensus. You must maintain independent thinking skills when you study, so that you can continuously optimize your data science knowledge.

The credit scorecard model based on Python-give me some credit is introduced here for everyone,
reference materials: <Python Financial Risk Control Scorecard Model and Data Analysis Micro-Professional Course (Enhanced Edition)>

Copyright statement: The article comes from the official account (python risk control model), without permission, no plagiarism. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/132675911