Practical case: Using machine learning algorithms to predict whether a user's loan defaults?

Hello everyone, recently a screenshot of "I hope to delay mortgage due to the epidemic" was circulated on the Internet, which immediately caused heated discussions among netizens!

Loan defaults can occur when a borrower borrows money from a lender and fails to repay the loan on time. Delinquent loans will not only be reported to credit, but may also risk being sued.

In order to better manage and control risks, lending institutions usually predict whether a user's loan defaults based on user information. Today I will use an example data set to explain the working principle of predicting loan default. It is not easy to be original. If you like this article, remember to like, follow, Collection, full version data and code at the end of the article.

[Note] A technical exchange group is provided at the end of the article

data

The data includes the demographics of each customer and a target variable showing whether they will default on their loans.

First, we import the library and load the dataset.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_theme(style = "darkgrid")
data = pd.read_csv("/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv")
data.head()

Explore datasets

First, let's start with understanding the data and its distribution

rows, columns = data.shape
print('Rows:', rows)
print('Columns:', columns)

output

Rows: 252000
Columns: 13

We see that the data has 252000 rows and 13 features, of which 12 are input features and 1 is output feature.

Now we check the data type and other information.

data.info()

output

RangeIndex: 252000 entries, 0 to 251999

Data columns (total 13 columns)
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Id                 252000 non-null  int64 
 1   Income             252000 non-null  int64 
 2   Age                252000 non-null  int64 
 3   Experience         252000 non-null  int64 
 4   Married/Single     252000 non-null  object
 5   House_Ownership    252000 non-null  object
 6   Car_Ownership      252000 non-null  object
 7   Profession         252000 non-null  object
 8   CITY               252000 non-null  object
 9   STATE              252000 non-null  object
 10  CURRENT_JOB_YRS    252000 non-null  int64 
 11  CURRENT_HOUSE_YRS  252000 non-null  int64 
 12  Risk_Flag          252000 non-null  int64 
dtypes: int64(7), object(6)
memory usage: 25.0+ MB

We see that half of the features are numeric and half are strings, so they may be categorical features.

In data science numerical data is called "quantitative data" and categorical data is called "qualitative data"

Let's check if there are any missing values in the data.

data.isnull().sum()

output

Id                   0
Income               0
Age                  0
Experience           0
Married/Single       0
House_Ownership      0
Car_Ownership        0
Profession           0
CITY                 0
STATE                0
CURRENT_JOB_YRS      0
CURRENT_HOUSE_YRS    0
Risk_Flag            0
dtype: int64

Let's check the data column names.

data.columns

output

Index(['Id', 'Income', 'Age', 'Experience', 'Married/Single',
       'House_Ownership', 'Car_Ownership', 'Profession', 'CITY', 'STATE',
       'CURRENT_JOB_YRS', 'CURRENT_HOUSE_YRS', 'Risk_Flag'],
      dtype='object')

We get the names of the data features.

Analyzing Numeric Columns

First, we start our analysis with numerical data.

data.describe()

output

Now, we examine the data distribution.

data.hist( figsize = (22, 20) )
plt.show()

Now, we check the count of the target variable.

data["Risk_Flag"].value_counts()

output

0    221004
1     30996
Name: Risk_Flag, dtype: int64

Only a small fraction of the target variable consists of people who default on their loans.

Now, we draw the correlation graph.

fig, ax = plt.subplots( figsize = (12,8) )
corr_matrix = data.corr()
corr_heatmap = sns.heatmap( corr_matrix, cmap = "flare", annot=True, ax=ax, annot_kws={
    
    "size": 14})
plt.show()

Analyze category features

Now, we move on to analyzing categorical features.

First, we define a function to create the plot.

def categorical_valcount_hist(feature):
    print(data[feature].value_counts())
    fig, ax = plt.subplots( figsize = (6,6) )
    sns.countplot(x=feature, ax=ax, data=data)
    plt.show()

First, we examine the number of married vs. single.

categorical_valcount_hist("Married/Single")

So, most people are single.

Now, we check the number of homeownerships.

categorical_valcount_hist("House_Ownership")

Now, let's check the number of states.

print( "Total categories in STATE:", len( data["STATE"].unique() ) )
print()
print( data["STATE"].value_counts() )

output

Total categories in STATE: 29
Uttar_Pradesh        28400
Maharashtra          25562
Andhra_Pradesh       25297
West_Bengal          23483
Bihar                19780
Tamil_Nadu           16537
Madhya_Pradesh       14122
Karnataka            11855
Gujarat              11408
Rajasthan             9174
Jharkhand             8965
Haryana               7890
Telangana             7524
Assam                 7062
Kerala                5805
Delhi                 5490
Punjab                4720
Odisha                4658
Chhattisgarh          3834
Uttarakhand           1874
Jammu_and_Kashmir     1780
Puducherry            1433
Mizoram                849
Manipur                849
Himachal_Pradesh       833
Tripura                809
Uttar_Pradesh[5]       743
Chandigarh             656
Sikkim                 608
Name: STATE
dtype: int64

Now, we check the number of professions.

print( "Total categories in Profession:", len( data["Profession"].unique() ) )
print()
data["Profession"].value_counts()

output

Total categories in Profession: 51
Physician                     5957
Statistician                  5806
Web_designer                  5397
Psychologist                  5390
Computer_hardware_engineer    5372
Drafter                       5359
Magistrate                    5357
Fashion_Designer              5304
Air_traffic_controller        5281
Comedian                      5259
Industrial_Engineer           5250
Mechanical_engineer           5217
Chemical_engineer             5205
Technical_writer              5195
Hotel_Manager                 5178
Financial_Analyst             5167
Graphic_Designer              5166
Flight_attendant              5128
Biomedical_Engineer           5127
Secretary                     5061
Software_Developer            5053
Petroleum_Engineer            5041
Police_officer                5035
Computer_operator             4990
Politician                    4944
Microbiologist                4881
Technician                    4864
Artist                        4861
Lawyer                        4818
Consultant                    4808
Dentist                       4782
Scientist                     4781
Surgeon                       4772
Aviator                       4758
Technology_specialist         4737
Design_Engineer               4729
Surveyor                      4714
Geologist                     4672
Analyst                       4668
Army_officer                  4661
Architect                     4657
Chef                          4635
Librarian                     4628
Civil_engineer                4616
Designer                      4598
Economist                     4573
Firefighter                   4507
Chartered_Accountant          4493
Civil_servant                 4413
Official                      4087
Engineer                      4048
Name: Profession
dtype: int64

data analysis

Now, we start by understanding the relationship between different data features.

sns.boxplot(x ="Risk_Flag",y="Income" ,data = data)

Now, we see the relationship between the marker variable and age.

sns.boxplot(x ="Risk_Flag",y="Age" ,data = data)

sns.boxplot(x ="Risk_Flag",y="Experience" ,data = data)

sns.boxplot(x ="Risk_Flag",y="CURRENT_JOB_YRS" ,data = data)

sns.boxplot(x ="Risk_Flag",y="CURRENT_HOUSE_YRS" ,data = data)

fig, ax = plt.subplots( figsize = (8,6) )
sns.countplot(x='Car_Ownership', hue='Risk_Flag', ax=ax, data=data)

fig, ax = plt.subplots( figsize = (8,6) )
sns.countplot( x='Married/Single', hue='Risk_Flag', data=data )

fig, ax = plt.subplots( figsize = (10,8) )
sns.boxplot(x = "Risk_Flag", y = "CURRENT_JOB_YRS", hue='House_Ownership', data = data)

feature engineering

Before modeling, data preparation is a required process in the field of data science. During the data preparation process, we have to complete several tasks, one of these key responsibilities is the encoding of categorical data.

It is well known that most data in everyday work has categorical string values, while most machine learning models only deal with numeric categories.

Encoding categorical data is the process of converting categorical data into integer format so that the data can be fed into a model to improve prediction accuracy.

We will apply encoding to categorical features.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce

label_encoder = LabelEncoder()
for col in ['Married/Single','Car_Ownership']:
    data[col] = label_encoder.fit_transform( data[col] )

onehot_encoder = OneHotEncoder(sparse = False)
data['House_Ownership'] = onehot_encoder.fit_transform(data['House_Ownership'].values.reshape(-1, 1) )

high_card_features = ['Profession', 'CITY', 'STATE']

count_encoder = ce.CountEncoder()

# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_encoder.fit_transform( data[high_card_features] )
data = data.join(count_encoded.add_suffix("_count"))

data= data.drop(labels=['Profession', 'CITY', 'STATE'], axis=1)
data.head()

After the feature engineering part is done, we split the data into training and test sets.

Split the data into train and test sets

In order to evaluate the working efficiency of our machine learning model, we must divide the dataset into training set and test set. The training set is used to train the machine learning model, whose statistics are known, and the test data set is used for prediction.

x = data.drop("Risk_Flag", axis=1)
y = data["Risk_Flag"]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y, random_state = 7)

We set the test set size to 20% of the entire data.

Random Forest Classifier

Tree-based algorithms are widely used in machine learning to deal with supervised learning challenges. These algorithms are adaptable and can solve almost any problem (classification or regression).

Furthermore, they have highly accurate, stable and interpretable predictions.

Random forest is a common tree-based supervised learning technique that can be used to solve classification and regression problems. Random forests typically combine hundreds of decision trees and then train each decision tree on a different sample of data.

Now, we train the model and perform predictions.

from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

rf_clf = RandomForestClassifier(criterion='gini', bootstrap=True, random_state=100)
smote_sampler = SMOTE(random_state=9)
pipeline = Pipeline(steps = [['smote', smote_sampler],['classifier', rf_clf]])
pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)

Now, we check the accuracy score.

from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
print("-------------------------TEST SCORES-----------------------") 
print(f"Recall: {
      
       round(recall_score(y_test, y_pred)*100, 4) }")
print(f"Precision: {
      
       round(precision_score(y_test, y_pred)*100, 4) }")
print(f"F1-Score: {
      
       round(f1_score(y_test, y_pred)*100, 4) }")
print(f"Accuracy score: {
      
       round(accuracy_score(y_test, y_pred)*100, 4) }")
print(f"AUC Score: {
      
       round(roc_auc_score(y_test, y_pred)*100, 4) }")

output

-------------------------TEST SCORES-----------------------
Recall: 54.1378
Precision: 54.3306
F1-Score: 54.234
Accuracy score: 88.7619
AUC Score: 73.8778

in conclusion

Today, I will explain the whole process of predicting whether a user's loan defaults. There are a few points worth paying attention to:

When we need highly accurate results while avoiding overfitting, random forest methods are suitable for classification and regression tasks on datasets with many items and features that may have missing values.
Additionally, Random Forest provides relative feature importance, enabling you to select the most important features. It is easier to interpret than neural network models, but more difficult to interpret than decision trees.
In the case of categorical features, we need to perform encoding so that ML algorithms can process them.
Predicting loan defaults is highly dependent on demographics, with people with lower incomes more likely to default on their loans.

code acquisition

Reply in the background of the public account below: Credit default , the full version code can be obtained

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
Method ②, add micro-signal: dkl88191 , note: from CSDN
Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow