Hello everyone, recently a screenshot of "I hope to delay mortgage due to the epidemic" was circulated on the Internet, which immediately caused heated discussions among netizens!
Loan defaults can occur when a borrower borrows money from a lender and fails to repay the loan on time. Delinquent loans will not only be reported to credit, but may also risk being sued.
In order to better manage and control risks, lending institutions usually predict whether a user's loan defaults based on user information. Today I will use an example data set to explain the working principle of predicting loan default. It is not easy to be original. If you like this article, remember to like, follow, Collection, full version data and code at the end of the article.
[Note] A technical exchange group is provided at the end of the article
data
The data includes the demographics of each customer and a target variable showing whether they will default on their loans.
First, we import the library and load the dataset.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_theme(style = "darkgrid")
data = pd.read_csv("/kaggle/input/loan-prediction-based-on-customer-behavior/Training Data.csv")
data.head()
Explore datasets
First, let's start with understanding the data and its distribution
rows, columns = data.shape
print('Rows:', rows)
print('Columns:', columns)
output
Rows: 252000
Columns: 13
We see that the data has 252000 rows and 13 features, of which 12 are input features and 1 is output feature.
Now we check the data type and other information.
data.info()
output
RangeIndex: 252000 entries, 0 to 251999
Data columns (total 13 columns)
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 252000 non-null int64
1 Income 252000 non-null int64
2 Age 252000 non-null int64
3 Experience 252000 non-null int64
4 Married/Single 252000 non-null object
5 House_Ownership 252000 non-null object
6 Car_Ownership 252000 non-null object
7 Profession 252000 non-null object
8 CITY 252000 non-null object
9 STATE 252000 non-null object
10 CURRENT_JOB_YRS 252000 non-null int64
11 CURRENT_HOUSE_YRS 252000 non-null int64
12 Risk_Flag 252000 non-null int64
dtypes: int64(7), object(6)
memory usage: 25.0+ MB
We see that half of the features are numeric and half are strings, so they may be categorical features.
In data science numerical data is called "quantitative data" and categorical data is called "qualitative data"
Let's check if there are any missing values in the data.
data.isnull().sum()
output
Id 0
Income 0
Age 0
Experience 0
Married/Single 0
House_Ownership 0
Car_Ownership 0
Profession 0
CITY 0
STATE 0
CURRENT_JOB_YRS 0
CURRENT_HOUSE_YRS 0
Risk_Flag 0
dtype: int64
Let's check the data column names.
data.columns
output
Index(['Id', 'Income', 'Age', 'Experience', 'Married/Single',
'House_Ownership', 'Car_Ownership', 'Profession', 'CITY', 'STATE',
'CURRENT_JOB_YRS', 'CURRENT_HOUSE_YRS', 'Risk_Flag'],
dtype='object')
We get the names of the data features.
Analyzing Numeric Columns
First, we start our analysis with numerical data.
data.describe()
output
Now, we examine the data distribution.
data.hist( figsize = (22, 20) )
plt.show()
Now, we check the count of the target variable.
data["Risk_Flag"].value_counts()
output
0 221004
1 30996
Name: Risk_Flag, dtype: int64
Only a small fraction of the target variable consists of people who default on their loans.
Now, we draw the correlation graph.
fig, ax = plt.subplots( figsize = (12,8) )
corr_matrix = data.corr()
corr_heatmap = sns.heatmap( corr_matrix, cmap = "flare", annot=True, ax=ax, annot_kws={
"size": 14})
plt.show()
Analyze category features
Now, we move on to analyzing categorical features.
First, we define a function to create the plot.
def categorical_valcount_hist(feature):
print(data[feature].value_counts())
fig, ax = plt.subplots( figsize = (6,6) )
sns.countplot(x=feature, ax=ax, data=data)
plt.show()
First, we examine the number of married vs. single.
categorical_valcount_hist("Married/Single")
So, most people are single.
Now, we check the number of homeownerships.
categorical_valcount_hist("House_Ownership")
Now, let's check the number of states.
print( "Total categories in STATE:", len( data["STATE"].unique() ) )
print()
print( data["STATE"].value_counts() )
output
Total categories in STATE: 29
Uttar_Pradesh 28400
Maharashtra 25562
Andhra_Pradesh 25297
West_Bengal 23483
Bihar 19780
Tamil_Nadu 16537
Madhya_Pradesh 14122
Karnataka 11855
Gujarat 11408
Rajasthan 9174
Jharkhand 8965
Haryana 7890
Telangana 7524
Assam 7062
Kerala 5805
Delhi 5490
Punjab 4720
Odisha 4658
Chhattisgarh 3834
Uttarakhand 1874
Jammu_and_Kashmir 1780
Puducherry 1433
Mizoram 849
Manipur 849
Himachal_Pradesh 833
Tripura 809
Uttar_Pradesh[5] 743
Chandigarh 656
Sikkim 608
Name: STATE
dtype: int64
Now, we check the number of professions.
print( "Total categories in Profession:", len( data["Profession"].unique() ) )
print()
data["Profession"].value_counts()
output
Total categories in Profession: 51
Physician 5957
Statistician 5806
Web_designer 5397
Psychologist 5390
Computer_hardware_engineer 5372
Drafter 5359
Magistrate 5357
Fashion_Designer 5304
Air_traffic_controller 5281
Comedian 5259
Industrial_Engineer 5250
Mechanical_engineer 5217
Chemical_engineer 5205
Technical_writer 5195
Hotel_Manager 5178
Financial_Analyst 5167
Graphic_Designer 5166
Flight_attendant 5128
Biomedical_Engineer 5127
Secretary 5061
Software_Developer 5053
Petroleum_Engineer 5041
Police_officer 5035
Computer_operator 4990
Politician 4944
Microbiologist 4881
Technician 4864
Artist 4861
Lawyer 4818
Consultant 4808
Dentist 4782
Scientist 4781
Surgeon 4772
Aviator 4758
Technology_specialist 4737
Design_Engineer 4729
Surveyor 4714
Geologist 4672
Analyst 4668
Army_officer 4661
Architect 4657
Chef 4635
Librarian 4628
Civil_engineer 4616
Designer 4598
Economist 4573
Firefighter 4507
Chartered_Accountant 4493
Civil_servant 4413
Official 4087
Engineer 4048
Name: Profession
dtype: int64
data analysis
Now, we start by understanding the relationship between different data features.
sns.boxplot(x ="Risk_Flag",y="Income" ,data = data)
Now, we see the relationship between the marker variable and age.
sns.boxplot(x ="Risk_Flag",y="Age" ,data = data)
sns.boxplot(x ="Risk_Flag",y="Experience" ,data = data)
sns.boxplot(x ="Risk_Flag",y="CURRENT_JOB_YRS" ,data = data)
sns.boxplot(x ="Risk_Flag",y="CURRENT_HOUSE_YRS" ,data = data)
fig, ax = plt.subplots( figsize = (8,6) )
sns.countplot(x='Car_Ownership', hue='Risk_Flag', ax=ax, data=data)
fig, ax = plt.subplots( figsize = (8,6) )
sns.countplot( x='Married/Single', hue='Risk_Flag', data=data )
fig, ax = plt.subplots( figsize = (10,8) )
sns.boxplot(x = "Risk_Flag", y = "CURRENT_JOB_YRS", hue='House_Ownership', data = data)
feature engineering
Before modeling, data preparation is a required process in the field of data science. During the data preparation process, we have to complete several tasks, one of these key responsibilities is the encoding of categorical data.
It is well known that most data in everyday work has categorical string values, while most machine learning models only deal with numeric categories.
Encoding categorical data is the process of converting categorical data into integer format so that the data can be fed into a model to improve prediction accuracy.
We will apply encoding to categorical features.
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce
label_encoder = LabelEncoder()
for col in ['Married/Single','Car_Ownership']:
data[col] = label_encoder.fit_transform( data[col] )
onehot_encoder = OneHotEncoder(sparse = False)
data['House_Ownership'] = onehot_encoder.fit_transform(data['House_Ownership'].values.reshape(-1, 1) )
high_card_features = ['Profession', 'CITY', 'STATE']
count_encoder = ce.CountEncoder()
# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_encoder.fit_transform( data[high_card_features] )
data = data.join(count_encoded.add_suffix("_count"))
data= data.drop(labels=['Profession', 'CITY', 'STATE'], axis=1)
data.head()
After the feature engineering part is done, we split the data into training and test sets.
Split the data into train and test sets
In order to evaluate the working efficiency of our machine learning model, we must divide the dataset into training set and test set. The training set is used to train the machine learning model, whose statistics are known, and the test data set is used for prediction.
x = data.drop("Risk_Flag", axis=1)
y = data["Risk_Flag"]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y, random_state = 7)
We set the test set size to 20% of the entire data.
Random Forest Classifier
Tree-based algorithms are widely used in machine learning to deal with supervised learning challenges. These algorithms are adaptable and can solve almost any problem (classification or regression).
Furthermore, they have highly accurate, stable and interpretable predictions.
Random forest is a common tree-based supervised learning technique that can be used to solve classification and regression problems. Random forests typically combine hundreds of decision trees and then train each decision tree on a different sample of data.
Now, we train the model and perform predictions.
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
rf_clf = RandomForestClassifier(criterion='gini', bootstrap=True, random_state=100)
smote_sampler = SMOTE(random_state=9)
pipeline = Pipeline(steps = [['smote', smote_sampler],['classifier', rf_clf]])
pipeline.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)
Now, we check the accuracy score.
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score, roc_auc_score
print("-------------------------TEST SCORES-----------------------")
print(f"Recall: {
round(recall_score(y_test, y_pred)*100, 4) }")
print(f"Precision: {
round(precision_score(y_test, y_pred)*100, 4) }")
print(f"F1-Score: {
round(f1_score(y_test, y_pred)*100, 4) }")
print(f"Accuracy score: {
round(accuracy_score(y_test, y_pred)*100, 4) }")
print(f"AUC Score: {
round(roc_auc_score(y_test, y_pred)*100, 4) }")
output
-------------------------TEST SCORES-----------------------
Recall: 54.1378
Precision: 54.3306
F1-Score: 54.234
Accuracy score: 88.7619
AUC Score: 73.8778
in conclusion
Today, I will explain the whole process of predicting whether a user's loan defaults. There are a few points worth paying attention to:
- When we need highly accurate results while avoiding overfitting, random forest methods are suitable for classification and regression tasks on datasets with many items and features that may have missing values.
- Additionally, Random Forest provides relative feature importance, enabling you to select the most important features. It is easier to interpret than neural network models, but more difficult to interpret than decision trees.
- In the case of categorical features, we need to perform encoding so that ML algorithms can process them.
- Predicting loan defaults is highly dependent on demographics, with people with lower incomes more likely to default on their loans.
code acquisition
Reply in the background of the public account below: Credit default , the full version code can be obtained
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group