Python | Classification Prediction Research Based on LendingClub Data Part01 - Problem Restatement + Feature Selection + Algorithm Comparison

insert image description here
Welcome to exchange and study~~


Column: Machine Learning & Deep Learning


In this paper, Python is used to analyze the data set, and a variety of machine learning algorithms are used for classification prediction.
Specific articles and data sets can be found in the resources I published: Published resources


Zero, Problem Restatement & Background Introduction

0.1 Question restatement

  • Question 1:lending-club Screen different attributes in the data set , determine at least three corresponding training sets and test sets, select the same machine learning algorithm, train different data sets, and compare and analyze the experimental results. Data equalization preprocessing can be performed.

  • Question 2: Select different machine learning algorithms to complete the classification prediction of the "multi-source data" set (at least three machine learning algorithms, such as support vector machines, neural networks, decision trees, etc.), and conduct a comparison of the advantages and disadvantages of different algorithms In-depth comparative analysis. Data equalization preprocessing can be performed.

  • Question 3: To expand the content, you can perform operations such as algorithm optimization and improvement for a certain machine learning algorithm. After completing the requirements of Question 1 and Question 2 of this topic, innovatively conduct algorithm experiments.

0.2 Background introduction

In recent years, with the rapid development of the Internet age, Internet financial products have developed rapidly, and have gradually changed human life and savings methods. Large-scale lending platforms have also gradually emerged. It is one of the large-scale P2P (Peer to P2P) companies that develop rapidly and operate well LendingClub. Peer) trading platform, due to the advantages of low transaction threshold, simple process, and high return on investment on the P2P platform, it quickly attracted a large number of customers to enter the market, and also derived some illegal loans and fraud incidents Lending Club. The data is modeled and analyzed, risk assessment is carried out through the method of Logistic Regression(LR)classification and prediction, P2Pand the platform's ability to identify customers with high default rates is improved, thereby providing scientific decision-making basis for the platform and the company.

In addition, for the "multi-source data" set, this paper 3selects machine learning algorithms: neural network, Bayesian classifier and decision tree, compares the operation effects of various algorithms in depth, and analyzes the advantages and disadvantages of various algorithms.

Finally, this paper conducts in-depth research Lending Clubon the approved loan dataset and related algorithms, and turns the original two-category problem into a three-category problem. Further, after classification using a single tree model such as a decision tree, two ensemble tree algorithms—random forest and extreme random tree model—are also used to predict and classify data. Finally, the three algorithms are synthesized, and their advantages and disadvantages are compared.

1. Comparison of different features for the difference in prediction results

In this part of the paper, on the basis of preliminary data analysis of Lending Clubthe data set, by selecting 4different characteristics of the group, using the same algorithm (logistic regression, LR) for classification prediction, comparing the differences in the parameters of 4the group model results, and selecting the relatively optimal one. Characteristics.

1.1 Introduction to LR Algorithm

Logistic regression ( Logistic Regression, LR) performs classification in a linear fashion, effectively combining the regression problem with the classification problem.
Consider a binary classification task whose output label y ∈ { 0 , 1 } y\in \{0,1\}y{ 0,1 } , and the predicted value z = w T x + bz=w^{T}x+bgenerated by the linear regression modelz=wTx+b is a real value, so we need to convert the real valuewww converts to0/1 0/10/1 value. The most ideal is the "unit transition function":
y = { 0 , z < 0 0.5 , z = 0 1 , z > 0 y = \begin{cases}0, z<0 \\ 0.5, z=0 \\ 1 , z>0 \end{cases}y= 0,z<00.5,z=01,z>0

If the predicted value zzIf z is greater than zero, it is judged as a positive example, and if it is less than zero, it is judged as a negative example. If the predicted value is a critical value of zero, it can be judged arbitrarily. However, since the unit transition function is discontinuous, we can use the logarithmic probability function: y = 1 1 + e
− zy=\frac{1}{1+e^{-z}}y=1+ez1

to replace the unit transition function. The unit transition function and the logarithmic probability function are shown in Figure 1:
insert image description here
Arranging the formula of logistic regression, we can get:
log ⁡ p 1 − p = θ T x \log \frac{p}{ {1 - p}} = { \theta ^T}xlog1pp=iTx

where p = P ( y = 1 ∣ x ) p = P(y = 1|x)p=P ( and=1∣ x ) , that is, the given inputxxThe probability that x is predicted to be a positive sample. In the argumentxxx and hyperparametersθ \thetaWhen θ is determined, logistic regression can be regarded as a generalized linear model (Generalized Linear Models) in the dependent variable yyA special case when y follows a bivariate distribution. This paper mainly uses the advantages of simplicity and efficiency Logistic RegressioninLending Clubclassify and predict the data in this paper.

1.2 Introduction of Classification Prediction Evaluation Index

For the binary classification problem, mainly use Recall, Precision, Accuracy, F1-score, P-R曲线, ROC曲线, AUC曲线and other indicators for evaluation, and the evaluation can be carried out according to the confusion matrix:

  • Recall ( Recall): The ratio of the number of correctly classified positive samples to the number of true positive samples: R ecall = TPTP + FN Recall = \frac{ {TP}}{ {TP + FN}}Recall=TP+FNTP

  • Accuracy ( Precision): The ratio of the total number of positive samples classified correctly to the total number of samples identified as positive samples by the classifier: P precision = TPTP + FP Precision = \frac{ {TP}}{ {TP + FP}}Precision=TP+FPTP

  • F1-score: The harmonic mean of recall rate and precision rate, which can comprehensively reflect the performance of the model: F 1 − score = 2 ⋅ R ecall ⋅ P precision R ecall + Pr ⁡ ecision F1 - score = \frac{ {2 \cdot Recall \cdot Precision}}{ { {\mathop{\rm Re}\nolimits} call + \Pr ecision}}Q1 _score=Recall+Precision2RecallPrecision

  • P-RCurve: A comprehensive graphic indicator used to measure the fitting effect of the classification model. The horizontal axis in the graph is Recallthe value , and the vertical axis is Precisionthe value.

  • Accuracy rate ( Accuracy): The ratio of the number of correctly classified samples to the total number of samples: Accuracy = TP + TNTP + FP + TN + FN Accuracy = \frac{ {TP + TN}}{ {TP + FP + TN + FN}}Accuracy=TP+FP+TN+FNTP+TN

  • ROCCurve: ROCThe abscissa of the curve is the false positive rate ( False Positive Rate, FPR), and the ordinate is True Positive Rate, TPRthe curve formed by the true positive rate ( ). where FPR = FPN FPR = \frac{ {FP}}{N}FPR=NFP T P R = T P P TPR = \frac{ {TP}}{P} TPR=PTP .

  • AUC: It is the size of the area under ROCthe curve . This value can quantitatively reflect ROCthe performance of the model based on the curve. The AUCvalue is the integral value along the horizontal axis of ROCthe curve . The closer the value is 1, the better the model effect is.

The meaning of the characters in the above formula can be represented by the following binary confusion matrix:
insert image description here

1.3 Data description and analysis of Lending Club

The data in this article is the organization of loan customer information by Lending Clubthe company over a period of time. The original data contains 77159samples, 108dimension features, and feature data includes integer, floating-point, category, and character data. The predictor variable is the customer's loan status, and the values ​​​​included are: ’Fully Paid’, ’Current’, ’Charged Off’, ’Late (31-120 days)’, ’In Grace Period’, ’Late (16-30 days)’, ’Default’, since this paper is mainly to identify default customers, so here ’Fully Paid’and ’Current’regarded as normal customers, marked as 0, in other cases ’Charged Off’, ’Late (31-120 days)’, ’In Grace Period’, ’Late (16-30 days)’, ’Default’depending on For defaulting customers, marked as 1.

Next, this paper conducts preliminary data analysis on the customer's loan status and certain characteristics: Table 2 shows the statistical information of normal customers and default customers, and Figure 2 shows the number of customers with different loan status ( ) loan_status:

insert image description here

insert image description here

Figure 3 is a box plot of loan amount (loan_amnt) and loan status (loan_status). It can be found that as the loan status declines, the loan amount shows a slight upward trend. It can be guessed that there is a certain relationship between the two.
insert image description here

Figure 4 shows the proportion of defaulting customers with different credit ratings ( grade). It can be seen that as the credit rating decreases Afrom to F, the proportion of defaulting customers is getting higher and higher, while the proportion of defaulting customers Gat level is lower. The possible reason is that the loan company The review conditions for users whose credit rating is Glevel are more stringent.

insert image description here

Figure 5 shows the proportion of defaulted customers in different total repayment months ( term). It can be seen that the proportion of defaulted customers with a 60total significantly higher than 36that of months. It is speculated that the possible reason is that the former has greater repayment pressure and greater job uncertainty.

insert image description here

Figure 6 is a box plot of the lender's annual income and loan status, which does not show a strong correlation between the two.

insert image description here

1.4 Feature selection and data preprocessing

Based on the description and analysis of the above data, we selected the following four groups of features for analysis (Table 3):
insert image description here

Among them, loan_amntis the loan amount, which is a continuous variable; gradeis the credit rating, which is a category variable; termis the total repayment month, which is a category variable; annual_incis the annual income of the lender, which is a continuous variable.

For the selected features, after data analysis, there are no missing values. For the features of different data types, we need to adopt different preprocessing methods:

  • Categorical variables gradeand term: the to
    grades in the variables are labeled to ; the variables in the are labeled to .AG06’36 months’0’60 months’1
  • Continuous variables loan_amntand annual_inc:
    Data in both are standardized.

1.5 Modeling analysis and comparison of results

For the establishment and operation of the model, we need to pythonuse numpythe , pandas, sklearnand other packages in . For the data set, we use it 80%as the training set and 20%the test set, and 4draw ROCthe curve corresponding to the data set model, as shown in Figure 7:

insert image description here

We can see from the figure that the AUC values ​​of groups 2, 3, and 4 have no significant difference, and are significantly higher than those of group 1.
And the respective model evaluation parameters are obtained as follows (Table 4):

insert image description here

According to the model evaluation parameters of different groups shown in the table, we can find that: Group 3, Recall, Precisionand areF1-score the largest, so we can think that the features selected by Group 3: , , are relative to the classification prediction of defaulting customers optimal.Accuracyloan_amntannual_incterm

2. Comparative analysis of the advantages and disadvantages of different algorithms

This part of this paper is based on the analysis of "multi-source data sets", 3using different machine learning algorithms: neural network, Bayesian classifier and decision tree, to classify and predict the data, and compare their model evaluation parameters, and analyze each Algorithm strengths and weaknesses.

2.1 Introduction to Algorithms

2.1.1 Neural network

Artificial neuron network is a kind of simulation and approximation to biological neural network. It is an adaptive nonlinear dynamic network system composed of a large number of neurons connected with each other. From the proposed first model of neurons - the MP model, to the single-layer perceptron model, and then to a multi-layer feedforward network trained by the error back propagation algorithm - the back propagation network (network) BP.

The neural network model has been developed in many forms, including: convolutional neuron ( CNN), recurrent neuron ( ), long short-term memory neuron ( ), gated recurrent neuron ( ), RCfeedforward neural network ( ), radial Basic neural network ( ), Hopfield network ( ), etc.LSTMGRUFFNNRBFHN

Figure 8 shows some neural network diagrams:

insert image description here

2.1.2 Bayesian Classifier

Bayesian classifier ( Bayesmethod) is a pattern classification method in the case of known prior probability and class conditional probability, and the classification result to be classified depends on the entirety of samples in the class domain.

Let the training sample set be divided into Mcategories , recorded as C = { c 1 , c 2 , . . . , c M } C = \{ {c_1},{c_2},...,{c_M}\}C={ c1,c2,...,cM} , the prior probability of each class isP ( ci ) P({c_i})P(ci) , when the sample set is very large, it can be considered thatP ( ci ) = n ( ci ) n P({c_i}) = \frac{ {n({c_i})}}{n}P(ci)=nn(ci), where n ( ci ) n({c_i})n(ci) isci {c_i}cinumber of samples of the class, nnn is the total number of samples. For a sample to be classifiedXXX , which is classified asci c_{i}ciThe class conditional probability of the class is P ( X ∣ ci ) P(X|{c_i})P(Xci) , according to Bayes theorem, the posterior probability can be obtained asP ( ci ∣ X ) P({c_i}|X)P(ciX)
P ( c i ∣ X ) = P ( X ∣ c i ) P ( c i ) P ( X ) P({c_i}|X) = P(X|{c_i})\frac{ {P({c_i})}}{ {P(X)}} P(ciX)=P(Xci)P(X)P(ci)

If P ( ci ∣ X ) = max ⁡ { P ( cj ∣ X ) } , j = 1 , 2 , . . . , MP({c_i}|X) = \max \{ P({c_j}|X )\} ,j = 1,2,...,MP(ciX)=max{ P(cjX)},j=1,2,...,M,则有 X ∈ c i X \in {c_i} Xci, which is the criterion of maximum a posteriori probability, which is also a commonly used criterion for Bayesian classification. After long-term research, the Bayes classification method has been fully demonstrated in theory, and it is also very extensive in application.

2.2.2 Decision tree

A decision tree can be regarded as a tree-like prediction model, which classifies instances by arranging instances from the root node to a certain leaf node, and the leaf node is the category to which the instance belongs. The core problem of decision tree is to select split attribute and pruning of decision tree.

There are many algorithms for decision trees, such as ID3, C45, CARTand so on. These algorithms all use a top-down greedy algorithm. Each node selects the attribute with the best classification effect to split the node into 2one or more sub-nodes. Continue this process until the tree can accurately classify the training set, or all properties have already been used.

Figure 9 is a schematic illustration of an example of using a decision tree to determine whether a loan can be repaid:

insert image description here

2.2 Modeling analysis and comparison of results

Since the processed "coded multi-source data set" has been given in the "multi-source data set", we directly preprocess the coded multi-source data set in this part. For multi-classification, in addition to the preprocessing method mentioned in 1.3, we can also binarize the labels (Table 5):
insert image description here
for the establishment and operation of the model, we need to pythonuse numpythe , pandas, sklearnand other packages. For the data set, we all use it 80%as the training set and 20%the test set. Afterwards, we use three machine learning algorithms to build models and count the performance of various algorithms:

2.2.1 Neural network

This article is based on the empirical formula:
λ = m + n + α \lambda = \sqrt {m + n} + \alphal=m+n +α
sets the number of hidden layer nodes, whereλ \lambdaλ is the number of hidden layer nodes,mmm is the number of input layer nodes, for this data set26,nnn is the number of nodes in the output layer, is5, 1~10is4, and the final calculation result is rounded to getλ = 10 \lambda=10l=10.

Table 6 is the model result parameters using neural network:

insert image description here

Among them, represents the total running time of the algorithm in seconds ( s).

Figure 10 is P-Ra graph , in which the accuracy rate of category 0, increases with the increase of recall rate, the accuracy rate of category 3 is maintained at , the accuracy rate of category has been maintained at a high 1level , and the overall average accuracy rate has also remained at a relatively high level.204

insert image description here

2.2.2 Bayesian Classifier

Table 7 shows the model result parameters using Bayesian classifier:

insert image description here

Figure 11 is P-Ra graph . 0The accuracy of the category decreases rapidly with the increase of the recall rate. The accuracy of the category 1, 2, has been maintained at a low level, and the accuracy of the category has been at a high level. While the overall average precision continues to increase as the recall increases.34
insert image description here

2.2.3 Decision tree

Table 8 shows the model result parameters using the decision tree:

insert image description here

Figure 12 is the PR curve of the decision tree model. The accuracy rates of various types in the model decrease with the increase of the recall rate. The rate of decline is different, but the overall average accuracy rate has also remained at a relatively high level.
insert image description here

2.2.4 Summary of the advantages and disadvantages of the three algorithms

The neural network model has higher recall ( Recall), precision ( Precision) F1-scoreand higher than the Bayesian classifier, but the model takes too long to run.

The Bayesian classifier has lower recall ( Recall), accuracy ( Precision) F1-scoreand lower, but its model running time is the shortest of the three.

The recall rate ( Recall), precision rate ( Precision) and of the decision tree F1-scoreare all the highest among the three, and compared with the neural network, its running time is also significantly lower.

Combining the three algorithms, the accuracy and generalization ability of the decision tree model are the best, and compared with the running time of the Bayesian classification, the running time of the decision tree is also within an acceptable range, so we can think that among the three The optimal decision tree model.

Guess you like

Origin blog.csdn.net/qq_60090693/article/details/127566869