Financial Risk Control Task1-Analysis of Competition Questions

1.1 Learning Objectives

Understand the data and objectives of the competition questions, and understand the scoring system.

1.2 Understand the topic

  • Question overview
  • Data overview
  • Predictor
  • Analysis problem

1.2.1 Overview of the competition questions

The task of the competition is to predict financial risks. The data set can be seen and downloaded after registration. The data comes from the loan records of a credit platform. The total data volume exceeds 120w, including 47 columns of variable information, 15 of which are anonymous variables. In order to ensure the fairness of the competition, 800,000 entries will be selected as the training set, 200,000 entries will be used as the test set A, and 200,000 entries will be used as the test set B. At the same time, information such as employmentTitle, purpose, postCode, and title will be desensitized.

1.2.2 Data overview

Generally speaking, there is a corresponding data overview (except for anonymous features) on the game interface for the data, explaining the nature and characteristics of the columns. Understanding the nature of the columns will help us understand the data and subsequent analysis. Tip: Anonymous features are feature columns that do not inform the nature of the data column.

train.csv

  • id is the unique letter of credit identifier assigned to the loan list
  • loanAmnt loan amount
  • term loan term (year)
  • interestRate loan interest rate
  • installment installment payment amount
  • grade loan grade
  • subGrade sub-grade of loan grade
  • employmentTitle employment title
  • employmentLength employment length (years)
  • homeOwnership The home ownership status provided by the borrower at the time of registration
  • annualIncome annual income
  • verificationStatus verification status
  • issueDate The month the loan was issued
  • purpose The type of loan purpose of the borrower at the time of loan application
  • postCode The first 3 digits of the postal code provided by the borrower in the loan application
  • regionCode area code
  • dti debt-to-income ratio
  • delinquency_2years The number of default events overdue for more than 30 days in the borrower's credit file in the past 2 years
  • ficoRangeLow The lower limit range of the borrower's fico at the time of loan issuance
  • ficoRangeHigh The upper limit range of the borrower's fico at the time of loan issuance
  • The number of open credit lines in openAcc borrower's credit file
  • pubRec the number of deprecated public records
  • pubRecBankruptcies Number of public record purges
  • revolBal Total credit revolving balance
  • revolUtil Revolving Facility Utilization, or the amount of credit used by the borrower relative to all available revolving credit
  • totalAcc The total number of credit lines currently in the borrower's credit file
  • initialListStatus The initial list status of the loan
  • applicationType indicates whether the loan is an individual application or a joint application with two co-borrowers
  • earliestsCreditLine The month the borrower's earliest reported credit line was opened
  • title The title of the loan provided by the borrower
  • policyCode Publicly available policy_code=1 New product not publicly available policy_code=2
  • n series of anonymous features Anonymous features n0-n14, for the processing of some lender behavior count features

1.2.3 Prediction Index
The competition uses AUC as the evaluation index. AUC (Area Under Curve) is defined as the area enclosed by the ROC curve and the coordinate axis.

Common evaluation indicators for classification algorithms are as follows:
1. Confuse Matrix

(1) If an instance is a positive class and is predicted to be a positive class, it is a true class TP (True Positive) (
2) If an instance is a positive class but is predicted to be a negative class, it is a false negative class FN (False Negative )
(3) If an instance is a negative class but is predicted to be a positive class, it is a false positive class FP (False Positive) (
4) If an instance is a negative class and is predicted to be a negative class, it is a true negative class TN (True Negative)
2. Accuracy Accuracy is a commonly used evaluation index, but it is not suitable for unbalanced samples. A accuracy = TP + TNTP + TN + FP + FN Accuracy = \frac{TP + TN}{TP + TN + FP + FN}Accuracy=TP+TN+FP+FNTP+TN

3. Precision (Precision), also known as precision, is the percentage of correctly predicted positive samples (TP) to predicted positive samples (TP+FP). P precision = TPTP + FP Precision = \frac{TP}{TP + FP}Precision=TP+FPTP

4. Recall rate (Recall), also known as recall rate, is the percentage of correctly predicted positive samples (TP) to positive samples (TP+FN). R ecall = TPTP + FN Recall = \frac{TP}{TP + FN}Recall=TP+FNTP

5. F1 Score Precision rate and recall rate affect each other. When the precision rate increases, the recall rate decreases, and when the recall rate increases, the precision rate decreases. If you need to take both into account, you need the combination of precision rate and recall rate. F1 Score. F 1 − S core = 2 1 P resolution + 1 R ecall F1-Score = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}}Q1 _Score=Precision1+Recall12

6. PR curve (Precision-Recall Curve) PR curve is a curve that describes the changes in precision and recall

7、ROC(Receiver Operating Characteristic)

The ROC space defines the false positive rate (FPR) as the X axis, and the true positive rate (TPR) as the Y axis.
TPR: Among all the samples that are actually positive, the ratio of being correctly judged as positive. TPR = TPTP + FN TPR = \frac{TP}{TP + FN}TPR=TP+FNTPFPR: Among all the samples that are actually negative examples, the ratio that is wrongly judged as positive examples. FPR = FPFP + TN FPR = \frac{FP}{FP + TN}FPR=FP+TNFP

8. AUC (Area Under Curve) AUC (Area Under Curve) is defined as the area enclosed by the ROC curve and the coordinate axis. Obviously, the value of this area will not be greater than 1. And because the ROC curve is generally above the straight line y=x, the value range of AUC is between 0.5 and 1. The closer the AUC is to 1.0, the higher the authenticity of the detection method; when it is equal to 0.5, the authenticity is the lowest and has no application value.

Guess you like

Origin blog.csdn.net/BigCabbageFy/article/details/108610186