Datawhale&Alibaba Cloud Tianchi-Financial Risk Control-Loan Default Prediction-task1 (Comprehension of Competition Questions)

Zero-based entry financial risk control-loan default prediction


Data set: Ali Tianchi-Request Book-Risk Control Data Set

Source of contest questions: https://tianchi.aliyun.com/competition/entrance/531830/introduction

The competition question is based on personal credit in financial risk control. It requires contestants to predict whether there is a possibility of default based on the data and information of the loan applicant to determine whether the loan is approved. This is a typical classification problem. Through this question, we will guide everyone to understand some business backgrounds in financial risk control, solve practical problems, and help newcomers in the competition to practice and improve themselves. The contestant also customized a learning plan for the question, including three parts: data science library, general process and baseline plan.

1. Question data

The question is to predict whether the user’s loan defaults to the task. The data set is visible and downloadable after registration. The data comes from the loan record of a credit platform. The total data volume exceeds 120w. It contains 47 columns of variable information, of which 15 are anonymous variables. In order to ensure the fairness of the game, 800,000 pieces will be selected as the training set, 200,000 pieces as the test set A, and 200,000 pieces as the test set B. At the same time, information such as employmentTitle, purpose, postCode and title will be desensitized.

Field table

Field Description
id The unique letter of credit identifier assigned to the loan list
loanAmnt loan amount
term Loan period (year)
interestRate Lending rates
installment Installment amount
grade Loan grade
subGrade Child of loan class
employmentTitle Employment title
employmentLength Employment years (years)
homeOwnership The housing ownership status provided by the borrower at the time of registration
annualIncome Annual income
verificationStatus Verification status
issueDate Month of loan
purpose The borrower's loan application category at the time of loan application
postCode The first 3 digits of the zip code provided by the borrower in the loan application
regionCode Area code
dti Debt-to-income ratio
delinquency_2years The number of default events overdue for more than 30 days in the borrower’s credit file in the past 2 years
ficoRangeLow The lower limit range of the borrower's fico at the time of loan issuance
ficoRangeHigh The upper limit range of the borrower's fico at the time of loan issuance
openAcc The number of outstanding credit lines in the borrower's credit file
pubRec The number of derogatory public records
pubRecBankruptcies Number of public records cleared
revolBal Total credit turnover balance
revolUtil Revolving line utilization, or the amount of credit used by the borrower relative to all available revolving credit
totalAcc The total number of credit lines currently in the borrower's credit file
initialListStatus Initial listing status of the loan
applicationType Indicate whether the loan is an individual application or a joint application with two co-borrowers
earliesCreditLine The month that the borrower’s earliest reported credit line was issued
title Name of loan provided by borrower
policyCode Publicly available strategy_code=1New product not publicly available strategy_code=2
N series anonymous features Anonymous features n0-n14 are the processing of counting features for some lenders’ behavior

2. Evaluation standards

The submitted result is the probability that each test sample is 1, which is the probability that y is 1. The evaluation method is the AUC evaluation model effect (the bigger the better).

ROC(Receiver Operating Characteristic)

  • ROC space defines the false positive rate (FPR) as the X axis, and the true rate (TPR) as the Y axis.

TPR: In all samples that are actually positive, the ratio of correctly judged positive. TPR=TPTP+FNTPR=TPTP+FNFPR: In all samples that are actually negative, the ratio of false positives. FPR=FPFP+TN

ROC curve is an evaluation index for binary classification problems. It is a probability curve that plots the relationship between TPR and FPR under different thresholds, essentially separating "signal" from "noise".

The area under the curve (AUC) is a measure of the ability of the classifier to classify and is used as a summary of the ROC curve.

The competition uses AUC as the evaluation index. AUC (Area Under Curve) is defined as the area under the ROC curve and the coordinate axis.

The higher the AUC, the better the model's performance in distinguishing between positive and negative classes.

当AUC=1时,分类器能够正确区分所有的正类点和负类点。然而,如果AUC为0,那么分类器将预测所有的否定为肯定,所有的肯定为否定。

当0.5<AUC<1时,分类器很有可能区分正类值和负类值。这是因为与假反例和假正例相比,分类器能够检测更多的真正例和真反例。

当AUC=0.5时,分类器无法区分正类点和负类点。这意味着分类器要么预测所有数据点的随机类,要么预测常量类。

因此,分类器的AUC值越高,其区分正类和负类的能力就越好。

在ROC曲线中,较高的X轴值表示假正例数高于真反例数。而Y轴值越高,则表示真正例数比假反例数高。

因此,阈值的选择取决于在假正例和假反例之间进行平衡的能力。

评价指标

评价指标针对不同的机器学习任务有不同的指标,同一任务也有不同侧重点的评价指标。
主要有分类(classification)、回归(regression)、排序(ranking)、聚类(clustering)、热门主题模型(topic modeling)、推荐(recommendation)等。

数据集上测试两个分类器的性能

Sklearn有一个非常有效的方法roc_curve(),它可以在几秒钟内计算分类器的roc!它返回FPR、TPR和阈值:

可以手动测试每个阈值的敏感性和特异性,也可以使用sklearn的roc_auc_score()方法计算AUC得分

三、结果提交

提交前请确保预测结果的格式与sample_submit.csv中的格式一致,以及提交文件后缀名为csv。

形式如下:

id,isDefault
800000,0.5
800001,0.5
800002,0.5
800003,0.5

四、经验总结

赛题理解是开始比赛的第一步,赛题的理解有助于对竞赛全局的把握。通过赛题理解有助于对赛题的业务逻辑把握,对于后期的特征工程构建和模型选择都尤为重要。

  • 在开始比赛之前要对赛题进行充分的了解。
  • 比赛什么时候开始,什么时候结束,什么时候换B榜数据。
  • 和该比赛有没有类似的比赛可以参考借鉴。
  • 线上提交结果的次数往往是有限的,提前了解每日可以提交的次数。
  • 比赛使用的是什么评价指标,可以选择相同的评价指标作为线下验证的方式。

Guess you like

Origin blog.csdn.net/adminkeys/article/details/108568260
Recommended