Financial risk control default prediction-task2

0. Overview

The machine learning model is to learn the distribution law in the data, and then use the data distribution law to make predictions. Our data takes features as the basic form of expression. Some of these features are useful and some are useless. The exploratory data analysis before modeling is to find out the features related to the prediction, and the data is necessary Initial understanding.
Therefore, the first step is to look at the characteristics and distribution, and then make appropriate assumptions. Although feature engineering is now formalized and streamlined, being able to understand the rational regularity of the relationship between features and goals is still a valuable experience in machine learning modeling. Especially in finance, models that require strong interpretability, traditional machine learning models still dominate, and require higher requirements for the relationship between the meaning behind the feature and the prediction target.

1. Data overview

With pandas, you can easily see the data overview.

data_test_a.shape
# (200000, 48)
data_train.shape
# (800000, 47)
data_train.columns
# Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
#       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
#       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
#      'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
#     'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
#    'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
#      'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
#     'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
#       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
#      dtype='object')

You can also use the info method to view the characteristic value type

data_train.info()

Features can be divided into two types, categorical features and numerical features. Numerical features include continuous numerical features and discrete numerical features.

2. Feature analysis

Check the specific column names, the specific feature meanings have been given in the understanding part of the competition questions, here for the convenience of reading, enumerate:

  • id is the unique letter of credit identifier assigned to the loan list
  • loanAmnt loan amount
  • term loan term (year)
  • interestRate loan interest rate
  • installment installment amount
  • grade loan grade
  • subGrade sub-grade of loan grade
  • employmentTitle employment title
  • employmentLength employment length (years)
  • homeOwnership The home ownership status provided by the borrower at the time of registration
  • annualIncome annual income
  • verificationStatus verification status
  • issueDate month when the loan was issued
  • purpose The borrower's loan purpose category at the time of loan application
  • postCode The first 3 digits of the postcode provided by the borrower in the loan application
  • regionCode region code
  • dti debt-to-income ratio
  • delinquency_2years The number of delinquency events overdue for more than 30 days in the borrower's credit file in the past 2 years
  • ficoRangeLow The lower limit range of the borrower's fico at the time of loan issuance
  • ficoRangeHigh The upper limit range of the borrower's fico at the time of loan issuance
  • The number of outstanding credit lines in the openAcc borrower's credit file
  • pubRec derogates the number of public records
  • pubRecBankruptcies Number of public records cleared
  • revolBal total credit turnover balance
  • revolUtil revolving line utilization, or the amount of credit used by the borrower relative to all available revolving credits
  • totalAcc The total number of credit lines currently in the borrower's credit file
  • initialListStatus The initial list status of the loan
  • applicationType indicates whether the loan is an individual application or a joint application with two co-borrowers
  • earliesCreditLine The month when the borrower’s earliest reported credit line was issued
  • title The name of the loan provided by the borrower
  • policyCode Publicly available policy_code=1 New product not publicly available policy_code=2
  • N series of anonymous features Anonymous features n0-n14 are the processing of counting features of some lenders’ behavior

The bold part above is a feature that I think is more important.

Guess you like

Origin blog.csdn.net/hu_hao/article/details/108674737