Python credit score card model based analysis


Credit risk measurement system includes a main body and rating models debt rating two parts. Ratings and the body has a series of debt rating rating model, where the body rating models available "four cards" to represent, respectively, a card A, B cards, C F cards and cards; debt rating model generally according to body financing purposes , divided into corporate finance model, cash flow financing model and project financing model. We focused on the development process of the main rating models.

First, the project process

Typical credit scoring model shown in Figure 1-1. The main credit risk rating model development process are as follows:
(1) data acquisition, including access to existing customers and potential customers' data. Existing customers means the customer has to carry out related financing business class securities companies, including individual and institutional clients; potential customer is a customer in the future be carried out related financing business class securities companies, including institutional clients, which is to solve the Securities Industry fewer samples commonly used methods, these potential institutional clients including listed companies, issuers publicly issued bonds, three new board listed companies, regional equity trading center listed companies, financial institutions and other non-standard.
(2) pre-processing the data, the main data including cleaning work, missing values, an abnormal value processing, mainly in order to obtain the raw data into formatted data can be used as model development.
(3) exploratory data analysis, the step of obtaining a sample mainly about the overall situation, the overall situation of the sample description index main histogram, box, etc. FIG.
(4) variable selection, the step mainly through statistical methods, select the most significant impact on the index of non-compliance. There are univariate feature selection methods and machine learning model.
(5) the development model, the variables including the step of segmentation, WOE variables (weight of evidence) into three parts and logistic regression estimation.
(6) evaluation model, this step is primarily distinguishing capability assessment model, predictive ability, stability, and formation model assessment report concluded that if the model can be used.
(7) credit score, credit score is determined according to the method of logistic regression coefficients and the like WOE. Convert Logistic model in the form of standard scores.
(8) establish a scoring system based on credit scoring methods, establish an automatic credit scoring system.
Here Insert Picture Description
PS: sometimes named for convenience, reference numerals corresponding variables substituting

Second, data acquisition

Data from the Kaggle Give Me s Some at Credit , there are 150,000 of sample data, the figure below you can see the general situation of this data.
Data is personal consumer loans, can only be considered when using the credit score the final implementation of some aspects of the data should be as follows obtain data:
- basic properties include: the borrower's age at the time.
- Solvency: including the borrower's monthly income, debt ratio.
- Credit transactions: the number of 35-59 days past due within two years, the number of 60-89 days past due within two years, within 90 years
the number of days or greater than 90 days past due.
- Property condition: including the number of open credit and loans, real estate loans or credit amount.
- Loan properties: No.
- Other factors include: the amount of the borrower's family members (not including myself).
- Time Window: Watch Window arguments for the past two years, the dependent variable is the performance of the window the next two years.

3. Data preprocessing

Prior to processing the data, the need to understand the data outliers and missing values ​​situation. There describe the python () function, data set can understand the missing values, mean and median and the like.

 #载入数据
 data = pd.read_csv('cs-training.csv')
 #数据集确实和分布情况
 data.describe().to_csv('DataDescribe.csv')

Details of the data set:
Here Insert Picture Description
From the figure shows, the presence and absence of variable MonthlyIncome NumberOfDependents, total variable MonthlyIncome missing values 29731, NumberOfDependents 3924 has missing values.

3.1 missing values

This situation is very common in real-world problems, which can lead to a number of analytical methods can not handle missing values can not be applied, therefore, the first step in the development of credit risk rating models we will be missing values. A method of handling missing values, include the following.
(1) delete a sample containing missing values.
(2) to impute missing values based on similarities between samples.
(3) to fill the missing values based on the correlation between variables.
MonthlyIncome missing variable rate is relatively large, so we fill in missing values based on the correlation between variables, we used random forest method:

# 用随机森林对缺失值预测填充函数
def set_missing(df):
    # 把已有的数值型特征取出来
    process_df = df.ix[:,[5,0,1,2,3,4,6,7,8,9]]
    # 分成已知该特征和未知该特征两部分
    known = process_df[process_df.MonthlyIncome.notnull()].as_matrix()
    unknown = process_df[process_df.MonthlyIncome.isnull()].as_matrix()
    # X为特征属性值
    X = known[:, 1:]
    # y为结果标签值
    y = known[:, 0]
    # fit到RandomForestRegressor之中
    rfr = RandomForestRegressor(random_state=0, 
    n_estimators=200,max_depth=3,n_jobs=-1)
    rfr.fit(X,y)
    # 用得到的模型进行未知特征值预测
    predicted = rfr.predict(unknown[:, 1:]).round(0)
    print(predicted)
    # 用得到的预测结果填补原缺失数据
    df.loc[(df.MonthlyIncome.isnull()), 'MonthlyIncome'] = predicted
    return df

NumberOfDependents variable missing values is relatively small, direct delete, the overall model will not cause much impact. After missing values have been processed, remove duplicates .

    data=set_missing(data)#用随机森林填补比较多的缺失值
    data=data.dropna()#删除比较少的缺失值
    data = data.drop_duplicates()#删除重复项
    data.to_csv('MissingData.csv',index=False)

3.2 outlier handling

After the completion of missing values, we also need to deal with outliers. It refers to outliers most significantly from sample data values, such as the individual customer's age is 0, the value is generally considered outliers. Identify outlier samples in the population, usually outliers detected.
First, we found 0 variable age, the apparently abnormal value, excluding directly:

    # 年龄等于0的异常值进行剔除
    data = data[data['age'] > 0]

For variable NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, NumberOfTime60-89DaysPastDueNotWorse these three variables, the following can be seen from FIG boxplot 3-2, outliers are present, and a unique function that can exist in two anomalies 96, 98 value, were excluded. 96, 98, which will also remove found value of a variable, the value of the other two variables 96, 98 will be removed accordingly.

Here Insert Picture Description
Excluding variable NumberOfTime30-59DaysPastDueNotWorse, NumberOfTimes90DaysLate, outliers NumberOfTime60-89DaysPastDueNotWorse of. In addition, good customer data set to 0, the default is 1 customer, taking into account the normal understanding of normal performance and pay interest of 1 customer, so we will be negated.

    #剔除异常值
    data = data[data['NumberOfTime30-59DaysPastDueNotWorse'] < 90]
    #变量SeriousDlqin2yrs取反
    data['SeriousDlqin2yrs']=1-data['SeriousDlqin2yrs']

Data segmentation 3.3

In order to verify the model fitting results, we need to be segmented data set is divided into training and test sets.

from sklearn.cross_validation import train_test_split
    Y = data['SeriousDlqin2yrs']
    X = data.ix[:, 1:]
    #测试集占比30%
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
    # print(Y_train)
    train = pd.concat([Y_train, X_train], axis=1)
    test = pd.concat([Y_test, X_test], axis=1)
    clasTest = test.groupby('SeriousDlqin2yrs')['SeriousDlqin2yrs'].count()
    train.to_csv('TrainData.csv',index=False)
    test.to_csv('TestData.csv',index=False)
Published 145 original articles · won praise 6 · views 8064

Guess you like

Origin blog.csdn.net/sinat_23971513/article/details/105026848