kaggle是一个很好的数据分析挖掘项目平台。可以从中感受到数据分析挖掘的实际应用价值，锻炼技能。它还是一个社交平台。

项目介绍：

在数据分析挖掘的过程中，最重要的一步就是提出重要的问题。

这一步做不好，接下来数据再好，分析在精妙，也没有意义（这也是统计的三要素：问题、数据、方法）。

银行在现代经济生活中扮演至关重要的地位，它希望借贷出去更多款项，但同时它又必须确保借贷人有偿还能力。这就涉及到金融风险预测评估的问题。根据历史，预测未来。根据历史数据，来预测未来会不会发生信用违约的情况，这是一个很重要的数据分析挖掘应用方向。以数据为依据，辅导决策！

数据介绍：

数据来源：give me some credit-kaggle


Variable Name	Description	Type
SeriousDlqin2yrs	Person experienced 90 days past due delinquency or worse 是否有超过90天或更长时间逾期未还的不良行为	Y/N
RevolvingUtilizationOfUnsecuredLines不安全额度循环利用	Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits.信用卡总余额和个人信用额度（除了房地产和分期付款给债务，比如汽车贷款）除以总信用限制。	percentage
age	Age of borrower in years借贷者的年龄（以年计）	integer
NumberOfTime30-59DaysPastDueNotWorse	Number of times borrower has been 30-59 days past due but no worse in the last 2 years.	integer
DebtRatio	Monthly debt payments, alimony,living costs divided by monthy gross income月债务支出、赡养费、生活费除以总收入（毛收入）	percentage
MonthlyIncome	Monthly income	real
NumberOfOpenCreditLinesAndLoans	Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)公开贷款（如汽车和抵押的分期）和信用上线（比如信用卡）数量	integer
NumberOfTimes90DaysLate	Number of times borrower has been 90 days or more past due（应付款、应付日期）.90天或以上贷款者逾期未还的次数。	integer
NumberRealEstateLoansOrLines	Number of mortgage and real estate loans including home equity lines of credit抵押和房地产数量（包括房屋净值信用额度）	integer
NumberOfTime60-89DaysPastDueNotWorse	Number of times borrower has been 60-89 days past due but no worse in the last 2 years.	integer
NumberOfDependents	Number of dependents in family excluding themselves (spouse, children etc.)家庭中的依赖者（比如说配偶子女，但不包括他自己）	integer

数据初探：

１．弄清楚数据条数（完整性）和类型

import pandas as pd


data_train = pd.read_csv('cs-training.csv')

了解到月收入和亲属人数２个特征中有缺省项。

２.了解各个特征的统计信息：

如果是KAGGLE中的数据，那么它还提供各个特征数据的频数分布直方图：

(以上截图出自pycharm，但是它的显示效果没有jupyter notebook好，后者会将结果以表格的形式显示出来！)

	Unnamed: 0	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfTime60-89DaysPastDueNotWorse	NumberOfDependents
count	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	1.202690e+05	150000.000000	150000.000000	150000.000000	150000.000000	146076.000000
mean	75000.500000	0.066840	6.048438	52.295207	0.421033	353.005076	6.670221e+03	8.452760	0.265973	1.018240	0.240387	0.757222
std	43301.414527	0.249746	249.755371	14.771866	4.192781	2037.818523	1.438467e+04	5.145951	4.169304	1.129771	4.155179	1.115086
min	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000
25%	37500.750000	0.000000	0.029867	41.000000	0.000000	0.175074	3.400000e+03	5.000000	0.000000	0.000000	0.000000	0.000000
50%	75000.500000	0.000000	0.154181	52.000000	0.000000	0.366508	5.400000e+03	8.000000	0.000000	1.000000	0.000000	0.000000
75%	112500.250000	0.000000	0.559046	63.000000	0.000000	0.868254	8.249000e+03	11.000000	0.000000	2.000000	0.000000	1.000000
max	150000.000000	1.000000	50708.000000	109.000000	98.000000	329664.000000	3.008750e+06	58.000000	98.000000	54.000000	98.000000	20.000000

从SeriousDlqin2yrs特征的统计信息中可以看出，这个数据集是一个极度不平衡的数据。

３．各项特征与结果的关联统计

探究连续型变量对某个分类结果的影响，方法之一是先将连续型变量用cut()离散成分段变量，然后统计各分段内的某一事件发声概率。

# 数据清洗
# 删除第一列，将.csv文件导入DataFrame中后会有自动索引
df = df.drop(df.columns[0], axis=1)
# 联系实际和统计信息，发现异常数据age<18的一项
# 查看age<18的项
# df[df.age < 18]
# 去掉age<18的项
df = df[df.age > 18]

# 通过了解数据背景，知道特征为连续型变量
# 特征若是分类变量，可以将每个特征不良和正常的数据分别做成频率分布表，
# 显示在一张图表中进行对比，可以对每个特征对结果的影响进行一个直观的了解。


# 使用cut函数，将连续变量转换成分类变量
def binning(col, cut_points, labels=None, isright=True):
    val_min = col.min()
    val_max = col.max()
    break_points = [val_min] + cut_points + [val_max]
    if not labels:
        labels = range(len(cut_points) + 1)
    else:
        labels = [str(i + 1) + ':' + labels[i] for i in range(len(cut_points) + 1)]
    colbin = pd.cut(col, bins=break_points, labels=labels, include_lowest=True, right=isright)
    return colbin


# 将RevolvingUtilizationOfUnsecuredLines离散化
# 尽量不要在原始数据集中更改数据
df_tmp = df[['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines']]

cut_points = [0.25, 0.5, 0.75, 1, 2]
labels = ['below0.25', '0.25-0.5', '0.5-0.75', '0.75-1.0', '1.0-2.0', 'above2']
df_tmp['Utilization_Bin'] = binning(df_tmp['RevolvingUtilizationOfUnsecuredLines'], cut_points, labels)

# 制作频率分布透视表
# 总人数
# total_size = df_tmp.shape[0]
# pd.pivot_table()是制作透视表工具。可以有多级索引，也可以有多级列标（如果默认column=None，那就是所有除了
# 索引外的列都是列标和计算值，列的聚合函数可以有多个。
# per_table = pd.pivot_table(df_tmp, index='Utilization_Bin', aggfunc = {'RevolvingUtilizationOfUnsecuredLines': [len, \
# lambda x:len(x)/total_size*100], 'SeriousDlqin2yrs': np.sum}, values=['RevolvingUtilizationOfUnsecuredLines','Serious\
# Dlqin2yrs'])
# per_table = pd.pivot_table(df_tmp, index=['Utilization_Bin'], aggfunc={"RevolvingUtilizationOfUnsecuredLines":[len, \
# lambda x:len(x)/total_size*100],"SeriousDlqin2yrs":[np.sum] },values=['RevolvingUtilizationOfUnsecuredLines','Seriou\
# sDlqin2yrs'])
# 给透视表增加违约率列
# 多级索引数据框的索引格式pd[a_level,b_level,..]
# per_table['SeriousDlqin2yrs', 'percent'] = per_table['SeriousDlqin2yrs', 'sum'] / per_table['RevolvingUtilizationO\
# fUnsecuredLines', 'len']*100
# 给透视表重命名
# per_table = per_table.rename(columns={'<lambda>': 'percent', 'len': 'number', 'sum': 'number'})
# 把number 放在前面，percent放在后面更合理，用reindex调整顺序
# per_table = per_table.reindex((per_table.columns[1], per_table.columns[0], per_table.columns[2], per_table.col\
# umns[3]), axis=1)
# 将上述生成频率分布表的过程写成函数，以便对每个变量进行类似处理
def get_frequency(df, col_x, col_y, cut_points, labels, isright=True):
    df_tmp = df[[col_x, col_y]]
    df_tmp['columns_Bin'] = binning(df_tmp[col_x], cut_points, labels, isright=isright)
    total_size = df_tmp.shape[0]
    per_table = pd.pivot_table(df_tmp, index='columns_Bin', aggfunc={col_x:[len, lambda x:len(x)/total_size], col_y:
        np.sum}, values=[col_x, col_y])
    if per_table.columns[0][0] != col_x:
        # 假如col_x不在第一列，说明是在第2/3列，就把他们往前挪
        per_table = per_table.reindex((per_table.columns[1], per_table.columns[2], per_table.columns[0]), axis=1)
    per_table[col_y, 'percent'] = per_table[col_y, 'sum'] / per_table[col_x, 'len']*100
    per_table = per_table.rename(columns={'<lambda>': 'percent', 'len': 'number', 'sum': 'number'})
    per_table = per_table.reindex((per_table.columns[1], per_table.columns[0], per_table.columns[2], per_table.
                                       columns[3]), axis=1)
    return per_table

上图中的行索引是将RevolvingUtilizationOfUnsecuredLines列的连续百分比值变成分段的离散变量；第一个number是分段人数统计，第一个percent是分段人数占总人数的比例；第二个number是分段内的违约人数，第二个percent是分段违约人数占总人数的比例。

从上表中可以看出变量RevolvingUtilizationOfUnsecuredLines对违约率的影响，呈不对称的钟形分布。

对其他各个特征变量与违约率的关系：

# age
cut_points = [25, 35, 45, 55, 65]
labels = ['below25', '26-35', '36-45', '46-55', '56-65', 'above65']
freq_age = get_frequency(df, 'age', 'SeriousDlqin2yrs', cut_points, labels)
# print(freq_age)

#DeptRatio
cut_points = [0.25,0.5,0.75,1,2]
labels = ["below0.25","0.25-0.5","0.5-0.75","0.75-1.0","1.0-2.0","above2"]
feq_ratio=get_frequency(df,'DebtRatio','SeriousDlqin2yrs', cut_points, labels)
#print(feq_ratio)
#NumberOfOpenCreditLinesAndLoans
cut_points=[5,10,15,20,25,30]
labels=['below 5', '6-10', '11-15','16-20','21-25','26-30','above 30']
feq_OpenCredit=get_frequency(df,'NumberOfOpenCreditLinesAndLoans','SeriousDlqin2yrs', cut_points, labels)
#print(feq_OpenCredit)
#NumberRealEstateLoansOrLines
cut_points=[5,10,15,20]
labels=['below 5', '6-10', '11-15','16-20','above 20']
feq_RealEstate=get_frequency(df,'NumberRealEstateLoansOrLines','SeriousDlqin2yrs', cut_points, labels)
#print(feq_RealEstate)
#NumberOfTime30-59DaysPastDueNotWorse
cut_points=[1,2,3,4,5,6,7]
labels=['0','1','2','3','4','5','6','7 and above',]
feq_30days=get_frequency(df,'NumberOfTime30-59DaysPastDueNotWorse','SeriousDlqin2yrs', cut_points, labels,isright=False)
#print(feq_30days)
#MonthlyIncome
cut_points=[5000,10000,15000]
labels=['below 5000', '5000-10000','1000-15000','above 15000']
feq_Income=get_frequency(df,'MonthlyIncome','SeriousDlqin2yrs', cut_points, labels)
#print(feq_Income)
#NumberOfDependents
cut_points = [1,2,3,4,5]
labels = ["0","1","2","3","4","5 and more"]
feq_dependent=get_frequency(df,'NumberOfDependents','SeriousDlqin2yrs', cut_points, labels,isright=False)
print(feq_dependent)

                age           SeriousDlqin2yrs           
              number   percent           number    percent
columns_Bin                                               
1:below25     3027.0  0.020180              338  11.166171
2:26-35      18458.0  0.123054             2053  11.122548
3:36-45      29819.0  0.198795             2628   8.813173
4:46-55      36690.0  0.244602             2786   7.593350
5:56-65      33406.0  0.222708             1531   4.583009
6:above65    28599.0  0.190661              690   2.412672
------------------------------
            DebtRatio           SeriousDlqin2yrs           
               number   percent           number    percent
columns_Bin                                                
1:below0.25   52361.0  0.349076             3126   5.970092
2:0.25-0.5    41346.0  0.275642             2529   6.116674
3:0.5-0.75    15728.0  0.104854             1484   9.435402
4:0.75-1.0     5427.0  0.036180              596  10.982126
5:1.0-2.0      4092.0  0.027280              539  13.172043
6:above2      31045.0  0.206968             1752   5.643421
------------------------------
            NumberOfOpenCreditLinesAndLoans           SeriousDlqin2yrs  \
                                     number   percent           number   
columns_Bin                                                              
1:below 5                           46590.0  0.310602             3922   
2:6-10                              60399.0  0.402663             3345   
3:11-15                             29184.0  0.194561             1804   
4:16-20                              9846.0  0.065640              676   
5:21-25                              2841.0  0.018940              191   
6:26-30                               785.0  0.005233               62   
7:above 30                            354.0  0.002360               26   

                       
              percent  
columns_Bin            
1:below 5    8.418115  
2:6-10       5.538171  
3:11-15      6.181469  
4:16-20      6.865732  
5:21-25      6.722985  
6:26-30      7.898089  
7:above 30   7.344633  
------------------------------
            NumberRealEstateLoansOrLines           SeriousDlqin2yrs           
                                  number   percent           number    percent
columns_Bin                                                                   
1:below 5                       149206.0  0.994713             9884   6.624398
2:6-10                             699.0  0.004660              121  17.310443
3:11-15                             70.0  0.000467               16  22.857143
4:16-20                             14.0  0.000093                3  21.428571
5:above 20                          10.0  0.000067                2  20.000000
------------------------------
              NumberOfTime30-59DaysPastDueNotWorse           SeriousDlqin2yrs  \
                                            number   percent           number   
columns_Bin                                                                     
1:0                                       126018.0  0.840126             5041   
2:1                                        16032.0  0.106881             2409   
3:2                                         4598.0  0.030654             1219   
4:3                                         1754.0  0.011693              618   
5:4                                          747.0  0.004980              318   
6:5                                          342.0  0.002280              154   
7:6                                          140.0  0.000933               74   
8:7 and above                                104.0  0.000693               50   

                          
                 percent  
columns_Bin               
1:0             4.000222  
2:1            15.026198  
3:2            26.511527  
4:3            35.233751  
5:4            42.570281  
6:5            45.029240  
7:6            52.857143  
8:7 and above  48.076923  
------------------------------
              MonthlyIncome           SeriousDlqin2yrs          
                     number   percent           number   percent
columns_Bin                                                     
1:below 5000        55859.0  0.372396             4813  8.616338
2:5000-10000        46090.0  0.307269             2752  5.970926
3:1000-15000        13035.0  0.086901              547  4.196394
4:above 15000        5284.0  0.035227              245  4.636639
------------------------------
             NumberOfDependents           SeriousDlqin2yrs           
                         number   percent           number    percent
columns_Bin                                                          
1:0                     86902.0  0.579351             5095   5.862926
2:1                     26316.0  0.175441             1935   7.352941
3:2                     19521.0  0.130141             1584   8.114338
4:3                      9483.0  0.063220              837   8.826321
5:4                      2862.0  0.019080              297  10.377358
6:5 and more              990.0  0.006600               99  10.000000

将上列表格变成更直观的图形：

由上表统计信息可知：

１．随着年龄增加，贷款违约率降低

２．随着借贷率的升高，违约率先升后降。

３．随着３０天违约次数的升高，违约率先先急剧升高后降。

４．随着亲人数的增加，违约率缓慢增加。

５．随着收入增加，违约率先缓慢降低然后基本不变。

。。。

模型构建：

比较逻辑回归模型、决策树模型和随机森林模型的AUC值，择优选用。

# coding = utf-8
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import Imputer

from sklearn.linear_model.logistic import LogisticRegression
from sklearn import tree

# 忽略警告错误的输出
import warnings
warnings.filterwarnings('ignore')


# 用数据集构造随机森林分类器模型
# 加载数据（训练和测试）
# preprocessing
# split training_data into training_new&test_new to verify model
# use mean_value to replace default_value with the method Imputer
# build a RandomForestModel with training_new
# deal with the unbalanced data problem
# perform parameters'ajustment with GridSearchCV in CrossValidation
# export best model and test it with data_test

# 创建字典函数
# input: keys = [] and values = []
# output: dict{}


def create_dict(keys, vals):
    lookup = {}
    if len(keys) == len(vals):
        for i in range(len(keys)):
            key = keys[i]
            val = vals[i]
            lookup[key] = val
    return lookup


# 计算AUC函数
# input: y_true = [] and y_score = []
# output: auc


def compute_auc(y_true, y_score):
    auc = roc_auc_score(y_true, y_score)
    print('auc', auc)
    return auc

def main():
    # 1,加载数据（训练和测试）和预处理数据
    colnames = ['ID', 'label', 'RUUnsecuredL', 'age', 'NOTime30-59',
                'DebtRatio', 'Income', 'NOCredit', 'NOTimes90',
                'NORealEstate', 'NOTime60-89', 'NODependents']
    # 将数据集中没有意义的数值全部指定为‘NA'，比如年龄小于１８，
    col_nas = ['', 'NA', 'NA', 0, [98, 96], 'NA', 'NA', 'NA', [98, 96], 'NA', [98, 96], 'NA']    col_na_values = create_dict(colnames, col_nas)
    dftrain = pd.read_csv('cs-training.csv', names=colnames,
                          na_values=col_na_values, skiprows=[0])
    # print(dftrain)
    # pop()返回删除的列，并默认本地删除
    train_id = [int(x) for x in dftrain.pop('ID')]
    y_train = np.asarray([int(x) for x in dftrain.pop('label')])
    x_train = dftrain.values

    dftest = pd.read_csv('cs-test.csv', names=colnames,
                         na_values=col_na_values, skiprows=[0])
    test_id = [int(x) for x in dftest.pop('ID')]
    y_test = np.asarray(dftest.pop('label'))
    x_test = dftest.values

    # 2,使用StratifiedShuffleSplit将训练数据分解为trainning_new和test_new(用于验证模型）
    sss = StratifiedShuffleSplit(n_splits=1, test_size=0.33333, random_state=0)
    for train_index, test_index in sss.split(x_train, y_train):
        # print("TRAIN:", train_index, "TEST:", test_index)
        x_train_new, x_test_new = x_train[train_index], x_train[test_index]
        y_train_new, y_test_new = y_train[train_index], y_train[test_index]

    y_train = y_train_new
    x_train = x_train_new

    # 3,使用Imputer将所有NaN替换为平均值
    imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
    imp.fit(x_train)
    x_train = imp.transform(x_train)
    x_test_new = imp.transform(x_test_new)
    x_test = imp.transform(x_test)

    # 使用training_new数据建立RF模型
    rf = RandomForestClassifier()

    # 模型比较
    # 逻辑回归，二分类模型
    lr = LogisticRegression()
    # 拟合逻辑回归模型
    lr.fit(x_train, y_train)
    # 预测训练集中各样本落入各个类别的概率
    predicted_probs_train = lr.predict_proba(x_train)
    predicted_probs_train = [x[1] for x in predicted_probs_train]
    # 计算AUC值
    compute_auc(y_train, predicted_probs_train)

    # 预测测试集中各样本落入各个类别的概率
    predicted_probs_test_new = lr.predict_proba(x_test_new)
    predicted_probs_test_new = [x[1] for x in predicted_probs_test_new]
    compute_auc(y_test_new, predicted_probs_test_new)

    # 树模型中的决策树模型
    model = tree.DecisionTreeClassifier()
    model.fit(x_train, y_train)
    predicted_probs_train = model.predict_proba(x_train)
    predicted_probs_train = [x[1] for x in predicted_probs_train]
    compute_auc(y_train, predicted_probs_train)

    predicted_probs_test_new = model.predict_proba(x_test_new)
    predicted_probs_test_new = [x[1] for x in predicted_probs_test_new]
    compute_auc(y_test_new, predicted_probs_test_new)

    # 随机森林
    rf.fit(x_train, y_train)
    # 模型AUC值比较
    predicted_probs_train = rf.predict_proba(x_train)
    predicted_probs_train = [x[1] for x in predicted_probs_train]
    compute_auc(y_train, predicted_probs_train)

    predicted_probs_test_new = rf.predict_proba(x_test_new)
    predicted_probs_test_new = [x[1] for x in predicted_probs_test_new]
    compute_auc(y_test_new, predicted_probs_test_new)

    # 输出特征重要性评估
    print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), dftrain.columns), reverse=True))
    # 使用具有crossvalidation的网格搜索执行参数调整
    param_grid = {"max_features": [2, 3, 4], "min_samples_leaf": [50]}
    grid_search = GridSearchCV(rf, cv=10, scoring='roc_auc', param_grid=param_grid, iid=False)
    # 输出最佳模型
    # 使用最优参数和训练数据构建模型
    grid_search.fit(x_train, y_train)
    print('the best parameter:', grid_search.best_params_)
    print('the best score:', grid_search.best_score_)

    # 使用rf预测train
    predicted_probs_train = grid_search.predict_proba(x_train)
    predicted_probs_train = [x[1] for x in predicted_probs_train]
    compute_auc(y_train, predicted_probs_train)

    # 使用rf预测test
    predicted_probs_test_new = grid_search.predict_proba(x_test_new)
    predicted_probs_test_new = [x[1] for x in predicted_probs_test_new]
    compute_auc(y_test_new, predicted_probs_test_new)

    # 使用rf预测test data
    predicted_probs_test = grid_search.predict_proba(x_test)
    predicted_probs_test = ['%.9f' % x[1] for x in predicted_probs_test]
    submission = pd.DataFrame({'ID': test_id, 'Probability': predicted_probs_test})
    submission.to_csv('rf_submission.csv', index=False)


# 如果直接运行.py文件，则__name__ == '__main__'
# 如果是引入.py文件，则__name__　＝＝　文件名
if __name__ == '__main__':
    main()

参考：

基于随机森林算法的贷款违约预测模型研究（Give me some credit）

债务违约预测之一：数据探索

python中数据集划分函数StratifiedShuffleSplit的使用

sklearn.preprocessing.Imputer

kaggle竞赛——give me some credit

项目介绍：

数据介绍：

数据初探：

猜你喜欢