信用卡逾期数据挖掘

项目简介

导入数据
首先进行数据预处理
对数据进行下采样处理：
开始进行训练
绘制混淆矩阵
调优
保存模型

项目简介

使用信用卡是普遍的当今社会。信用卡诈骗案检测是一项艰巨的任务，信用卡诈骗罪的侦查无论在学术或商业都极为重要。该项目利用已有信用卡用户的数据进行训练，建立一个检测模型，以此来检测信用卡用户是否为异常用户。

导入数据


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

data = pd.read_csv('F:\\card_data\\creditcard.csv')
data.head()

在这里插入图片描述
在数据集中，大部分特征已经是经过数据标准化的处理了，但是Amount特征与其它特征相比数字差异比较大，需要进行标准化处理，同时Time对于结果的预测来说是多余的，需要处理掉

首先进行数据预处理

from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
data.head()

在这里插入图片描述

count_class = pd.value_counts(data['Class'],sort=True).sort_index()

在这里插入图片描述
在对训练数据的结果统计中，不难发现正常与异常的账户相差过大，如果直接对训练数据进行训练预测结果将会不理想，
所以需要对样本数据进行处理，一般来说有两种方法：1.下采样处理,2.过量处理

对数据进行下采样处理：

X = data.iloc[:,data.columns != 'Class']
y = data.iloc[:,data.columns == 'Class']
#Number of data points in the minority class
number_record_fraud = len(data[data.Class == 1])
fraud_ridices = np.array(data[data.Class == 1].index)

#Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index

#out of indices we picked, randmly select 'X' number (num_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_record_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)

#Appending the two indices
under_sample_indices = np.concatenate([fraud_ridices, random_normal_indices])

#Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]

X_undersample = under_sample_data.iloc[:,under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:,under_sample_data.columns == 'Class']

#Showing ratio
print('Percentage of normal transactions:', len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print('Percentage of fraud transactions:', len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print('Total number of transactions in undersample data:', len(under_sample_data))

Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5
Total number of transactions in undersample data: 984

开始进行训练

from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

#Whole dataSet
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

print("Number transactions train dataset:",len(X_train))
print('Number transactions test dataset:',len(X_test))
print('Total number of transactions:',len(X_train)+len(X_test))

#Undersample dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,
                                                                                                   y_undersample,
                                                                                                   test_size=0.3,
                                                                                                   random_state=0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 284807

Number transactions train dataset: 688
Number transactions test dataset: 296
Total number of transactions: 984

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix,recall_score

由于原始数据都经过了标准化的处理，LogisticRegression模型又易于构建，这里使用LogisticRegression模型易于模型的快速迭代，计算结果方便存储，容易扩展。所以该项目采用LogisticRegression模型对数据进行训练和预测；
首先需要采用交叉验证的方法寻找出合适的超参数，这里使用召回率Recal=TP/(TP+FN)对模型进行评价，使用混淆矩阵对预测结果进行统计

def print_KFold_scores(x_train_data,y_train_data):
    fold = KFold(5,shuffle=False)
    
    c_param_range = [0.01,0.1,1,10,100]
    result_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
    result_table['C_parameter'] = c_param_range
    j = 0
    recall_accs = []
    for c_param in c_param_range:
        print('--------------------------------------')
        print('c_param', c_param)
        print('--------------------------------------')
        for train_index,test_index in fold.split(x_train_data):
            lr = LogisticRegression(C=c_param, penalty='l1', solver = 'liblinear')
            lr.fit(x_train_data.iloc[train_index,:].values, y_train_data.iloc[train_index,:].values.ravel())
            
            y_pred = lr.predict(x_train_data.iloc[test_index,:].values)
            
            recall_acc = recall_score(y_train_data.iloc[test_index,:].values.ravel(),y_pred )
            print('recall_score',recall_acc)
            recall_accs.append(recall_acc)
        result_table.loc[j,1:2] = np.mean(recall_accs)
        j += 1
        print('')
        print('Then mean recall_score of test_dataset:',np.mean(recall_accs))
        print('')   
    best_c = result_table.loc[result_table['Mean recall score'].astype('float').idxmax()]['C_parameter']
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    return best_c

best_c = print_KFold_scores(X_train_undersample,y_train_undersample)

c_param 0.01

recall_score 0.9315068493150684
recall_score 0.9178082191780822
recall_score 1.0
recall_score 0.9594594594594594
recall_score 0.9545454545454546

Then mean recall_score of test_dataset: 0.9526639964996129

c_param 0.1

recall_score 0.8356164383561644
recall_score 0.863013698630137
recall_score 0.9322033898305084
recall_score 0.9459459459459459
recall_score 0.9090909090909091

Then mean recall_score of test_dataset: 0.9249190364351729

c_param 1

recall_score 0.8493150684931506
recall_score 0.8904109589041096
recall_score 0.9661016949152542
recall_score 0.9594594594594594
recall_score 0.9090909090909091

Then mean recall_score of test_dataset: 0.9215712303476409

c_param 10

recall_score 0.8904109589041096
recall_score 0.8904109589041096
recall_score 0.9661016949152542
recall_score 0.9459459459459459
recall_score 0.9242424242424242

Then mean recall_score of test_dataset: 0.9220340219063228

c_param 100

recall_score 0.9041095890410958
recall_score 0.8767123287671232
recall_score 0.9661016949152542
recall_score 0.9594594594594594
recall_score 0.9090909090909091

Then mean recall_score of test_dataset: 0.9222461767760118

Best model to choose from cross validation is with C parameter = 0.01

绘制混淆矩阵

def plot_confusion_matrix(cm,classes,title = 'Confusion matrix', cmap = plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap = cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            plt.text(j,i,cm[i,j],
                    horizontalalignment='center',
                    color='white' if cm[i,j] > thresh else 'black')
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

在这里插入图片描述
召回率大约93.2%，效果很理想了，但是样本数据中有18个正常用户被误判为异常用户，这还是少量数据的情况下，
让我们来看看全部数据的预测结果会如何呢？

best_c = print_KFold_scores(X_train,y_train)

c_param 0.01

recall_score 0.4925373134328358
recall_score 0.6027397260273972
recall_score 0.6833333333333333
recall_score 0.5692307692307692
recall_score 0.45

Then mean recall_score of test_dataset: 0.5595682284048672

c_param 0.1

recall_score 0.5671641791044776
recall_score 0.6164383561643836
recall_score 0.6833333333333333
recall_score 0.5846153846153846
recall_score 0.525

Then mean recall_score of test_dataset: 0.5774392395241914

c_param 1

recall_score 0.5522388059701493
recall_score 0.6164383561643836
recall_score 0.7166666666666667
recall_score 0.6153846153846154
recall_score 0.5625

Then mean recall_score of test_dataset: 0.5891747226285153

c_param 10

recall_score 0.5522388059701493
recall_score 0.6164383561643836
recall_score 0.7333333333333333
recall_score 0.6153846153846154
recall_score 0.575

Then mean recall_score of test_dataset: 0.5965007975140104

c_param 100

recall_score 0.5522388059701493
recall_score 0.6164383561643836
recall_score 0.7333333333333333
recall_score 0.6153846153846154
recall_score 0.575

Then mean recall_score of test_dataset: 0.6008964424453077

Best model to choose from cross validation is with C parameter = 100.0

lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

在这里插入图片描述
从结果可知：未经过下采用处理的样本数据预测出来的结果不是很理想

调优

接下来需要考虑怎么样才能降低模型的误判率，这里考虑改变模型的阈值来降低误判率

lr = LogisticRegression(C=0.01, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample.values, y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)

thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(10,10))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
    plt.subplot(3,3,j)
    j+=1
    cnf_matrix = confusion_matrix(y_test_undersample.values,y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
    
    print('Recall metric in the testing dataset:',cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[1,0]))
    
    classes = [0,1]
    plot_confusion_matrix(cnf_matrix,classes=classes,title = 'Threshold >= %s'%i)

在这里插入图片描述
从上图可看出，当预测的概率大于0.6或0.7时判断该用户为异常用户，这样误判的用户几乎为只有1-3个。
到这为止，模型的建立基本完毕，采用下采样处理类别不平衡的数据，将模型的阈值设为0.6或者0.7可以使得预测结果的效果达到89.1%左右，同时误判用户仅仅只有1-3个，符合实际要求

保存模型

from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)

机器学习实战之信用卡欺诈案列