项目简介
使用信用卡是普遍的当今社会。信用卡诈骗案检测是一项艰巨的任务,信用卡诈骗罪的侦查无论在学术或商业都极为重要。该项目利用已有信用卡用户的数据进行训练,建立一个检测模型,以此来检测信用卡用户是否为异常用户。
导入数据
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
data = pd.read_csv('F:\\card_data\\creditcard.csv')
data.head()
在数据集中,大部分特征已经是经过数据标准化的处理了,但是Amount特征与其它特征相比数字差异比较大,需要进行标准化处理,同时Time对于结果的预测来说是多余的,需要处理掉
首先进行数据预处理
from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
data.head()
count_class = pd.value_counts(data['Class'],sort=True).sort_index()
在对训练数据的结果统计中,不难发现正常与异常的账户相差过大,如果直接对训练数据进行训练预测结果将会不理想,
所以需要对样本数据进行处理,一般来说有两种方法:1.下采样处理,2.过量处理
对数据进行下采样处理:
X = data.iloc[:,data.columns != 'Class']
y = data.iloc[:,data.columns == 'Class']
#Number of data points in the minority class
number_record_fraud = len(data[data.Class == 1])
fraud_ridices = np.array(data[data.Class == 1].index)
#Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index
#out of indices we picked, randmly select 'X' number (num_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_record_fraud, replace = False)
random_normal_indices = np.array(random_normal_indices)
#Appending the two indices
under_sample_indices = np.concatenate([fraud_ridices, random_normal_indices])
#Under sample dataset
under_sample_data = data.iloc[under_sample_indices,:]
X_undersample = under_sample_data.iloc[:,under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:,under_sample_data.columns == 'Class']
#Showing ratio
print('Percentage of normal transactions:', len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print('Percentage of fraud transactions:', len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print('Total number of transactions in undersample data:', len(under_sample_data))
Percentage of normal transactions: 0.5
Percentage of fraud transactions: 0.5
Total number of transactions in undersample data: 984
开始进行训练
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
#Whole dataSet
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
print("Number transactions train dataset:",len(X_train))
print('Number transactions test dataset:',len(X_test))
print('Total number of transactions:',len(X_train)+len(X_test))
#Undersample dataset
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,
y_undersample,
test_size=0.3,
random_state=0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))
Number transactions train dataset: 199364
Number transactions test dataset: 85443
Total number of transactions: 284807
Number transactions train dataset: 688
Number transactions test dataset: 296
Total number of transactions: 984
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix,recall_score
由于原始数据都经过了标准化的处理,LogisticRegression模型又易于构建,这里使用LogisticRegression模型易于模型的快速迭代,计算结果方便存储,容易扩展。所以该项目采用LogisticRegression模型对数据进行训练和预测;
首先需要采用交叉验证的方法寻找出合适的超参数,这里使用召回率Recal=TP/(TP+FN)对模型进行评价,使用混淆矩阵对预测结果进行统计
def print_KFold_scores(x_train_data,y_train_data):
fold = KFold(5,shuffle=False)
c_param_range = [0.01,0.1,1,10,100]
result_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
result_table['C_parameter'] = c_param_range
j = 0
recall_accs = []
for c_param in c_param_range:
print('--------------------------------------')
print('c_param', c_param)
print('--------------------------------------')
for train_index,test_index in fold.split(x_train_data):
lr = LogisticRegression(C=c_param, penalty='l1', solver = 'liblinear')
lr.fit(x_train_data.iloc[train_index,:].values, y_train_data.iloc[train_index,:].values.ravel())
y_pred = lr.predict(x_train_data.iloc[test_index,:].values)
recall_acc = recall_score(y_train_data.iloc[test_index,:].values.ravel(),y_pred )
print('recall_score',recall_acc)
recall_accs.append(recall_acc)
result_table.loc[j,1:2] = np.mean(recall_accs)
j += 1
print('')
print('Then mean recall_score of test_dataset:',np.mean(recall_accs))
print('')
best_c = result_table.loc[result_table['Mean recall score'].astype('float').idxmax()]['C_parameter']
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
return best_c
best_c = print_KFold_scores(X_train_undersample,y_train_undersample)
c_param 0.01
recall_score 0.9315068493150684
recall_score 0.9178082191780822
recall_score 1.0
recall_score 0.9594594594594594
recall_score 0.9545454545454546
Then mean recall_score of test_dataset: 0.9526639964996129
c_param 0.1
recall_score 0.8356164383561644
recall_score 0.863013698630137
recall_score 0.9322033898305084
recall_score 0.9459459459459459
recall_score 0.9090909090909091
Then mean recall_score of test_dataset: 0.9249190364351729
c_param 1
recall_score 0.8493150684931506
recall_score 0.8904109589041096
recall_score 0.9661016949152542
recall_score 0.9594594594594594
recall_score 0.9090909090909091
Then mean recall_score of test_dataset: 0.9215712303476409
c_param 10
recall_score 0.8904109589041096
recall_score 0.8904109589041096
recall_score 0.9661016949152542
recall_score 0.9459459459459459
recall_score 0.9242424242424242
Then mean recall_score of test_dataset: 0.9220340219063228
c_param 100
recall_score 0.9041095890410958
recall_score 0.8767123287671232
recall_score 0.9661016949152542
recall_score 0.9594594594594594
recall_score 0.9090909090909091
Then mean recall_score of test_dataset: 0.9222461767760118
Best model to choose from cross validation is with C parameter = 0.01
绘制混淆矩阵
def plot_confusion_matrix(cm,classes,title = 'Confusion matrix', cmap = plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap = cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
plt.text(j,i,cm[i,j],
horizontalalignment='center',
color='white' if cm[i,j] > thresh else 'black')
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
召回率大约93.2%,效果很理想了,但是样本数据中有18个正常用户被误判为异常用户,这还是少量数据的情况下,
让我们来看看全部数据的预测结果会如何呢?
best_c = print_KFold_scores(X_train,y_train)
c_param 0.01
recall_score 0.4925373134328358
recall_score 0.6027397260273972
recall_score 0.6833333333333333
recall_score 0.5692307692307692
recall_score 0.45
Then mean recall_score of test_dataset: 0.5595682284048672
c_param 0.1
recall_score 0.5671641791044776
recall_score 0.6164383561643836
recall_score 0.6833333333333333
recall_score 0.5846153846153846
recall_score 0.525
Then mean recall_score of test_dataset: 0.5774392395241914
c_param 1
recall_score 0.5522388059701493
recall_score 0.6164383561643836
recall_score 0.7166666666666667
recall_score 0.6153846153846154
recall_score 0.5625
Then mean recall_score of test_dataset: 0.5891747226285153
c_param 10
recall_score 0.5522388059701493
recall_score 0.6164383561643836
recall_score 0.7333333333333333
recall_score 0.6153846153846154
recall_score 0.575
Then mean recall_score of test_dataset: 0.5965007975140104
c_param 100
recall_score 0.5522388059701493
recall_score 0.6164383561643836
recall_score 0.7333333333333333
recall_score 0.6153846153846154
recall_score 0.575
Then mean recall_score of test_dataset: 0.6008964424453077
Best model to choose from cross validation is with C parameter = 100.0
lr = LogisticRegression(C = best_c, penalty = 'l1',solver='liblinear')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
, classes=class_names
, title='Confusion matrix')
plt.show()
从结果可知:未经过下采用处理的样本数据预测出来的结果不是很理想
调优
接下来需要考虑怎么样才能降低模型的误判率,这里考虑改变模型的阈值来降低误判率
lr = LogisticRegression(C=0.01, penalty='l1', solver='liblinear')
lr.fit(X_train_undersample.values, y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(10,10))
j = 1
for i in thresholds:
y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i
plt.subplot(3,3,j)
j+=1
cnf_matrix = confusion_matrix(y_test_undersample.values,y_test_predictions_high_recall)
np.set_printoptions(precision=2)
print('Recall metric in the testing dataset:',cnf_matrix[1,1]/(cnf_matrix[1,1]+cnf_matrix[1,0]))
classes = [0,1]
plot_confusion_matrix(cnf_matrix,classes=classes,title = 'Threshold >= %s'%i)
从上图可看出,当预测的概率大于0.6或0.7时判断该用户为异常用户,这样误判的用户几乎为只有1-3个。
到这为止,模型的建立基本完毕,采用下采样处理类别不平衡的数据,将模型的阈值设为0.6或者0.7可以使得预测结果的效果达到89.1%左右,同时误判用户仅仅只有1-3个,符合实际要求
保存模型
from sklearn.externals import joblib
filename = 'finalized_model.sav'
joblib.dump(model, filename)