逻辑回归--信用卡欺诈检测

逻辑回归–信用卡欺诈检测

交叉验证 recall 正则化惩罚项已经安排上了
代码和数据链接：https://pan.baidu.com/s/1E-n0iCNr4oFr4VxPPFrSBg 密码：fjvk
机器学习慢慢入门

1.看数据

#信用卡欺诈检测
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = pd.read_csv("creditcard.csv")
print(data.head())

count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
print(count_classes)
#value_counts统计Class列的属性值0，1各有多少个
count_classes.plot(kind='bar') #条形图，可以用pandas直接画
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.xticks(rotation=0)
plt.show()

这里写图片描述

0，1分度差异

这里写图片描述
按照常识，我们推断 Class列为0，表示正常的交易，为1，表示异常交易，可能存在信用卡欺诈，这也就是我们要找的一类。

看表发现：

1.1数据分布查异大

V1-V28列数值基本在-2到1之间，而Amount列中数据分布查异很大，从前5条数据看最小的2.69，最大的378.66。可能会对机器学习算法造成误导，认为数值大的特征更重要，数值小的特征不重要。为了使每个特征重要程度相当，用sklearn的preprocessing模块导进来StandardScaler标准化模块
fit_transform对数据进行变换


from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1)) #转换成新的特征
data = data.drop(['Time', 'Amount'], axis=1)#删除没用的特征
print(data.head())

打印前五条，已经变换好了
这里写图片描述

1.2 正常样本和异常样本，样本极度不均衡

样本不均衡解决方案：

过采样：
对少的那个进行生成，使样本一样多
下采样：
从多的里面找与少的数量一样的样本，再组合起来就ok了。使样本一样少

2 下采样策略

2.1 获取下采样数据集

X = data.ix[:, data.columns != 'Class']  #拿出所有样本 ，不包括Class列
y = data.ix[:, data.columns == 'Class']  ##拿出所有样本 ，只拿出Class列，我们的label列

# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1]) #class=1的有多少个
fraud_indices = np.array(data[data.Class == 1].index) #把class=1的样本的索引拿出来
#print(fraud_indices)
# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index  #把class=0的样本的索引拿出来

# Out of the indices we picked, randomly select "x" number (number_records_fraud) replace是否进行代替
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
random_normal_indices = np.array(random_normal_indices)
#拿出索引后再转换为np.array格式

# Appending the 2 indices 合并索引
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])

# Under sample dataset  现在合并之后的数据
under_sample_data = data.iloc[under_sample_indices, :]

X_undersample = under_sample_data.ix[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.ix[:, under_sample_data.columns == 'Class']

# Showing ratio
print("Percentage of normal transactions: ", len(under_sample_data[under_sample_data.Class == 0])/len(under_sample_data))
print("Percentage of fraud transactions: ", len(under_sample_data[under_sample_data.Class == 1])/len(under_sample_data))
print("Total number of transactions in resampled data: ", len(under_sample_data))

打印出来看一下
这里写图片描述
下采样策略后，正常样本492条，占50%，异常样本也是如此。合并后的新的数据集有984条样本。但是下采样的策略存在潜在的问题

2.2 切分数据集 train — test

将原始数据集和经过下采样处理的数据集分别进行随机切分，70%train集，30%test集

from sklearn.cross_validation import train_test_split
# Whole dataset 原始数据集  将来用原始的test集测试
# 交叉验证 先进行切分 随机切分 test_size切分比例  random_state=0随机
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

# Undersampled dataset下采样的数据集
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample, y_undersample, test_size=0.3, random_state=0)
print("")
print("Number transactions train dataset: ", len(X_train_undersample))
print("Number transactions test dataset: ", len(X_test_undersample))
print("Total number of transactions: ", len(X_train_undersample)+len(X_test_undersample))

打印出来看一下，上面三行是原数据集切分结果，下面三行是下采样数据集切分结果
这里写图片描述

2. 3 建立逻辑回归模型，交叉验证找出最佳惩罚力度。

2.3.2 将train集进一步划分，进行交叉验证，找最佳惩罚力度

为了进行交叉验证，把上一步分好的train集。再切分成5份。
逻辑回归模型：直接调用LogisticRegression，往里传入数据即可

#Recall = TP/(TP+FN)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,recall_score,classification_report 
def printing_Kfold_scores(x_train_data, y_train_data):
    fold = KFold(len(y_train_data), 5, shuffle=False) #train集切分成5份

    # Different C parameters正则化惩罚项  惩罚力度哪一个好要试
    c_param_range = [0.01, 0.1, 1, 10, 100]

    results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
    results_table['C_parameter'] = c_param_range

    # the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('-------------------------------------------')
        print('')

        recall_accs = []
        for iteration, indices in enumerate(fold, start=1): #交叉验证
            # Call the logistic regression model with a certain C parameter 实例化模型对象
            lr = LogisticRegression(C=c_param, penalty='l1')

            # Use the training data to fit the model. In this case, we use the portion of the fold to train the model
            # with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
            lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())

            # Predict values using the test indices in the training data
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)

            # Calculate the recall score and append it to a list for recall scores representing the current c_parameter
            recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration, ': recall score = ', recall_acc)

        # The mean value of those recall scores is the metric we want to save and get hold of.
        results_table.ix[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('Mean recall score ', np.mean(recall_accs))
        print('')

    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']

    # Finally, we can check which C parameter is the best amongst the chosen.
    print('*********************************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c)
    print('*********************************************************************************')

    return best_c


best_c = printing_Kfold_scores(X_train_undersample, y_train_undersample)

用下采样数据集找到的最佳惩罚力度是0.01

这里只截取一部分
这里写图片描述
总之最佳惩罚力度是0.01

2.4 画混淆矩阵图，求recall值

画混淆矩阵的目的是便于求recall值

2.4.1 画图函数

先写一个画图的功能性函数

def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

R e c a l l = T P / (T P + F N)

$Recall=TP/(TP+FN)$ 以后补充recall的知识

2.4.2 下采样数据集的混淆矩阵和recall值

注意到y_pred_undersample = lr.predict(X_test_undersample.values)
y_pred_undersample 是预测的y值，逻辑回归模型的对象lr 的predict函数传入的是下采样数据集的test集

lr = LogisticRegression(C=best_c, penalty='l1')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample = lr.predict(X_test_undersample.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersample)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1, 1]/(cnf_matrix[1, 0]+cnf_matrix[1, 1]))

# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

这里写图片描述
Recall = TP/(TP+FN) TP=138 FN=9 TP=136 TN=132 FP=17
Recall metric in the testing dataset: 0.9387755102040817

更需要的是在原始数据上测试

2.4.3 原始数据集的混淆矩阵和recall值

注意到y_pred = lr.predict(X_test.values)
y_pred是预测的y值，逻辑回归模型的对象lr 的predict函数传入的是原始数据集的test集

lr = LogisticRegression(C=best_c, penalty='l1')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred = lr.predict(X_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1, 1]/(cnf_matrix[1, 0]+cnf_matrix[1, 1]))

# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

这里写图片描述
Recall = TP/(TP+FN) TP=136 FN=11 TN=75313 FP=9983
Recall metric in the testing dataset: 0.9251700680272109
有问题：
右上角的框，真实值是0却被预测成1的个数有9983，数量太大了。有9983个样本被误杀，虽然它不影响recall值，但它使得精度大大下降。实际角度上来说，它增大了工作量，我们的目的是找到异常的样本，现在多选出了将近一万个样本，这就意味着要对这一万个样本进行本来没有必要的处理。
所以由下采样数据集计算出的惩罚力度参数是有问题的。

3 过采样策略

SMOTE算法

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
######过采样
credit_cards = pd.read_csv('creditcard.csv')
columns = credit_cards.columns
# The labels are in the last column ('Class'). Simply remove it to obtain features columns
features_columns = columns.delete(len(columns)-1)

features = credit_cards[features_columns]
labels = credit_cards['Class']

features_train, features_test, labels_train, labels_test = train_test_split(features,
                                                                            labels,
                                                                            test_size=0.2,
                                                                            random_state=0)

oversampler = SMOTE(random_state=0)
os_features, os_labels = oversampler.fit_sample(features_train, labels_train)
#用训练集新生成样本
print(len(os_labels[os_labels == 1]))
print(len(os_labels[os_labels != 1]))
os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels) #均衡之后的数据集
#再来求出最佳惩罚力度
best_c = printing_Kfold_scores(os_features, os_labels)

#给逻辑回归模型带入best_c
lr = LogisticRegression(C=best_c, penalty='l1')
lr.fit(os_features, os_labels.values.ravel())
y_pred = lr.predict(features_test.values)

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test, y_pred)
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1, 1]/(cnf_matrix[1, 0]+cnf_matrix[1, 1]))

# Plot non-normalized confusion matrix
class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

经过SMOTE算法后；正例反例各有227454个
这里写图片描述
混淆矩阵 TP=91 FN=10 TN=56333 FP=528
Recall = TP/(TP+FN)
打印出来best_c=1.0 recall值= 0.900990099009901

4. 逻辑回归阈值对结果的影响

阈值，也就是划定界限，大于这个值就XXXX 小于就XXXX

lr = LogisticRegression(C=0.01, penalty='l1')
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
#predict_proba 预测出概率值

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

plt.figure(figsize=(15, 15))

j = 1
for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_proba[:, 1] > i

    plt.subplot(3, 3, j)
    j += 1

    # Compute confusion matrix
    cnf_matrix = confusion_matrix(y_test_undersample, y_test_predictions_high_recall)
    np.set_printoptions(precision=2)

    print("Recall metric in the testing dataset: ", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))

    # Plot non-normalized confusion matrix
    class_names = [0, 1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s' % i)
plt.show()

这里写图片描述

Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 1.0
Recall metric in the testing dataset: 0.9863945578231292
Recall metric in the testing dataset: 0.9387755102040817
Recall metric in the testing dataset: 0.8843537414965986
Recall metric in the testing dataset: 0.8367346938775511
Recall metric in the testing dataset: 0.7414965986394558
Recall metric in the testing dataset: 0.5714285714285714
阈值越大，recall值越小，精度越大

总结

样本越多越好，样本不均衡的时候，首选过采样策略
数据分布差异大，在特征都平等的情况下则需要对数据进行变换，标准化
交叉验证得到最佳参数
模型评估时，精度：正确预测的数量/全体测试集。有时候是骗人的
Recall值，召回率，查全率：尤其在检测任务上用Recall是较好的