Anomaly detection algorithm for credit card fraud based on deep learning Autoencoder

e5cb67cc7ba57bdcc347c4c2aca59ea9.png

Deep learning is used for anomaly detection, and the effect is quite awesome. The credit card fraud data set can achieve a top1000 accuracy rate of 26% on the isolated forest, but the Autoencoder algorithm can achieve a maximum of 33.6%, but this data is very unstable, sometimes only about 25%, but at least this model has potential Huge, more experiments are needed to find a more stable network structure.

752de9c3f853eaa75e1b7e421fc8f0e5.png

AutoEncoder (AE) is a kind of artificial neural network used in semi-supervised learning and unsupervised learning. The encoder consists of two parts: an encoder and a decoder .

AutoEncoder is an important part of deep learning, and it is very interesting. The neural network performs end-to-end training through a large number of data sets to continuously improve its accuracy. AutoEncoder designs the encoding and decoding processes to make the input and output more and more Proximity is an unsupervised learning process that can be applied to dimensionality reduction and anomaly detection. Autoencoders built with convolutional layers can be applied to computer vision problems, including image noise reduction. (image denoising), neural style transfer (neural style transfer), etc. This article mainly explains how to use AutoEncoder for anomaly detection experiments.

Using AutoEncoder for noise reduction, we can see that through the convolutional auto-encoder, our noise reduction effect is still very good, the final generated picture looks very smooth, and the noise is almost invisible.

f4bbbaef570a4074fd606297493206c2.png

Dimensionality reduction with AutoEncoder.

3b1d8cf4546fc351d7cdce89bde035fd.png

1. Introduction to Autoencoder structure

Autoencoder essentially uses a neural network to generate a low-dimensional table of high-dimensional input. Autoencoder is similar to PCA, but Autoencoder overcomes the linearity limitation of PCA when using nonlinear activation functions.

Autoencoder contains two main parts, encoder (encoder) and decoder (decoder). The role of the encoder is to find a compressed representation of the given data, and the decoder is used to reconstruct the original input. At training time, the decoder forces the autoencoder to select the most informative features, which are eventually saved in the compressed representation. The final compressed representation is in the middle coder layer.

The following figure is an example, the dimension of the original data is 10, the encoder and the decoder have two layers respectively, and the encoder in the middle has a total of 3 nodes, which means that the original data is reduced to only 3 dimensions. The Decoder reconstructs the original data according to the dimensionality-reduced data, and obtains a 10-dimensional output again. In the process from Input to Ouptut, autoencoder actually plays a role in noise reduction.

75e146ebab7ebd910e07d51614782244.png

2. Autoencoder anomaly detection process

Anomaly detection is usually divided into supervised and unsupervised situations. In the unsupervised case, we have no outliers to learn from, and the algorithm basically assumes that outliers follow different distributions. The Autoencoder trained according to normal data can reconstruct and restore normal samples, but it cannot restore data points that are different from the normal distribution, resulting in a large restoration error.

If the features of the sample are all numerical variables, we can use MSE or MAE as the restoration error. For example, in the figure above, if the input sample is

46083303462d438fcd4d29c7c9262d11.png

The result reconstructed by Autoencoder is

55285d6825aa86403ba09d06f1e45297.png

The restoration error MSE is

e312d2c26a31954a651463b432646598.png

The restoration error MAE is

822adc4dfdb6cc505e5c239c3ddaedb7.png

Third, the model algorithm process

The data is still using credit card data. The data comes from a credit card fraud detection competition on kaggle. The data quality is high, and the proportion of positive and negative samples is very different. It is a typical anomaly detection data set. Test various anomaly detection methods on this data set. Effect. Of course, the results of another dataset may be very different, and the results are for reference only.

1. Dataset introduction

Credit card fraud refers to the intentional use of forged or invalid credit cards , fraudulent use of other people's credit cards to defraud property, or malicious overdraft with one's own credit card . In fraud cases, more than 60% are fraudulent credit card fraud, which is characterized by the nature of gangs, from stealing card information, manufacturing counterfeit cards, selling counterfeit cards, to committing crimes with counterfeit cards and making huge profits. And credit card fraud detection is an important means for banks to reduce losses.

This dataset contains information on transactions made by European cardholders by credit card in September 2013. This dataset shows transactions that occurred within two days, out of 284,807 transactions, there were 492 frauds, the dataset is highly imbalanced, and the positive class (fraud) only accounts for 0.172% of all transactions. The original data set has been desensitized and processed by PCA. Anonymous variables V1, V2, ... V28 are the principal components obtained by PCA. The only variables that have not been processed by PCA are Time and Amount. Time is the interval between each transaction and the first transaction in the dataset, in seconds; Amount is the transaction amount. Class is a categorical variable, 1 if fraud occurs, 0 otherwise. The project requires the establishment of a classification model based on an existing dataset to detect credit card fraud.

Note: PCA - "Principal Component Analysis" - Principal Component Analysis, used to extract the "Principal Component" features of the dataset, that is, dimensionality reduction of the dataset.

2. Data sources

The dataset Credit Card Fraud Detection was provided by the Free University of Brussels (ULB) - Worldline and the Machine Learning Group, Belgium. Download from Kaggle: https://www.kaggle.com/mlg-ulb/creditcardfraud

Download the data yourself

3. Model building

There are more packages needed, let's load them first

# 加载所需要的包
import warnings
warnings.filterwarnings("ignore")
import os 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
#plt.style.use('seaborn')
import tensorflow as tf
import seaborn as sns
from sklearn.model_selection import train_test_split
from keras.models import Model, load_model
from keras.layers import Input, Dense,LeakyReLU,BatchNormalization
from keras.callbacks import ModelCheckpoint
from keras import regularizers
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, precision_recall_curve
# 工作空间设置
os.chdir('/Users/wuzhengxiang/Documents/DataSets/CreditCardFraudDetection')
os.getcwd()

Data reading and simple feature engineering

# 读取数据
d = pd.read_csv('creditcard.csv')


# 查看样本比例
num_nonfraud = np.sum(d['Class'] == 0)
num_fraud = np.sum(d['Class'] == 1)
plt.bar(['Fraud', 'non-fraud'], [num_fraud, num_nonfraud], color='dodgerblue')
plt.show()


# 删除时间列,对Amount进行标准化
data = d.drop(['Time'], axis=1)
data['Amount'] = StandardScaler().fit_transform(data[['Amount']])
X = data.drop(['Class'],axis=1)
Y = data.Class

ad8ccf7b5e4fc70be206198678cb4c60.png

Model building + model training

# 设置Autoencoder的参数
input_dim    = X.shape[1]
encoding_dim = 128
num_epoch    = 30
batch_size   = 256


input_layer = Input(shape=(input_dim, ))


encoder = Dense(encoding_dim, 
                activation="tanh", 
                activity_regularizer=regularizers.l1(10e-5)
                )(input_layer)
encoder =BatchNormalization()(encoder)
encoder=LeakyReLU(alpha=0.2)(encoder)


encoder = Dense(int(encoding_dim/2), 
                activation="relu"
                )(encoder)
encoder =BatchNormalization()(encoder)
encoder=LeakyReLU(alpha=0.1)(encoder)


encoder = Dense(int(encoding_dim/4), 
                activation="relu"
                )(encoder)
encoder =BatchNormalization()(encoder)






### decoder
decoder = LeakyReLU(alpha=0.1)(encoder)
decoder = Dense(int(encoding_dim/4),
                activation='tanh'
                )(decoder)
decoder = BatchNormalization()(decoder)
decoder = LeakyReLU(alpha=0.1)(decoder)




decoder = Dense(int(encoding_dim/2),
                activation='tanh'
                )(decoder)
decoder = BatchNormalization()(decoder)
decoder = LeakyReLU(alpha=0.1)(decoder)


decoder = Dense(input_dim, 
                #activation='relu'
                )(decoder)


autoencoder = Model(inputs = input_layer, 
                    outputs = decoder
                    )
autoencoder.compile(optimizer='adam', 
                    loss='mean_squared_error', 
                    metrics=['mae','mse']
                    )


# 模型保存为 XiaoWuGe_model.h5,并开始训练模型
checkpointer = ModelCheckpoint(filepath="XiaoWuGe_model.h5",
                               verbose=0,
                               save_best_only=True
                               )
history = autoencoder.fit(X, 
                          X,
                          epochs=num_epoch,
                          batch_size=batch_size,
                          shuffle=True,
                          #validation_data=(X_test, X_test),
                          verbose=1, 
                          callbacks=[checkpointer]
                          ).history
Epoch 1/30
284807/284807 [==============================] - 39s 136us/step - loss: 0.6593 - mae: 0.3893 - mse: 0.4098
Epoch 2/30
Epoch 29/30
284807/284807 [==============================] - 41s 144us/step - loss: 0.1048 - mae: 0.1188 - mse: 0.0558
Epoch 30/30
284807/284807 [==============================] - 39s 135us/step - loss: 0.0891 - mae: 0.1134 - mse: 0.0495






模型结果可视化


# 画出损失函数曲线
plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(history['loss'], c='dodgerblue', lw=3)
plt.title('model loss')
plt.ylabel('mse')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper right')




# 画出损失函数曲线
plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(history['mae'], c='dodgerblue', lw=3)
plt.title('model mae')
plt.ylabel('mae')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper right')


# 画出损失函数曲线
plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(history['mse'], c='dodgerblue', lw=3)
plt.title('model mse')
plt.ylabel('mse')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper right')

1ea932136aca9d7948573c82882cf308.png

Model Outcome Prediction

#利用训练好的autoencoder重建测试集
pred_X  = autoencoder.predict(X)
# 计算还原误差MSE和MAE
mse_X = np.mean(np.power(X-pred_X,2), axis=1)
mae_X = np.mean(np.abs(X-pred_X),     axis=1)




data['mse_X'] = mse_X
data['mae_X'] = mae_X
# TopN准确率评估
n = 1000
df =  data.sort_values(by='mae_X',ascending=False)
df = df.head(n)
rate = df[df['Class']==1].shape[0]/n
print('Top{}的准确率为:{}'.format(n,rate))


Top1000的准确率为:0.336

It can be seen that our accuracy rate is 0.336, which is a great improvement over the previous isolation forest, but I have gone through a lot of experiments and this is an ideal result. Later, I will find a more stable structure to share with you. Below I can see a distribution difference between positive samples and negative samples.

# 提取负样本,并且按照7:3切成训练集和测试集
mask = (data['Class'] == 0)
X_train, X_test = train_test_split(X, test_size=0.3,
                                   random_state=520)


# 提取所有正样本,作为测试集的一部分
X_fraud = X[~mask]




# 利用训练好的autoencoder重建测试集
pred_test  = autoencoder.predict(X_test)
pred_fraud = autoencoder.predict(X_fraud)




# 计算还原误差MSE和MAE
mse_test = np.mean(np.power(X_test - pred_test, 2), axis=1)
mse_fraud = np.mean(np.power(X_fraud - pred_fraud, 2), axis=1)
mae_test = np.mean(np.abs(X_test - pred_test), axis=1)
mae_fraud = np.mean(np.abs(X_fraud - pred_fraud), axis=1)
mse_df = pd.DataFrame()
mse_df['Class'] = [0] * len(mse_test) + [1] * len(mse_fraud)
mse_df['MSE'] = np.hstack([mse_test, mse_fraud])
mse_df['MAE'] = np.hstack([mae_test, mae_fraud])
mse_df = mse_df.sample(frac=1).reset_index(drop=True)




# 分别画出测试集中正样本和负样本的还原误差MAE和MSE
markers = ['o', '^']
markers = ['o', '^']
colors = ['dodgerblue', 'red']
labels = ['Non-fraud', 'Fraud']


plt.figure(figsize=(14, 5))
plt.subplot(121)
for flag in [1, 0]:
    temp = mse_df[mse_df['Class'] == flag]
    plt.scatter(temp.index, 
                temp['MAE'],  
                alpha=0.7, 
                marker=markers[flag], 
                c=colors[flag], 
                label=labels[flag])
plt.title('Reconstruction MAE')
plt.ylabel('Reconstruction MAE')
plt.xlabel('Index')


plt.subplot(122)
for flag in [1, 0]:
    temp = mse_df[mse_df['Class'] == flag]
    plt.scatter(temp.index, 
                temp['MSE'],  
                alpha=0.7, 
                marker=markers[flag], 
                c=colors[flag], 
                label=labels[flag])
plt.legend(loc=[1, 0], fontsize=12)
plt.title('Reconstruction MSE')
plt.ylabel('Reconstruction MSE')
plt.xlabel('Index')
plt.show()

52b3de949cd83a217e8c4962b0e12d9d.png

It can be seen that there are obvious differences between the MAE and MSE of positive and negative samples, which proves that this algorithm has good anomaly detection ability. Of course, some normal samples are still difficult to be separated by anomaly detection.

推荐阅读:

我的2022届互联网校招分享

我的2021总结

浅谈算法岗和开发岗的区别

互联网校招研发薪资汇总
2022届互联网求职现状,金9银10快变成铜9铁10!!

公众号:AI蜗牛车

保持谦逊、保持自律、保持进步

发送【蜗牛】获取一份《手把手AI项目》(AI蜗牛车著)
发送【1222】获取一份不错的leetcode刷题笔记

发送【AI四大名著】获取四本经典AI电子书

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/125075977