Summary of 8 Python anomaly detection algorithms

Anomaly detection is the discovery of abnormal data that is inconsistent with the distribution of the data set through data mining methods, also known as outliers, outlier detection, and so on. This article organizes 8 common Python anomaly detection algorithms for you, hoping to help you

1. Introduction to Anomaly Detection

Anomaly detection is the discovery of abnormal data that is inconsistent with the distribution of the data set through data mining methods, also known as outliers, outlier detection, and so on.

 

1.1 Scenarios where anomaly detection is applicable

The characteristics of the scene where the anomaly detection algorithm is applicable are: (1) no label or extremely unbalanced categories; (2) the difference between the abnormal data and most of the data in the sample is large; (3) the proportion of abnormal data in the overall data sample The proportion is very low. Common application cases such as:

Financial field: Identify "fraudulent users" from financial data, such as identifying credit card application fraud, credit card theft, credit fraud, etc.; Security field: Judging traffic data fluctuations and whether they are attacked, etc.; E-commerce field: From transaction data Identify "malicious buyers", such as wool parties and malicious screen-swiping gangs; ecological disaster warning: based on weather index data, judge possible extreme weather in the future; medical monitoring: find abnormal data that may show disease conditions from medical equipment data ;

 

1.2 Challenges of Anomaly Detection

Anomaly detection is a popular research field, but due to the complexity of unknowns, heterogeneity, particularity, and diversity of anomalies, there are still many challenges in the entire field:

1) One of the most challenging problems is the difficulty of achieving high recall for anomaly detection. Because abnormalities are so rare and heterogeneous, it can be difficult to identify all abnormalities.

2) To improve the accuracy of the anomaly detection model, it is often necessary to deeply combine business characteristics, otherwise the effect will not be good, and it will easily lead to algorithmic bias against minority groups.

2. Anomaly detection method

According to whether the training set contains outliers, it can be divided into outlier detection and novelty detection. The representative method of novelty detection is one class SVM.

According to different types of anomalies, anomaly detection can be divided into: abnormal point detection (such as abnormal consumer users), contextual anomaly detection (such as time series anomaly), and group anomaly detection (such as abnormal gangs).

According to different learning methods, anomaly detection can be divided into: Supervised Anomaly Detection, Semi-Supervised Anomaly Detection and Unsupervised Anomaly Detection. In real-life anomaly detection problems, due to the difficulty of collecting abnormal label samples, there are often no labels, so unsupervised anomaly detection is the most widely used.

Unsupervised anomaly detection can be roughly divided into the following categories according to its algorithm ideas:

2.1 Clustering-based methods

Clustering-based anomaly detection methods usually rely on the following assumptions, 1) normal data instances belong to a cluster in the data, while abnormal data instances do not belong to any cluster; 2) normal data instances are close to their nearest cluster centroids, while abnormal data instances are far from them The nearest cluster centroid is far away; 3) Normal data instances belong to large and dense clusters, while abnormal data instances either belong to small clusters or sparse clusters; by classifying data into different clusters, abnormal data are those belonging to Small clusters or data that do not belong to any cluster or are far from the center of the cluster.

  • Use the data that is far from the cluster center as an outlier: such methods include SOM, K-means, maximum expectation (expectation maximization, EM) and algorithms based on semantic anomaly factor (semantic anomaly factor), etc.;
  • Use the small cluster data obtained by clustering as outliers: representative methods include K-means clustering;
  • Do not belong to any cluster as an outlier: representative methods include DBSCAN, ROCK, and SNN clustering.

2.2 Statistical-based methods

The assumption that the statistical method relies on is that the data set obeys a certain distribution (such as normal distribution, Poisson distribution, and binomial distribution, etc.) or a probability model. event discrimination) to achieve anomaly detection. According to the probability model can be divided into:

1) Parametric method, which estimates model parameters (such as Gaussian model) from data with known distribution. The simplest parameter anomaly detection model is to assume that the sample obeys a normal distribution, when the difference between the data point and the mean is greater than two or three times When the variance is , the point is considered to be abnormal;

2) Non-parametric method, when the data distribution is unknown, the histogram can be drawn to detect whether the data is in the histogram generated by the training set for anomaly detection. You can also use the degree of variability of the data (such as mean deviation, standard deviation, coefficient of variation, interquartile range, etc.) to find outliers in the data.

2.3 Depth-based methods

This method maps the data into a hierarchy in k-dimensional space and assumes that outliers are distributed at the periphery, while normal data points are closer to the center of the hierarchy (the higher the depth).

Half-space depth method (ISODEPTH method), by calculating the depth of each point, and judging abnormal data points according to the depth value.

Minimum ellipsoid estimation (minimum volume ellipsoid estimator, MVE) method. According to the probability distribution model of most data points (usually > 50%), the boundary of a minimum elliptical sphere shown in the solid ellipse is fitted, and the data points not within this boundary range will be judged as abnormal points.

 

isolated forest. The time complexity of the above two depth-based basic models increases exponentially with the increase of the feature dimension k, which is usually applicable when the dimension k≤3, and the isolated forest can also be applied to high-dimensional ones by changing the way of calculating the depth data.

 

The isolation forest algorithm is an anomaly detection method based on Ensemble, so it has linear time complexity. And the accuracy is high, and the speed is fast when processing big data, so it is currently widely used in the industry. The basic idea is: randomly split the sample space through the tree model method, those clusters with high density will be cut many times before they stop cutting (that is, each point exists in a subspace alone), but those clusters with sparse distribution Most of the points (that is, abnormal points) stop in a subspace very early. The algorithm steps are:

1) Randomly select Ψ samples from the training data to train a single tree.

2) Randomly specify a q dimension (attribute), and randomly generate a cutting point p in the current node data. The p cut point is generated between the maximum value and the minimum value of the specified q dimension in the current node data.

3) The selection of the cutting point generates a hyperplane, which divides the data space of the current node into two subspaces: put the points smaller than p in the current selected dimension on the left branch of the current node, and place the points greater than or equal to p Put it on the right branch of the current node;

4) Recursive steps 2 and 3 in the left branch and right branch nodes of the node, and continuously construct new leaf nodes until there is only one data on the leaf node (no further cutting) or the tree has grown to the set height. (The maximum height of a single tree is set because there are relatively few abnormal data records, and its path length is relatively low, and we only need to distinguish normal records from abnormal records, so we only need to care about the part below the average height. Well, the algorithm is more efficient this way.)

5) Since the process of cutting the feature space of each tree training is completely random, it is necessary to use the ensemble method to converge the results, that is, to build a few more trees, and then comprehensively calculate the average value of the segmentation results of each tree. For each sample x, the composite anomaly score s is calculated by the following formula.

 

h(x) is the height of x in each tree, and c(Ψ) is the average path length when the number of samples Ψ is given, which is used to standardize the path length h(x) of sample x.

2.4 Classification-based models

The representative method is One class SVM. Its principle is to find a hyperplane to circle the positive examples in the sample. The prediction is to use this hyperplane to make decisions, and the samples in the circle are considered positive samples. Since the calculation of the kernel function is time-consuming, it is not used much in scenarios with massive data.

 

It relies on the assumption that normal data instances are located in dense neighborhoods, while samples near abnormal data instances are sparse. Can be further subdivided based on density/neighborhood:

Based on the density, the method calculates the density of each data area in the data set, and regards the area with lower density as an outlier area. The classic method is: local outlier factor (local outlier factor, LOF). The LOF method is different from the traditional definition of outliers. The outliers are defined locally as outliers, and each data is assigned a value representing the LOF value relative to its neighbors. The larger the LOF, the lower the density of its neighbors. , the more likely it is an outlier. However, it is difficult to determine the minimum nearest neighbor in LOF, and as the data dimension increases, the computational complexity and time complexity increase.

Based on the distance, the basic idea is to detect anomalies by calculating the distance between the comparison data and the neighbor data set. The normal data points are similar to their neighbor data, while the abnormal data is different from the neighbor data.

2.5 Bias-based methods

When a data set is given, points that do not match the characteristics of the entire data set can be found through the deviation-based method, and the variance of the data set will decrease with the removal of outliers. This approach can be divided into sequential anomaly techniques, which compare data points one by one, and OLAP data cube techniques. At present, this method has little practical application.

2.6 Refactoring-based approach

The representative method is PCA. PCA's approach to anomaly detection generally has two ideas: one is to map the data to a low-dimensional feature space, and then check the deviation of each data point from other data in different dimensions of the feature space; the other is to map the data Map to the low-dimensional feature space, and then remap from the low-dimensional feature space back to the original space, try to reconstruct the original data with low-dimensional features, and see the size of the reconstruction error.

2.7 Neural Network-Based Approaches

Representative methods include autoencoder (autoencoder, AE), long short-term memory neural network (LSTM) and so on.

LSTM can be used for anomaly detection of time series data: use historical sequence data to train models and detect outliers with large differences from predicted values.

Autoencoder Anomaly Detection Autoencoder essentially uses a neural network to produce a low-dimensional representation of a high-dimensional input. Autoencoder is similar to PCA, but Autoencoder overcomes the linear limitation of PCA when using a nonlinear activation function. The basic assumption of the algorithm is that outliers obey different distributions. The Autoencoder trained based on normal data can reconstruct and restore normal samples, but it cannot restore data points that are different from the normal distribution, resulting in a large error based on reconstruction. When the reconstruction error is larger than a certain threshold, it is marked as an outlier.

 

Summary: The elements of the unsupervised anomaly detection method are the selection of relevant features and the selection of appropriate algorithms based on reasonable assumptions, which can better exert the anomaly detection effect.

3. Project actual combat: credit card anti-fraud

The project is the classic credit card fraud detection on kaggle. The data set is of high quality and the ratio of positive and negative samples is very different. We mainly used the unsupervised Autoencoder novel point detection in this project to identify abnormal fraud samples according to the reconstruction error.

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

#!/usr/bin/env python 

# coding: utf-8 

   

import warnings 

warnings.filterwarnings("ignore"

   

import pandas as pd 

import numpy as np 

import pickle 

import matplotlib.pyplot as plt 

plt.style.use('seaborn'

import tensorflow as tf 

import seaborn as sns 

from sklearn.model_selection import train_test_split 

from keras.models import Model, load_model 

from keras.layers import Input, Dense 

from keras.callbacks import ModelCheckpoint 

from keras import regularizers 

from sklearn.preprocessing import StandardScaler 

from sklearn.metrics import roc_curve, auc, precision_recall_curve 

# 安利一个异常检测Python库 https://github.com/yzhao062/Pyod 

   

# 读取数据 :信用卡欺诈数据集地址https://www.kaggle.com/mlg-ulb/creditcardfraud 

d = pd.read_csv('creditcard.csv'

   

# 查看样本比例 

num_nonfraud = np.sum(d['Class'] == 0

num_fraud = np.sum(d['Class'] == 1

plt.bar(['Fraud', 'non-fraud'], [num_fraud, num_nonfraud], color='dodgerblue'

plt.show() 

   

# 删除时间列,对Amount进行标准化 

data = d.drop(['Time'], axis=1

data['Amount'] = StandardScaler().fit_transform(data[['Amount']]) 

   

# 为无监督新颖点检测方法,只提取负样本,并且按照8:2切成训练集和测试集 

mask = (data['Class'] == 0

X_train, X_test = train_test_split(data[mask], test_size=0.2, random_state=0

X_train = X_train.drop(['Class'], axis=1).values 

X_test = X_test.drop(['Class'], axis=1).values 

   

# 提取所有正样本,作为测试集的一部分 

X_fraud = data[~mask].drop(['Class'], axis=1).values 

   

# 构建Autoencoder网络模型 

# 隐藏层节点数分别为16,8,8,16 

# epoch为5,batch size为32 

input_dim = X_train.shape[1

encoding_dim = 16 

num_epoch = 5 

batch_size = 32 

   

input_layer = Input(shape=(input_dim, )) 

encoder = Dense(encoding_dim, activation="tanh",  

                activity_regularizer=regularizers.l1(10e-5))(input_layer) 

encoder = Dense(int(encoding_dim / 2), activation="relu")(encoder) 

decoder = Dense(int(encoding_dim / 2), activation='tanh')(encoder) 

decoder = Dense(input_dim, activation='relu')(decoder) 

autoencoder = Model(inputs=input_layer, outputs=decoder) 

autoencoder.compile(optimizer='adam',  

                    loss='mean_squared_error',  

                    metrics=['mae']) 

   

# 模型保存为model.h5,并开始训练模型 

checkpointer = ModelCheckpoint(filepath="model.h5"

                               verbose=0

                               save_best_only=True

history = autoencoder.fit(X_train, X_train, 

                          epochs=num_epoch, 

                          batch_size=batch_size, 

                          shuffle=True

                          validation_data=(X_test, X_test), 

                          verbose=1,  

                          callbacks=[checkpointer]).history 

   

   

# 画出损失函数曲线 

plt.figure(figsize=(14, 5)) 

plt.subplot(121

plt.plot(history['loss'], c='dodgerblue', lw=3

plt.plot(history['val_loss'], c='coral', lw=3

plt.title('model loss'

plt.ylabel('mse'); plt.xlabel('epoch'

plt.legend(['train', 'test'], loc='upper right'

   

plt.subplot(122

plt.plot(history['mae'], c='dodgerblue', lw=3

plt.plot(history['val_mae'], c='coral', lw=3

plt.title('model mae'

plt.ylabel('mae'); plt.xlabel('epoch'

plt.legend(['train', 'test'], loc='upper right'

   

   

# 读取模型 

autoencoder = load_model('model.h5'

   

# 利用autoencoder重建测试集 

pred_test = autoencoder.predict(X_test) 

# 重建欺诈样本 

pred_fraud = autoencoder.predict(X_fraud)   

   

# 计算重构MSE和MAE误差 

mse_test = np.mean(np.power(X_test - pred_test, 2), axis=1

mse_fraud = np.mean(np.power(X_fraud - pred_fraud, 2), axis=1

mae_test = np.mean(np.abs(X_test - pred_test), axis=1

mae_fraud = np.mean(np.abs(X_fraud - pred_fraud), axis=1

mse_df = pd.DataFrame() 

mse_df['Class'] = [0] * len(mse_test) + [1] * len(mse_fraud) 

mse_df['MSE'] = np.hstack([mse_test, mse_fraud]) 

mse_df['MAE'] = np.hstack([mae_test, mae_fraud]) 

mse_df = mse_df.sample(frac=1).reset_index(drop=True

   

# 分别画出测试集中正样本和负样本的还原误差MAE和MSE 

markers = ['o', '^'

markers = ['o', '^'

colors = ['dodgerblue', 'coral'

labels = ['Non-fraud', 'Fraud'

   

plt.figure(figsize=(14, 5)) 

plt.subplot(121

for flag in [1, 0]: 

    temp = mse_df[mse_df['Class'] == flag] 

    plt.scatter(temp.index,  

                temp['MAE'],   

                alpha=0.7,  

                marker=markers[flag],  

                c=colors[flag],  

                label=labels[flag]) 

plt.title('Reconstruction MAE'

plt.ylabel('Reconstruction MAE'); plt.xlabel('Index'

plt.subplot(122

for flag in [1, 0]: 

    temp = mse_df[mse_df['Class'] == flag] 

    plt.scatter(temp.index,  

                temp['MSE'],   

                alpha=0.7,  

                marker=markers[flag],  

                c=colors[flag],  

                label=labels[flag]) 

plt.legend(loc=[1, 0], fontsize=12); plt.title('Reconstruction MSE'

plt.ylabel('Reconstruction MSE'); plt.xlabel('Index'

plt.show() 

# 下图分别是MAE和MSE重构误差,其中橘黄色的点是信用欺诈,也就是异常点;蓝色是正常点。我们可以看出异常点的重构误差整体很高。 

   

# 画出Precision-Recall曲线 

plt.figure(figsize=(14, 6)) 

for i, metric in enumerate(['MAE', 'MSE']): 

    plt.subplot(1, 2, i+1

    precision, recall, _ = precision_recall_curve(mse_df['Class'], mse_df[metric]) 

    pr_auc = auc(recall, precision) 

    plt.title('Precision-Recall curve based on %s\nAUC = %0.2f'%(metric, pr_auc)) 

    plt.plot(recall[:-2], precision[:-2], c='coral', lw=4

    plt.xlabel('Recall'); plt.ylabel('Precision'

plt.show() 

   

# 画出ROC曲线 

plt.figure(figsize=(14, 6)) 

for i, metric in enumerate(['MAE', 'MSE']): 

    plt.subplot(1, 2, i+1

    fpr, tpr, _ = roc_curve(mse_df['Class'], mse_df[metric]) 

    roc_auc = auc(fpr, tpr) 

    plt.title('Receiver Operating Characteristic based on %s\nAUC = %0.2f'%(metric, roc_auc)) 

    plt.plot(fpr, tpr, c='coral', lw=4

    plt.plot([0,1],[0,1], c='dodgerblue', ls='--'

    plt.ylabel('TPR'); plt.xlabel('FPR'

plt.show() 

# 不管是用MAE还是MSE作为划分标准,模型的表现都算是很好的。PR AUC分别是0.51和0.44,而ROC AUC都达到了0.95。 

   

# 画出MSE、MAE散点图 

markers = ['o', '^'

colors = ['dodgerblue', 'coral'

labels = ['Non-fraud', 'Fraud'

   

plt.figure(figsize=(10, 5)) 

for flag in [1, 0]: 

    temp = mse_df[mse_df['Class'] == flag] 

    plt.scatter(temp['MAE'],  

                temp['MSE'],   

                alpha=0.7,  

                marker=markers[flag],  

                c=colors[flag],  

                label=labels[flag]) 

plt.legend(loc=[1, 0]) 

plt.ylabel('Reconstruction RMSE'); plt.xlabel('Reconstruction MAE'

plt.show() 

到此这篇关于8种Python异常检测算法总结的文章就介绍到这了。

点击拿去
50G+学习视频教程
100+Python初阶、中阶、高阶电子书籍

Guess you like

Origin blog.csdn.net/ai520wangzha/article/details/131086750