Introduction to Anomaly Detection

abnormal detection

1. What is an outlier?

Outliers are data points that are significantly different from other data.
Hawkins defined 1 an outlier as follows:
“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” Outliers
are also used in data mining and statistics literature They are called abnormalities, discordants, deviants or anomalies. In most application scenarios, data is created by one or more generation processes, which can reflect the activities in the system or the collected observations about the entity. When the generation process behaves abnormally, outliers will be generated. Therefore, outliers usually contain useful information about the abnormal characteristics of systems and entities that affect the data generation process. The knowledge of these unusual characteristics provides useful insights into specific applications. Some examples are as follows:

  • Intrusion detection system: In many computer systems, different types of data about operating system calls, network traffic, or other user operations are collected. These data may show unusual behavior caused by malicious behavior. The identification of this activity is called intrusion detection.
  • Credit card fraud: Credit card fraud has become more and more common because sensitive information (such as credit card numbers) is more likely to be leaked. In many cases, unauthorized use of credit cards may exhibit different patterns, such as shopping spree from a specific location or conducting very large transactions. This mode can be used to detect abnormal values ​​in credit card transaction data.
  • Interesting sensor events: In many real-world applications, sensors are often used to track various environmental and location parameters. Sudden changes in underlying patterns may represent events of interest. Event detection is one of the main incentive applications in the field of sensor networks.
  • Medical diagnosis: In many medical applications, data is collected from various devices, such as magnetic resonance imaging (MRI) scans, positron emission computed tomography (PET) scans, or electrocardiogram (ECG) time series. Unusual patterns in these data usually reflect disease conditions.
  • Law enforcement: Anomaly detection has found a large number of applications in law enforcement, especially in cases where unusual patterns can only be discovered through multiple actions by an entity. To determine fraud in financial transactions, transaction activities, or insurance claims, it is often necessary to identify unusual patterns in the data generated by the actions of criminal entities.
  • Earth Science: A large amount of spatiotemporal data on weather patterns, climate change or land cover patterns are collected through various mechanisms such as satellites or remote sensing. Anomalies in these data provide important insights about human activity or environmental trends, which may be underlying causes.

In all these applications, the data has a "normal" model, and anomalies are identified as deviations from the normal model. Normal data points are sometimes called inliers. In applications such as intrusion or fraud detection, outliers correspond to a sequence of multiple data points, rather than a single data point. For example, fraud incidents may often reflect the behavior of individuals in a particular sequence. The specificity of the sequence is related to the recognition of abnormal events. Such anomalies are also called collective anomalies, because they can only be collectively inferred from a set of data points or a sequence of data points. This collective anomaly is often the result of unusual activity patterns produced by unusual events.
The output of the anomaly detection algorithm can be one of the following two types:

  • Outlier score: Most anomaly detection/value algorithms output a score that quantifies the "outlier" level of each data point. This score can also be used to sort data points according to their outlier trend. This is a very general form of output that retains all the information provided by a particular algorithm, but does not provide a concise summary of the small number of data points that should be considered outliers.
  • Binary label: The second output type is a binary label, which indicates whether the data point is an outlier. Although some algorithms may directly return binary labels, outlier points can also be converted into binary labels. This is usually achieved by setting a threshold on the outlier, which is selected based on the statistical distribution of the score. The binary mark contains less information than the scoring mechanism, but it is the final result that is often required for decision-making in practical applications.

2. Basic anomaly detection model

From an analyst's point of view, the interpretability of anomaly detection models is extremely important. It is often necessary to determine why a particular data point should be considered an outlier because it provides the analyst with further hints about the diagnosis required in a particular application scenario. This process is also known as the process of discovering connotative knowledge about outliers 2 or anomaly detection and description 3 . Different models have different degrees of interpretability. Generally, a model that uses original attributes and uses less transformation of the data (for example, principal component analysis) has higher interpretability. The result of the trade-off is that data conversion often enhances the contrast between abnormal points and normal data points, at the expense of interpretability. Therefore, it is important to keep these factors in mind when choosing a particular model for outlier analysis.

2.1 Probability and statistical models

In the probability model and statistical model, the data is modeled in the form of a closed probability distribution model, and the parameters of this model are learned. Therefore, the key assumption here is about the specific choice of data distribution to perform modeling. For example, the Gaussian mixture model assumes that the data is the output of the generation process, where each point belongs to one of the k Gaussian clusters. The parameters of these Gaussian distributions are learning to use the Expectation Maximization (EM) algorithm on the observation data to make the data generated by the probability (or likelihood) process as large as possible. A key output of this method is the membership probability of data points to different clusters, and the fitting of the model distribution based on density. This provides a natural way to model outliers, because data points with very low fitted values ​​may be considered outliers. In practical applications, the logarithm of these fitted values ​​is used as an outlier score, because outliers tend to use logarithmic fitting as extreme values, and extreme value testing can be applied to these fitted values ​​to identify abnormalities. value.
One of the main advantages of probabilistic models is that they can be easily applied to almost any data type (or mixed data type), as long as each mixed component has an appropriate generative model. For example, if the data is categorical, then a discrete Bernoulli distribution model can be used to simulate each component of the mixture. For a mixture of different types of attributes, the product of attribute-specific generating components can be used. Since such models work with probability, the problem of data normalization has been explained by generative assumptions. Therefore, the probabilistic model provides a general framework based on the EM algorithm, which can be applied to any specific data type relatively easily. For many other models, this is not the case.
One disadvantage of probabilistic models is that they try to fit the data to a particular distribution, which may sometimes be inappropriate. In addition, with the increase in the number of model parameters, the phenomenon of overfitting becomes more and more common. In this case, the outliers may fit the underlying model of normal data. Many parametric models are also difficult to explain with connotative knowledge, especially when the parameters of the model cannot be intuitively presented to analysts in the form of underlying attributes. This may undermine an important purpose of anomaly detection, which is to provide a diagnostic understanding of the abnormal data generation process.

2.2 Linear model

Insert picture description here

These methods utilize a linear dependence of the low-dimensional subspace data modeling 4 . For example, in Figure 1.4, the data is aligned along a 1-dimensional line in a 2-dimensional space. The best straight line through these points is determined by regression analysis. Generally, least squares fitting is used to determine the best low-dimensional hyperplane. The distance between the data points and this hyperplane is used to quantify the outlier score, because they quantify the deviation of the normal data model. Extreme value analysis can be applied to these scores to determine outliers. For example, in the two-dimensional example in Figure 1.4, the data point {(xi, yi), i ∈ {1... n} can be created as follows with two coefficients a and b:
Insert picture description here
Insert picture description here

2.3 Model based on proximity

The idea of ​​the proximity-based method is a model that separates outliers from the remaining data based on similarity or distance functions. Proximity-based methods are one of the most commonly used methods in outlier analysis, including proximity-based clustering methods, density-based clustering methods, and nearest neighbor-based methods. In clustering and other density-based methods, directly find dense areas in the data, and define outliers as points that are not in these dense areas. Alternatively, an outlier can be defined as a point far away from a dense area. The main difference between the clustering method and the density-based method is that the clustering method divides the data points, while the density-based method such as the histogram divides the data space. This is because the goal of the latter case is to estimate the density of test points in the data space, which is best achieved by space division.

2.4 Information theory model

Many of the aforementioned outlier analysis models use various forms of data aggregation, such as generating probabilistic model parameters, clustering, or low-dimensional representation of hyperplanes. These models implicitly generate a small data summary, and deviations from this summary are marked as outliers. Information theory measurement is based on the same principle, but it is indirect. The idea is that outliers increase the minimum code length required to describe the data set (that is, the minimum length of the abstract) because they represent a deviation from the natural attempt to summarize the data.

3. Commonly used open source libraries for anomaly detection

3.1 Scikit-learn

Scikit-learn is an open source machine learning library in Python language that supports 4 anomaly detection methods, LOF, IsolationForest, OneClassSVM, EllipticEnvelope.

3.2 PyOD

Python Outlier Detection (PyOD) is currently the most popular Python anomaly detection tool library. Its main highlights include:

  • Including nearly 20 common anomaly detection algorithms, such as classic LOF/LOCI/ABOD and the latest deep learning such as confrontation generative model (GAN) and integrated anomaly detection (outlier ensemble)
  • Support different versions of Python: including 2.7 and 3.5+; support multiple operating systems: windows, macOS and Linux
  • Easy-to-use and consistent API, only a few lines of code can complete anomaly detection, which is convenient for evaluating a large number of algorithms
  • Use parallelization for optimization, speed up algorithm operation and scalability (scalability), and can handle large amounts of data

4. Examples

4.1 KNN

Learn the basic operation of the pyod library through an example. It is worth mentioning that the api design of the pyod library almost completely refers to scikit-learn, and the learning cost for users is extremely low.
Simply put, a complete pyod training model is divided into the following steps:

  • Generate a data set or refer to a ready-made data set
  • Train the model on the training set
  • Predict the results of the test set
  • Give the model evaluation score
  • Visualize model results (high-dimensional data sets are difficult to visualize)

from pyod.models.knn import KNN
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

contamination = 0.1 # percentage of outliers
n_train = 200 # number of training points
n_test = 100 # number of testing points


#Generate virtual data X_train, y_train, X_test, y_test =
generate_data(n_train=n_train,
n_test=n_test,
n_features=2,
contamination=contamination,
random_state=42)


#Training KNN model clf_name ='KNN'
clf = KNN()
clf.fit(X_train) # Note that when training the model, you do not need to enter the y parameter


#Get training labels and training scores y_train_pred = clf.labels_ # 0 is normal, 1 is abnormal
y_train_scores = clf.decision_scores_ # The larger the value, the more abnormal


#Use the trained model to predict the labels and scores of the test data y_test_pred = clf.predict(X_test)
y_test_scores = clf.decision_function(X_test)

#评估并打印结果
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

#可视化模型效果
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
y_test_pred, show_figure=True, save_figure=True)

Insert picture description here

4.2 Isolation Forest

4.2.1 Implement Isolation Forest using Scikit-learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest #Get
pseudo-random number generator
rng = np.random.RandomState(42)


#Create an array with a given shape, the array elements conform to the standard normal distribution N(0,1) X = 0.3 * rng.randn(100, 2)
#np.r is to connect two matrices by column, that is,
Add the two matrices up and down, requiring the same number of columns X_train = np.r_[X + 1, X-3, X-5, X + 6]
print('X_train',X_train)


#Generate a set of regular data X = 0.3 * rng.randn(50, 2)
X_test = np.r_[X + 1, X-3, X-5, X + 6]
print('X_test',X_test)


#Generate a group of abnormal data # Randomly generate an out-of-bounds array
between -8 and 8 (20, 2) X_outliers = rng.uniform(low=-8, high=8, size=(20, 2))
print('X_outliers ',X_outliers) #Generate
model
clf = IsolationForest(max_samples=100)
clf.fit(X_train) #Generate
training data prediction value
y_pred_train = clf.predict(X_train)
print('y_pred_train',y_pred_train) #Generate
test data prediction value
y_pred_test = clf.predict(X_test)
print('y_pred_test',y_pred_test) #Generate
out-of-bounds data prediction value
y_pred_outliers = clf.predict(X_outliers)
print('y_pred_outliers',y_pred_outliers)
#画图
xx, yy = np.p.meshspace(np.meshspace (-8, 8, 50), np.linspace(-8, 8, 50))
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape( xx.shape)

plt.title(“IsolationForest”)
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c=‘white’)
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c=‘green’)
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c=‘red’)
plt.axis(‘tight’)
plt.xlim((-8, 8))
plt.ylim((-8, 8))
plt.legend([b1, b2, c],
[“training observations”,
“new regular observations”, “new abnormal observations”],
loc=“upper left”)
plt.show()
Insert picture description here

4.2.2 Implement Isolation Forest using PyOD

from future import division
from future import print_function
import os
import sys
from pyod.models.iforest import IForest
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
if name == “ main ”:
contamination = 0.1 # Outlier percentage
n_train = 200 #
Number of training points n_test = 100 #测试点数
# Generate sample data
X_train, y_train, X_test, y_test =
generate_data(n_train=n_train,
n_test=n_test,
n_features=2,
contamination= contamination,
random_state=42)
# training
clf = IForest()
clf.fit(X_train)

if name == “ main ”:
contamination = 0.1 # outlier percentage
n_train = 200 # number of training points
n_test = 100 # number of test points

# 生成样本数据
X_train, y_train, X_test, y_test = \
    generate_data(n_train=n_train,
                  n_test=n_test,
                  n_features=2,
                  contamination=contamination,
                  random_state=42)

# 训练
clf = IForest()
clf.fit(X_train)

# 得到训练数据的预测标签和离群值
y_train_pred = clf.labels_  # 二元标签(0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  

# 获取测试数据的预测值
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # 异常分数

# 评估并输出结果
print("\nOn Training Data:")
evaluate_print(clf, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf, y_test, y_test_scores)

# 可视化结果
visualize(clf, X_train, y_train, X_test, y_test, y_train_pred,
          y_test_pred, show_figure=True, save_figure=False)

On Training Data:
IForest(behaviour=‘old’, bootstrap=False, contamination=0.1, max_features=1.0,
max_samples=‘auto’, n_estimators=100, n_jobs=1, random_state=None,
verbose=0) ROC:0.9956, precision @ rank n:0.85

On Test Data:
IForest(behaviour=‘old’, bootstrap=False, contamination=0.1, max_features=1.0,
max_samples=‘auto’, n_estimators=100, n_jobs=1, random_state=None,
verbose=0) ROC:0.9967, precision @ rank n:0.9
Insert picture description here

4.3 One Class SVM

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm

xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
#Generate train data
X = 0.3 * np.random.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
#Generate some regular novel observations
X = 0.3 * np.random.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
#Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))

#fit the model
clf = svm.OneClassSVM(nu=0.1, kernel=“rbf”, gamma=0.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

#plot the line, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.title("Novelty Detection")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu) # draw the area of ​​abnormal samples
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='darkred') # Draw the boundary between normal and abnormal samples
plt.contourf(xx, yy, Z, levels=[ 0, Z.max()], colors='palevioletred') # draw the normal sample area
s = 40
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white', s=s, edgecolors='k')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='blueviolet', s=s,
edgecolors='k')
c = plt. scatter(X_outliers[:, 0], X_outliers[:, 1], c='gold', s=s,
edgecolors='k')
plt.axis('tight')
plt.xlim((-5, 5) )
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
[“learned frontier”, “training observations”,
“new regular observations”, “new abnormal observations”],
loc=“upper left”,
prop=matplotlib.font_manager.FontProperties(size=11))
plt.xlabel(
"error train: %d/200 ; errors novel regular: %d/40 ; "
“errors novel abnormal: %d/40”
% (n_error_train, n_error_test, n_error_outliers))
plt.show()

Insert picture description here

4.4 Local Outlier Factor

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
from scipy import stats

# Construct training samples
n_samples = 200 # Total number of samples
outliers_fraction = 0.25 # Proportion of abnormal samples
n_inliers = int((1.-outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples)

rng = np.random.RandomState(42)
X = 0.3 * rng.randn(n_inliers // 2, 2)
X_train = np.r_[X + 2, X-2] # normal sample
X_train = np.r_[X_train, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))] # normal sample plus abnormal sample

#fit the model
clf = LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction)
y_pred = clf.fit_predict(X_train)
scores_pred = clf.negative_outlier_factor_
threshold = stats.scoreatpercentile(scores_pred, 100, get the threshold according to the outliers_fraction) For drawing

#plot the level sets of the decision function
xx, yy = np.meshgrid(np.linspace(-7, 7, 50), np.linspace(-7, 7, 50))
Z = clf. decision_function(np.c [xx.ravel(), yy.ravel()]) # Similar to the value of scores_pred, the smaller the value, the more likely it is an abnormal point
Z = Z.reshape(xx.shape)

plt.title(“Local Outlier Factor (LOF)”)
#plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=plt.cm.Blues_r) # Draw the abnormal point area, the part from the smallest value to the threshold
a = plt.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red') # Draw the boundary between the abnormal point area and the normal point area
plt.contourf(xx, yy, Z, levels =[threshold, Z.max()], colors='palevioletred') # Draw the normal point area, the value from the threshold to the largest part

b = plt.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c=‘white’,
s=20, edgecolor=‘k’)
c = plt.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c=‘black’,
s=20, edgecolor=‘k’)
plt.axis(‘tight’)
plt.xlim((-7, 7))
plt.ylim((-7, 7))
plt.legend([a.collections[0], b, c],
[‘learned decision function’, ‘true inliers’, ‘true outliers’],
loc=“upper left”)
plt.show()
Insert picture description here


  1. D. Hawkins. Identification of Outliers, Chapman and Hall, 1980. ↩︎

  2. E. Knorr and R. Ng. Finding Intensional Knowledge of Distance-Based Outliers. VLDB Conference, 1999. ↩︎

  3. L. Akoglu, E. Muller, and J Vreeken. ACM KDD Workshop on Outlier Detection and Description, 2013. http://www.outlieranalytics.org/odd13kdd/ ↩︎

  4. P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 2003. ↩︎

Guess you like

Origin blog.csdn.net/weixin_43595036/article/details/112488120