Common methods and libraries for anomaly detection 2021-01-14

abnormal detection

1. The meaning of anomaly detection

Identify data that differs from normal data and data that differs greatly from expected behavior. Issues such as credit card fraud, industrial production abnormalities, and network flow abnormalities (network intrusion) are only a few incidents.

1.1 Exception category

-**Point abnormal**: Refers to a small number of individual instances that are abnormal, and most of the individual instances are normal, such as health indicators of normal people and patients;

  • Context anomaly : Also known as context anomaly, it refers to an individual instance that is abnormal in a specific situation, and is normal in other situations, such as a sudden rise or fall in temperature at a specific time, and fast credit card transactions in a specific scenario ;
  • Group abnormality : refers to the abnormal situation of individual instances in the group collection, and the individual instance itself may not be abnormal. For example, the collection formed by fake accounts in social networks is regarded as a group abnormality subset. The real account is normal.

1.2 Scenarios of abnormal categories

Fault detection
IoT anomaly detection
Fraud detection
Industrial anomaly detection
Time series anomaly detection
Video anomaly detection
Log anomaly detection
Medical daily detection
Network intrusion detection

2. Common methods of anomaly detection

2.1.1 Based on statistical methods

Statistical methods make assumptions about the normality of the data. **They assume that normal data objects are generated by a statistical model, and data that does not comply with the model are abnormal points. **The effectiveness of statistical methods highly depends on whether the statistical model assumptions made for the given data are true.

The general idea of ​​the statistical method of anomaly detection is: learn a generative model that fits a given data set, and then identify objects in the low-probability area of ​​the model and treat them as anomalous points.

That is, use statistical methods to build a model, and then consider how likely the object is to fit the model
Insert picture description here

2.1.2 Linear model

The principle of PCA principal component dimensionality reduction is to construct a new feature space and map the original data to this new low-dimensional space. PCA can improve the computing performance of data and alleviate the "high-dimensional disaster". The data after dimensionality reduction can retain the characteristics of the original data to the greatest extent (measured by data covariance).

2.1.3 Method based on similarity

This type of algorithm is suitable for situations where the degree of aggregation of data points is high and there are few outliers. At the same time, because similarity algorithms usually need to perform corresponding calculations on each data separately, this type of algorithm usually has a large amount of calculation and is not suitable for data with large amounts of data and high dimensions.
  Detection methods based on similarity can be roughly divided into three categories:

Based on cluster (cluster) detection, clustering algorithms such as DBSCAN.
  The clustering algorithm divides the data points into relatively dense "clusters", and those points that cannot be classified as a certain cluster are regarded as outliers. This type of algorithm is highly sensitive to the selection of the number of clusters. Improper selection of the number may cause more normal values ​​to be classified as outliers or outliers in small clusters to be classified as normal. Therefore, specific parameters need to be set for each data set to ensure the effect of clustering, and the versatility between data sets is poor. The main purpose of clustering is usually to find clustered data, while outliers and noise are treated as worthless data and ignored or discarded, and they are rarely used in special outlier detection.
  The advantages and disadvantages of the clustering algorithm:
(1) It can better detect the anomalies of small clusters;
(2) It is usually used for the discovery of clusters, and the outliers are discarded, and the processing of the outliers is not friendly enough;
(3) Produced The set of outliers and their scores may be very dependent on the number of clusters used and the existence of outliers in the data;
(4) The quality of the clusters produced by the clustering algorithm has a great influence on the quality of the outliers produced by the algorithm. Big.
Distance-based metrics, such as k-nearest neighbor algorithm.
  The basic idea of ​​k-nearest neighbor algorithm is to calculate the distance between each point and the nearest k neighboring points, and judge whether it is an outlier by the size of the distance. Here, the outlier distance is highly sensitive to the value of k. If k is too small (for example, 1), a small number of nearby outliers may lead to a lower outlier score; if k is too large, all objects in the cluster with less than k points may become outliers. In order to make the model more stable, the calculation of the distance value usually uses the average distance of the k nearest neighbors.
  The advantages and disadvantages of the k-nearest neighbor algorithm:
(1) Simple;
(2) Proximity-based methods require O(m2) time, which is not suitable for large data sets;
(3) Sensitive to the choice of parameters;
(4) Cannot handle different densities The regional data set, because it uses a global threshold, cannot account for this density change.
Density-based metrics, such as LOF (local outlier factor) algorithm.
  The local outlier factor (LOF) algorithm is similar to k-nearest neighbors, except that it measures the local density deviation relative to its neighbors instead of distance. It further transforms the distance between neighboring points into "neighborhoods" to obtain the number of points in the neighborhood (ie density), and considers samples whose density is much lower than their neighbors as outliers.
The advantages and disadvantages of the LOF (local outlier factor) algorithm:
(1) A quantitative measure of outlier degree is given;
(2) Data of different density regions can be processed well;
(3) Sensitive to the choice of parameters.

2.1.4 Integration method

Integration is a common method to improve the accuracy of data mining algorithms. The integrated method combines the outputs of multiple algorithms or multiple base detectors. The basic idea is that some algorithms perform well on some subsets, and some algorithms perform well on other subsets, and then integrate them to make the output more robust. The ensemble method has a natural similarity with the subspace-based method. The subspace is related to different point sets, while the ensemble method uses base detectors to explore subsets of different dimensions and gather these base learners.

Commonly used integration methods include Feature bagging, isolated forest, etc.

**feature bagging **:

Similar to bagging method, except that the object is feature.

Isolated forest :

Isolation forest assumes that we use a random hyperplane to cut the data space, and two subspaces can be generated by cutting once. Then we continue to use random hyperplanes to cut each subspace and loop until there is only one data point in each subspace. Intuitively speaking, those clusters with high density need to be cut many times to separate them, and those with low density will soon be individually allocated to a subspace. Isolation Forest considers these points that are quickly isolated as abnormal points.

Using four samples to make a simple and intuitive understanding, d is the first to be isolated, so d is most likely to be abnormal
Insert picture description here

2.1.4 Machine Learning

In the case of labels, tree models (gbdt, xgboost, etc.) can be used for classification. The disadvantage is that the data labels in anomaly detection scenarios are not balanced, but the advantage of using machine learning algorithms is that different features can be constructed.

3. Commonly used open source libraries for anomaly detection

Scikit-learn

Scikit-learn is an open source machine learning library in Python language. It has various classification, regression and clustering algorithms. It also contains some anomaly detection algorithms, such as LOF and isolated forest.
Official website: https://scikit-learn.org/stable/

PyOD

*Python Outlier Detection(PyOD)**是当下最流行的Python异常检测工具库,其主要亮点包括:

包括近20种常见的异常检测算法,比如经典的LOF/LOCI/ABOD以及最新的深度学习如对抗生成模型(GAN)和集成异常检测(outlier ensemble)
支持不同版本的Python:包括2.73.5+;支持多种操作系统:windows,macOS和Linux
简单易用且一致的API,只需要几行代码就可以完成异常检测,方便评估大量算法
使用JIT和并行化(parallelization)进行优化,加速算法运行及扩展性(scalability),可以处理大量数据
​ ——https://zhuanlan.zhihu.com/p/58313521

Guess you like

Origin blog.csdn.net/qq_43720646/article/details/112603310