datawhale task2 Anomaly detection-a statistical method

The main contents include:

  • Gaussian distribution
  • Box plot

1 Overview

Statistical methods make assumptions about the normality of the data. **They assume that normal data objects are generated by a statistical model, and data that does not comply with the model are abnormal points. **The effectiveness of statistical methods highly depends on whether the statistical model assumptions made for the given data are true.

2. Parameter method

2.1 Unary outlier detection based on normal distribution

Data that only involves one attribute or variable is called unary data. We assume that the data is generated by a normal distribution, and then we can learn the parameters of the normal distribution from the input data, and identify low-probability points as abnormal points.

The threshold is an empirical value, and you can choose the threshold that maximizes the evaluation index value (that is, the best effect) on the validation set as the final threshold.

For example, in the commonly used 3sigma principle, if the data point exceeds the range (μ − 3 σ, μ + 3 σ) (\mu-3\sigma, \mu+3\sigma)( μ −3 σ , μ +3 σ ), Then these points are likely to be abnormal points.

This method can also be used for visualization. The box plot makes a simple statistical visualization of the data distribution, using the upper and lower quartiles (Q1 and Q3) and midpoints of the data set. Outliers are often defined as those data less than Q1-1.5IQR or greater than Q3+1.5IQR.

2.2 Multiple outlier detection

Data involving two or more attributes or variables is called multivariate data. Many unary outlier detection methods can be extended to handle multivariate data. The core idea is to transform the multivariate outlier detection task into a one-dimensional outlier detection problem. For example, when the unary outlier detection based on the normal distribution is extended to the multivariate case, the mean and standard deviation of each dimension can be obtained.
This is when the features of each dimension are independent of each other. If there is a correlation between the features, a multivariate Gaussian distribution must be used.

3. Non-parametric method

In the non-parametric method of anomaly detection, the model of "normal data" learns from input data instead of assuming a prior. Generally, non-parametric methods make fewer assumptions about the data, so they can be used in more situations.

Example: Use a histogram to detect abnormal points.

Histogram is a frequently used non-parametric statistical model that can be used to detect abnormal points. The process includes the following two steps:

Step 1: Construct a histogram. Use the input data (training data) to construct a histogram. The histogram can be univariate or multivariate (if the input data is multidimensional).

Although non-parametric methods do not assume any prior statistical models, they usually do require users to provide parameters for learning from data. For example, the user must specify the type of histogram (equal width or depth) and other parameters (the number of bins in the histogram or the size of each bin, etc.). Unlike parametric methods, these parameters do not specify the type of data distribution.

Step 2: Detect abnormal points. To determine whether an object is an abnormal point, you can check it against the histogram. In the simplest method, if the object falls into a box in the histogram, the object is considered normal, otherwise it is considered an abnormal point.

For more complex methods, a histogram can be used to give each object an outlier score. For example, let the abnormal point score of the object be the inverse of the volume of the box into which the object falls.

One disadvantage of using the histogram as a non-parametric model for outlier detection is that it is difficult to choose a suitable box size. On the one hand, if the box size is too small, many normal objects will fall into the empty or sparse box, and thus be misidentified as abnormal points. On the other hand, if the box size is too large, the abnormal point object may infiltrate some frequent boxes, and thus "pretend" as normal.

4、HBOS

The full name of HBOS: Histogram-based Outlier Score. It is a combination of univariate methods, which cannot model the dependencies between features, but the calculation speed is faster and it is friendly to large data sets. The basic assumption is that each dimension of the data set is independent of each other. Then each dimension is divided into bins. The higher the density of the bin, the lower the abnormal score.

HBOS algorithm flow:

1. Make a data histogram for each data dimension. Count the frequency of each value of the categorical data and calculate the relative frequency. The following two methods are used for numerical data according to different distributions:
2. An independent histogram is calculated for each dimension, where the height of each box represents an estimate of density. Then, in order to make the maximum height 1 (to ensure that the weight of each feature and the abnormal value is equal), the histogram is normalized.

5. Summary

1. The statistical method of abnormality detection is based on the data learning model to distinguish between normal data objects and abnormal points. One advantage of using statistical methods is that anomaly detection can be statistically indisputable. Of course, it is true only when the statistical assumptions made on the data satisfy the actual constraints.

2. HBOS performs well in global anomaly detection, but cannot detect local anomalies. But HBOS is much faster than standard algorithms, especially on large data sets.

6. Code

from pyod.models.hbos import HBOS
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

#  sample data
X_train, y_train, X_test, y_test = \
    generate_data(n_train=1000,    
                  n_test=300,     
                  n_features=2,
                  contamination=0.01,   #预估的 percentage of outliers
                  random_state=123)

# train HBOS detector
clf_name = 'HBOS'
clf = HBOS()
clf.fit(X_train)

y_train_pred = clf.labels_ 
y_train_scores = clf.decision_scores_  


y_test_pred = clf.predict(X_test)  # 
y_test_scores = clf.decision_function(X_test) 
evaluate_print(clf_name, y_train, y_train_scores)
evaluate_print(clf_name, y_test, y_test_scores)

visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,y_test_pred, show_figure=True, save_figure=False)

## 参考资料

**关于Datawhale**:

> Datawhale 异常检测学习

Guess you like

Origin blog.csdn.net/huochuangchuang/article/details/112691204