Anomaly detection-statistical methods

1. Statistics-tail confidence test

Normal distribution has a wide range of applications in statistical tail confidence testing and other fields. In the statistical tail confidence test, the extreme value of a set of data values ​​is determined according to the normal distribution. The assumption of normal distribution is quite common in the real world. This applies not only to variables expressed as the sum of random samples (as discussed in the previous section), but also to many variables produced by different random processes. The density function fx (x) of a normal distribution with average μ and standard deviation σ is defined as follows:
Insert picture description here
In some cases, it can be assumed that the average μ and standard deviation σ of the model distribution are known. This is the case when a large number of data samples can be used to accurately estimate μ and σ. In other cases, μ and σ can be obtained from domain knowledge. Then, the z value zi of the observation value xi can be calculated as follows:
Insert picture description here
Since the normal distribution can be directly expressed as a function of the z value (without other parameters), the tail probability of the point xi can also be expressed as a function of zi. In fact, the z value corresponds to a normal random variable with a scale shift, that is, a standard normal distribution with an average value of 0 and a variance of 1. Therefore, the cumulative standard normal distribution can be used directly to determine the exact value of the tail probability at the value of zi. From a practical point of view, since this distribution does not exist in a closed form, the normal distribution table is used to map different zi values ​​and probabilities. This provides a statistical level of significance, which can be directly interpreted as the probability that a data point is an outlier. The basic assumption is that the data is generated by a normal distribution.

1.1 t test

The aforementioned discussion assumes that the mean and standard deviation of the model distribution can be estimated very accurately from a large number of samples. However, in practice, the available data set may be small. For example, for a sample of 20 data points, it is much more difficult to accurately model the mean and standard deviation. In this case, how do we accurately perform statistical significance testing?
Student's t-distribution provides an effective way to model anomalies in this situation. This distribution is the number of degrees of freedom ν defined by a parameter, which is the closely defined existing sample size. When the degrees of freedom are greater than 1000, the t-distribution is very close to the normal distribution, and when the t-distribution tends to ∞, the t-distribution converges to the normal distribution. For fewer degrees of freedom (or sample size), the t-distribution has a bell-shaped curve similar to the normal distribution, but the tail is heavier. This is quite intuitive, because the heavier tail explains the loss of statistical significance caused by the inability to accurately estimate the mean and standard deviation of the model (normal) distribution from a smaller sample.
The t distribution is expressed as a function of several independent identically distributed standard normal distributions. It has a parameter ν, which corresponds to the number of degrees of freedom. This dictates the number of this normal distribution, and the number it represents. The parameter ν is set to n-1, where n is the total number of valid samples. Let u 0... u ν be ν + 1, an independent and identically distributed normal distribution with zero mean and unit standard deviation. This normal distribution is also called the standard normal distribution. Then, the definition of t distribution is as follows:
Insert picture description here
The process of extreme value detection with a small number of samples x 1...xn is as follows. First, estimate the mean and standard deviation of the sample. Then use the mean and standard deviation to calculate the t-value for each data point directly from the sample. The calculation method of t-value is the same as that of z-value. Calculate the tail probability of each data point according to the cumulative density function of the (n-1)-degree-of-freedom t distribution. As with the normal distribution, there are standardized tables for this purpose. From a practical point of view, if there are more than 1,000 samples, then the t distribution (with at least 1,000 degrees of freedom) is so close to the normal distribution, so the normal distribution can be used as a very good approximation.

1.2 Visualize extreme values ​​with box plots

An interesting way to visualize univariate extremes is to use box plots or box and whisker plots. This method is particularly useful when visualizing outliers worthy of points. In a box plot, the statistics of a univariate distribution are summarized into five quantities. These five quantities are "minimum/maximum" (whisker), upper and lower quartile (box), and median (line in the middle of the box). We attach two of the quotations for the quantities because they are defined in a non-standard way. The distance between the upper and lower quartiles is called the interquartile range (IQR). "Minimum" and "Maximum" are defined in a (non-standard) pruning method to define the position of the beard. If there is no point above 1.5 IQR on the top quartile value (upper end of the box), then the upper whisker is the true maximum. Otherwise, the upper whisker is set to 1.5 times the IQR from the upper end of the box. A completely similar rule also applies to the lower whiskers, set to 1.5 IQR from the lower end of the box. In the special case of normally distributed data, the distance corresponding to a 1.5 IQR value greater than the top quartile is 2.7 times the standard deviation. Therefore, the position of the whiskers is roughly similar to the 3 σ cutoff points of the normal distribution.

2. Technology based on histogram

The histogram uses a space division method for density-based summarization. In the simplest case of univariate data, the data is discretized into boxes of equal width between the minimum and maximum values, and the frequency of each box is estimated. Data points located in very low frequency bins are reported as outliers. In the context of multivariate data, this method can be generalized in two different ways:

  • Outliers are calculated separately for each dimension, and then the scores can be aggregated.
  • At the same time, the discretization of each dimension is generated, and the grid structure is constructed. The distribution of points in the grid structure can be used to model sparse areas. The data points in these sparse areas are outliers.
    In some cases, the histogram is only based on a sample of points (for efficiency reasons), but all points are scored based on the frequency of the location of the point. Let f 1...fb be the (raw) frequency of b univariate or multivariate bins. These frequencies represent the outlier points in these boxes. Smaller values ​​can show anomalies better. In some cases, it is adjusted by reducing the frequency count (score) of the data point by 1. This is because the inclusion of data points in the count itself may obscure its transparency during extreme value analysis. This adjustment is particularly important if the data samples are used to construct the histogram, but the frequency of the relevant units in the construction histogram is used to score out-of-sample points. In this case, only the score of the points within the sample is reduced by 1. Therefore, the frequency fj of the adjusted frequency point j (belonging to the third box with the frequency fi) is given by:
    Insert picture description here
    Here, ij ∈ {0,1} is an indicator variable, depending on whether the jth point is a sample point. Note that fj is always non-negative, because the frequency of the container containing the sample point is always at least 1.
    It is worth noting that the logarithm of the adjusted frequency fj represents a log-likelihood score, which allows the extreme value analysis of the score to better mark the conversion. For the purpose of regularization, we use log 2(fj + α) as the outlier score of the jth point, where α> 0 is the regularization parameter. To convert scores to binary labels, you can use Student's t-distribution or normal distribution to determine abnormally low scores through extreme value analysis. These points are marked as outliers. Histograms are very similar to clustering methods because they summarize the dense and sparse areas of the data to calculate outliers; the main difference is that the clustering method divides the data points, while the histogram method tends to divide the data space into areas of the same size .
    The main challenge of histogram-based techniques is that it is often difficult to determine the optimal histogram width. A histogram that is too wide or too narrow cannot establish a frequency distribution model according to the level of granularity required for optimal detection of outliers. When these boxes are too narrow, normal data points that fall into these boxes will be identified as outliers. On the other hand, when the box is too wide, abnormal data points may fall in the high-frequency box, so it will not be recognized as an outlier. In this case, it makes sense to change the width of the histogram and obtain multiple scores for the same data point. These (log-likelihood) scores are then averaged on different basic detectors to get the final result. Like clustering, histogram-based methods have high variability in prediction due to different parameter choices, which makes them ideal candidates for ensemble methods. For example, the RS-Hash method changes the dimension and size of the grid area, and determines outliers based on the score. Similarly, some recent collection methods, such as isolation forests, can be regarded as random histograms, where grid areas of different sizes and shapes are created in a hierarchical and randomized manner. Instead of measuring the number of points in a fixed-size grid area, an indirect measurement of the expected size of the grid area required to isolate a single point is reported as an outlier score. This method can avoid the problem of pre-selecting a fixed grid size.
    The second challenge in using histogram-based techniques is that their spatial segmentation method makes them blind to the existence of cluster anomalies. Multivariate grid-based methods may not be able to classify a set of isolated data points as outliers unless the resolution of the grid structure is carefully calibrated. This is because the density of the grid only depends on the data points in it, and when the granularity of the representation is very high, a group of isolated points may create an artificially dense grid unit. This problem can also be partially solved by changing the grid width and average score.
    Due to the sparsity of the grid structure and the increase in dimensionality, the histogram method does not work well in high-dimensional situations, unless the outliers are calculated according to carefully selected low-dimensional projections. For example, a d-dimensional space will contain at least 2 d-th power grid cells. Therefore, the number of data points that fill each cell is expected to decrease exponentially as the dimensionality increases. Some techniques such as rotated bagging and subspace histograms can partially solve this problem. Although histogram-based techniques have great potential, in order to get the best results, they should be used in combination with high-dimensional subspace integration.

3、HBOS

The full name of HBOS: Histogram-based Outlier Score. It is a combination of univariate methods, which cannot model the dependency relationship between features, but the calculation speed is faster and it is friendly to large data sets. The basic assumption is that each dimension of the data set is mutually exclusive. Then divide each dimension into bins. The higher the bin density, the lower the anomaly score.
HBOS algorithm flow:
1. Make a data histogram for each data dimension. Count the frequency of each value of the categorical data and calculate the relative frequency. The following two methods are used for numerical data according to different distributions:
(1) Static width histogram: The standard histogram construction method uses k equal-width boxes within the value range. The frequency (relative number) of samples falling into each bucket is used as an estimate of density (box height).
Time complexity: O(n)
(2) Dynamic width histogram: first sort all values, and then a fixed number of N/k continuous values ​​are packed into a box, where N is the total number of instances, k is the number of boxes; the box product in the histogram represents the number of instances. Because the width of the box is determined by the first value and the last value in the box, the dimensions of all boxes are the same, so the height of each box can be calculated. This means that the height of the box with a large span is low, that is, the density is low. There is only one exception. The number of more than N/k is equal. At this time, more than N/k values ​​are allowed in the same box.
Time complexity: O(n×log(n))
2. A unique histogram is calculated for each dimension, where the height of each box represents an estimate of density. Then in order to make the maximum height 1 (to ensure that the weight of each feature and the abnormal value is equal), the histogram is normalized. Finally, the
HBOS value of each instance is calculated by the following formula:
Insert picture description here
Derivation process:
Insert picture description here

4. Summary

1. The statistical method of anomaly detection uses a data learning model to distinguish between normal data objects and abnormal points. One advantage of using statistical methods is that anomaly detection can be statistically inconsistent. Of course, it is true only when the statistical assumptions made on the data meet the actual constraints.
2. HBOS performs well on global anomaly detection problems, but cannot detect local anomalies. But HBOS is much faster than standard algorithms, especially on large data sets.

5. Use the PyOD library to create a toy example and call HBOS

from pyod.models.hbos import HBOS

import numpy as np
import pandas as pd
import matplotlib as plt

from pyod.utils.data import generate_data, evaluate_print
from pyod.utils.example import visualize

'Basic parameters'
contamination = 0.1
n_train = 10000
n_test = 500

‘生成toy example数据集’
X_train, y_train, X_test, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=contamination)

clf_name = ‘HBOS’
clf = HBOS()
clf.fit(X_train)

‘获取X_train的score & label’
y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_

‘获取X_test的scor & label’
y_test_pred = clf.predict(X_test)
y_test_scores = clf.decision_function(X_test)

'Output result'
print("Training result:")
evaluate_print(clf_name, y_train, y_train_scores)
print("Test result:")
evaluate_print(clf_name, y_test, y_test_scores)

Training result:
HBOS ROC: 0.9892, precision @ rank n: 0.9752
Test result:
HBOS ROC: 0.9957, precision @ rank n: 0.913

visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_43595036/article/details/112548118