High-dimensional data anomaly detection

I. Overview

Many real data sets are high-dimensional. In some cases, real data sets may contain hundreds or thousands of dimensions. With the continuous increase of dimensionality, many traditional anomaly detection methods cannot work very effectively. This is a manifestation of the famous dimensionality disaster. In a high-dimensional space, when a full-dimensional analysis is performed, the data becomes sparse, and the true outliers are masked by the noise effect of multiple unrelated dimensions.

One of the main reasons for the disaster of dimensionality is that it is difficult to determine the relative position of a point in a high-dimensional situation. For example, methods based on distance similarity define locality by using distance functions in all dimensions. However, some dimensions may not be related to a specific test point, which will also affect the quality of the basic distance function. For example, in a high-dimensional space, all point pairs are almost equidistant. This phenomenon is called data sparseness or distance concentration. Since outliers are defined as data points in a sparse area, this leads to a bad distinction, that is, all data points are located in a full-dimensional area that is almost the same sparse. The challenge posed by the curse of dimensionality is not unique to anomaly detection. As we all know, many problems, such as clustering and nearest neighbor search analysis, have encountered challenges with increasing dimensions. In fact, some people believe that almost any algorithm based on the concept of proximity will degenerate in quality in high-dimensional spaces, so it needs to be redefined in a more meaningful way.

In order to further explain the failure of the full-dimensional outlier analysis algorithm, this article gives an example. In the figure below, four different two-dimensional views of a hypothetical data set have been illustrated. Each of these views corresponds to a set of disjoint dimensions. Obviously, the point "A" is exposed as an outlier in the first view of the data set, and the point "B" is exposed as an outlier in the fourth view of the data set. However, in the second and third views of the data set, the two data points "A" and "B" are not disclosed as outliers. Therefore, these views are noisy from the point of view of the outliers of the metrics "A" and "B". In this case, three of the four views are non-informative and noisy for revealing any particular outlier "A" or "B". In this case, the outliers are lost in the random distribution in these views, when the distance measurement is done in full dimensions. This situation usually enlarges naturally as the dimensionality increases. For very high-dimensional data sets, there may be only a small part of the view that can provide information for the outlier analysis process.
Insert picture description here
The above picture description tells us that the local relative size of the problem is very important. In actual scenarios, the physical explanation of this situation is quite intuitive. An object may have several measured quantities, and the significant abnormal behavior of this object may only be reflected in a small subset of these quantities. For example, consider an aircraft mechanical failure detection scenario where the results of different tests are represented in different sizes. The results of thousands of different airframe tests performed on the same aircraft may be mostly normal, with some noise changes, which are not significant. On the other hand, some deviations in a small number of tests may be significant enough to indicate abnormal behavior. When the data from the test is expressed in full dimensions, abnormal data points will appear normal in almost all views of the data, except for a small part of the dimension. Therefore, the aggregate proximity metric is unlikely to expose outliers, because a large number of normal test noise changes will mask the outliers. In addition, when testing different objects (instances of different organisms), different tests (subsets of dimensions) may be related to identifying outliers. In other words, outliers are usually embedded in locally correlated subspaces.

What does this mean for full-dimensional analysis in this situation? When the full-dimensional distance is used to measure the deviation, the dilution effect of a large number of "normal noise" dimensions will make the detection of outliers difficult. In most cases, noise in other dimensions causes the distance to be concentrated, which may make the calculation more erroneous. In addition, the cumulative effect of noise in a large number of different dimensions will interfere with the detection of actual deviations. Simply put, due to the masking effect and dilution effect of noise in the full-dimensional calculation, when using full-dimensional analysis, the outliers existing in the low-dimensional subspace will be lost.
Other distance-based methods have similar effects, such as clustering and nearest neighbor search. For these problems, it has been shown that by examining the behavior of the data in the subspace, more meaningful clusters can be designed, which are specific to the particular subspace in the problem. This broad observation usually also applies to anomaly detection problems. Since outliers can only be found in the low-dimensional subspace of the data, it makes sense to explore the deviations of interest in the low-dimensional subspace. This method filters out the additional noise influence of a large number of dimensions, thereby obtaining more robust outliers. An interesting phenomenon is that even in data sets that lack attribute values, such low-dimensional predictions can often be identified. This is very useful for many practical applications, because feature extraction is a difficult process, and there is usually no complete feature description. For example, in the fuselage fault detection scheme, only a subset of the test may be applied, so only the values ​​in the dimension subset can be used for outlier analysis. This model is called planned anomaly detection, or, alternatively, subspace anomaly detection.

The identification of related subspaces is a very challenging problem. This is because the number of possible predictions for high-dimensional data is exponentially related to the dimensionality of the data. An effective anomaly detection method needs to search for data points and dimensions in an integrated manner in order to reveal the most relevant outliers. This is because different subsets of the dimensions may be related to different outliers, which can be clearly seen from the example in the figure above. This further increases the complexity of the calculation.

An important observation is that subspace analysis is usually more difficult in anomaly detection problems than in clustering problems. This is because problems like clusters are based on aggregation behavior, and outliers, by definition, are rare. Therefore, in the case of outlier analysis, compared with clustering-based problems such as clustering, the statistical clustering of individual dimensions in a given area often provides very weak hints for the subspace exploration process. When this weak prompt leads to the omission of relevant dimensions, the effect may be more dramatic than including irrelevant dimensions, especially in interesting situations, when the number of locally relevant dimensions is only a small part of the entire data dimension. A common mistake is to assume that the complementary relationship between clustering and outlier analysis can be extended to the local subspace selection problem. In particular, because the dimensional selection methods in the early subspace clustering methods do not understand the subtle differences in the principles of subspace analysis in different problems, important outliers are sometimes missed. In this case, the difficulty of identifying the relevant subspace for outlier analysis is also crucial. Generally speaking, choosing a relevant subspace for each data point will lead to unpredictable results, so it is very important to combine the results of multiple subspaces. In other words, subspace anomaly detection is inherently an integration-centric problem.
The commonly used methods are as follows:

  • Based on scarcity : These methods try to discover subspaces based on the scarcity of the underlying distribution. The main challenge here is calculation, because the number of rare subspaces is much larger than the number of high-dimensional dense subspaces.
  • Unbiased : In these methods, the sampling of the subspace is unbiased, and the score is a combination of cross-sampled subspaces. When the subspace is sampled from the original attribute set, this method is called feature packing. In the case of arbitrary orientation subspace sampling, this method is called rotating bagging sampling or rotating subspace sampling. Although these methods are very simple, they are often very effective.
  • Based on aggregation : In these methods, aggregation statistics, such as cluster statistics, variance statistics, or inconsistency statistics of global or local subsets of data, are used to quantify the correlation of subspaces. Unlike statistics based on rarity, these methods quantify the statistical properties of global or local reference point sets, rather than trying to directly identify subspaces with sparse data. Since these methods can only provide weak (and error-prone) hints to identify relevant subspaces, multi-subspace sampling is essential.

二、Feature Bagging

Feature Bagging, the basic idea is similar to bagging, but the object is feature. Feature bagging is a type of integration method. The design of the integration method has the following two main steps:

1. Choose a base detector

These basic detectors can be completely different from each other, or different parameter settings, or use different sampled sub-data sets. Feature bagging often uses lof algorithm as the base algorithm. The following figure shows the general algorithm of feature bagging:

Insert picture description here

2. Score standardization and combination method

Different detectors may produce scores on different scales. For example, the average k-nearest neighbor detector will output the original distance score, while the LOF algorithm will output the normalized value. In addition, although the general situation is to output a larger outlier score, some detectors will output a smaller outlier score. Therefore, it is necessary to convert the scores of various detectors into normalized values ​​that can be meaningfully combined in the future. After the scores are standardized, a combination function must be selected to combine the scores of different basic detectors. The most common choices include average and maximization combination functions.
The following figure shows two different combination score methods for two feature bagging:
Insert picture description here
Insert picture description here
the design of the base detector and its combination method all depend on the specific goals of the specific integration method. In many cases, we have no way of knowing the original distribution of the data and can only learn from part of the data. In addition, the algorithm itself may also have certain problems that make it impossible to learn the complete
information of the data. The errors caused by these problems are usually divided into two types: deviation and discrepancy.

Difference : refers to the error between the output result of the algorithm and the expected output of the algorithm, describing the degree of dispersion of the model and the volatility of data.

Bias : refers to the difference between the predicted value and the true value. Especially in the problem of outlier detection, there is often no basic truth value available.

三、Isolation Forests

The Isolation Forest algorithm is an anomaly detection algorithm proposed by Professor Zhou Zhihua and others in 2008. It is one of the few special algorithms designed for anomaly detection in machine learning. The method is time-efficient and can Effective processing of high-dimensional data and massive data, without the need to label samples, is widely used in the industry.

Lonely forest is a non-parametric and unsupervised algorithm. It does not require the definition of a mathematical model or the training data to have labels. The solitary forest's strategy of finding solitary points is not very effective. Suppose we use a random super-flat to cut the data space, one cut can create two sub-spaces. Then we continue to use random super-flats to cut each subspace and loop until there is only one data point in each subspace. Intuitively speaking, those high-density clusters need to be cut many times to separate them, and those low-density points are quickly allocated to a subspace individually. Lonely Forest believes that these points that are quickly isolated are abnormal points.

Using four samples for a simple and intuitive understanding, d is the first to be isolated, so d is most likely to be abnormal.

Insert picture description here
How to cut this data space is the core idea of ​​the lone forest. Because the cutting is random, for the reliability of the results, an ensemble method is used to obtain a convergence value, that is, cutting repeatedly from the beginning and averaging the results of each cutting. The lone forest consists of t lonely numbers, and each tree is a random cross tree, which means that for each node in the tree, there are either two child nodes or one child node. No. The structure of trees is somewhat similar to the structure of trees in random forests. The process is as follows:

  1. Randomly select a sample sub-set from the training data and place it at the root node of the tree;
  2. Randomly specify an attribute, and randomly generate a cutting point V, that is, a certain number between the maximum value and the minimum value of the attribute A;
  3. Categorize each sample according to the attribute A, put the samples with A less than V in the left child of the current node, and put the samples with V or equal to the right child in the right child, thus forming two subspaces;
  4. Steps 2 and 3 are recursively in the child node, and the left child and the right child are constructed continuously until there is only one data in the child node, or the height of the tree reaches the limit height.

After t trees are obtained, the training of the isolated forest is over, and the resulting isolated forest can be used to evaluate the test data.

The hypothesis for detecting anomalies in lone forests is: abnormal points are generally very rare, and will be quickly divided into leaf nodes in the tree, so the path length from the leaf node to the root node can be used to judge one Whether the record is abnormal. Similar to random forests, solitary forests also use the average results of all the constructed trees to form the final result. During training, the training samples of each tree are randomly sampled. From the perspective of the tree construction process of the isolated forest, it does not need to know the label of the sample, but judges whether the sample is abnormal through the threshold. Because the path of anomalous points is shorter than that of normal points, the lone forest estimates the anomaly degree of each sample point based on the path length.

Path length calculation method:
Insert picture description here
Isolated forest is also a subspace-based method. Different divisions correspond to different local subspace regions of the data, and smaller paths correspond to the low-dimensionality of the isolated subspace.

Four, summary

1.feature bagging can reduce variance

2. The advantages of isolated forests are:

  • The calculation cost is smaller than distance-based or density-based algorithms.
  • Has linear time complexity.
  • It has advantages in processing large data sets.
    Isolated forests are not suitable for ultra-high-dimensional data, because isolated forests randomly select dimensions every time. If the dimensions are too high, there will be too much noise.

Five, based on the pyod library feature bagging algorithm implementation

from pyod.models.feature_bagging import FeatureBagging
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

if name == ‘main’:
contamination = 0.1
n_train = 200
n_test = 100

X_train, y_train, X_test, y_test = generate_data(n_train=n_train,
                                                 n_test=n_test,
                                                 n_features=10,
                                                 contamination=contamination,
                                                 random_state=42)

clf_name = 'FeatureBagging'
clf = FeatureBagging(check_estimator=False)
clf.fit(X_train)

y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_

y_test_pred = clf.predict(X_test)
y_test_scores = clf.decision_function(X_test)

# 打印结果
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

On Training Data:
FeatureBagging ROC:0.9964, precision @ rank n:0.9

On Test Data:
FeatureBagging ROC:0.8244, precision @ rank n:0.6


#For comparison, look at the effect of a single LOF detector clf_name ='LOF'
clf = LOF()
clf.fit(X_train)

y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_

y_test_pred = clf.predict(X_test)
y_test_scores = clf.decision_function(X_test)

# 打印结果
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

On Training Data:
LOF ROC:0.9889, precision @ rank n:0.8

On Test Data:
LOF ROC:0.6333, precision @ rank n:0.5

6. Implementation of isolated forest algorithm based on pyod library

from pyod.models.iforest import IForest
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

if name == ‘main’:
contamination = 0.1
n_train = 200
n_test = 100

X_train, y_train, X_test, y_test = generate_data(n_train=n_train,
                                                 n_test=n_test,
                                                 n_features=10,
                                                 contamination=contamination,
                                                 random_state=42)


clf_name = 'IForest'
clf = IForest()
clf.fit(X_train)

y_train_pred = clf.labels_
y_train_scores = clf.decision_scores_

y_test_pred = clf.predict(X_test)
y_test_scores = clf.decision_function(X_test)

# 打印结果
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

On Training Data:
IForest ROC:1.0, precision @ rank n:1.0

On Test Data:
IForest ROC:1.0, precision @ rank n:1.0

It can be seen that for the high-dimensional data with ten features generated by PyOD, the isolated forest is better than the LOF-based Feature Bagging, and the LOF-based Feature Bagging is better than a single LOF detector.

Guess you like

Origin blog.csdn.net/weixin_43595036/article/details/113072799