Anomaly detection-high-dimensional data anomaly detection

The main contents include:

  • Feature Bagging

  • Isolated forest

1 Introduction

In actual scenarios, many data sets are multi-dimensional. As the dimensionality increases, the size (volume) of the data space will increase exponentially, making the data sparse. This is the problem of the curse of dimensionality. The curse of dimensionality not only brings challenges to anomaly detection, but also brings difficulties to distance calculation and clustering. For example, the proximity-based method uses distance functions in all dimensions to define locality. However, in a high-dimensional space, the distances of all point pairs are almost equal (distance concentration), which makes some distance-based methods invalid. In high-dimensional scenarios, a commonly used method is the subspace method.

Integration is one of the commonly used methods in subspace thinking, which can effectively improve the accuracy of data mining algorithms. The integrated method combines the outputs of multiple algorithms or multiple base detectors. The basic idea is that some algorithms perform well on some subsets, and some algorithms perform well on other subsets, and then integrate them to make the output more robust. The ensemble method has a natural similarity with the subspace-based method. The subspace is related to different point sets, while the ensemble method uses base detectors to explore subsets of different dimensions and gather these base learners.

Here are two common integration methods:

2、Feature Bagging

Feature Bagging, the basic idea is similar to bagging, but the object is feature. Feature bagging is one of the integrated methods. The design of the integration method has the following two main steps:

1. Select the base detector . These basic detectors can be completely different from each other, or different parameter settings, or use different sampled sub-data sets. Feature bagging commonly used lof algorithm as the base algorithm. The following figure shows the general algorithm of feature bagging:

[External link image transfer failed, the source site may have an anti-leech chain mechanism, it is recommended to save the image and upload it directly (img-Mb6YMoDa-1611510884509)(https://github.com/datawhalechina/team-learning-data-mining/raw /master/AnomalyDetection/img/image-20210104144520790.png)]

2. Score standardization and combination method : Different detectors may produce scores on different scales. For example, the average k-nearest neighbor detector will output the original distance score, while the LOF algorithm will output the normalized value. In addition, although the general situation is to output a larger outlier score, some detectors will output a smaller outlier score. Therefore, it is necessary to convert the scores from various detectors into normalized values ​​that can be meaningfully combined. After the scores are standardized, a combination function must be selected to combine the scores of different basic detectors. The most common choices include averaging and maximizing combination functions.

The following figure shows two different combination score methods for two feature bagging:

[External link image transfer failed. The origin site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-wohP9zti-1611510884513)(https://github.com/datawhalechina/team-learning-data-mining/raw /master/AnomalyDetection/img/image-20210105140222697-1609839336763.png)]

(Breadth first)

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-i26ssjvd-1611510884516)(https://github.com/datawhalechina/team-learning-data-mining/raw /master/AnomalyDetection/img/image-20210105140242611.png)]

(Cumulative sum)

The design of the base detector and its combination method all depend on the specific goals of the specific integration method. In many cases, we cannot know the original distribution of the data and can only learn from part of the data. In addition, the algorithm itself may also have certain problems that make it unable to learn the complete information of the data. The errors caused by these problems are usually divided into two types: bias and variance.

Variance : refers to the error between the output result of the algorithm and the expected output of the algorithm, describing the degree of dispersion of the model and data volatility.

Deviation : refers to the difference between the predicted value and the true value. Even if there is no basic truth value available in the outlier detection problem

3、Isolation Forests

Isolation Forest (Isolation Forest) algorithm is an anomaly detection algorithm proposed by Professor Zhou Zhihua and others in 2008. It is one of the rare algorithms specifically designed for anomaly detection in machine learning. The method has high time efficiency and can effectively handle high-dimensional Data and massive data, no need to label samples, are widely used in industry.

Isolation forest is a non-parametric and unsupervised algorithm. It does not require a mathematical model to be defined or training data to be labeled. The strategy of isolated forests to find outliers is very efficient. Suppose we use a random hyperplane to cut the data space, and one cut can generate two subspaces. Then we continue to use random hyperplanes to cut each subspace and loop until there is only one data point in each subspace. Intuitively speaking, those clusters with high density need to be cut many times to separate them, and those with low density will soon be individually allocated to a subspace. Isolation Forest considers these points that are quickly isolated as abnormal points.

Using four samples to make a simple and intuitive understanding, d is the first to be isolated, so d is most likely to be abnormal.

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-p19QtE7j-1611510884520)(https://github.com/datawhalechina/team-learning-data-mining/raw /master/AnomalyDetection/img/v2-bb94bcf07ced88315d0a5de47677200e_720w.png)]

How to cut this data space is the core idea of ​​the isolation forest. Because the cutting is random, for the reliability of the results, an ensemble method is used to obtain a convergence value, that is, to repeatedly cut from the beginning and average the results of each cut. The isolated forest consists of t isolated numbers, and each tree is a random binary tree, which means that for each node in the tree, there are either two child nodes or none. The tree construction method is somewhat similar to the tree construction method in random forests. The process is as follows:

  1.  从训练数据中随机选择一个样本子集,放入树的根节点;
    
  2.  随机指定一个属性,随机产生一个切割点V,即属性A的最大值和最小值之间的某个数;
    
  3.  根据属性A对每个样本分类,把A小于V的样本放在当前节点的左孩子中,大于等于V的样本放在右孩子中,这样就形成了2个子空间;
    
  4.  在孩子节点中递归步骤2和3,不断地构造左孩子和右孩子,直到孩子节点中只有一个数据,或树的高度达到了限定高度。
    

After t trees are obtained, the training of the isolated forest ends, and the generated isolated forest can be used to evaluate the test data.

The hypothesis for detecting anomalies in isolated forests is that anomalous points are generally very rare and will be quickly divided into leaf nodes in the tree. Therefore, the path length from the leaf node to the root node can be used to determine whether a record is abnormal. Similar to random forest, isolated forest uses the average result of all the constructed trees to form the final result. During training, the training samples of each tree are randomly sampled. From the perspective of the tree construction process of the isolated forest, it does not need to know the label of the sample, but uses a threshold to determine whether the sample is abnormal. Because the path of abnormal points is relatively short, and the path of normal points is relatively long, the isolated forest estimates the abnormality of each sample point based on the length of the path.

Path length calculation method:

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-0BlTdEe0-1611510884524)(https://github.com/datawhalechina/team-learning-data-mining/raw /master/AnomalyDetection/img/image-20210103183909407.png)]

Isolation forest is also a subspace-based method. Different branches correspond to different local subspace regions of the data, and the smaller path corresponds to the low-dimensionality of the isolated subspace.

4. Summary

1.feature bagging can reduce variance

2. The advantages of isolated forests are:

  • The computational cost is smaller than distance-based or density-based algorithms.
  • Has linear time complexity.
  • It has advantages in processing large data sets.

Isolated forests are not suitable for ultra-high-dimensional data, because the forest is encouraged to select dimensions randomly every time. If the dimensions are too high, there will be too much noise.

5. Practice

1. Use PyOD library to generate toy example and call feature bagging

from pyod.models.feature_bagging import FeatureBagging
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize

contamination = 0.1  # percentage of outliers
n_train = 200  # number of training points
n_test = 100  # number of testing points

# Generate sample data
X_train, y_train, X_test, y_test = \
    generate_data(n_train=n_train,
                  n_test=n_test,
                  n_features=2,
                  contamination=contamination,
                  random_state=42)

# train Feature Bagging detector
clf_name = 'FeatureBagging'
clf = FeatureBagging(check_estimator=False)
clf.fit(X_train)

# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores

# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

# visualize the results
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
          y_test_pred, show_figure=True, save_figure=False)

2. Use the PyOD library to generate a toy example and call Isolation Forests

from pyod.models.iforest import IForest
from pyod.utils.data import generate_data

from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize


contamination = 0.1  # percentage of outliers
n_train = 200  # number of training points
n_test = 100  # number of testing points

# Generate sample data
X_train, y_train, X_test, y_test = \
    generate_data(n_train=n_train,
                  n_test=n_test,
                  n_features=2,
                  contamination=contamination,
                  random_state=42)

# train IForest detector
clf_name = 'IForest'
clf = IForest()
clf.fit(X_train)

# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores

# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)

# visualize the results
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
          y_test_pred, show_figure=True, save_figure=False)


3. (Question: Why can feature bagging reduce variance?)

Feature bagging can combine dimensions so that each sample value is not much different from the overall average, thereby reducing variance.

4. (Thinking question: What are the defects of feature bagging, and what ideas can be optimized?)

Need to consume a lot of space and time

6. References

[1]Goldstein, M. and Dengel, A., 2012. Histogram-based outlier score (hbos):A fast unsupervised anomaly detection algorithm . InKI-2012: Poster and Demo Track, pp.59-63.

[2]https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

[3]《Outlier Analysis》——Charu C. Aggarwal

About Datawhale :

Datawhale is an open source organization focusing on data science and AI. It brings together outstanding learners from many universities and well-known companies in many fields, and brings together a group of team members with open source and exploratory spirit. With the vision of "for the learner, grow with learners", Datawhale encourages true self-expression, openness and tolerance, mutual trust and mutual assistance, the courage to try and make mistakes, and the courage to take responsibility. At the same time, Datawhale uses the concept of open source to explore open source content, open source learning and open source solutions, empower talent training, help talent growth, and establish a connection between people, people and knowledge, people and enterprises, and people and the future.

Guess you like

Origin blog.csdn.net/huochuangchuang/article/details/113101461