Task05: High-dimensional anomaly

main content:

Feature Bagging
Isolated forest

1. Introduction to the specificity of high-dimensional data anomaly detection

High-dimensional data anomaly detection means that in real life, some data sets often have more dimensions. On the one hand, the increase of data dimensions makes the size of the data increase rapidly, on the other hand, data sparseness may appear. This situation is called Curse for the dimension. The curse of dimensionality will cause problems in the calculation of distance, which in turn will cause difficulties for clustering methods.
Especially in the calculation of distance, because it is high-dimensional data, the distances of all points are almost equal, and the judgment of abnormal points based on the distance may be invalid

2. High-dimensional data anomaly detection ideas

1. Perform dimensionality reduction through the principal component method, and select low eigenvalue dimensionality reduction features for abnormal point judgment (linear part of the content)
2. Through the ensemble idea. The
integration method detects multiple algorithms or multiple bases The output of the converter is combined. **The basic idea is that some algorithms perform well on some subsets, some algorithms perform well on other subsets, and then integrate them to make the output more robust. **The integrated method has a natural similarity to the subspace-based method. The subspace is related to different point sets (in fact, the bootstrap method), and the integrated method uses the base detector to explore the subsets of different dimensions, and learns these bases. Assemble.

3. Integration method 1-Feature Bagging

3.1 The meaning of bagging

The term bagging is bootstrap aggregating (self-service aggregation), which refers to the method of building a set of models for data

The meaning of the bootstrap method is to extract more samples from the sample itself with replacement, and recalculate statistics or models for each resample.

The bootstrap method can also be used for multivariate data. At this time, the method uses data rows as the sampling unit, and can run the model on self-service data to estimate the stability (or variability) of the model parameters, or improve the predictive ability of the model. For example, we can use classification and regression trees (also called decision trees) to run multiple tree models on self-service data, and average the predicted values given by multiple trees (or use classification and select majority votes). This is usually It has better prediction performance than using a single tree. This process is called the Bagging method.

3.2 Feature Bagging

The basic idea of Feature Bagging is similar to that of bagging, except that the object is a feature (feature), that is, only variables are sampled, not records. Feature Bagging will resample according to the characteristics of the sample to obtain multiple data sets , and then use a set (multiple) models to train these data sets .
(Much like a random forest)

There are two main steps in the design of the integration method:

1. Choose a base detector or model

These basic detectors can be completely different from each other, or different parameter settings, or use different sampled sub-data sets. Feature bagging commonly used lof algorithm as the base algorithm. The following figure is the general algorithm of feature bagging:
Insert picture description here
2. Score standardization and combination method:

Different detectors may produce scores on different scales. For example, the average k-nearest neighbor detector will output the original distance score, while the LOF algorithm will output the normalized value. In addition, although the general situation is to output a larger outlier score, some detectors will output a smaller outlier score. Therefore, it is necessary to convert the scores from various detectors into normalized values that can be meaningfully combined. After the scores are standardized, a combination function must be selected to combine the scores of different basic detectors. The most common choices include averaging and maximizing combination functions.

The following are two different combination methods of two feature bagging
: 1) Breadth first:
Insert picture description here
2) Cumulative summation

The design of the base detector and its combination method all depend on the specific goal of the specific integration method. In many cases, we cannot know the original distribution of the data and can only learn from part of the data. In addition, the algorithm itself may also have certain problems that make it unable to learn the complete information of the data. The errors caused by these problems are usually divided into two types: bias and variance.

Variance: refers to the error between the output result of the algorithm and the expected output of the algorithm, describing the degree of dispersion of the model and data volatility.

Deviation: refers to the difference between the predicted value and the true value. Even if there is no basic truth value available in the outlier detection problem

4. Integration method 2-lsolation Forests (isolated forest)

Isolation Forest (Isolation Forest) algorithm is an anomaly detection algorithm proposed by Professor Zhou Zhihua and others in 2008. It is one of the rare algorithms specifically designed for anomaly detection in machine learning. The method has high time efficiency and can effectively handle high-dimensional Data and massive data, no need to label samples, are widely used in industry. The source of the
idea of tree algorithm is very simple. The conditional branch structure in program design is the if-then structure. The earliest decision tree is a classification learning method that uses this structure to split data. It will meet the conditions and the unsatisfied conditions one by one. The conditions are divided.
Isolation forest is a non-parametric and unsupervised algorithm, no need to define mathematics nor need training data to have labels. The algorithm is very efficient.

Example : We use a random plane to cut the data space, which will generate two subspaces, and then use a random hyperplane to cut each subspace, and keep looping until all subspaces have only one data point. Intuitively speaking, those with High-density clusters need to be cut many times to separate them, and those low-density points are quickly allocated to a subspace individually.
Isolation Forest considers these points that are quickly isolated as abnormal points.
Using four samples to make a simple and intuitive understanding, d is the first to be isolated, so d is most likely to be abnormal.
Insert picture description here
How to cut this data space is the core idea of the isolation forest. Because the cutting is random, for the reliability of the results, an ensemble method is used to obtain a convergence value, that is, to repeatedly cut from the beginning and average the results of each cut. The isolated forest consists of t isolated numbers, and each tree is a random binary tree, which means that for each node in the tree, there are either two child nodes or none. The tree construction method is somewhat similar to the tree construction method in random forests. The process is as follows:

1. Randomly select a sample subset from the training data and put it into the root node of the tree;

2. Randomly specify an attribute, and randomly generate a cutting point V, which is a certain number between the maximum and minimum values of attribute A;

3. Categorize each sample according to the attribute A, put the samples with A less than V in the left child of the current node, and put the samples greater than or equal to V in the right child, thus forming 2 subspaces;

4. Recursive steps 2 and 3 in the child nodes, continuously constructing the left child and the right child, until there is only one data in the child node, or the height of the tree reaches the limit height.

After t trees are obtained, the training of the isolated forest ends, and the generated isolated forest can be used to evaluate the test data.

The hypothesis for detecting anomalies in isolated forests is that anomalous points are generally very rare and will be quickly divided into leaf nodes in the tree. Therefore, the path length from the leaf node to the root node can be used to determine whether a record is abnormal. Similar to random forest, isolated forest uses the average result of all the constructed trees to form the final result. During training, the training samples of each tree are randomly sampled. From the perspective of the tree construction process of the isolated forest, it does not need to know the label of the sample, but uses a threshold to determine whether the sample is abnormal. Because the path of abnormal points is relatively short, and the path of normal points is relatively long, the isolated forest estimates the abnormality of each sample point based on the length of the path.
Insert picture description here
Isolation forest is also a subspace-based method. Different branches correspond to different local subspace regions of the data, and the smaller path corresponds to the low-dimensionality of the isolated subspace.

5. Summary

1. Feature bagging advantage: it
can reduce the variance, can adopt the most suitable algorithm for different dimension combinations, and finally get an overall higher training result according to the weight.
2. The advantage of isolated forests: the
computational cost is smaller than distance-based or density-based algorithms.
Has linear time complexity.
It has advantages in processing large data sets.
Isolation forest is not suitable for ultra-high dimensional data, because the dimensionality is randomly selected every time. If the dimensionality is too high, there will be too much noise.

6. Practice (to be added)

6.1 Feature bagging
uses PyOD library to generate toy example and calls feature bagging
6.2 Uses PyOD library to generate toy example and calls Isolation Forests
6.3 Thinking: Why can feature bagging reduce variance?
The meaning of variance is the error between the output result and the output expectation. The feature bagging core can adopt the most suitable algorithm for different dimension combinations, weighting the score according to the weight, and finally get a better training result, the error is lower than that of a single model Volatility also reduces variance.
6.34 Thinking about the defects of 2feature bagging, and what ideas can be optimized?
The base detection algorithm cannot achieve the best choice, and the self-service method means that the algorithm needs to be run multiple times, which will consume more computing resources and time.
You can divide the overall data into the test set and the training set to check the current data before selecting features Suitable base detection algorithm, and then the steps after the self-service method to extract features

Anomaly Detection Task05: High-dimensional anomaly