Anomaly Detection: Exploring the Mysteries Behind the Deep Layers of Data "Part 2" --- Anomaly Detection in High-Dimensional Data: Isolated Forests

Anomaly Detection: Exploring the Mysteries Behind the Deep Layers of Data "Part 2"

Anomaly Detection - Anomaly Detection in High-Dimensional Data: Isolation Forest

In actual scenarios, many data sets are multi-dimensional. As the dimension increases, the size (volume) of the data space will grow exponentially, making the data sparse. This is the problem of the curse of dimensionality. The curse of dimensionality not only brings challenges to anomaly detection, but also brings difficulties to distance calculation and clustering. For example, proximity-based methods use distance functions in all dimensions to define locality. However, in high-dimensional space, the distances of all point pairs are almost equal (distance concentration), which makes some distance-based methods ineffective. In high-dimensional scenarios, a commonly used method is the subspace method.

Integration is one of the commonly used methods in subspace thinking, which can effectively improve the accuracy of data mining algorithms. Ensemble methods combine the outputs of multiple algorithms or multiple base detectors. The basic idea is that some algorithms perform well on certain subsets, some perform well on other subsets, and then integrated to make the output more robust. Ensemble methods have natural similarities with subspace-based methods. Subspaces are associated with different point sets, while ensemble methods use base detectors to explore subsets of different dimensions and assemble these base learners.

Two common integration methods are introduced below:

1.Feature Bagging

Feature Bagging, the basic idea is similar to bagging, except that the object is feature. Feature bagging is a type of integration method. The design of the integrated approach has the following two main steps:

1. Select a base detector . These basic detectors can be completely different from each other, or have different parameter settings, or use differently sampled sub-datasets. Feature bagging commonly uses the lof algorithm as the base algorithm. The following figure is the general algorithm of feature bagging:

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/133268081