Unsupervised Outlier Detection Using Isolation Forests

Isolation Forest is a simple but very effective algorithm that can find outliers in a dataset very quickly. Understanding this algorithm is a must for data scientists working with tabular data, so in this article I will briefly introduce the theory behind the algorithm and its implementation.

Because its algorithm is very simple and efficient, Scitkit Learn has implemented it efficiently, and we can use it directly. But before jumping directly into the example, you still need to introduce the theory behind it, so that you can gain a deep understanding of the algorithm.

some theories

1. What is an exception?

Anomalies (outliers) can be described as data points in a dataset that differ significantly from other data or observations. There are several reasons why this happens:

  • Outliers may indicate incorrect data or the experiment may not have been run correctly.

  • Outliers may be due to random variation or may indicate something scientifically interesting.

8a4b45718c58c31d54eafcaa54e39b30.png

2. Why anomaly detection?

We want to find and drill down into anomalies because these data points either waste time and effort or allow us to identify something meaningful.

In the case of simple linear regression, false outliers can increase the variance of the model and further reduce the model's ability to hold the data. Outliers cause regression models (especially linear models) to learn a biased understanding of outliers.

How isolation forests work

Other approaches have been trying to build a profile of normal data (distribution, regularity, etc.) and then further identify which data points do not fit the profile as anomalies.

The highlight of the isolation forest is that it can use the "isolation" rule to directly detect anomalies (the distance of a data point from the rest of the data). This means that the algorithm can run in linear time complexity like other distance-dependent models such as K-Nearest Neighbors.

The algorithm works by centering on the most obvious characteristics of outliers:

  • only a few outliers

  • There are outliers that are definitely different from other values

Isolation forests are implemented by introducing (a set of) binary trees that recursively generate partitions by randomly choosing a feature and then randomly choosing a split value for that feature. The partitioning process continues until it separates all data points from the rest of the samples.

Because only one feature is selected in each tree instance. It can be said that the maximum depth of a decision tree is actually one, so the basic estimator of an isolation forest is actually an extremely random decision tree (ExtraTrees) with various subsets of data.

An example of a tree in an isolated forest is as follows:

017d13df2317668eddb8b0cd280b015d.png

Looking at the properties of outliers above, it can be observed that outliers require fewer bifurcations on average to isolate them than normal samples. Each data point will receive a score after X rounds based on how easily they were isolated, and data points with anomalous scores will be marked as anomalous.

Recursively split each data instance by randomly choosing attribute q and split value p (within the min-max value of attribute q) until they are completely isolated. The algorithm will then provide a ranking reflecting how unusual each data instance is based on path length. The ranking or score is called the anomaly score and is calculated as follows:

  • H(x): Number of steps before data instance x is completely isolated.

  • E[H(x)]: The average value of H(x) in the set of isolation trees.

These metrics make sense, but one problem: the maximum possible step size of a tree is of order n, while the average step size is only log n of order. This will cause the steps to not be directly compared, so a normalization constant c(n) that varies by n needs to be introduced, called the path length normalization constant, the formula is as follows:

90007796741fd03ddae5cc5b0f4e19b7.png

where H (i) is the number of harmonics that can be estimated by ln (i) + 0.5772156649 (Euler's constant).

The full formula for the anomaly score is:

b4d071e8da2b73cd65ee70271d2d4160.png

So, if you run the entire dataset through the isolation forest, you can get an anomaly score. Using the anomaly score s, we can infer the presence of an anomaly whenever there are instances where the anomaly score is very close to 1. Or any score below 0.5 will be recognized as a normal instance.

Another note: in sklearn's implementation, the anomaly score is the opposite of the anomaly score defined in the original paper. It will subtract the constant 0.5. This is to easily identify anomalies (negative scores are identified together with anomalies), you can refer to the sklearn documentation for details

Example of an isolated forest

First, we quickly import some useful packages and use the make_blob() function to generate a dataset with random data points.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

data, _ = make_blobs(n_samples=500, centers=1, cluster_std=2, center_box=(0, 0))
plt.scatter(data[:, 0], data[:, 1])
plt.show()

e504b01b4066237f7db3461a13240070.png

Some outliers can be easily observed in the above graph. Here we use a two-dimensional use case to quickly demonstrate the effectiveness of the algorithm. The algorithm can be used without problems on datasets with multidimensional features.

Let's initialize an isolation forest object by calling IsolationForest().

The hyperparameters used here are the most default ones and are recommended by the original paper.

The number of trees controls the size of the integration. The path length usually converges before t = 100. Unless otherwise stated, we will use t = 100 as the default value in our experiments.

A subset of samples set to 256 usually provides enough detail to perform anomaly detection in a wide range of data

N_estimators represent the number of trees, and the largest samples represent the subset samples used in each round.

Max_samples = 'auto' sets the subset size to min (256, num_samples).

Here contamination represents the proportion of outliers in the dataset. By default, the anomaly score threshold will follow what was in the original paper. However, if we have any prior knowledge, we can manually set the proportion of outliers in the data. This is set to 0.03 for this article.

Fitting and predicting the entire dataset returns an array of [-1 or 1], where -1 is anomalous and 1 is a normal instance.

iforest = IsolationForest(n_estimators = 100, contamination = 0.03, max_samples ='auto)
prediction = iforest.fit_predict(data)

print(prediction[:20])
print("Number of outliers detected: {}".format(prediction[prediction < 0].sum()))
print("Number of normal samples detected: {}".format(prediction[prediction > 0].sum()))

751b8032d7452445ca1075f7b7d8135a.png

Then we will plot every detected outlier.

normal_data = data[np.where(prediction > 0)]
outliers = data[np.where(prediction < 0)]
plt.scatter(normal_data[:, 0], normal_data[:, 1])
plt.scatter(outliers[:, 0], outliers[:, 1])
plt.title("Random data points with outliers identified.")
plt.show()

439a2b6622dfe78efa4130963825de3b.png

You can see that it works fine, identifying data points around the edges.

Decision_function() can also be called to calculate the anomaly score for each data point. This allows us to understand which data points are anomalous.

score = iforest.decision_function(data)
data_scores = pd.DataFrame(list(zip(data[:, 0],data[:, 1],score)),columns = ['X','Y','Anomaly Score'])

display(data_scores.head())

fc2d03ca8db105f35a07198735e1de63.png

Select the top 5 outliers in the outlier score and plot it again.

top_5_outliers = data_scores.sort_values(by = ['Anomaly Score']).head()
plt.scatter(data[:, 0], data[:, 1])
plt.scatter(top_5_outliers['X'], top_5_outliers['Y'])
plt.title("Random data points with only 5 outliers identified.")
plt.show()

4c867da71871536a5729a40b77119ee3.png

Summarize

Isolation forests are a completely different model of outlier detection that can find anomalies extremely fast. It has linear time complexity, which makes it one of the best ways to handle large datasets.

It is based on the concept that anomalies are "rare and different", so anomalies are easier to isolate than normal points. Its Python implementation can be found at sklearn.ensemble.IsolationForest.

Paper address: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest

Isolation Forest by Fei Tony Liu, Kai Ming Ting Gippsland School of Information Technology Monash University, Victoria, Australia.

Author: Yenwee Lim

From: deephub-imba

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

For time series, everything you can do.

What is the spatiotemporal sequence problem? Which models are mainly used for such problems? What are the main applications?

Public number: AI snail car

Stay humble, stay disciplined, stay progressive

61214d453b40e97743edd78515dc79f6.png

Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)

Send [1222] to get a good leetcode brushing note

Send [AI Four Classics] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/123650011