Python Machine Learning 19 - Six commonly used machine learning outlier detection methods (isolated forest, data support description, autoencoder, Gaussian mixture, DBSCAN, LOF)

Case background

Outlier monitoring is an important field of machine learning. In the past, bloggers did more predictions and less about outlier monitoring. However, future work may require work on outliers, so I briefly summarized the commonly used machine learning methods for anomalies. Value monitoring methods and codes.

These machine learning methods in the title can basically be packaged and implemented using the sklearn library. No need to pack a lot of bags.

(I won’t introduce the traditional statistical methods in detail, such as the three sigma (variance) criterion, t test, 95% quantile and so on. That’s too simple. This article mainly introduces machine learning methods.)

Introduction to method ideas

Generally speaking, the simplest idea is 2 classification, supervised learning, classifying outliers into one category, normal values into another category, and then doing machine learning classification problems.

However, generally speaking, outliers appear rarely in samples, and it is easy to cause sample imbalance when classifying. This imbalance cannot be solved by simply doing some data sampling and data enhancement processing. This is an extremely imbalanced situation. Moreover, in most cases, the data has no labels, and there is no response variable to tell you whether it is a normal value, so supervised learning for classification problems cannot be done.

Therefore, outlier monitoring is mostly unsupervised learning, looking for anomalies from the data itself.

Of course there is also semi-supervised learning, which tells him which ones are outliers and which ones are not. Or use self-supervised learning like an autoencoder.

But generally speaking, these methods only use feature variables during training, only use

Code

Generate simulation data

We first import the package and then generate a simulated data set containing outliers.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import OneClassSVM
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score

# 生成数据：正常数据和异常数据
np.random.seed(77)
X_normal = 0.3 * np.random.randn(800, 2)
X_abnormal = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.vstack([X_normal, X_abnormal])
print(X_normal.shape,X_abnormal.shape)

# 标准化数据
scaler = StandardScaler()
scaled = scaler.fit(X_normal)
X_scaled=scaled.transform(X)

X_normal_scaled=scaled.transform(X_normal)
X_abnormal_scaled=scaled.transform(X_abnormal)
print(X_scaled.shape)
true_labels = np.hstack([np.ones(len(X_normal_scaled)), -1 * np.ones(len(X_abnormal_scaled))])
print(true_labels.shape)

It can be seen that 800 points are generated from the normal distribution and 20 points are generated from the uniform distribution as outliers. Then in order to compare the model effect, we recorded the label of whether the data was an outlier (real data sets generally do not have it). There is a big gap between the normal and abnormal ratios, as shown below:

pd.Series(true_labels).value_counts().plot.bar(figsize=(2,2))

This ratio is almost the same for real data sets, normal value: abnormal value = 40:1.

The code for the outlier monitoring method begins below.

Support vector data description

Support Vector Data Description (SVMD) is a machine learning method for anomaly detection. It is based on the concept of Support Vector Machines (SVM), but is specifically used for unsupervised learning, especially when the data set mainly consists of normal samples. SVMD attempts to find a smallest sphere that can contain most of the normal samples while excluding outliers.

Implementing SVMD in Python can be done by using the classes from the scikit-learn library. OneClassSVM

svmd = OneClassSVM(nu=0.025, kernel="rbf", gamma=0.1)
# Fit the SVM model only on normal data (X_normal)
svmd.fit(X_scaled)  # Ensure to use scaled normal data
y_pred = svmd.predict(X_scaled)
print(y_pred.shape)

The standardized data has been trained and predicted above. Let's look at the accuracy:

X_pred_normal = X[y_pred == 1]
X_pred_abnormal = X[y_pred == -1]
# Correctly and incorrectly predicted points
correctly_predicted = y_pred == true_labels
incorrectly_predicted = ~correctly_predicted
# Calculating accuracy using sklearn's accuracy_score function
accuracy_simple = accuracy_score(true_labels,y_pred)
accuracy_simple

98.9%, the accuracy is quite high.

After some screening preparations, I drew three pictures below. One is a comparison of normal data and abnormal data in the original data. One is the predicted normal data and abnormal data. There is also a picture of points where predictions were accurate and points where predictions were incorrect.

# Plotting
plt.figure(figsize=(15, 5))
# Original data plot
plt.subplot(1, 3, 1)
plt.scatter(X_normal[:, 0], X_normal[:, 1], color='blue', label='Normal',s=20,marker='x')
plt.scatter(X_abnormal[:, 0], X_abnormal[:, 1], color='red', label='Abnormal',s=20,marker='x')
plt.title("Original Data")
plt.legend()

# Predicted data plot
plt.subplot(1, 3, 2)
plt.scatter(X_pred_normal[:, 0], X_pred_normal[:, 1], color='b', label='Predicted Normal',s=20,marker='x')
plt.scatter(X_pred_abnormal[:, 0], X_pred_abnormal[:, 1], color='r', label='Predicted Abnormal',s=20,marker='x')
plt.title("Predicted Data")
plt.legend()

# Prediction accuracy plot
plt.subplot(1, 3, 3)
plt.scatter(X[correctly_predicted, 0], X[correctly_predicted, 1], color='gold', label='Correctly Predicted',s=20,marker='x')
plt.scatter(X[incorrectly_predicted, 0], X[incorrectly_predicted, 1], color='purple', label='Incorrectly Predicted',s=20,marker='x')
plt.title("Prediction Accuracy")
plt.legend()

plt.show()

The real data is the blue pile in the middle, and the red scattered points next to it are outliers. Yellow means the prediction is correct, purple means the prediction is wrong.

It can be seen that there is no pattern in the points where SVMD predicts errors. It just judges some abnormal points around it as normal. General effect.

isolated forest

Isolation Forest is an effective outlier detection method, especially suitable for high-dimensional data sets. It is different from traditional density-based or distance-based anomaly detection methods. Here are some key characteristics of isolated forests:

1. Basic principles:

The core idea of an isolation forest is to randomly "isolate" each data point. This method assumes that outlier points are more likely to be isolated because they are small in number and significantly different from normal points.

Isolation Trees: An isolation forest consists of multiple isolated trees. Each isolation tree recursively divides the data by randomly selecting a feature and then randomly selecting the split value of that feature until each data point is isolated or a defined tree depth is reached.
Path Length: The path length required for a data point to be isolated is used as the basis for anomaly scoring. Outliers are often isolated at short path lengths because they are usually far away from most points.

2. Advantages:

Efficiency: For large-scale data sets, isolated forests operate more efficiently.
Applicability: It is also effective for high-dimensional data sets and does not suffer from the "curse of dimensionality" like some distance-based methods.
No preset distribution required: Unlike statistics-based methods, Isolation Forest does not require a priori assumption that the data follows a specific distribution.

3. Application scenarios:

Isolation forests are suitable for a variety of scenarios, especially those where anomaly data are sparse and do not follow any specific distribution. It is widely used in financial fraud detection, network security, fault detection and other fields.

4. Precautions for use:

Parameter tuning: The performance of an isolated forest may be affected by parameters such as the number of trees and tree depth.
Data size and type: While it performs well for large data sets, performance may decrease for data sets containing a large number of repeated values or categorical variables.

After the introduction, you can all see that it was written by a certain GPT. . Below is the code.

The same idea as above, build the model, fit the prediction, and draw the picture. I encapsulate the drawing as a function to facilitate the use of other models below. To avoid duplication of a large section of code for drawing pictures.

It is worth noting that the contamination parameter of the isolated forest is the proportion of outliers in the total data. The real data set is not known, so we can only give a rough estimate of the proportion. The simulated data here gives almost the same value.

from sklearn.ensemble import IsolationForest
# Create and fit the Isolation Forest model
iso_forest = IsolationForest(contamination=float(len(X_abnormal)) / len(X))
iso_forest.fit(X_scaled)  #X_normal_scaled
y_pred_iso = iso_forest.predict(X_scaled)

def plot_result(y_pred,model_name=''):
    # Splitting the data into predicted normal and abnormal by Isolation Forest
    X_k_normal = X[y_pred == 1]
    X_k_abnormal = X[y_pred == -1]

    # Correctly and incorrectly predicted points by Isolation Forest
    correctly_predicted_k = y_pred == true_labels
    incorrectly_predicted_k = ~correctly_predicted_k
    accuracy_k_simple = accuracy_score(true_labels, y_pred)
    print(accuracy_k_simple)

    plt.figure(figsize=(15, 5))
    # Original data plot
    plt.subplot(1, 3, 1)
    plt.scatter(X_normal[:, 0], X_normal[:, 1], color='blue', label='Normal',s=20,marker='x')
    plt.scatter(X_abnormal[:, 0], X_abnormal[:, 1], color='red', label='Abnormal',s=20,marker='x')
    plt.title("Original Data")
    plt.legend()

    # Predicted data plot (Isolation Forest)
    plt.subplot(1, 3, 2)
    plt.scatter(X_k_normal[:, 0], X_k_normal[:, 1], color='b', label='Predicted Normal',s=20,marker='x')
    plt.scatter(X_k_abnormal[:, 0], X_k_abnormal[:, 1], color='r', label='Predicted Abnormal',s=20,marker='x')
    plt.title(f"Predicted Data ({model_name})")
    plt.legend()

    # Prediction accuracy plot (Isolation Forest)
    plt.subplot(1, 3, 3)
    plt.scatter(X[correctly_predicted_k, 0], X[correctly_predicted_k, 1], color='gold', label='Correctly Predicted',s=20,marker='x')
    plt.scatter(X[incorrectly_predicted_k, 0], X[incorrectly_predicted_k, 1], color='purple', label='Incorrectly Predicted',s=20,marker='x')
    plt.title(f"Prediction Accuracy ({model_name})")
    plt.legend()

    plt.show()
    
plot_result(y_pred_iso,model_name='IsolationForest')

The accuracy is 99.7561%, which is higher than SVMD, with only two point prediction errors. These two anomalies... If the prediction is wrong, you can't really blame the model, because they are indeed very close to the normal values.

autoencoder

Principle: A neural network-based method that detects anomalies by reconstructing the input.
Applications: Particularly suitable for capturing complex patterns in high-dimensional data, such as detecting anomalies in images or sequence data.

The autoencoder is actually self-supervised learning. The input feature variables and response variables are all X itself. The data is compressed by the neural network and then decoded and reconstructed. It can be understood as melting a sword and re-melting it into another sword. . . From this process, he will learn the reconstructed features from the data of all samples. If the reconstructed data has a large error compared with the original data, it means that it may be an outlier.

His thinking led him to reconstruct a regression problem, and the output was not a category. How to convert from predicted value to category?

In order to convert the output of the autoencoder into a category judgment (normal or abnormal), a threshold can be set. This threshold is used to determine how large the reconstruction error is before a data point is considered an anomaly. A common way to set this threshold is to choose a certain percentile (e.g. 95%) of the reconstruction error for normal data. When the reconstruction error of data points is higher than this threshold, they are marked as anomalies.

In our implementation, the threshold is set based on the reconstruction error distribution of normal data. Specifically, we selected the 95th percentile of the normal data reconstruction error as the threshold. We then label all points with reconstruction errors higher than this threshold as anomalies. In this way, we can convert the continuous output of the autoencoder (reconstruction error) into a binary category (normal/abnormal).

Logically speaking, autoencoders require a neural network framework, TensorFlow or pytorch. For simplicity, I just use the sklearn library here, and it can be implemented.

A 128 encoding is constructed, 32 is the compression dimension, and then 128 is decoded and restored. Convert to categories according to the threshold method, and use the custom function above to draw and evaluate.

from sklearn.neural_network import MLPRegressor
# Creating the autoencoder model
autoencoder = MLPRegressor(hidden_layer_sizes=(128,32,128),activation='relu', 
                           solver='adam', max_iter=2000, random_state=77)
autoencoder.fit(X_scaled,X_scaled)
X_reconstructed = autoencoder.predict(X_scaled)

# Calculating reconstruction error
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2, axis=1)
threshold = np.percentile(reconstruction_error[:100], 95)
predicted_anomalies = reconstruction_error <threshold
predicted_anomalies=np.array([1 if i else -1 for i in predicted_anomalies])

plot_result(predicted_anomalies,model_name='autoencoder')

The accuracy is 95%, which is not very high. The wrong prediction points are all in the upper left corner, and I don’t know why. . .

DBSCAN (density-based spatial clustering of noise applications)

Principle: Density-based clustering algorithm marks points in sparse areas as outliers.
Application: Suitable for cases where the cluster structure in the data set is irregular or of different sizes.

I don’t quite understand the principle of this method, so I directly adjust the contract to predict and evaluate.

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.9, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
plot_result(np.where(dbscan_labels == 0, 1, -1),model_name='DBSCAN')

The accuracy rate is 99.6%, with only three prediction errors, which is still very high.

Local Outlier Factor (LOF)

Principle: Detect anomalies by comparing the local density of a point with its neighbors.
Application: Suitable for detecting outliers with significant difference in neighborhood density

from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=10, contamination=0.025)
lof_labels = lof.fit_predict(X_scaled)
plot_result(lof_labels,model_name='LOF')

The accuracy rate is 99.6%, with only three prediction errors, which is still very high.

Gaussian Mixture Model (GMM):

Principle: Use a mixture of multiple Gaussian distributions to model data, outliers often do not fit these distributions.
Application: Suitable for anomaly detection when data presents multi-modal distribution.

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2, random_state=77)
gmm_labels = gmm.fit_predict(X_scaled)
plot_result(np.where(gmm_labels == 0, -1, 1),model_name='GaussianMixture')

The accuracy rate is 99.87%, with only 1 point of prediction error, the highest! . Moreover, this point is too close to the normal value, so it is completely understandable that the prediction is wrong.

Summarize

Judging from the performance of this simulated data set, Gaussian mixture is the best, followed by isolated forest, LOF, DBSCAN, and autoencoder are worse.

In order for the deep learning method to be inferior, I guess it may be because the amount of data is small. The performance of deep learning on small data sets is generally not as good as traditional machine learning.

Why is Gaussian mixture so good? Because it is based on distribution, and our simulation data set is generated from different distributions, so it just matches his expertise, so the effect will be better.

This is just a preliminary comparison case. In the future, it will need to be tested with massive real data to know which models are really useful.