Common anomaly detection algorithm summary and code implementation [statistical method/K nearest neighbor/isolation forest/DBSCAN/LOF/mixed Gaussian GMM/autoencoder AutoEncoder, etc.]

This blog post is mainly a continuation of the summary records of the previous series. Here it mainly summarizes the relevant knowledge content of daily mainstream anomaly detection algorithms.

 (1) Outlier detection based on statistical methods

Statistical method-based outlier detection is a commonly used anomaly detection algorithm that identifies observations that are significantly different from other samples based on the statistical properties of sample data. The following will introduce its algorithm principle in detail, and analyze its advantages and disadvantages.
Algorithm principle:
Method based on probability distribution: Assume that the sample data obeys a certain probability distribution, such as a normal distribution. By calculating the probability density or cumulative distribution function (CDF) of each sample, the degree of abnormality of the sample can be determined. Judgment of abnormal samples based on threshold or ratio.
Distance-based methods: By calculating the distance or similarity index between each sample and other samples, such as Euclidean distance, Mahalanobis distance, etc. The abnormal samples are judged according to the threshold or percentile, and the samples with a far distance are considered as outliers.
Boxplot-based approach: Use boxplots in statistics to identify outliers. After calculating the quartiles and inner limits of the samples, samples that exceed the inner limits are considered outliers.
Advantages:
Simple and intuitive: Outlier detection algorithms based on statistical methods are usually easy to understand and implement, and do not require too many complex calculations.
Strong generalization ability: This method is applicable to a variety of data types and distributions, and can be used for anomaly detection problems in different fields.
Label-independent: Compared with supervised learning anomaly detection methods, statistical methods do not need to rely on known normal sample labels, so they are more suitable for unsupervised or semi-supervised scenarios.
Disadvantages:
Assume that the data obeys a certain probability distribution: The method based on the probability distribution assumes that the data obeys a certain probability distribution, but the actual data may not fully conform to this assumption. If the actual distribution of the data is inconsistent with the assumed distribution, it may lead to misjudgment of outliers.
Sensitive to noise: When statistical methods deal with data containing a lot of noise, noise may be misjudged as outliers, thereby affecting accuracy.
Relying on parameter selection: Statistical methods usually need to set thresholds or other parameters to judge outliers. Selecting appropriate parameters has a great impact on the performance of the algorithm, which requires experience or experiments to determine.
Summarize:
Outlier detection based on statistical methods is a simple, intuitive, widely applicable and label-independent anomaly detection algorithm. It identifies outliers by exploiting the statistical properties of sample data. However, this approach has the disadvantages of assumptions about the data distribution, sensitivity to noise, and dependence on parameter selection. In practical applications, appropriate statistical methods need to be selected according to specific problems and data characteristics, and parameter tuning and result verification should be performed to improve the accuracy and robustness of anomaly detection.

The demo code implementation is as follows:

import numpy as np
import pandas as pd


"""
在下述代码中,我们定义了三个函数分别实现了基于概率分布的方法(Z-score),基于距离的方法(密度离群因子 - DFF)和基于箱线图的方法。

detect_outliers_zscore 函数使用 Z 分数(Z-score)来度量每个数据点与均值之间的偏离程度,超过阈值的数据点被判定为异常值。
detect_outliers_distance 函数使用密度离群因子(DFF)来度量每个数据点与中位数之间的距离,超过阈值的数据点被判定为异常值。
detect_outliers_boxplot 函数使用箱线图的方法来判断异常值,根据四分位距离(IQR)和乘法因子,将超过上下界的数据点判定为异常值。
在示例中,数据为 [1, 2, 3, 10, 20, 30, 100],将使用每种方法进行异常值检测,并输出对应的异常值索引。你可以根据自己的输入数据和需求来调用相应的函数进行测试。
"""


# 基于概率分布的方法 (Z-score)
def detect_outliers_zscore(data, threshold=3):
    mean = np.mean(data)
    std = np.std(data)
    z_scores = np.abs((data - mean) / std)
    outliers = np.where(z_scores > threshold)[0]
    return outliers

# 基于距离的方法 (密度离群因子 - DFF)
def detect_outliers_distance(data, threshold=1.5):
    median = np.median(data)
    mad = np.median(np.abs(data - median))
    dff = np.abs(0.6745 * (data - median) / mad)
    outliers = np.where(dff > threshold)[0]
    return outliers

# 基于箱线图的方法
def detect_outliers_boxplot(data, multiplier=1.5):
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    lower_bound = q1 - multiplier * iqr
    upper_bound = q3 + multiplier * iqr
    outliers = np.where((data < lower_bound) | (data > upper_bound))[0]
    return outliers

# 示例用法
data = np.array([1, 2, 3, 10, 20, 30, 100])

# 基于概率分布的方法 (Z-score)
outliers_zscore = detect_outliers_zscore(data)
print("基于概率分布的方法 (Z-score) 异常值索引:", outliers_zscore)

# 基于距离的方法 (密度离群因子 - DFF)
outliers_distance = detect_outliers_distance(data)
print("基于距离的方法 (DFF) 异常值索引:", outliers_distance)

# 基于箱线图的方法
outliers_boxplot = detect_outliers_boxplot(data)
print("基于箱线图的方法 异常值索引:", outliers_boxplot)

(2) Anomaly detection based on k-nearest neighbor algorithm

Anomaly detection based on the k-nearest neighbor algorithm is a commonly used unsupervised learning method, which identifies outliers by calculating the distance or similarity between a sample and its nearest neighbors. The following will introduce its algorithm principle in detail, and analyze its advantages and disadvantages.
Algorithm principle:
Calculate distance or similarity: For each sample, calculate the distance or similarity between it and other samples, usually using Euclidean distance, Manhattan distance or cosine similarity as a metric.
Select k nearest neighbors: According to the preset parameter k, select the k samples with the closest distance or similarity to the current sample as its nearest neighbors.
Decision threshold determination: For each sample, set a threshold according to its nearest neighbor distance or similarity. A sample is considered an outlier if its distance or similarity to its nearest neighbor exceeds a threshold.
Advantages:
Simple and easy to implement: Anomaly detection based on the k-nearest neighbor algorithm is relatively simple, easy to implement and understand.
Adapt to changes in data distribution: The method does not rely on data distribution assumptions and can adapt to different types of data.
Do not rely on training data: directly use sample data for anomaly detection without a training phase.
Disadvantages:
Inefficiency: Calculating the distance or similarity of each sample to all other samples can lead to high computational complexity, especially on large-scale data sets.
Parameter selection: Selecting the appropriate k value and threshold has a great influence on the performance of the algorithm, which needs to be determined through experience or experiments. Choosing inappropriate parameters may lead to misjudgment of outliers or missed outliers.
Difficulty in dealing with high-dimensional data: In high-dimensional spaces, distance calculations may suffer from the curse of dimensionality, making the results inaccurate.
Summarize:
The anomaly detection method based on the k-nearest neighbor algorithm is a simple and intuitive anomaly detection method that does not depend on training data. It identifies outliers by computing the distance or similarity between a sample and its nearest neighbors. However, this method has the disadvantages of inefficiency, dependence on parameter selection, and difficulty when dealing with high-dimensional data. In practical applications, it is necessary to select the appropriate k value and threshold according to specific problems and data characteristics, and consider using techniques such as dimensionality reduction to solve high-dimensional data problems, so as to improve the performance and reliability of the algorithm.
The demo code implementation is as follows:

import numpy as np
from sklearn.neighbors import NearestNeighbors

# 生成随机实验数据
np.random.seed(42)
data = np.concatenate([np.random.normal(0, 1, 100), np.random.normal(10, 1, 20)])

"""
使用 numpy 库生成了一个实验数据集。该数据集包含了两个正态分布的样本,其中一个分布的均值为 0,标准差为 1,样本数量为 100;另一个分布的均值为 10,标准差为 1,样本数量为 20。这样生成的数据集中,后面 20 个样本点就是异常值。

然后使用 detect_outliers_knn 函数进行基于 k 最近邻算法的异常值检测。在示例中使用默认的参数,k=5 和阈值为 1.5。输出结果为后面 20 个样本点的索引列表,即 [100, 101, ..., 119],表示这些样本点被判定为异常值。
"""

def detect_outliers_knn(data, k=5, threshold=1.5):
    """
    使用基于 k 最近邻算法的异常值检测
    参数:
        - data: 输入数据,可以是一维数组或二维数组
        - k: 进行 k 最近邻计算时的邻居数,默认为5
        - threshold: 判断异常值的阈值,默认为1.5
    返回值:
        - outliers: 异常值的索引列表
    """
    if len(data.shape) == 1:
        # 对于一维数组
        data = data.reshape(-1, 1)
    
    knn = NearestNeighbors(n_neighbors=k)
    knn.fit(data)
    distances, _ = knn.kneighbors()
    avg_distances = np.mean(distances, axis=1)
    max_distance = np.max(avg_distances)
    
    outliers = np.where(avg_distances > threshold * max_distance)[0]
    return outliers

# 示例用法
outliers = detect_outliers_knn(data)
print("异常值的索引:", outliers)

(3) Anomaly detection algorithm based on Isolation Forest

Anomaly detection based on isolation forest is an unsupervised learning method based on tree structure, which identifies outliers by constructing a randomly divided binary tree. The following will introduce its algorithm principle in detail, and analyze its advantages and disadvantages.
Algorithm principle:
Build an isolated tree: For a given data set, randomly select a feature and a segmentation point to divide the data set into two subsets. This process is repeated recursively on each subset until a stopping condition is reached (such as the height of the tree reaches a preset value or there is only one sample in the subset).
Computing outlier scores: An outlier score is defined by calculating the path length traveled from the root node to each sample. The shorter the path, the easier it is for the sample to be classified as an outlier.
Decision-making threshold determination: According to the preset threshold, if the abnormal score of the sample exceeds the threshold, it will be regarded as an abnormal value.
Advantages:
Efficiency: Compared with other distance-based methods, isolation forest has higher computational efficiency because it uses randomized segmentation for tree construction.
Not affected by data distribution: This method does not assume the distribution of data and is suitable for different types of data, including multidimensional data and mixed data.
Scalability: Isolation forests can improve the accuracy and robustness of anomaly detection by integrating multiple isolated trees.
Disadvantages:
Insensitive to high-dimensional sparse data: The isolation forest algorithm may be biased when dealing with high-dimensional sparse data, because the random segmentation method is less effective in this case.
Parameter selection: It is necessary to select an appropriate tree height and anomaly score threshold for anomaly detection. Parameter selection has a great influence on the performance of the algorithm and needs to be determined through experience or experiments.
Summarize:
Anomaly detection method based on isolation forest is an efficient and universal unsupervised learning method. It identifies outliers by building a randomly divided binary tree, and uses the path length as an outlier score for judgment. However, this method may be biased when dealing with high-dimensional sparse data and needs to choose appropriate parameters for anomaly detection. In practical applications, it is necessary to select the appropriate tree height and abnormal score threshold according to specific problems and data characteristics, and combine other techniques (such as dimensionality reduction) to improve the performance and reliability of the algorithm.

The demo code implementation is as follows:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest

# 生成实验数据
data, _ = make_blobs(n_samples=100, centers=1, random_state=42)  # 正常数据
outliers, _ = make_blobs(n_samples=20, centers=1, random_state=42)  # 异常值

"""
使用 sklearn.datasets.make_blobs 函数生成了一个实验数据集。该数据集包含了一个正态分布的样本集作为正常数据,以及一个与之不同的正态分布样本集作为异常值。
然后使用 detect_outliers_isolation_forest 函数进行基于孤立森林算法的异常值检测。使用默认的参数,即异常值比例为 0.1(10%)。输出结果为异常值的索引列表,表示这些样本点被判定为异常值。
"""

# 合并正常数据和异常值
data = np.concatenate([data, outliers])

def detect_outliers_isolation_forest(data, contamination=0.1):
    """
    使用基于孤立森林算法的异常值检测
    参数:
        - data: 输入数据,可以是一维数组或二维数组
        - contamination: 指定异常值的比例,默认为0.1
    返回值:
        - outliers: 异常值的索引列表
    """
    if len(data.shape) == 1:
        # 对于一维数组
        data = data.reshape(-1, 1)
    
    clf = IsolationForest(contamination=contamination)
    clf.fit(data)
    outliers = np.where(clf.predict(data) == -1)[0]
    return outliers

# 示例用法
outliers = detect_outliers_isolation_forest(data)
print("异常值的索引:", outliers)

(4) Anomaly detection algorithm based on DBSCAN

Anomaly detection based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based unsupervised learning method that identifies outliers by dividing data points into core points, boundary points, and noise points. The following will introduce its algorithm principle in detail, and analyze its advantages and disadvantages.
Algorithm principle:
Density reachable: For a given data set and neighborhood radius ε, if a sample contains at least MinPts samples in its neighborhood, the sample is considered a core point. If a sample is within the neighborhood of a core point, but not itself a core point, it is considered a boundary point. Otherwise, the sample is considered a noise point.
Extended clustering: Starting from any unvisited core point, by recursively expanding the samples in the neighborhood, the samples that are density-reachable with the core point are divided into the same cluster. This process is repeated until all core points have been visited and no new samples can be added to existing clusters.
Outlier judgment: According to the samples that are not divided into any clusters, they are considered as outliers.
Advantages:
Not affected by data distribution: DBSCAN algorithm does not depend on data distribution assumptions, can find clusters of arbitrary shapes, and can effectively handle data sets with large density changes.
Automatically determine the number of clusters: Compared with other clustering algorithms, DBSCAN does not need to specify the number of clusters in advance, but automatically determines the number of clusters according to the density of the data.
Strong robustness: This method has a certain tolerance for noise and outliers, and can better handle datasets containing noise.
Disadvantages:
Parameter selection: The DBSCAN algorithm needs to set two parameters, the neighborhood radius ε and the minimum number of samples MinPts, for clustering and anomaly detection. Reasonable parameter selection has a great influence on the algorithm results and needs to be determined through experience or experiments.
Poor effect on high-dimensional data: Due to the problem of dimensionality disaster, the DBSCAN algorithm may encounter difficulties when dealing with high-dimensional data, which may easily lead to sparse neighborhoods, so it may perform poorly on high-dimensional data.
Summary:
The DBSCAN-based anomaly detection algorithm is an unsupervised learning method that identifies outliers by dividing samples into core points, boundary points, and noise points. It has the advantages of not being affected by data distribution, automatically determining the number of clusters and strong robustness. However, this method requires proper parameter selection and may have difficulties when dealing with high-dimensional data. In practical applications, it is necessary to select appropriate parameters according to specific problems and data characteristics, and combine other techniques (such as dimensionality reduction) to improve the performance and reliability of the algorithm.

The demo code implementation is as follows:

import numpy as np
from sklearn.cluster import DBSCAN

# 生成实验数据
data, _ = make_blobs(n_samples=100, centers=1, random_state=42)  # 正常数据
outliers, _ = make_blobs(n_samples=20, centers=1, random_state=42)  # 异常值

# 合并正常数据和异常值
data = np.concatenate([data, outliers])

"""
使用 sklearn.datasets.make_blobs 函数生成了一个实验数据集。该数据集包含了一个正态分布的样本集作为正常数据,以及一个与之不同的正态分布样本集作为异常值。
然后使用 detect_outliers_dbscan 函数进行基于 DBSCAN 算法的异常值检测。使用默认的参数,即领域半径 eps 为 0.5,最小样本数 min_samples 为 5。输出结果为异常值的索引列表,表示这些样本点被判定为异常值。
"""


def detect_outliers_dbscan(data, eps=0.5, min_samples=5):
    """
    使用基于 DBSCAN 算法的异常值检测
    参数:
        - data: 输入数据,可以是一维数组或二维数组
        - eps: 指定 DBSCAN 的领域半径,默认为0.5
        - min_samples: 指定 DBSCAN 的最小样本数,默认为5
    返回值:
        - outliers: 异常值的索引列表
    """
    if len(data.shape) == 1:
        # 对于一维数组
        data = data.reshape(-1, 1)
    
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    dbscan.fit(data)
    
    # 认为标签为-1的点为异常值
    outliers = np.where(dbscan.labels_ == -1)[0]
    return outliers

# 示例用法
outliers = detect_outliers_dbscan(data)
print("异常值的索引:", outliers)

(5) Anomaly detection algorithm based on LOF

Anomaly detection based on LOF (Local Outlier Factor) is a density-based unsupervised learning method that identifies outliers by calculating the local outlier factor of each sample. The following will introduce its algorithm principle in detail, and analyze its advantages and disadvantages.
Algorithm principle:
Calculate the local reachability density: For each sample, calculate the distance between it and its k nearest neighbors, and then calculate the inverse of these distances to obtain the local reachability density (Local Reachability Density, LRD). The larger the LRD, the greater the density around the sample.
Calculate the local outlier factor: For each sample, calculate the LRD ratio between it and the samples in its k nearest neighbors, and take the average value to obtain the local outlier factor (Local Outlier Factor, LOF). The larger the LOF, the more likely the sample is an outlier relative to its neighborhood.
Judging outliers: According to the preset threshold, if the LOF value of the sample exceeds the threshold, it will be regarded as an outlier.
Advantages:
Not affected by data distribution: LOF algorithm does not assume the distribution of data, and can effectively find clusters and outliers of any shape.
Considering local context information: LOF algorithm can better capture local context information by considering the density relationship between samples and their neighbors, thus improving the accuracy of anomaly detection.
Can be used for cluster analysis: The LOF algorithm can be used not only for anomaly detection, but also for cluster analysis to help discover cluster structures in data.
Disadvantages:
Parameter selection: The LOF algorithm needs to select an appropriate k value and threshold for anomaly detection. Parameter selection has a great influence on the performance of the algorithm, which needs to be determined through experience or experiments.
High computational complexity: Calculating the distance and LRD value between each sample and its k nearest neighbors may lead to high computational complexity, especially on large-scale datasets.
Poor effect on high-dimensional data: Due to the problem of the curse of dimensionality, the LOF algorithm may encounter difficulties when dealing with high-dimensional data and is susceptible to the "curse of dimensionality".
Summary:
The LOF-based anomaly detection algorithm is an unsupervised learning method that identifies outliers by calculating the local outlier factor of each sample. It has the advantages of not being affected by data distribution, considering local context information and being applicable to cluster analysis. However, this method requires proper parameter selection and may have difficulties in computational complexity and handling high-dimensional data. In practical applications, it is necessary to select appropriate parameters according to specific problems and data characteristics, and combine other techniques (such as dimensionality reduction) to improve the performance and reliability of the algorithm.

The demo code implementation is as follows:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.neighbors import LocalOutlierFactor

# 生成实验数据
data, _ = make_blobs(n_samples=100, centers=1, random_state=42)  # 正常数据
outliers, _ = make_blobs(n_samples=20, centers=1, random_state=42)  # 异常值

# 合并正常数据和异常值
data = np.concatenate([data, outliers])


"""
使用 sklearn.datasets.make_blobs 函数生成了一个实验数据集。该数据集包含了一个正态分布的样本集作为正常数据,以及一个与之不同的正态分布样本集作为异常值。
然后使用 detect_outliers_lof 函数进行基于 LOF 算法的异常值检测,使用默认的参数,即异常值比例为 0.1(10%)。输出结果为异常值的索引列表,表示这些样本点被判定为异常值。
"""

def detect_outliers_lof(data, contamination=0.1):
    """
    使用基于 LOF 算法的异常值检测
    参数:
        - data: 输入数据,可以是一维数组或二维数组
        - contamination: 指定异常值比例,默认为0.1
    返回值:
        - outliers: 异常值的索引列表
    """
    if len(data.shape) == 1:
        # 对于一维数组
        data = data.reshape(-1, 1)
    
    lof = LocalOutlierFactor(contamination=contamination)
    lof.fit_predict(data)
    scores = -lof.negative_outlier_factor_
    
    # 根据分数判断异常值
    threshold = np.percentile(scores, (1 - contamination) * 100)
    outliers = np.where(scores > threshold)[0]
    return outliers

# 示例用法
outliers = detect_outliers_lof(data)
print("异常值的索引:", outliers)

(6) Anomaly detection algorithm based on Gaussian mixture model GMM

The anomaly detection algorithm based on Gaussian Mixture Model (GMM) is an unsupervised learning method based on probability and statistics. Probability to identify outliers. The following will introduce its algorithm principle in detail, and analyze its advantages and disadvantages.
Algorithm Principle:
Model Fitting: First, the data set is fitted to a mixture model with multiple Gaussian distributions through the maximum likelihood estimation or expectation maximization algorithm.
Calculate the probability density: For each sample, calculate its probability density value in each Gaussian distribution, and weighted summation to obtain the total probability density value of the sample.
Abnormal judgment: According to the preset threshold, if the probability density value of the sample is lower than the threshold, it will be regarded as an outlier.
Advantages: Ability
to model complex distributions: GMM can flexibly model complex data distributions because it is a model that is a mixture of multiple Gaussian distributions, which can capture different patterns and clustering structures in the data set.
Provide probability estimates: GMM provides probability density estimates in each Gaussian distribution for each sample, so the degree of outliers can be judged according to the probability of samples in different distributions.
Scalability: GMM can increase the complexity and flexibility of the model by increasing the number of Gaussian distributions, and adapt to more complex data distributions.
Disadvantages:
High computational complexity: The fitting process of GMM requires iterative optimization, and the computational complexity is high, especially on large-scale data sets, which may be limited.
Sensitive to the disaster of dimensionality: Since the number of parameters of the Gaussian distribution is related to the data dimension, when dealing with high-dimensional data, GMM may encounter the problem of dimensionality disaster, and the data needs to be preprocessed such as dimensionality reduction.
Parameter selection: GMM needs to select the appropriate parameters such as the number of Gaussian distributions, initialization strategies, and optimization algorithms. The selection of these parameters has a great impact on the performance and results of the model, and needs to be determined by experience or experiments.
Summarize:
Anomaly detection algorithm based on Gaussian mixture model is a flexible and probabilistic unsupervised learning method. It is capable of modeling complex data distributions and identifying outliers based on the probability of a sample being in a Gaussian distribution. However, the computational complexity of this method is high, and the requirements for high-dimensional data and parameter selection are high. In practical applications, it is necessary to select appropriate parameters according to specific problems and data characteristics, and combine other techniques (such as dimensionality reduction) to improve the performance and reliability of the algorithm.

The demo code implementation is as follows:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture

# 生成实验数据
data, _ = make_blobs(n_samples=100, centers=1, random_state=42)  # 正常数据
outliers, _ = make_blobs(n_samples=20, centers=1, random_state=42)  # 异常值

# 合并正常数据和异常值
data = np.concatenate([data, outliers])


"""
使用 sklearn.datasets.make_blobs 函数生成了一个实验数据集。该数据集包含了一个正态分布的样本集作为正常数据,以及一个与之不同的正态分布样本集作为异常值。
然后使用 detect_outliers_gmm 函数进行基于高斯混合模型 (GMM) 算法的异常值检测。使用默认的参数,即分量数 n_components 为 2,异常值比例 contamination 为 0.1(10%)。输出结果为异常值的索引列表,表示这些样本点被判定为异常值。
"""


def detect_outliers_gmm(data, n_components=2, contamination=0.1):
    """
    使用基于高斯混合模型 (Gaussian Mixture Model, GMM) 的异常值检测
    参数:
        - data: 输入数据,可以是一维数组或二维数组
        - n_components: 指定 GMM 中的分量数,默认为2
        - contamination: 指定异常值比例,默认为0.1
    返回值:
        - outliers: 异常值的索引列表
    """
    if len(data.shape) == 1:
        # 对于一维数组
        data = data.reshape(-1, 1)
    
    gmm = GaussianMixture(n_components=n_components, covariance_type='full', 
                          max_iter=100, random_state=42)
    gmm.fit(data)
    scores = -gmm.score_samples(data)
    
    # 根据分数判断异常值
    threshold = np.percentile(scores, (1 - contamination) * 100)
    outliers = np.where(scores > threshold)[0]
    return outliers

# 示例用法
outliers = detect_outliers_gmm(data)
print("异常值的索引:", outliers)

(7) Anomaly detection algorithm based on autoencoder

The anomaly detection algorithm based on autoencoder is an unsupervised learning method, which trains a neural network model that can reconstruct the input data, and uses the reconstruction error to judge outliers. The following will introduce its algorithm principle in detail, and analyze its advantages and disadvantages.
Algorithm principle:
Autoencoder structure: An autoencoder usually consists of two parts, an encoder and a decoder. The encoder maps the input data to a low-dimensional latent representation space, and the decoder maps the latent representation back to the reconstructed input space.
Training process: The autoencoder model is trained by minimizing the reconstruction error between the input and the reconstructed output. Normal samples can be better reconstructed, while abnormal samples will produce larger reconstruction errors.
Abnormal judgment: According to the preset threshold, if the reconstruction error of the sample exceeds the threshold, it will be regarded as an outlier.
Advantages:
Unsupervised learning: Autoencoder is an unsupervised learning method that does not require labeled abnormal samples for training, so it can also be used for scenarios with no or limited number of abnormal samples.
Powerful nonlinear modeling capabilities: Autoencoders have nonlinear modeling capabilities, which can effectively capture complex patterns and features in data, especially when dealing with high-dimensional data.
Generalization ability: The autoencoder can learn the potential representation of the data, has a certain generalization ability, and can effectively detect anomalies when faced with data that has a different but similar distribution from the training data.
Disadvantages:
Parameter selection: The autoencoder needs to select parameters such as appropriate network structure, loss function, and regularization method. Different parameter choices may lead to differences in model performance and anomaly detection results, which need to be determined empirically or experimentally.
Difficulty distinguishing different types of anomalies: Autoencoders usually judge outliers through reconstruction errors, so it is difficult to distinguish different types of anomalies, and may misclassify rare but normal samples as anomalies.
High computational complexity: The training of the autoencoder model requires a lot of computational resources and time, which may be limited especially on large-scale data sets.
Summarize:
The autoencoder-based anomaly detection algorithm is an unsupervised learning method by training a neural network model capable of reconstructing input data and using the reconstruction error to judge outliers. It has the advantages of unsupervised learning, nonlinear modeling ability and generalization ability. However, this method requires suitable parameter selection and it is difficult to distinguish different types of anomalies. In addition, the training process of the autoencoder model is computationally complex. In practical applications, appropriate parameters need to be selected according to specific problems and data characteristics, and combined with other techniques (such as feature selection) to improve the performance and reliability of the algorithm.

The demo code implementation is as follows:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# 生成实验数据
data, _ = make_blobs(n_samples=100, centers=1, random_state=42)  # 正常数据
outliers, _ = make_blobs(n_samples=20, centers=1, random_state=42)  # 异常值

# 合并正常数据和异常值
data = np.concatenate([data, outliers])

# 数据归一化
scaler = MinMaxScaler()
data = scaler.fit_transform(data)


"""
使用 sklearn.datasets.make_blobs 函数生成了一个实验数据集。该数据集包含了一个正态分布的样本集作为正常数据,以及一个与之不同的正态分布样本集作为异常值。
然后使用 build_autoencoder 函数构建了一个自编码器模型,并使用 detect_outliers_autoencoder 函数进行基于自编码器(Autoencoder)算法的异常值检测。使用默认的参数,即训练轮数 epochs 为 100,批次大小 batch_size 为 32。输出结果为异常值的索引列表,表示这些样本点被判定为异常值。
"""


def build_autoencoder(input_dim):
    """
    构建自编码器模型
    参数:
        - input_dim: 输入维度
    返回值:
        - autoencoder: 自编码器模型
        - encoder: 编码器模型
    """
    input_layer = Input(shape=(input_dim,))
    
    # 编码器部分
    encoded = Dense(10, activation='relu')(input_layer)
    encoded = Dense(5, activation='relu')(encoded)
    
    # 解码器部分
    decoded = Dense(10, activation='relu')(encoded)
    decoded = Dense(input_dim, activation='sigmoid')(decoded)
    
    # 构建自编码器模型
    autoencoder = Model(input_layer, decoded)
    
    # 构建编码器模型,用于提取特征
    encoder = Model(input_layer, encoded)
    
    return autoencoder, encoder

def detect_outliers_autoencoder(data, epochs=100, batch_size=32):
    """
    使用基于自编码器(Autoencoder)算法的异常值检测
    参数:
        - data: 输入数据,可以是一维数组或二维数组
        - epochs: 训练轮数,默认为100
        - batch_size: 批次大小,默认为32
    返回值:
        - outliers: 异常值的索引列表
    """
    if len(data.shape) == 1:
        # 对于一维数组
        data = data.reshape(-1, 1)
    
    input_dim = data.shape[1]
    autoencoder, encoder = build_autoencoder(input_dim)
    autoencoder.compile(optimizer='adam', loss='mse')
    
    # 训练自编码器模型
    autoencoder.fit(data, data, epochs=epochs, batch_size=batch_size, verbose=0)
    
    # 重构样本
    reconstructed_data = autoencoder.predict(data)
    
    # 计算重构误差
    errors = np.mean(np.square(data - reconstructed_data), axis=1)
    
    # 根据重构误差判断异常值
    threshold = np.percentile(errors, 95)  # 假设异常值的阈值为95%分位点
    outliers = np.where(errors > threshold)[0]
    
    return outliers

# 示例用法
outliers = detect_outliers_autoencoder(data)
print("异常值的索引:", outliers)

Guess you like

Origin blog.csdn.net/Together_CZ/article/details/131631291