Summary of Methods for Time Series Anomaly Detection

c20912e283ad6e03949e62dfe2dfb617.png

来源:算法进阶 kaggle竞赛宝典
本文约3100字,建议阅读5分钟
在本文中将探索各种方法来揭示时间序列数据中的异常模式和异常值。

Introduction

Time series data is a series of observations recorded at certain time intervals. It is often encountered in various fields such as finance, weather forecasting, stock market analysis, etc. Analyzing time series data can provide valuable insights and help in making informed decisions.

Anomaly detection is the process of identifying patterns in data that do not conform to expected behavior. In the context of time series data, anomalies can represent significant events or outliers that deviate from normal patterns. Detecting anomalies in time-series data is critical for a variety of applications, including fraud detection, network monitoring, and predictive maintenance.

First import the library, in order to facilitate data acquisition, we use yfinance directly:

import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import yfinance as yf


 # Download time series data using yfinance
 data = yf.download('AAPL', start='2018-01-01', end='2023-06-30')

Understanding Time Series Data

Before diving into anomaly detection techniques, a brief introduction to the characteristics of time series data is given. Time series data typically have the following properties:

  • Trend: The long-term increase or decrease in data values ​​over time.

  • Seasonality: A pattern or cycle that repeats at regular intervals.

  • Autocorrelation: The correlation between the current observation and previous observations.

  • Noise: Random fluctuations or irregularities in data.

Let's visualize the downloaded time series data

# Plot the time series data
 plt.figure(figsize=(12, 6))
 plt.plot(data['Close'])
 plt.xlabel('Date')
 plt.ylabel('Closing Price')
 plt.title('AAPL Stock Price')
 plt.xticks(rotation=45)
 plt.grid(True)


 plt.show()

882180fbd1fc348a785d47c3369fd643.png

The trend of stock price growth over time can be observed from the figure. There are also periodic fluctuations, indicating the presence of seasonality. There appears to be some autocorrelation between consecutive closing prices.

Time Series Data Preprocessing

Before applying anomaly detection techniques, it is crucial to preprocess time series data. Preprocessing includes handling missing values, smoothing data, and removing outliers.

missing value

Missing values ​​can occur in time series data due to various reasons such as data collection errors or gaps in the data. Proper handling of missing values ​​is essential to avoid bias in the analysis.

# Check for missing values
 missing_values = data.isnull().sum()
 print(missing_values)

The stock data data we are using does not contain any missing values. If there are missing values, they can be handled by imputing missing values ​​or deleting the corresponding time points.

smooth data

Smoothing time series data can help reduce noise and highlight underlying patterns. A common technique for smoothing time series data is a moving average.

# Smooth the time series data using a moving average
 window_size = 7
 data['Smoothed'] = data['Close'].rolling(window_size).mean()


 # Plot the smoothed data
 plt.figure(figsize=(12, 6))
 plt.plot(data['Close'], label='Original')
 plt.plot(data['Smoothed'], label='Smoothed')
 plt.xlabel('Date')
 plt.ylabel('Closing Price')
 plt.title('AAPL Stock Price (Smoothed)')
 plt.xticks(rotation=45)
 plt.legend()
 plt.grid(True)


 plt.show()

5ac4c163d6a419c2d25b4d0bb8a0afd6.png

The graph shows the raw closing price and a smoothed version obtained using a moving average. Smoothing helps to visualize the overall trend and reduce the impact of short-term fluctuations.

remove outliers

Abnormal outliers can significantly affect the performance of anomaly detection algorithms. Identifying and removing outliers is crucial before applying anomaly detection techniques.

# Calculate z-scores for each data point
 z_scores = (data['Close'] - data['Close'].mean()) / data['Close'].std()


 # Define a threshold for outlier detection
 threshold = 3


 # Identify outliers
 outliers = data[np.abs(z_scores) > threshold]


 # Remove outliers from the data
 data = data.drop(outliers.index)


 # Plot the data without outliers
 plt.figure(figsize=(12, 6))
 plt.plot(data['Close'])
 plt.xlabel('Date')
 plt.ylabel('Closing Price')
 plt.title('AAPL Stock Price (Without Outliers)')
 plt.xticks(rotation=45)
 plt.grid(True)


 plt.show()

21d54001fa34f6871d9bf8996c71df8e.png

The figure above shows the time series data after removing the identified outliers. Removing outliers helps to improve the accuracy of anomaly detection algorithms by reducing the influence of extreme values.

Some people will say, don't we just want to detect outliers, why should we delete it? This is because the outliers we delete here are very obvious values, that is to say, this preprocessing is a preliminary screening, or coarse screening. Delete very obvious values ​​so that the model can better judge which values ​​are difficult to judge.

statistical methods

Statistical methods provide the basis for anomaly detection in time series data. We'll explore two commonly used statistical techniques: z-score and moving average.

z-score

The z-score measures the number of standard deviations an observation is from the mean. By computing a z-score for each data point, we can identify observations that deviate significantly from expected behavior.

# Calculate z-scores for each data point
 z_scores = (data['Close'] - data['Close'].mean()) / data['Close'].std()


 # Plot the z-scores
 plt.figure(figsize=(12, 6))
 plt.plot(z_scores)
 plt.xlabel('Date')
 plt.ylabel('Z-Score')
 plt.title('Z-Scores for AAPL Stock Price')
 plt.xticks(rotation=45)
 plt.axhline(y=threshold, color='r', linestyle='--', label='Threshold')
 plt.axhline(y=-threshold, color='r', linestyle='--')
 plt.legend()
 plt.grid(True)


 plt.show()

fdeeac34e65dd812ddeec6a47014f3bf.png

The graph shows the calculated z-score for each data point. Observations with a z-score above a threshold (dashed red line) can be considered anomalous.

moving average

Another statistical approach to anomaly detection is based on moving averages. By calculating a moving average and comparing it to the raw data, we can identify deviations from expected behavior.

# Calculate the moving average
 window_size = 7
 moving_average = data['Close'].rolling(window_size).mean()


 # Calculate the deviation from the moving average
 deviation = data['Close'] - moving_average


 # Plot the deviation
 plt.figure(figsize=(12, 6))
 plt.plot(deviation)
 plt.xlabel('Date')
 plt.ylabel('Deviation')
 plt.title('Deviation from Moving Average')
 plt.xticks(rotation=45)
 plt.axhline(y=0, color='r', linestyle='--', label='Threshold')
 plt.legend()
 plt.grid(True)


 plt.show()

fae16d5274e07152a0421371eca34cbd.png

The graph shows the deviation of each data point from the moving average. Positive deviations indicate values ​​that are higher than expected behavior, while negative deviations indicate values ​​that are lower than expected behavior.

machine learning method

Machine learning methods provide more advanced techniques for anomaly detection of time series data. We will explore two popular machine learning algorithms: Isolation Forest and LSTM Autoencoder.

isolated forest

Isolation Forest is an unsupervised machine learning algorithm that isolates anomalies by randomly partitioning data into subsets. It measures the average number of partitions required to isolate observations, while anomalies are expected to require fewer partitions.

from sklearn.ensemble import IsolationForest


 # Prepare the data for Isolation Forest
 X = data['Close'].values.reshape(-1, 1)


 # Train the Isolation Forest model
 model = IsolationForest(contamination=0.05)
 model.fit(X)


 # Predict the anomalies
 anomalies = model.predict(X)


 # Plot the anomalies
 plt.figure(figsize=(12, 6))
 plt.plot(data['Close'])
 plt.scatter(data.index, data['Close'], c=anomalies, cmap='cool', label='Anomaly')
 plt.xlabel('Date')
 plt.ylabel('Closing Price')
 plt.title('AAPL Stock Price with Anomalies (Isolation Forest)')
 plt.xticks(rotation=45)
 plt.legend()
 plt.grid(True)


 plt.show()

df1300cf479fcd6e4a84730435a879d3.png

This figure shows anomalous time series data identified by the Isolation Forest algorithm. Anomalies are highlighted with a different color to indicate that they deviate from expected behavior.

LSTM autoencoder

LSTM (Long - Short-Term Memory) autoencoders are deep learning models capable of learning patterns in time series data and reconstructing input sequences. Anomalies can be detected by comparing the reconstruction error with predefined thresholds.

from tensorflow.keras.models import Sequential
 from tensorflow.keras.layers import LSTM, Dense


 # Prepare the data for LSTM Autoencoder
 X = data['Close'].values.reshape(-1, 1)


 # Normalize the data
 X_normalized = (X - X.min()) / (X.max() - X.min())


 # Train the LSTM Autoencoder model
 model = Sequential([
    LSTM(64, activation='relu', input_shape=(1, 1)),
    Dense(1)
 ])
 model.compile(optimizer='adam', loss='mse')
 model.fit(X_normalized, X_normalized, epochs=10, batch_size=32)


 # Reconstruct the input sequence
 X_reconstructed = model.predict(X_normalized)


 # Calculate the reconstruction error
 reconstruction_error = np.mean(np.abs(X_normalized - X_reconstructed), axis=1)


 # Plot the reconstruction error
 plt.figure(figsize=(12, 6))
 plt.plot(reconstruction_error)
 plt.xlabel('Date')
 plt.ylabel('Reconstruction Error')
 plt.title('Reconstruction Error (LSTM Autoencoder)')
 plt.xticks(rotation=45)
 plt.axhline(y=threshold, color='r', linestyle='--', label='Threshold')
 plt.legend()
 plt.grid(True)


 plt.show()

02193b7b2c299f820fb32bf5e8194b8d.png

The plot shows the reconstruction error for each data point. Observations with reconstruction errors above a threshold (dashed red line) can be considered anomalies.

Evaluation of Anomaly Detection Models

To accurately evaluate the performance of anomaly detection models, labeled data containing information about the presence or absence of anomalies is required. But in real-world scenarios, it is nearly impossible to obtain labeled data with known anomalies, so alternative techniques can be employed to evaluate the effectiveness of these models.

One of the most commonly used techniques is cross-validation, which involves dividing the available labeled data into subsets, or folds. A model is trained on a portion of the data and evaluated on the remaining portion. This process is repeated several times and the evaluations are averaged to obtain a more reliable estimate of the model performance.

Unsupervised evaluation metrics can also be used when labeled data is not readily available. These metrics evaluate the performance of anomaly detection models based on features inherent in the data itself, such as clustering or density estimates. Examples of unsupervised evaluation metrics include silhouette score, Dunn index, or average nearest neighbor distance.

Summarize

This article explores various techniques for time series anomaly detection using machine learning. It is first preprocessed to handle missing values, smooth the data and remove outliers. Statistical methods for anomaly detection, such as z-scores and moving averages, are then discussed. It concludes with a discussion of machine learning methods including isolation forests and LSTM autoencoders.

Anomaly detection is a challenging task that requires a deep understanding of time series data and the use of appropriate techniques to spot unusual patterns and outliers. Remember to try different algorithms, fine-tune parameters and evaluate the performance of your model to get the best results.

Editor: Yu Tengkai

Proofreading: Lin Yilin

61b5d2be5e706d27e883952092e0e16f.png

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131798889
Recommended