time series clustering

Table of contents

Overview of Time Series Clustering

Time series acquaintance measurement

Dynamic Time Warping

edit

Euclidean distance

subsequence clustering

edit

time point clustering

Kshpe clustering based on time series shape

Clustering based on segmental statistical features

Overview of Time Series Clustering

Time Series Clustering: An unsupervised learning method for classifying time series data into distinct groups. Clustering methods aim to find similar subsets of data and group them into the same group. For time series data, clustering technology can find sequences with similar characteristics and divide them into the same group, which is helpful for data classification and analysis.

There are two types of time series clustering: subsequence clustering and time point clustering. Subsequence clustering is clustering on a series of subsequences extracted in a time series by sliding windows; time point clustering is based on the clustering of the combination of temporal proximity of time points and similarity of corresponding values. Time series clustering usually uses common clustering algorithms such as K-means, Ksahpe, hierarchical clustering, etc.

Time series acquaintance measurement

Dynamic Time Warping

Dynamic Time Warping (Dynamic Time Warping, DTW) is a very effective method to solve the problem of time series similarity matching. It can calculate distances between time series of different lengths and velocities.

The core idea of the DTW algorithm is to align two time series and calculate the distance between each point between the two series. During the alignment process, DTW can stretch or compress the time axis so that the corresponding points between the two sequences can be matched. Therefore, DTW can handle time series data of different lengths and sampling rates.

The steps of the DTW algorithm are as follows:

Initializes a distance matrix, which is a two-dimensional matrix of the lengths of the two time series.
Perform dynamic programming to update the distance matrix according to the specified path, so as to find the optimal alignment scheme.
Computes the distance between two series, which is the distance between points in the aligned time series.

from scipy.spatial.distance import euclidean
from fastdtw import fastdtw

# 定义两个时间序列
x = [1, 2, 3, 4, 5]
y = [1, 2, 2, 4, 4, 5]

# 使用快速DTW算法进行动态时间规整并计算距离
distance, path = fastdtw(x, y, dist=euclidean)

# 打印计算出的距离和对齐路径
print("Distance: ", distance)
print("Path: ", path)

Euclidean distance

Euclidean distance is used to measure the similarity of multi-scalar statistical features of time series data, such as daily, monthly, and annual mean, variance, median, peak, and skewness;

subsequence clustering

Time series subsequence clustering (Subsequence Clustering) is an unsupervised learning method that divides time series data into different groups according to similarity. Different from traditional time series clustering, the subsequence clustering algorithm does not cluster the entire time series, but performs subsequence clustering based on some known subsequence characteristics.

The sample size, time step and feature dimension are defined in the code, and random time series data is generated. Next, we defined the subsequence length and step size, and used extract_subsequences()the function to extract the subsequence data from the original time series data. Then, we defined the number of clusters as 3, used the k-means algorithm to cluster the subsequence data, and obtained the category to which each sample belongs. Finally, we print out the number of samples for each class;

import numpy as np
from sklearn.cluster import KMeans

def extract_subsequences(X, subseq_len, stride):
    n_samples, n_timesteps, n_features = X.shape
    subsequences = []
    for i in range(n_samples):
        for j in range(0, n_timesteps - subseq_len + 1, stride):
            subseq = X[i, j:j+subseq_len, :]
            subsequences.append(subseq.ravel())
    return np.array(subsequences)

# 定义样本数、时间步长和特征维度
n_samples, n_timesteps, n_features = 100, 40, 5

# 生成随机时间序列数据
X = np.random.rand(n_samples, n_timesteps, n_features)

# 获取子序列数据
subseq_len, stride = 10, 5
X_subseq = extract_subsequences(X, subseq_len=subseq_len, stride=stride)

# 定义聚类数目
n_clusters = 3

# 使用 k-means 算法对数据进行聚类
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(X_subseq)

# 获取每个样本所属的类别
labels = kmeans.labels_

# 打印每个类别的样本数量
unique, counts = np.unique(labels, return_counts=True)
print(dict(zip(unique, counts)))

time point clustering

Kshpe clustering based on time series shape

KShape is a time series clustering algorithm, which can divide time series data into different clusters, so that the similarity of data objects in the same cluster is as large as possible. KShape uses the shape distance (Shape distance) to measure the similarity between two time series, and also uses dynamic programming (Dynamic Programming) to align each time series in order to more accurately calculate the shape distance. The main parameters of KShape include the number of clusters (n_clusters), the maximum number of iterations (max_iter) and the threshold (tol), which can be set by the user according to the actual situation. The KShape clustering algorithm needs to load the tslearn library first and call the tslearn.clustering.KShape class.

Two codes for shape-based clustering are provided below

import numpy as np
import matplotlib.pyplot as plt
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.piecewise import SymbolicAggregateApproximation
from tslearn.clustering import TimeSeriesKMeans

# 加载数据集并对时间序列进行预处理
X_train, y_train, _, _ = CachedDatasets().load_dataset("Trace")
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train[:6])
X_train = SymbolicAggregateApproximation(n_segments=10, alphabet_size_avg=2).fit_transform(X_train)

# 定义并训练KMeans聚类器
km = TimeSeriesKMeans(n_clusters=3, verbose=True, random_state=42)
y_pred = km.fit_predict(X_train)

# 可视化聚类结果
plt.figure(figsize=(10, 10))
for yi in range(3):
    plt.subplot(3, 1, 1 + yi)
    for xx in X_train[y_pred == yi]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(km.cluster_centers_[yi].ravel(), "r-")
    plt.xlim(0, X_train.shape[1])
    plt.ylim(-4, 4)
    plt.title("Cluster %d" % (yi + 1))

plt.tight_layout()
plt.show()

from tslearn.clustering import KShape
from tslearn.generators import random_walks
from sklearn_extra.cluster import KMedoids
import tslearn.metrics as metrics
from tslearn.clustering import silhouette_score
from tslearn.generators import random_walks
# X:需要训练的时间序列矩阵

#定义分成6类
num_cluster = 6

# precomputed自定义相似度计算方法
km = KMedoids(n_clusters= num_cluster, random_state=0,metric="precomputed")

 # 采用tslearn中的DTW系列及变种算法计算相似度，生成距离矩阵dists
# dists = metrics.cdist_dtw(X) # dba + dtw
dists = metrics.cdist_soft_dtw_normalized(X,gamma=0.5) # softdtw

#预测结果
y_pred = km.fit_predict(dists)

# 计算轮廓系数
np.fill_diagonal(dists,0)
score = silhouette_score(dists,y_pred,metric="precomputed")

print(X.shape)
print(y_pred.shape)
print("silhouette_score: " + str(score)

Clustering based on segmental statistical features

Time series statistical characteristics:

Based on the statistical characteristics of the entire time series, there are maximum, minimum, mean, variance, skewness, kurtosis, and entropy;
Time series can be segmented, such as week, month, year or user-defined segmentation parameters, statistics of maximum, minimum, mean, variance, skewness, kurtosis, entropy and other characteristics;
Feature extraction based on tsfresh time feature;
Select appropriate features based on the 1,2,3 method, and then use Kmeans, hierarchical clustering, density clustering and other methods to cluster;

Time series calculation kurtosis code

import pandas as pd

# 读取时间序列数据，要求数据第一列为时间，第二列为数值
data = pd.read_csv('time_series.csv', parse_dates=['time'], index_col='time')

# 计算时间序列峰度
kurtosis = data.kurtosis()[0]

print("该时间序列的峰度为：", kurtosis)

Time series calculation skewness code

import pandas as pd

# 读取时间序列数据，要求数据第一列为时间，第二列为数值
data = pd.read_csv('time_series.csv', parse_dates=['time'], index_col='time')

# 计算时间序列偏度
skewness = data.skew()[0]

print("该时间序列的偏度为：", skewness)

Calculate entropy for time series :

t1 =[1,3,1,3,1,3,1,3]

t2 =[1,3,3,1,1,1,3,3]

The two sequences of t1 and t2 are different, t1 is regular, and t2 is relatively disordered. I want to describe this certainty and uncertainty. It can be represented by entropy; the following is the code for calculating time series entropy;

from pyentrp import entropy as ent

# 读取时间序列数据，要求数据第一列为时间，第二列为数值
data = pd.read_csv('time_series.csv', parse_dates=['time'], index_col='time')

# 将时间序列转化为一维数组，并计算其熵
ts = data['value'].values
entropy = ent.shannon_entropy(ts)
print("该时间序列的熵为：", entropy)

tsfresh can automatically calculate a large number of time series features, including many feature extraction methods and powerful feature selection algorithms.

# 时间序列特征提取
from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.feature_extraction import extract_features, EfficientFCParameters
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters


# extraction_settings = EfficientFCParameters()#计算算所有默认特征和带参数的特征的多种参数组合
extraction_settings = ComprehensiveFCParameters()

X = extract_features(data, column_id="id", column_sort='time',
                     default_fc_parameters=extraction_settings,
                     # impute就是自动移除所有NaN的特征
                     impute_function=impute)
X.head()

Kmeans clustering

Provide a Kmeans selection k value function

from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn import preprocessing

# Kmeans选取k值函数

def get_silhouette_K(data, range_K):
    K = range(2, range_K)
    Scores = [] 
    for k in K:
        kmeans = KMeans(n_clusters=k)
        kmeans.fit(data)
        Scores.append(silhouette_score(data, kmeans.labels_, metric='euclidean'))

    max_idx = Scores.index(max(Scores))
    best_k = K[max_idx]
    plt.plot(K, Scores, 'bx-')
    plt.xlabel('k')
    plt.ylabel('silhouette')
    plt.title('Selecting k with the silhouette Method')
    plt.show()
    return best_k