A Survey of Self-Supervised Learning for Time Series

A Survey of Self-Supervised Learning for Time Series

Original Swimming Fish  Algorithm Advanced  2023-08-15 00:02  Published in Fujian

Included in the collection #time series 13

Self-supervised learning (SSL) is a machine learning approach that has recently achieved impressive performance on a variety of time series tasks. The most notable advantage of SSL is its reduced reliance on tagged data . Based on pre-training and fine-tuning strategies , high performance can be achieved even with a small amount of labeled data.

Today, I share a review article from researchers such as Zhejiang University and Ali on self-supervised learning for time series. The article reviews the existing investigations related to SSL and time series, and provides an overview of existing time series SSL methods. A new taxonomy (Figure 1). They generalized self-supervised learning time series analysis into three categories: generative-based, contrastive-based, and adversarial-based. All methods can be further divided into ten subclasses. To facilitate the experimentation and validation of time series SSL methods, the paper also summarizes commonly used datasets in time series forecasting, classification, anomaly detection and clustering tasks.

Figure 1: Proposed time-series SSL taxonomy.

Time series self-supervised learning (SSL) methods can generally be divided into three categories, and the model architectures of these categories are shown in Figure 2:

  1. Generation-based methods: This approach first uses an encoder to map an input x to a representation z, and then uses a decoder to reconstruct x from z. The training objective is to minimize the reconstruction error between the input x and the reconstructed input ˆx.

  2. Contrast-based approach: This approach is one of the most widely used SSL strategies ( self-supervised learning ) , which builds positive and negative examples through data augmentation or context sampling. Then, the model is trained by maximizing the mutual information (MI) between two positive samples . Contrast-based methods usually use contrastive similarity measures such as InfoNCE loss

  3. Adversarial-based methods: This method usually consists of a generator and a discriminator. The generator generates fake samples, while the discriminator is used to distinguish them from real samples.

Figure 2: Learning paradigm for SSL.

1 Generation-based methods

In this category, the pretext task is to generate expected data based on a view of the given data. In the context of time series modeling, commonly used pretext tasks include using past series to predict future time windows or specific time stamps, using encoders and decoders to reconstruct inputs, and predicting masked time series. Visible part.

This section organizes existing self-supervised representation learning methods in time series modeling from three perspectives: autoregressive-based forecasting, autoencoder-based reconstruction, and diffusion-based model generation (Fig. 3). It should be noted that autoencoder-based reconstruction tasks are also considered as an unsupervised framework. In the context of SSL, we mainly use the reconstruction task as a pre-text task, with the ultimate goal of obtaining representations through autoencoder models.

picture

Figure 3: Three categories based on generated time series SSL.

1.1 Prediction based on autoregressive

The ARF task is a time-series-based forecasting task whose goal is to predict a window of length K using the sequence before timestamp t. In the ARF task , the prediction model f( ) usually adopts an autoregressive model, that is, the output at the current moment is used as the input at the next moment, and so on. When K=1, the ARF task is a single-step prediction model, which is to predict the value of the next time step; when K>1, the ARF task is a multi-step prediction model, which is to predict the value of multiple time steps in the future. 

The mathematical expression of the ARF task is formula (1),

picture

where X[1:t] represents the sequence before timestamp t, and ˆX[t+1:t+K] represents the predicted target window. The prediction model f(·) usually adopts an autoregressive model, that is, the output at the current moment is used as the input at the next moment, and so on. The application scenarios of ARF tasks include stock price forecasting, weather forecasting, traffic flow forecasting, etc. 

Related research and application of ARF tasks. The ARF task can be learned unsupervised by the autoencoder model, resulting in a better time series representation. In addition, the ARF task can also be combined with other tasks, such as anomaly detection, classification and clustering, etc.

1.2 Reconstruction based on autoencoder

An autoencoder is an unsupervised learning artificial neural network consisting of two parts: an encoder and a decoder [56]. The encoder maps the input data X to a low-dimensional representation Z, and then the decoder maps the low-dimensional representation Z back to the original data space to obtain the reconstructed data ˜X. The output of the decoder is defined as the reconstructed input ˜X. The process can be expressed as:

picture

The goal of an autoencoder is to minimize the reconstruction error, which is the difference between the input data and the reconstructed data. In the application of time series data, autoencoder can be used for reconstruction and representation learning of time series data, so as to improve the expression ability and prediction performance of time series data.

A variant of the autoencoder model. For example, denoising autoencoders, spectral analysis autoencoders, temporal clustering-friendly representation learning models, etc. These variant models can improve the performance and applicability of autoencoder models by introducing additional constraints and loss functions. For example, a denoising autoencoder can improve the robustness and generalization of the model by adding noise to the input data; a spectral analysis autoencoder can improve the frequency domain representation ability of the model by introducing spectral constraints in the loss function; Time-series clustering-friendly representation learning models can improve the clustering performance of the model by introducing clustering constraints in the loss function.

Application scenarios of autoencoder models in time series data. Such as signal processing, image processing, speech recognition, natural language processing, etc. Autoencoder models have achieved some success in these domains and still hold great promise for future research.

1.3 Generation of diffusion-based models

The diffusion model is a probability-based generative model , and its core idea is to realize sample generation through two inverse processes . Specifically, the diffusion model contains two processes : a forward process and a reverse process. The forward process is to inject random noise into the data, then complete the transition step by step, and finally get a state. The inversion process is to generate samples from the noise distribution, which is achieved by using the inverse state transition operation. Retrotransferred nuclei are key to the reversion process, but are often difficult to identify. Therefore, the diffusion model learns to approximate the inverse transfer kernel via a deep neural network for efficient sample generation.

Currently, there are mainly three basic forms of diffusion models: denoised diffusion probability models (DDPMs ), fractionally matched diffusion models , and fractional stochastic differential equations (SDEs) models. The DDPMs model approximates the backtransfer kernel by denoising , the fractional matching diffusion model approximates the backtransfer kernel by matching gradients, and the SDEs model approximates the backtransfer kernel by stochastic differential equations. These models are all designed to solve the problem of back-transfer kernels, thus enabling efficient sample generation.

Diffusion models have achieved great success in areas such as image synthesis, video generation, speech generation, bioinformatics, and natural language processing. It is a powerful generative model that can be used for data generation and modeling in various fields. The advantages of the diffusion model include: good generation effect, fast generation speed, strong scalability, and good interpretability. Therefore, the diffusion model has attracted much attention in the field of deep learning and has become an important generative model.

2 Contrast-based methods

Contrastive learning is a self-supervised learning strategy that has shown powerful learning capabilities in computer vision and natural language processing. Unlike other models, contrastive learning methods learn data representations by comparing positive and negative samples, where positive samples should be similar and negative samples should be different. Therefore, the selection of positive and negative samples is very important for contrastive learning methods. Figure 4 demonstrates the five categories of contrastive-based self-supervised learning of time series.

picture

Figure 4: Five categories of contrastive-based self-supervised learning of time series

2.1 Sampling comparison method

The sampling comparison method divides the time series into multiple fixed-length subsequences, and then randomly selects two different sampling points from each subsequence as positive samples, and randomly selects a sampling point from other subsequences as negative samples. By comparing positive and negative samples, sample comparison methods can learn the representation of time series. The method follows the widely used assumption in time series analysis that adjacent time windows or timestamps have a high degree of similarity. So the positive and negative samples are sampled directly from the original time series.

The sampling contrast method follows the most commonly used assumptions in time series analysis. It has a simple principle and can model local correlations well, and for some time series datasets, sampling contrast methods can achieve good performance. However, its disadvantage is that spurious negative pairs may be introduced when analyzing long-term dependencies, leading to suboptimal representations. Therefore, sampling contrast methods may not be optimal when dealing with long-term dependencies. In addition, sampling contrast methods require selection of an appropriate subsequence length and number of sampling points, which may require some experience and adjustments.

2.2 Prediction and comparison method

The predictive contrast method is used to learn representations of time series. The method learns meaningful and informative representations by predicting future information in a time series. Specifically, the method divides the time series into multiple fixed-length subsequences, and then takes the last time step of each subsequence as the target and the remaining time steps as the context. Then, the model is trained to predict the value at the target time step while using target time steps of other subsequences as negative samples. By comparing positive and negative samples, predictive comparison methods can learn the representation of time series.

The advantage of the predictive contrast approach is that it can learn meaningful and informative representations in time series that capture important features and patterns in the data. This method pays more attention to slowly changing trends in time series data and can extract slow features. In addition, the implementation of the predictive comparison method is very simple, easy to understand and implement. However, the downside of the forecast-contrastive approach is that it focuses primarily on local information and may not be able to accurately model long-term dependencies in time-series data. In addition, the method is sensitive to noise and outliers, which may affect the representation ability and generalization performance of the model. Therefore, the forecast contrast method may not be the best choice when dealing with time series data with complex long-term dependencies.

2.3 Enhanced contrast method

The augmented contrastive method is a commonly used contrastive learning framework that generates different views of input samples through data augmentation techniques, and then learns representations by maximizing the similarity of views from the same sample and minimizing the similarity of views from different samples. . Specifically, the method splits each input sample into two views, and then uses a neural network to learn to map these two views into the same representation space. Then, the network is trained by maximizing the similarity of views from the same sample and minimizing the similarity of views from different samples. This can be achieved by using a contrastive loss function, where for each sample the network learns to distinguish it from other samples.

The advantage of the enhanced contrast method is that it is easy to implement and understand, and it is applicable to various types of time series modeling tasks. In addition, this method can increase the diversity of data by using various data augmentation techniques, thereby improving the generalization performance of the model. However, the disadvantage of this method is that it is a challenge to deal with temporal dependencies, since the essence of contrast enhancement lies in distinguishing feature representations of positive and negative sample pairs, rather than capturing temporal dependencies explicitly. Choosing an appropriate augmentation method for time series data is also a challenging problem. Furthermore, sampling bias is another concern as it may lead to the generation of spurious negative samples, which can affect the performance of the model.

2.4 Prototype comparison method

The Prototype Contrastive Method is a framework for contrastive learning based on clustering constraints, which learns representations of time series data by comparing samples with cluster centers. This method can reduce the amount of computation and encourage samples to present a friendly cluster distribution in the feature space. Specifically, the prototype contrast method divides the samples into different clusters, takes the cluster centers as prototypes, and then compares the samples with the prototypes to learn the representation of time series data. This approach can be achieved by using a contrastive loss function, where for each sample the network learns to distinguish it from other samples.

Prototype comparison methods introduce the concept of prototypes so that samples can be assigned to a limited number of categories. This method exploits high-level semantic information and encourages samples to present a cluster distribution in the feature space instead of a uniform distribution, which is more in line with the real data distribution. However, the main problem with this method is that the number of prototypes needs to be determined in advance, which still requires some prior information.

2.5 Expert knowledge comparison method

The expert knowledge contrast method is a relatively new representation learning framework, which can introduce prior knowledge into the contrastive learning framework to help the model choose the correct positive and negative samples. For example, during training, one anchor sample and one positive sample, and some negative samples can be selected. The network will then learn to identify anchor samples as similar to positive samples and to distinguish anchor samples from negative samples. This can be achieved by using a contrastive loss function, where for each sample, the network learns to distinguish it from other samples and assign it to the correct positive and negative samples.

The characteristic of the expert knowledge comparison method is that the prior knowledge or information of domain experts can be introduced into the deep neural network to guide the selection of positive and negative samples or the measurement of similarity. Its main advantage lies in the ability to accurately select positive and negative samples. However, it is limited by the need to provide reliable prior knowledge. Obtaining reliable prior knowledge for time series data is not easy in most cases. Incorrect or misleading knowledge can lead to biased representations.

3 Adversarial-Based Approaches

Adversarial-based methods exploit generative adversarial networks (GANs) to construct pretext tasks. GAN consists of a generator G and a discriminator D. The generator G is responsible for generating synthetic data similar to real data, while the discriminator D is responsible for determining whether the generated data is real or synthetic. Thus, the goal of the generator is to maximize the decision failure rate of the discriminator, while the goal of the discriminator is to minimize its failure rate. The generator G and the discriminator D are in a mutual game relationship, so the learning objective is to optimize the performance of the generator and the discriminator by minimizing the loss function L.

picture

According to the final task, existing adversarial-based representation learning methods can be divided into time series generation and imputation, and auxiliary representation augmentation . Figure 5 shows a schematic diagram of adversarial self-supervised learning of time series.

picture

Figure 5: Three categories of adversarial self-supervised learning of time series

3.1 Time series generation and imputation

In terms of time series generation, using Transformer instead of RNN can better handle long-term dependencies and improve efficiency. Li et al. proposed Context-FID, a new metric for evaluating the quality of generated sequences. Li et al. also explored the generation of time series data with irregular spatio-temporal relationships, and proposed TTS-GAN, which uses Transformer instead of RNN to build discriminators and generators, and treats time series data as height 1 image data. 

in time series imputation. Luo et al. treat missing value imputation as a data generation task and use GANs to learn the distribution of the training dataset. In order to better capture the dynamic characteristics of time series, they proposed the GRUI module. In addition, auxiliary representation augmentation is introduced, which can improve the model's robustness and generalization ability.

The advantage of adversarial-based methods is that high-quality time series samples can be generated, and imputation or generation tasks can be performed according to the seasonality and trend of different time series data, thereby improving the coherence and rationality of the results. In addition, many efficient adversarial-based methods have been applied in the field of image generation, which can be transferred and applied to time-series data generation or imputation tasks. The disadvantage is that the training process of GAN is relatively complex and requires a trade-off between the generator and the discriminator, which may require more training time and computing resources, and may lead to unstable training.

3.2 Auxiliary Representation Enhancement

In addition to generation and imputation tasks, adversarial-based representation learning strategies can also be added to existing learning frameworks as an additional auxiliary learning module, which we refer to as adversarial-based auxiliary representation augmentation. Auxiliary representation augmentation aims to facilitate the model to learn more informative representations for downstream tasks by adding an adversarial-based learning strategy. Usually defined as:

picture

where Lbase is the base learning objective and Ladv is the additional adversarial-based learning objective. It should be noted that when Ladv is not available, the model can still extract representations from the data, so Ladv is considered as an auxiliary learning objective.

USAD [63] is a time series anomaly detection framework that includes two BAE models, which are defined as AE1 and AE2. The core idea behind USAD is to amplify the reconstruction error by adversarial training between two BAEs. In USAD, AE1 is considered as a generator and AE2 as a discriminator. The auxiliary goal is to use AE2 to distinguish real data from AE1 reconstructed data, and train AE1 to deceive AE2. The whole process can be expressed as:

picture

where W is the actual input sequence. Similar to USAD, Anoma lyTrans [155] also uses an adversarial strategy to amplify the anomaly score of an anomaly. But different from (30) using the reconstruction error, AnomalyTrans defines the prior association and the series association, and then uses the Kulback-Leibler divergence to measure the error of the two associations.

DUBCN [156] and CRLI [157] are used for sequence retrieval and clustering tasks, respectively. Both methods adopt RNN-based BAE as the model, and add cluster-based loss and adversarial-based loss to the basic reconstruction loss, namely:

picture

Among them, λ1 and λ2 are the weight coefficients of the auxiliary target.

Adversarial based methods are also effective in other time series modeling tasks. For example, introducing adversarial training in time series forecasting can improve accuracy and capture long-term recurring patterns, such as AST [158] and ACT [159]. BeatGAN [160] introduces adversarial representation learning in the task of abnormal heartbeat detection from ECG data and provides an interpretable detection framework. In modeling behavioral data, Activity2vec [161] uses adversarial-based training to model target invariance and enhance the representation ability of the model in different behavioral stages.

Adversarial methods can help the model learn more robust representations, thereby improving the generalization ability of the model. By introducing adversarial signals, the model can better fit the training data and resist disturbances or attacks. However, introducing an adversarial method as a regularization term in the loss function increases the complexity of the training process. The competition between training the generator and the discriminator needs to be carefully balanced, which may require more training time and computing resources. This can even lead to unstable training.

4 Applications and Datasets

Self-supervised learning (SSL) has broad applications in various time series tasks, such as anomaly detection, forecasting, classification, and clustering.

Table 2: Summary of time series applications and widely used datasets

picture

abnormal detection. The main task of time series anomaly detection is to identify abnormal time points or abnormal time series based on given norms or common signals. Since it is challenging to obtain labels for anomalous data, most time series anomaly detection methods employ unsupervised learning frameworks. Among the many modeling strategies, autoregressive-based prediction and autoencoder-based reconstruction are the most commonly used methods.

predict . Time series forecasting is a statistical and modeling technique used to perform analysis on time series data to predict values ​​for future time windows or points in time. The autoregressive forecasting task is also a time series forecasting task.

classification and clustering. The task goal of classification and clustering is to identify the true class to which a particular time series sample belongs. Since the core of the contrastive-based self-supervised learning method is to identify positive and negative samples, it is the best choice for these two tasks.

In summary, generation-based methods are more suitable for anomaly detection and prediction tasks, while contrast-based methods are more suitable for classification and clustering tasks. Adversarial-based methods can be useful in various tasks, but in most cases, it is used as an additional regularization term to ensure that the features extracted by the model are more robust and informative. Usually, a mixture of multiple self-supervised methods is a better choice.

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132296627