Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting Paper Reading

Title:Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting

Publication:NeuralPS

Author:Tsinghua University

Published Date:2022

Page: 1~10 (article content), 10~22 (details)

Score: Excellent

Github:GitHub - thuml/Nonstationary_Transformers: Code release for "Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting" (NeurIPS 2022), https://arxiv.org/abs/2205.14415Code release for "Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting" (NeurIPS 2022), https://arxiv.org/abs/2205.14415 - GitHub - thuml/Nonstationary_Transformers: Code release for "Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting" (NeurIPS 2022), https://arxiv.org/abs/2205.14415https://github.com/thuml/Nonstationary_Transformers

type 

train of thought

Research Background

Due to the global scope modeling ability, the current transformer is very strong in time series prediction, but the model can seriously degrade on the non-stationary data set whose joint distribution changes with time, so a general framework of non-stationary transformer is proposed. Being able to solve the problem of missing features while stabilizing is the central problem solved in this paper.

Keywords: non-stationary time series, Transformers

method and nature

The research object is time series forecasting task.

Stabilizing the sequence (such as sequence decomposition and normalization) is a common method in time series forecasting. After stabilization, it is better to predict and generally improves the forecasting performance. However, this paper believes that there will be a problem of over-stabilization in stabilization, which leads to the fact that for sequences with different attribute characteristics, the Attention Map learned by the Transformer model is very similar, as shown in the following figure: three sequences of different time periods, when the Transformer is directly used The Attention Map is not similar, but the smoothing roar is very similar, which is over-smoothing.

So the question arises: how to use smoothing to improve predictable performance while solving the problem of over smoothing?

This paper proposes a framework as follows: It consists of two main modules. One is the stabilization module (Series Stationarization) for sequence stabilization (improving predictability), and the other is the de- stationary attention mechanism (De-stationary Attention) to alleviate the transition stabilization problem.

1. Series Stationarization

There are two modules for sequence stabilization: Normalization module at input and De- normalization module at output . The mean and variance calculated by normalization will be sent to the normalization layer to restore the sequence statistical characteristics.

    Normalization module: Find the mean and variance, normalize the sequence and send it to the model.

       

    De-normalization module: Use the above mean and variance inverse normalization.

   

2. De-stationary Attention

Because we input a smoothed model, the calculated Attention is also a smoothed sequence, and there is a problem of over-stabilization. But I hope to get: the Attention Matrix in the model is actually a non-stationary sequence, so the purpose of this module is to approximate the Attention Matrix of the original non-stationary sequence through the Attention Matrix of the post-stationary sequence . The derivation is as follows:

 (Q , K , V  are obtained from the post-stationary sequence, while Q, K, V are obtained from the pre-stationary sequence.)

The sequence calculation Attention before stabilization is as follows, which is actually our goal:

(1)

Since the stabilization process is a normalization process, the stabilized Q K T can be expanded as:

(2)

Then bring (2) into (1) to get the Attention Matrix of our target, which can be rewritten as:

(3)

The last two items are repeated operations in each column, without affecting the results after Softmax. For example, for any row of the matrix, the last two items are equivalent to adding the same value to each element of the row, and since Softmax operates on each column of the matrix, whether to add the same value to the result of Softmax No effect. Therefore, the latter two items can be directly removed, and (3) can be directly simplified as:

(4)

There is Q′K′T in the first item after the equal sign in formula 4, which is actually the Attention Matrix of the post-stationary sequence. Therefore, Equation 4 builds a bridge from the Attention Matrix of the post-stationary sequence to the Attention Matrix of the original sequence before the stabilization (ie our goal).

除了Q′K′T,式子4还包括, σx2,(KμQ)T ,但这些是无法从平稳后序列中得到的。因此,可以使用MLP来学习这两个量,即使用额外的两个MLP,一个用来学 τ=σx2 (注意这个量是正数,因此可以学它的对数),另一个用来学 Δ=KμQ ,这里的 τ,Δ 也被称为去平稳因子(de-stationary factors)。MLP的输入其实就是未平稳原始序列及其统计量。注意,要学习的量 σx2和Series Stationarization中的σx2并不一致,因为Series Stationarization中的σx2是整个模型的输入的方差,而要学习的量 σx2是每一层Attention layer的输入的方差,但论文中作者共享所有层Attention layer的去平稳因子。

综上,整个的De-stationary Attention可以写为:

研究结果

效果非常好

采用的数据集为时序任务中常用的数据集:

 

在Transformer上降低了49.43%的MSE,在Informer上降低了47.34%,在Reformer上降低了46.89。尤其是在长期预测上,表现突出。

提出的非平稳transformer持续大幅提升四款主流transformer性能,并在六个真是数据集中达到SOTA。

数据

实验结论数据:

相对平稳性检验:

结论

本文从平稳性的角度来探讨时间序列预测。

与以往简单地减弱非平稳性导致过平稳化的研究不同,提出了一种有效的方法来提高序列平稳性,并更新内部机制来重新合并非平稳信息,从而同时提高数据的可预测性和模型预测能力。

实验表明,其在六个真实的基准上显示了良好的通用性和性能。并提供了详细的推导和消融,以证明在我们提出的非平稳transformer框架中每个组件的有效性。

研究展望

在未来将会探索与模型无关的过平稳化问题的解决方案。因为本文只是依托transformer方法的。

 Limitation:

De-stationary Attention是通过分析self-attention推导出来的,这可能不是高级注意机制的最佳解决方案。projector也有进一步发展的潜力,包括更多的归纳偏置。此外,所提出的框架仅限于基于transformer的模型,而任何深度时间预测模型如果使用不适当的平稳化方法都可能出现过平稳问题。因此,对过平稳问题的模型不可知的解决方法将是我们今后的探索方向

重要性

更好地预测未来。Transformer目前是应用比较广泛的方法,有很强的全局建模能力。但是在实现非平稳数据预测时,性能会严重退化,于是大家都想出来先平稳化再预测,对时间序列进行校区非平稳化预处理已被普遍认可,相关研究也是热点,但是平稳化又会丢失很多特征信息,这对现实世界突发事件预测的指导意义不大。所以提出了过平稳化问题的解决方案,来解决平稳化操作带来的一些问题。

想法和问题

Series Stationarization用到的两个模块应该参考了Reversible instance normalization for accurate time-series forecasting against distribution shift(RevIN)【这是一篇2022发表在ICIR上的文章,后续看完SCINet会看】,而且实验证明Series Stationarization和RevIN的效果差不多。不过这都是受到Batch Normalization、Layer Normalization的启发就是了,这些Normalization的方式就是先归一化后又逆归一化,只不过时间序列中逆归一化的参数不是学的,是归一化时的统计量得到的,这也符合直觉,即输入序列的均值和方差应该和输出序列的均值和方差大致相同。

The design of De-stationary Attention is very ingenious, and a new form of Attention is introduced with theory, but the two hypotheses derived are not valid under nonlinear activation conditions. When calculating, the key inside is to stabilize the factor, because this is the core of avoiding the phenomenon of transitional smoothing. But the author did not show the learned de-stable factors, Attention Matrix and how they work. However, other quantitative experiments are sufficient.

Citing references:

Non-stationary Transformers: Rethinking the Stationarity in Time Series Forecasting - 知乎

This article is only for self-study use, if you have any questions, please comment and point out.

Guess you like

Origin blog.csdn.net/weixin_43681559/article/details/128000061