[Model of article learning series] Non-stationary Transformers

Article overview

"Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting" is an article published on NeurIPS in 2022. In the past time series forecast research, people often weaken the non-stationarity of the original sequence through data stabilization. This approach is contrary to the significance of time series forecasting for emergency prediction, and ignores the ubiquity of non-stationary data in real scenarios. This eventually leads to overstationarization of modeling and forecasting. In order to solve this problem, this paper proposes a new network structure consisting of sequence stabilization and anti-stationary attention.

Paper Link
Code Link

The overall structure

insert image description here
The figure above shows the structure of the model proposed in this paper. Data xxx first passes through a normalization module (Normalization) to obtain the transformed sequencex ′ x^{'}x , and then encode it, and input the encoded result into the common Encoder-Decoder module, and finallyoutput the prediction result afterDe-NormalizationOne of the pairs of normalization-denormalization modules is called a smoothing module by the authors. Unlike the general Transformer, the author'sQKVQKVQ K V has been adjusted, and the proposed structure de-stationary attention module (De-stationary Attention) replaces the traditional attention module, and receives two parametersτ τtD DΔ is used to calculate the attention score.

main module

Series Stationarization (sequence stabilization module)

Non-stationary series are usually difficult in forecasting tasks, so people hope to convert them into stationary data for forecasting. Pre-existing method: RevIN. This method standardizes the input data, adds learnable parameters (weight a and bias item b) to the standardized result, and finally performs the corresponding inverse operation when predicting the output. This method has been proven to be effective. Inspired by this, the author simplifies the operation of RevIN and discards the learnable parameters, that is, the input and output of data only need to be standardized and denormalized.

De-stationary Attention (anti-stationary attention module)

The sequence stabilization module can indeed effectively convert non-stationary data into stationary data, but the non-stationary characteristics will inevitably be lost in the stabilization, resulting in over-stationary data, which is undoubtedly a disaster for the purpose of time series forecasting.
insert image description here
To illustrate this point, the author made a feature map learned in the attention mechanism, as shown in the figure above. The non-stationary sequence is generally composed of several different subsequences, and the different subsequences are input into the general Transformer, the Transformer with a sequence stabilization module, and the method proposed in this paper.
It can be found that for the feature extraction of different subsequences, Transformer and the method in this paper extract three different sets of features, while Transformer with a sequence stabilization module extracts very similar features. This shows that sequence stabilization makes the features extracted from different subsequences converge, and the method of losing differential features simplifies the results of feature extraction, which is not conducive to the development of subsequent prediction tasks.
In contrast, the method proposed in this paper, after the sequence is stabilized, successfully integrates the original lost non-stationary features into the features learned by attention by innovating the attention module. Next, let's take a closer look at how to do it.

General Attention Module De-stationary Attention
insert image description here insert image description here
z = softmax ( QKT dk ) V z=softmax(\frac{QK^{T} }{\sqrt{d_{k} } } )Vz=softmax(dk QKT)V z ′ = s o f t m a x ( τ Q ′ K ′ T + 1 Δ T d k ) V ′ z'=softmax(\frac{τQ^{'}K^{'T} +1Δ^{T}}{\sqrt{d_{k} } } )V^{'} z=softmax(dk τQKT+1 DT)V

The left side of the above table is the general attention module, and the right side is the De-stationary Attention mentioned in the article. We can find that the difference between the two is mainly that the latter adjusts QKVQKVQ K V and moreτ Δ τ Δτ Δ two parameters. Through such changes, the author tries to add non-stationary features into the attention module, and the specific calculation steps are as follows.

1. In order to simplify the calculation, assume that QQQ is made byxxx is obtained by linear mapping
2. In order to facilitate the derivation of the formula, it is assumed that the input sequence of each batch has the same expectation and variance.
At this time,x ′ = x − 1 μ x T σ x^{'}=\frac{ x-1\mu ^{T}_x}{\sigma }x=px 1 mxT,and Q ′ = Q − 1 μ QT σ Q^{'}=\frac{Q-1\mu^{T}_Q}{\sigma}Q=pQ1μQTK ′ = K − 1 µ KT σ K^{'}=\frac{K-1\mu^{T}_K}{\sigma}K=pK 1 mKT
3.Define the equation, giving Q ′ K ′ = 1 σ x 2 [ QKT − 1 ( µ QTKT ) − ( Q µ K ) 1 T + 1 ( µ QT µ K ) 1 T ] Q^{'}K^ {'}=\frac{1}{\sigma^2_x}[QK^T-1(\mu^T_QK^T)-(Q\mu_K)1^T+1(\mu^T_Q\mu_K)1^ T]QK=px21[QKT1 ( mQTKT)( _K)1T+1 ( mQTmK)1T ]
4. Therefore, the original attention formula can be transformed intosoftmax ( QKT dk ) = softmax ( σ x 2 Q ′ K ′ T + 1 ( μ QTKT ) + ( Q μ K ) 1 T − 1 ( μ QT μ K ) 1 T dk ) softmax(\frac{QK^{T} }{\sqrt{d_{k} } })=softmax(\frac{\sigma^{2}_xQ'K'^T+1(\ mu^T_QK^T)+(Q\mu_K)1^T-1(\mu^T_Q\mu_K)1^T}{\sqrt{d_{k}}})softmax(dk QKT)=softmax(dk px2QKT +1(mQTKT)+(QμK)1T1(mQTmK)1T)
5. At this point, the second point simplifies the expectation and variance as scalars and it comes into play. In the formula of the fourth point, the last two items on the numerator are constant items, which can be deleted, so that
z ′ = softmax ( QKT dk ) = softmax ( σ x 2 Q ′ K ′ T + 1 ( μ QTKT ) dk ) z'=softmax(\frac{QK^{T} }{\sqrt{d_{k} } })=softmax(\frac{\sigma^{2}_xQ'K'^T+1(\mu^ T_QK^T)}{\sqrt{d_{k}}})z=softmax(dk QKT)=softmax(dk px2QKT +1(mQTKT)) .
6. For the formula in 5,Q ′ K ′ dk Q' K' \sqrt{d_{k}}QKdk It is known (the left figure of the table) can be calculated, and the σ x \sigma_x related to the non-stationary signalpx K K K μ Q \mu_Q mQ σ 2 \sigma^2 p2记的τ ( τ ≥ 0 ) τ(τ≥0)t ( t0) K μ Q K\mu_Q K mQdenoted as Δ ΔΔ , the author uses a multi-layer perceptron to bring its non-stationary features into the calculation after stabilization, and we can getz ′ = softmax ( τ Q ′ K ′ T + 1 Δ T dk ) z'=softmax(\frac {τQ'K'^T+1Δ^T}{\sqrt{d_{k}}})z=softmax(dk τQKT+1ΔT) ,
log τ = MLP ( σ x , x ) and Δ = MLP ( µ x , x ) log τ = MLP ( \sigma_x , x ) and Δ = MLP ( \ mu_x , x )lo=M L P ( σx,x ) and Δ=M L P ( μx,x)

Experimental results

insert image description here
The author tested six kinds of data sets in the real world, illustrating the advanced nature of the experiment from the perspective of experiment and formula derivation.
(See the paper for more experimental results)

insert image description here
As shown in the figure above, the author also applied the framework proposed in this article to the previous Transformer model, and demonstrated the superiority of the framework proposed in this article through experiments.

Ablation experiment

insert image description here
This part mainly compares the general Transformer, general Transformer+time series stability and the method proposed in this paper. It can be seen that a and c have significant differential fluctuations, while the fluctuation of b is relatively stable, obviously because of the role of the smoothing module, which is exactly what the author calls over-smoothing. With the help of the anti-stable attention module, c returned to the non-stationary state in a, which shows the huge role played by this module, which reduces the negative impact of the country's smoothing to a certain extent, and is conducive to the accurate time series Change and prediction are of great significance in this type of research topic.

Summarize

The goal of this paper is very clear, pointing out the problem of overstationarization and proposing an effective method. This method often requires some skills in the derivation of formulas, such as approximate value processing, constant term zeroing, mathematical assumptions, etc. Although From a purely theoretical point of view, some content is difficult to deduce perfectly, but the value brought by the excellent results of a set of reasonable and effective experimental methods under various simplification operations has exceeded the disadvantages caused by the lack of scattered theories.

Guess you like

Origin blog.csdn.net/qq_43166192/article/details/130375673