[Paper] Channel-independence in PatchTST

channel-independence in PatchTST

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Time: 2022
Citations: 8 ICLR 2023< a i=3> Code: https://github.com/yuqinie98/PatchTST Chinese reference: A Time Series is Worth 64 Words (PatchTST model) paper InterpretationTransformer is not as good as linear model in time series prediction? The latest responses to ICLR 2023 are here! Nie Y, Nguyen N H, Sinthong P, et al. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers[J]. arXiv preprint arXiv:2211.14730, 2022.




Summarize

The article proposes two innovative points:
(1) Patching
(2) Channel-independence

Patching

Because the time series has very close values ​​before and after a certain time step, patching aggregation can be used.
good:

  • Reduces GPU memory usage during training
  • It allows the model to see longer historical sequences and improves prediction performance.

bad:

  • The resolution of the time series is lost, which means the granularity is larger.

Channel-independence

Consider each variable of the time series separately, that is, send it separately toTransformer.
good:

  • The author said that it can reduce the occurrence of overfitting. However, bloggers believe that Channel-independence is equivalent to increasing training data in disguise.

bad:

  • Interactions between sequence variables are missing, which may be important for certain downstream tasks.

1. Model Overview
Insert image description here

Figure 1 Overview of PatchTST structure.

2. Dataset

Insert image description here


The following are the benefits of Channel-independence mentioned in the appendix of the paper:

3. Ablation experiment

B - batch size.
M - number of variables.
N - number of patches.
P - patch size.
S - patch stride.
  • Only channel-independence: P (length of patch) and S (step size of patch) are set to 1.
  • Only patching: Change the input format. BM x N x P -> B x N x MP. namely channel-mixing with patching.
  • Original TST: DirectionsTST KDD2021.

Advantages of Channel-independence:
(1) Adaptablity: Each time series (of variables) is input into the Transformer separately, and each time series has its own Attention Map. This means that different attention patterns can be learned for different time series. With the Channel-mixing method, all sequences share the same attention pattern, which may be harmful to performance because the time series of each variable may have its own different behavior pattern.

Figure 6 reveals the interesting phenomenon that predictions of unrelated time series depend on different attention patterns, while similar sequences can produce similar (attention) maps.

(2) Channel-mixing requires more training data to achieve channel-independence performance. The flexibility of learning the correlation between different channels is also a double-edged sword, because it requires more data to learn the information between different channels and different time steps. However, the channel-independence model only needs to focus on the information on the timeline.
To test this hypothesis, we conducted experiments. Use different training data sizes, as shown in Figure 7. It can be clearly seen that the channel-independence method converges faster as the training data increases. The datasets we use extensively (shown in Table 2) may be too small for supervised learning.

(3) The Channel-independence model is even more different from overfitting. As shown in Figure 7, the Channel-mixing model quickly overfits, but the Channel-independence model does not.

More technical advantages of Channel-independence:
(1) Spatial correlation between different sequences can be learned: Although we have not done sufficient research in this article, The design of channel-independence can be naturally extended to learn cross-channel relationships (using methods such as graph neural networks).
(2) The loss function of multi-task learning can be added to different time series.
(3) Tolerance to noise: If noise dominates a certain sequence, then this noise will be mapped to the latent space of other sequences (if channel-mixing is used method). The Channel-independence method can alleviate this situation by retraining the noise in noisy channels. (?)

Channel independence can mitigate this problem by only retaining the noise in these noisy channels. We can further alleviate the noise by introducing smaller weights to the objective losses that associate with noisy channels.

Please add image description

Figure 6 excerpts a part of

Please add image description

Guess you like

Origin blog.csdn.net/LittleSeedling/article/details/130808893