[Time series] TimesNet: General 2D modeling time series model

论文:ICLR2023 | Timesnet: Temporal 2d-variation modeling for general time series analysis [1]

Authors: Wu, Haixu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long

Institution: Tsinghua University

Code: https://github.com/thuml/TimesNet

Citations: 29

c9fe0b2ff58b43a885a622dcaa9ea47c.png

This paper takes the multi-periodicity of time series as the starting point, decomposes complex time changes into changes within and between multiple cycles, converts a one-dimensional time series into a set of two-dimensional tensors based on multiple cycles, and applies two-dimensional convolution Kernel modeling, which extracts complex temporal changes, significantly improves the performance of TimesNet in five mainstream time series analysis tasks (short- and long-term forecasting, imputation, classification, and anomaly detection).

b7eb161e088befd95adeddd8c4166723.png

The source of the radar chart is the score ranking of different models under different tasks (see the appendix for yourself)

Q: How to find the multi-periodicity of the data?

A: The author uses FFT to convert the original data into a spectrum, and selects the top 6 obvious frequencies for each window sequence with a length of 96. Then, collect the corresponding period length and draw its normalized density map, as shown in the figure below, you can see the multi-periodicity. Like Electricity, the dataset includes periods of length 12 and 24.

1f2907aa14ffb7d2d77143f0854a7129.png

In order for the model to better capture multi-period changes from the data, the author extracts multiple periods of time series and organizes the data as follows:

43fec8e9a07fcdef2996138989842d24.png

Among them, red is the change within the cycle (similar to the concept of chain ratio, the change between yesterday and today), and blue is the change between the cycle (similar to the concept of year-on-year, the change between last Monday and this Monday). In this way, if we perform convolution operations on 2D data, and then fuse the results under multiple cycles, can't we capture the timing changes of multiple cycles? As shown below:

7ac60c87fd023f8553ee8267f4fddcf9.png

To this end, the author proposed TimesNet, the most important of which is the TimesBlock module. This module does the following:

- Convert 1D time series to 2D structure data;

- 2D convolution kernel captures information;

- Dynamically merge multiple cycles.

Q: First of all, how to convert 1D time series to 2D structure data?

A: First use FFT to obtain the frequency domain, and then calculate the amplitude of each frequency, A_j represents the amplitude of frequency j. Select the topk frequencies with the highest amplitude, and finally T/j gets the period length of the corresponding frequency.

 223b5c171d11259f7c5bf98e396db909.png

code show as below:

def FFT_for_Period(x, k=2):
    # [B, T, C]
    # [32,192,16] -> [32,97,16]
    # 使用快速傅里叶变换,得到T/2+1个频率
    xf = torch.fft.rfft(x, dim=1)
    # find period by amplitudes
    # 在样本维度上求均值,得到所有样本的平均振幅
    # 在通道维度上求均值,得到所有特征的平均振幅
    # 得到的频率列表维度为 [T/2+1]
    frequency_list = abs(xf).mean(0).mean(-1)
    # 频率列表首位元素为直流分量,值较大,为避免影响后续topk选取,置为0
    # ref: https://github.com/thuml/Time-Series-Library/issues/7
    frequency_list[0] = 0
    # 从频率列表中选择振幅最高的k个元素 [k]
    # 返回两个张量,第一个是未使用的排序结果,第二个是topk的索引
    _, top_list = torch.topk(frequency_list, k)
    # 计算实际周期,即时间步数除以top_list中每个频率对应的索引值(周期长度)
    # 得到的结果维度为[32, k]
    top_list = top_list.detach().cpu().numpy()
    period = x.shape[1] // top_list # [k]
    # 返回实际周期和振幅
    # 振幅通过在最后一维上求均值得到每个频率的平均振幅 [B, k]
    return period, abs(xf).mean(-1)[:, top_list]

After that, the data is intercepted and stacked into 2D according to different cycle lengths. Since the time length of the input sequence may not be divisible by the given cycle length, it is necessary to fill in 0 values ​​to ensure that the conversion from 1D to 2D can be successfully completed.

6badd06644a0b0439f8336a34e695d0a.png

92e2b48d26969df2ec868e3280c5feb8.png

code show as below:

def forward(self, x):
        B, T, N = x.size()
        # period_list: 各top振幅频率j的周期长度,维度[k]
        # period_weight: 各样本下,各top振幅频率j的平均振幅,维度[B, k]
        period_list, period_weight = FFT_for_Period(x, self.k)


        res = []
        for i in range(self.k):
            # 获取第i个频率对应的周期长度
            period = period_list[i]
            # padding
            # 若周期过大,超过数据范围则需要padding
            # 为什么数据范围要考虑pred_len?
            # 因为对于预测任务来说,TimesNet的pipeline是:
            # 在embedding之后先将序列长度扩充为self.seq_len + self.pred_len,然后再不
            # 断refine预测结果。所以在中间层的TimesBlock其实在处理预测的中间结果(其长度
            # 为self.seq_len + self.pred_len)。
            if (self.seq_len + self.pred_len) % period != 0:
                # 计算调整后的序列长度,使其能够整除周期长度
                length = (
                                 ((self.seq_len + self.pred_len) // period) + 1) * period
                # 创建一个0填充张量,形状为 [B, 填充长度, N]
                padding = torch.zeros([x.shape[0], (length - (self.seq_len + self.pred_len)), x.shape[2]]).to(x.device)
                # 合并
                out = torch.cat([x, padding], dim=1)
            else:
                length = (self.seq_len + self.pred_len)
                out = x
            # reshape
            # 将输入张量进行形状变换和维度置换
            # 将长度为 length 的序列划分为 length//period 个长度为 period 的子序列
            # 将通道数特征放在第 2 维度上,将子序列放在第 3 维度上
            # 得到的结果维度为 [B, N, length//period, period]
            out = out.reshape(B, length // period, period,
                              N).permute(0, 3, 1, 2).contiguous()
            # 2D conv: from 1d Variation to 2d Variation
            out = self.conv(out)
            # reshape back
            out = out.permute(0, 2, 3, 1).reshape(B, -1, N)
            res.append(out[:, :(self.seq_len + self.pred_len), :]) # 保留前seq_len+pred_len长度的T,后面padding部分丢弃
        res = torch.stack(res, dim=-1)
        # adaptive aggregation
        # 基于每个A,softmax算权重
        period_weight = F.softmax(period_weight, dim=1)
        period_weight = period_weight.unsqueeze(
            1).unsqueeze(1).repeat(1, T, N, 1)
        # 加权融合   
        res = torch.sum(res * period_weight, -1)
        # residual connection
        res = res + x
        return res

After reading the above code, you will know how TimesBlock does "2D convolution kernel capture information" and "dynamically merge multiple cycles" on 2D data. Simply put, the author used Inception to 2D convolution to capture information. You can also replace it with other CV backbone networks, as shown in the figure below.

15433a0abc7726ee2ea28536cd7b4c17.png

# parameter-efficient design
self.conv = nn.Sequential(
    Inception_Block_V1(configs.d_model, configs.d_ff,
                       num_kernels=configs.num_kernels),
    nn.GELU(),
    Inception_Block_V1(configs.d_ff, configs.d_model,
                       num_kernels=configs.num_kernels)
)

You can really learn something, and you have to focus on:

c7b55159ed07701610698dbb7221cc81.png

The dynamic merging of multi-periods is: based on the frequency amplitude calculation of the softmax weight, the weighted summation merges the convolution results under the multi-period. The author has also tried other merging methods, such as direct summation and direct amplitude weighted summation, but the effect is not as good as amplitude + softmax + weighted summation.

ce493fe925898845101379345f9a9a6e.png

The above is the complete content of TimesBlock, and TimesNet is a series of TimesBlocks connected by residuals. As shown in the figure below, I will not repeat them here:

a585ebeeffe7251690995e77f193b28e.png

The formula for the data dimension transformation process is as follows:

The author put TimesNet in their self-developed Time-Series-Library library, the code is not long, and the idea is straight to the point. Interested friends can read the code by themselves to deepen their understanding. When the data officially enters TimesNet, there is data preprocessing:

- Refer to their previous Non-stationary Transformer for standardization;

- Do embedding: token embedding + position embedding + temporal embedding. Then feed the linear layer again. It should be noted that the embedding output is: seq_len+pred_len in the time dimension. The reason is that the author has a reply on Github [2]: "For prediction tasks, TimesNet's pipeline is: after embedding, first expand the sequence length to self.seq_len + self.pred_len, and then continue to refine the prediction results. So in the middle layer The TimesBlock actually handles the intermediate result of the prediction (its length is self.seq_len + self.pred_len)."

def forecast(self, x_enc, x_mark_enc, x_dec, x_mark_dec):
        # encoder输入 x_enc: (batch_size, seq_len, enc_in)
        # encoder时间戳特征 x_mark_enc: (batch_size, seq_len, ts_fnum)
        # decoder输入 x_dec: (batch_size, label_len+pred_len, dec_out)
        # decoder时间戳特征 x_mark_dec: (batch_size, label_len+pred_len, ts_fnum)


        # Normalization from Non-stationary Transformer
        # 窗口标准化
        means = x_enc.mean(1, keepdim=True).detach()
        x_enc = x_enc - means
        stdev = torch.sqrt(
            torch.var(x_enc, dim=1, keepdim=True, unbiased=False) + 1e-5)
        x_enc /= stdev


        # embedding:token embedding + postion embedding + temporal embedding。
        enc_out = self.enc_embedding(x_enc, x_mark_enc)  # [B,T,d_model]
        # 用MLP在时间维度上,获取未来预测部分的序列。即[B,T,C]->[B,T+pred_len,d_model]
        enc_out = self.predict_linear(enc_out.permute(0, 2, 1)).permute(
            0, 2, 1)  # align temporal dimension
        # TimesNet
        # 经过多少个TimeBlock
        # TimeBlock -> layer_norm -> TimeBlock -> layer_norm
        for i in range(self.layer):
            enc_out = self.layer_norm(self.model[i](enc_out))
        # porject back 输出预测序列 [B,pred_len,dec_out]
        dec_out = self.projection(enc_out)


        # De-Normalization from Non-stationary Transformer
        # 逆窗口标准化
        dec_out = dec_out * \
                  (stdev[:, 0, :].unsqueeze(1).repeat(
                      1, self.pred_len + self.seq_len, 1))
        dec_out = dec_out + \
                  (means[:, 0, :].unsqueeze(1).repeat(
                      1, self.pred_len + self.seq_len, 1))
        return dec_out

In addition, a Layer Normalization will be performed every time a TimesBlock is passed. Whether it is BN or LN, the two Normalizations are to stabilize the parameters of the layer and avoid gradient disappearance or explosion. The specific differences are as follows:

- Batch Normalization: Normalize each feature of a batch of samples, retain the size relationship between different samples, and apply to the CV field;

- Layer Normalizatiom: Normalize all the features of each sample, retain the size relationship between different features in a sample, and use the NLP field.

Let's take a look at the experiment! I will only show the long-term and short-term forecast performance here. There are too many experimental contents in the appendix of the paper. It is recommended to read it by yourself.

e3cbafeddd7c8af531f866ca350dea18.png

Q: How to determine the frequency of top k amplitude after FFT?

A: Doing experimental adjustments, the author found that k=3 is suitable for timing filling, classification and anomaly detection tasks. k=5 is suitable for short-term forecasting.

dd500c19474f4ee91d1c76967342d85a.png

Q: How to determine d_model?

A: According to different tasks, the author configures different d_models according to the formula. You can refer to the following table:

ab5febcd15cc0a0ce975e42bf3ec2c1e.png

Finally, to summarize the full text, in fact, there may be multi-periodicity in the timing data, so based on the frequency domain, after converting the 1D data into 2D, apply the 2D convolution kernel to capture the timing changes in the 2D cycle and cycle, and based on the amplitude, Dynamic weighting and merging of multi-period characterization results has brought significant improvements to the performance of major timing tasks.

References

[1] Wu, H., Hu, T., Liu, Y., Zhou, H., Wang, J., & Long, M. (2022). Timesnet: Temporal 2d-variation modeling for general time series analysis. *arXiv preprint arXiv:2210.02186*.

[2] Period selection and length setting when changing to 2D, Github: https://github.com/thuml/Time-Series-Library/issues/7

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

The 2022 Internet job hunting status, gold 9 silver 10 will soon become copper 9 iron 10! !

Public number: AI snail car

Stay humble, stay disciplined, keep improving

8abf0943e39e02d558146bcce024bef0.jpeg

Send [Snail] to get a copy of "Hands-on AI Project" (written by AI Snail Car)

Send [1222] to get a good leetcode brushing notes

Send [AI Four Classics] Get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/132703129