Detailed Informer code!

Author: Xiaodou who loves to learn, the article is excerpted from Yuanyuan's algorithm notes

  • Zhihu link:

    • https://zhuanlan.zhihu.com/p/646853438

1

foreword

Informer is the best paper in the field of timing in 2021. After reading the article and code carefully, you will find that the ideas, arguments, and code framework of this article are really well written, which is pleasing to the eye. Most of the subsequent timing prediction algorithms are in Informer It is carried out on the basis of , including the format of input and output, the way of feature encoding, etc.

2

The starting point and innovation point of the paper

8e54765ff2a39a628083a4721d64396d.png

Figure 1 Informer overall framework diagram

Point of departure

The self-attention calculation in Transformer is quadratic complexityd9bb53cf330d7304d490144b281677e2.png

The shape of the traditional Transformer's block input and output remains unchanged in a columnar form. The complexity brought by J blocks is that the f7c70906483dfd103a56610fdfcc0372.pnginput of the model cannot be changed too long, which limits the scalability of the timing model.

The output of the Decoder stage of the traditional Transformer is step-by-step. On the one hand, it increases the time consumption, and on the other hand, it will also bring cumulative errors to the model.

Innovation

The attention method of ProbSparse self-attention mechanism is proposed, which is compressed to the maximum 19e1cb6ef9519964e6b7ac3551f5c93e.pngcomplexity in terms of time consumption and memory, which is 1 in Figure 1

A "distillation" operation is added between each attention block, highlighting the main attention by halving the shape of the sequence, the original columnar Transformer becomes a pyramidal Transformer, so that the model can accept longer sequence inputs, and It can reduce memory and time consumption, which is 2 in Figure 1.

Design a relatively simple Decoder that can output the predicted value at one time, which is 3 in Figure 1 and the yellow part of the output.

3

Paper Details

The author found that in the self-attention mechanism in the original Transformer, the attention score presents a long-tail distribution, that is, only a small number of points are directly and strongly correlated with other points, as shown in Figure 2

c0db61513259b9628954241ae45cdbaf.png

Figure 2 Distribution of attention scores in different heads in the original Transformer

So in Transformer, if you can delete those useless queries in the process of calculating Attention, you can reduce the amount of calculation. Related work also shows that deleting some useless queries will not cause loss of accuracy, as shown in Figure 3:

e4e6ef76d5067741ad16945f2d75b8b8.png

Figure 3. A lot of work shows that reducing some points to calculate attention will not cause accuracy loss

So what the author has to do is how to define useless queries (called "Lazy query") and how to find these queries.

How to define and find "Lazy query"

The author proposes that the gap between the distribution of the more important query and the uniform distribution should be greater, so the KL divergence obtained between the probability distribution and the uniform distribution for each query (usually used to measure the two distributions directly The correlation) can be used as the importance of the query, as shown in Figure 4.

8cd9a0cca1eb98b218c87ffeef22a362.png

Figure 4

Considering that if you calculate the KL divergence of each query and the uniform distribution according to the formula, it is undoubtedly impossible to perform a large number of calculations before the attention is calculated. The author scales and simplifies this part of the calculation formula (with If you are interested, you can see the proof in the appendix of the original paper. I won’t go into details here, but I will explain in detail how the author did it in the code part.) Figure 5 is the last "active" score for each query. The lower the score, the more likely it is It is "Lazy" query (many calculations in this formula can be reused, so the amount of calculation is relatively reduced).

aee02a7a6bbe400533306c7b09bc7db1.png

Figure 5 active formula of query

4

code analysis

The code execution process is as follows:

1. The key of the random sampling part, the default is a341894cb4de240c1abcee4f0e556648.pngdimension (L is the sequence length)

2. According to the sampled part of the key, let each query and key function, and calculate the activity score of each queryd35aaafef41ab30029ef156b22d897aa.png

3. Select N queries with higher active scores, and N is 7b3652746ea33415c5cd88c1a879b464.pngthe dimension by default

4. Use this N query to calculate Attention with all keys

5. The remaining LN queries do not participate in the calculation, and take the mean value directly to ensure that the shape of the input and output remains unchanged

We want to locate the ProbAttention class

15ace402fd4c2b72ba4841fe8f2c34da.png

Figure 6

As shown in Figure 6, some initialization operations were done at the beginning. U_part and u respectively correspond to the number of sampling keys and sampling queries in the steps I mentioned above. You can see that there is a factor parameter control, and the number of samples is factor * In L .

Then call the function _prob_QK to get the dot product results of the most active query and all keys (the formula of Attention is to click first and then divide by the root number d for normalization, the basis) and the corresponding indexes of these queries (see Figure 7, done very detailed notes)

scores_top, index = self._prob_QK(
            queries, keys, sample_k=U_part, n_top=u)

be94165798e9d0dd36cc1fe1dec5e078.png

Figure 7 Probabilistic sparse attention calculation process

Back to Forward, the next step is to normalize the dot product calculated above

scale = self.scale or 1. / sqrt(D)
if scale is not None:
    scores_top = scores_top * scale  # attention公式中除以分母归一化

Then enter the function _get_initial_context to calculate the mean value of each query, which is equivalent to initializing the Attention scores between all queries and all keys with the mean value, and then according to the above scores and the index of the sampled query, the inside The Attention score corresponding to the active query is replaced

# get the context
context = self._get_initial_context(values, L_Q)  # (batch_size, n_heads, seq_len_q, dim_qkv)

acb645d71e82fe71cae08814fb476411.png

Figure 7 Initialize the Attention scores of all queries and keys with the mean

Then use the _update_context function to replace the Attention score of the active query part (see Figure 8, commented)

# update the context with selected top_k queries
context, attn = self._update_context(context, values, scores_top, index, L_Q, attn_mask)

9bea74cbd1d56d19388815a159a4567a.png

Figure 8

As shown in Figure 9, the Informer's Encoder is composed of e_layers EncoderLayer, e_layers-1 ConvLayer (downsampled) and a LayerNorm, and each EncoderLayer is composed of an AttentionLayer composed of probabilistic sparse attention. The overall structure is very simple, and the code is written very nice.

efd7033738a813a28d2a93a43f138a2b.png

Figure 9 Informer's Encoder

The probabilistic sparse attention part has been mentioned above. You can look at the ConvLayer here, which is determined by the parameter distil. If this parameter is set to False, there will be no downsampling in the whole process. Let's look at the code of this class, which is essentially a maximum pooled downsampling, which halves the dimension of seq_len

fd6759246a20c8648574ba554258ebfd.png

Figure 10 Downsampling operation in Informer

Figure 11 is the overall structure of the Encoder, which is very clear when looking at the code

b3d44ba4dda7471fa80d01b087d36161.png

Figure 11

As shown in Figure 12, the Decoder in the Informer is composed of d_layers DecoderLayers, a LayerNorm and a Linear (responsible for converting to the pred_len dimension). The first AttentionLayer in the DecoderLayer is probabilistic sparse self-attention, and the second AttentionLayer corresponds to the cross full Attention (the output of the Encoder is used as key and value, and the output of the previous layer of Decoder is used as query).

417cd5a436243f86e33cc3cc29dd7297.png

Figure 12

Figure 13 shows the structure of the Decoder

c2f03bbc213639888d5806350593a9c2.png

Figure 13 Structure of Decoder in Informer

It can be seen that the input of the Decoder consists of two parts. The length corresponding to the first part in the figure is label_len, which represents the post-label_lend data in the Encoder; the length of the second part is pred_len. You can see that when the Decoder is input, the length of this part The data is occupied by 0, we can see how to generate the Decoder data in Figure 14:

0152ccf18f5ab485ff638b7f971c29e8.png

Figure 14

Let's take a look at the process of the overall data passing through the Informer, as shown in Figure 15. I added shape annotations to each process

5940686974437387c709ed9d509b9bb1.png

Figure 15

It can be found that the input of Encoder and Decoder will go through an embedding. This embedding consists of three parts, the Token Embedding of the input feature sequence + the position_embedding to ensure the orderliness of the time series, and the temporal_embedding of each time point (for example, today is July 30th, This date contains the information of the day of the month, the day of the week, and whether it is the end of the month... The author encoded this information, and subsequent time series predictions basically use similar encoding). As shown in Figure 16, with annotations

b62c5481764a4a16ed9dbbd6a5647cfd.png

Figure 16

5

personal experience

Time complexity: Informer runs very fast. Compared with the models that claim to be faster in subsequent papers, Informer's efficiency is the highest mainly because it gives up a bunch of queries when calculating Attention and has down-sampling process

When I first used Informer, the effect was very rubbish, and then I deleted the downsampling in the middle, and the effect suddenly became better. Analysis Because the seq_len we input is not large, it is probabilistically sparse attention, and after continuous downsampling, there are not many queries used in the end, which must be a problem.

Informer can mainly adjust some parameters: factor, seq_len, d_model, and the rest of the effects are average.

Many timing models solve the problem of reducing time complexity and memory consumption. In essence, they still use Transformer to do timing tasks. When using their own scenarios, they may need to make some structural changes. We can keep Informer such as probability Sparse attention, an efficient module for extracting time series information, adds more structures with its own business scenarios, depending on your own design.

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

The 2022 Internet job hunting status, gold 9 silver 10 will soon become copper 9 iron 10! !

Public number: AI snail car

Stay humble, stay disciplined, keep improving

2821a7f3e21e05b75baf6bd6d10545d6.jpeg

Send [Snail] to get a copy of "Hands-on AI Project" (written by AI Snail Car)

Send [1222] to get a good leetcode brushing notes

Send [AI Four Classics] Get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/132419213