Study notes: Transformer-based time series forecasting model

1 Some preparation notes

In order to facilitate the reader's understanding, the author will take an actual project of shield machine tunneling parameter prediction to explain the Transformer model. In addition, this post is more for my learning records, suitable for readers who already have a certain understanding of the Transformer model. Besides this time, it is being updated irregularly.

Some references and picture sources: What are the details of the
Transformer paper link transformer? In-depth understanding of Transformer and its source code Interpretation Informer paper link


1.1 Data used

The specific data is as follows in csv, and only part of the data is shown here
insert image description here
. In this project, not all parameters are used. In the example of this article, only used

"state": ["刀盘转速(r/min)",
          "刀盘压力(bar)",
          "总推进力(KN)",
          "螺机转速(r/min)"],
"action": ["A组推进压力设定(bar)",
           "B组推进压力设定(bar)",
           "C组推进压力设定(bar)",
           "D组推进压力设定(bar)",
           "推进速度2(mm/min)"],
"target": ["VMT导向垂直后(mm)",
           "VMT导向水平前(mm)",
           "VMT导向垂直前(mm)",
           "VMT导向水平后(mm)",
           "VMT导向水平趋向RP(mm)",
           "VMT导向垂直趋向RP(mm)"]

These parameters are extracted using the pandas package.

1.2 Format of time series data

Next, let's understand the format of time series data.
For the data at a certain moment, it should be a row of 1×15 tensors like the following:

[[1.51, 86.656, 69.550, ......(共15个数据)]]

This data represents the data of 15 tunneling parameters in 刀盘转速(r/min), 刀盘压力(bar), 总推进力(KN), at a certain moment from left to right. If the data of multiple consecutive times, such as the data of 3 times, should be a 3×15 tensor as shown below:......

[[1.51, 86.656, 69.550, ......(共15个数据)],
 [1.52, 86.756, 69.650, ......(共15个数据)],
 [1.53, 86.856, 69.750, ......(共15个数据)]]

Note that the above two sets of data were randomly entered by the author, and the numerical value has no practical significance, so please do not take it into consideration.

1.3 Input and output of Transformer

Assuming ours batch_size = 32, the author will take the training process as an example in the next interpretation of the full-text model. The purpose is to predict the data of 2 future moments (2×6) through the data of 8 past moments (8×15).

  • Input: Divided into encoder and decoder input, the size is 32×8×15 and 32×2×15 respectively
  • Output: only one, with dimensions 32×2×6

Among them, 32 is batch_size, 32×8×15 can be understood as 32 batches, each batch contains 8 past data, 15 is the number of tunneling parameters considered, such as , , , , 刀盘转速etc. 刀盘压力in 总推进力Figure 螺机转速1.1 ……15 parameter.
In the overall structure diagram below, the input and output data sizes will be marked at the corresponding positions. Note that the input and output tensor sizes of the Encoder layer are the same, and the same is true for the input and output of the Decoder layer.

2 overall structure

insert image description here
Note: In order to adapt to time series prediction, compared with the original Transformer model, the last one will be Softmax层deleted

3 Enter code

The input needs to be normalized before input.
The flow of input data is shown in the figure below, focusing on the changes in dimensions. (15——>512)
Please add a picture description
In some other time series forecasting projects, such as Informer, Global Time Stamp is also added to consider the influence of date factors such as week, month, and holidays. These are described in this article None are considered, only positional encoding is considered, see the Informer paper link for details .

3.1 Embeddings

First, use nn.Linear(n_encoder_inputs, channels)the projection of the two input dimensions to 512, that is, the size becomes 32×8×512 and 32×2×512

3.2 Positional Encoding

With positional encoding, order is considered. The specific formula is as follows:
insert image description here
Among them pos=0~7或0~1, i=0~512/2
the two Positional Encodings in the figure respectively obtain two tensors with sizes of 32×8×512 and 32×2×512
(if you want to know more about the role of positional encoding, you can refer to Zhihu article What are the details of the transformer?, this article will not discuss too much)

3.3 Embedding With Time Signal

Add the tensors in 3.1 and 3.2 to get a new two tensors with sizes 32×8×512 and 32×2×512, as shown in the figure below.
insert image description here

4 Encoder

Here we first show the process of changing the size of the tensor in the Encoder.
insert image description here
The Encoder is composed of N identical layers, and the output of each layer is used as the input of the next layer. The output of the last layer enters the Decoder as the K and V input of the Multi-Head Attention layer.

4.1 Multi-Head Attention layer

4.1.1 Self-Attention

The following two figures are the process of the Self-Attention self-attention mechanism, X is the input, and Z is the output.
insert image description here
insert image description here
insert image description here
For a more detailed explanation of self-attention, you can refer to other articles. There are a lot of searches on the Internet, so I won’t go into details here.

4.1.2 Muti-Head Attention

Muti-Head Attention is to convert the input into Q, K, and V, and then d_model=512cut them into h=8pieces according to the dimension, and the dimension becomes d_qor d_kor d_v= 512/8= 64. After doing it separately Self-Attention, they are combined and restored to the dimension d_model=512, and then a Linear projection is performed, and the dimension remains unchanged. get the output. As shown below

There are two main points where Muti-Head Attention is prone to confusion:
1.
When the author introduces Muti-Head Attention, the input used is represented by three letters such as Q, K, and V, and then Linear projection is performed. The first reading It is really easy to make people feel confused when it arrives. Several mainstream introductions on the Internet usually distinguish the input and Q, K, and V for Attention calculation, which makes it easy to conflict with the introduction in Self-Attention. Here we can understand that the input (here first written as Q, K, V) after Linear projection, generates new Q, K, V to replace the original Q, K, V, and then performs Self-Attention calculation, that is The Scaled Dot-Product Attention part in the above picture, so the understanding is smooth.

2.
According to my current search results, there are two ways to interpret Muti-Head Attention on the Internet.
(1) The first is the way of explanation in the paper:
graphs in papersAnd the formula is as follows
insert image description here
insert image description here
, and I believe many people have seen the followingA diagram explaining Muti-Head Attention
insert image description here
There is no problem with the interpretation of these two pictures, but there is a problem when viewed together.
First of all, regarding the representation of Muti-Head Attention input, Q, K, and V are used in the paper, and X is used in the second figure. There are certain problems in the correspondence between the two. We have already explained this in 1. In fact, the Q, K, and V input in the paper are completely consistent with the X in the second figure, that is, secondly, the rightmost W O looks like a Q=K=V=X
long matrix , But actually, this should be a 512x512 matrix in the paper. When the author draws this figure, it may be to highlight the corresponding relationship between dimensions.
Then, after solving a few problems, redraw the drawing, which should be the following picture.
insert image description here
After understanding this picture, let’s take a look at how the source code is written.

(2) The second is the idea of ​​source code:
upload the code first, the code may be slightly different from the source code in pytorch in detail, but the overall idea is the same.

# 实现多头注意力机制的类
class MultiHeadAttention(nn.Module):
	def __init__(self, head, embedding_dim, dropout=0.1):
		# head:代表几个头的参数
		# embedding_dim:代表词嵌入的维度
		# dropout:进行Dropout操作时置零的比率
		super(MultiHeadAttention, self).__init__()
		assert embedding_dim % head == 0
		
		self.d_k = embedding_dim // head
		self.head = head
		self.embedding_dim = embedding_dim
		# 获得四个,分别是Q、K、V及最终输出WO线性层
		self.linears = clones(nn.Linear(embedding_dim, embedding_dim), 4)
		# 初始化注意力张量
		self.attn = None
		self.dropout = nn.Dropout(p=dropout)

	def forward(self, query, key, value, mask=None):
		# query, key, value是注意力机制的三个输入张量,在Encoder中三者一致,mask代表掩码张量
		if mask is not none:
			# 使用suqeeze将掩码张量进行维度扩充,代表多头中的第n个头
			mask = mask.unsqueeze(1)	
		# 得到batch_size
		batch_size = query.size(0)
		
		query, key, value = \
		[model(x).view(batch_size, -1, self.head, self.d_k).transpose(1, 2) \
		for model, x in zip(self.linears), (query, key, value)]
		
		# 将每个头的输出传递到注意力层
		x, self.atten = attention(query, key, value, mask=mask, dropout=self.dropout)
		# 得到每个头的计算结果是4维张量,需要进行形状的转换
		# 前面已经将1,2两个维度进行转置,在这里要重新转置回来
		# 经历了transpose()后必须使用contiguous方法,不然无法使用view()
		x = x.transpose(1, 2).contiguout().view(batch_size, -1, self.head*self.d_k)
		# 最后将x输入线性层列表中的最后一个线性层进行处理
		return self.linears[-1](x)

At first glance, it seems that it is not the same as the explanation in the paper. How can the input be multiplied by three matrices and then divided? Not to consider long me?
The specific operation of the code can be understood by the following figure. In order to facilitate the comparison of (1)论文中的解释方式and (2)源码的思路, the operations in the figure are not considered batch_size, and (1) and (2) are placed in the same figure.
insert image description here
Comparing (1) and (2), it can be seen that the idea of ​​(2) is actually to combine W0...W7 into a matrix for parallel operation, and the essential idea is the same.

4.2 Add & Norm layer

There are so many online, I'm lazy

4.3 Feed Forward layer

There are so many online, I'm lazy

5 Decoder

Except for the Masked Multi-Head Attention layer in the Decoder, the rest are consistent with those in the Encoder. I won’t go into details here, and only show the process of tensor’s size change in the Decoder.
insert image description here

5.1 Masked Multi-Head Attention层

The function in the code can also refer to the code part in 4.1.2.
For more detailed functions in testing and training, see6 模型在训练与测试时的区别
insert image description here
insert image description here

6 Differences between model training and testing

During the learning process of Transformer, I still have a lot of doubts about the difference between training and testing, so I hereby separate a section to explain it. The main idea is to refer to the details of the boss transformer.

6.1 When testing

In NLP任务, the sentence to be translated is usually entered in the Encoder. If there are 3 words in the sentence and the translation is 3 words (such as -- "我""是""谁"> "who""am""I"), the size of the Encoder input (regardless of the Padding Mask) is (3, 512).
The input and output of the Decoder are relatively different. In the Decoder's Multi-Head Attention layer, K and V are the results of linear transformation of the Encoder's output Memory (the Memory at this time contains the encoding information of each position of the original input sequence), and Q is Decoder's Masked The result of the linear transformation of the hidden vector output by the Multi-Head Attention layer. When the Decoder decodes each moment, the first thing to do is to interact with Q and K (query query), and calculate the attention weight matrix; then calculate the attention weight and V to obtain a weight vector, The meaning represented by the weight vector is how to allocate attention to each position of Memory during decoding.
When decoding the first moment, the Decoder inputs a representational vector (representing the beginning of the sentence), and the input size is as (1, 512)shown in the figure below. After obtaining Q, K, and V, first, Q interacts with K to obtain the weight vector. At this time, it can be regarded as Q (the vector to be decoded) and queries each position in the Memory related to Q in K (essentially Memory). Information; then the weight vector and V are operated to obtain the decoding vector. At this time, the decoding vector can be regarded as an output vector that considers the encoding information of each position in Memory, that is to say, it contains the information that should be paid attention to when decoding the current moment. Information about where to put it in Memory. Further, after the Decoder obtains the output result, it goes through a linear layer and then inputs it to the classification layer for classification to obtain the decoded output value at the current moment. If the model is accurate, "who"the output should be obtained.
insert image description here
insert image description here
After the decoding process at the first moment is completed, the input at the first moment of decoding and the output after decoding the first moment should be used as the input of the decoder to decode and predict the output at the second moment. Similarly, after the decoding process at the second moment is completed, the input at the first and second moments and the output after decoding the second moment should be used as the input of the decoder to decode and predict the output at the second moment.

  • The complete process is as follows:
    the first moment: { } ——>{ } the second moment: { , } ——>{ } the third moment: { , , } ——>{ } the fourth moment: { , , , } ——>{ } <start> who
    <start>who am
    <start>whoam I
    <start>whoamI <end>

Obviously there is a problem at this time. If at the third moment, { , , } is input, which should be a (3, 512) vector, then the specific calculation process is shown in the figure below. Finally, the output of the Decoder should be a (3, 512) tensor of the same size as the input of the Decoder, and to obtain the result, the output of the Decoder should be a (1, 512) tensor. For this reason, for the tensor output by the Decoder, only the last vector is fed to the classifier for classification to obtain the decoded output at the current moment. Similarly, in the task of , we want to predict the data of 2 future moments ( , ) <start>whoam
insert image description here
insert image description here
"I"
时间序列预测t1t2

  • The complete process is as follows:
    The first moment: { } ——>{ } The second moment: { , } ——>{ } t0时刻数据 t1时刻数据
    t0时刻数据t1时刻数据 t2时刻数据

At the second moment, the output of the final Decoder should be a (2, 512) tensor of the same size as the input of the Decoder, and to get it, the output of the Decoder t2时刻数据should be a (1, 512) tensor. For this reason, for the tensor output by the Decoder, only the last vector will be taken to get t2时刻数据.

6.2 During training

After introducing the decoding process during the test, let's continue to see how the network decodes during the training process. In the real prediction, the decoder needs to use the output of the previous moment as the input for decoding at the next moment, and then perform the decoding operation moment by moment. Obviously, it will be very time-consuming if the same method is used in training. Therefore, in the training process, the decoder, like the encoder, receives input at all moments of decoding at one time for calculation. The advantage of this is that the network training speed can be accelerated through multi-sample parallel computing; the second is that the correct result of the decoder is directly fed to the decoder during the training process instead of the predicted value at the previous moment (because the prediction at the previous moment during training The value may be wrong), which can better train the network.
Still take the NPL task in 6.1 as an example.Encoder inputis { , , }, and "我""是""谁"Decoder inputIt is { , , , } , corresponding to <start>whoamIcorrect labelIt is { , , , ,}. Assume that the input of the decoder { , , , } is multiplied by a matrix for linear transformation to obtain Q, K, V, and the attention weight matrix is ​​obtained after the action of Q and K (the softmax operation has not yet been performed at this time) ,As shown below. From the weight vector in row 1, it can be seen that when decoding the first moment, 2/9 attention should be put on top, 1/3 attention should be put on top, and so on. However, there is a problem at this time that the model cannot see the information after the current moment when predicting. Therefore, the Decoder in Transformer solves this problem by adding an attention mask mechanism. As shown in the figure below, the attention weight matrix is ​​still calculated by Q and K on the left (the softmax operation has not yet been performed), and the middle one is the so-called attention mask matrix. The two are added and then multiplied by The matrix V gets the output of the entire self-attention mechanism, which is the Masked Multi-Head Attention in the Decoder. Then why can the attention weight matrix plus this attention mask matrix achieve such an effect? Take the weight of the first row in the figure as an example. When the decoder decodes the first moment, its corresponding input is only , so this means that all attention should be placed on the first position at this time ( position , although the decoder feeds all the inputs at once during training), in other words, the weight in the first position should be 1, while the other positions should be 0. It can be seen from the figure that the attention vector of the first line is added with the attention mask of the first line, and then a similar vector is obtained after the softmax operation . Then, through this vector, it can be guaranteed that only the first position can be focused on when decoding the first moment ( whoamI<end>
<start>whoamI
insert image description here
<start>"who"

insert image description here
<start><start>[1,0,0,0,0]<start>location). A similar process is also performed at subsequent moments of decoding. In addition, this operation coincides with the "only the last vector" operation mentioned in 6.1.
In the same way, the task of time series forecasting is also similar to the above process, so I won't repeat it in detail.

7 Some minor improvements to the model

Inspired by Informer , combined with the requirements of your own project, make small adjustments to the input of the model.

Guess you like

Origin blog.csdn.net/xxt228/article/details/128754364