Transformer framework time series model Informer content and code interpretation

Transformer framework time series model Informer content and code interpretation

Note: If you think the blog is good, don’t forget to like and collect it. I will update the content related to artificial intelligence and big data every week. Most of the content is original, Python Java Scala SQL code, CV NLP recommendation system, etc., Spark Flink Kafka, Hbase, Hive, Flume, etc. are all pure dry goods, and the interpretation of various top conference papers makes progress together.
Paper: https://arxiv.org/abs/2012.07436
Code: https://github.com/zhouhaoyi/Informer2020
#博学谷IT学校技术支持#



foreword

The Transformer model is one of the hottest models at the moment and is widely used in various fields such as nlp and cv. However, it is very rare to use the Transformer model to solve time series problems. This article is a rare use of the Transformer model to solve the time prediction problem of long sequences. It is AAAI 2021 Best Paper. The content is relatively novel, especially a new attention layer - ProbSparse Self-Attention and Distilling operation is proposed, which greatly improves the effect of the model while reducing the model parameters, which is worth learning in many occasions.


1. Dataset

The paper provides several kinds of data. I take wth.csv as an example, and other data sets are similar.
insert image description here
The data has a total of 35064 items and 13 columns, the first column of which is time, in units of hours. The last 12 columns are ordinary columns for multivariate prediction tasks, that is, the last 12 columns are both feature X and label Y to be predicted.

2. Special engineering operations of data sets

The paper mainly cuts the data into training set, verification set and test set, including 24544 training sets (0 ~ 24544), 3508 verification sets (24448 ~ 28052), and 7012 test sets (27956 ~ 35064). The data is standardized, and the time data of the first column is extracted. In order to extract more information about the latitude of the time axis, it is converted, and 1 column is converted into 4 columns.

1. Standardized operation

The code is as follows (example):

class StandardScaler():
    def __init__(self):
        self.mean = 0.
        self.std = 1.
    
    def fit(self, data):
        self.mean = data.mean(0)
        self.std = data.std(0)

    def transform(self, data):
        mean = torch.from_numpy(self.mean).type_as(data).to(data.device) if torch.is_tensor(data) else self.mean
        std = torch.from_numpy(self.std).type_as(data).to(data.device) if torch.is_tensor(data) else self.std
        return (data - mean) / std

    def inverse_transform(self, data):
        mean = torch.from_numpy(self.mean).type_as(data).to(data.device) if torch.is_tensor(data) else self.mean
        std = torch.from_numpy(self.std).type_as(data).to(data.device) if torch.is_tensor(data) else self.std
        if data.shape[-1] != mean.shape[-1]:
            mean = mean[-1:]
            std = std[-1:]
        return (data * std) + mean

2. Time information conversion

The code is as follows (example):

class HourOfDay(TimeFeature):
    """Hour of day encoded as value between [-0.5, 0.5]"""
    def __call__(self, index: pd.DatetimeIndex) -> np.ndarray:
        return index.hour / 23.0 - 0.5
class DayOfWeek(TimeFeature):
    """Hour of day encoded as value between [-0.5, 0.5]"""
    def __call__(self, index: pd.DatetimeIndex) -> np.ndarray:
        return index.dayofweek / 6.0 - 0.5
class DayOfMonth(TimeFeature):
    """Day of month encoded as value between [-0.5, 0.5]"""
    def __call__(self, index: pd.DatetimeIndex) -> np.ndarray:
        return (index.day - 1) / 30.0 - 0.5
class DayOfYear(TimeFeature):
    """Day of year encoded as value between [-0.5, 0.5]"""
    def __call__(self, index: pd.DatetimeIndex) -> np.ndarray:
        return (index.dayofyear - 1) / 365.0 - 0.5
 
最后将其封装起来
dates = pd.to_datetime(dates.date.values)
return np.vstack([feat(dates) for feat in time_features_from_frequency_str(freq)]).transpose(1,0)

ps: The idea here is worth learning. Feature engineering is very important. It can be subdivided according to different business scenarios, such as adding seasons, holidays, promotion seasons, etc., to further supplement the information of the time dimension. These feature engineering often determine the online performance of the model , while the model itself can only approximate the upper line.

3. Model input

The data is processed through the Dataloader getitem function, and the model inputs are seq_x, seq_y, seq_x_mark, seq_y_mark respectively.
The dimension of seq_x is 96 12, representing 12 features and time series 96 (default hour).
The dimension of seq_y is 72 12, representing 12 features and time series 72 (default hour). In 72 hours, 48 ​​overlap with seq_x, and the other 24 are labels Y that really need to be predicted. That is to say, in the prediction stage of the model decoder, it is necessary to use the 48 values ​​in front of the time series as features to improve the effect of the model.
The dimension of seq_x_mark is 96 4, which represents the 4 time features previously transformed by time information.
The dimension of seq_y_mark is 72 4, which also represents the time characteristics of y.

3. Encoder

1.Embedding

Generally, the first layer of the Transformer framework is embedding, which fuses various feature information together. The author of this paper performs feature fusion from three angles, namely

  1. value_embedding
  2. position_embedding
  3. temporal_embedding

As shown in the figure, the embedding of data consists of three parts.
insert image description here
The code is as follows (example):

class TokenEmbedding(nn.Module):
    def __init__(self, c_in, d_model):
        super(TokenEmbedding, self).__init__()
        padding = 1 if torch.__version__>='1.5.0' else 2
        self.tokenConv = nn.Conv1d(in_channels=c_in, out_channels=d_model, 
                                    kernel_size=3, padding=padding, padding_mode='circular')
        for m in self.modules():
            if isinstance(m, nn.Conv1d):
                nn.init.kaiming_normal_(m.weight,mode='fan_in',nonlinearity='leaky_relu')

    def forward(self, x):
        x = self.tokenConv(x.permute(0, 2, 1)).transpose(1,2)
        return x

class PositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super(PositionalEmbedding, self).__init__()
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model).float()
        pe.require_grad = False

        position = torch.arange(0, max_len).float().unsqueeze(1)
        div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return self.pe[:, :x.size(1)]
  
class TimeFeatureEmbedding(nn.Module):
  def __init__(self, d_model, embed_type='timeF', freq='h'):
      super(TimeFeatureEmbedding, self).__init__()

      freq_map = {
    
    'h':4, 't':5, 's':6, 'm':1, 'a':1, 'w':2, 'd':3, 'b':3}
      d_inp = freq_map[freq]
      self.embed = nn.Linear(d_inp, d_model)
    
    def forward(self, x):
        return self.embed(x)

最后将其对应位置相加,得到最终的embedding,做法较为常规,通过1维的卷积层将embedding的输出映射成512
def forward(self, x, x_mark):
   x = self.value_embedding(x) + self.position_embedding(x) + self.temporal_embedding(x_mark)

After embedding, the output of the model is 32 96 512, of which 32-bit batch size (default 32), 96 is the length of the time series, and 512 is the dimension of the converted feature.

2.Encoder

Encoder is the core content of this article, ProbSparse Self-Attention and Distilling are the two core content.
insert image description here
insert image description here

    def forward(self, queries, keys, values, attn_mask):
        B, L, _ = queries.shape
        _, S, _ = keys.shape
        H = self.n_heads

        queries = self.query_projection(queries).view(B, L, H, -1)
        keys = self.key_projection(keys).view(B, S, H, -1)
        values = self.value_projection(values).view(B, S, H, -1)

        out, attn = self.inner_attention(
            queries,
            keys,
            values,
            attn_mask
        )

The first step here is the same as the normal self-attention. Three linear transformations are performed on the embedding output to obtain queries, keys, and values. This is also an attention mechanism for multiple heads, and the default is 8. The focus here is how to use queries, keys, and values ​​to do ProbSparse Self-Attention.

class ProbAttention(nn.Module):
    def __init__(self, mask_flag=True, factor=5, scale=None, attention_dropout=0.1, output_attention=False):
        super(ProbAttention, self).__init__()
        self.factor = factor
        self.scale = scale
        self.mask_flag = mask_flag
        self.output_attention = output_attention
        self.dropout = nn.Dropout(attention_dropout)

    def _prob_QK(self, Q, K, sample_k, n_top): # n_top: c*ln(L_q)
        # Q [B, H, L, D]
        B, H, L_K, E = K.shape
        _, _, L_Q, _ = Q.shape

        # calculate the sampled Q_K
        K_expand = K.unsqueeze(-3).expand(B, H, L_Q, L_K, E)
        index_sample = torch.randint(L_K, (L_Q, sample_k)) # real U = U_part(factor*ln(L_k))*L_q
        K_sample = K_expand[:, :, torch.arange(L_Q).unsqueeze(1), index_sample, :]
        Q_K_sample = torch.matmul(Q.unsqueeze(-2), K_sample.transpose(-2, -1)).squeeze(-2)

        # find the Top_k query with sparisty measurement
        M = Q_K_sample.max(-1)[0] - torch.div(Q_K_sample.sum(-1), L_K)
        M_top = M.topk(n_top, sorted=False)[1]

        # use the reduced Q to calculate Q_K
        Q_reduce = Q[torch.arange(B)[:, None, None],
                     torch.arange(H)[None, :, None],
                     M_top, :] # factor*ln(L_q)
        Q_K = torch.matmul(Q_reduce, K.transpose(-2, -1)) # factor*ln(L_q)*L_k

        return Q_K, M_top

    def _get_initial_context(self, V, L_Q):
        B, H, L_V, D = V.shape
        if not self.mask_flag:
            # V_sum = V.sum(dim=-2)
            V_sum = V.mean(dim=-2)
            contex = V_sum.unsqueeze(-2).expand(B, H, L_Q, V_sum.shape[-1]).clone()
        else: # use mask
            assert(L_Q == L_V) # requires that L_Q == L_V, i.e. for self-attention only
            contex = V.cumsum(dim=-2)
        return contex

    def _update_context(self, context_in, V, scores, index, L_Q, attn_mask):
        B, H, L_V, D = V.shape

        if self.mask_flag:
            attn_mask = ProbMask(B, H, L_Q, index, scores, device=V.device)
            scores.masked_fill_(attn_mask.mask, -np.inf)

        attn = torch.softmax(scores, dim=-1) # nn.Softmax(dim=-1)(scores)

        context_in[torch.arange(B)[:, None, None],
                   torch.arange(H)[None, :, None],
                   index, :] = torch.matmul(attn, V).type_as(context_in)
        if self.output_attention:
            attns = (torch.ones([B, H, L_V, L_V])/L_V).type_as(attn).to(attn.device)
            attns[torch.arange(B)[:, None, None], torch.arange(H)[None, :, None], index, :] = attn
            return (context_in, attns)
        else:
            return (context_in, None)

    def forward(self, queries, keys, values, attn_mask):
        B, L_Q, H, D = queries.shape
        _, L_K, _, _ = keys.shape

        queries = queries.transpose(2,1)
        keys = keys.transpose(2,1)
        values = values.transpose(2,1)

        U_part = self.factor * np.ceil(np.log(L_K)).astype('int').item() # c*ln(L_k)
        u = self.factor * np.ceil(np.log(L_Q)).astype('int').item() # c*ln(L_q) 

        U_part = U_part if U_part<L_K else L_K
        u = u if u<L_Q else L_Q
        
        scores_top, index = self._prob_QK(queries, keys, sample_k=U_part, n_top=u) 

        # add scale factor
        scale = self.scale or 1./sqrt(D)
        if scale is not None:
            scores_top = scores_top * scale
        # get the context
        context = self._get_initial_context(values, L_Q)
        # update the context with selected top_k queries
        context, attn = self._update_context(context, values, scores_top, index, L_Q, attn_mask)
        
        return context.transpose(2,1).contiguous(), attn

You can take a look at the forward function in ProbAttention

  1. First, the author defines a variable U_part, which defaults to 25. The significance of this variable is that through the ProbAttention layer, 25 valuable queries are finally obtained from 96 long sequences.

  2. self._prob_QK This function first randomly samples 25 keys from 96 keys to represent the K_sample of all keys, and then uses these K_samples to do the inner product of all 96 queries to get the 25 keys values ​​corresponding to the 96 queries . Each query corresponds to 25 key values ​​at this time, and then take the maximum value among the 25 key values, and then subtract the average value as the importance of each query. In this way, the importance of 96 queries can be obtained. When updating values, only 25 important queries are updated, and the corresponding unimportant queries are averaged.insert image description here

  3. self._get_initial_context This function averages the values ​​other than the 25 important queries, which can reduce the parameters and eliminate noise.

  4. self._update_context This function completes the attention calculation operation. finally get the output.

insert image description here
Use ProbSparse Self-Attention to quickly find the most useful "Active" Query, remove "Lazy" Query, and replace it with the average value.

In addition to ProbAttention, Distilling is another innovation of the article. The author's experiments show that after the downsampling operation converts 96 input lengths into 48, the effect is better, so the model is similar to the echelon structure.

    def forward(self, x):
        x = self.downConv(x.permute(0, 2, 1))
        x = self.norm(x)
        x = self.activation(x)
        x = self.maxPool(x)
        x = x.transpose(1,2)
        return x

insert image description here
Other structures are basically the same as the general Transformer model, and residual connection, normalization, dropout, etc. are also used to ensure that each round is at least as good as the previous round.

4. Decoder

There are no more innovations in the Decoder module. Like the normal Transformer model, the only difference is that the 42 values ​​of the Encoder are also used in the input of the Decoder, and 0 is used as the initial value that really needs to be predicted. This can better improve the effect of the model.
The Mask mechanism is used in the first attention layer of the Decoder to ensure that the output decoded at time t only depends on the output before time t.

class TriangularCausalMask():
    def __init__(self, B, L, device="cpu"):
        mask_shape = [B, 1, L, L]
        with torch.no_grad():
            self._mask = torch.triu(torch.ones(mask_shape, dtype=torch.bool), diagonal=1).to(device)

    @property
    def mask(self):
        return self._mask

class ProbMask():
    def __init__(self, B, H, L, index, scores, device="cpu"):
        _mask = torch.ones(L, scores.shape[-1], dtype=torch.bool).to(device).triu(1)
        _mask_ex = _mask[None, None, :].expand(B, H, L, scores.shape[-1])
        indicator = _mask_ex[torch.arange(B)[:, None, None],
                             torch.arange(H)[None, :, None],
                             index, :].to(device)
        self._mask = indicator.view(scores.shape).to(device)
    
    @property
    def mask(self):
        return self._mask


Summarize

The article uses ProbSparse Self-Attention and Distilling to improve the ordinary Self-Attention, which not only reduces the parameters, but also improves the effect of the model when used in the case of time series.
Can ProbSparse Self-Attention and Distilling be used in other scenarios? For example, in the cv nlp model, replace Self-Attention with ProbSparse Self-Attention and Distilling, because both are Transformer mechanisms, or in other architectures that use Transformer mechanisms, will the effect be improved?

Guess you like

Origin blog.csdn.net/weixin_53280379/article/details/125021064