Forecasting time series using Transfomer (based on PyTorch code)

Code source

https://github.com/nok-halfspace/Transformer-Time-Series-Forecasting

文章信息：https://medium.com/mlearning-ai/transformer-implementation-for-time-series-forecasting-a9db2db5c820

data structure

The data structure in this project is as shown below: there are different sensor_ids, and then these sensors have different humidities in different time periods.

Data import and preliminary processing

The first step is to perform preliminary processing of the data. The following is the code of DataLoader:

class SensorDataset(Dataset):
    """Face Landmarks dataset."""

    def __init__(self, csv_name, root_dir, training_length, forecast_window):
        """
        Args:
            csv_file (string): Path to the csv file.
            root_dir (string): Directory
        """
        
        # load raw data file
        csv_file = os.path.join(root_dir, csv_name)
        self.df = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transform = MinMaxScaler() #对数据进行归一化处理
        self.T = training_length
        self.S = forecast_window

    def __len__(self):
        # return number of sensors
        return len(self.df.groupby(by=["reindexed_id"]))

    # Will pull an index between 0 and __len__. 
    def __getitem__(self, idx):
        
        # Sensors are indexed from 1
        idx = idx+1

        # np.random.seed(0)

        start = np.random.randint(0, len(self.df[self.df["reindexed_id"]==idx]) - self.T - self.S) 
        sensor_number = str(self.df[self.df["reindexed_id"]==idx][["sensor_id"]][start:start+1].values.item())
        index_in = torch.tensor([i for i in range(start, start+self.T)])
        index_tar = torch.tensor([i for i in range(start + self.T, start + self.T + self.S)])
        _input = torch.tensor(self.df[self.df["reindexed_id"]==idx][["humidity", "sin_hour", "cos_hour", "sin_day", "cos_day", "sin_month", "cos_month"]][start : start + self.T].values)
        target = torch.tensor(self.df[self.df["reindexed_id"]==idx][["humidity", "sin_hour", "cos_hour", "sin_day", "cos_day", "sin_month", "cos_month"]][start + self.T : start + self.T + self.S].values)

        # scalar is fit only to the input, to avoid the scaled values "leaking" information about the target range.
        # scalar is fit only for humidity, as the timestamps are already scaled
        # scalar input/output of shape: [n_samples, n_features].
        scaler = self.transform

        scaler.fit(_input[:,0].unsqueeze(-1))
        _input[:,0] = torch.tensor(scaler.transform(_input[:,0].unsqueeze(-1)).squeeze(-1))
        target[:,0] = torch.tensor(scaler.transform(target[:,0].unsqueeze(-1)).squeeze(-1))

        # save the scalar to be used later when inverse translating the data for plotting.
        dump(scaler, 'scalar_item.joblib')

        return index_in, index_tar, _input, target, sensor_number

One of the more important points is that the initial data is processed by MinMaxScaler(), that is, data normalization is performed, which is a very common operation in deep learning.

Time information Embedding

Different from the LSTM model, because in the transfoermer model, all information is thrown in at once, there is no time information. Therefore, for time series, additional processing of time information is required. In the following code, some information is added to the original data set, including sin_hour, cos_hour, sin_day, and cos_day, which is somewhat similar to the positional embedding mechanism. The information to be expressed is shown in the figure below:

import pandas as pd
import time
import numpy as np
import datetime
from icecream import ic

# encoding the timestamp data cyclically. See Medium Article.
def process_data(source):

    df = pd.read_csv(source)
        
    timestamps = [ts.split('+')[0] for ts in  df['timestamp']]
    timestamps_hour = np.array([float(datetime.datetime.strptime(t, '%Y-%m-%d %H:%M:%S').hour) for t in timestamps])
    timestamps_day = np.array([float(datetime.datetime.strptime(t, '%Y-%m-%d %H:%M:%S').day) for t in timestamps])
    timestamps_month = np.array([float(datetime.datetime.strptime(t, '%Y-%m-%d %H:%M:%S').month) for t in timestamps])

    hours_in_day = 24
    days_in_month = 30
    month_in_year = 12

    df['sin_hour'] = np.sin(2*np.pi*timestamps_hour/hours_in_day)
    df['cos_hour'] = np.cos(2*np.pi*timestamps_hour/hours_in_day)
    df['sin_day'] = np.sin(2*np.pi*timestamps_day/days_in_month)
    df['cos_day'] = np.cos(2*np.pi*timestamps_day/days_in_month)
    df['sin_month'] = np.sin(2*np.pi*timestamps_month/month_in_year)
    df['cos_month'] = np.cos(2*np.pi*timestamps_month/month_in_year)

    return df

train_dataset = process_data('Data/train_raw.csv')
test_dataset = process_data('Data/test_raw.csv')

train_dataset.to_csv(r'Data/train_dataset.csv', index=False)
test_dataset.to_csv(r'Data/test_dataset.csv', index=False)

Then through this code snippet, we get new data with more variables:

It should be noted that because the data is normalized in the subsequent main.py, the normalization process of the original humidity data cannot be seen in this data set.

Bring it into the Transformer model for calculation

After processing the data, the next step is to bring the data into the model for calculation.

Define transfomer model

This is a very important code segment in this project, so it needs to be analyzed here. At the beginning of the code, the author proposed that this calculation was based on the article "Attention is all you need".

Transfomer is composed of encoder, decoder and feed forward.

Model Building: In this project, all known historical data are used to predict data for a future period. Assume that X1 to X5 are historical data from periods 1 to 5 in the past respectively. When predicting X2, only use the data of The data of X2, X3, and so on.

Masked self-attention: Because the self-attention mechanism is used in transfomer, but during the prediction process, you cannot see the data before the predicted time period, so the masked mechanism is used here. In simple terms, it is Set future data to a very small number so that the model cannot see later values.

Because there is already a ready-made code segment in Pytorch, you only need to define parameters here to use it. It sounds simple, but you often encounter a lot of trouble during use.

feature_size: The number of features used, in this project it refers to the 6 features of time ('sin_hour', 'cos_hour', 'sin_day', ' ;cos_day', 'sin_month', 'cos_month')+1 initial data
num_layers: The number of encoder layers. This can be adjusted specifically according to the model.
dropout: This can be adjusted specifically according to the model
nhead: The number of heads of the multi-layer attention mechanism. It must be noted that the number of features must be divisible by the number of heads, otherwise the model will report an error (this point is easy to understand, because originally the number of heads is equivalent to the number of separate mappings. If it is not divisible, it is difficult to divide).

import torch.nn as nn
import torch, math
from icecream import ic
import time
"""
The architecture is based on the paper “Attention Is All You Need”. 
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
"""

class Transformer(nn.Module):
    # d_model : number of features
    def __init__(self,feature_size=7,num_layers=3,dropout=0):
        super(Transformer, self).__init__()

        self.encoder_layer = nn.TransformerEncoderLayer(d_model=feature_size, nhead=7, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)        
        self.decoder = nn.Linear(feature_size,1) #feature_size是input的个数，1为output个数
        self.init_weights()
    
    #init_weight主要是用于设置decoder的参数
    def init_weights(self):
        initrange = 0.1    
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def forward(self, src, device):
        
        mask = self._generate_square_subsequent_mask(len(src)).to(device)
        output = self.transformer_encoder(src,mask)
        output = self.decoder(output)
        return output

execution model

In this model, batch_size is set to 1, which means that in this model, each sensor is independent of each other, and the transfomer model is used for calculation respectively. The relationship between each sensor is not considered.

import argparse
# from train_teacher_forcing import *
from train_with_sampling import *
from DataLoader import *
from torch.utils.data import DataLoader
import torch.nn as nn
import torch
from helpers import *
from inference import *

def main(
    epoch: int = 1000,
    k: int = 60,
    batch_size: int = 1,
    frequency: int = 100,
    training_length = 48,
    forecast_window = 24,
    train_csv = "train_dataset.csv",
    test_csv = "test_dataset.csv",
    path_to_save_model = "save_model/",
    path_to_save_loss = "save_loss/", 
    path_to_save_predictions = "save_predictions/", 
    device = "cpu"
):

    clean_directory()

    train_dataset = SensorDataset(csv_name = train_csv, root_dir = "Data/", training_length = training_length, forecast_window = forecast_window)
    train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)
    test_dataset = SensorDataset(csv_name = test_csv, root_dir = "Data/", training_length = training_length, forecast_window = forecast_window)
    test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=True)

    best_model = transformer(train_dataloader, epoch, k, frequency, path_to_save_model, path_to_save_loss, path_to_save_predictions, device)
    inference(path_to_save_predictions, forecast_window, test_dataloader, device, path_to_save_model, best_model)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epoch", type=int, default=1000)
    parser.add_argument("--k", type=int, default=60)
    parser.add_argument("--batch_size", type=int, default=1)
    parser.add_argument("--frequency", type=int, default=100)
    parser.add_argument("--path_to_save_model",type=str,default="save_model/")
    parser.add_argument("--path_to_save_loss",type=str,default="save_loss/")
    parser.add_argument("--path_to_save_predictions",type=str,default="save_predictions/")
    parser.add_argument("--device", type=str, default="cpu")
    args = parser.parse_args()

    main(
        epoch=args.epoch,
        k = args.k,
        batch_size=args.batch_size,
        frequency=args.frequency,
        path_to_save_model=args.path_to_save_model,
        path_to_save_loss=args.path_to_save_loss,
        path_to_save_predictions=args.path_to_save_predictions,
        device=args.device,
    )