Probabilistic time series forecasting using the Transformers package

Probability prediction

Typically, classical methods fit each time series in the data set individually. However, when dealing with large numbers of time series, it is beneficial to train a "global" model on all available time series, which enables the model to learn latent representations from many different sources.

Deep learning is well suited to training global probabilistic models rather than local point prediction models because neural networks can learn representations from several related time series and model uncertainty in the data.

It is common in probabilistic settings to learn the future parameters of some chosen parametric distribution, such as Gaussian distribution or Student-T, or to learn conditional quantile functions, or to use conformal forecasting frameworks adapted to time series settings. One can always transform a probabilistic model into a point prediction model by taking an empirical mean or median.

Time Series Transformer

In this blog post, we will use the traditional vanilla Transformer to perform univariate probabilistic prediction tasks (i.e., predict the one-dimensional distribution of each time series). Since the Encoder-Decoder Transformer encapsulates several inductive biases well, it became a natural choice for our predictions.

First, using the Encoder-Decoder architecture is helpful during inference. Usually for some recorded data, we want to predict some future prediction steps in advance. We can, given a certain distribution type, sample from it to provide a prediction up to our desired prediction range. This is called Greedy Sampling/search.

Second, Transformers help us train on time series data that may contain thousands of time points. Due to time and memory constraints, it may not be feasible to feed the complete history of all time series into the model at once. Therefore, when building batches for stochastic gradient descent, one can consider an appropriate context window size and sample this window and subsequent prediction length-sized windows from the training data. The resized context window can be passed to the encoder and the prediction window to the ausal-masked decoder.

Another benefit of Transformers over other architectures is that we can use missing values as additional masking values for the encoder or decoder and still train without resorting to padding or imputation.

Set up the environment

First, let's install the necessary libraries: Transformers, Datasets, Evaluate, Accelerate and GluonTS.

As we will show, GluonTS will be used to transform the data to create features and create appropriate training, validation, and test batches.

!pip install -q transformers
!pip install -q datasets
!pip install -q evaluate
!pip install -q accelerate
!pip install -q gluonts ujson

Load dataset

In this blog post, we will use the tourism_monthly dataset available on Hugging Face Hub. This dataset contains monthly tourism flows for 366 regions in Australia.

This dataset is part of the Monash Time Series Forecasting repository, which contains time series datasets from multiple domains. It can be viewed as the GLUE benchmark for time series forecasting.

from datasets import load_dataset
dataset = load_dataset("monash_tsf", "tourism_monthly")

As can be seen, the dataset contains 3 segments: training, validation and testing.

dataset
DatasetDict({
        train: Dataset({
            features: ['start', 'target', 'feat_static_cat', 'feat_dynamic_real', 'item_id'],
            num_rows: 366
        })
        test: Dataset({
            features: ['start', 'target', 'feat_static_cat', 'feat_dynamic_real', 'item_id'],
            num_rows: 366
        })
        validation: Dataset({
            features: ['start', 'target', 'feat_static_cat', 'feat_dynamic_real', 'item_id'],
            num_rows: 366
        })
    })

Each example contains a few keys, with start and target being the most important. Let's look at the first time series in the dataset:

train_example = dataset['train'][0]
train_example.keys()


dict_keys(['start', 'target', 'feat_static_cat', 'feat_dynamic_real', 'item_id'])

start only indicates the start of the time series (of type datetime), while target contains the actual values of the time series.

start will help to add time-related features to the time series values as additional input to the model (e.g. "month of year"). Since we already know that the frequency of the data is monthly, we can also infer that the timestamp of the second value is 1979-02-01, and so on.

print(train_example['start'])
print(train_example['target'])
1979-01-01 00:00:00
    [1149.8699951171875, 1053.8001708984375, ..., 5772.876953125]

The validation set contains the same data as the training set, except that the data time range is extended by prediction_length. This allows us to verify the model's predictions against real conditions.

Compared with the validation set, the test set still contains more prediction_length time data than the validation set (or use a test set with several prediction_length time data more than the training set to implement the test task on multiple rolling windows).

validation_example = dataset['validation'][0]
validation_example.keys()


dict_keys(['start', 'target', 'feat_static_cat', 'feat_dynamic_real', 'item_id'])

The initial values for validation are exactly the same as the corresponding training examples:

print(validation_example['start'])
print(validation_example['target'])


1979-01-01 00:00:00
    [1149.8699951171875, 1053.8001708984375, ..., 5985.830078125]

However, this example has prediction_length=24 extra data compared to the training example. Let's verify it.

freq = "1M"
prediction_length = 24


assert len(train_example["target"]) + prediction_length == len(
    validation_example["target"]
)

Let’s visualize it:

import matplotlib.pyplot as plt


figure, axes = plt.subplots()
axes.plot(train_example["target"], color="blue")
axes.plot(validation_example["target"], color="red", alpha=0.5)


plt.show()

Update start to pd.Period

The first thing we do is convert the start feature of each time series into a pandas Period index based on the freq value of the data:

from functools import lru_cache


import pandas as pd
import numpy as np


@lru_cache(10_000)
def convert_to_pandas_period(date, freq):
    return pd.Period(date, freq)


def transform_start_field(batch, freq):
    batch["start"] = [convert_to_pandas_period(date, freq) for date in batch["start"]]
    return batch

Here we use set_transform of datasets to achieve:

from functools import partial


train_dataset.set_transform(partial(transform_start_field, freq=freq))
test_dataset.set_transform(partial(transform_start_field, freq=freq))

Define model

Next, let's instantiate a model. The model will be trained from scratch, so instead of using the from_pretrained method, we randomly initialize the model from the config.

We specify several additional parameters for the model:

prediction_length (in our case 24 months): This is the range over which the Transformer's decoder will learn to predict;
context_length: If context_length is not specified, the model will set context_length (the input to the encoder) equal to prediction_length;
Lags at a given frequency: This will determine how much the model "looks back" and will act as an additional feature. For example, for Daily frequency, we may consider looking back [1, 2, 7, 30, ...], that is, looking back at the data of 1, 2... days, and for Minute data, we may consider [1, 30, 60, 60*24, ...] etc;
Number of time features: set to 2 in our case as we will be adding MonthOfYear and Age features;
Number of static categorical features: In our case this will be just 1 since we will be adding a "Time Series ID" feature;
Cardinality: form a list with the number of values for each static categorical feature, which for this example will be [366] since we have 366 different time series;
Embedding dimension: The embedding dimension of each static categorical feature, also constitutes a list. For example [3] means that the model will learn an embedding vector of size 3 for each 366 time series (region).

Let's use the default hysteresis value provided by GluonTS for a given frequency ("monthly"):

from gluonts.time_feature import get_lags_for_frequency


lags_sequence = get_lags_for_frequency(freq)
print(lags_sequence)


>>> [1, 2, 3, 4, 5, 6, 7, 11, 12, 13, 23, 24, 25, 35, 36, 37]

This means we will look back up to 37 months of data per time step as additional features. We also check the default temporal characteristics that GluonTS provides us with:

from gluonts.time_feature import time_features_from_frequency_str


time_features = time_features_from_frequency_str(freq)
print(time_features)


>>> [<function month_of_year at 0x7fa496d0ca70>]

In this case there is only one feature, "month of year". This means that for each time step, we will add the month as a scalar value (e.g. 1 if the timestamp is "january", 2 if the timestamp is "february", etc.).

We are now ready to define everything needed for the model:

from transformers import TimeSeriesTransformerConfig, TimeSeriesTransformerForPrediction


config = TimeSeriesTransformerConfig(
    prediction_length=prediction_length,
    # context length:
    context_length=prediction_length * 2,
    # lags coming from helper given the freq:
    lags_sequence=lags_sequence,
    # we'll add 2 time features ("month of year" and "age", see further):
    num_time_features=len(time_features) + 1,
    # we have a single static categorical feature, namely time series ID:
    num_static_categorical_features=1,
    # it has 366 possible values:
    cardinality=[len(train_dataset)],
    # the model will learn an embedding of size 2 for each of the 366 possible values:
    embedding_dimension=[2],


    # transformer params:
    encoder_layers=4,
    decoder_layers=4,
    d_model=32,
)


model = TimeSeriesTransformerForPrediction(config)

Note that, like other models in the Transformers library, TimeSeriesTransformerModel corresponds to an encoder-decoder Transformer without any top header, while TimeSeriesTransformerForPrediction corresponds to TimeSeriesTransformerForPrediction with a distribution head on top. By default, the model uses the Student-t distribution (you can also configure it yourself):

model.config.distribution_output


>>> student_t

This is an important difference between the implementation level and Transformers for NLP, where the head usually consists of a fixed classification distribution, implemented as a nn.Linear layer.

Define transformation

Next, we define the transformation of the data, especially the temporal features that need to be created based on a sample data set or a general data set.

Again, we use the GluonTS library. A Chain is defined here (somewhat similar to torchvision.transforms.Compose for image training). It allows us to combine multiple transformations into a single pipeline.

from gluonts.time_feature import (
    time_features_from_frequency_str,
    TimeFeature,
    get_lags_for_frequency,
)
from gluonts.dataset.field_names import FieldName
from gluonts.transform import (
    AddAgeFeature,
    AddObservedValuesIndicator,
    AddTimeFeatures,
    AsNumpyArray,
    Chain,
    ExpectedNumInstanceSampler,
    InstanceSplitter,
    RemoveFields,
    SelectFields,
    SetField,
    TestSplitSampler,
    Transformation,
    ValidationSplitSampler,
    VstackFeatures,
    RenameFields,
)

The conversion code below has comments for you to view the specific steps. Globally, we will iterate through each time series of the dataset and add or remove certain fields or features:

from transformers import PretrainedConfig


def create_transformation(freq: str, config: PretrainedConfig) -> Transformation:
    remove_field_names = []
    if config.num_static_real_features == 0:
        remove_field_names.append(FieldName.FEAT_STATIC_REAL)
    if config.num_dynamic_real_features == 0:
        remove_field_names.append(FieldName.FEAT_DYNAMIC_REAL)
    if config.num_static_categorical_features == 0:
        remove_field_names.append(FieldName.FEAT_STATIC_CAT)


    # a bit like torchvision.transforms.Compose
    return Chain(
        # step 1: remove static/dynamic fields if not specified
        [RemoveFields(field_names=remove_field_names)]
        # step 2: convert the data to NumPy (potentially not needed)
        + (
            [
                AsNumpyArray(
                    field=FieldName.FEAT_STATIC_CAT,
                    expected_ndim=1,
                    dtype=int,
                )
            ]
            if config.num_static_categorical_features > 0
            else []
        )
        + (
            [
                AsNumpyArray(
                    field=FieldName.FEAT_STATIC_REAL,
                    expected_ndim=1,
                )
            ]
            if config.num_static_real_features > 0
            else []
        )
        + [
            AsNumpyArray(
                field=FieldName.TARGET,
                # we expect an extra dim for the multivariate case:
                expected_ndim=1 if config.input_size == 1 else 2,
            ),
            # step 3: handle the NaN's by filling in the target with zero
            # and return the mask (which is in the observed values)
            # true for observed values, false for nan's
            # the decoder uses this mask (no loss is incurred for unobserved values)
            # see loss_weights inside the xxxForPrediction model
            AddObservedValuesIndicator(
                target_field=FieldName.TARGET,
                output_field=FieldName.OBSERVED_VALUES,
            ),
            # step 4: add temporal features based on freq of the dataset
            # month of year in the case when freq="M"
            # these serve as positional encodings
            AddTimeFeatures(
                start_field=FieldName.START,
                target_field=FieldName.TARGET,
                output_field=FieldName.FEAT_TIME,
                time_features=time_features_from_frequency_str(freq),
                pred_length=config.prediction_length,
            ),
            # step 5: add another temporal feature (just a single number)
            # tells the model where in its life the value of the time series is,
            # sort of a running counter
            AddAgeFeature(
                target_field=FieldName.TARGET,
                output_field=FieldName.FEAT_AGE,
                pred_length=config.prediction_length,
                log_scale=True,
            ),
            # step 6: vertically stack all the temporal features into the key FEAT_TIME
            VstackFeatures(
                output_field=FieldName.FEAT_TIME,
                input_fields=[FieldName.FEAT_TIME, FieldName.FEAT_AGE]
                + (
                    [FieldName.FEAT_DYNAMIC_REAL]
                    if config.num_dynamic_real_features > 0
                    else []
                ),
            ),
            # step 7: rename to match HuggingFace names
            RenameFields(
                mapping={
                    FieldName.FEAT_STATIC_CAT: "static_categorical_features",
                    FieldName.FEAT_STATIC_REAL: "static_real_features",
                    FieldName.FEAT_TIME: "time_features",
                    FieldName.TARGET: "values",
                    FieldName.OBSERVED_VALUES: "observed_mask",
                }
            ),
        ]
    )

InstanceSplitter

For the training, validation, and testing steps, next we create an InstanceSplitter that samples a window from the dataset (because we cannot pass the entire history of values to the Transformer due to time and memory constraints).

The instance splitter randomly samples windows of size context_length and subsequent size prediction_length from the data and appends past_ or future_ keys to any temporary keys for the respective windows. This ensures that values are split into past_values and subsequent future_values keys, which will be used as input to the encoder and decoder respectively. We also need to modify all keys in the time_series_fields parameter:

from gluonts.transform.sampler import InstanceSampler
from typing import Optional


def create_instance_splitter(
    config: PretrainedConfig,
    mode: str,
    train_sampler: Optional[InstanceSampler] = None,
    validation_sampler: Optional[InstanceSampler] = None,
) -> Transformation:
    assert mode in ["train", "validation", "test"]


    instance_sampler = {
        "train": train_sampler
        or ExpectedNumInstanceSampler(
            num_instances=1.0, min_future=config.prediction_length
        ),
        "validation": validation_sampler
        or ValidationSplitSampler(min_future=config.prediction_length),
        "test": TestSplitSampler(),
    }[mode]


    return InstanceSplitter(
        target_field="values",
        is_pad_field=FieldName.IS_PAD,
        start_field=FieldName.START,
        forecast_start_field=FieldName.FORECAST_START,
        instance_sampler=instance_sampler,
        past_length=config.context_length + max(config.lags_sequence),
        future_length=config.prediction_length,
        time_series_fields=["time_features", "observed_mask"],
    )

Create DataLoader

With the data in hand, the next step is to create PyTorch DataLoaders. It allows us to batch process pairs of (input, output) data, i.e. (past_values, future_values).

from typing import Iterable


import torch
from gluonts.itertools import Cached, Cyclic
from gluonts.dataset.loader import as_stacked_batches




def create_train_dataloader(
    config: PretrainedConfig,
    freq,
    data,
    batch_size: int,
    num_batches_per_epoch: int,
    shuffle_buffer_length: Optional[int] = None,
    cache_data: bool = True,
    **kwargs,
) -> Iterable:
    PREDICTION_INPUT_NAMES = [
        "past_time_features",
        "past_values",
        "past_observed_mask",
        "future_time_features",
    ]
    if config.num_static_categorical_features > 0:
        PREDICTION_INPUT_NAMES.append("static_categorical_features")


    if config.num_static_real_features > 0:
        PREDICTION_INPUT_NAMES.append("static_real_features")


    TRAINING_INPUT_NAMES = PREDICTION_INPUT_NAMES + [
        "future_values",
        "future_observed_mask",
    ]


    transformation = create_transformation(freq, config)
    transformed_data = transformation.apply(data, is_train=True)
    if cache_data:
        transformed_data = Cached(transformed_data)


    # we initialize a Training instance
    instance_splitter = create_instance_splitter(config, "train")


    # the instance splitter will sample a window of
    # context length + lags + prediction length (from the 366 possible transformed time series)
    # randomly from within the target time series and return an iterator.
    stream = Cyclic(transformed_data).stream()
    training_instances = instance_splitter.apply(
        stream, is_train=True
    )


    return as_stacked_batches(
        training_instances,
        batch_size=batch_size,
        shuffle_buffer_length=shuffle_buffer_length,
        field_names=TRAINING_INPUT_NAMES,
        output_type=torch.tensor,
        num_batches_per_epoch=num_batches_per_epoch,
    )

def create_test_dataloader(
    config: PretrainedConfig,
    freq,
    data,
    batch_size: int,
    **kwargs,
):
    PREDICTION_INPUT_NAMES = [
        "past_time_features",
        "past_values",
        "past_observed_mask",
        "future_time_features",
    ]
    if config.num_static_categorical_features > 0:
        PREDICTION_INPUT_NAMES.append("static_categorical_features")


    if config.num_static_real_features > 0:
        PREDICTION_INPUT_NAMES.append("static_real_features")


    transformation = create_transformation(freq, config)
    transformed_data = transformation.apply(data, is_train=False)


    # we create a Test Instance splitter which will sample the very last
    # context window seen during training only for the encoder.
    instance_sampler = create_instance_splitter(config, "test")


    # we apply the transformations in test mode
    testing_instances = instance_sampler.apply(transformed_data, is_train=False)


    return as_stacked_batches(
        testing_instances,
        batch_size=batch_size,
        output_type=torch.tensor,
        field_names=PREDICTION_INPUT_NAMES,
    )

train_dataloader = create_train_dataloader(
    config=config,
    freq=freq,
    data=train_dataset,
    batch_size=256,
    num_batches_per_epoch=100,
)


test_dataloader = create_test_dataloader(
    config=config,
    freq=freq,
    data=test_dataset,
    batch_size=64,
)

Let's check the first batch:

batch = next(iter(train_dataloader))
for k, v in batch.items():
    print(k, v.shape, v.type())


>>> past_time_features torch.Size([256, 85, 2]) torch.FloatTensor
    past_values torch.Size([256, 85]) torch.FloatTensor
    past_observed_mask torch.Size([256, 85]) torch.FloatTensor
    future_time_features torch.Size([256, 24, 2]) torch.FloatTensor
    static_categorical_features torch.Size([256, 1]) torch.LongTensor
    future_values torch.Size([256, 24]) torch.FloatTensor
    future_observed_mask torch.Size([256, 24]) torch.FloatTensor

It can be seen that we do not provide input_ids and attention_mask to the encoder (this is also the case when training NLP models), but provide past_values, as well as past_observed_mask, past_time_features, static_categorical_features and static_real_features several data.

The inputs to the decoder include future_values, future_observed_mask and future_time_features. future_values can be seen as equivalent to decoder_input_ids in NLP training.

forward propagation

Let's perform a forward pass on the batch we just created:

# perform forward pass
outputs = model(
    past_values=batch["past_values"],
    past_time_features=batch["past_time_features"],
    past_observed_mask=batch["past_observed_mask"],
    static_categorical_features=batch["static_categorical_features"]
    if config.num_static_categorical_features > 0
    else None,
    static_real_features=batch["static_real_features"]
    if config.num_static_real_features > 0
    else None,
    future_values=batch["future_values"],
    future_time_features=batch["future_time_features"],
    future_observed_mask=batch["future_observed_mask"],
    output_hidden_states=True,
)

print("Loss:", outputs.loss.item())


>>> Loss: 9.069628715515137

Currently, the model returns a loss value. This is because the decoder automatically shifts future_values one position to the right to obtain labels. This allows the calculation of the error between the prediction and the label value.

Also note that the decoder uses a Causal Mask to avoid predicting the future since it requires the predicted values to be in the future_values tensor.

Training model

It’s time to train the model! We will use the standard PyTorch training loop.

Here we use the Accelerate library, which automatically places the model, optimizer and data loader on the appropriate device.

from accelerate import Accelerator
from torch.optim import AdamW


accelerator = Accelerator()
device = accelerator.device


model.to(device)
optimizer = AdamW(model.parameters(), lr=6e-4, betas=(0.9, 0.95), weight_decay=1e-1)


model, optimizer, train_dataloader = accelerator.prepare(
    model,
    optimizer,
    train_dataloader,
)


model.train()
for epoch in range(40):
    for idx, batch in enumerate(train_dataloader):
        optimizer.zero_grad()
        outputs = model(
            static_categorical_features=batch["static_categorical_features"].to(device)
            if config.num_static_categorical_features > 0
            else None,
            static_real_features=batch["static_real_features"].to(device)
            if config.num_static_real_features > 0
            else None,
            past_time_features=batch["past_time_features"].to(device),
            past_values=batch["past_values"].to(device),
            future_time_features=batch["future_time_features"].to(device),
            future_values=batch["future_values"].to(device),
            past_observed_mask=batch["past_observed_mask"].to(device),
            future_observed_mask=batch["future_observed_mask"].to(device),
        )
        loss = outputs.loss


        # Backpropagation
        accelerator.backward(loss)
        optimizer.step()


        if idx % 100 == 0:
            print(loss.item())

Model reasoning

During inference, it is recommended to use the generate() method for autoregressive generation, similar to NLP models.

The prediction process obtains data from the test instance sampler. The sampler will sample data as long as the last context_length of each time series in the data set, and then input it into the model. Please note that future_time_features known in advance need to be passed to the decoder.

The model will autoregressively sample a certain number of values from the prediction distribution and pass them back to the decoder to finally get the prediction output:

model.eval()


forecasts = []


for batch in test_dataloader:
    outputs = model.generate(
        static_categorical_features=batch["static_categorical_features"].to(device)
        if config.num_static_categorical_features > 0
        else None,
        static_real_features=batch["static_real_features"].to(device)
        if config.num_static_real_features > 0
        else None,
        past_time_features=batch["past_time_features"].to(device),
        past_values=batch["past_values"].to(device),
        future_time_features=batch["future_time_features"].to(device),
        past_observed_mask=batch["past_observed_mask"].to(device),
    )
    forecasts.append(outputs.sequences.cpu().numpy())

The model outputs a tensor representing the structure (batch_size, number of samples, prediction length).

The output below illustrates: For each example in a batch of size 64, we will get 100 possible values for the next 24 months:

forecasts[0].shape


>>> (64, 100, 24)

We will stack them vertically to get predictions for all time series in the test dataset:

forecasts = np.vstack(forecasts)
print(forecasts.shape)


>>> (366, 100, 24)

We can evaluate the generated predictions against the ground truth based on the sample values present in the test set. Here we use the MASE and sMAPE metrics for each time series in the dataset to evaluate:

from evaluate import load
from gluonts.time_feature import get_seasonality


mase_metric = load("evaluate-metric/mase")
smape_metric = load("evaluate-metric/smape")


forecast_median = np.median(forecasts, 1)


mase_metrics = []
smape_metrics = []
for item_id, ts in enumerate(test_dataset):
    training_data = ts["target"][:-prediction_length]
    ground_truth = ts["target"][-prediction_length:]
    mase = mase_metric.compute(
        predictions=forecast_median[item_id], 
        references=np.array(ground_truth), 
        training=np.array(training_data), 
        periodicity=get_seasonality(freq))
    mase_metrics.append(mase["mase"])


    smape = smape_metric.compute(
        predictions=forecast_median[item_id], 
        references=np.array(ground_truth), 
    )
    smape_metrics.append(smape["smape"])

print(f"MASE: {np.mean(mase_metrics)}")


>>> MASE: 1.2564196892177717


print(f"sMAPE: {np.mean(smape_metrics)}")


>>> sMAPE: 0.1609541520852549

We can also plot the resulting metrics for each time series in the dataset individually and observe that a few of them have a large impact on the final test metric:

plt.scatter(mase_metrics, smape_metrics, alpha=0.3)
plt.xlabel("MASE")
plt.ylabel("sMAPE")
plt.show()

To plot forecasts for any time series based on ground truth test data, we define the following auxiliary plotting function:

import matplotlib.dates as mdates


def plot(ts_index):
    fig, ax = plt.subplots()


    index = pd.period_range(
        start=test_dataset[ts_index][FieldName.START],
        periods=len(test_dataset[ts_index][FieldName.TARGET]),
        freq=freq,
    ).to_timestamp()


    # Major ticks every half year, minor ticks every month,
    ax.xaxis.set_major_locator(mdates.MonthLocator(bymonth=(1, 7)))
    ax.xaxis.set_minor_locator(mdates.MonthLocator())


    ax.plot(
        index[-2*prediction_length:], 
        test_dataset[ts_index]["target"][-2*prediction_length:],
        label="actual",
    )


    plt.plot(
        index[-prediction_length:], 
        np.median(forecasts[ts_index], axis=0),
        label="median",
    )


    plt.fill_between(
        index[-prediction_length:],
        forecasts[ts_index].mean(0) - forecasts[ts_index].std(axis=0), 
        forecasts[ts_index].mean(0) + forecasts[ts_index].std(axis=0), 
        alpha=0.3, 
        interpolate=True,
        label="+/- 1-std",
    )
    plt.legend()
    plt.show()

Summarize

As time series researchers know, there is a lot of interest in the problem of applying Transformer-based models to time series. The traditional vanilla Transformer is only one of many attention-based models, so more models need to be added to the library.

There is nothing currently preventing us from continuing to explore modeling multivariate time series, but to do so we need to instantiate the model using a multivariate distribution head. Currently, diagonal independent distribution is supported, and support for other multivariate distributions will be added in the future. Stay tuned for future blog posts and tutorials within them.

Finally, the NLP/CV field benefits greatly from large pre-trained models, but to our knowledge this is not the case in the time series field. Transformer-based models seem to be the natural choice for this research direction, and we can’t wait to see what breakthroughs researchers and practitioners will discover!

Probabilistic time series forecasting using the Transformers package

Guess you like