Tool series: TimeGPT_(8) Using irregular timestamps for time series prediction

introduce

When working with time series data, the frequency of timestamps is a key factor that can have a significant impact on forecast results. Regular frequencies like daily, weekly or monthly are easy to handle. However, irregular frequencies like weekdays (excluding weekends) can be challenging for time series forecasting methods.

Our forecasting method can handle such irregular time series data as long as you specify the frequency of the series. For example, in the case of weekdays, the frequency should be passed as 'B'. Without this parameter, the method may not automatically detect frequencies, especially when the timestamps are irregular.


# Import the colab_badge module from the nixtlats.utils package
from nixtlats.utils import colab_badge

colab_badge('docs/tutorials/8_irregular_timestamps')
from fastcore.test import test_eq, test_fail, test_warns
from dotenv import load_dotenv
# 导入load_dotenv函数,用于加载.env文件中的环境变量
load_dotenv()
True

# 导入pandas库,用于数据处理
import pandas as pd

# 导入TimeGPT模块
from nixtlats import TimeGPT

/home/ubuntu/miniconda/envs/nixtlats/lib/python3.11/site-packages/statsforecast/core.py:25: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm
# 创建一个TimeGPT对象,并传入一个参数token,用于验证身份
# 如果没有提供token参数,则默认使用环境变量中的TIMEGPT_TOKEN

timegpt = TimeGPT(
    token = 'my_token_provided_by_nixtla'
)
# 导入TimeGPT模型

timegpt = TimeGPT()  # 创建TimeGPT对象的实例

Univariate time prediction for irregular timestamps

The first step is to obtain your time series data. The data must include timestamps and associated values. For example, you might be working with stock prices and your data might look like this. In this example we use OpenBB .


# 从指定URL读取数据集
df_fed_test = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/openbb/fed.csv')

# 使用pd.testing.assert_frame_equal函数对两个预测结果进行比较
# 第一个预测结果使用默认的频率(每日)
# 第二个预测结果使用频率为每周
# 比较的指标为预测结果的FF列,并设置置信水平为90%
pd.testing.assert_frame_equal(
    timegpt.forecast(df_fed_test, h=12, target_col='FF', level=[90]),
    timegpt.forecast(df_fed_test, h=12, target_col='FF', freq='W', level=[90])
)
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: W-WED
WARNING:nixtlats.timegpt:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtlats.timegpt:Restricting input...
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: W-WED
WARNING:nixtlats.timegpt:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtlats.timegpt:Restricting input...
INFO:nixtlats.timegpt:Calling Forecast Endpoint...


# 从指定的URL读取CSV文件,并将其存储在名为pltr_df的DataFrame中
pltr_df = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/openbb/pltr.csv')

# 将'date'列转换为日期时间格式,并将结果存储在'date'列中
pltr_df['date'] = pd.to_datetime(pltr_df['date'])
# 显示数据集的前几行
pltr_df.head()
date Open High Low Close Adj Close Volume Dividends Stock Splits
0 2020-09-30 10.00 11.41 9.11 9.50 9.50 338584400 0.0 0.0
1 2020-10-01 9.69 10.10 9.23 9.46 9.46 124297600 0.0 0.0
2 2020-10-02 9.06 9.28 8.94 9.20 9.20 55018300 0.0 0.0
3 2020-10-05 9.43 9.49 8.92 9.03 9.03 36316900 0.0 0.0
4 2020-10-06 9.04 10.18 8.90 9.90 9.90 90864000 0.0 0.0

Let's see this dataset has irregular timestamps. The dayofweek property of DatetimeIndex from pandas returns the day of the week, Monday=0, Sunday=6. So checking dayofweek > 4 is actually checking if the date falls on Saturday (5) or Sunday (6), which is usually a non-working day (weekend).

# 统计pltr_df中日期的星期几大于4的数量
(pltr_df['date'].dt.dayofweek > 4).sum()
0

We can see that the timestamps are irregular. Let's examine the "Close" series.

# 使用timegpt模块中的plot函数,绘制pltr_df数据集中的日期(date)与收盘价(Close)之间的关系图
timegpt.plot(pltr_df, time_col='date', target_col='Close')

To predict this data, you can use our forecastmethod. Importantly, remember to use freqparameters to specify the frequency of your data. In this case it should be 'B', which means working day. We also need to define time_colthe index to choose the series (default is ds), as well as target_colto predict our target variable, in which case we will predict Close.

# 预测函数test_fail()用于测试timegpt.forecast()函数的功能
# timegpt.forecast()函数用于根据给定的时间序列数据进行预测
# 该函数的参数包括:
# - df:时间序列数据的DataFrame
# - h:预测的时间步数
# - time_col:时间列的名称
# - target_col:目标列的名称

# 在这个测试中,我们使用pltr_df作为输入数据进行预测
# 预测的时间步数为14
# 时间列的名称为'date'
# 目标列的名称为'Close'

# 预测结果中应该包含'frequency',但是由于某种原因,预测失败了
test_fail(
    lambda: timegpt.forecast(
        df=pltr_df, h=14,
        time_col='date', target_col='Close',
    ),
    contains='frequency'
)
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
# 导入所需的模块和函数


# 调用forecast函数,传入时间序列数据的DataFrame、预测步长、频率、时间列的列名和目标列的列名
fcst_pltr_df = timegpt.forecast(
    df=pltr_df, h=14, freq='B',
    time_col='date', target_col='Close',
)
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
WARNING:nixtlats.timegpt:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
# 查看数据集的前几行
fcst_pltr_df.head()
date TimeGPT
0 2023-09-25 14.688427
1 2023-09-26 14.742798
2 2023-09-27 14.781240
3 2023-09-28 14.824156
4 2023-09-29 14.795214

Remember, for weekdays, the frequency is 'B'. For other frequencies, you can refer to the pandas offset aliases documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases.

By specifying frequencies, you can help prediction methods better understand patterns in the data, resulting in more accurate and reliable predictions.

Let's plot TimeGPTthe predictions generated by.



# 使用timegpt.plot函数绘制图表
# 参数pltr_df是包含股票价格数据的DataFrame
# 参数fcst_pltr_df是包含预测股票价格数据的DataFrame
# 参数time_col指定时间列的名称,这里是'date'
# 参数target_col指定目标列的名称,这里是'Close'
# 参数max_insample_length指定用于训练模型的最大样本数量,这里是90
timegpt.plot(
    pltr_df, 
    fcst_pltr_df, 
    time_col='date',
    target_col='Close',
    max_insample_length=90, 
)

You can also use levelparameters to add uncertainty quantification to your forecasts.

# 导入所需的模块和函数

# 使用timegpt.forecast函数进行时间序列预测
# 参数df为输入的数据框,pltr_df为待预测的数据框
# 参数h为预测的时间步长,这里设置为42
# 参数freq为数据的频率,这里设置为工作日(B)
# 参数time_col为时间列的名称,这里设置为'date'
# 参数target_col为目标列的名称,这里设置为'Close'
# 参数add_history为是否将历史数据添加到预测结果中,这里设置为True
# 参数level为置信水平,这里设置为[40.66, 90]
fcst_pltr_levels_df = timegpt.forecast(
    df=pltr_df, h=42, freq='B',
    time_col='date', target_col='Close',
    add_history=True,
    level=[40.66, 90],
)
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
WARNING:nixtlats.timegpt:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
INFO:nixtlats.timegpt:Calling Historical Forecast Endpoint...


# 绘制时间序列图
# 参数:
# pltr_df: 包含时间序列数据的DataFrame
# fcst_pltr_levels_df: 包含预测水平数据的DataFrame
# time_col: 时间列的列名
# target_col: 目标列的列名
# level: 预测水平的取值范围
timegpt.plot(
    pltr_df, 
    fcst_pltr_levels_df, 
    time_col='date',
    target_col='Close',
    level=[40.66, 90],
)

If you want to predict another variable, just change the "target_col" parameter. Now let's predict "Volume":

# 导入所需模块和函数

# 使用timegpt.forecast函数进行时间序列预测
# 参数df为输入的时间序列数据,pltr_df为输入的数据框
# 参数h为预测的步长,这里设置为14
# 参数freq为时间序列的频率,这里设置为'B',表示工作日
# 参数time_col为时间列的名称,这里设置为'date'
# 参数target_col为目标列的名称,这里设置为'Volume'
fcst_pltr_df = timegpt.forecast(
    df=pltr_df, h=14, freq='B',
    time_col='date', target_col='Volume',
)

# 使用timegpt.plot函数绘制时间序列和预测结果的图形
# 参数pltr_df为输入的时间序列数据,这里是原始数据
# 参数fcst_pltr_df为预测结果数据,这里是预测的结果
# 参数time_col为时间列的名称,这里设置为'date'
# 参数max_insample_length为显示的最大样本长度,这里设置为90
# 参数target_col为目标列的名称,这里设置为'Volume'
timegpt.plot(
    pltr_df, 
    fcst_pltr_df, 
    time_col='date',
    max_insample_length=90,
    target_col='Volume',
)
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
WARNING:nixtlats.timegpt:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtlats.timegpt:Calling Forecast Endpoint...

But what if we want to predict all time series simultaneously? We can do this by reshaping our data frame. Currently, the data frames are in wide format (each series is one column), but we need to convert them to long format (stacked one after another). We can achieve this using:

# 将pltr_df进行重塑,使得每一行代表一个观测值
# id_vars参数指定date列为标识变量,即不需要重塑的列
# var_name参数指定新生成的列名为series_id
pltr_long_df = pd.melt(
    pltr_df, 
    id_vars=['date'],
    var_name='series_id'
)

# 显示数据集的前几行
pltr_long_df.head()
date series_id value
0 2020-09-30 Open 10.00
1 2020-10-01 Open 9.69
2 2020-10-02 Open 9.06
3 2020-10-05 Open 9.43
4 2020-10-06 Open 9.04

Then we simply call forecastthe method, specifying id_colthe parameters.

# 导入所需的模块和函数已在代码中,无需额外的import语句

# 调用timegpt模块中的forecast函数,对pltr_long_df数据进行预测
# 参数df表示要进行预测的数据框,pltr_long_df为待预测的数据框
# 参数h表示预测的时间步数,这里设置为14,即预测未来14个时间步的值
# 参数freq表示数据的频率,这里设置为'B',表示工作日频率
# 参数id_col表示数据框中表示序列ID的列名,这里设置为'series_id'
# 参数time_col表示数据框中表示时间的列名,这里设置为'date'
# 参数target_col表示数据框中表示目标变量的列名,这里设置为'value'
fcst_pltr_long_df = timegpt.forecast(
    df=pltr_long_df, h=14, freq='B',
    id_col='series_id', time_col='date', target_col='value',
)
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
WARNING:nixtlats.timegpt:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
# 显示 DataFrame 的前五行数据
fcst_pltr_long_df.head()
series_id date TimeGPT
0 Adj Close 2023-09-25 14.688427
1 Adj Close 2023-09-26 14.742798
2 Adj Close 2023-09-27 14.781240
3 Adj Close 2023-09-28 14.824156
4 Adj Close 2023-09-29 14.795214

We can then predict the "opening price" series:


# 使用timegpt.plot函数绘制图表
# 参数pltr_long_df是包含原始数据的DataFrame
# 参数fcst_pltr_long_df是包含预测数据的DataFrame
# 参数id_col指定数据中用于标识系列的列名
# 参数time_col指定数据中用于表示时间的列名
# 参数target_col指定数据中用于表示目标值的列名
# 参数unique_ids是一个列表,包含需要绘制图表的唯一系列的标识符
# 参数max_insample_length指定用于训练模型的最大样本长度
timegpt.plot(
    pltr_long_df, 
    fcst_pltr_long_df, 
    id_col='series_id',
    time_col='date',
    target_col='value',
    unique_ids=['Open'],
    max_insample_length=90,
)

Exogenous variable time prediction for irregular timestamps

In time series forecasting, the variables we forecast are often affected not only by their past values, but also by other factors or variables. These external variables are called exogenous variables, and they can provide important additional context that can significantly improve the accuracy of our forecasts. One of those factors, and the focus of this tutorial, is the company's revenue. Revenue data can provide key indicators of a company's financial health and growth potential, both of which can have a significant impact on its stock price. We can get this data from openbb.



# 从指定的 URL 中读取 CSV 文件,并将其存储在名为 revenue_pltr 的数据框中
revenue_pltr = pd.read_csv('https://raw.githubusercontent.com/Nixtla/transfer-learning-time-series/main/datasets/openbb/revenue-pltr.csv')
# 获取revenue_pltr中'totalRevenue'列的第一个值
value = revenue_pltr['totalRevenue'].iloc[0]

# 判断value是否为float类型且包含'M'
if not isinstance(value, float) and 'M' in value:
    
    # 定义一个函数convert_to_float,用于将字符串转换为浮点数
    def convert_to_float(val):
        # 如果val中包含'M',则将'M'替换为空字符串,并将结果乘以1e6(表示百万)
        if 'M' in val:
            return float(val.replace(' M', '')) * 1e6
        # 如果val中包含'K',则将'K'替换为空字符串,并将结果乘以1e3(表示千)
        elif 'K' in val:
            return float(val.replace(' K', '')) * 1e3
        # 如果val中既不包含'M'也不包含'K',则直接将val转换为浮点数
        else:
            return float(val)
    
    # 将'revenue_pltr'中'totalRevenue'列的每个值都应用convert_to_float函数进行转换
    revenue_pltr['totalRevenue'] = revenue_pltr['totalRevenue'].apply(convert_to_float)

# 显示数据的最后几行
revenue_pltr.tail()
fiscalDateEnding totalRevenue
5 2022-06-30 473010000.0
6 2022-09-30 477880000.0
7 2022-12-31 508624000.0
8 2023-03-31 525186000.0
9 2023-06-30 533317000.0

The first thing we observe in the dataset is that we only have information up to the end of the first quarter of 2023. Our data is expressed at quarterly frequency, and our goal is to use this information to predict daily stock prices for the next 14 days beyond this date.

However, in order to accurately calculate such a forecast that includes income as an exogenous variable, we need to know the value of future income. This is crucial because these future earnings values ​​can significantly impact stock prices.

Since our goal is to predict daily stock prices for the next 14 days, we only need to predict revenue for the upcoming quarter. This approach allows us to create a coherent forecasting process where the output of one forecast (revenue) is used as the input to another forecast (stock price), thereby leveraging all available information to obtain the most accurate forecast.

# 定义一个变量fcst_pltr_revenue,用于存储预测结果
# 调用timegpt库中的forecast函数,对revenue_pltr数据进行预测
# 预测的时间跨度为1,时间列为fiscalDateEnding,目标列为totalRevenue
fcst_pltr_revenue = timegpt.forecast(revenue_pltr, h=1, time_col='fiscalDateEnding', target_col='totalRevenue')
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
INFO:nixtlats.timegpt:Inferred freq: Q-DEC
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
# 查看数据集的前几行
fcst_pltr_revenue.head()
fiscalDateEnding TimeGPT
0 2023-09-30 540005888

Continuing from last time, the next key step in our forecasting process is to adjust the frequency of the data to match the frequency of stock prices, which is based on weekdays. To achieve this, we need to resample historical and future forecast revenue data.

We can achieve this using the following code

# 将'revenue_pltr'数据框中的'fiscalDateEnding'列转换为日期格式
revenue_pltr['fiscalDateEnding'] = pd.to_datetime(revenue_pltr['fiscalDateEnding'])
revenue_pltr = revenue_pltr.set_index('fiscalDateEnding').resample('B').ffill().reset_index()

Important : It is important to emphasize that in this process we assign the same revenue value to all days in a given quarter. This simplification is necessary because of the large difference in granularity between quarterly revenue data and daily stock price data. However, in practical applications, it is crucial to treat this assumption with caution. The impact of quarterly earnings data on daily stock prices can vary widely within a quarter based on a range of factors, including changes in market expectations, other financial news and events. In this tutorial, we use this assumption to illustrate how to incorporate external variables into our predictive models, but in real life, a more nuanced approach may be required depending on available data and specific use cases.

We can then create the complete historical data set.

# 合并数据框
# 将revenue_pltr数据框的'fiscalDateEnding'列重命名为'date'列,并与pltr_df数据框进行合并
pltr_revenue_df = pltr_df.merge(revenue_pltr.rename(columns={
    
    'fiscalDateEnding': 'date'}))

# 显示DataFrame的前几行数据
pltr_revenue_df.head()
date Open High Low Close Adj Close Volume Dividends Stock Splits totalRevenue
0 2021-03-31 22.500000 23.850000 22.379999 23.290001 23.290001 61458500 0.0 0.0 341234000.0
1 2021-04-01 23.950001 23.950001 22.730000 23.070000 23.070000 51788800 0.0 0.0 341234000.0
2 2021-04-05 23.780001 24.450001 23.340000 23.440001 23.440001 65374300 0.0 0.0 341234000.0
3 2021-04-06 23.549999 23.610001 22.830000 23.270000 23.270000 41933500 0.0 0.0 341234000.0
4 2021-04-07 23.000000 23.549999 22.809999 22.900000 22.900000 32766200 0.0 0.0 341234000.0

Data frame for calculating future revenue:

# 设置变量horizon为14,表示水平线的位置为14
horizon = 14
# 导入numpy库,用于进行科学计算和数组操作
import numpy as np
# 创建一个DataFrame对象future_df
# 该DataFrame包含两列:'date'和'totalRevenue'
# 'date'列使用pd.date_range函数生成,从pltr_revenue_df的最后一个日期开始,生成horizon + 1个日期,频率为工作日('B')
# 从生成的日期中取出后horizon个日期,作为'future_df'的'date'列
# 'totalRevenue'列使用np.repeat函数生成,将fcst_pltr_revenue的第一个元素的'TimeGPT'值重复horizon次
future_df = pd.DataFrame({
    
    
    'date': pd.date_range(pltr_revenue_df['date'].iloc[-1], periods=horizon + 1, freq='B')[-horizon:],
    'totalRevenue': np.repeat(fcst_pltr_revenue.iloc[0]['TimeGPT'], horizon)
})


# 查看数据集的前几行
future_df.head()
date totalRevenue
0 2023-07-03 540005888
1 2023-07-04 540005888
2 2023-07-05 540005888
3 2023-07-06 540005888
4 2023-07-07 540005888

We can then pass future earnings X_dfin the method using parameters . forecastSince revenue is in the historical data frame, this information will be used in the model.


# 使用timegpt模块中的forecast函数,对pltr_revenue_df数据进行预测
# 预测的时间范围为horizon
# 频率为'B',即每个工作日
# 时间列为'date'
# 目标列为'Close'
# 附加的特征数据为future_df
fcst_pltr_df = timegpt.forecast(
    pltr_revenue_df, h=horizon, 
    freq='B',
    time_col='date', 
    target_col='Close',
    X_df=future_df,
)
INFO:nixtlats.timegpt:Validating inputs...
INFO:nixtlats.timegpt:Preprocessing dataframes...
WARNING:nixtlats.timegpt:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtlats.timegpt:Calling Forecast Endpoint...
# 绘制时间序列预测图
# 参数说明:
# pltr_revenue_df: 公司收入数据的DataFrame
# fcst_pltr_df: 预测的公司收入数据的DataFrame
# id_col: 数据中表示系列ID的列名
# time_col: 数据中表示时间的列名
# target_col: 数据中表示目标变量的列名
# max_insample_length: 用于训练模型的最大样本长度
timegpt.plot(
    pltr_revenue_df, 
    fcst_pltr_df, 
    id_col='series_id',
    time_col='date',
    target_col='Close',
    max_insample_length=90,
)

We can also see the importance of income.

timegpt.weights_x.plot.barh(x='features', y='weights')
<Axes: ylabel='features'>

Guess you like

Origin blog.csdn.net/wjjc1017/article/details/135244831