愉快的学习就从翻译开始吧_Multi-step Time Series Forecasting_4_Persistence Model_Prepare data

Persistence Model/持续性模型

A good baseline for time series forecasting is the persistence model.

时序预测的一个好基准是持续性模型

This is a forecasting model where the last observation is persisted forward. Because of its simplicity, it is often called the naive forecast.

这是一个最后的观察值持续向前的预测模型，由于其简单性，它通常被称为天真预测

You can learn more about the persistence model for time series forecasting in the post:

你可以学习更多关于时序预测的持续性模型在下面帖子中

How to Make Baseline Predictions for Time Series Forecasting with Python

Prepare Data/准备数据

The first step is to transform the data from a series into a supervised learning problem.

第一步是从一个序列转换数据到一个监督学习问题

That is to go from a list of numbers to a list of input and output patterns. We can achieve this using a pre-prepared function called series_to_supervised().

这是从一个数字列表转到一个输入和输出对列表，我们可以用一个叫series_to_supervised()预处理函数来实现这个目的。

For more on this function, see the post:

有关这个函数的更多信息，参阅此帖子：

How to Convert a Time Series to a Supervised Learning Problem in Python

The function is listed below.

函数罗列如下：

 
          1 
        
          2 
        
          3 
        
          4 
        
          5 
        
          6 
        
          7 
        
          8 
        
          9 
        
          10 
        
          11 
        
          12 
        
          13 
        
          14 
        
          15 
        
          16 
        
          17 
        
          18 
        
          19 
        
          20 
        
          21 
        
          22 
        
          23 
        
         # convert time series into supervised learning problem 
        
         def  
         series_to_supervised 
         ( 
         data 
         , 
           
         n_in 
         = 
         1 
         , 
           
         n_out 
         = 
         1 
         , 
           
         dropnan 
         = 
         True 
         ) 
         : 
        
         n_vars 
           
         = 
           
         1 
           
         if 
           
         type 
         ( 
         data 
         ) 
           
         is 
           
         list  
         else 
           
         data 
         . 
         shape 
         [ 
         1 
         ] 
        
         df 
           
         = 
           
         DataFrame 
         ( 
         data 
         ) 
        
         cols 
         , 
           
         names 
           
         = 
           
         list 
         ( 
         ) 
         , 
           
         list 
         ( 
         ) 
        
         # input sequence (t-n, ... t-1) 
        
         for 
           
         i 
           
         in 
           
         range 
         ( 
         n_in 
         , 
           
         0 
         , 
           
         - 
         1 
         ) 
         : 
        
         cols 
         . 
         append 
         ( 
         df 
         . 
         shift 
         ( 
         i 
         ) 
         ) 
        
         names 
           
         += 
           
         [ 
         ( 
         'var%d(t-%d)' 
           
         % 
           
         ( 
         j 
         + 
         1 
         , 
           
         i 
         ) 
         ) 
           
         for 
           
         j 
           
         in 
           
         range 
         ( 
         n_vars 
         ) 
         ] 
        
         # forecast sequence (t, t+1, ... t+n) 
        
         for 
           
         i 
           
         in 
           
         range 
         ( 
         0 
         , 
           
         n_out 
         ) 
         : 
        
         cols 
         . 
         append 
         ( 
         df 
         . 
         shift 
         ( 
         - 
         i 
         ) 
         ) 
        
         if 
           
         i 
           
         == 
           
         0 
         : 
        
         names 
           
         += 
           
         [ 
         ( 
         'var%d(t)' 
           
         % 
           
         ( 
         j 
         + 
         1 
         ) 
         ) 
           
         for 
           
         j 
           
         in 
           
         range 
         ( 
         n_vars 
         ) 
         ] 
        
         else 
         : 
        
         names 
           
         += 
           
         [ 
         ( 
         'var%d(t+%d)' 
           
         % 
           
         ( 
         j 
         + 
         1 
         , 
           
         i 
         ) 
         ) 
           
         for 
           
         j 
           
         in 
           
         range 
         ( 
         n_vars 
         ) 
         ] 
        
         # put it all together 
        
         agg 
           
         = 
           
         concat 
         ( 
         cols 
         , 
           
         axis 
         = 
         1 
         ) 
        
         agg 
         . 
         columns 
           
         = 
           
         names 
        
         # drop rows with NaN values 
        
         if 
           
         dropnan 
         : 
        
         agg 
         . 
         dropna 
         ( 
         inplace 
         = 
         True 
         ) 
        
         return 
           
         agg

The function can be called by passing in the loaded series values an n_in value of 1 and an n_out value of 3; for example:

函数可以被调用靠传入，加载序列值，一个为1的n_in值和一个为3的n_out值，例如：

 
          1 
        
         supervised 
           
         = 
           
         series_to_supervised 
         ( 
         raw_values 
         , 
           
         1 
         , 
           
         3 
         )

Next, we can split the supervised learning dataset into training and test sets.

接下来，我们可以分割监督学习数据集到训练和测试集。

We know that in this form, the last 10 rows contain data for the final year. These rows comprise the test set and the rest of the data makes up the training dataset.

我们知道在这种形式中，最后10行包含最后一年的数据，这些行构成测试集，其余的组成训练数据集

We can put all of this together in a new function that takes the loaded series and some parameters and returns a train and test set ready for modeling.

我们可以把这些一起放到一个新函数中，它接受加载的序列和一些参数，并且返回一个训练和测试集，为建模做准备。

 
          1 
        
          2 
        
          3 
        
          4 
        
          5 
        
          6 
        
          7 
        
          8 
        
          9 
        
          10 
        
          11 
        
         # transform series into train and test sets for supervised learning 
        
         def  
         prepare_data 
         ( 
         series 
         , 
           
         n_test 
         , 
           
         n_lag 
         , 
           
         n_seq 
         ) 
         : 
        
         # extract raw values 
        
         raw_values 
           
         = 
           
         series 
         . 
         values 
        
         raw_values 
           
         = 
           
         raw_values 
         . 
         reshape 
         ( 
         len 
         ( 
         raw_values 
         ) 
         , 
           
         1 
         ) 
        
         # transform into supervised learning problem X, y 
        
         supervised 
           
         = 
           
         series_to_supervised 
         ( 
         raw_values 
         , 
           
         n_lag 
         , 
           
         n_seq 
         ) 
        
         supervised_values 
           
         = 
           
         supervised 
         . 
         values 
        
         # split into train and test sets 
        
         train 
         , 
           
         test 
           
         = 
           
         supervised_values 
         [ 
         0 
         : 
         - 
         n_test 
         ] 
         , 
           
         supervised_values 
         [ 
         - 
         n_test 
         : 
         ] 
        
         return 
           
         train 
         , 
           
         test

We can test this with the Shampoo dataset. The complete example is listed below.

我们可以用洗发水数据集来测试这个函数，完整的例子如下：

from pandas import DataFrame
from pandas import concat
from pandas import read_csv
from pandas import datetime


# date-time parsing function for loading the dataset
def parser(x):
    return datetime.strptime('190' + x, '%Y-%m')


# convert time series into supervised learning problem
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
    n_vars = 1 if type(data) is list else data.shape[1]
    df = DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j + 1, i)) for j in range(n_vars)]
    # forecast sequence (t, t+1, ... t+n)
    for i in range(0, n_out):
        cols.append(df.shift(-i))
        if i == 0:
            names += [('var%d(t)' % (j + 1)) for j in range(n_vars)]
        else:
            names += [('var%d(t+%d)' % (j + 1, i)) for j in range(n_vars)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg


# transform series into train and test sets for supervised learning
def prepare_data(series, n_test, n_lag, n_seq):
    # extract raw values
    raw_values = series.values
    raw_values = raw_values.reshape(len(raw_values), 1)
    # transform into supervised learning problem X, y
    supervised = series_to_supervised(raw_values, n_lag, n_seq)
    supervised_values = supervised.values
    # split into train and test sets
    train, test = supervised_values[0:-n_test], supervised_values[-n_test:]
    return train, test


# load dataset
series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
# configure
n_lag = 1
n_seq = 3
n_test = 10
# prepare data
train, test = prepare_data(series, n_test, n_lag, n_seq)
print(test)
print('Train: %s, Test: %s' % (train.shape, test.shape))

Running the example first prints the entire test dataset, which is the last 10 rows. The shape and size of the train test datasets is also printed.

运行该示例，首先打印整个测试数据集，就是最后10行，训练数据集的形状和尺寸也被打印出来

 
          1 
        
          2 
        
          3 
        
          4 
        
          5 
        
          6 
        
          7 
        
          8 
        
          9 
        
          10 
        
          11 
        
          [[ 342.3  339.7  440.4  315.9] 
        
          [ 339.7  440.4  315.9  439.3] 
        
          [ 440.4  315.9  439.3  401.3] 
        
          [ 315.9  439.3  401.3  437.4] 
        
          [ 439.3  401.3  437.4  575.5] 
        
          [ 401.3  437.4  575.5  407.6] 
        
          [ 437.4  575.5  407.6  682. ] 
        
          [ 575.5  407.6  682.   475.3] 
        
          [ 407.6  682.   475.3  581.3] 
        
          [ 682.   475.3  581.3  646.9]] 
        
          Train: (23, 4), Test: (10, 4)

We can see the single input value (first column) on the first row of the test dataset matches the observation in the shampoo-sales for December in the 2nd year:

我们可以看到测试数据集第一行的单个输入值（第一列）与第二年12月洗发水销售中的观察值相匹配：

 
          1 
        
          "2-12",342.3

We can also see that each row contains 4 columns for the 1 input and 3 output values in each observation.

我们还可以看到，每个观测中每行包含4列，用于1个输入和3个输出值。