Practical case: Using LSTM for multivariate time series forecasting (with Python complete code)

In this article we will perform multivariate time series forecasting using deep learning methods (LSTM).

Let's first understand two topics -

  • What is Time Series Analysis?
  • What are LSTMs?

Time Series Analysis: A time series represents a series of data based on the order of time. It can be seconds, minutes, hours, days, weeks, months, years. Future data will depend on its previous value.

In real world cases, we mainly have two types of time series analysis −

  • Univariate time series
  • multivariate time series

For univariate time series data, we will use a single column for forecasting.

As we can see, there is only one column, so the upcoming future value will only depend on its previous value.

But in case of multivariate time series data, there will be different types of eigenvalues ​​and the target data will depend on these features.

As you can see in the picture, there will be multiple columns in the multivariate to make predictions for the target value. ("count" in the figure above is the target value)

In the data above, count depends not only on its previous value, but also on other features. Therefore, to predict the upcoming count value, we have to consider all columns including the target column to make a prediction on the target value.

One thing must be kept in mind while performing multivariate time series analysis, we need to use multiple features to predict the current target, let us understand with an example −

At training time, if we use 5 columns [feature1, feature2, feature3, feature4, target] to train the model, we need to provide 4 columns [feature1, feature2, feature3, feature4] for the upcoming prediction day.

technology upgrade

This article is shared by fans of the technical group. Project source code, data, and technical exchange improvements can be obtained by adding the exchange group. The group has more than 2,000 members. The best way to add notes is: source + interest direction, so that it is convenient to find like-minded friends

Method ①, add WeChat account: dkl88191, remarks: from CSDN + research direction
Method ②, WeChat search official account: Python learning and data mining, background reply: multivariate prediction

LSTM

It is not intended to discuss LSTMs in detail in this article. So just provide some simple descriptions, if you don't have much understanding of LSTM, you can refer to our previously published articles.

LSTM is basically a recurrent neural network capable of handling long-term dependencies.

Suppose you are watching a movie. So when anything happens in the movie, you already know what happened before, and can understand that something new happened because of what happened in the past. RNNs work in the same way, they remember past information and use it to process current input. The problem with RNNs is that they cannot remember long-term dependencies due to vanishing gradients. Therefore, lstm is designed to avoid long-term dependency problems.

Now we discuss time series forecasting and LSTM theory part. Let's start coding.

Let's start by importing the libraries needed to make predictions

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV

Load the data, and check the output -

df=pd.read_csv("train.csv",parse_dates=["Date"],index_col=[0])
df.head()

df.tail()

Now let's take a moment to look at the data: the csv file contains Google's stock data from 2001-01-25 to 2021-09-29, and the data is frequency by day.

[You can convert the frequency to "B" [weekday] or "D" if you want, since we won't be using dates, I'm just keeping it as is. ]

Here we are trying to predict the future value of "Open" column, so "Open" is the target column here

Let's look at the shape of the data

df.shape
(5203,5)

Now let's do the train-test split. Here we cannot shuffle the data as it must be sequential in the time series.

test_split=round(len(df)*0.20)
df_for_training=df[:-1041]
df_for_testing=df[-1041:]
print(df_for_training.shape)
print(df_for_testing.shape)

(4162, 5)
(1041, 5)

It can be noticed that the data range is very large, and they are not scaled on the same range, so to avoid prediction errors, let's first scale the data using MinMaxScaler. (StandardScaler can also be used)

scaler = MinMaxScaler(feature_range=(0,1))
df_for_training_scaled = scaler.fit_transform(df_for_training)
df_for_testing_scaled=scaler.transform(df_for_testing)
df_for_training_scaled

Split the data into X and Y, this is the most important part, read each step correctly.

def createXY(dataset,n_past):
    dataX = []
    dataY = []
    for i in range(n_past, len(dataset)):
            dataX.append(dataset[i - n_past:i, 0:dataset.shape[1]])
            dataY.append(dataset[i,0])
    return np.array(dataX),np.array(dataY)

trainX,trainY=createXY(df_for_training_scaled,30)
testX,testY=createXY(df_for_testing_scaled,30)

Let's see what is done in the code above:

N_past is the number of steps in the past that we will look at when predicting the next target value.

Using 30 here means that the past 30 values ​​(for all features including the target column) will be used to predict the 31st target value.

So, in trainX we will have all the eigenvalues, and in trainY we only have the target values.

Let's break down each part of the for loop

For training, dataset = df_for_training_scaled, n_past=30

This i= 30:

data_X.addend (df_for_training_scaled[i - n_past:i, 0:df_for_training.shape[1]])

The range from n_past is 30, so the first data range will be -[30 - 30,30,0:5] which is equivalent to [0:30,0:5]

So in the dataX list, the df_for_training_scaled[0:30,0:5] array will appear for the first time.

现在, dataY.append(df_for_training_scaled[i,0])

i = 30, so it will only take the open starting at row 30 (because in prediction, we only need the open column, so the column range is only 0, which means the open column).

For the first time store the df_for_training_scaled[30,0] value in the dataY list.

So the first 30 rows containing 5 columns are stored in dataX and only the 31st row with the open column is stored in dataY. Then we convert the dataX and dataY lists into arrays, which are trained in LSTM in array format.

Let's look at the shape.

print("trainX Shape-- ",trainX.shape)
print("trainY Shape-- ",trainY.shape)

(4132, 30, 5)
(4132,)

print("testX Shape-- ",testX.shape)
print("testY Shape-- ",testY.shape)

(1011, 30, 5)
(1011,)

4132 is the total number of arrays available in trainX, each array has a total of 30 rows and 5 columns, and in trainY of each array, we have the next target value to train the model.

Let's look at one of the arrays containing the (30,5) data from trainX and the trainY values ​​of the trainX array

print("trainX[0]-- \n",trainX[0])
print("trainY[0]-- ",trainY[0])

If you look at the trainX[1] value, you will see that it has the same data as in trainX[0] (except for the first column), since we will see the first 30 to predict the 31st column, after the first prediction it will Automatically move to column 2 and take the next 30 value to predict the next target value.

Let's explain it all in a simple format -

trainX — — →trainY

[0 : 30,0:5][30,0]

[1:31, 0:5][31,0]

[2:32,0:5][32,0]

Like this, each data will be saved in trainX and trainY

Now let's train the model, I use gridsearchCV to do some hyperparameter tuning to find the base model.

def build_model(optimizer):
    grid_model = Sequential()
    grid_model.add(LSTM(50,return_sequences=True,input_shape=(30,5)))
    grid_model.add(LSTM(50))
    grid_model.add(Dropout(0.2))
    grid_model.add(Dense(1))

grid_model.compile(loss = 'mse',optimizer = optimizer)
    return grid_modelgrid_model = KerasRegressor(build_fn=build_model,verbose=1,validation_data=(testX,testY))

parameters = {
    
    'batch_size' : [16,20],
              'epochs' : [8,10],
              'optimizer' : ['adam','Adadelta'] }

grid_search  = GridSearchCV(estimator = grid_model,
                            param_grid = parameters,
                            cv = 2)

You can also add more layers if you want to do more hyperparameter tuning for your model. But if the data set is very large, it is recommended to increase the period and unit in the LSTM model.

See the input shape as (30,5) in the first LSTM layer. It comes from the trainX shape. (trainX.shape[1],trainX.shape[2]) → (30,5)

Now let's fit the model to the trainX and trainY data.

grid_search = grid_search.fit(trainX,trainY)

This will take some time to run due to the hyperparameter search.

You can see that the loss will decrease like this -

Let us now examine the optimal parameters of the model.

grid_search.best_params_

{
    
    ‘batch_size’: 20, ‘epochs’: 10, ‘optimizer’: ‘adam’}

Save the best model in the my_model variable.

my_model=grid_search.best_estimator_.model

The model can now be tested with the test dataset.

prediction=my_model.predict(testX)
print("prediction\n", prediction)
print("\nPrediction Shape-",prediction.shape)

testY and prediction have the same length. Now you can compare testY with predictions.

But we scaled the data to begin with, so first we have to do some unscaling process.

scaler.inverse_transform(prediction)

The error is reported because when we scaled the data, we had 5 columns per row and now we only have 1 column which is the target column.

So we have to change the shape to use inverse_transform

prediction_copies_array = np.repeat(prediction,5, axis=-1)

The 5 column values ​​are similar, it just duplicates the single prediction column 4 times. So now we have 5 columns with the same value.

prediction_copies_array.shape
(1011,5)

This enables the use of the inverse_transform function.

pred=scaler.inverse_transform(np.reshape(prediction_copies_array,(len(prediction),5)))[:,0]

But the first column after inverse transformation is what we need, so we use →[:,0] at the end.

Now compare this pred value with testY, but testY is also scaled and needs to be inverse transformed using the same code as above.

original_copies_array = np.repeat(testY,5, axis=-1)
original=scaler.inverse_transform(np.reshape(original_copies_array,(len(testY),5)))[:,0]

Now let's look at the predicted and original values ​​→

print("Pred Values-- " ,pred)
print("\nOriginal Values-- " ,original)

Finally draw a graph to compare our pred with the original data.

plt.plot(original, color = 'red', label = 'Real Stock Price')
plt.plot(pred, color = 'blue', label = 'Predicted Stock Price')
plt.title('Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Google Stock Price')
plt.legend()
plt.show()

Looks good, so far we trained the model and checked it with test values. Now let's predict some future values.

Get the last 30 values ​​from the main df dataset that we loaded at the beginning [why 30? Since this is the number of past values ​​we want, to predict the 31st value]

df_30_days_past=df.iloc[-30:,:]
df_30_days_past.tail()

You can see that there are all columns including the target column ("Open"). Now let's predict 30 values ​​into the future.

In multivariate time series forecasting, a single column needs to be forecasted by using different features, so when doing forecasting we need to use feature values ​​(except the target column) for upcoming forecasts.

Here we need the upcoming 30 values ​​of the columns "High", "Low", "Close", "Adj Close" to make a prediction on the column "Open".

df_30_days_future=pd.read_csv("test.csv",parse_dates=["Date"],index_col=[0])
df_30_days_future

After removing the "Open" column, the following operations need to be done before using the model for prediction:

Scale the data, since the 'Open' column is removed, before scaling it, add an Open column with all values ​​'0'.

After scaling, replace "Open" column values ​​with "nan" in future data

Now append 30 days old value and 30 days new value (where the last 30 "open" values ​​are nan)

df_30_days_future["Open"]=0
df_30_days_future=df_30_days_future[["Open","High","Low","Close","Adj Close"]]
old_scaled_array=scaler.transform(df_30_days_past)
new_scaled_array=scaler.transform(df_30_days_future)
new_scaled_df=pd.DataFrame(new_scaled_array)
new_scaled_df.iloc[:,0]=np.nan
full_df=pd.concat([pd.DataFrame(old_scaled_array),new_scaled_df]).reset_index().drop(["index"],axis=1)

The full_df shape is (60,5) and finally the first column has 30 nan values.

To make predictions we have to use the for loop again, which we did when we split the data in trainX and trainY. But this time we only have X and no Y values

full_df_scaled_array=full_df.values
all_data=[]
time_step=30
for i in range(time_step,len(full_df_scaled_array)):
    data_x=[]
    data_x.append(
     full_df_scaled_array[i-time_step :i , 0:full_df_scaled_array.shape[1]])
    data_x=np.array(data_x)
    prediction=my_model.predict(data_x)
    all_data.append(prediction)
    full_df.iloc[i,0]=prediction

For the first prediction, there are the previous 30 values, when the for loop runs for the first time it checks the previous 30 values ​​and predicts the 31st "Open" data.

When the second for loop will try to run, it will skip the first row and try to get the next 30 values ​​[1:31]. An error will be reported here because the last line of the Open column is "nan", so you need to replace "nan" with prediction every time.

Finally, the prediction needs to be inversely transformed →

new_array=np.array(all_data)
new_array=new_array.reshape(-1,1)
prediction_copies_array = np.repeat(new_array,5, axis=-1)
y_pred_future_30_days = scaler.inverse_transform(np.reshape(prediction_copies_array,(len(new_array),5)))[:,0]
print(y_pred_future_30_days)

Such a complete process has already run through.

Guess you like

Origin blog.csdn.net/m0_59596937/article/details/128271899