[Numerical prediction case] (6) LSTM, GRU time series stock data prediction, with complete TensorFlow code

Hello everyone, today I will share with you how to use the recurrent neural network LSTM and GRU to complete the prediction of stock data. GRU is a simplification based on LSTM. The three gates inside LSTM are simplified into two, and the calculation effect of GRU is often better than that of LSTM.


1. Import the toolkit

If there is no computer and no GPU, delete the following code that calls GPU-accelerated computing

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt

# 调用GPU加速
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

2. Get the dataset

First, install pandas_datareader, a toolkit for remotely obtaining financial data. Next, use  web.DataReader() to specify which platform to get the stock information of which company. For specific function parameters, please see the Zhihu column: https://zhuanlan.zhihu.com/p/341254102 

This time, we obtained Google's stock information from 2000 to 2021. After reading the original data, we must delete the vacancies in the data to avoid the impact on subsequent data processing. In order to facilitate the learning of the recurrent neural network, the data needs to be arranged in order from old to new .

This section uses a series to predict the closing price of a stock in 10 days . Add a new 'label' column to the data table to hold the label value for each series . is a prediction for a point in time.

# pip install pandas_datareader
import pandas_datareader.data as web
import datetime  # datetime是Python处理日期和时间的标准库。

# 设置获取股票的时间范围的数据
start = datetime.datetime(2000,1,1)  # 设置开始时间
end = datetime.datetime(2021,9,1)  # 设置结束时间

# 在stooq数据源上获取googl在2000-2021年的股票数据
df = web.DataReader('GOOGL', 'stooq', start, end)
# 查看股票信息, 时间, 开盘价, 最高价, 最低价, 收盘价, 交易量
print(df)

df.dropna(inplace=True)  # 删除表格中的空值

# 根据数据的索引(时间)从小到大排序
df.sort_index(inplace=True)  # 排序完成后替换原来的df
print(df)

# 获取标签,预测10天后的收盘价
pre_days = 10
# 添加一个新的列存放标签, 相当于通过2004-08-19的特征来预测2004-08-29的收盘价
df['label'] = df['Close'].shift(-pre_days)
print(df)

 Since the 'label' column collectively moves the 'Close' column upwards by 10 rows, the last 10 rows of the 'label' column will have a vacancy value of nan . It should be noted here that it will be processed later.


3. Data preprocessing

Import the sklearn normalization method, normalize all feature data, and do not process the label data 'label' column. After standardization, it can avoid the influence of data with excessive deviation on the training results.

from sklearn.preprocessing import StandardScaler  # 导入数据标准化方法

scaler = StandardScaler()  # 接收数据标准化方法
# 对所有的特征数据进行标准化,最后一列是标签值
sca_x = scaler.fit_transform(df.iloc[:,:-1])
# 查看标准化后的特征数据
print(sca_x)

The five characteristic columns correspond to: opening price, high price, low price, closing price, trading volume


4. Time series sliding window

A very convenient queue deque is used here. Specifying the maximum length of this queue, maxlen=20, means that the length of a time series is 20. If the length of the queue deq exceeds 20, the first feature will be deleted and the 21st feature will be appended. After the 20th feature, we can keep the queue length at 20 . Then the shape=[20,5] of each time series represents 20 rows of data and 5 columns of features.

After completing the time series grouping of all data, there is no corresponding label value in the last 10 rows of the feature data sca_x . Therefore, the last 10 groups of time series need to be deleted. Each sequence corresponds to a label, and the length of the label and the sequence is the same.

import numpy as np
from collections import deque  # 相当于一个列表,可以在两头增删元素

men_his_days = 20  # 用20天的特征数据来预测
# 创建一个队列, 长度等于记忆的天数,时间序列滑窗大小=20
deq = deque(maxlen=men_his_days)

# 创建一个特征列表,保存每个时间序列的特征
x = []
# 遍历每一行特征数据
for i in sca_x:
    # 将每行特征保存进队列
    deq.append(list(i))  # array类型转为list类型
    # 如果队列的长度等于记忆的天数(时间滑窗的的长度)就证明特征组成了一个时间序列
    # 如果队列长度大于记忆天数,队列会自动将头端的那个特征删除,将新输入追加到队列尾部
    if len(deq) == men_his_days:
        # 将这一组序列保存下来
        x.append(list(deq))  # array类型转为list类型

# 由于原特征中最后10条数据没有标签值, 在x特征数据中将最后10个序列删除
x = x[:-pre_days]
# 查看有多少个序列
print(len(x))  # 4260

# 数据表格df中最后一列代表标签值, 把所有标签取出来
# 例如使用[0,1,2,3,4]天的特征预测第20天的收盘价, 使用[1,2,3,4,5]天的特征预测第21天的收盘价
# 而表格中索引4对应的标签就是该序列的标签
y = df['label'].values[men_his_days-1: -pre_days]
print(len(y))  # 序列x和标签y的长度应该一样

# 将特征和标签变成numpy类型
x, y = np.array(x), np.array(y)

5. Data set partitioning

We have obtained the processed time series and the corresponding labels, and then we can divide the training set, validation set, and test set according to the proportion. For the training set, you need to use  .shuffle() to randomly shuffle the order of data rows to avoid chance. Set the iterator iter() , combine the next() function to take out a batch of data from the training set

total_num = len(x)  # 一共多少组序列和标签
train_num = int(total_num*0.8)  # 80%的数据用于训练
val_num = int(total_num*0.9)  # 80-90%的数据用于验证
# 剩余的数据用于测试

x_train, y_train = x[:train_num], y[:train_num]  # 训练集
x_val, y_val = x[train_num:val_num], y[train_num:val_num]  # 验证集
x_test, y_test = x[val_num:], y[val_num:]  # 测试集

# 转为tensor类型
batch_size = 128  # 每个step训练多少组序列数据
# 训练集
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_ds = train_ds.batch(batch_size).shuffle(10000)  # 随机打乱
# 验证集
val_ds = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_ds = val_ds.batch(batch_size)
# 测试集
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))
test_ds = test_ds.batch(batch_size)

# 查看数据集信息
sample = next(iter(train_ds))  # 取出一个batch的数据
print('x_train.shape:', sample[0].shape)  # (128, 20, 5)
print('y_train.shape:', sample[1].shape)  # (128,)

6. Construct the network model

Taking the GRU network as an example, LSTM only needs to replace layers.GRU() in the following code with layers.LSTM  ( ) .

Pay attention to the parameter return_sequences , which means to return the last value in the output sequence, or all values. Default False . Generally, return_sequences=True is used when the next layer or LSTM is used.

input_shape = sample[0].shape[-2:]  # [20,5] 输入维度不需要写batch维度

# 构造输入层
inputs = keras.Input(shape=input_shape)  # [None,20,5]

# 第一个GRU层, 如果下一层还是LSTM层就需要return_sequences=True, 否则就是False
x = layers.GRU(8, activation='relu', return_sequences=True, kernel_regularizer=keras.regularizers.l2(0.01))(inputs)
x = layers.Dropout(0.2)(x)  # 随机杀死神经元防止过拟合

# 第二个GRU层
x = layers.GRU(16, activation='relu', return_sequences=True, kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = layers.Dropout(0.2)(x)  

# 第三个GRU层
x = layers.GRU(32, activation='relu')(x)
x = layers.Dropout(0.2)(x)  

# 全连接层, 随机权重初始化, l2正则化
x = layers.Dense(16, activation='relu', kernel_initializer='random_normal', kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = layers.Dropout(0.2)(x)  

# 输出层, 输入序列的10天后的股票,是时间点。保证输出层神经元个数和y_train.shape[-1]相同
outputs = layers.Dense(1)(x)

# 构造网络
model = keras.Model(inputs, outputs)

# 查看网络结构
model.summary()

View network structure

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_4 (InputLayer)         [(None, 20, 5)]           0         
_________________________________________________________________
gru_3 (GRU)                  (None, 20, 8)             360       
_________________________________________________________________
dropout_12 (Dropout)         (None, 20, 8)             0         
_________________________________________________________________
gru_4 (GRU)                  (None, 20, 16)            1248      
_________________________________________________________________
dropout_13 (Dropout)         (None, 20, 16)            0         
_________________________________________________________________
gru_5 (GRU)                  (None, 32)                4800      
_________________________________________________________________
dropout_14 (Dropout)         (None, 32)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 16)                528       
_________________________________________________________________
dropout_15 (Dropout)         (None, 16)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 17        
=================================================================
Total params: 6,953
Trainable params: 6,953
Non-trainable params: 0
_________________________________________________________________

7. Network training

The mean absolute error between the predicted value and the label value, mae , is used as the loss function, and the mean squared logarithmic error msle is used as the monitoring metric of the network. The mae loss and msle metrics of each iteration during training are saved in history

# 网络编译
model.compile(optimizer = keras.optimizers.Adam(0.001),  # adam优化器学习率0.001
              loss = tf.keras.losses.MeanAbsoluteError(),  # 标签和预测之间绝对差异的平均值
              metrics = tf.keras.losses.MeanSquaredLogarithmicError())  # 计算标签和预测之间的对数误差均方值。

epochs = 10  # 网络迭代次数

# 网络训练
history = model.fit(train_ds, epochs=epochs, validation_data=val_ds)

The training process is as follows

Epoch 1/10
27/27 [==============================] - 8s 214ms/step - loss: 395.9859 - mean_squared_logarithmic_error: 32.9226 - val_loss: 1164.5131 - val_mean_squared_logarithmic_error: 46.3883
Epoch 2/10
27/27 [==============================] - 5s 198ms/step - loss: 404.5123 - mean_squared_logarithmic_error: 28.0247 - val_loss: 1153.9086 - val_mean_squared_logarithmic_error: 20.9722
----------------------------------------------------
----------------------------------------------------
Epoch 9/10
27/27 [==============================] - 5s 200ms/step - loss: 111.9984 - mean_squared_logarithmic_error: 0.1729 - val_loss: 174.2481 - val_mean_squared_logarithmic_error: 0.0213
Epoch 10/10
27/27 [==============================] - 5s 199ms/step - loss: 101.5161 - mean_squared_logarithmic_error: 0.1041 - val_loss: 54.0906 - val_mean_squared_logarithmic_error: 0.0028

8. View training process information

Plot the training set loss and validation machine loss, training set monitoring metrics and validation set monitoring metrics for each iteration

#(10)查看训练信息
history_dict = history.history  # 获取训练的数据字典
train_loss = history_dict['loss']  # 训练集损失
val_loss = history_dict['val_loss']  # 验证集损失
train_msle = history_dict['mean_squared_logarithmic_error']  # 训练集的百分比误差
val_msle = history_dict['val_mean_squared_logarithmic_error']  # 验证集的百分比误差
 
#(11)绘制训练损失和验证损失
plt.figure()
plt.plot(range(epochs), train_loss, label='train_loss')  # 训练集损失
plt.plot(range(epochs), val_loss, label='val_loss')  # 验证集损失
plt.legend()  # 显示标签
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show()

#(12)绘制训练百分比误差和验证百分比误差
plt.figure()
plt.plot(range(epochs), train_msle, label='train_msle')  # 训练集损失
plt.plot(range(epochs), val_msle, label='val_msle')  # 验证集损失
plt.legend()  # 显示标签
plt.xlabel('epochs')
plt.ylabel('msle')
plt.show()


9. Prediction Phase

Use the evaluate() function to calculate the loss and monitoring metrics for the entire test set , and get the timescale for each true value

#(13)测试集评价, 计算损失和监控指标
model.evaluate(test_ds)

# 预测
y_pred = model.predict(x_test)

# 获取标签值对应的时间
df_time = df.index[-len(y_test):]

# 绘制对比曲线
fig = plt.figure(figsize=(10,5))  # 画板大小
axes = fig.add_subplot(111)  # 画板上添加一张图
# 真实值, date_test是对应的时间
axes.plot(df_time, y_test, 'b-', label='actual')
# 预测值,红色散点
axes.plot(df_time, y_pred, 'r--', label='predict')
# 设置横坐标刻度
axes.set_xticks(df_time[::50])
axes.set_xticklabels(df_time[::50], rotation=45)
 
plt.legend()  # 注释
plt.grid()  # 网格
plt.show()

Plot a comparison curve between true and predicted values


10. Comparing LSTM and GRU

The prediction curves of LSTM and GRU trained by the same method are as follows. There is little difference between the two methods. If there is actual need, they can be used for comparison.

 Training process comparison chart

Guess you like

Origin blog.csdn.net/dgvv4/article/details/124386024