[Numerical prediction case] (5) LSTM time series temperature data prediction, complete with TensorFlow code

Hello everyone, today I will share with you how to use the recurrent neural network LSTM to complete the temperature prediction with multiple features. In the previous section, I introduced the prediction of a single feature of LSTM. If you are interested, you can take a look: https://blog.csdn.net/dgvv4/article/details/124349963

1. Import the toolkit

I use GPU to accelerate computing, friends without GPU can remove the code segment that calls GPU.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 调用GPU加速
gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

2. Read the dataset

Dataset address: https://pan.baidu.com/s/1E5h-imMwdIyPv1Zc7FfC9Q Extraction code: 9cb5 

The dataset is recorded every 10 minutes, with 42w rows of data and 14 columns of features. The first 10 columns of features except the time column are selected for this model. Use pandas's plotting method to plot features versus time.

filepath = 'D:/deeplearning/test/神经网络/循环神经网络/climate.csv'
data = pd.read_csv(filepath)
print(data.head())  # 数据是10min记录一次的

# 选择从第1列开始往后的所有行的数据
feat = data.iloc[:, 1:11]  # 最后4个特征列不要
date = data.iloc[:, 0]   # 获取时间信息

feat.plot(subplots=True, figsize=(80,10),  # 为每一列单独开辟子图,设置画板大小
          layout=(5,2), title='climate features')  # 14张图的排序方式,设置标题

The data set information is as follows

Plot the characteristic data of the last 10 columns except the time characteristic DateTime column as a function of time

3. Data preprocessing

Due to the large amount of data, all of them used for training may cause an error of insufficient memory usage. Here, the first 2w data are taken for training. Find the mean and standard deviation of each feature column in the training set, and use the mean and standard deviation of the training set for standardization preprocessing for the entire data set . Use the normalized air temperature data as the label .

train_num = 20000  # 取前2w组数据用于训练
val_num = 23000  # 取2w-2.3w的数据用于验证
# 2.3w-2.5w的数据用于验证用于测试

# 求训练集的每个特征列的均值和标准差
feat_mean = feat[:train_num].mean(axis=0)
feat_std = feat[:train_num].std(axis=0)

# 对整个数据集计算标准差
feat = (feat - feat_mean) / feat_std

# 保存所有的气温数据,即标签数据
targets = feat.iloc[:,1]   # 取标准化之后的气温数据作为标签值

4. Time series functions to extract features and labels

Move over the dataset through a sliding window, e.g. predict the temperature at a point/segment in the future using 20 rows of current 10 features. The task requires the use of 5 consecutive days of data to predict the temperature value at the next time point, and the data is recorded once every 10 minutes.

Prediction for a certain time point: There are 5*24*6=720 data in five days, the window slides one step at a time, the first sliding window range is range(0, 720, 1), and the 720th temperature is predicted. The second sliding window range is range(1,721,1), and the 721st temperature is predicted. range() takes the value regardless of the head and tail

Prediction for a certain time period: Since the data set is recorded once every 10 minutes, the difference between the two data lines is very small . You can set a step size to take the feature data every 60 minutes , and the first sliding window range range(0 , 720, 6) , predict the hourly temperature data for the next whole day , ie range(720, 720+24*6, 6) . The second sliding window range is range(1,721,6), and the hourly temperature range(721, 721+24*6, 6) is predicted for the next day

Here is the prediction of the data at a certain point in time. The parameters are as follows, which can be modified by yourself.

dataset 代表特征数据
start_index 代表从数据的第几个索引值开始取
history_size 滑动窗口大小
end_index 代表数据取到哪个索引就结束
target_size 代表预测未来某一时间点还是时间段的气温。例如target_size=0代表用前20个特征预测第21个的气温
step 代表在滑动窗口中每隔多少步取一组特征
point_time 布尔类型,用来表示预测未来某一时间点的气温,还是时间段的气温
true 原始气温数据的所有标签值

def TimeSeries(dataset, start_index, history_size, end_index, step,
               target_size, point_time, true):
    data = []  # 保存特征数据
    labels = []  # 保存特征数据对应的标签值
    start_index = start_index + history_size  # 第一次的取值范围[0:start_index]
    # 如果没有指定滑动窗口取到哪个结束,那就取到最后
    if end_index is None:
        # 数据集最后一块是用来作为标签值的,特征不能取到底
        end_index = len(dataset) - target_size
    # 滑动窗口的起始位置到终止位置每次移动一步
    for i in range(start_index, end_index):
        # 滑窗中的值不全部取出来用,每隔60min取一次
        index = range(i-history_size, i, step)  # 第一次相当于range(0, start_index, 6)
        # 根据索引取出所有的特征数据的指定行
        # 用这些特征来预测某一个时间点的值还是未来某一时间段的值
        if point_time is True:  # 预测某一个时间点
            # 预测未来哪个时间点的数据,例如[0:20]的特征数据(20取不到),来预测第20个的标签值
        else:  # 预测未来某一时间区间
            # 例如[0:20]的特征数据(20取不到),来预测[20,20+target_size]数据区间的标签值
    # 返回划分好了的时间序列特征及其对应的标签值
    return np.array(data), np.array(labels)

5. Divide the dataset

Use the above time series function to get the feature and label values ​​needed for training. Here is an example of predicting the temperature value at the next time point. history_size specifies the size of the time series window , that is, how many rows of data are used to predict the temperature value at a time point; target_size represents the value at which time point in the future , which is 0, such as range The feature of (0, 720, 1) is used to predict the temperature value at the 720+0th time point. When point_time=False, it means predicting a certain time period .

history_size = 5*24*6  # 每个滑窗取5天的数据量=720
target_size =  0  # 预测未来下一个时间点的气温值
step = 1  # 步长为1取所有的行

# 构造训练集
x_train, y_train = TimeSeries(dataset=feat, start_index=0, history_size=history_size, end_index=train_num,
                              step=step, target_size=target_size, point_time=True, true=targets)

# 构造验证集
x_val, y_val = TimeSeries(dataset=feat, start_index=train_num, history_size=history_size, end_index=val_num,
                          step=step, target_size=target_size, point_time=True, true=targets)

# 构造测试集
x_test, y_test =  TimeSeries(dataset=feat, start_index=val_num, history_size=history_size, end_index=25000,
                              step=step, target_size=target_size, point_time=True, true=targets)

# 查看数据集信息
print('x_train_shape:', x_train.shape)  # (19280, 720, 10)
print('y_train_shape:', y_train.shape)  # (19280,)

6. Construct the dataset

Convert the divided feature values ​​and label values ​​to tensor type, randomly shuffle() the feature rows of the training set , and train batchsize=128 sets of data per step in each iteration. Set the iterator iter() , take a batch of data from the dataset next() . The label value y_train represents one label temperature value for every 720 rows of feature data in the sliding window .

train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train))  # 训练集
train_ds = train_ds.batch(128).shuffle(10000)  # 随机打乱、每个step处理128组数据

val_ds = tf.data.Dataset.from_tensor_slices((x_val, y_val))  # 验证集
val_ds = val_ds.batch(128)  

test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test))  # 测试集
test_ds = test_ds.batch(128)  

# 查看数据集信息
sample = next(iter(train_ds))  # 取出一个batch的数据
print('x_train.shape:', sample[0].shape)  # [128, 720, 10]
print('y_train.shape:', sample[1].shape)  # [128, ]

7. Model building

The next step is to customize the LSTM network. It doesn't matter how you want to build it. It should be noted that there is a parameter return_sequences in the layers.LSTM() layer, which represents the last value in the returned output sequence, or all values . Default False . Generally, return_sequences=True is used when the next layer or LSTM is used.

inputs_shape = sample[0].shape[1:]  # [120,10]  不需要写batch的维度大小
inputs = keras.Input(shape=inputs_shape)  # 输入层

# LSTM层,设置l2正则化
x = layers.LSTM(units=8, dropout=0.5, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(0.01))(inputs)
x = layers.LeakyReLU()(x)
x = layers.LSTM(units=16, dropout=0.5, return_sequences=True, kernel_regularizer=tf.keras.regularizers.l2(0.01))(inputs)
x = layers.LeakyReLU()(x)
x = layers.LSTM(units=32, dropout=0.5, kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
x = layers.LeakyReLU()(x)
# 全连接层,随即正态分布的权重初始化,l2正则化
x = layers.Dense(64,kernel_initializer='random_normal',kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
x = layers.Dropout(0.5)(x)
# 输出层返回回归计算后的未来某一时间点的气温值
outputs = layers.Dense(1)(x)  # 标签shape要和网络shape一样

# 构建模型
model = keras.Model(inputs, outputs)

# 查看网络结构

The network structure is as follows

Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 720, 10)]         0         
lstm_7 (LSTM)                (None, 720, 16)           1728      
leaky_re_lu_7 (LeakyReLU)    (None, 720, 16)           0         
lstm_8 (LSTM)                (None, 32)                6272      
leaky_re_lu_8 (LeakyReLU)    (None, 32)                0         
dense_4 (Dense)              (None, 64)                2112      
dropout_2 (Dropout)          (None, 64)                0         
dense_5 (Dense)              (None, 1)                 65        
Total params: 10,177
Trainable params: 10,177
Non-trainable params: 0

8. Network training

Using mean absolute error as the regression loss function, evaluate .evaluate() on the entire test set after training to calculate the loss on the entire test set.

# 网络编译
model.compile(optimizer = keras.optimizers.Adam(0.001),  # adam优化器学习率0.001
              loss = tf.keras.losses.MeanAbsoluteError())  # 计算标签和预测之间绝对差异的平均值
epochs = 15  # 网络迭代次数

# 网络训练
history = model.fit(train_ds, epochs=epochs, validation_data=val_ds)

# 测试集评价
model.evaluate(test_ds)  # loss: 0.1212

The training process is as follows:

Epoch 1/15
151/151 [==============================] - 11s 60ms/step - loss: 0.8529 - val_loss: 0.4423
Epoch 2/15
151/151 [==============================] - 9s 56ms/step - loss: 0.3999 - val_loss: 0.2660
Epoch 14/15
151/151 [==============================] - 9s 56ms/step - loss: 0.1879 - val_loss: 0.1442
Epoch 15/15
151/151 [==============================] - 9s 56ms/step - loss: 0.1831 - val_loss: 0.1254

9. Visualization of the training process

All indicators of the network training process are saved in history. Only the mean absolute error loss is used here, and the change curve of the loss indicator with each iteration is drawn.

history_dict = history.history  # 获取训练的数据字典
train_loss = history_dict['loss']  # 训练集损失
val_loss = history_dict['val_loss']  # 验证集损失

plt.plot(range(epochs), train_loss, label='train_loss')  # 训练集损失
plt.plot(range(epochs), val_loss, label='val_loss')  # 验证集损失
plt.legend()  # 显示标签

10. Prediction Phase

In order to make the drawing clear, only the first 200 sets of features of the test set (each set has 720 rows and 10 columns, 720 represents a sliding window size, 10 represents the number of feature columns) are predicted, and the .predict() function is used to get the corresponding The temperature forecast for the next moment.

# x_test[0].shape = (720,10)
x_predict = x_test[:200]  # 用测试集的前200组特征数据来预测 
y_true = y_test[:200]  # 每组特征对应的标签值

y_predict = model.predict(x_predict)  # 对测试集的特征预测

# 绘制标准化后的气温曲线图
fig = plt.figure(figsize=(10,5))  # 画板大小
axes = fig.add_subplot(111)  # 画板上添加一张图
# 真实值, date_test是对应的时间
axes.plot(y_true, 'bo', label='actual')
# 预测值,红色散点
axes.plot(y_predict, 'ro', label='predict')
plt.legend()  # 注释
plt.grid()  # 网格

The comparison between the predicted value and the actual value is as follows

Guess you like

Origin blog.csdn.net/dgvv4/article/details/124379152