2022 Teddy Cup B question analysis (LSTM neural network, time series ARIMA model) for learning reference

This article only records the model competition that I and two other small partners participated in, to commemorate

For the complete paper and code, please like and follow the collection and post the main blog post.

Power system load forecasting is an important issue with many influencing factors and great significance. In this paper, by establishing the LSTM power forecasting model and the ARIMA(p,d,q) forecasting model, combining the deep learning algorithm with the statistical method, the system load forecasting value is given and the forecasting accuracy is compared with the traditional forecasting model. Mining and analyzing the mutation situation of load data using scientific model. The research of this problem is beneficial to improve the accuracy of power system prediction and the efficiency and stability of power grid operation. In response to the first question of question one, this paper uses the long short-term memory neural network to establish an LSTM power load prediction model, and predicts a total of 960 results at 15-minute intervals in the next ten days according to the historical load data, with a prediction accuracy of 0.0001309; using spss expert modeling The ARIMA(0,1,12)(1,1,1) model was established by the detector, and a total of 960 results were predicted for the next ten days of the power grid in the area with an interval of 15 minutes , and the relevant fitting measures were obtained. For the first and second questions, the LSTM prediction model established in the previous question is used for prediction, and the maximum and minimum daily load are extracted from the predicted complete sequence, and the prediction accuracy is 0.000237. In response to the second question, the first question, this paper introduces the Mann-Kendall test method for the detection of sudden changes in climate data in meteorology, and calculates the sudden changes in the historical electricity loads of large industries, non-general industries, general industries, and commerce. Point and mutation magnitude, analyze the reasons of mutation and weather factors and social factors. For the second sub-question of question 2, based on the LSTM model established in the first sub-question of question 1, the meteorological data is converted into one-hot encoding, and combined with the original historical data to form a two-dimensional input for prediction, and the corresponding prediction accuracy is 0.000029; The ARIMA model predicts the maximum and minimum daily loads of the four industries in the next three months, and the corresponding ARIMA models are respectively: ARIMA(0,1,14) , ARIMA(0,1,14) , ARIMA( 0,0,9), ARIMA(0,0,1), ARIMA(0,0,14), ARIMA(1,0,14), ARIMA(0,1,8), ARIMA(0,1,0 ). In response to the second and third questions, the predicted values ​​obtained by the LSTM and ARIMA models can observe that the maximum and minimum daily loads of large and general industries fluctuate greatly, and the development is relatively stable compared with the original data. Business has a trend of decreasing daily load. It is recommended to replace traditional energy with new energy according to various industries, control carbon emissions, and complete the green transformation of the industry.

Key words: LSTM ARIMA power load forecast mutation test "two-carbon" target

1. Restatement of the problem

Power load refers to the sum of the electrical power provided by the power system used by the electrical equipment of the power user at a certain time. Accurate load forecasting can maintain the safety and stability of power grid operation and ensure normal social production and life. Predicting power system load needs to fully consider historical data, economic conditions, meteorological conditions and other factors. According to the forecast time span, it can be divided into short-term, medium-term and long-term forecast, which is of great significance to the power grid, enterprise production, and social life. The diversification of the load structure of the power system reduces the effect of the traditional load forecasting model to a certain extent. It is necessary to further improve the relevant models and algorithms to improve the accuracy of the forecast. Build a model to solve the following load forecasting problem:

1. According to the power system load data of a certain region's power grid from January 1, 2018 to August 31, 2021 at 15-minute intervals, establish an appropriate short-term load forecasting model:

(1) Obtain the forecast results of the power grid in the region at intervals of 15 minutes for the next 10 days, and analyze the forecast accuracy of the model or algorithm;

(2) Obtain the prediction results of the maximum and minimum daily load of the power grid in the region in the next 3 months, and the time corresponding to the maximum value, and analyze its prediction accuracy.

2. Establish appropriate medium-term forecasting models based on the maximum and minimum electric charge data for each of the four industries in the region from January 1, 2019 to August 31, 2021:

(1) Analyze and point out the time, magnitude and possible reasons for the sudden change of electricity load in the above four industries;

(2) Obtain the forecast results of the maximum and minimum daily load of each industry in the next 3 months, and analyze the forecast accuracy;

(3) Starting from the actual situation of each industry, study the possible impact of the national "dual carbon" target on the electricity load of various industries in the future, and put forward corresponding suggestions for related industries.

2.1 Analysis of Problem 1

Problem 1 requires establishing a short-term and medium-term load forecasting model based on the load data of the power grid in a certain area at intervals of 15 minutes in a period of time. Since it is a short- and medium-term forecast, we first consider using the traditional method of forecasting—time series forecasting; in the selection of intelligent algorithms, in view of the fact that the data is a time series and the sample size is very large, in order to avoid the possible disappearance or explosion of error gradients question, we use a long short-term memory neural network (LSTM) to make predictions and compare the results with the time series forecasting method. In order to establish a

4

For comparison of effectiveness, MAE (Mean Absolute Error) was calculated for all forecasts to reflect their forecast accuracy. The first question asked to use the established model to predict the load data of the power grid in the area in the next 10 days at intervals of 15 minutes and analyze the prediction accuracy. Since the input data is a one-dimensional sequence, the feature dimension is set to 1, and the time step is considered to be the total number of samples divided by the number of predictions, and the loss function in machine learning is defined as the prediction accuracy. The second sub-question requires to predict the maximum and minimum daily load in the next 3 months, the principle is the same as the first sub-question. We consider first forecasting the complete data series for 3 months, and then extracting the maximum and minimum daily load from it.

2.2 Analysis of Problem 2

The first question asks to analyze the time, magnitude and possible causes of sudden changes in electricity consumption in various industries based on the historical data of the maximum and minimum daily loads of large industries, non-general industries, general industries, and commercial industries. We consider the introduction of the sliding t-test and the Man-Kendall test used in meteorology for testing long-term data. The second sub-question requires the maximum and minimum daily load of each industry in the next three months to be obtained on the basis of the first sub-question. We consider to follow the method and model in question 1. Spss establishes an ARIMA model to predict the results of the four industries and obtain fitting parameters. In addition, we continue to use the LSTM model of the previous question, and consider meteorological factors to perform one-hot encoding on the meteorological data. Extract features and input them as a new input dimension, and analyze the influence of meteorological factors on power load by comparing with the prediction results when no meteorological factors are input. The third small question: In the context of the national "dual carbon" policy goal, we discuss and analyze the power load data of the power grid and various industries based on the actual situation such as the characteristics of electricity consumption in various industries, combined with the previous questions. The possible impact of policies on electricity consumption in various industries and suggestions for improvement.

3. Model Assumptions

1. The model and the parameters corresponding to the model obtained by using the spss expert modeler are the models and parameters with the highest prediction accuracy and the best fitting effect among the time series models;

2. It is assumed that the time series analysis model is less disturbed by special factors;

3. Only consider the impact of historical load and meteorological conditions on power system load forecast;

4. The significance level of the difference between the means of the two series is assumed to be α=0.05 when performing the sliding t-test.

5.1 Establishment and solution of the first problem model 5.1.1 Establishment and solution of the first sub-question model

(1) In the first question of the ARIMA model, a prediction model for power load time series is established. Using the "expert modeler" of spss software, a mathematical model with the highest accuracy can be selected among many models to describe the dependent variable, that is, from From August 31, 2021, the regularity statistics of the load change process every 15 minutes for the next ten days, and then determine the prediction model of the power load in this question on the basis of the mathematical model, so as to obtain the prediction results of the future load [1] ].

The ARIMA model (Autoregressive Integrated Moving Average Model, differential integrated moving average autoregressive model) combines the p-order autoregressive model (AR(p) model) and the q-order moving average process (MA(q) model), and uses d Represents the number of differences made to make the overall data a stationary series. To ensure the accuracy of the forecast, select from 0:00 on August 16, 2021 to August 31, 2021

At 23:45, a total of 1536 items of data were used to establish a time series forecasting model. In order to draw the time series diagram and establish the time series model, the following syntax is executed in spss, the historical data is regarded as a cycle every day, and each cycle is regarded as the division of 96 time periods:

DATE OBS 1 96 (1)

The time series diagram for drawing the above historical data is as follows:

 

 The timing diagram after the periodic decomposition is drawn as follows:

Using spss' expert modeler to detect outliers and create a traditional model, the optimal model type is ARIMA(0,1,12)(1,1,1). Part of the forecast data for the next ten days is as follows: ( See Appendix 2 for full data)  

 

The ARIMA model (Autoregressive Integrated Moving Average Model, differential integrated moving average autoregressive model) combines the p-order autoregressive model (AR(p) model) and the q-order moving average process (MA(q) model), and uses d Represents the number of differences made to make the overall data a stationary series.

ARIMA(0,1,12)(1,1,1) means non-seasonal 0-order autoregression, non-seasonal 1-order difference, non-seasonal

12-order moving average, seasonal 1-order autoregressive, seasonal 1-order difference, seasonal 1-order moving average time series forecasting model. The predicted and fitted graphs are as follows:

 

 It can be seen from the figure that the fitted value curve of the model has a high degree of coincidence with the observed value curve, and the fitting effect is very good. The fit of the ARIMA(0,1,12)(1,1,1) model established by this question is shown in the following table:

 

The prediction accuracy can be seen from the above parameters: the stationary R-square and R-square are 0.787 and 0.998, respectively, which are close to 1; the root mean square error RMSE, the mean absolute error percentage MAPE and the mean absolute error MAE are 1227.667, 0.419, and 950.705, respectively. , indicating that the difference between the dependent variable sequence and the prediction level of the model is not large; the maximum absolute error percentage MaxAPE and the maximum absolute error MaxAE are 2.400 and 4696.848, respectively, indicating the largest prediction error; the value of normalized BIC is only 14.337, indicating that This optimal model fits well. 

(2) LSTM model Long Short Term Memory (LSTM) is essentially a special Recurrent Neural Network (RNN). LSTM adds three logic control units: Input Gate, Output Gate, and Forget Gate to the basic structure of RNN, and each is connected to a multiplying element (see Figure 1). By setting the weight of the connection between the memory unit and other parts to control the input and output of the information flow and the state of the memory cell, it is possible to select and memorize important information, filter noise information, and solve the problem.

RNN forgets the problem of the initial input content, and can effectively use long-distance and fluctuating time series information for prediction [5]. Its topology is shown in the figure below.

 

 

 First notice that the data in Annex 1 contains missing values. After filling the missing values ​​with the resample method, draw the time series to see the general trend.

 Then we define the LSTM model. Since the data in Annex 1 is one-dimensional data, set the input feature dimension (input size) to 1, and similarly obtain the output feature dimension (output size) to be 1. The output format of the LSTM model is (sequence_length, bacth_size, hidden_size), the sequence legth is set to 134 as described above, the tentative batch size (bacth size) is set to 1, and the hidden layer feature dimension (hidden size) is set to 20. Other parameters such as the number of hidden layers and bias are set to default values ​​according to the official LSTM routine. Finally, choose the loss function as MSE and the optimization function as adam. In the model training part, temporarily set the training times to 4, and draw a model training process diagram

 

As shown in the training process diagram, as the number of training increases, the loss value of the training set decreases, and the LSTM

The parameters inside the unit are constantly updated and optimized. After passing the model test (this part will be explained in detail in "6. Model Analysis and Test"), we use the entire sample set as the model input to predict the next 10-day interval required by the question15

A total of 960 load data per minute. The basic idea of ​​prediction is to use the 1-134th value in the sample set to predict the first value at the beginning, and then use the 2-135th value to predict the second value, and so on until 960 values ​​are predicted.

The prediction results are visualized as follows

 

 Part of the prediction code is as follows

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 由于训练数据存在相差较大的,因此使用min/max尺度变换对训练数据进行归一化
# 注意只对训练数据进行归一化,为了防止有些信息从训练数据泄露到的测试数据
from sklearn.preprocessing import MinMaxScaler
flight_data = pd.read_csv(r"填补缺失值.csv")
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = 15
fig_size[1] = 5
plt.rcParams["figure.figsize"] = fig_size
plt.title('power vs day')
plt.ylabel('power')
plt.xlabel('day')
plt.grid(True)
plt.autoscale(axis='x',tight=True)
plt.plot(flight_data['power'])
plt.show()
#提取数据
all_data = flight_data['power'].values.astype(float)
print(all_data)
#将数据区分为训练数据和测试数据
test_data_size = 960
train_data = all_data[:-test_data_size]
test_data = all_data[-test_data_size:]
 
# 由于训练数据存在相差较大的,因此使用min/max尺度变换对训练数据进行归一化
# 注意只对训练数据进行归一化,为了防止有些信息从训练数据泄露到的测试数据
 
scaler = MinMaxScaler(feature_range=(-1, 1))
train_data_normalized = scaler.fit_transform(train_data.reshape(-1, 1))
 
 
print(train_data_normalized)
 
# 将数据转换为张量
train_data_normalized = torch.FloatTensor(train_data_normalized).view(-1)
 
 
def create_inout_sequences(input_data, tw):
    inout_seq = []
    L = len(input_data)
    for i in range(L-tw):
        train_seq = input_data[i:i+tw]
        train_label = input_data[i+tw:i+tw+1]
        inout_seq.append((train_seq ,train_label))
    return inout_seq
 
train_window =5
train_inout_seq = create_inout_sequences(train_data_normalized, train_window)
 
#定义LSTM模型
class LSTM(nn.Module):
 def __init__(self, input_size=1, hidden_size=55, output_size=1):
        super().__init__()
        self.hidden_size = hidden_size
        # 定义lstm 层
        self.lstm = nn.LSTM(input_size, hidden_size)
        # 定义线性层,即在LSTM的的输出后面再加一个线性层
        self.linear = nn.Linear(hidden_size, output_size)
 
    
        return predictions[-1]
 
 
# 模型实例化并定义损失函数和优化函数
 
model = LSTM()
 
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
print(model)
 
epochs = 1
 
        optimizer.step()
 
    #if i%25 == 1:
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')
 
print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')
 
 
 
 
 
# 以train data的最后12个数据进行预测
fut_pred = 960
test_inputs = train_data_normalized[-train_window:].tolist()
print(test_inputs)
 
model.eval()
# 基于最后12个数据来预测第133个数据,并基于新的预测数据进一步预测
# 134-144 的数据
for i in range(fut_pred):
    seq = torch.FloatTensor(test_inputs[-train_window:])
    # 模型评价时候关闭梯度下降
    with torch.no_grad():
        model.hidden = (torch.zeros(1, 1, model.hidden_size),
                        torch.zeros(1, 1, model.hidden_size))
        test_inputs.append(model(seq).item())
 
test_inputs[fut_pred:]
 
actual_predictions = scaler.inverse_transform(np.array(test_inputs[train_window:] ).reshape(-1, 1))
print(actual_predictions)
# 绘制图像查看预测的[133-144]的数据和实际的133-144 之间的数据差别
x = np.arange(127584,128544, 1)
plt.title('power vs day')
plt.ylabel('power')
plt.grid(True)
plt.autoscale(axis='x', tight=True)
plt.plot(flight_data['power'])
plt.plot(x,actual_predictions)
plt.show()
 
 

 

5.2 Establishment and solution of the second problem model

5.2.1 Establishment and solution of the first sub-question model of problem 2

The topic requires mutation detection of load data, and it is considered to introduce the Mann-Kendall test [6], which is commonly used in meteorology to test interdecadal changes. The Mankendall test method is a nonparametric statistical test method. Mutation of load data. Plot the series curves of the two statistics and the ± two significant interval dividing lines on a graph to form MK

Mutation test plot:

 

The location of the sudden change of the maximum daily load of large industrial electricity is [347, 471, 656]. The magnitude of the mutation is

20000-30000. The corresponding value, time point and weather are 116067.43 (2019 12 10 sunny), 102863.9

(2020 April 12 sunny), 129905.64 (2020 10 14 light rain). The mutation results may be mainly related to the weather. When the weather is good, less electricity is used, and vice versa.

The location of the sudden change of the minimum daily load of large industrial electricity is [31, 44, 351, 469, 708, 787, 845,

846, 919]. The magnitude is about 90000-100000. The sudden change of the minimum value at the end of January and the beginning of February may be related to the reason for the Spring Festival holiday.

The location of the sudden change of the maximum daily load of non-general industrial electricity is [181, 441, 532]. The corresponding values, time points and weather are 1982.0484 2020 4 3 light rain, 2539.9632 2020 12 23 showers and cloudy,

1871.3442 2021 3 22 Cloudy.

For several other industries, please like, follow, and post private letter bloggers

5.2.2 Establishment and solution of the second sub-question model of problem 2

1. In the second sub-question of the ARIMA model, the principle and process of model establishment are similar to the first sub-question of the first question. The "expert modeler" of the spss software is used to establish models for the four industries respectively, and the spss calculation results are obtained. The maximum and minimum daily loads of the four industries in the next 3 months are obtained, and the forecast results and forecast accuracy are analyzed according to the parameters and relevant coefficients given by them. To ensure the accuracy of the forecast, a total of 243 data from January 1, 2021 to August 31, 2021

Item data to build a time series forecasting model. Replacement of missing values ​​is performed before building the model. The daily load data of the four industries on January 26, 2021 are all missing, so the missing values ​​are replaced with the series mean or the linear trend of the adjacent points.

(1) The time series diagram of the maximum and minimum daily load of large industrial electricity consumption is as follows:

 

 

The spss expert modeler calculates the optimal model for predicting the maximum daily load of large industrial electricity as

ARIMA(0,1,14), represents a time series prediction model with 0-order autoregression, 1-order difference, and 14-order moving average. The optimal model for the minimum daily load of large industrial electricity is ARIMA(0,1,14), which represents 0-order autoregression,

1st order difference, 14th order moving average time series forecasting model [3][4].

 The partial forecast values ​​of the maximum and minimum daily load of large industrial electricity in the next 3 months are as follows:

The fitting and prediction diagrams of the above two types of daily loads are as follows:

 

 

Observing the above two figures, it can be seen that the fitting degree of the above two models is very good. The fitting degree of the ARIMA(0,1,14) model and the ARIMA(0,1,14) model for the prediction of the maximum daily load of large industrial electricity in the next three months and the ARIMA(0,1,14) model for the prediction of the minimum value established by this question are shown in the following two tables :

Table 5 ARIMA(0,1,14) model fitting statistics for the prediction of the maximum daily load of large industrial electricity

 

The prediction accuracy can be seen from the above parameter values: For the prediction of the maximum daily load of large industrial electricity in the next three months, the ARIMA (0,1,14) model: the stationary R-square and R-square are 0.560 and 0.961, respectively, where the R-square and 1 is relatively close; the root mean square error RMSE, the mean absolute error percentage MAPE and the mean absolute error MAE are 5187.752, 4.279, 3997.133 respectively, indicating that the difference between the dependent variable sequence and the model predicted level is still within an acceptable range;

The maximum absolute error percentage MaxAPE and the maximum absolute error MaxAE are 27.494 and 14321.564 respectively, which both represent the largest prediction error; the value of the standardized BIC is only 17.358, which indicates that the optimal model fits well.

For the forecast ARIMA(0,1,14) model of the minimum daily load for large industrial electricity consumption in the next three months: the stationary R23 and R23 are 0.912 and 0.960, respectively, both of which are relatively close to 1; the root mean square error RMSE, The mean absolute error percentage MAPE and the mean absolute error MAE are 5343.336, 10.016, 3945.040, respectively.

Indicates that the difference between the dependent variable sequence and the model's prediction level is still within an acceptable range; the maximum absolute error percentage MaxAPE and the maximum absolute error MaxAE are 287.526 and 15679.873, respectively, which both represent the largest prediction error; the value of standardized BIC is only 17.666 , indicating that the optimal model fits well.

For several other industries, please like, follow, and post private letter bloggers

2. LSTM model According to the title requirements of this question, the model of question 1 can be used in the case of changing the training set. When entering the model, in addition to historical data, we consider meteorological factors, and input more features to obtain more accurate and effective prediction results. At the same time, we analyze the impact of meteorological conditions on the prediction of power load data. The process of extracting meteorological features is described in detail below, and the rest of the steps are the same as in Question 1. The meteorological features available from Annex 3 are "weather conditions", "maximum temperature", "minimum temperature", "daytime wind

37

These five meteorological features are extracted by one-hot encoding and input as a new input dimension. By comparing with the prediction results when no meteorological factors are input, we analyze the impact of meteorological factors on power load. Impact.

The predicted results are as follows:

5.2.3 Establishment and solution of the third sub-question model of the second question

In the third question, the country's "dual carbon" goal is the "carbon peak" proposed for the realization of "carbon emission reduction" (at a certain point, the emission of carbon dioxide reaches a peak and no longer emits, and then gradually falls back), " "Carbon neutrality" (the total amount of carbon dioxide or greenhouse gas emissions directly or indirectly produced by a country, enterprise, product, activity or individual within a certain period of time, through afforestation, energy saving and emission reduction, etc., to offset the carbon dioxide or greenhouse gas generated by itself Emissions, achieve positive and negative offsets, and achieve relative "zero emissions"), which is crucial for building a "carbon peak" and "carbon neutral" policy system. From the prediction results of LSTM and ARIMA, it can be seen that the maximum and minimum daily loads of large industries and ordinary industries fluctuate greatly, and the development is relatively stable compared with the original data. Therefore, it is recommended that large industries realize green and low carbon.

39

Transformation, promote the green development of traditional large industries and then drive economic and social progress [7]; for ordinary industries, try to save traditional fossil energy and improve energy efficiency, achieve “stock emission reduction”, and respond to the “double carbon” call [8]. Judging from the forecast results of non-general industry and commerce, both industries have a trend of decreasing daily load. And according to the data, the low-carbon transformation of commercial real estate has a very broad strategic prospect, and it is recommended to enhance the value of the existing properties at the asset operation level [9]; for non-general industries, you can consider actively developing the photovoltaic industry chain or using other new Polluting energy is used to increase production capacity and complete the low-carbon strategic transformation of non-general industries [10].

6. Analysis and testing of the model

 1. The first question of the ARIMA model In the first question, the ARIMA(0,1,12)(1,1,1) model obtained by the spss expert modeler is relatively stable when the time series data is relatively stable. After the seasonal first-order difference and the seasonal first-order difference, the autocorrelation coefficient ACF and the partial autocorrelation coefficient PACF of most lag orders are not significantly different from 0, which means that the residual can be regarded as white noise. However, spss calculation shows that the significance of the Yang-Box Q test is 0.000, which is less than α at the 95% confidence level, indicating that there is a possibility that the residual cannot be regarded as white noise, that is, the original data is not completely recognized by the prediction model. There is still room for improvement.

 

Guess you like

Origin blog.csdn.net/jiebaoshayebuhui/article/details/127000716