Regression algorithm | Long short-term memory network LSTM and its optimized implementation

This article will introduce the principle of LSTM and its optimization implementation

Insert image description here

Insert image description here

Sequence data has a characteristic, that is, "without the past, there is no current status quo." This type of data uses time as a link to connect countless historical events to form the current state. This kind of time builds the dependency relationship between events before and after. Called temporal dependencies, modeling such dependencies is key to learning them.

In recent years, more and more neural network models have been used to predict sequence data, such as stocks, electric load, wind power, ECG signals and other scenarios, and have achieved good results.

Generally, neural network models can be divided into two categories:

One type is the neural network represented by BP neural network. This type of network has a simple structure, but is prone to problems such as falling into local extreme values ​​and over-fitting, and it does not take advantage of dependencies;

The other type is deeper and more efficient deep neural network models, such as CNN, RNN, and LSTM. This type of network is a relatively cutting-edge and efficient prediction model that can fit nonlinear complex relationships between input variables, and for RNN As for LSTM, it can overcome the problem that traditional neural networks have no memory function and can effectively learn and predict based on historical information. Compared with RNN, LSTM can avoid the problem of gradient disappearance or explosion that occurs in RNN in long sequence data. It is the most popular RNN (LSTM is an improvement based on RNN). Therefore, LSTM has been widely used in sequence data learning.

LSTM also faces the problem of setting hyperparameters such as the number of hidden layer neurons, learning rate, and number of iterations. These parameters will affect the prediction accuracy of LSTM. Using optimization algorithms to optimize hyperparameters is more scientific and efficient than empirical methods. , so this article will introduce in detail the principle of the LSTM model and its optimization implementation.

00 Catalog

1 LSTM model principle

2 Overview of optimization algorithms and their improvements

3 GWO-LSTM prediction model

4 Code directory

5 Experimental results

6 Source code acquisition

01 LSTM neural network model[1]

Long Short-Term Memory Neural Network (LSTM) is an algorithm improved on Recurrent Neural Network (RNN) by Sepp Hochreiter and Jurgen Schmidhuber in 1997. It is designed to solve the vanishing gradient problem caused by recurrent neural networks (RNN), and performs much better than RNN in long-distance dependency tasks. The LSTM model works basically the same as RNN, but the LSTM model implements a more complex internal processing unit to handle the storage and update of contextual information.

Hochreiter et al. mainly introduced memory units and gate control units to preserve historical information and long-term status, and controlled the flow of information through gate control logic. Later, Graves and others improved the LSTM unit and introduced a forget gate, allowing the LSTM model to learn continuous tasks and reset the internal state.

LSTM is mainly implemented by three gate logics (input, forget and output). Gating can be regarded as a fully connected layer, and the storage and update of information by LSTM are implemented by these gates. More specifically, gating is implemented by the Sigmoid function and the dot multiplication operation.
Insert image description here

Here, i, f, and o are used to represent the input gate, forgetting gate, and output gate respectively. O represents the multiplication of corresponding elements. W and b represent the weight matrix and bias vector of the network respectively. When the time step is t, the input and output vectors of the LSTM hidden layer are x and h respectively, and the memory unit is c. The input gate is used to control how much of the current input data x of the network flows into the memory unit, that is, how much It can be saved to c, and its value is:

Insert image description here

The forgetting gate is a key component of LSTM that controls which information is retained and which is forgotten, and somehow avoids the vanishing and exploding gradient problems that arise when gradients are backpropagated over time. The forgetting gate can determine which information in the historical information will be discarded, that is, determine the impact of the information in the memory unit ct-1 at the previous moment on the current memory unit ct.
Insert image description here

The output gate controls the influence of memory unit c, on the current output value h, that is, which part of the memory unit will be output at time step t. The value of the output gate and the output value of the hidden layer can be expressed as:
Insert image description here

02 Overview of optimization algorithms and their improvements

In the previous article, the author introduced many optimization algorithms and their improved algorithms.

Here we take the gray wolf optimization algorithm as an example, and the same applies to other algorithms. Many of the author's codes are standardized, and the algorithms in other articles can be easily replaced.

03 GWO-LSTM prediction model

Hyperparameters will affect the fitting accuracy of the LSTM network to a certain extent, so it is necessary to obtain the best hyperparameter values ​​suitable for different feature data. However, there is currently no mature theory to obtain suitable hyperparameter values. Therefore, this article uses the gray wolf optimization algorithm to obtain the optimal network hyperparameter values ​​of LSTM, including the initial learning rate, number of hidden layer neurons, batch size and number of training iterations, that is, [lr, L1, L2, Batch, k ]. Increasing the number of hidden layers can improve the nonlinear fitting ability of the model, but it also makes the model more complex, increases the prediction time, and even causes over-fitting problems. Therefore, this article controls the selection of hidden layers to 2 layers. The constraints of the optimization parameters are set as follows:Insert image description here

Using MSE as fitness, the flow chart of the GWO-LSTM prediction model is as follows:Insert image description here

04 Experimental results

With root mean square error (Root Mean Square Error, RMSE), mean absolute percentage error (Mean Absolute Percentage Error, MAPE), mean absolute value error (Mean Absolute Error, MAE) and coefficient of determination (coefficient of determination, R^ 2) As an evaluation criterion for sequence data fitting.

Insert image description here

Insert image description here

Insert image description here

MSGWO in the picture is the gray wolf optimization algorithm improved by the author previously.

05 Source code acquisition

The code comments are detailed. Generally, you only need to replace the data set. Note that the rows of the data are samples and the columns are variables. The source code provides 3 versions.

1. Free version

It is mainly an LSTM prediction model, including Matlab and Python programs, which is enough for students who need to make some simple predictions or want to learn the LSTM algorithm.

Insert image description here

Obtaining method - GZH ( KAU's cloud experimental platform ) background reply: LSTM

2. Paid version 1

It is mainly the prediction model of GWO optimized LSTM. This only includes the Matlab program, including the prediction comparison of BP, LSTM, and GWO-LSTM. Because I have been busy recently, Python has not been released. The program's comments are detailed and easy to replace. The intelligent optimization algorithms introduced by Kaka before can be replaced.

Insert image description here
Insert image description here

How to obtain it——GZH backend reply: GWOLSTM

3. Paid version 2

Mainly MSGWO optimizes the prediction model of LSTM. This only includes the Matlab program, including the prediction comparison of BP, LSTM, GWO-LSTM, and MSGWO-LSTM, which is the picture in the result display, where MSGWO is the fusion in front of Kaka. An article on improving the gray wolf optimization algorithm with multiple strategies. The program has detailed annotations. This part of the program contains two parts: function testing and prediction model. It can be used to publish articles in this direction. Of course, you can also use the Kaka algorithm based on Innovative improvements can be made on the prediction model. For example, in the prediction model, a prediction model can be cascaded for the prediction error, or other modification strategies can be introduced to the improved gray wolf algorithm, etc.

Insert image description here

Insert image description here

Obtaining method——GZH backend reply: MSGWOLSTM

[1] Written by You Haolin. The Beauty of Python Prediction: Data Analysis and Algorithm Practice[M]. Electronic Industry Press

Another note: If anyone has optimization problems to be solved (in any field), you can send them to me, and I will selectively update articles that use optimization algorithms to solve these problems.

If this article is helpful or inspiring to you, you can click Like/Reading (ง •̀_•́)ง in the lower right corner (you don’t have to click).

Guess you like

Origin blog.csdn.net/sfejojno/article/details/134097476