Time Series Forecasting—A review of time series forecasting research

Table of contents

1 What is time series forecasting?

2 Time series forecast classification

3 Characteristics of time series data

4 Time series prediction evaluation indicators

5 Time series forecasting method based on deep learning

5.1 Statistical learning methods

5.2 Machine learning methods

5.3 Convolutional Neural Network

5.4 Recurrent Neural Network

5.5 Transformer class model


1 What is time series forecasting?

Time series: refers to a set of random variables obtained by observing the development and change process of something and collecting it at a certain frequency.
Time series forecasting: It is to dig out the core laws contained in numerous data and make accurate estimates of future data based on known factors.

Research directions of time series data:Classification and clustering, anomaly detection, event prediction, time series prediction, etc.
Application scenarios of time series forecasting: Time series forecasting is used to solve many problems faced in real life by mining the potential patterns of time series and making analogies or extensions. , including noise elimination, stock market analysis, power load forecast, traffic condition forecast, influenza epidemic warning, etc.

2 Time series forecast classification

Classification of time series prediction: When the original data provided by the time series prediction task is only the historical data of the target data, it is univariate time series prediction. When the original data provided When multiple random variables are included, it is multivariable time series forecasting. General time series data are multivariate time series.

Time series prediction tasks can be divided into four categories according to the length of the time span to be predicted.

3 Characteristics of time series data

Time series forecasting is to learn and analyze historical data from the previous t - 1 moments to estimate the data value for a specified future time period. Time series data often exhibits one or more characteristics due to the inherent potential relationships among its variables.

  • Massiveness:Time series data is growing explosively, and effective preprocessing at the data set level is the key to completing time series prediction tasks with high quality.
  • Trend:The data at the current moment is often closely related to the data at the previous moment. This feature implies that time series data usually have certain changing patterns when affected by other factors. Time A series may exhibit a steady upward or downward trend or a level trend over a long period of time.
  • Cyclicity:The data in the time series are affected by external factors and show ups and downs over a long period of time.
  • Volatility: With the passage of time and the influence of multiple external factors, the variance and mean of the time series may also change systematically, which affects the accuracy of time series prediction to a certain extent.
  • Stationality:Time series data individually change randomly, show statistical regularity at different times, and remain relatively stable in terms of variance and mean.
  • Symmetry:If within a certain period of time, the distance between the original time series and its inverted time series is controlled within a certain threshold, and the curves are basically aligned, it is deemed that the The time sequence is symmetrical, such as the reciprocating operation of large transport vehicles in the port, the lifting and lowering of the crane's arm, etc.

4 Time series prediction evaluation indicators

The error evaluation index is an important method to measure the performance of a time series prediction model. Generally speaking, the greater the error calculated by the evaluation index, the lower the accuracy of the model prediction, which in turn indicates that the performance of the established prediction model will be worse. Currently commonly used time series forecasting algorithm evaluation indicators are as follows:

Mean absolute error (MAE): is calculated by calculating the absolute value of the difference between the predicted value and the true value of each sample. The specific calculation formula is: < /span>

Mean square error (MSE): is a very practical indicator. It calculates the square of the difference between the predicted value and the true value of each sample and then averages it. value, the specific formula is:

Root mean square error (RMSE): is obtained by taking the square root of the mean square error. The specific formula is:

Mean absolute percentage error (MAPE): is a relative error measurement value, which prevents positive errors and negative errors from canceling each other out. The specific formula is:

Determination coefficient R-squared: is also called the coefficient of determination, also called the goodness of fit. The calculation result is the accuracy of the model prediction, which is The value range is [0, 1]. The closer the R2 value is to 1, the better the model performance; when the model is equal to the baseline model, R2 = 0, the R-squared formula is:

5 Time series forecasting method based on deep learning

The following is only an introduction to the prediction method. You can just look at the red font to get a general understanding. For specific codes, you can check relevant information. Subsequent columns will introduce the principles and codes one by one.

5.1 Statistical learning methods

AR: The autoregressive model (AutoRegressive) is one of the most typical and basic statistical time series models. Its basic idea is to predict future behavior based on historical behavior. . The AR model pays close attention to whether there is some correlation between the value in the time series and the values ​​before and after it. The AR model relies on this correlation to operate.

MA: Moving Average Model is also one of the very classic statistical models. It is constructed using ideas different from the autoregressive model and can also obtain many results. Good results. Rather than focusing on the connection between the past and the future, the MA model cares more about how the value at each point in time is affected by external accidental factors.

ARIMA:Autoregressive Integrated Moving Average (ARIMA) is one of the most famous and widely used time series forecasting methods. It combines the ideas of the AR model and the MA model, that is, it is concerned with the impact of the past on the future, and also cares about the value at each time point being affected by external accidental factors, so it can deal with relatively complex time series data.

SARIMA: Seasonal Autoregressive Integrated Moving Average (SARIMA). This model extends the ARIMA model and allows ARIMA to learn seasonal patterns on a monthly basis based on the original model. , there is always good performance on data showing certain patterns by season or year.

The above four models all belong to the ARIMA category. This type of model can be implemented with the help of multiple libraries such as statsmodel, Pmdarima, and sktime. It can also be implemented with the statistical tool SPSS or R language. It can be said that it is the most commonly used in contemporary univariate time series forecasting. One of the time series prediction methods.

Exponential Smoothing:Exponential smoothing is a time series prediction method for univariate data. It competes with the ARIMA series models and is the most widely used in the field of statistics. two types of models. If the ARIMA series model focuses on the "autocorrelation" (Autocorrelation) between samples, exponential smoothing is concerned with the trend and seasonality of data over a long period of time. Therefore, exponential smoothing is particularly good at processing problems with Data with systematic trends or seasonal components.

5.2 Machine learning methods

Machine learning model:We can use any high-order supervised model in machine learning to predict time series, such as XGBoost, LightGBM, CatBoost, DeepForest, etc. In the same way, classic models such as support vector machines can also be used. But it should be noted that most of the time we will only use high-order machine learning models for de-serialized multivariate time series.

5.3 Convolutional Neural Network

CNN: Convolutional neural networks (CNN) are a type of deep feedforward neural network with convolution and pooling operations as the core. At the beginning of the design , which is used to solve image recognition problems in the field of computer vision. The principle of convolutional neural network for time series prediction is to use the ability of the convolution kernel to feel the changes in historical data over a period of time and make predictions based on the changes in this period of historical data.

Schematic diagram of CNN structure 

WaveNet-CNN:Borovykh et al. were inspired by WaveNet, a speech sequence generation model, and used the ReLU activation function and parameterized skip connections to simplify the structure. Improved CNN model. The model achieves high performance in financial analysis tasks, proving that convolutional networks are not only simpler and easier to train, but can also perform excellently on noisy prediction tasks.

WaveNet-CNN structure diagram  

Kmeans-CNN:As the size of the data set becomes larger and larger, CNN does not perform well in processing large data sets. In 2017, Dong et al. chose to combine CNN, which can learn more useful features, with the K-means clustering algorithm for segmenting data. By clustering similar samples in large data sets, they divided them into multiple small samples for training. On a large scale of millions, The performance is good in the power load data set.

Kmeans-CNN structure diagram   

TCN:In 2018, Bai et al. proposed a temporal convolutional network architecture (temporal convolutional networks, TCN) with lower memory consumption and parallelism based on CNN. TCN introduces causal convolution to ensure that future information will not be obtained in advance during training. Its backpropagation path is different from the time direction, avoiding the problems of gradient disappearance and gradient explosion. In order to solve the problem of information loss caused by too many layers in CNN, TCN introduces residual connections so that information can be transferred across layers when transferred between networks.

TCN structural diagram 

The prediction accuracy of CNN is no longer superior to other network structures such as recurrent neural networks, and it is difficult to deal with long-step time series prediction problems alone. However, it is often used as a powerful module to be integrated into other advanced algorithm models for prediction tasks. . The overall analysis of convolutional neural network algorithms is as follows:

5.4 Recurrent Neural Network

RNN: Recurrent neural networks (RNN) is a deep learning model proposed by Jordan in 1990 for learning time dimension features. Each unit of RNN is connected together in the form of a long chain to recurse in the direction of sequence development. The input to the model is sequence data. Using RNN training is prone to serious gradient disappearance problems or gradient explosion problems. The vanishing gradient problem is mainly caused by the fact that the network weights at the frontmost layer in the neural network model cannot be updated effectively in time, and the training fails; the gradient explosion problem refers to the unbalanced learning process due to the excessive changes in the iterative parameters. As the data length increases, this problem becomes more and more obvious, causing RNN to only effectively capture short-term patterns, that is, it only has short-term memory.

RNN structure diagram  

Bi-RNN:In 1997, Schuster et al. extended the conventional recurrent neural network RNN ​​to bidirectional recurrent neural networks (Bi-RNN). Bi-RNN can simultaneously obtain past and future feature information by training in forward and backward directions simultaneously, using input information without restriction until a preset future frame. In regression prediction experiments on artificial data, Bi-RNN took roughly the same training time as RNN and achieved better prediction results.

Bi-RNN Summary  

LSTM: Long short-term memory network (long short-term memory, LSTM) was proposed by Hochreiter in 1997 to solve many problems of RNN model.

LSTM Structure diagram  

Bi-LSTM:In 2005, the bidirectional long short-term memory network (bidirectional long short-term memory, Bi-LSTM) proposed by Graves et al. [42] has a structure similar to Bi-LSTM. RNN, which is composed of two independent LSTMs. The original intention of the model design of Bi-LSTM is to overcome the shortcoming of LSTM that cannot utilize future information, so that the feature data obtained at time t has both past and future information. Since Bi-LSTM leverages additional context without having to remember previous inputs, it exhibits greater capabilities when processing data with longer delays. Experiments show that LSTM without time delay returns almost the same results, which means that the context in both forward and backward training directions is equally important in partial time series data. The feature extraction ability of Bi-LSTM is significantly higher< /span>
in LSTM.

GRU: The gated recurrent unit (GRU) is improved based on the LSTM model.

GRU structural diagram

Recurrent neural network methods can capture and utilize long-term and short-term time dependencies for prediction, but they do not perform well in long-sequence time series prediction tasks, and RNNs are mostly serial calculations, resulting in extremely high memory consumption during the training process. Large, and the problems of gradient disappearance and gradient explosion have never been completely solved. The overall analysis of the recurrent neural network algorithm is as follows:

5.5 Transformer class model

Transformer is a new deep learning framework that is different from previous CNN or RNN structures. The situation that the self-attention mechanism used by Transformer solves is that the input of the neural network is many vectors of different sizes. There is often some potential connection between the vectors at different times. In actual training, the potential between the inputs cannot be fully captured. This connection leads to poor model training results.

 Transformer structure diagram

The core of Transformer is the self-attention module, which can be viewed as a fully connected layer whose weights are dynamically generated based on the pairwise similarity of the input pattern. Its small number of parameters requires less calculation under the same conditions, making it suitable for modeling long-term dependencies. Compared with RNN models, using LSTM and GRU cannot avoid the problems of gradient disappearance and gradient explosion: as the network is trained later, the gradient becomes smaller and smaller, and it takes n - 1 steps. To get to the nth word, the longest path of Transformer is only 1, which solves the problem that has long plagued RNN. Transformer's outstanding ability to capture long-term dependencies and interactions with each other has great appeal for time series modeling tasks, and can show high performance in a variety of time series tasks.

BERT:This model captures time series information through multi-head self-attention instead of RNN commonly used in prediction tasks, and also determines each time more effectively by decomposing embedding parameterization. The autocorrelation between the states before and after the step only requires traffic speed and road information for days in a week, and does not require traffic information on adjacent roads at the current moment, so it has few application limitations.

BERT structure diagram 

AST: Apply the idea of ​​generative adversarial to propose an adversarial sparse Transformer (AST) based on SparseTransformer.

Informer:Beijing University of Aeronautics and Astronautics proposed the Informer model based on the classic Transformer encoder-decoder structure to make up for the inability of Transformer-like deep learning models to apply to long sequences. Shortcomings in time prediction problems. Before this, multiple predictions were often used to solve the task of predicting a long sequence, while Informer can give the desired long sequence results at once. The specific structure of Informer is as shown in the figure:

Informer structure diagram 

TFT: TFT (temporal fusiontransformers) designs a multi-scale prediction model including a static covariate encoder, a gated feature selection module and a temporal self-attention decoder.

TFT structure diagram  

SSDNet: State space decomposition neural network (SSDNet), which combines the Transformer deep learning architecture and state space models (SSM). It takes into account the performance advantages of deep learning and the interpretability of SSM. SSDNet adopts the Transformer architecture to learn temporal patterns and directly estimate the parameters of SSM. For ease of interpretation, a fixed-form SSM is used to provide trend and cyclical components and the Transformer’s attention mechanism is used to identify which parts of past history are most important for predictions.

SSDNet structure diagram 

Autoformer: designed a simple periodic trend decomposition architecture. Autoformer inherits the encoder-decoder structure using Transformer. The unique internal operator used by Autoformer can separate the overall changing trend of variables from the predicted hidden variables. This design allows the model to alternately decompose and refine intermediate results during the prediction process.

SSDNet structure diagram  

Aliformer:In 2021, in order to solve the problem of accurate time series sales forecasting in e-commerce, Alibaba proposed Aliformer based on two-way Transformer, using historical information, current factors and future knowledge to predict future values. Aliformer designed a knowledge-guided self-attention layer, using the consistency of known knowledge to guide the transmission of temporal information, and proposed a future-emphasis training strategy to make the model pay more attention to the utilization of future knowledge.

Aliformer structure diagram 

FEDformer:The FEDformer (frequencyenhanced decomposed Transformer) proposed in 2022 designs two attention modules, which use Fourier transform [68] and wavelet transform [69] to process frequency. Apply attention operations in the domain. FEDformer integrates the periodic trend decomposition method widely used in time series analysis into the Transformer-based method. It also combines Fourier analysis with the Transformer-based method. Instead of applying the Transformer to the time domain, it applies it to the time domain. Frequency domain, which helps Transformer better capture the global characteristics of time series.

FEDformer structure diagram 

Pyraformer:Pyraformer was proposed in 2022, a new model based on pyramid attention, which can effectively describe short-term and long-term time dependencies with low time and space complexity lower.

Pyraformer structure diagram  

Conformer:In 2023, in order to solve the efficiency and stability problems of long sequence prediction tasks with obvious periodicity, a Conformer model for multivariate long-period time series prediction was proposed.

Conformer structure diagram   

Transformer-like algorithms are now widely used in various tasks in the field of artificial intelligence. Building models based on Transformer can break the bottleneck of previous algorithms. It can also have good ability to capture short-term and long-term dependencies, effectively solve the problem of long sequence prediction, and Can be processed in parallel. The performance comparison and overall analysis of the above algorithms are as follows:


 Reference paper: A review of research on deep learning applied to time series prediction​​​​​​

Guess you like

Origin blog.csdn.net/qq_41921826/article/details/134333660