Time series forecasting — ARIMA model principle

Table of contents

1 Introduction to ARIMA model

1.1 Principle of ARIMA model

1.2 ARIMA model applicable conditions

1.3 Basic steps of the model

2 Differencing

2.1 The role of differential operation

2.2 Difference operation

2.3 Order of difference

2.4 Differential hysteresis

2.5 Points to note when using differential operations

3. Stability of data

3.1 The concept of data stationarity

3.2 ADF inspection concept

3.3 Unit root

3.4 Principle of ADF inspection

3.5 Python implements ADF inspection

4 Determine p and g values

4.1 How to determine p and g values

4.2 Introduction to ACF and PACF

4.3 ACF and PACF


1 Introduction to ARIMA model

1.1 Principle of ARIMA model

The ARIMA model is the most classic statistical time series model and the most classic model suitable for univariate time series data. The data we actually face is basically multi-variable time series data, and the problems that need to be predicted are not only related to time, but the ARIMA model can help us understand time series prediction very well.

ARIMA(Autor egressive Integrated Moving Average model)The differential autoregressive moving average model combines model (autoregressive model) andMA model (moving average model): a The label value at a point in time is affected by both the label value in the past period of time and the accidental events (can be understood as noise) in the past period of time. That is to say, the ARIMA model assumes that label values ​​fluctuate around the general trend of time, where the trend is affected by historical labels, and the fluctuations are affected by accidental events within a period of time, and the general trend itself is not necessarily is stable.

Guided by this central idea, the formula of the ARIMA model is expressed as:

The first half of the formula is the AR model, and the second half is the "fluctuation" part of the MA model.

1.2 ARIMA model applicable conditions

  • The data series is stationary,This means that the mean and variance should not change over time. A series can be made stationary by logarithmic transformation or difference
  • The input data must be a univariate sequence, because ARIMA uses past values ​​to predict future values.

1.3 Basic steps of the model

  1. Sequence stationarity test; determine d value
  2. Determine p-value and g-value
  3. Parameter estimation and diagnostic tests
  4. Predict future value

2 Differencing

2.1 The role of differential operation

Difference operations can eliminate drastic fluctuations in data, and therefore can eliminate the effects of seasonality, periodicity, holidays, etc. in time series. Generally, we use a difference with a lag of 7 to eliminate the influence of the week, and a difference with a lag of 12 to eliminate the influence of the month (the time unit corresponding to the sample is months). We also often use a lag of 4 to try to eliminate the influence of the quarter. Impact (the time unit corresponding to the sample is quarter). The essence of differential operation is a method of information extraction, and the key information it is best at extracting is the periodicity in the data.

2.2 Difference operation

First, let’s take a look at the concept of difference in ARIMADifferenceAutoregressive Moving Average model. The ARIMA model has built-in difference operations and even the order of the difference. The number setting is one of the key hyperparameters of ARIMA, and the difference operation can improve the stability of the data. Differentiation is a mathematical operation used on sequences. When we differentiate a sequence, it means that we calculate the difference between different observations in the sequence. For example, suppose we now have the sequence x1:

If the x1-x2 operation is performed in this sequence, a new blue sequence is formed. The first-order difference result (First-Order Differencing) of the new sequence y1. It is not difficult to find that the first-order difference operation is an operation that subtracts the value with a larger index in the sequence from the value with a smaller index adjacent to it, and forms a new sequence. When the sequence is a time series, the first-order difference operation calculates the difference in label values ​​at adjacent time points. Compared with the original sequence, the differenced sequence often has fewer digits (the sequence is shorter). Usually, a difference operation is performed, and the original sequence will become shorter by 1 unit.

When actually performing the difference operation, we can change the two relevant factors of the difference operation to perform different differences: the order of the difference and the lag of the difference.

2.3 Order of difference

Let’s first look at the order of differentiation: high-order differentiation means performing first-order differentiation multiple times. For example, when performing second-order differencing (Second-Order Differencing), we actually need to perform two first-order differences on x, that is, first solve it, and then perform the first-order difference on the basis of y1 to solve for Z:

At this time, the result of Z is the second-order difference of the sequence x. Therefore, n-order difference is to perform n first-order differences on the basis of the original data. In reality, the order of high-order differences we use is generally not too high. In the ARIMA model, the most common values ​​of the hyperparameter d are small numbers such as 0, 1, and 2.

2.4 Differential hysteresis

The lag of differential is completely different from the order of differential. The normal first-order difference is a difference with a lag of 1 (lag-1 Differences), which means that in the difference operation, we subtract two adjacent observations, that is, let the two observations with an interval of (lag-1) values ​​are subtracted. Therefore, when the lag is 2, it means that we need to subtract two observations that are 1 value apart.

The sequence y1 is the lag 2 difference result of the sequence X. Therefore, the lagged difference operation is the operation of subtracting the value with a smaller index that is 2 samples away from the value with a larger index in the sequence to form a new sequence.

Difference with lag is also called multi-step difference. For example, difference with lag of 2 is called 2-step difference. Compared with high-order differences, which are rarely used, multi-step differences are widely used. In time series, labels often have a certain periodicity: for example, labels may fluctuate regularly with the seasons (such as high label values ​​in summer, low label values ​​in winter, etc.), or they may fluctuate regularly with the time of the week. (e.g. higher on weekends, lower on weekdays, etc.).

2.5 Points to note when using differential operations

The higher the order of difference and the more steps, the more essential the information extracted?  In fact, like many information extraction processes, difference will also cause information loss and even extract "noise" in the process of extracting information. The higher the order and the more steps, the more original information will be lost in the differential operation, and the more severe the "deformation" of the information will be. Therefore, we need to find the appropriate order and number of steps instead of insisting on using high-order or multi-step differential.

As long as the data has trend/periodicity, we can use difference operations to eliminate it? Difference operation can indeed be applied to most trending and periodic time series data, but it cannot solve all time series data problems.

3. Stability of data

3.1 The concept of data stationarity

Now we already know the significance of differential operation for time series data. The stability brought by differential operation is the basic requirement for the smooth operation of the ARIMA model. Specifically, the ARIMA model has the following basic assumption: the time series data input to ARIMA must be stationary data.

  • The d value can be roughly judged from the difference order

In statistics, stable time series data is data that always fluctuates around a mean value, and the amplitude of the fluctuation does not vary greatly. While time changes, the mean value of the sequence label changes significantly, or the fluctuation amplitude in different stages is obviously different and uneven. Data are unstable. Specifically, the data is unstable when there is an obvious upward trend or downward trend in the data, when the data shows seasonality or periodicity, or when the fluctuation amplitude is obviously getting larger or smaller. The stationarity of the data can be checked byADF.

3.2 ADF inspection concept

As mentioned earlier, ARIMA requires the time series to be stationary, so generally when studying a period of time series, the first step is to conduct a stationarity test. In addition to the naked eye detection method, the more commonly used strict statistical test method is ADF Test, also called unit root test.

The full name of ADF test is Augmented Dickey-Fuller test. As the name suggests, ADF is an augmented form of Dickey-Fuller test. The DF test can only be applied to first-order situations. When there is a high-order lagged correlation in the sequence, the ADF test can be used, so the ADF is an extension of the DF test.

3.3 Unit root

The first-order AR model, that is, the case of AR(1), has the following model:

When α1 = 1, it becomes the unit root. The model is a random walk, and we know it is non-stationary. For example, when α1 = 1, then the influence of the previous moment on the current moment is 100% and will not weaken; then even at a very distant moment, the current influence on it will not be eliminated, so the variance (performance Fluctuation) is affected by all previous moments and is related to t, so it is not stationary;

When α1 > 1, then the fluctuation at the current moment is not only affected by the previous moment, but also amplified, so it is not stable;

When α1 < 1, the impact of fluctuations at the previous moment on the current moment will gradually decrease. It can be calculated that the autocovariance and autocorrelation coefficient at this time are a fixed value. So in this case, the series is stationary.

3.4 Principle of ADF inspection

The ADF test is to determine whether a unit root exists in the sequence: if the sequence is stationary, there is no unit root; otherwise, there will be a unit root.

Therefore, the H0 hypothesis of the ADF test is that there is a unit root. If the obtained significance test statistic is less than three confidence levels (10%, 5%, 1%), it corresponds to (90%, 95%, 99%) confidence to reject the null hypothesis.

3.5 Python implements ADF inspection

from statsmodels.tsa.stattools import adfuller

adfresult = adfuller(regionload['总有功功率(kw)'][150:600])
print(adfresult)
# 输出
(-4.747394852689344, 6.849295111339592e-05, 17, 432, {'1%': -3.445578199334947, '5%': -2.8682536932290876, '10%': -2.570346162765775}, 8736.266503299139)
  • adf: Test statistic, T test, hypothesis test value.
  • pvalue: hypothesis test result.
  • usedlag: The lag order used.
  • nobs: The number of observations used for ADF regression and calculation of critical values.
  • icbest: If autolag is not None, return the largest information criterion value.
  • resstore: merge the results into a dummy.

According to the return value usedlag, the d value can be set to 17.

4 Determine p and g values

4.1 How to determine p and g values

There are three parameters in the ARIMA model:p, q, d, p and q control the autoregression and moving average in the ARIMA model respectively. part, and d controls the order of difference performed on the data input to the ARIMA model. You can also add parameters lag, to the ARIMA model to control the difference operation. The number of lags (number of steps in multi-step difference).

  • The p value can be roughly judged from the maximum lag point of the partial autocorrelation coefficient (PACF) plot.
  • The q value can be roughly judged from the maximum lag point of the autocorrelation coefficient (ACF) graph.
  • The d value can be roughly judged from the difference order

4.2 Introduction to ACF and PACF

Autocorrelation coefficient ACF:The autocorrelation coefficient measures the degree of correlation between the same event in two different periods. To put it figuratively speaking, it measures the impact of one's past behavior on one's present. Impact. Here, the q value can be roughly judged by the maximum lag point of the autocorrelation coefficient (ACF) graph.

Partial autocorrelation coefficient PACF:When calculating the influence or correlation degree of one factor on another factor, the influence of other factors is regarded as a constant, that is, other factors are not considered for the time being. When the closeness of the relationship between those two elements is studied separately, it is called partial correlation. Here, the p value can be roughly judged by the maximum post-band point of the partial autocorrelation coefficient (PACF) graph.

Highly similar to the Pearson correlation coefficient, the value range of ACF and PACF is [-1,1], where 1 means that the two sequences are completely positively correlated, -1 means that the two sequences are completely negatively correlated, and 0 means that the two sequences are not. Related.

4.3 ACF and PACF

PythonThe code implementation can directly use thestatsmodels package for calculation, which requirespip install statsmodels Install statsmodels .

import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

regionload['总有功功率(kw)'][:800].plot()
plot_acf(regionload['总有功功率(kw)'][:800], lags=40, adjusted=False)
plot_pacf(regionload['总有功功率(kw)'][:800], lags=40,  method='ols')
plt.show()

Figure 1 is an image of the load data part of dataset 3, ACF (Figure 2) and PACF (Figure 3).

The abscissa of the ACF chart and the PACF chart are the same, both are at different lag levels, while the ordinate is the ACF and PACF values ​​of the sequence at the current lag level. The area with a blue background represents the 95% or 99% confidence interval. When the ACF/PACF value is outside the blue area, we consider the ACF/PACF at the current lag level to be a statistically significant value, that is The correlation between series at this level of lag is largely trustworthy.

Generally speaking,ACF and PACF graphs have three common forms: tailing, truncation, and neither tailing nor truncation.

  • Tailing: means that the image attenuates regularly and the autocorrelation gradually weakens (Figure 2), q value can Set to 0.
  • Truncation: refers specifically to the state of ACF/PACF falling off a cliff after a certain lag level (Figure 3). For example, the ACF corresponding to lag 0 and lag 1 is very low. High, but the ACF corresponding to lag 2 fell off a cliff. This situation is called "first-order censoring". Pay special attention to the fact that the "order" in 1st-order truncation/nth-order truncation is a common term. It actually refers to lag 1 and lag n. Do not confuse it with the order in higher-order differences. p  value can be set as 0.
  • Neither tailing nor truncation:The ACF and PACF graphs under this situation often do not show any regularity (Figure 3).

When the image is censored, the order of censoring is usually very low (generally no more than 3), which means that very few days in the sequence have an impact on the future. When the image is not censored, it means that the rules in the original data are difficult to extract. The original data may be a stationary sequence or white noise, and a more complex time series model is required to extract the rules.

Of course, in addition to this, we may also see ACF and PACF graphs in other states. When the ACF or PACF graph shows strong periodicity, there is a high probability that there is also strong periodicity in the original sequence (as shown below). When the ACF or PACF chart shows a strong trend (such as rising or falling), there is a high probability that there is also a strong trend in the original sequence.

Guess you like

Origin blog.csdn.net/qq_41921826/article/details/134399698