Examples of time series analysis in R language (pattern recognition, fitting, testing, forecasting)

1. Preparation

1. Data preparation

The data used is the data TSAin the package co2, if there is no such package, you can install it first

install.packages("TSA")	# 安装包 TSA

There will be a process for you to choose a mirror image, just choose whatever you want. After downloading, import and view the data

library(TSA)
data(co2)
win.graph(width = 4.875,height = 3,pointsize = 8)
plot(co2,ylab='CO2')        #绘制原始数据


It can be seen that the original data clearly has an upward trend and a cyclical trend.

2. Basic concepts

Akaike's (1973) Information Criterion (AIC) is based on the concept of entropy, which can weigh the complexity of the estimated model and the goodness of the model fitting data.
AIC = − 2 log (maximum likelihood estimate) + 2 k AIC=-2log(maximum likelihood estimate)+2kAIC=2 l o g ( maximum likelihood estimate ) _ _+2 k
where, if the model contains an intercept or constant term, thenk=p+q+1; otherwisek=p+q. The smaller the AIC, the better.

Ljung-BoxThe test is the LB test and the randomness test, which are used to test whether the autocorrelation of the sequence within the m-order lag range is significant, or whether the sequence is white noise (or the statistic obeys the chi-square distribution with degrees of freedom) m. If it is white noise data, the data has no value extraction, that is, there is no need to continue the analysis.

2. Data processing

After getting a sequence, first judge whether it is a stationary time series, if so, perform pattern recognition; if not, subtract the trend item to turn it into a stationary time series. Then do pattern recognition, parameter estimation, model diagnosis and prediction.

ps: This is the flowchart I found from the teacher's courseware. The part of personal sensory pattern recognition should not contain parameters d, because it dgenerally contains ARIMA(p,d,q)models, which are non-stationary models, and the non-stationary time series has been converted into a stationary time series in the previous step. Yes, dit should be confirmed in the previous step. It is also possible that the boundary between deduction of trend items and pattern recognition cannot be so finely divided, but it must be expressed in a flow chart, so it is written like this.

1. Pattern recognition

Generally speaking, pattern recognition is to identify ARIMA(p,d,q)the various orders in the process p,d,q. The commonly used methods of pattern recognition are:acf, pacf, eacf

First look at its autocorrelation function

acf(as.vector(co2),lag.max = 36)                #自相关函数


The seasonal autocorrelation is strong: strong correlations at lags 12, 24, 36, ....

plot(diff(co2),ylab='1st Diff. of CO2',xlab='Year') #一次差分,消除整体上升趋势


It can be seen that after one difference, the overall upward trend in the series has been eliminated. Let's look at its sample autocorrelation function

acf(as.vector(diff(co2)),lag.max = 36)          #一次差分后的自相关函数


After one differencing, there is still strong seasonality in the series; applying seasonal differencing should result in a more parsimonious model.

plot(diff(diff(co2),lag=12),xlab='Year',ylab='1st & seasonal Diff.')


plot its autocorrelation function

acf(as.vector(diff(diff(co2),lag=12)),lag.max=36,ci.type='ma')


It can be seen that the time series after one difference and seasonal difference have eliminated most of the seasonal effects. According to the sample autocorrelation function, it can be seen that in addition to the autocorrelation at lags 1 and 12, the series after the primary and seasonal differences almost no longer have autocorrelation, and the built model only needs to have at lags 1 and 12 Autocorrelation does.

In summary, consider constructing multiplicative seasonal ARIMA ( 0 , 1 , 1 ) × ( 0 , 1 , 1 ) 12 ARIMA(0,1,1)\times (0,1,1)_{12}ARIMA(0,1,1)×(0,1,1)12Model.

2. Parameter estimation

After the model is established, the parameters of the model need to be estimated. A multiplicative seasonal ARIMA model is just a special case of a general ARIMA model.

m1.co2=arima(co2,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12))
print(m1.co2)
-------------------------------------------
Coefficients:
          ma1     sma1
      -0.5792  -0.8206
s.e.   0.0791   0.1137

sigma^2 estimated as 0.5446:  log likelihood = -139.54,  aic = 283.08

The first line of code above obtains the maximum likelihood estimate of the parameter, the standard deviation of the parameter estimate is 0.5446, and the log likelihood value is -139.54, AIC=283.08. The parameter estimates of the model are all highly significant, and the model will then be tested.

ps: Why can the parameter valuation be highly significant based on these index values, and what is the standard?

3. Diagnostic test

1 residual order

First look at the time series plot of the residuals

plot(window(rstandard(m1.co2),start=c(1995,2)),ylab='Standardized Resi.',type='o');
abline(h=0)

Apart from some anomalous behavior in the middle of the series, the residual plots do not indicate any major irregularities in the model. Then the sample autocorrelation function of the residual is further observed

acf(as.vector(window(rstandard(m1.co2),start=c(1995,2))),lag.max=36)


The statistically significant correlation coefficient is at lag 22, its value is only -0.194, the correlation is very small, and the dependence on lag 22 is difficult to give a reasonable explanation. Except for a marginal significance at lag 22, the model seems to have captured the nature of the dependencies in the series.

Note acf: printYou can output the value of the autocorrelation function of each order of lag by typing in the front.

2 Ljung-Box test

Check it out below Ljung-Box:

win.graph(width=3,height =3,pointsize = 8)
hist(window(rstandard(m1.co2),start=c(1995,2)),xlab='Standardized Resi.',ylab='Frequency')


The histogram is shaped like a bell, but not standard. The model is Ljung-Boxtested and given a degree of freedom of 22 x=25.59, p=0.27, it shows that the model has captured the dependencies in the time series.

why?

The function of Ljung-Box test in R language is as follows:

Box.test(x, lag = 1, type = c("Box-Pierce", "Ljung-Box"), fitdf = 0)
  • x: a time series, when the residual is tested, it is generally the residual
  • lag: the lag value based on the autocorrelation factor
  • type: Ljung-Box inspection is set to Ljung-Box
  • fitdf: If x is a series of residuals, degrees of freedom need to be subtracted.

After calling the function, what we care about is pthe value. If it is p > 0.05, it means that it is a white noise sequence and a pure random sequence. Otherwise the data is not white noise and has research value.

Examples are as follows:

x <- rnorm (100)
Box.test(x, lag = 5)
Box.test(x, lag = 10, type = "Ljung")
a=Box.test(resid(m1.xpole),type="Ljung",lag=20,fitdf=11)

Then draw the quantile-quantile diagram (qq diagram)

win.graph(width=5,height =5,pointsize = 8)
qqnorm(window(rstandard(m1.co2),start=c(1995,2)))
abline(c(0,0),c(1,1),col='red')

At the upper tail of the QQ graph, an outlier appeared again. However, Shapiro-Wilkthe statistic given by the normality test W=0.982, in turn, is obtained p=0.11, and normality is not rejected at any level of significance.

As a further test of the model, consider ARIMA ( 0 , 1 , 2 ) × ( 0 , 1 , 1 ) 12 ARIMA(0,1,2)\times (0,1,1)_{12}ARIMA(0,1,2)×(0,1,1)12The model is overfitting.

m2.co2=arima(co2,order=c(0,1,2),seasonal=list(order=c(0,1,1),period=12))
print(m1.co2)
print(m2.co2)
--------------------------------
arima(x = co2, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
Coefficients:
          ma1     sma1
      -0.5792  -0.8206
s.e.   0.0791   0.1137
sigma^2 estimated as 0.5446:  log likelihood = -139.54,  aic = 283.08
--------------------------------
arima(x = co2, order = c(0, 1, 2), seasonal = list(order = c(0, 1, 1), period = 12))

Coefficients:
          ma1      ma2     sma1
      -0.5714  -0.0165  -0.8274
s.e.   0.0897   0.0948   0.1224

sigma^2 estimated as 0.5427:  log likelihood = -139.52,  aic = 285.05

It can be seen that θ 1 \theta_1i1Sum θ \thetaThe estimates of θ vary very little (when considering the size of the standard deviation). New parameterθ 2 \theta_2i2The estimate for is statistically indistinguishable from zero. Neither the AIC sigma^2nor the log-likelihood changed significantly. So use ARIMA ( 0 , 1 , 2 ) × ( 0 , 1 , 1 ) 12 ARIMA(0,1,2)\times (0,1,1)_{12}ARIMA(0,1,2)×(0,1,1)12The model is overfitting. According to the "Occam's Razor Principle", do not increase the entity if it is not necessary. So use ARIMA ( 0 , 1 , 1 ) × ( 0 , 1 , 1 ) 12 ARIMA(0,1,1)\times (0,1,1)_{12}ARIMA(0,1,1)×(0,1,1)12model.

4. Forecast

With a lead time of 2 years, forecasts are made for 2 years and the forecast values ​​and forecast limits are plotted.

win.graph(width = 4.875,height = 3,pointsize = 8)
plot(m1.co2,n1=c(2003,1),nahead=24,xlab='Year',type='o',ylab='CO2 Levels')


With a lead time of 1 year, forecasts are made for 4 years and the forecast values ​​and forecast limits are plotted.

win.graph(width = 4.875,height = 3,pointsize = 8)
plot(m1.co2,n1=c(2004,1),n.ahead=48,xlab='Year',type='b',ylab='CO2 Levels')

ps: The meaning of the above parameters, n1=c(2004,1) means starting from January 2004 (the actual data ends in December 2005), n.ahead=48 means predicting 48 values ​​(12 values ​​a year, So it's 4 years)

Guess you like

Origin blog.csdn.net/Gou_Hailong/article/details/124422470