Practical case of building a time series ARIMA model in Python

This article will introduce the complete steps and process of using Python to complete the time series analysis ARIMA model

Recommended reading

  1. Use Python to complete the basics of time series analysis
  2. A practical case of SPSS establishing a time series multiplication season model
  3. Practical case of building a time series ARIMA model in Python

Insert picture description here

Time series analysis concepts

Time series analysis is a very important branch of statistics. It is a rapidly developed scientific method with strong applications based on probability theory and mathematical statistics, supported by computer applications. Time series is a series of random variables formed in the order of time intervals. A large number of statistical indicators in the natural world, social economy and other fields have their indicator values ​​calculated by year, quarter, month or day. As time goes by, statistics are formed The time series of indicators, such as stock price index, price index, GDP, and product sales, are all time series. Original link: https://blog.csdn.net/qq_45176548/article/details/111504846


Basic steps to build a model

Insert picture description here


ARIMA model modeling practice

Import module

import sys
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
from statsmodels.tsa.stattools import adfuller 
from statsmodels.stats.diagnostic import acorr_ljungbox 
from statsmodels.graphics.api import qqplot
import matplotlib.pylab as plt
from matplotlib.pylab import style
style.use('ggplot')
from arch.unitroot import ADF
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.float_format', lambda x: '%.5f' % x) 
np.set_printoptions(precision=5, suppress=True) 
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
%matplotlib inline
"""中文显示问题"""
plt.rcParams['font.family'] = ['sans-serif']
plt.rcParams['font.sans-serif'] = ['SimHei']

Download Data

data = pd.read_excel("data.xlsx",index_col="年份",parse_dates=True)
data.head()
xt
years
1952-01-01 100.00000
1953-01-01 101.60000
1954-01-01 103.30000
1955-01-01 111.50000
1956-01-01 116.50000

Stationarity test

Timing diagram

data1 = data.loc[:,["xt","diff1","diff2"]]
data1.plot(subplots=True, figsize=(18, 12),title="差分图")

Insert picture description here

Sequence diagram test --It relies solely on visual judgment and human experience. Different people may make different judgments when they see the same graph. Therefore, we need a more convincing and objective statistical method to help us test the stationarity of the time series. This method is the unit root test.
Original link: https://blog.csdn.net/qq_45176548/article/details/111504846

Unit root test

print("单位根检验:\n")
print(ADF(data.diff1.dropna()))  
单位根检验:

   Augmented Dickey-Fuller Results   
=====================================
Test Statistic                 -3.156
P-value                         0.023
Lags                                0
-------------------------------------

Trend: Constant
Critical Values: -3.63 (1%), -2.95 (5%), -2.61 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.


 **单位根检验** - -对其一阶差分进行单位根检验,得到:1%、%5、%10不同程度拒绝原假设的统计值和ADF Test result的比较,本数据中,P-value 为 0.023,接近0,ADF Test result同时小于5%、10%即说明很好地拒绝该假设,本数据中,ADF结果为-3.156,拒绝原假设,即一阶差分后数据是平稳的。

White noise test

Determine whether the sequence is a non-white noise sequence

from statsmodels.stats.diagnostic import acorr_ljungbox
acorr_ljungbox(data.diff1.dropna(), lags = [i for i in range(1,12)],boxpierce=True)
(array([11.30402, 13.03896, 13.37637, 14.24184, 14.6937 , 15.33042,
        16.36099, 16.76433, 18.15565, 18.16275, 18.21663]),
 array([0.00077, 0.00147, 0.00389, 0.00656, 0.01175, 0.01784, 0.02202,
        0.03266, 0.03341, 0.05228, 0.07669]),
 array([10.4116 , 11.96391, 12.25693, 12.98574, 13.35437, 13.85704,
        14.64353, 14.94072, 15.92929, 15.93415, 15.9696 ]),
 array([0.00125, 0.00252, 0.00655, 0.01135, 0.02027, 0.03127, 0.04085,
        0.06031, 0.06837, 0.10153, 0.14226]))

Through P<α, the null hypothesis is rejected, so the sequence after the difference is a stationary non-white noise sequence, and the next step of modeling can be carried out

Model ordering

Now that we have obtained a stationary time series, the next step is to select the appropriate ARIMA model, that is, the appropriate p, q in the ARIMA model.
The first step is to check the autocorrelation graph and partial autocorrelation graph of the stationary time series. Get graphics through sm.graphics.tsa.plot_acf and sm.graphics.tsa.plot_pacf

Insert picture description here

From the autocorrelation graph and partial autocorrelation graph of the first-order difference sequence, you can find:

  1. Autocorrelation graph tailing or first-order truncation
  2. Partial autocorrelation graph first-order truncation,
  • So we can build ARIMA(1,1,0), ARIMA(1,1,1), ARIMA(0,1,1) models.

Model optimization

Insert picture description here

  • Where L is the maximum likelihood under the model, n is the number of data, and k is the number of variables in the model.
    The python code is as follows:
arma_mod20 = sm.tsa.ARIMA(data["xt"],(1,1,0)).fit()
arma_mod30 = sm.tsa.ARIMA(data["xt"],(0,1,1)).fit()
arma_mod40 = sm.tsa.ARIMA(data["xt"],(1,1,1)).fit()
values = [[arma_mod20.aic,arma_mod20.bic,arma_mod20.hqic],[arma_mod30.aic,arma_mod30.bic,arma_mod30.hqic],[arma_mod40.aic,arma_mod40.bic,arma_mod40.hqic]]
df = pd.DataFrame(values,index=["AR(1,1,0)","MA(0,1,1)","ARMA(1,1,1)"],columns=["AIC","BIC","hqic"])
df
AIC BIC hqic
AR (1,1,0) 253.09159 257.84215 254.74966
MA (0,1,1) 251.97340 256.72396 253.63147
WEAPON (1,1,1) 254.09159 258.84535 259.74966
  • The statistical ideas followed in constructing these statistics are consistent, that is, while considering the fitting residuals, "penalties" are imposed according to the number of independent variables. However, it should be noted that these criteria cannot explain the accuracy of a certain model, that is, for the three models A, B, and C, we can judge that the B model is the best, but it cannot guarantee that the B model can be very good. Characterize the data well

Parameter Estimation

from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(data["xt"], order=(0,1,1))
result = model.fit()
print(result.summary())
                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                   D.xt   No. Observations:                   36
Model:                 ARIMA(0, 1, 1)   Log Likelihood                -122.987
Method:                       css-mle   S.D. of innovations              7.309
Date:                Tue, 22 Dec 2020   AIC                            251.973
Time:                        09:11:55   BIC                            256.724
Sample:                    01-01-1953   HQIC                           253.631
                         - 01-01-1988                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9956      2.014      2.481      0.013       1.048       8.943
ma.L1.D.xt     0.6710      0.165      4.071      0.000       0.348       0.994
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
MA.1           -1.4902           +0.0000j            1.4902            0.5000
-----------------------------------------------------------------------------

Model checking

Insert picture description here

Significance test of parameters

Parameter test
Insert picture description here

P<α, reject the original hypothesis, consider that the parameter is significantly non-zero MA(2) model fits the sequence, and the residual sequence has achieved white noise

Significance test of the model

Significance test of the model

resid = result.resid#残差
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
fig = qqplot(resid, line='q', ax=ax, fit=True)

Insert picture description here

The qq diagram shows that we see that the red KDE line is parallel to N(0,1), which is a good indicator of the normal distribution of the residue, indicating that the residual sequence is a white noise sequence, and the model information is sufficiently extracted. You can use the previously introduced method of testing white noise LB statistics to test

The ARIMA(0,1,1) model fits the sequence, the residual sequence has realized white noise, and the parameters are all significantly non-zero. Explain that the AR(0,11) model is an effective fitting model for the sequence

Model prediction

pred = result.predict('1988', '1990',dynamic=True, typ='levels')
print (pred)
1988-01-01   278.35527
1989-01-01   283.35088
1990-01-01   288.34649
Freq: AS-JAN, dtype: float64
plt.figure(figsize=(12, 8))
plt.xticks(rotation=45)
plt.plot(pred)
plt.plot(data.xt)
plt.show()

Insert picture description here

Recommended reading

  1. Analysis of the principle of video barrage crawling at Bingbing Station B
  2. Use xpath to crawl data
  3. jupyter notebook use
  4. BeautifulSoup crawls the top 250 Douban movies
  5. An article takes you to master the requests module
  6. Python web crawler basics-BeautifulSoup

This is the end, if it helps you, welcome to like and follow, your likes are very important to me

Guess you like

Origin blog.csdn.net/qq_45176548/article/details/111504846