Python builds ARIMA model

1 Visual analysis

1.1 Analysis of air quality and air detection components

import pandas as pd
from statsmodels.stats.diagnostic import acorr_ljungbox # 白噪声检验

filename = r'C:\Users\Administrator\Desktop\data.csv'
data = pd.read_csv(filename)
plt.rcParams['font.sans-serif'] = ['SimHei']  ##能显示中文
sns.catplot(x='质量等级', y='SO2', kind='box',data=data)
plt.title('SO2和质量等级的箱线图')

Insert picture description here

It can be seen from the above figure that in the case of moderate pollution, the median content of SO2 is higher than the level of pollution, which is a good level. It is reasonable to believe that the content of SO2 has little effect on air quality.
-------------------------------------------According to the above principles, do other things separately Graph of air content. --------------

1.2 Analysis of air content composition under different time periods of air quality

In order to further explore the analysis of air content composition under different consecutive time periods, the air quality, time, and composition are compared and screened. The analysis is as follows:

Insert picture description here

1.3 Seasonal and air component content analysis

plt.rcParams['font.sans-serif'] = ['SimHei']
sns.catplot(x='季节', y='SO2', kind='box',data=data)
plt.title('SO2和季节的箱线图')

Insert picture description here

2 Time series ARIMA model analysis

2.1 Overview of the ARMA model

The full name of ARMA model is Auto-regressive Moving Average Model (ARMA), which is an important method for studying time series. In the process of economic forecasting, it not only considers the dependence of economic phenomena on the time series, but also considers the interference of random fluctuations. It has a relatively high accuracy in forecasting short-term trends in economic operations, and is one of the more widely used methods in recent years. The ARMA model is a famous time series analysis model proposed by American statisticians GEPBox and GMJenkins in the 1970s, that is, the autoregressive moving average model. The ARMA model has three basic types: autoregressive model AR(q), moving average model MR(q), and autoregressive moving average model ARMA(p,q). Among them, the ARMA(p,q) autoregressive moving average model, the model can be expressed as:

Among them, is the order of the autoregressive model, is the betweenness of the moving average model; represents the value of the time series at the moment; is the autoregressive coefficient; represents the moving average coefficient; represents the error or deviation of the time series in the period.
2.2 Trend analysis and unit root test
The broken line analysis of the AQI sequence diagram is as follows, and it is found that the AQI has a certain downward trend and the time series data is not stable enough, so the data is first-order difference.
plt.plot(data[u'AQI'])
plt.title('AQI polyline trend chart for different time periods')
Insert picture description here
sequence diagram after first-order difference

AQIdata=data[u'AQI']
D_data = AQIdata.diff().dropna()
plt.plot(D_data)
plt.title('AQI line trend graph for different time periods after first-order difference')
Insert picture description here
As can be seen from the above figure, one After the first-order difference, the data tends to be stable, so the ADF unit root test is further performed on the data, and the test level is as follows:
print(u'The white noise test result of the difference sequence is:', acorr_ljungbox(D_data, lags=1))

AQI unit root test for first-order difference
t-Statistic Prob1*
Augmented Dickey- Fuller Test Statistic -6.214 0.0000
Test Critical Values ​​1% level -3.527
5% level -2.903
10% level -2.589
print('Test result of first-order difference sequence For:',adfuller(D_data))

Knowing from the associated probability, P is far less than 0,01, so the sequence is considered to be stationary.
2.3 Model identification and selection After
calculating the values ​​of the sample autocorrelation coefficient and partial correlation coefficient, we mainly choose the appropriate ARMA model to fit the sequence of observations according to their properties. This process is actually to estimate the autocorrelation order and the moving average order according to the properties of the sample autocorrelation coefficient and partial correlation coefficient, so the model identification process also becomes the order determination process. The basic principles of general ARMA model order determination are shown in Table 2:

Table 2 Principles of ARMA(p,q) model selection
ACF PACF model ordering
Tailing p-order censoring AR§ model
q-order censoring tailing MA(q) model
Tailing and tailing ARMA(p,q) model

Using Python to operate on the difference data, the sample autocorrelation coefficient and partial correlation coefficient graph can be obtained as shown in the figure:
Insert picture description here
Insert picture description here
through the analysis and observation of the autocorrelation coefficient and partial correlation coefficient graph of the logarithmic sequence of the first-order difference, you can know the approximate model Two models can be selected. In the first type, the autocorrelation coefficient is tailing, and the partial correlation coefficient is first-order censoring. At this time, the selected model can be an ARIMA (1,1,2) model. The second type is the second-order censoring of the autocorrelation, and the first-order censoring of the partial correlation coefficient. At this time, the selected model can be an ARIMA (2,1,1) model.
2.4 Parameter estimation After
selecting the fitted model, the next step is to use the observed values ​​of the sequence to determine the caliber of the model, that is, to estimate the value of the unknown parameter in the model. For a decentralized ARMA (p, q) model, there
are

The model contains a total of unknown parameters:. There are three estimation methods for unknown parameters: moment estimation, maximum likelihood estimation and least square estimation. Among them, this paper uses the least squares estimation method to estimate the parameters of the sequence.
In the case of the ARMA(p,q) model, remember

The residual term is:

The residual sum of squares is:

It is the minimum estimated value of the group of parameter values ​​where the sum of squared residuals reaches the minimum.
Two possible parameter estimation diagrams of the sequence can be obtained by using Python operations as shown in the figure below:

fit model

from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot

import pandas as pd
filename = r'C:\Users\Administrator\Desktop\data.csv'
data = pd.read_csv(filename)
model = ARIMA(data.AQI, order=(2,1,1))
model_fit = model.fit(disp=0)
print(model_fit.summary())

The model results are as follows:

Assuming that AQI is represented by Y, then

model = ARIMA(data.AQI, order=(1,1,2))
model_fit = model.fit(disp=0)
print(model_fit.summary())

After all, AIC and BIC select the model ARIMA(2,1,1). The expression formula result of the model is:
Insert picture description here

Insert picture description here

# plot residual errors


residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())

Guess you like

Origin blog.csdn.net/tandelin/article/details/105399401