ARIMA model-[SPSS & Python]

Welcome to reprint, please indicate the source: https://blog.csdn.net/qq_41709378/article/details/105869122
———————————————————————— ————————————————————————————

Introduction:
  ARIMA model: (English: Autoregressive Integrated Moving Average model), differential integrated moving average autoregressive model, also known as integrated moving average autoregressive model (moving can also be called sliding), is one of the time series forecasting analysis methods. AR is "autoregressive" , p is the number of autoregressive terms; MA is "moving average" , q is the number of moving average terms, and d is the number of differences (order) made to make it a stationary sequence .
  
  Since the graduation thesis involves modeling and analysis of time series data (sales of commodities), it is mainly to predict the time series data. When the data is observed by a simple scatter plot, it is found that the data has seasonality and also That is to say: data fluctuations are cyclical, and the previous data will have an impact on the subsequent data, which is also in line with the influence of the fluctuation of commodity sales over time. So I chose the ARIMA model, so why not choose the AR model, MA model, ARMA model? ? ?

Therefore, through this blog, you will learn:
(1) Operate the ARIMA model through SPSS
(2) Use python to judge the white noise data
(3) Why the difference and how to determine the order

PS: At the end of the blog, the code for solving the ARIMA model in Python will be appended.
Why use SPSS?
   Due to the true incense theorem, there are various operations of ARIMA, AR, and MA models in SPSS; it also includes outlier processing, difference, white noise data judgment, and order determination. Isn’t it cool to be convenient without programming and to avoid changing the code...

Steps of the ARIMA model

Well, the reason for using the ARIMA model:
  In the past, data has a certain impact on today's data. If the past data does not have an impact on today's data, it is not suitable to use the ARIMA model for time series forecasting.
Steps for modeling with ARIMA:

Insert picture description here
To put it simply, when using the ARIMA model for modeling, the main steps can be divided into the following three steps:
(1) Obtain the original data and perform data preprocessing. (Missing value filling, outlier replacement)
(2) Judging the stationarity of the preprocessed data. If the data is not stationary, then the data should be differentiated.
(3) Perform white noise inspection on stationary data; if it is not white noise data, it means that there is still correlation between the data, and ARIMA (p, d, q) needs to be re-ordered: p, q.
(4) When the last tested data is white noise data, the model ends.

The next step is to use SPSS and Python for practical operations.

1 Raw data preprocessing

First of all, the source of the data is: Data from Problem B of Mathematical Modeling in the 2019 Central China Competition. After filtering part of the data, the data that can be established by the model is obtained. As shown in the figure below:
Insert picture description here
Of course, after we get the new data, we need to fill in the missing values ​​of the data and judge the outliers. The related operations of preprocessing are no longer shown here. The corresponding operation link is below:
Missing value filling : https://jingyan.baidu.com/article/d8072ac456536bec95cefdb6.html Outlier
handling : https://wenku.baidu.com/view/bd0289ca6d85ec3a87c24028915f804d2b1687aa.html?fr=search

2 Stationarity test

After obtaining the preprocessed data, we can proceed to the next step of stationarity test ; in simple terms, stationarity is whether the time series data fluctuates up or down a certain data, which is transformed into mathematical terms: mean and contrast Will not change over time. So you can use SPSS to draw a scatter plot of the data, and then use the image display of the scatter plot to determine whether it is stationary data. If it is not stable data, it needs to be differentiated .
Insert picture description here
Observing the image, we can see that the original data has very weak seasonality, but the data is non-stationary. Since December 2018, the sales volume of the product number SS73210 has dropped significantly, rather than fluctuating around a certain value. Therefore, a first-order difference is used for the data.
Why the difference? Deal with non-stationary data and eliminate its correlation to make it into stationary data.
Insert picture description here
Insert picture description here

The data after the difference is:
Insert picture description here
At the same time, after we have obtained the corresponding stable data, we must carry out the white noise test.
The following is the data value after the difference is obtained, and then the operation is performed with Python.
Insert picture description here

Insert picture description here
Insert picture description here
Finally, get the data after the first-order difference: SS73210_1

3 White noise inspection

After obtaining the differenced data SS73210_1, use Python to perform a white noise test. The purpose of the white noise test is to test whether the time series data that fluctuates around a certain line fluctuates up and down randomly.

(White noise data: random data, Sig>0.05, it is a white noise sequence, the historical data cannot predict and infer the future, and the ACF of the residual is within the confidence interval, which can be considered equal to 0, and the past data affects This part of today’s data, this information has been extracted by this model.)

The next step is to use Python to judge the white noise of the sequence:

'''
1.实现一阶差分的白噪声数据的判断
'''
import pandas as pd
from statsmodels.stats.diagnostic import acorr_ljungbox as lb

path = 'D:/Python/Python_learning/HBUT/预处理/ARIMA.xlsx'
df1 = pd.read_excel(path)

p_value = lb(df1, lags= 1)
print('白噪声检验p值:', p_value)

test result:

白噪声检验p值: (array([28.53145736]), array([9.21884666e-08])

Result analysis: The original hypothesis is that the data is white noise data. Since the p value of the model test is 9.21884666e-08, which is less than 0.05, it is a small probability event. It is considered that the null hypothesis is true and it is not white noise data. Therefore, it is necessary to use the ARIMA model to re-order.

4 Re-ranking

The order of ARIMA model has two parameters p and q. Generally, the order can be determined by the specific autocorrelation and the truncation of the partial correlation graph. Here, the SPSS operation is used to determine the order, and then the significance sig parameter is used to determine the order. The parameters of the model after ordering are reliable enough.

1: Here is the automatic operation of SPSS: "Expert Modeler"
Insert picture description here

Insert picture description here
2: It is also possible to customize the order of the parameters p, d, q through the method "ARIMA model" .
Insert picture description hereThe parameters p, d, and q of the model I chose here are all 1, that is, for the first-order difference, p (number of autoregressive terms) and q (number of moving average terms) are both 1.
The following is the model result after using the above model:

Insert picture description here
Insert picture description here
It can be seen from the image that after using the ARIMA model (1, 1, 1), the significance of 0.135 is greater than 0.05. The data of this model is considered to be white noise data, that is, the past data affects the current data This part of the data, this information has been extracted by this model.
The residual ACF and residual PACF are also used to see the correlation. If most of the data is between the two lines, it means that the data between the two lines is weakly correlated, and there is almost no correlation. The information on the degree of influence is already Was extracted.

PS: Write Python to determine the order of parameters

'''
#相对最优模型(p,q)
data_ = data_.astype(float)  #销量转为float类型
#定阶
pmax = int(len(D_data)/30) #一般阶数不超过length/10
qmax = int(len(D_data)/30) #一般阶数不超过length/10
bic_matrix = [] #bic矩阵
for p in range(pmax+1):
    tmp = []
    for q in range(qmax+1):
        try: #存在部分报错,所以用try来跳过报错。
            tmp.append(ARIMA(data_, (p, 1, q)).fit().bic)
        except:
            tmp.append(None)
    bic_matrix.append(tmp)

bic_matrix = pd.DataFrame(bic_matrix) #从中可以找出最小值
p, q = bic_matrix.stack().idxmin() #先用stack展平,然后用idxmin找出最小值位置。
print(u'BIC最小的p值和q值为:%s、%s' %(p, q))
'''

5 prediction

After selecting the parameters, we need to use the model to predict the next 5 days of sales.
Here we use Python to make predictions:

# 选取好p,q后进行ARIMA预测
model = ARIMA(data_, (p,1,q) ).fit()  # 建立ARIMA(1, 1, 1)模型
model.summary2()  # 给出一份模型报告
r = model.forecast(5)  # 做出未来五天的预测结果
pro_r = r[0]

forecast result:

做出未来五天的预测结果:
[ 9.49325086  9.25931922 10.35808756  8.96617407  9.23941594]

I also added the Python code of the complete ARIMA algorithm:

# -*- coding: utf-8 -*-
# @Time    : 2020/4/3 22:50
'''
1.运用模型:ARIMA
'''

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns     #seaborn画出的图更好看,且代码更简单,缺点是可塑性差
from statsmodels.graphics.tsaplots import plot_acf  #自相关图
from statsmodels.tsa.stattools import adfuller as ADF  #平稳性检测
from statsmodels.graphics.tsaplots import plot_pacf    #偏自相关图
from statsmodels.stats.diagnostic import acorr_ljungbox    #白噪声检验
from statsmodels.tsa.arima_model import ARIMA  #引入ARIMA模型
#seaborn 是建立在matplotlib之上的


#文件的导入,和data的选取。
inputfile = 'D:/Python/Python_learning/HBUT/model_3/test_four.xlsx'
data = pd.read_excel(inputfile ,sheet_name= 'Sheet2', index_col = '日期')
print(data.head())
print(data[-5:])
data_1 = data['SS81346']; data_2 = data['SS81004']
data_3 = data['SS73210']; data_4 = data['SS81516']; data_5 = data['SS81376']
data_ = data_5

#seaborn设置背景
sns.set(color_codes=True)
plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号
plt.rcParams['figure.figsize'] = (8, 5)   #设置输出图片大小


#自相关图
#自相关图显示自相关系数长期大于零,说明时间序列有很强的相关性
f = plt.figure(facecolor='white')
ax1 = f.add_subplot(1, 1, 1)
data_drop = data_.dropna()  #将数据data dropna()
plot_acf(data_drop, lags=31, ax=ax1)


#平稳性检查
print(u'原始序列的ADF检验结果为:')
print(ADF(data_)) #通过导入的ADF模块返回销量的平稳性检查
#单位根统计量对应的p的值显著大于0.05,最终判断该序列是非平稳序列

#1阶差分后的时序图
f = plt.figure(facecolor='white')
ax2 = f.add_subplot(1, 1, 1)
D_data = data_.diff().dropna()   #1阶差分,丢弃na值
D_data.plot(ax = ax2)
print(u'一阶差分序列的ADF检验结果为:')
print(ADF(D_data))
#输出p值远小于0.05,所以1阶差分之后是平稳非白噪声序列


#绘制一阶差分前后的图像
f = plt.figure(facecolor='white')
ax3 = f.add_subplot(2, 1, 1)
plot_acf(D_data, lags=31, ax=ax3) #自相关
ax4 = f.add_subplot(2, 1, 2)
plot_pacf(D_data, lags=31, ax=ax4) #偏相关


p = 1
q = 1
#选取好p,q后进行ARIMA预测
model = ARIMA(data_, (p,1,q) ).fit() #建立ARIMA(1, 1, 1)模型
model.summary2() #给出一份模型报告
r = model.forecast(5) #做出未来五天的预测结果
pro_r = r[0]
print('做出未来五天的预测结果:')
print(pro_r)


#添加预测值到图像上
pre_data = pd.Series(pro_r, index=['2019/03/13', '2019/03/14', '2019/03/15', '2019/03/16', '2019/03/17'], name='SS81346')
pre_data.index.name = '日期'

#绘图
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(data_, 'k', label='one')
ax.plot(pre_data,'r', label='two')
ax.set_title('商品: SS81376')
ax.set_xlabel('日期')
ax.set_ylabel('销量')
ax.set_xticks(['2018/09/01', '2018/10/01', '2018/11/01',
               '2018/12/01', '2019/01/01', '2019/02/01', '2019/02/28', '2019/03/18'])
plt.show()

Image:
Insert picture description here
It can be seen from the image that the black data is the original data, and the red data is the sales forecast data. It can be seen that the forecasted data for the next 5 days has a good effect and can also reflect the model well. Predictive power.


参考文献:
1:https://www.bilibili.com/video/BV1J7411d7wT?from=search&seid=16121932388884174904
2:https://blog.csdn.net/qq_41709378/article/details/105812871
3:https://baike.baidu.com/item/ARIMA模型/10611682?fr=aladdin

Guess you like

Origin blog.csdn.net/qq_41709378/article/details/105869122