Python uses ARIMA model for time series analysis and prediction

ARIMA model prediction

Time series analysis and prediction is to build its data model and predict its future data on the basis of existing time-related data series, such as the number of passengers per day and the number of people in a certain area of ​​an airline company. With periodic laws. As shown in the figure below, some data show a simple periodic cycle, and some show periodic cycle changes.

   

ARIMA (Autoregressive Integrated Moving Average model) can not only fit the existing data well, but also can predict the future data on this basis. The function prototype is: ARIMA (data, order = (p, d, q)), so in addition to the original data data, three parameters p, d, and q are required. These three parameters are in accordance with the principle of ARIMA related.

ARIMA can be split into AR, I, and MA, corresponding to the three parameters p, d, and q, respectively.

  1. AR (AutoRegressive, autoregressive model) describes the relationship between the current value and the historical value, and uses the historical time data of the variable to predict itself. The formula definition of the autoregressive process: yt is the current value u is the constant term P is the order ri is the autocorrelation coefficient et is the error. Therefore, the parameter p is the order of the autoregressive model.
  2. MA (Moving Average) uses the average value within a certain time interval as an estimated value in a certain period, which can eliminate the periodic impact of data. The formula is . Therefore, the parameter q is the order of the moving average.
  3. I stands for differential operation. Time series is the most commonly used method to eliminate periodic factors. It mainly subtracts linearly from the data at regular intervals. So that the data becomes stable, ARIMA generally can be stabilized by performing a difference, so d generally takes the values ​​0, 1, 2, and here takes 1 to perform a difference.

Determine parameter values

According to BIC

Therefore, to use the ARIMA model for prediction, the values ​​of parameters p and q need to be determined first. Because the general order does not exceed one tenth of the overall data, the exhaustive method is used here. Take p and q from 0 to 10 respectively, and get the BIC value of the model (Bayesian information criterion). The BIC criterion comprehensively considers the size of the residual and the number of independent variables. The smaller the residual and the smaller the number of independent variables, the smaller the BIC value and the better the model. Therefore, finding the smallest bic value from all p and q values ​​is the ideal model parameter. After the following function, I get p, q is 5, 5

def get_order(data):
    pmax = int(len(data) / 10)    #一般阶数不超过 length /10
    qmax = int(len(data) / 10)
    bic_matrix = []
    for p in range(pmax +1):
        temp= []
        for q in range(qmax+1):
            try:
                temp.append(ARIMA(data, order=(p, 1, q)).fit().bic)    # 将bic值存入二维数组
            except:
                temp.append(None)
        bic_matrix.append(temp)    
    bic_matrix = pd.DataFrame(bic_matrix)   #将其转换成Dataframe 数据结构
    p,q = bic_matrix.stack().astype('float64').idxmin()        # 找出bic值最小对应的索引
    return p,q

According to the image

You can also determine the values ​​of orders p and q according to the autocorrelation coefficient ACF and partial autocorrelation coefficient PACF images, as shown below using the library function to output the image of the data

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

def draw_acf_pacf(data,lags):
    f = plt.figure(facecolor='white')
    ax1 = f.add_subplot(211)
    plot_acf(data,ax=ax1,lags=lags)
    ax2 = f.add_subplot(212)
    plot_pacf(data,ax=ax2,lags=lags)
    plt.subplots_adjust(hspace=0.5)
    plt.show()

There are two types of drawing images. The left picture data gradually decreases and converges to 0, which means that the coefficient is truncated . Otherwise, the trend of the picture on the right that does not converge is called smearing .

 

Select different models according to the trailing end of ACF image and PACF image:

  • Autocorrelation coefficient tail q = 0, partial autocorrelation coefficient p-order truncation : AR model,
  • Autocorrelation coefficient q-order truncation, partial autocorrelation function tail p = 0: MA model
  • Both the autocorrelation function and partial autocorrelation function are trailing, and the order is q, p: ARMA model

 So how does the trailing image determine the order? That is, the observation picture converges to the confidence interval after the first few data. For example, the right picture above has no data beyond the confidence interval after 24, so it is considered that it has a 24-order tail.

Make predictions

Next, you can use the arima model for model fitting and prediction. The ARIMA model in the third-party python package statsmodels.tsa.arima.model is used here. This is a new independent module of Statsmodels since version 0.11. Its original module is statsmodels.tsa.arima_model.ARIMA. Both of them implement the arima model functionally and have the same properties and method names. The return values ​​are ARIMAResults For the object, the predicted value is obtained through the predict () and forecast () of the object. It is worth noting that the incoming parameters and return values ​​of the methods of the new and old modules are not the same, so you need to pay attention when using them. The official documentation link is: statsmodels.tsa.arima.model.ARIMAResults , statsmodels.tsa.arima_model.ARIMAResults

The predict () function is used here for prediction. The start and end parameters can be passed in to represent the start and end coordinates of the predicted data. The coordinate values ​​can be not only int, str, but also datetime time types. If start and end are between the original Within the data area, it is the prediction fit to the original data. If end exceeds the length of the original data, it means to predict the future data. The forecast () function receives a steps parameter, which represents how much data to predict in the future.

import h5py
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

%matplotlib inline  
plt.rcParams['font.sans-serif'] = ['SimHei']    # 设置字符集显示中文
plt.rcParams['axes.unicode_minus'] = False      # 设置负号正确显示

# 显示数据趋势图
def draw_data(data):
    plt.figure()
    plt.plot(data)
    
with h5py.File('./taxi.h5', 'r') as hf:
    # 读取数据
    in_data=np.array(hf['in_data'])
    in_area = in_data.transpose(2, 0, 1)  # 调整维度
    in_mean_day = in_area.mean(axis=2)
    sub_data=pd.Series(in_mean_day[116][0:100])      # 截取数据集中的100个作为样本
    draw_data(sub_data)
    
    # 确定模型阶数
    # p,q=get_order(sub_data)
    # print('阶数:',p,q) 
    
    # 利用ARIMA模型进行预测
    model=ARIMA(sub_data,order=(5,1,5)).fit()        # 传入参数,构建并拟合模型
    predict_data=model.predict(0,150)                # 预测数据
    forecast=model.forecast(21)                      # 预测未来数据
    
    # 绘制原数据和预测数据对比图
    plt.plot(sub_data,label='原数据')
    plt.plot(predict_data,label='预测数据')
    plt.legend()
    plt.show()
    

The following is the result of fitting prediction:

 Data preprocessing

stability

The data required by ARIMA is stable, so that regular excavation and fitting can be carried out. Stable data means that its mean and variance are constant, and the self-covariance is independent of time. The following shows that the stable data and mean are not Stable, unstable variance, self-covariance changes with time

 

The stability of the data can be judged by the unit root test Dickey-Fuller Test, that is, at a certain confidence level, it is assumed that the time series data Null hypothesis: unstable, the sequence has a unit root. Therefore, for a stable time series data, it is necessary to be significant at a given confidence level and reject the null hypothesis. Use the adfuller in the statsmodels package to detect it as follows:

from statsmodels.tsa.stattools import adfuller

# 利用ADF检测数据是否平稳
def check_stable(data):
    adf_res=adfuller(data)
    print('t统计量',adf_res[0])
    print('t统计量的P值',adf_res[1])
    print('延迟阶数',adf_res[2])
    print('观测值的个数',adf_res[3])
    for key,value in adf_res[4].items():
        print('临界区间:',key,',值:',value)

 The statistical results are as follows: where the P value is less than 0.05, the original hypothesis can be rejected, that is, the data is stable

 Moving average

So how to make the data stable, you can first use the data smoothing technology to eliminate the periodic fluctuations of the data. Common smoothing techniques include moving average and weighted moving average. Moving average uses the average value within a certain time interval as a The estimated value, and the exponential average is to calculate the average value by the method of variable weight. Data movement can be achieved through the rolling () method of pandas, weighted movement by ewa (), and averaging by mean (). Step average interval.

# 对数据进行移动平均来平滑数据
def move_average(data,step):
    series=pd.Series(data)
    rol_mean=series.rolling(window=step).mean()     # 移动平均
    rol_weight_mean=pd.DataFrame.ewm(series,span=step).mean()       # 加权移动平均
    
    return rol_mean,rol_weight_mean

Difference

Eliminate periodic factors by making data at the same interval. The pandas diff can complete the difference between steps separated by sequences. For example, when steps = 7, the first seven data cannot be compared with the previous data, so the first seven data are Nan, which is deleted by dropna ().

After the prediction operation, the differential data needs to be restored. The difference operation is just a simple subtraction, so the restoration is the addition. The original data is shifted back by 7 positions and then added, and the restoration operation can be realized.

# 进行差分
diff=series.diff(steps)
diff=diff.dropna()

# 差分还原
diff_shift=sub_data.shift(7)
diff_recover=predict_data.add(diff_shift)

 

Published 124 original articles · Like 65 · Visit 130,000+

Guess you like

Origin blog.csdn.net/theVicTory/article/details/104941483