Time series forecasting-ARIMA actual combat

Guide

This article is mainly divided into four parts:
using pandas to process time series data,
how to check the stability of time series data,
how to make time series data stable, and
time series data prediction

Import and process time series data with pandas

Step 1: Import commonly used libraries

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
#rcParams设定好画布的大小
rcParams['figure.figsize'] = 15, 6

Step 2: Import time series data. The
data file can be
downloaded from github: http://github.com/aarshayj/Analytics_Vidhya/tree/master/Articles/Time_Series_Analysis

#导入数据
data = pd.read_csv("../testdata/AirPassengers.csv")
print (data.head())
print ('\n Data types:')
print (data.dtypes)

The results of the operation are as follows: The data includes the number of passengers corresponding to each month.
You can see that data is already a DataFrame with two columns Month and #Passengers, where the type of Month is object, and the index is 0,1,2...
filelist
Step 3: To process time series data,
we need to change the type of Month to datetime, At the same time as the index.

#处理时序数据
#我们需要将Month的类型变为datetime,同时作为index
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m')
#---其中parse_dates 表明选择数据中的哪个column作为date-time信息,
#---index_col 告诉pandas以哪个column作为 index
#--- date_parser 使用一个function(本文用lambda表达式代替),使一个string转换为一个datetime变量
data = pd.read_csv('../testdata/AirPassengers.csv', parse_dates=['Month'], index_col='Month',date_parser=dateparse)
print (data.head())
print (data.index)

The result is as follows: You can see that the index of data has become Month of datetime type.
Insert picture description here

How to check the stability of time series data (Stationarity)

Because the ARIMA model requires data to be stable, this step is crucial.
1. Judging that the data is stable is often based on several statistics that are constant with respect to time:
the mean of the
constant, the variance of the constant, and the autocovariance
independent of time.
Illustrated as follows:
mean
filelist
X is the value of time series data, and t is time. As you can see in the left picture, the mean value of the data is constant for the time axis, that is, the mean value of the data is not a function of time, so it is stable; in the right picture, the overall trend of the data value is increasing over time, all The mean is a function of time, and the data has a trend, so it is not stable.
Variance As
filelist
you can see in the left picture, the variance of the data is constant with time, that is, the amplitude of the data range around the mean value is fixed, so the data in the left picture is stable. On the right, the amplitude of the data is different at different time points, so the variance is not independent of time, and the data is not stable. But the mean values ​​of the left and right graphs are the same.
Auto-covariance The auto-covariance of
filelist
a time series data is the covariance of the values ​​of i and j at two different times. It can be seen that the autocovariance of the left picture has nothing to do with time; while the right picture, with the time difference, the fluctuation frequency of the data is obviously different, causing it to have different values ​​of i and j, and different covariances will be obtained, so it is unstable of. Although the figure on the right is independent of time in terms of mean and variance, it is still unstable data.
2. Python judges the stability of time series data.
There are two methods:
1. Rolling statistic - that is, the average data mean and standard deviation in each time period.
2. Dickey-Fuller Test-This is more complicated, and roughly means that under a certain level of confidence, assume Null hypothesis: instability for time series data.
If the test value (statistic) <critical value (critical value) is passed, then the null hypothesis is rejected, that is, the data is stable; otherwise, it is not stable.

# python判断时序数据稳定性
#ARIMA模型要求数据是稳定的,所以这一步至关重要
# 有两种方法:
# 1.Rolling statistic-- 即每个时间段内的平均的数据均值和标准差情况。
# 2. Dickey-Fuller Test -- 这个比较复杂,大致意思就是在一定置信水平下,对于时序数据假设 Null hypothesis: 非稳定。
# if 通过检验值(statistic)< 临界值(critical value),则拒绝null hypothesis,即数据是稳定的;反之则是非稳定的。
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
    # 这里以一年为一个窗口,每一个时间t的值由它前面12个月(包括自己)的均值代替,标准差同理。
    rolmean = timeseries.rolling(12).mean()
    rolstd = timeseries.rolling(12).std()

    # plot rolling statistics:
    fig = plt.figure()
    fig.add_subplot()
    orig = plt.plot(timeseries, color='blue', label='Original')
    mean = plt.plot(rolmean, color='red', label='rolling mean')
    std = plt.plot(rolstd, color='black', label='Rolling standard deviation')

    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)

    # Dickey-Fuller test:

    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    # dftest的输出前一项依次为检测值,p值,滞后数,使用的观测数,各个置信度下的临界值
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
    for key, value in dftest[4].items():
        dfoutput['Critical value (%s)' % key] = value

    print (dfoutput)
ts = data['Passengers']
test_stationarity(ts)

The results are as follows: It
Insert picture description here
Insert picture description here
can be seen that the rolling mean/standard deviation of the data has an increasing trend and is unstable.
And DF-test can clearly point out that the data is not stable at any degree of confidence.

3. A way to make time series data stable
There are two main reasons why data becomes unstable:

Trend-Data changes over time. For example, increase or decrease.
Seasonality-Data changes within a specific time period. For example, holidays, or events that cause data anomalies.
Due to the relatively large range of the original data, in order to narrow the value range while retaining other information, the common method is logarithmization and log.

ts_log = np.log(ts)


There are usually three methods for detecting and removing trends :
Aggregation: shorten the time axis and use the average value of the week/month/year within a period of time as the data value. Reduce the value gap in different time periods.
Smoothing: Replace the original value with the mean value in a sliding window, in order to narrow the gap between the values.
Polynomial filtering: Use a regression model to fit the existing data to make the data smoother.
This article mainly uses the smoothing method
Moving Average-moving average

#平滑方法
#Moving Average--移动平均
moving_avg = ts_log.rolling(12).mean()
plt.plot(ts_log ,color = 'blue')
plt.plot(moving_avg, color='red')
plt.show()

Insert picture description here
It can be seen that moving_average is much smoother than the original value.
Then make a difference:

ts_log_moving_avg_diff = ts_log-moving_avg
ts_log_moving_avg_diff.dropna(inplace = True)
test_stationarity(ts_log_moving_avg_diff)

Insert picture description here
Insert picture description here
It can be seen that the processed data basically has no trend over time. The result of DFtest tells us that the data is stable with 95% confidence.

The above method is to treat all time equally, and in many cases, it can be considered that the closer the moment, the more important. So the exponentially-weighted moving average is introduced - Exponentially-weighted moving average. (This function is provided by the ewm() function in pandas.)

# halflife的值决定了衰减因子alpha:  alpha = 1 - exp(log(0.5) / halflife)
expweighted_avg = pd.DataFrame.ewm(ts_log,halflife=12).mean()
ts_log_ewma_diff = ts_log - expweighted_avg
test_stationarity(ts_log_ewma_diff)

Insert picture description here
Insert picture description here

It can be seen that the average standard deviation of the new data is smaller than the ordinary Moving Average. And DFtest can conclude that the data is stable with 99% confidence.


There are two ways to detect and remove seasonality :
1. Differentiation: Make the difference between the values ​​at a specific number of lags.
2. Decomposition: Model the trend and seasonality separately before removing them.
Differentiation-difference

ts_log_diff = ts_log - ts_log.shift()
ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff)

Insert picture description here
Insert picture description here
As shown in the figure, it can be seen that compared to the MA method, the amplitude of the mean and variance of the data processed by the Differencing method on the time axis is significantly reduced. The conclusion of DFtest is that the data is stable with 90% confidence.

3. Decomposing-decomposition

#分解(decomposing) 可以用来把时序数据中的趋势和周期性数据都分离出来:
from statsmodels.tsa.seasonal import seasonal_decompose
def decompose(timeseries):
    
    # 返回包含三个部分 trend(趋势部分) , seasonal(季节性部分) 和residual (残留部分)
    decomposition = seasonal_decompose(timeseries)
    
    trend = decomposition.trend
    seasonal = decomposition.seasonal
    residual = decomposition.resid
    
    plt.subplot(411)
    plt.plot(ts_log, label='Original')
    plt.legend(loc='best')
    plt.subplot(412)
    plt.plot(trend, label='Trend')
    plt.legend(loc='best')
    plt.subplot(413)
    plt.plot(seasonal,label='Seasonality')
    plt.legend(loc='best')
    plt.subplot(414)
    plt.plot(residual, label='Residuals')
    plt.legend(loc='best')
    plt.tight_layout()
    
    return trend , seasonal, residual

Insert picture description here

As can be clearly seen in the figure, the original data is split into three. Trend data has obvious trends, Seasonality data has obvious periodicity, and Residuals is the remaining part. It can be considered that after removing trend and seasonal data, stable data is what we need.

#消除了trend 和seasonal之后,只对residual部分作为想要的时序数据进行处理
trend , seasonal, residual = decompose(ts_log)
residual.dropna(inplace=True)
test_stationarity(residual)

Insert picture description here
Insert picture description here

As shown in the figure, the mean and variance of the data tend to be constant, with almost no fluctuations (it looks steeper than before, but note that his range is only between [-0.05,0.05]), so it can be considered stable intuitively The data. In addition, the result of DFtest shows that the Statistic value was originally less than the critical value when 1%, so the data is stable with 99% confidence.

4. Prediction of time series data
Assume that after processing, stable time series data has been obtained. Next, we use the ARIMA model
to predict the data. The introduction of ARIMA can be found in another article .

step1: Estimate p, q parameters of ARIMA (p, d, q) through ACF and PACF

It is known from the Differencing section above that the data is stable after the first-order difference, so d=1.
So use the first-order differenced ts_log_diff = ts_log-ts_log.shift() as input.
It is equivalent to yt=Yt−Yt−1 as input.

First draw the image of ACF, PACF, the code is as follows:

#ACF and PACF plots:
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(ts_log_diff, nlags=20)
lag_pacf = pacf(ts_log_diff, nlags=20, method='ols')
#Plot ACF: 
plt.subplot(121) 
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')

#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()

Insert picture description here

In the figure, the upper and lower gray lines are the confidence interval, and the value of p is the horizontal axis value when the ACF first crosses the upper confidence interval. The value of q is the horizontal axis value where PACF crosses the upper confidence interval for the first time. So we can get p=2 and q=2 from the figure.

step2: After obtaining the parameter estimates p, d, q, generate the model ARIMA (p, d, q) in
order to highlight the difference, use three models with three parameter values ​​as a comparison.
Model 1: AR model (ARIMA(2,1,0))

from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log, order=(2, 1, 0))  
results_AR = model.fit(disp=-1)  
plt.plot(ts_log_diff)
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_AR.fittedvalues-ts_log_diff)**2))

Insert picture description here

In the figure, the blue line is the input value, the red line is the fitted value of the model, and the cumulative square error of the RSS.

Model 2: MA model (ARIMA (0,1,2))

model = ARIMA(ts_log, order=(0, 1, 2))  
results_MA = model.fit(disp=-1)  
plt.plot(ts_log_diff)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_MA.fittedvalues-ts_log_diff)**2))

Insert picture description here

Model 3: ARIMA model (ARIMA(2,1,2))

model = ARIMA(ts_log, order=(2, 1, 2))  
results_ARIMA = model.fit(disp=-1)  
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2))

Insert picture description here

From RSS, it can be seen that model 3-ARIMA (2,1,2) has the best fit, so we determined the final prediction model.

Step3: Substitute the model into the original data for prediction.
Because the fitted value of the above model is the fitting of the input data after stabilizing the original data, it is necessary to perform the inverse operation of the corresponding processing on the fitted value to make it return to and The same scale of the original data.

#ARIMA拟合的其实是一阶差分ts_log_diff,predictions_ARIMA_diff[i]是第i个月与i-1个月的ts_log的差值。
#由于差分化有一阶滞后,所以第一个月的数据是空的,
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
print predictions_ARIMA_diff.head()
#累加现有的diff,得到每个值与第一个月的差分(同log底的情况下)。
#即predictions_ARIMA_diff_cumsum[i] 是第i个月与第1个月的ts_log的差值。
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
#先ts_log_diff => ts_log=>ts_log => ts 
#先以ts_log的第一个值作为基数,复制给所有值,然后每个时刻的值累加与第一个月对应的差值(这样就解决了,第一个月diff数据为空的问题了)
#然后得到了predictions_ARIMA_log => predictions_ARIMA
predictions_ARIMA_log = pd.Series(ts_log.ix[0], index=ts_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.figure()
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts)))
plt.show()

Insert picture description here
Insert picture description here
step4: Forecast the data changes in the next year

5. Summary The
previous article summarized the steps of ARIMA modeling.
(1). Obtain the time series data of the observed system;
(2). Plot the data to see if the observation is a stationary time series; for non-stationary time series, first perform the d-order difference operation to transform it into a stationary time series;
(3). After the second step of processing, a stationary time series has been obtained. To obtain the autocorrelation coefficient ACF and the partial autocorrelation coefficient PACF for the stationary time series, through the analysis of the autocorrelation graph and the partial autocorrelation graph, the best level p and order q are obtained
(4). Obtained from the above D, q, and p to get the ARIMA model. Then start model checking on the obtained model.
This article combines an example to illustrate how python solves:
1. Determine whether a time series data is stable. Corresponding to step (1)
2. How to stabilize time series data. Corresponding to step (2)
3. Use the ARIMA model to forecast time series data. Corresponding steps (3, 4)

Guess you like

Origin blog.csdn.net/qq_30868737/article/details/108472380