Use ARIMA for stock forecasts

1. Introduction to ARIMA

1 Introduction

        Full name is called ARIMA model of autoregressive moving average model , full name (ARIMA, Autoregressive Integrated Moving Average Model ). It is the most common type of statistical model (statistic model) used for time series forecasting. The model is very simple, only endogenous variables are needed and no other exogenous variables are needed.

2. Model introduction

1. Autoregressive model (AR)

        Describe the relationship between the current value and the historical value, and use the historical time data of the variable itself to predict itself. The autoregressive model must meet the requirements of stationarity. The formula for the p-order autoregressive process is defined as:

Is the current value, is a constant term, P is the order, is the autocorrelation coefficient, and is the error. However, if you want to use an autoregressive model, there are some restrictions as follows:

  • Autoregressive model uses its own data to make predictions;
  • Must have stability ;
  • Must have autocorrelation , if the autocorrelation coefficient (φi) is less than 0.5, it should not be adopted;
  • Autoregression is only suitable for predicting phenomena related to the previous period .

2. Moving Average Model (MA)

        The moving average model focuses on the accumulation of error terms in the autoregressive model. The formula definition of the q-order autoregressive process:

Moving average method can effectively eliminate random fluctuations in forecasting.

3. Autoregressive moving average model (ARMA)

        ARIMA (p, d, q) model is called differential autoregressive moving average model (Autoregressive Integrated Moving Average Model, abbreviated as ARIMA). In fact, it is a combination of autoregressive and moving average.

        What the ARIMA model does is: transform a non-stationary time series into a stationary time series, and then regress the dependent variable only on its lag value and the present value and lag value of the random error term to establish a model. The formula is defined as:

AR is autoregressive, p is autoregressive term, MA is moving average, q is the number of moving average terms, and d is the number of differences made when the time series becomes stationary.

3. Autocorrelation function

1. Autocorrelation function ACF

        The ordered random variable sequence is compared with itself, and the autocorrelation function reflects the correlation between the values ​​of the same sequence in different time series. The formula says:

The value range of Pk is [-1,1].

2. Partial autocorrelation function (PACF)

        For a stationary AR(p) model, when the autocorrelation coefficient p(k) of lag k is calculated, it is actually not a pure correlation between x(t) and x(tk). x(t) will also be affected by the middle k-1 random variables x(t-1), x(t-2),..., x(t-k+1), and these k-1 random variables The variables all have a correlation with x(tk), so the autocorrelation coefficient p(k) is actually mixed with the influence of other variables on x(t) and x(tk).

        In other words, ACF also contains the influence of other variables, and the partial autocorrelation coefficient PACF is strictly the correlation between these two variables. After PACF eliminates the interference of k-1 random variables x(t-1), x(t-2),..., x(t-k+1), the influence of x(tk) on x(t) Relevance.

2. Data acquisition

        The stock data uses the API interface (pandas-datareader) provided by Yahoo Finance, and the guide package is as follows:

%matplotlib inline
import pandas as pd
import pandas_datareader
import datetime
import matplotlib.pylab as plt
import seaborn as sns
from matplotlib.pylab import style
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt

#设置字体、图形样式
%config InlineBackend.figure_format = 'retina'
sns.set_style("whitegrid")
plt.rcParams['font.sans-serif'] = [u'SimHei']
plt.rcParams['axes.unicode_minus'] = False

%matplotlib inline

plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

What I got here is the stock data of Taiji Industrial, the stock code is 600667.SS, and the data period is from 2019 to the present.

start = pd.datetime(2019, 1, 1)
end = pd.datetime.today()
stock = pandas_datareader.DataReader('600667.SS', 'yahoo', start, end)
stock.head()

Among them, Close is the net value of the day. Let’s look at the data trend first.

stock_train.plot(figsize=(12,8))
plt.legend(bbox_to_anchor=(1.25, 0.5))
plt.title("Stock Close")
sns.despine()

Here is a day as a unit, fill in the missing data (the market is closed), I fill in the linear value here

stock_train = stock_train.resample('D').interpolate('linear')

stock_train.plot(figsize=(12,8))
plt.legend(bbox_to_anchor=(1.25, 0.5))
plt.title("Stock Close")
sns.despine()


 We need to understand the data in depth and be more intuitive,

def tsplot(y, lags=None, title='', figsize=(14, 8)):
    fig = plt.figure(figsize=figsize)
    layout = (2, 2)
    ts_ax   = plt.subplot2grid(layout, (0, 0))
    hist_ax = plt.subplot2grid(layout, (0, 1))
    
    y.plot(ax=ts_ax)
    ts_ax.set_title(title)
    y.plot(ax=hist_ax, kind='hist', bins=25)
    hist_ax.set_title('Histogram')
    sns.despine()
    plt.tight_layout()
    return ts_ax, acf_ax, pacf_ax

tsplot(stock_train, title='Consumer Sentiment', lags=36)

3. Data preprocessing

        According to the above data display, it can be seen that the data distribution is not stable, that is, it is not smooth. Doing the difference processing, the difference data of each order can be drawn as a scatter diagram, as follows:

lags=9
ncols=3
nrows=int(np.ceil(lags/ncols))

fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(4*ncols, 4*nrows))

for ax, lag in zip(axes.flat, np.arange(1,lags+1, 1)):
    lag_str = 't-{}'.format(lag)
    X = (pd.concat([stock_train, stock_train.shift(-lag)], axis=1,
                   keys=['y'] + [lag_str]).dropna())

    X.plot(ax=ax, kind='scatter', y='y', x=lag_str);
    corr = X.corr().as_matrix()[0][1]
    ax.set_ylabel('Original')
    ax.set_title('Lag: {} (corr={:.2f})'.format(lag_str, corr));
    ax.set_aspect('equal');
    sns.despine();

fig.tight_layout();

According to the scatter plot, know the first difference (on a straight line).

stock_diff = stock_train.diff()
stock_diff = stock_diff.dropna()

plt.figure()
plt.plot(stock_diff)
plt.title('一阶差分')
plt.show()

Basically, it is fairly stable. The data fluctuates greatly from February to March. Because of the epidemic, it is better to remove this part of the data, but I will not go here. When I introduced the ARMA model earlier, I explained the parameters p, q, and d. Obviously, d is 1 here. So how are p and q determined? To be determined according to ACF and PACF, the specific rules are as follows:

Censoring, falling within the confidence interval, 95% of the points meet the rule. We need to draw the graph of acf and pacf first,

fig = plt.figure(figsize=(12,8))

ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(stock_diff, lags=20,ax=ax1)
ax1.xaxis.set_ticks_position('bottom')
fig.tight_layout();

ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(stock_diff, lags=20, ax=ax2)
ax2.xaxis.set_ticks_position('bottom')
fig.tight_layout();

According to the above rules, first determine the order of q. Look at the acf diagram. The shaded part represents the truncated part, that is to say, from which order to enter the shadow, it can be seen from the diagram that it is order 2, and at this time pacf also tends to Nearly zero. To determine the order of p, look at the pacf diagram, it can be seen that after order 2 is satisfied, at this time acf is also close to 0.

Fourth, model training

model = ARIMA(stock_train, order=(2, 1, 2),freq='D') # p,d,q
result = model.fit()
result.summary()

The following uses the trained model to predict the trend from 2019-08-01 to 2020-10-01:

pred = result.predict('20190801', '20201001',dynamic=True, typ='levels')
plt.figure(figsize=(6, 6))
plt.xticks(rotation=45)
plt.plot(pred)
plt.plot(stock_train)

The blue line is the actual value, and the red line is the predicted value, which seems useless. Finally, take a look at the distribution of predicted values:

In essence, ARIMA can only capture linear relationships, but not nonlinear relationships. In other words, using the ARIMA model to predict time series data must be stable. If the data is not stable, the law cannot be captured. Stock data is unstable and often fluctuates under the influence of policies and news, so the effect is not good. It may be better to use some Google search term changes to extract some features, and then use the tree model to predict. Try again later.

Guess you like

Origin blog.csdn.net/qq_22172133/article/details/107795239