Machine learning practice notes (3) time series data prediction

1. When we get the data, let's start to analyze

When we get the data, the time series is that the data is mainly divided into four categories, 1. Long-term trend. 2. Seasonal changes. 3. Cycle changes 4. Irregular data.
For everyone to popularize (manual dog head):
1. Long-term trend:

The long-term trend is the practice of changing according to a certain trend over a long period of time. Simply put, it is the monotonic decrease of the function and the monotonic increase of the function.

2. Seasonal changes:

Due to the influence of natural conditions and social factors, the statistical values ​​of objective phenomena have regular changes in one year.

3. Cycle changes:

Is that the data represents a loop function

4. Irregular data:

Irregular changes are due to unexpected fluctuations. And this accident is infrequent.

Demonstrate specifically:
Insert picture description here
embarrassing, I can't tell. Of course, the data should be selected according to specific application scenarios. Of course, my method may not be suitable for all.
I directly chose the Arima algorithm based on actual application scenarios. (After all, it's just an algorithm. People are alive, and you don't have to use a certain fixed algorithm).

2、Arima

Now we can see that the above data is not stable. Of course, if we look at it this way, the computer doesn’t know it, so we have to test it.

Hard basic knowledge

Time series stationarity under popularization: stationarity means that the curve fitted by the sample time series will continue to follow the existing "inertia" for a period of time in the future. Stationarity requires that the mean and variance of the series do not change significantly.
There are two concepts in Arima:
Strictly stable: the distribution does not change with time, that is to say, the variance and expectation are constant.
Weakly stationary: Expectations and correlation coefficients (dependencies) are constant. For example, X t at a certain moment in the future depends on past data. This is dependency.

ps: In real life, it is too difficult to be strict and stable. Basically, weak and stable are just fine.

What if we get unstable data?
We need the difference method at this time: the difference between the time series at t and t-1.
ps: If we have a set of data x1, x2, x3, if x1 and x2 are stable, x3 is not stable. This is suitable for us to need to differentiate all x1, x2, x3.

The first picture is gone, and the second picture is difference:
Insert picture description here
code:

# -*- coding: utf-8 -*-
"""
Created on Sat Dec 26 18:57:06 2020

@author: 13056
"""
import pandas as pd
import matplotlib.pyplot as plt
#导入数据
data =pd.read_csv(r'C:/Users/13056/Desktop/145.csv',encoding = 'gb2312')
data = data.drop(['日期'], axis=1)
#用subplot()方法绘制多幅图形
plt.figure(figsize=(6,6),dpi=80)
#创建第一个画板
plt.figure(1)
#将第一个画板划分为2行1列组成的区块,并获取到第一块区域
ax1 = plt.subplot(211)
#在第一个子区域中绘图
plt.plot(data.ds)
data['ds'] = data['ds'].diff(1)#进行差分
#选中第二个子区域,并绘图
ax2 = plt.subplot(212)
plt.plot(data.ds)

Regarding the difference, I will give you an example (today the operating system review is over, it is very idle)
Insert picture description here
should be very simple and clear, a difference is to do an addition and subtraction. (Time series difference is still time series/manual dog head), the number in the diff() function above, such as diff(1) is the data of time interval 1 for subtraction, if it is diff(2), it is time interval 2 The data is subtracted.

Arima model

This is the combination of Ar+i+ma. Let's talk about it separately now.

WITH

AR is a
formula for the p-order autoregressive process of an autoregressive model : y t = μ+∑ p i=1 r i y t-1t (where y t is the current value, u is the constant term, and p is the order , R & lt I is the autocorrelation coefficient, [epsilon] T is the error.

  • Use past data to predict future data
  • Meet the stability requirements
  • The autocorrelation coefficient r i must at least be greater than or equal to 0.5 (the autocorrelation coefficient measures the degree of correlation between the same event in two different periods)

MA

MA is a moving average model (the focus is on the accumulation of error terms in the autoregressive model).
The formula of the q-order autoregressive process: y t =μ+ε t +∑ q i=1 θ i ε t-i
Purpose: to effectively eliminate random fluctuations in the prediction process.

WEAPON

Autoregressive moving average model, formula definition.
y t = μ+ε t +∑ q i=1 θ i ε t-i +∑ p i=1 r i y t-1

Here to explain, p and q are specified by ourselves, we need to use the existing data to find θ i and r i . i is the difference item that we difference before (in simple terms, the number of data).

ARIMA

ARIMA is the differential autoregressive moving average model.
We need to specify the parameters (p, q, d) in total, p and q are the orders of the autoregressive model and the moving average model, and i is the number of the difference. (This order is the lag value, and the first-order lag is the previous period value of the model.)

How to choose p-value and q-value

Autocorrelation function ACF

Purpose: to see the correlation of values ​​in different time series in the same sequence.
The formula acf(k) = ρ k = Cov(y t ,y t-k )/Var(y t ) The value range of
ρ k [-1,1], -1 means negative correlation, +1 means positive correlation, 0 means irrelevant.
The ACF diagram I drew: (The autocorrelation diagram is a plane two-dimensional coordinate dangling line diagram. The abscissa represents the delay order, the ordinate represents the ACF value, the abscissa represents the delay order, and the ordinate represents the partial autocorrelation coefficient. That blue The color area is the confidence interval, which is normally 95%.) Simply put, the data represented by k on the abscissa is naturally the information represented by tk data.
Insert picture description here
Code:

# -*- coding: utf-8 -*-
"""
Created on Sat Dec 26 18:57:06 2020

@author: 13056
"""
import pandas as pd
from statsmodels.graphics.tsaplots import plot_acf
data =pd.read_csv(r'C:/Users/13056/Desktop/145.csv',encoding = 'gb2312')
data = data.drop(['日期'], axis=1)
data['ds'] = data['ds'].diff(1)
data1 = data.ds.dropna()
plot_acf(data1)

Let's talk about the partial autocorrelation function pacf. The acf(k) we obtained before is not a pure correlation relationship between y t and y t-k . It has been affected by many things, including the data between t and tk, and pacf You can ignore these effects and strictly enforce the correlation between the two relationships. (Very complicated things, just understand.)
Drawing:
Insert picture description here
Code:

# -*- coding: utf-8 -*-
"""
Created on Sat Dec 26 18:57:06 2020

@author: 13056
"""
import pandas as pd
from statsmodels.graphics.tsaplots import plot_pacf
data =pd.read_csv(r'C:/Users/13056/Desktop/145.csv',encoding = 'gb2312')
data = data.drop(['日期'], axis=1)
data['ds'] = data['ds'].diff(1)
data1 = data.ds.dropna()
plot_pacf(data1)

Build Arima model

We now need to require (p, d, q), we will not say d == is the order of difference.

model acf pacf
AR (p) Attenuation tends to 0 p-order post-censoring
MA( q) q-order after censoring Attenuation tends to 0
WEAPON (p, q) Attenuation after q-order truncation tends to 0 After p-th order censored attenuation tends to 0

Take a look at our pacf graph again, starting from the second (1st order) into the confidence zone, that is, the AR model takes p as 1 here. At this time, the data needs to be attenuated towards zero on acf.
Let's take a look at the graph of acf again, and enter the confidence zone from the second (1st order), so the q taken by the MA model is 1 here.
At this time, the data needs to be attenuated to zero on the pacf.
If you don’t understand, there is another way == that is violent traversal!!!
Process:

  1. Make the series stationary (determine d)
  2. Find p,q
  3. Call model arima(p,d,q)

Finally, the code to establish the Arima model:

# -*- coding: utf-8 -*-
"""
Created on Sat Dec 26 18:57:06 2020

@author: 13056
"""
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
data =pd.read_csv(r'C:/Users/13056/Desktop/145.csv',encoding = 'gb2312')
data = data.drop(['日期'], axis=1)

data['ds'] = data['ds'].diff(1)
data1 = data.ds.dropna()

model = sm.tsa.ARIMA(data1, order=(1, 0, 0))
results = model.fit()
#后面就是(p,d,q)
resid = results.resid #赋值
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze())
plt.show()

Draw the picture:
Insert picture description here

Data determination

When we take p and q, sometimes we get more than just a set of values, but many values ​​that meet the conditions.
So we can use another method to find p and q values.
It is the method of AIC and BIC (the smaller the value, the better, the smaller the k is, the larger the l is, the better.)
AIC (Akaike Information Criterion): AIC = 2k-2ln(l)
BIC (Bayesian Information Criterion): AIC = kln(n)-2ln(l)
k is the number of model parameters, n is the number of samples, and l is
an example of the likelihood function BIC:

# BIC准则
results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
                           columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
 
for p,d,q in itertools.product(range(p_min,p_max+1),
                               range(d_min,d_max+1),
                               range(q_min,q_max+1)):
    if p==0 and d==0 and q==0:
        results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = np.nan
        continue
 
    try:
        model = sm.tsa.ARIMA(data1, order=(p, d, q),
                               #enforce_stationarity=False,
                               #enforce_invertibility=False,
                              )
        results = model.fit()
        results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = results.bic
    except:
        continue
results_bic = results_bic[results_bic.columns].astype(float)
 
fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(results_bic,
                 mask=results_bic.isnull(),
                 ax=ax,
                 annot=True,
                 fmt='.2f',
                 )
ax.set_title('BIC')
plt.show()

Result:
Insert picture description here
The lower the value of this heat map, the better ==
In fact, there is another way to find:

train_results = sm.tsa.arma_order_select_ic(train, ic=['aic', 'bic'], trend='nc', max_ar=8, max_ma=8)
print('AIC', train_results.aic_min_order)
print('BIC', train_results.bic_min_order)

Finally, we need to check whether the residuals of the model are normally distributed with a mean value of 0 and a constant variance.
My code above has this step:

model = sm.tsa.ARIMA(train, order=(1, 1, 1))
results = model.fit()
resid = results.resid #赋值
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze())
plt.show()

Model prediction

This is just an example for me, and the result is not so good.

# -*- coding: utf-8 -*-
"""
Created on Sat Dec 26 18:57:06 2020

@author: 13056
"""
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
data =pd.read_csv(r'C:/Users/13056/Desktop/145.csv',encoding = 'gb2312')
data = data.drop(['日期'], axis=1)

data['ds'] = data['ds'].diff(1)
data1 = data.ds.dropna()

model = sm.tsa.ARIMA(data1, order=(1, 1, 1))
results = model.fit()
#后面就是(p,d,q)
resid = results.resid #赋值
predict_sunspots = results.predict(start=1,end=101,dynamic=False)
plt.plot(data1)
plt.plot(predict_sunspots)
plt.show()

Result:
Insert picture description here
Or get the result:

results.forecast()[0]
Out[46]: array([0.16628989])

When I write about this project, I will give Zhang a nice picture.

Guess you like

Origin blog.csdn.net/weixin_45743162/article/details/109428700