Seven methods of time series prediction-python3

Seven methods of time series forecasting

Table of Contents
Data Reading and Processing
Installation Library
Method 1 – Start in a simple way
Method 2 – Simple average
Method 3 – Moving average
Method 4 – Exponential smoothing
Method 5 – Holt linear trend method
Method 6 – Holt winter season method
Method 7 -Integrated Autoregressive Moving Average Method (ARIMA)

Understand the problem description and data set

Construct training and test files for modeling
. Visualize the data (for training and testing) to understand how it changes over time.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:550]
test = df[550:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
print(df)
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
print(train)
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()
print(test)

# 数据可视化查看趋势
train.cv.plot(figsize=(15,8), title= 'Daily Ridership', fontsize=14)
test.cv.plot(figsize=(15,8), title= 'Daily Ridership', fontsize=14)
plt.show()

Insert picture description here

Install the library (statsmodels)

The library used for time series forecasting is statsmodels. Before applying the few given methods, you need to install it.
pip3 install statsmodels

Method 1: Start with Naive method

Consider the following graph:
Insert picture description here
From the graph we can see that the price of the coin is stable from the beginning. Most of the time we have a data set that is relatively stable throughout the time period. If you want to predict the price of the next day, you can simply use the price data of the previous day to estimate the price of the next day. This prediction technique that assumes that the next expected point is equal to the last observation point is called the Naive method.
Insert picture description here
Now we use the Naive method to predict the price of test data.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:550]
test = df[550:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()

#Naive Forecast
dd = np.asarray(train.cv)
y_hat = test.copy()
y_hat['naive'] = dd[len(dd)-1]
print(y_hat)
plt.figure(figsize=(12,8))
plt.plot(train.index,train['cv'],label='Train')
plt.plot(test.index,test['cv'],label='Test')
plt.plot(y_hat.index,y_hat['naive'],label='Naive Forecast')
plt.legend(loc='best')
plt.title('Naive Forecast')
plt.show()

#均方差检查模型精度
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test.cv,y_hat.naive))
print(rms)
# 22658116.88412684
#可以从RMSE值和上面的图推断,Naive方法不适合变化频繁的数据集,它最适合稳定的数据集。

It can be inferred from the RMSE value and the above graph that the Naive method is not suitable for frequently changing data sets, it is most suitable for stable data sets.

Method 2 Simple Mean Method

Consider the diagram below.
Insert picture description here
It can be inferred from the graph that the price of the coin randomly rises and falls in a small range, with the average value unchanged. Many times, we get a data set, although it has a small change in the entire time period, but the average value of each time period remains unchanged. In this case, we can predict that the next day's price will be similar to the previous daily average.

This kind of prediction technique in which the expected value of prediction is equal to the average of all observation points is called the simple average method.
Insert picture description here
We take all the previously known values, calculate the average, and use it as the next value. Of course, it is not precise, but slightly close. As a predictive method, the actual situation is that this technique is most effective.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:550]
test = df[550:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()

#Average Forecast
y_hat_avg = test.copy()
y_hat_avg['avg'] = train['cv'].mean()
print(y_hat_avg)
plt.figure(figsize=(12,8))
plt.plot(train.index,train['cv'],label='Train')
plt.plot(test.index,test['cv'],label='Test')
plt.plot(y_hat_avg.index,y_hat_avg['avg'],label='Average Forecast')
plt.legend(loc='best')
plt.title('Average Forecast')
plt.show()

#均方差检查模型精度
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test.cv,y_hat_avg.avg))
print(rms)
#9298381.954466905
#可以从RMSE值和上面的图推断,这种方法在每个时间段的平均值保持不变的时候效果最好。

It can be seen that this model did not improve our score. Therefore, we can infer from the score that this method works best when the average value of each time period remains constant. Although the score of the Naive method is better than the mean method, it does not mean that the Naive method is better than the mean method on all data sets.

Method 3 moving average method

Consider the graph below.
Insert picture description here
From the graph, it can be inferred that the price of the coin has increased substantially some time ago, but it is now stable. Many times, we get a data set in which the price/sales of the object has increased/decreased sharply some time ago. The initial price will have a great influence on the forecast of the next time period. Therefore, compared to the improvement of the simple average method, only the average price of the last few time periods is calculated. Obviously, only the most recent value is important. This prediction technique that uses time windows to calculate the average is called the moving average method.

Use a simple moving average model to predict the next value or values ​​in the time series based on the average value of a constant finite number p. Therefore, for all i> p.
Insert picture description here

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:600]
test = df[600:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()

#Moving Average Forecast
y_hat_avg = test.copy()
y_hat_avg['mov_avg'] = train['cv'].rolling(10).mean().iloc[-1]
print(y_hat_avg)
plt.figure(figsize=(12,8))
plt.plot(train.index,train['cv'],label='Train')
plt.plot(test.index,test['cv'],label='Test')
plt.plot(y_hat_avg.index,y_hat_avg['mov_avg'],label='Moving Average Forecast')
plt.legend(loc='best')
plt.title('Moving Average Forecast')
plt.show()

#均方差检查模型精度
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test.cv,y_hat_avg.mov_avg))
print(rms)
#1432105.0414506379
#可以从RMSE值和上面的图推断,所以相对于简单均值法的改进,只计算最后几个时间段的平均价格。显然,只有最近的值才是重要的。这种利用时间窗计算平均值的预测技术称为移动均值法。

An improved method of moving average method-weighted moving average method. In the above moving average method, we also weigh the past N observations. But what we may encounter is that every observation in the past affects the forecast in a different way. This technique of weighing past observations in different ways is called the weighted moving average technique.
The weighted moving average is a moving average that assigns different weights to the values ​​of the sliding window.
Insert picture description here
To select the size of the window, a weight list is required. For example, if you select [0.40, 0.25, 0.20, 0.15] as the weight, it will give 40%, 25%, 20% and 15% respectively.

Method 4 Simple exponential smoothing method

After understanding the above method, you can notice that the simple average method and the weighted moving average method are completely opposite. We need to take a certain approach between these two methods, these two methods weigh data points in different ways while considering all the data. This technique is called simple exponential smoothing. The prediction is calculated using a weighted average. The weight of the previous observations is exponentially decreasing, and the smallest weight is related to the earliest observation:
Insert picture description here
0≤α ≤1 is the parameter.

The first step prediction time T + 1 is the weighted average Y1,..., YT of all observations in a series. The rate at which the weight decreases is determined by the parameter α.

If you observe long enough, you will see that the expected ŷx is the sum of α⋅YT and (1−α)⋅ŶT-1.

It can also be written as:
Insert picture description here
So basically we already have a weighted moving average of 1−α and α:.

As you can see, 1−α is multiplied by the previously expected value ŷx−1 that expresses the recursion. This is why this method is called Exponential. The prediction at time t + 1 is equal to the weighted average between the most recent observation yt and the most recent prediction ŷt|t−1.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:600]
test = df[600:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()

#Exponential Forecast
from statsmodels.tsa.api import ExponentialSmoothing,SimpleExpSmoothing,Holt
y_hat_avg = test.copy()
fit2 = SimpleExpSmoothing(np.asarray(train['cv'])).fit(smoothing_level=0.6,optimized=False)
y_hat_avg['SES'] = fit2.forecast(len(test))
print(y_hat_avg)
plt.figure(figsize=(12,8))
plt.plot(train.index,train['cv'],label='Train')
plt.plot(test.index,test['cv'],label='Test')
plt.plot(y_hat_avg.index,y_hat_avg['SES'],label='Exponential Forecast')
plt.legend(loc='best')
plt.title('Exponential Forecast')
plt.show()

#均方差检查模型精度
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test.cv,y_hat_avg.SES))
print(rms)
#1425009.0760642842

It can be seen that a simple exponential model with an alpha value of 0.6 forms a better model, and so far, a better method is generated

5 Holt's linear trend method

We have now learned several forecasting methods, but we can see that these models are not very good on data with large changes.
Insert picture description here
A trend is a general pattern of prices observed over a period of time. For example, the Naive method will assume that the trend between the last two points will remain unchanged, or an average trend can be obtained from the average slope between all points, using the moving trend average or exponential smoothing method.

But we need a way to draw trends accurately. This method of considering the trend of a data set is called Holt's linear trend method. Each time series data set can be decomposed into different trend components, seasonality and residual. Any data set that follows the trend can be predicted using Holt's linear trend method.

Holt extends the simple exponential smoothing method to allow trending data forecasting. It only applies to exponential smoothing methods for two levels (averages of multiple series) and trends. Expressed in mathematical notation, three equations are now needed: one for the rank, one for the trend, and one to combine the rank with the trend to get the predicted value Ŷ

As a simple exponential smoothing method, the rank equation here shows that it is a weighted average of the observations and a forward prediction within the sample. The trend equation shows that this is a weighted average of the predicted trend at time t based on ℓ(t)−ℓ(t−1) and b(t−1).

We will add these equations to generate prediction equations. It is also possible to generate multiplicative forecasting equations by multiplying by trend and rank instead of increasing. When the trend rises or falls linearly, the additive equation is used, and when the trend falls exponentially, the multiplicative equation is used. Practice shows that multiplication is a more stable prediction, but additive methods are easier to understand.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:580]
test = df[580:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()

#Holt Linear
from statsmodels.tsa.api import ExponentialSmoothing,SimpleExpSmoothing,Holt
y_hat_avg = test.copy()
fit2 = Holt(np.asarray(train['cv'])).fit(smoothing_level=0.3,smoothing_slope=0.1)
y_hat_avg['holt_linear'] = fit2.forecast(len(test))
print(y_hat_avg)
plt.figure(figsize=(12,8))
plt.plot(train.index,train['cv'],label='Train')
plt.plot(test.index,test['cv'],label='Test')
plt.plot(y_hat_avg.index,y_hat_avg['holt_linear'],label='Holt Linear')
plt.legend(loc='best')
plt.title('Holt Linear')
plt.show()

#均方差检查模型精度
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test.cv,y_hat_avg.holt_linear))
print(rms)
#7062387.952482162

Method 6 Holt-Winters method

Consider a hotel on the hill. There is a high number of visitors during the summer, and there are relatively few visitors for the rest of the year. Therefore, the owner's profit is much better in summer than in other seasons. And it’s the same every year, seasonal. The data sets show similarity in a fixed time interval.
Insert picture description here

Due to seasonal factors, using the Holt winter method will be the best choice among other models. The Holt-Winters seasonal method includes forecasting equations and three smoothing equations-one seems to be level ℓt, one is trend bt, and the other is seasonal component st, smoothing parameters α, β and γ.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:600]
test = df[600:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()

#Holt Linear
from statsmodels.tsa.api import ExponentialSmoothing,SimpleExpSmoothing,Holt
y_hat_avg = test.copy()
fit1 = ExponentialSmoothing(np.asarray(train['cv']) ,seasonal_periods=7 ,trend='add', seasonal='add',).fit()
y_hat_avg['Holt_Winter'] = fit1.forecast(len(test))
plt.figure(figsize=(16,8))
plt.plot( train['cv'], label='Train')
plt.plot(test['cv'], label='Test')
plt.plot(y_hat_avg['Holt_Winter'], label='Holt_Winter')
plt.legend(loc='best')
plt.show()

#均方差检查模型精度
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test.cv,y_hat_avg.Holt_Winter))
print(rms)
#2627706.3047538185

Method 7 ARIMA
Another time series model that is very popular among data scientists is ARIMA. It stands for Autoregressive Integrated Moving average. The exponential smoothing model is based on the description of trend and seasonal data. The purpose of the ARIMA model is to describe the correlation between the data. The improvement of ARIMA takes into account the seasonality of the data set, just like the Holt-Winters method.

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#读取时间序列数据
df = pd.read_csv('../testdata/train.csv', error_bad_lines=False)
print(df.head())

#切分训练集和测试集
train = df[0:600]
test = df[600:]

# 设置数据以天变化
df.time = pd.to_datetime(df.time,format='%Y-%m-%d')
df.index = df.time
df = df.resample('D').mean()
train.time = pd.to_datetime(train.time,format='%Y-%m-%d')
train.index = train.time
train = train.resample('D').mean()
test.time = pd.to_datetime(test.time,format='%Y-%m-%d')
test.index = test.time
test = test.resample('D').mean()

#SARIMA
import statsmodels.api as sm
y_hat_avg = test.copy()
fit1 = sm.tsa.statespace.SARIMAX(train.cv, order=(2, 1, 4),seasonal_order=(0,1,1,7)).fit()
y_hat_avg['SARIMA'] = fit1.predict(start="2020-08-22", end="2020-09-03", dynamic=True)
plt.figure(figsize=(16,8))
plt.plot( train['cv'], label='Train')
plt.plot(test['cv'], label='Test')
plt.plot(y_hat_avg['SARIMA'], label='SARIMA')
plt.legend(loc='best')
plt.show()

#均方差检查模型精度
from sklearn.metrics import mean_squared_error
from math import sqrt

rms = sqrt(mean_squared_error(test.cv,y_hat_avg.SARIMA))
print(rms)
#1511711.2373576711

As you can see, using seasonal ARIMA generates a solution similar to Holt's Winter. The parameters we choose are ACF and PACF diagrams.

Guess you like

Origin blog.csdn.net/qq_30868737/article/details/108470954