[Data Mining] Time Series Model Processing Guide (2)

1. Description

This article is the second part of a series of articles. This article will use stock data for time series analysis as an example to elaborate on the method and process of time analysis.

Second, the previous section

In the first part of the article: [Data Mining] Time Series Model Processing (1)_Mr. Anhui's Blog-CSDN Blog

7. Time series model example: predicting stock prices

We will use the historical share price of the New German Fund (GF) to try to predict the closing price for the next five trading days. (You can work with datasets and notebooks .

7.1 Import data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error

from scipy.optimize import minimize
import statsmodels.tsa.api as smt
import statsmodels.api as sm

from tqdm import tqdm_notebook

from itertools import product

def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

DATAPATH = 'data/stock_prices_sample.csv'

data = pd.read_csv(DATAPATH, index_col=['DATE'], parse_dates=['DATE'])
data.head(10)

First, we'll import some libraries that will help throughout the analysis. Also, we need to define the mean percent error (MAPE), as this will be our error metric.

Then, we'll import the dataset and the first ten entries. You should get:

As you can see, we have some entries dealing with different stocks than the New German Fund (GF). Also, we have an entry for intraday information, but we only want end-of-day (EOD) information.

7.2 Clean data

data = data[data.TICKER != 'GEF']
data = data[data.TYPE != 'Intraday']

drop_cols = ['SPLIT_RATIO', 'EX_DIVIDEND', 'ADJ_FACTOR', 'ADJ_VOLUME', 'ADJ_CLOSE', 'ADJ_LOW', 'ADJ_HIGH', 'ADJ_OPEN', 'VOLUME', 'FREQUENCY', 'TYPE', 'FIGI']

data.drop(drop_cols, axis=1, inplace=True)

data.head()

First, we'll delete unwanted entries . Then, we'll drop the columns we don't need, since we only want to focus on the stock's closing price.

Prepare for data analysis: for exploratory data analysis .

7.3. Exponential Moving Average (EDA)

# Plot closing price

plt.figure(figsize=(17, 8))
plt.plot(data.CLOSE)
plt.title('Closing price of New Germany Fund Inc (GF)')
plt.ylabel('Closing price ($)')
plt.xlabel('Trading day')
plt.grid(False)
plt.show()

We will plot the closing price for the entire time period of the dataset.

You should get:

Obviously, this is not a static process, and it is difficult to tell if there is some kind of seasonality.

7.4. Moving average

Let's smooth our time series using a moving average model. For this, we will rely on a helper function that will run the moving average model over a specified time window and will plot the resulting smooth curve:

def plot_moving_average(series, window, plot_intervals=False, scale=1.96):

    rolling_mean = series.rolling(window=window).mean()
    
    plt.figure(figsize=(17,8))
    plt.title('Moving average\n window size = {}'.format(window))
    plt.plot(rolling_mean, 'g', label='Rolling mean trend')
    
    #Plot confidence intervals for smoothed values
    if plot_intervals:
        mae = mean_absolute_error(series[window:], rolling_mean[window:])
        deviation = np.std(series[window:] - rolling_mean[window:])
        lower_bound = rolling_mean - (mae + scale * deviation)
        upper_bound = rolling_mean + (mae + scale * deviation)
        plt.plot(upper_bound, 'r--', label='Upper bound / Lower bound')
        plt.plot(lower_bound, 'r--')
            
    plt.plot(series[window:], label='Actual values')
    plt.legend(loc='best')
    plt.grid(True)
    
#Smooth by the previous 5 days (by week)
plot_moving_average(data.CLOSE, 5)

#Smooth by the previous month (30 days)
plot_moving_average(data.CLOSE, 30)

#Smooth by previous quarter (90 days)
plot_moving_average(data.CLOSE, 90, plot_intervals=True)

Using a time window of five days, we get:

Smoothed curve from the previous trading week

We can hardly see the trend because it is too close to the actual curve. Let's smooth the previous month and the previous quarter to compare the results.

Smoothed from previous quarter (90 days)

Trends are now easier to spot. Notice how the 30-day and 90-day trends show a downward curve at the end. This could mean that the stock could fall over the next few days.

7.5. Exponential smoothing

Now, let's use exponential smoothing to see if it can get a better trend.

def exponential_smoothing(series, alpha):

    result = [series[0]] # first value is same as series
    for n in range(1, len(series)):
        result.append(alpha * series[n] + (1 - alpha) * result[n-1])
    return result
  
def plot_exponential_smoothing(series, alphas):
 
    plt.figure(figsize=(17, 8))
    for alpha in alphas:
        plt.plot(exponential_smoothing(series, alpha), label="Alpha {}".format(alpha))
    plt.plot(series.values, "c", label = "Actual")
    plt.legend(loc="best")
    plt.axis('tight')
    plt.title("Exponential Smoothing")
    plt.grid(True);

plot_exponential_smoothing(data.CLOSE, [0.05, 0.3])

Here, we use 0.05 and 0.3 as values for the smoothing factor. Feel free to experiment with other values and see how it turns out.

Exponential smoothing applied to the model. | Image: Marco Pesciero

As you can see, an alpha value of 0.05 smoothes the curve, picking up most of the up and down trends at the same time.

Now, let's use double exponential smoothing .

7.6. Double Exponential Smoothing

def double_exponential_smoothing(series, alpha, beta):

    result = [series[0]]
    for n in range(1, len(series)+1):
        if n == 1:
            level, trend = series[0], series[1] - series[0]
        if n >= len(series): # forecasting
            value = result[-1]
        else:
            value = series[n]
        last_level, level = level, alpha * value + (1 - alpha) * (level + trend)
        trend = beta * (level - last_level) + (1 - beta) * trend
        result.append(level + trend)
    return result

def plot_double_exponential_smoothing(series, alphas, betas):
     
    plt.figure(figsize=(17, 8))
    for alpha in alphas:
        for beta in betas:
            plt.plot(double_exponential_smoothing(series, alpha, beta), label="Alpha {}, beta {}".format(alpha, beta))
    plt.plot(series.values, label = "Actual")
    plt.legend(loc="best")
    plt.axis('tight')
    plt.title("Double Exponential Smoothing")
    plt.grid(True)
    
plot_double_exponential_smoothing(data.CLOSE, alphas=[0.9, 0.02], betas=[0.9, 0.02])

you will get:

Double exponential smoothing applied to the model. | Image: Marco Pesciero

Again, try different combinations of alpha and beta to get better looking curves.

7.7. Modeling

As mentioned before, we have to turn our series into a static process in order to model it. So let's apply the Dickey-Fuller test to see if it's a stationary process:

def tsplot(y, lags=None, figsize=(12, 7), syle='bmh'):
    
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        
    with plt.style.context(style='bmh'):
        fig = plt.figure(figsize=figsize)
        layout = (2,2)
        ts_ax = plt.subplot2grid(layout, (0,0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1,0))
        pacf_ax = plt.subplot2grid(layout, (1,1))
        
        y.plot(ax=ts_ax)
        p_value = sm.tsa.stattools.adfuller(y)[1]
        ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
        plt.tight_layout()
        
tsplot(data.CLOSE, lags=30)

# Take the first difference to remove to make the process stationary
data_diff = data.CLOSE - data.CLOSE.shift(1)

tsplot(data_diff[1:], lags=30)

You should see:

The Dicke-Fuller test was applied to models that revealed non-stationary models.

The time series is unsurprisingly non-stationary by the Dickey-Fuller test. Also, looking at the autocorrelation plot, we see that it's very high and there doesn't seem to be significant seasonality.

To get rid of the high autocorrelation and make the process stationary, let's take the first difference (line 23 in the code block). We simply subtract the one-day-lag time series from itself to get:

Our series is now fixed and we can start modeling.

7.8. Sarima

#Set initial values and some bounds
ps = range(0, 5)
d = 1
qs = range(0, 5)
Ps = range(0, 5)
D = 1
Qs = range(0, 5)
s = 5

#Create a list with all possible combinations of parameters
parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
len(parameters_list)

# Train many SARIMA models to find the best set of parameters
def optimize_SARIMA(parameters_list, d, D, s):
    """
        Return dataframe with parameters and corresponding AIC
        
        parameters_list - list with (p, q, P, Q) tuples
        d - integration order
        D - seasonal integration order
        s - length of season
    """
    
    results = []
    best_aic = float('inf')
    
    for param in tqdm_notebook(parameters_list):
        try: model = sm.tsa.statespace.SARIMAX(data.CLOSE, order=(param[0], d, param[1]),
                                               seasonal_order=(param[2], D, param[3], s)).fit(disp=-1)
        except:
            continue
            
        aic = model.aic
        
        #Save best model, AIC and parameters
        if aic < best_aic:
            best_model = model
            best_aic = aic
            best_param = param
        results.append([param, model.aic])
        
    result_table = pd.DataFrame(results)
    result_table.columns = ['parameters', 'aic']
    #Sort in ascending order, lower AIC is better
    result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
    
    return result_table

result_table = optimize_SARIMA(parameters_list, d, D, s)

#Set parameters that give the lowest AIC (Akaike Information Criteria)
p, q, P, Q = result_table.parameters[0]

best_model = sm.tsa.statespace.SARIMAX(data.CLOSE, order=(p, d, q),
                                       seasonal_order=(P, D, Q, s)).fit(disp=-1)

print(best_model.summary())

Now, for SARIMA, we first need to define some parameters and ranges of values for other parameters to generate a list of all possible combinations of p,q,d,P,Q,D,S.

Now, in the code cell above, we have 625 different combinations. We will try each combination and train SARIMA with each combination to find the best performing model. This may take a while, depending on your computer's processing power.

Once this is done, we will print out a summary of the best model and you should see:

Print the results of the time series model.

We can finally predict the closing price for the next five trading days and evaluate the mean absolute percentage error (MAPE) of the model.

In this case, we have a MAPE of 0.79%, which is very good.

More on Data Science C-Means Clustering Explained

7.9. Comparing forecasted prices with actual data

# Make a dataframe containing actual and predicted prices
comparison = pd.DataFrame({'actual': [18.93, 19.23, 19.08, 19.17, 19.11, 19.12],
                          'predicted': [18.96, 18.97, 18.96, 18.92, 18.94, 18.92]}, 
                          index = pd.date_range(start='2018-06-05', periods=6,))


#Plot predicted vs actual price

plt.figure(figsize=(17, 8))
plt.plot(comparison.actual)
plt.plot(comparison.predicted)
plt.title('Predicted closing price of New Germany Fund Inc (GF)')
plt.ylabel('Closing price ($)')
plt.xlabel('Trading day')
plt.legend(loc='best')
plt.grid(False)
plt.show()

Now, to compare our forecasts with actual data, we can take the financial data from Yahoo Finance and create a DataFrame.

Then, we make a graph to see how far we are from the actual closing price:

Comparison of predicted (orange line) and actual (blue line) closing prices

It appears our predictions were a bit off. The fact that the predicted prices are essentially flat means that our model may be underperforming.

Again, this is not due to our program, but due to the fact that predicting stock prices is basically impossible.

From the first project, we learned the whole process of making a time series stationary before modeling with SARIMA. This is a long and tedious process that requires a lot of manual adjustments.

Now, let's introduce Facebook's Prophet. It is a predictive tool available in Python and R. This tool allows experts and non-experts to generate high-quality forecasts with minimal effort.