[Translated] Time series analysis, visualization, prediction and use LSTM

原文地址：Time Series Analysis, Visualization & Forecasting with LSTM

Original author: Susan Li

Translation from: Nuggets Translation Project

Permalink article: github.com/xitu/gold-m...

Translator: Minghao23

Proofreaders: Xuyuey , TrWestdoor

Statistical test of normality, stability Dickey-Fuller test, long and short term memory network

The title already explained everything.

Without further ado, let's just start now!

data

The data is the power consumption of a family in one-minute sampling rate measurements, you can nearly four years here to download.

Numerical data comprising a number of different power values and division tables. However, we are only concerned Global_active_power this variable.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.4f' % x)
import seaborn as sns
sns.set_context("paper", font_scale=1.3)
sns.set_style('white')
import warnings
warnings.filterwarnings('ignore')
from time import time
import matplotlib.ticker as tkr
from scipy import stats
from statsmodels.tsa.stattools import adfuller
from sklearn import preprocessing
from statsmodels.tsa.stattools import pacf
%matplotlib inline

import math
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers import *
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from keras.callbacks import EarlyStopping

df=pd.read_csv('household_power_consumption.txt', delimiter=';')
print('Number of rows and columns:', df.shape)
df.head(5)
复制代码

And wherein the following data preprocessing steps are required to complete the project:

The date and time merge into the same column, and converted to datetime type.
Global_active_power converted to numeric, and removing missing values (1.2%).
Create a feature year, quarter, month, and day.
Create a feature of the week, "0" weekend "1" working day.

df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')
df = df.dropna(subset=['Global_active_power'])
df['date_time']=pd.to_datetime(df['date_time'])
df['year'] = df['date_time'].apply(lambda x: x.year)
df['quarter'] = df['date_time'].apply(lambda x: x.quarter)
df['month'] = df['date_time'].apply(lambda x: x.month)
df['day'] = df['date_time'].apply(lambda x: x.day)
df=df.loc[:,['date_time','Global_active_power', 'year','quarter','month','day']]
df.sort_values('date_time', inplace=True, ascending=True)
df = df.reset_index(drop=True)
df["weekday"]=df.apply(lambda row: row["date_time"].weekday(),axis=1)
df["weekday"] = (df["weekday"] < 5).astype(int)

print('Number of rows and columns after removing missing values:', df.shape)
print('The time series starts from: ', df.date_time.min())
print('The time series ends on: ', df.date_time.max())
复制代码

After removing missing values, including data from December 2006 to November 2010 (47 months) a total of 2,049,280 measured values.

The initial data includes a plurality of variables. Here we will only focus on a single variable: the history of the house Global_active_power, that is, the whole house average consumption per minute active power, the unit is kilowatts.

Statistical Normality Test

There are some statistical test methods can be used to quantify whether our sample data looks like a Gaussian distribution. We will use the D'Agostino's K² test .

In SciPy realize this test, we make the following interpretation of p-values.

p <= alpha: reject H0, non-normality.
p> alpha: not reject H0, normality.

stat, p = stats.normaltest(df.Global_active_power)
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('Data looks Gaussian (fail to reject H0)')
else:
    print('Data does not look Gaussian (reject H0)')
复制代码

We will also calculate kurtosis and skewness , in order to determine whether the data distribution deviates from the normal distribution.

sns.distplot(df.Global_active_power);
print( 'Kurtosis of normal distribution: {}'.format(stats.kurtosis(df.Global_active_power)))
print( 'Skewness of normal distribution: {}'.format(stats.skew(df.Global_active_power)))
复制代码

Kurtosis : Tail weight distribution described

Kurtosis of the normal distribution is close to zero. If the peak is greater than zero, the heavier the tail of the distribution. If the peak is less than zero, the distribution of the tail light. We calculated the kurtosis is greater than zero.

Skewness : measure of distribution asymmetry

If the skewness is between -0.5 and 0.5, the data is substantially symmetrical. If the skewness is between -0.5 and -1, or between 0.5 and 1, then the data is somewhat skewed. If the skewness is less than -1 or greater than 1, then the data is highly skewed. We calculate the skewness is larger than 1.

A first time-series images

df1=df.loc[:,['date_time','Global_active_power']]
df1.set_index('date_time',inplace=True)
df1.plot(figsize=(12,5))
plt.ylabel('Global active power')
plt.legend().set_visible(False)
plt.tight_layout()
plt.title('Global Active Power Time Series')
sns.despine(top=True)
plt.show();
复制代码

Obviously, this is not the image we want. Do not do it.

Annual and quarterly total active power box plot comparison

plt.figure(figsize=(14,5))
plt.subplot(1,2,1)
plt.subplots_adjust(wspace=0.2)
sns.boxplot(x="year", y="Global_active_power", data=df)
plt.xlabel('year')
plt.title('Box plot of Yearly Global Active Power')
sns.despine(left=True)
plt.tight_layout()

plt.subplot(1,2,2)
sns.boxplot(x="quarter", y="Global_active_power", data=df)
plt.xlabel('quarter')
plt.title('Box plot of Quarterly Global Active Power')
sns.despine(left=True)
plt.tight_layout();
复制代码

When a side by side comparison of the annual box plot, we noted that the median overall active power in 2006, much higher compared to other years. In fact, there will be a little misleading. If you remember, we only have data in December 2006. And it is clear that December is the peak month for a family of power consumption.

Quarterly median overall active is more in line with expectations, higher (winter) the first and fourth quarters, the third quarter (summer) minimum.

Overall active power distribution

plt.figure(figsize=(14,6))
plt.subplot(1,2,1)
df['Global_active_power'].hist(bins=50)
plt.title('Global Active Power Distribution')

plt.subplot(1,2,2)
stats.probplot(df['Global_active_power'], plot=plt);
df1.describe().T
复制代码

Normal probability plot data also shows that significant deviation from a normal distribution.

By day, week, month, quarter and annual re-sampling average total active power

fig = plt.figure(figsize=(18,16))
fig.subplots_adjust(hspace=.4)
ax1 = fig.add_subplot(5,1,1)
ax1.plot(df1['Global_active_power'].resample('D').mean(),linewidth=1)
ax1.set_title('Mean Global active power resampled over day')
ax1.tick_params(axis='both', which='major')

ax2 = fig.add_subplot(5,1,2, sharex=ax1)
ax2.plot(df1['Global_active_power'].resample('W').mean(),linewidth=1)
ax2.set_title('Mean Global active power resampled over week')
ax2.tick_params(axis='both', which='major')

ax3 = fig.add_subplot(5,1,3, sharex=ax1)
ax3.plot(df1['Global_active_power'].resample('M').mean(),linewidth=1)
ax3.set_title('Mean Global active power resampled over month')
ax3.tick_params(axis='both', which='major')

ax4  = fig.add_subplot(5,1,4, sharex=ax1)
ax4.plot(df1['Global_active_power'].resample('Q').mean(),linewidth=1)
ax4.set_title('Mean Global active power resampled over quarter')
ax4.tick_params(axis='both', which='major')

ax5  = fig.add_subplot(5,1,5, sharex=ax1)
ax5.plot(df1['Global_active_power'].resample('A').mean(),linewidth=1)
ax5.set_title('Mean Global active power resampled over year')
ax5.tick_params(axis='both', which='major');
复制代码

In general, trends in ascending or descending sequence does not exist in our time. The highest average power consumption seems to be by the year 2007, in fact, it is because we are in 2007, only 12 months of data (Translator's Note: the original is wrong, it should be only in December 2006 data), while the month is the peak month. That is, if we compare year by year, in fact, this sequence is relatively stable.

The overall mean active power draw diagrams and to year, season, month, and day packet

plt.figure(figsize=(14,8))
plt.subplot(2,2,1)
df.groupby('year').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Year')

plt.subplot(2,2,2)
df.groupby('quarter').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Quarter')

plt.subplot(2,2,3)
df.groupby('month').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Month')

plt.subplot(2,2,4)
df.groupby('day').Global_active_power.agg('mean').plot()
plt.xlabel('')
plt.title('Mean Global active power by Day');
复制代码

More images confirm the findings of our previous. On an annual basis, the sequence is relatively stable. By the quarter, the lowest average power consumption in the third quarter. On a monthly basis, the lowest average power consumption in July and August. In days, the lowest average monthly consumption of about No. 8 (not sure why).

The annual total active power

This time, we removed in 2006.

pd.pivot_table(df.loc[df['year'] != 2006], values = "Global_active_power",
               columns = "year", index = "month").plot(subplots = True, figsize=(12, 12), layout=(3, 5), sharey=True);
复制代码

From 2007 to 2010, the annual patterns are very similar.

Weekday and weekend overall comparison of active power

dic={0:'Weekend',1:'Weekday'}
df['Day'] = df.weekday.map(dic)

a=plt.figure(figsize=(9,4))
plt1=sns.boxplot('year','Global_active_power',hue='Day',width=0.6,fliersize=3,
                    data=df)
a.legend(loc='upper center', bbox_to_anchor=(0.5, 1.00), shadow=True, ncol=2)
sns.despine(left=True, bottom=True)
plt.xlabel('')
plt.tight_layout()
plt.legend().set_visible(False);
复制代码

In 2010, the median overall active on weekdays than weekends lower. In 2010, they are exactly equal.

Figure weekdays and active power factor comparison of the overall weekend

plt1=sns.factorplot('year','Global_active_power',hue='Day',
                    data=df, size=4, aspect=1.5, legend=False)
plt.title('Factor Plot of Global active power by Weekend/Weekday')
plt.tight_layout()
sns.despine(left=True, bottom=True)
plt.legend(loc='upper right');
复制代码

On an annual basis, weekdays and weekends followed the same pattern.

In principle, when LSTM , we do not need to test or correct stability . However, if the data is stable, it will help improve the performance of models, neural networks easier to learn.

Stability

In statistics, Dickey-Fuller the Test test a null hypothesis that there is a unit root in the autoregressive model. Alternative hypothesis based on the use of different testing methods varies, but is usually smooth or stable trend .

The mean and variance stationary series has been a constant. Time series with the time difference is not in the mean and standard sliding window.

Dickey-Fuller test

Zero Inspection (H0): time series show a unit root, meaning that it is non-stationary. It contains a number of ingredients and time-related.

Alternative test (H1): indicates that there is no time series unit root, meaning that it is smooth. It does not contain ingredients and time-related.

p-value> 0.05: receiving zero test (H0), and data with a non-stationary unit root.

p-value <= 0.05: rejection of the null test (H0), the data is not the root and the unit is stationary.

df2=df1.resample('D', how=np.mean)

def test_stationarity(timeseries):
    rolmean = timeseries.rolling(window=30).mean()
    rolstd = timeseries.rolling(window=30).std()

    plt.figure(figsize=(14,5))
    sns.despine(left=True)
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')

    plt.legend(loc='best'); plt.title('Rolling Mean & Standard Deviation')
    plt.show()

    print ('<Results of Dickey-Fuller Test>')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4],
                         index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)
test_stationarity(df2.Global_active_power.dropna())
复制代码

From the above results can be obtained, we will reject the null test H0, because there is no data unit root and is smooth.

LSTM

Our mission is based on a family history of consumption of two million minutes of this time series to make predictions. We will use a multi-layered LSTM recurrent neural network to predict the final value of the time series.

If you want to reduce the computation time, and fast results to test the model, you can in hours re-sampling of the data. In the experiments in this article I will maintain the original minutes.

LSTM before building the model, the following work is required preprocessing and feature data engineering.

Create a data set, to ensure that all data types are float.
Standardized features.
Segmentation training and test sets.
Converting the data set values array is a matrix.
The dimensions and converted to X = t Y = t + 1.
Dimension input into a three-dimensional (num_samples, num_timesteps, num_features).

dataset = df.Global_active_power.values #numpy.ndarray
dataset = dataset.astype('float32')
dataset = np.reshape(dataset, (-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
dataset = scaler.fit_transform(dataset)
train_size = int(len(dataset) * 0.80)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]

def create_dataset(dataset, look_back=1):
    X, Y = [], []
    for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back), 0]
        X.append(a)
        Y.append(dataset[i + look_back, 0])
    return np.array(X), np.array(Y)

look_back = 30
X_train, Y_train = create_dataset(train, look_back)
X_test, Y_test = create_dataset(test, look_back)

# 将输入维度转化为 [samples, time steps, features]
X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))
复制代码

Model structure

LSTM defined model, a hidden layer of neurons comprises 100, an output layer comprising neurons, for predicting Global_active_power. Dimension input is a feature comprising 30 time steps.
Dropout 20%。
Loss function using the mean square error, and improvement in higher efficiency of Adam stochastic gradient descent.
Model will train 20 epochs, the size of each batch is 70.

model = Sequential()
model.add(LSTM(100, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

history = model.fit(X_train, Y_train, epochs=20, batch_size=70, validation_data=(X_test, Y_test),
                    callbacks=[EarlyStopping(monitor='val_loss', patience=10)], verbose=1, shuffle=False)

model.summary()
复制代码

Make predictions

train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
# 预测值求逆
train_predict = scaler.inverse_transform(train_predict)
Y_train = scaler.inverse_transform([Y_train])
test_predict = scaler.inverse_transform(test_predict)
Y_test = scaler.inverse_transform([Y_test])

print('Train Mean Absolute Error:', mean_absolute_error(Y_train[0], train_predict[:,0]))
print('Train Root Mean Squared Error:',np.sqrt(mean_squared_error(Y_train[0], train_predict[:,0])))
print('Test Mean Absolute Error:', mean_absolute_error(Y_test[0], test_predict[:,0]))
print('Test Root Mean Squared Error:',np.sqrt(mean_squared_error(Y_test[0], test_predict[:,0])))
复制代码

Draw loss model

plt.figure(figsize=(8,4))
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Test Loss')
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epochs')
plt.legend(loc='upper right')
plt.show();
复制代码

Compare the real and predicted values

In my results, each time step of 1 minute. If you're in hours before the re-sampling the data, then your results in each time step is 1 hour.

I will compare recent real and predicted values 200 minutes.

aa=[x for x in range(200)]
plt.figure(figsize=(8,4))
plt.plot(aa, Y_test[0][:200], marker='.', label="actual")
plt.plot(aa, test_predict[:,0][:200], 'r', label="prediction")
# plt.tick_params(left=False, labelleft=True) # 移除 ticks
plt.tight_layout()
sns.despine(top=True)
plt.subplots_adjust(left=0.07)
plt.ylabel('Global_active_power', size=15)
plt.xlabel('Time step', size=15)
plt.legend(fontsize=15)
plt.show();
复制代码

LSTMs amazing!

Jupyter notebook can Github found in. Enjoy this week the rest of the time!

参考： Multivariate Time Series Forecasting with LSTMs in Keras

If you find there is a translation error or other areas for improvement, welcome to Denver translation program to be modified and translations PR, also obtained the corresponding bonus points. The beginning of the article Permalink article is the MarkDown the links in this article on GitHub.

Nuggets Translation Project is a high-quality translation of technical articles Internet community, Source for the Nuggets English Share article on. Content covering Android , iOS , front-end , back-end , block chain , product , design , artificial intelligence field, etc., you want to see more high-quality translations, please continue to focus Nuggets translation program , the official micro-blog , we know almost columns .