Practical case of deep learning: multivariate time-series air quality prediction (with complete code)

In this article, you will learn how to use the Keras deep learning library to develop LSTM models for multivariate time series forecasting .

technology upgrade

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. A person can go fast, a group of people can go farther.

Complete codes, data, and technical exchange improvements can all be obtained by joining the Knowledge Planet exchange group. The group has more than 2,000 members. When adding, remember to note: source + interest direction, which is convenient for finding like-minded friends.

Method ①, add WeChat account: pythoner666, remarks: from CSDN + air quality
method ②, WeChat search official account: Python learning and data mining, background reply: data

1. Air Pollution Forecast

In this article, we will use the air quality dataset.

This is a dataset reporting hourly weather and pollution levels at the U.S. Embassy in Beijing, China, over five years.

The data includes date time, pollution known as PM2.5 concentration, and weather information, including dew point, temperature, pressure, wind direction, wind speed, and accumulated hours of rain and snow. The complete list of features in the raw data is as follows:

  1. No : line number
  2. year : the year of the row's data
  3. month : the month of the row data
  4. day : the date of the row of data
  5. hour : the hour of the row's data
  6. pm2.5 : PM2.5 concentration
  7. DEWP : dew point
  8. TEMP : temperature
  9. PRES : pressure
  10. cbwd : combined wind direction
  11. Iws : Cumulative wind speed
  12. Is : cumulative snowfall time
  13. Ir : accumulated rainfall time

2. Basic data preparation

Below are the first few rows of the original dataset.

No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
1,2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0
2,2010,1,1,1,NA,-21,-12,1020,NW,4.92,0,0
3,2010,1,1,2,NA,-21,-11,1019,NW,6.71,0,0
4,2010,1,1,3,NA,-21,-14,1019,NW,9.84,0,0
5,2010,1,1,4,NA,-20,-12,1018,NW,12.97,0,0

The first step is to incorporate the datetime information into a single datetime so we can use it as an index in Pandas.

A quick check will show NA values ​​for pm2.5 in the previous 24 hours. Therefore, we need to delete the first row of data. There are also some scattered "NA" values ​​later in the dataset; we can now label them with 0 values.

The script below loads the raw dataset and parses the datetime information into a Pandas DataFrame index. Remove the "No" column and give each column a clearer name. Finally, replace NA values ​​with "0" values ​​and remove the first 24 hours.

Remove the "No" column and give each column a clearer name. Finally, replace NA values ​​with "0" values ​​and remove the first 24 hours.

from pandas import read_csv
from datetime import datetime
# load data
def parse(x):
	return datetime.strptime(x, '%Y %m %d %H')
dataset = read_csv('raw.csv',  parse_dates = [['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse)
dataset.drop('No', axis=1, inplace=True)
# manually specify column names
dataset.columns = ['pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain']
dataset.index.name = 'date'
# mark all NA values with 0
dataset['pollution'].fillna(0, inplace=True)
# drop the first 24 hours
dataset = dataset[24:]
# summarize first 5 rows
print(dataset.head(5))
# save to file
dataset.to_csv('pollution.csv')

Running the example prints the first 5 rows of the transformed dataset and saves the dataset to " pollution.csv ".

                     pollution  dew  temp   press wnd_dir  wnd_spd  snow  rain
date
2010-01-02 00:00:00      129.0  -16  -4.0  1020.0      SE     1.79     0     0
2010-01-02 01:00:00      148.0  -15  -4.0  1020.0      SE     2.68     0     0
2010-01-02 02:00:00      159.0  -11  -5.0  1021.0      SE     3.57     0     0
2010-01-02 03:00:00      181.0   -7  -5.0  1022.0      SE     5.36     1     0
2010-01-02 04:00:00      138.0   -7  -5.0  1022.0      SE     6.25     2     0

Now that we have easy-to-use data, we can quickly create charts for each series and see what we have.
The code below loads a new " pollution.csv " file and plots each series as a separate subplot, except for wind speed direction, which is categorical.

from pandas import read_csv
from matplotlib import pyplot
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# specify columns to plot
groups = [0, 1, 2, 3, 5, 6, 7]
i = 1
# plot each column
pyplot.figure()
for group in groups:
	pyplot.subplot(len(groups), 1, i)
	pyplot.plot(values[:, group])
	pyplot.title(dataset.columns[group], y=0.5, loc='right')
	i += 1
pyplot.show()

Running the example creates a graph with 7 subplots showing 5 years of data for each variable.

3. Multivariate LSTM prediction model

LSTM data preparation

The first step is to prepare the tainted dataset for LSTM. This involves structuring the dataset as a supervised learning problem and normalizing the input variables.

We define the supervised learning problem as predicting pollution at the current time (t) given pollution measurements and weather conditions at previous time steps.

First, load the "_pollution.csv" dataset. _wind direction feature is label-encoded (integer-encoded). If you're interested in exploring it, one-hot encoding might be taken a step further in the future.

Next, all features are normalized, and the dataset is transformed into a supervised learning problem. Then remove the weather variable for the hour (t) to forecast.

# prepare data for lstm
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
	n_vars = 1 if type(data) is list else data.shape[1]
	df = DataFrame(data)
	cols, names = list(), list()
	# input sequence (t-n, ... t-1)
	for i in range(n_in, 0, -1):
		cols.append(df.shift(i))
		names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
	# forecast sequence (t, t+1, ... t+n)
	for i in range(0, n_out):
		cols.append(df.shift(-i))
		if i == 0:
			names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
		else:
			names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
	# put it all together
	agg = concat(cols, axis=1)
	agg.columns = names
	# drop rows with NaN values
	if dropnan:
		agg.dropna(inplace=True)
	return agg

# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)
print(reframed.head())

Running the example prints the first 5 rows of the transformed dataset. We can see 8 input variables (input sequence) and 1 output variable (pollution level for the current hour).

   var1(t-1)  var2(t-1)  var3(t-1)  var4(t-1)  var5(t-1)  var6(t-1)  \
1   0.129779   0.352941   0.245902   0.527273   0.666667   0.002290
2   0.148893   0.367647   0.245902   0.527273   0.666667   0.003811
3   0.159960   0.426471   0.229508   0.545454   0.666667   0.005332
4   0.182093   0.485294   0.229508   0.563637   0.666667   0.008391
5   0.138833   0.485294   0.229508   0.563637   0.666667   0.009912
 
   var7(t-1)  var8(t-1)   var1(t)
1   0.000000        0.0  0.148893
2   0.000000        0.0  0.159960
3   0.000000        0.0  0.182093
4   0.037037        0.0  0.138833
5   0.074074        0.0  0.109658

This kind of data preparation is simple, and we can explore more. Some ideas you can consider include:

  • One-hot encoded wind direction.
  • All series are stationary by differencing and seasonally adjusted.
  • Provides input time steps greater than 1 hour.

This last point is probably the most important considering that LSTMs use backpropagation through time when learning sequence prediction problems.

Define and fit the model

In this section, we will fit an LSTM on multivariate input data.

First, we have to split the prepared dataset into training and testing sets. To speed up model training for this demonstration, we will only fit the model on the first year of data and then evaluate it on the remaining 4 years of data. If you have time, consider exploring an inverted version of this testing tool.

The following example splits the dataset into training and testing sets, and then splits the training and testing sets into input and output variables. Finally, the input (X) is reshaped into the 3D format expected by the LSTM, ie [samples, timesteps, features].

# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

Running this example prints the training and test sets, which have about 9K hours of training data and about 35K hours of testing data.

(8760, 1, 8) (8760,) (35039, 1, 8) (35039,)

Now we can define and fit our LSTM model.

We will define the LSTM to have 50 neurons in the first hidden layer and 1 neuron in the output layer for predicting contamination. The input shape will be 1 time step with 8 features.

# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')

evaluation model

After the model is fitted, we can make predictions on the entire test dataset.

We combine the predictions with the test dataset and reverse the scaling. We also invert the scaling of the test dataset with the expected amount of contamination.

With the predicted and actual values ​​at the original scale, we can calculate the model's error score. In this case, we calculate root mean square error (RMSE), which gives the error in the same units as the variable itself.

# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

example

Running the example first creates a graph showing the training and testing loss during training.

Interestingly, we can see that the test loss is lower than the training loss. The model may be overfitting the training data. Measuring and plotting the RMSE during training might illustrate this more clearly.
The training and test losses are printed at the end of each training epoch. At the end of the run, the final RMSE of the model on the test dataset will be printed.

We can see that this model achieves a respectable RMSE of 26.496, which is lower than the 30 RMSE of the persistence model.

Epoch 46/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 47/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 48/50
0s - loss: 0.0144 - val_loss: 0.0133
Epoch 49/50
0s - loss: 0.0143 - val_loss: 0.0133
Epoch 50/50
0s - loss: 0.0144 - val_loss: 0.0133
Test RMSE: 26.496

4. Training Multiple Lag Time Step Example

First, you must frame the question appropriately when calling series_to_supervised(). We will use 3 hours of data as input.

# specify the number of lag hours
n_hours = 3
n_features = 8
# frame as supervised learning
reframed = series_to_supervised(scaled, n_hours, 1)

Next, we need to be more careful about specifying the input and output columns.
We have 3 * 8 + 8 columns in our framework dataset. We will take 3*8 or 24 columns as input for obs of all features in the past 3 hours. We will only have the tainted variable as output the next time, like so:

# split into input and outputs
n_obs = n_hours * n_features
train_X, train_y = train[:, :n_obs], train[:, -n_features]
test_X, test_y = test[:, :n_obs], test[:, -n_features]
print(train_X.shape, len(train_X), train_y.shape)

Next, we can properly reshape our input data to reflect time steps and features.

# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))
test_X = test_X.reshape((test_X.shape[0], n_hours, n_features))

The fitted model is the same.

The only other small change is how the model is evaluated. Specifically, how do we restructure the row with 8 columns to fit the inverse scaling operation to bring y and yhat back to their original scale so we can calculate the RMSE.

The gist of the change is that we concatenate the y or yhat column with the last 7 features of the test dataset in order to reverse the scaling, like so:

# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:, -7:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]

The model fits as well as before in a minute or two.

Epoch 45/50
1s - loss: 0.0143 - val_loss: 0.0154
Epoch 46/50
1s - loss: 0.0143 - val_loss: 0.0148
Epoch 47/50
1s - loss: 0.0143 - val_loss: 0.0152
Epoch 48/50
1s - loss: 0.0143 - val_loss: 0.0151
Epoch 49/50
1s - loss: 0.0143 - val_loss: 0.0152
Epoch 50/50
1s - loss: 0.0144 - val_loss: 0.0149

The training and testing losses are plotted for various epochs.

In the end, the test RMSE was printed and didn't really show any skill advantage, at least on this question.

Test RMSE: 27.177

I would add that LSTMs don't seem to be suitable for autoregressive type problems, you're better off exploring MLPs with large windows.

I hope this example helps you to conduct your own time series forecasting experiments.

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/130462461