Time series forecast 12: Electricity consumption forecast 02 Naive model multi-step forecast modeling

Then above , as used herein, univariate simple model to predict the multi-step data set household electricity. The main contents are as follows:

  • How to prepare a data set for the model;
  • How to develop indicators, divide data sets, and evaluate prediction models;
  • How to develop and evaluate and compare the performance of naive models;


How to develop a simple model for multi-time-step household electricity consumption forecast

Data processing

1.1 Outlier handling

From the previous article that, there are a large number of data sets marked as outliers, which requires these outliers treated as np.nan floating-point type, increase data processing speed. code show as below:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 函数的参数在上篇文章中已经介绍,此处不再赘述。
dataset = pd.read_csv('household_power_consumption.txt', sep=';', header=0, 
                      low_memory=False, infer_datetime_format=True, engine='c',
                      parse_dates={'datetime':[0,1]}, index_col=['datetime'])
                      
dataset.replace('?', np.nan, inplace=True) # 替换异常值
values = dataset.values.astype('float32') # 统一数据类型为float类型,提高精度和速度

1.2 Handling of missing values

Just now we replaced the outliers with missing values ​​(nan), in this step we fill in these missing values. A very simple method is to copy the sampled value at the same time the previous day. Use a custom fill_missing()function to achieve:

def fill_missing(values):
    '''
    该函数实现缺失值填充
    思路:将前一天同一时刻的采样值用来填充缺失值
    '''
    one_day = 60 * 24
    for row in range(values.shape[0]):# 行循环
        for col in range(values.shape[1]): # 列循环
            if np.isnan(values[row, col]):
                values[row, col] = values[row - one_day, col]

fill_missing(dataset.values) # 填充缺失值

After the processing is completed, add a new column and save the processed data. The calculation formula has been explained in the previous article and will not be repeated here. code show as below:

# 添加剩余用电量的列
values = dataset.values
dataset['sub_metering_4'] = (values[:,0] * 1000 / 60) - (values[:,4] + values[:,5] + values[:,6])
dataset.to_csv('household_power_consumption.csv')

At this point, the data processing is complete.


2. Model evaluation

This section describes how to develop and evaluate a prediction model for household electricity consumption data sets. This section mainly includes the following four parts:

  • Problem modeling
  • Evaluation index
  • Training set test set partition
  • Forward verification

2.1 Problem modeling

There are many ways to explore household electricity consumption data sets. This article uses these data to explore a specific problem: use data from the most recent period to predict how much electricity will be used in the next week. This requires the establishment of a prediction model to predict the total active power per day for the next seven days. This type of problem is called a multi-step time series prediction problem. A model that uses multiple input variables can be called a multivariate (feature) multistep time series prediction model.

To achieve this goal, in order to facilitate processing, the power consumption sampling data of the original data per minute is resampled to the total daily power consumption. This is not necessary, but it makes sense, because we care about the total power per day. Pandas can be used in the resample()function implementation, the parameter set “D”call this function to allow the press date - time indexes by day, and loading data resampled data packets. Then, you can create a new daily power consumption data set for these 8 variables (characteristics) by calculating the sum of all the samples every day.

View the resampling results:

daily_groups = dataset.resample('D')
daily_groups.sum()

Insert picture description here
The statistics of the original sampling points included in the new sampling:

daily_groups.count()

Insert picture description here
Re-sampling:

daily_data = daily_groups.sum()

You can know from the above output information that after resampling, the shape of the data is (1442, 8).

Save as a new csv file:

daily_data.to_csv('household_power_consumption_days.csv'

2.2 Evaluation index

The forecast output is a vector of seven values, and each value represents the predicted power consumption for each day of the next week. Multi-step forecasting problems usually evaluate each forecasting time step separately. Suggestions are as follows:

  • Evaluate skills every certain period (for example, a model that predicts 1 day and 3 days);
  • Different comparison models based on the length of the forecast date (for example, a model good at forecasting 1 day and a model good at forecasting 5 days);

The unit of total power is kilowatts, and an error measurement method with a uniform unit should be used. ** Root mean square error (RMSE) and absolute mean error (MAE) ** are consistent with the unified requirements of this unit, this article uses the more commonly used RMSE. Unlike MAE, RMSE penalizes prediction errors more. The performance index for this problem is the daily RMSE from day 1 to day 7. Using scores to evaluate the performance of the model to help model selection is a convenient and effective method. One available assessment score is the RMSE for all single days. Custom evaluate forecast()functions can achieve this function.


2.3 Division of training set and test set

Use the data from the first three years to train the prediction model and the data from the last year to evaluate the model. Divide the data according to the standard week (from Sunday to Saturday). This is an effective method for model selection, which can predict power consumption in the next week. It is also helpful for modeling. The model can be used to predict a specific day (such as Wednesday) or the entire sequence.

The data is processed in weeks, the test data is intercepted first, and the rest is the training data. The last year of the data is 2010, the first Sunday of 2010 is January 3, the data ends on November 26, 2010, and the most recent last Saturday is November 20. A total of 46 weeks of test data. The following provides daily data for the first and last rows of the test data set for confirmation.
Insert picture description here
From the previous statistics (pictured above), the daily power consumption data began on December 16, 2006. The first Sunday in the data set is December 17, which is the second row of data. Organizing data into standard weeks can provide 159 complete standard weeks for training prediction models.
Insert picture description here
Custom split_dataset()function daily data into training and test sets, and organize them into a standard peripheral. The specific row offset is used to split the data using the knowledge of the data set. NumPy then the split()function of the divided set of data organized as weekly data. Note here that the split()performance function is the same pitch division, if the data is divisible error.


2.4 Forward verification

The model is evaluated using forward verification. The process is: first provide the model with one week of data and predict the second week; then give the actual data of the second week and use the data of the previous two weeks to predict the third week; Data, forecast for the fourth week, and so on ... Demonstrate with input data and output / prediction data below:

Input,  Predict
[Week1]  Week2
[Week1 + Week2]  Week3
[Week1 + Week2 + Week3]  Week4
...

Custom front on this data set by the predictive model to assess the authentication method evaluate_model()function implementation. model_funcThe parameter is the function name. This function is provided to define the model, fit the model, and predict the electricity consumption for a week. The evaluate_model()function then evaluates the predictions made by the model based on the test data set.


2.5 Naive prediction model (simple model)

On any new forecasting problem, it is important to test the naive forecasting model. The prediction results of the naive model quantitatively show the difficulty of the prediction problem and provide a standard performance (baseline) through which more complex prediction methods can be evaluated. This section compares three simple prediction methods for household power prediction problems:

  • Daily durability forecast;
  • Weekly persistence forecast;
  • Weekly persistence forecast a year ago;

2.5.1 Daily durability prediction

The model takes the active power of the last day (such as Saturday) before the forecast period as the daily active power value of the forecast period (Sunday to Saturday). daily_persistence()The function implements this strategy.

2.5.2 Weekly Persistence Forecast

The model prediction takes the entire time of the previous week as the prediction for the next week. This is based on the idea that next week will be very similar to this week. weekly_persistence()The function implements this strategy.

2.5.3 Weekly persistence forecast a year ago

Similar to the idea of ​​using the previous week to predict the next week, that is, based on the idea that the next week will be similar to the same week a year ago, the observation week 52 weeks ago is used as the prediction. year_persistence()The function implements this strategy.


3. Complete code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import sklearn.metrics as skm

# 设置中文显示
plt.rcParams['font.sans-serif'] = ['Microsoft JhengHei']
plt.rcParams['axes.unicode_minus'] = False

def split_dataset(data):
    '''
    该函数实现以周为单位切分训练数据和测试数据
    '''
    # data为按天的耗电量统计数据,shape为(1442, 8)
    # 测试集取最后一年的46周(322天)数据,剩下的159周(1113天)数据为训练集,以下的切片实现此功能。
    train, test = data[1:-328], data[-328:-6]
    train = np.array(np.split(train, len(train)/7)) # 将数据划分为按周为单位的数据
    test = np.array(np.split(test, len(test)/7))
    return train, test

def evaluate_forecasts(actual, predicted):
    '''
    该函数实现根据预期值评估一个或多个周预测损失
    思路:统计所有单日预测的 RMSE
    '''
    scores = list()
    for i in range(actual.shape[1]):
        mse = skm.mean_squared_error(actual[:, i], predicted[:, i])
        rmse = math.sqrt(mse)
        scores.append(rmse)
    
    s = 0 # 计算总的 RMSE
    for row in range(actual.shape[0]):
        for col in range(actual.shape[1]):
            s += (actual[row, col] - predicted[row, col]) ** 2
    score = math.sqrt(s / (actual.shape[0] * actual.shape[1]))
    print('actual.shape[0]:{}, actual.shape[1]:{}'.format(actual.shape[0], actual.shape[1]))
    return score, scores

def summarize_scores(name, score, scores):
    s_scores = ', '.join(['%.1f' % s for s in scores])
    print('%s: [%.3f] %s\n' % (name, score, s_scores))

def evaluate_model(model_func, train, test):
    '''
    该函数实现评估单个模型
    '''
    history = [x for x in train] # # 以周为单位的数据列表
    predictions = [] # 每周的前项预测值
    for i in range(len(test)):
        yhat_sequence = model_func(history) # 预测每周的耗电量
        predictions.append(yhat_sequence)
        history.append(test[i, :]) # 将测试数据中的采样值添加到history列表,以便预测下周的用电量
    predictions = np.array(predictions)
    score, scores = evaluate_forecasts(test[:, :, 0], predictions) # 评估一周中每天的预测损失
    return score, scores

def daily_persistence(history):
    last_week = history[-1] # 获取之前一周七天的总有功功率
    value = last_week[-1, 0] # 获取前一周最后一天的总有功功率
    forecast = [value for _ in range(7)] # 准备7天预测
    return forecast

def weekly_persistence(history):
    last_week = history[-1] # 将之前一周的数据作为预测数据
    return last_week[:, 0]

def week_one_year_ago_persistence(history):
    last_week = history[-52] # 将去年同一周的数据预测数据
    return last_week[:, 0]


def model_predict_plot(dataset, days):
    train, test = split_dataset(dataset.values)
    #定义要评估的模型的名称和函数
    models = dict()
    models['daily'] = daily_persistence
    models['weekly'] = weekly_persistence
    models['week-oya'] = week_one_year_ago_persistence
    
    plt.figure(figsize=(8,6), dpi=150)
    for name, func in models.items():
        score, scores = evaluate_model(func, train, test)
        summarize_scores(name, score, scores)
        plt.plot(days, scores, marker='o', label=name)
    plt.grid(linestyle='--', alpha=0.5)
    plt.ylabel(r'$RMSE$', size=15)
    plt.title('三种模型预测结果比较', color='blue', size=20)
    plt.legend()
    plt.show()

if __name__ == '__main__':
    dataset = pd.read_csv('household_power_consumption_days.csv', header=0, 
                       infer_datetime_format=True, engine='c',
                       parse_dates=['datetime'], index_col=['datetime'])
    days = ['sun', 'mon', 'tue', 'wed', 'thr', 'fri', 'sat']
    model_predict_plot(dataset, days)

Running the example first prints the total score and daily score of each model. We can see that the weekly strategy performed better than the daily strategy, and the data for the week a year ago predicted that the strategy for the current week would perform best. We can see this in the overall RMSE score for each model and the score for each prediction day. An exception is the prediction error on the first day (Sunday), on which the performance of the daily persistence model seems to be better than the two-week strategy. We can use the data from one week a year ago to predict the current week's strategy, using the total RMSE of 465.294 kW as the performance baseline to evaluate the performance of other complex models on this data set.

actual.shape[0]:46, actual.shape[1]:7
daily: [511.886] 452.9, 596.4, 532.1, 490.5, 534.3, 481.5, 482.0

actual.shape[0]:46, actual.shape[1]:7
weekly: [469.389] 567.6, 500.3, 411.2, 466.1, 471.9, 358.3, 482.0

actual.shape[0]:46, actual.shape[1]:7
week-oya: [465.294] 550.0, 446.7, 398.6, 487.0, 459.3, 313.5, 555.1

Insert picture description here


The next article introduces the ARIMA model to achieve multi-step electricity consumption forecasting.
Insert picture description here


Reference:
https://machinelearningmastery.com/naive-methods-for-forecasting-household-electricity-consumption/
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample. html? highlight = resample # pandas.DataFrame.resample
https://www.statsmodels.org/stable/generated/statsmodels.graphics.tsaplots.plot_acf.html?highlight=plot_acf
https://matplotlib.org/index.html

Published 167 original articles · praised 686 · 50,000+ views

Guess you like

Origin blog.csdn.net/weixin_39653948/article/details/105412563