NILMTK-introduction and use of the classic data set REDD

After configuring the environment of the NILMTK package, I want to find data to test it. In the API Docs of the NILMTK official website, I found that the dataset_converters module has a built-in data set processing function, as shown in the figure:

Convert the data into HDF files. These data are relatively good. Among them, the commonly used data sets are REDD and UK_DALE.

1. REDD data set

The download address of the current version is:  http://redd.csail.mit.edu , you need to send an email to the author to get the user name and password to download!

论文为:J. Zico Kolter and Matthew J. Johnson. REDD: A public data set for energy disaggregation research. In proceedings of the SustKDD workshop on Data Mining Applications in Sustainability, 2011. [pdf]

The files of the data set are:

The file mainly contains low-frequency power data and high-frequency voltage and current data

low_freq: 1Hz power data

high_freq: voltage and current waveform data after calibration and grouping

high_freq_row: raw voltage and current waveform data

(1) File directory of low_freq

A total of 6 household data are collected. The labels record the device type of each channel, and the channel records the power data of the UTC timestamp of each channel.

labels:

channel (one point per second):

(2) File directory of high_freq

A total of 6 household data are collected, current_1 records the current data of the first power source, current_1 records the current data of the second power source, and voltage records the voltage data.

have to be aware of is:

a. The decimal UTC timestamp is in the same format as the low-frequency UTC timestamp, but this allows fractional parts.

b. Cycle count. Although it is expressed as double precision in the file, it is actually an integer that indicates how many AC cycles are reserved for this particular waveform.

c. In the equally spaced period, 275 decimal values ​​represent the value of the waveform

After downloading the data set, the data can be changed to HDF format through the function of dataset_converters:

from nilmtk.dataset_converters import convert_redd
 
convert_redd(r'C:\Users\admin\Anaconda3\nilm_metadata\low_freq',r'C:\Users\admin\Anaconda3\nilm_metadata\low_freq\redd_low_new.h5')

2. Use of the REDD dataset

a. Load decomposition algorithm

Through the API of the NILMTK official website, we know that the algorithms for load decomposition packages include Combinatorial Optimisation, Factor Hidden Markov (FHMM), Hart 1985 (Hart 1985 algorithm), and CO and FHMM are commonly used.

b. Realization of load decomposition

The following example is calculated by CO and FHMM, and the file is obtained in:

CO:http://nilmtk.github.io/nilmtk/master/_modules/nilmtk/disaggregate/combinatorial_optimisation.html#CombinatorialOptimisation

FHMM : The fhmm_exact file under the nilmtk.legacy.disaggregate file.

  • retrieve data:
from __future__ import print_function, division
import pandas as pd
import numpy as np
from nilmtk.dataset import DataSet
#from nilmtk.metergroup import MeterGroup
#from nilmtk.datastore import HDFDataStore
#from nilmtk.timeframe import TimeFrame
from nilmtk.disaggregate.combinatorial_optimisation import CombinatorialOptimisation
from nilmtk.legacy.disaggregate.fhmm_exact import FHMM

train = DataSet('C:/Users/admin/PycharmProjects/nilmtktest/low_freq/redd_low.h5')  # 读取数据集
test = DataSet('C:/Users/admin/PycharmProjects/nilmtktest/low_freq/redd_low.h5') # 读取数据集
building = 1  ## 选择家庭house
train.set_window(end="30-4-2011")  ## 划分数据集,2011年4月20号之前的作为训练集
test.set_window(start="30-4-2011") ## 四月40号之后的作为测试集

## elec包含了这个家庭中的所有的电器信息和总功率信息,building=1-6个家庭
train_elec = train.buildings[1].elec  
test_elec = test.buildings[1].elec

top_5_train_elec = train_elec.submeters().select_top_k(k=5)  ## 选择用电量排在前5的来进行训练和测试

The first household is selected and the electricity consumption is in the top 5 electrical appliances data for testing.

  • Calculation:
def predict(clf, test_elec, sample_period, timezone):   ## 定义预测的方法
    pred = {}
    gt= {}
    #获取总的负荷数据
    for i, chunk in enumerate(test_elec.mains().load(sample_period=sample_period)):
        chunk_drop_na = chunk.dropna()   ### 丢到缺省值
        pred[i] = clf.disaggregate_chunk(chunk_drop_na)  #### 分解,disaggregate_chunk #通过调用这个方法实现分解,这部分代码在下面可以见到
        gt[i]={}  ## 这是groudtruth,即真实的单个电器的消耗功率

        for meter in test_elec.submeters().meters:
            # Only use the meters that we trained on (this saves time!)    
            gt[i][meter] = next(meter.load(sample_period=sample_period))  
        gt[i] = pd.DataFrame({k:v.squeeze() for k,v in gt[i].items()}, index=next(iter(gt[i].values())).index).dropna()   #### 上面这一块主要是为了得到pandas格式的gt数据

    # If everything can fit in memory
    gt_overall = pd.concat(gt)   
    gt_overall.index = gt_overall.index.droplevel()
    pred_overall = pd.concat(pred)
    pred_overall.index = pred_overall.index.droplevel()

    # Having the same order of columns
    gt_overall = gt_overall[pred_overall.columns]

    #Intersection of index
    gt_index_utc = gt_overall.index.tz_convert("UTC")
    pred_index_utc = pred_overall.index.tz_convert("UTC")
    common_index_utc = gt_index_utc.intersection(pred_index_utc)


    common_index_local = common_index_utc.tz_convert(timezone)
    gt_overall = gt_overall.ix[common_index_local]
    pred_overall = pred_overall.ix[common_index_local]
    appliance_labels = [m.label() for m in gt_overall.columns.values]
    gt_overall.columns = appliance_labels
    pred_overall.columns = appliance_labels
 
    return gt_overall, pred_overall

classifiers = { 'CO':CombinatorialOptimisation(),'FHMM':FHMM()}   ### 设置了两种算法,一种是CO,一种是FHMM
predictions = {}
sample_period = 120  ## 采样周期是两分钟
for clf_name, clf in classifiers.items():
    print("*"*20)
    print(clf_name)
    print("*" *20)
    clf.train(top_5_train_elec, sample_period=sample_period)  ### 训练部分
    gt, predictions[clf_name] = predict(clf, test_elec, 120, train.metadata['timezone'])

First use clf.train to train the characteristic laws of these 5 kinds of electrical appliances, and then use the total power data to decompose the characteristics of various electrical appliances. gt records the power data of each electrical appliance. The sampling period is one point in two minutes. Then, according to the predicted electrical appliance type, five electrical appliances with the highest power consumption ranking are selected.

The predictions variable records the calculation results of the two algorithms:

  • Evaluation:
def compute_rmse(gt, pred):   ### 评估指标 rmse
    from sklearn.metrics import mean_squared_error
    rms_error = {}
    for appliance in gt.columns:
        rms_error[appliance] = np.sqrt(mean_squared_error(gt[appliance], pred[appliance])) ## 评价指标的定义很简单,就是均方根误差
    return pd.Series(rms_error)
rmse = {}
for clf_name in classifiers.keys():
    rmse[clf_name] = compute_rmse(gt, predictions[clf_name])
rmse = pd.DataFrame(rmse)

The calculation result is:

Reference blog: https://blog.csdn.net/baidu_36161077/article/details/81144037

 

 

Guess you like

Origin blog.csdn.net/qq_28409193/article/details/109490513