[Python] Electric power data mining based on non-intrusive load detection and decomposition

Preface

This case will deeply explore the current, voltage, and power of each power equipment based on the collected power data, and analyze the actual power consumption of each power equipment, thereby providing a certain reference basis for the power company to formulate power energy strategies. For more details, please refer to the book "Python Data Mining: Introductory Advanced and Practical Case Analysis".

1. Case background

In order to better monitor the energy consumption of electrical equipment, power sub-metering technology was born. Electric power sub-metering is of great significance for power companies to accurately predict power loads, scientifically formulate power grid dispatch plans, and improve the stability and reliability of power systems. For users, electricity sub-metering can help users understand the usage of electrical equipment, improve users' awareness of energy conservation, and promote scientific and rational use of electricity.

Insert image description here

2. Analysis Goals

Based on the background and business requirements of power data mining for non-intrusive load detection and decomposition, the goals to be achieved in this case are as follows.

  • Analyze the operating attributes of each electrical device.
  • Build a device identification attribute library.
  • The K nearest neighbor model is used to "decompose" the independent power consumption data of each electrical device from the entire line.

3. Analysis process

The detailed analysis process can be seen in the figure below, from data sources to final data preparation, and finally to performance measurement.

Insert image description here

4. Data preparation

4.1 Data exploration

In the power data mining analysis of this case, operation record data will not be involved. Therefore, equipment data, cycle data and harmonic data are mainly obtained here. After obtaining the data, since there are many data tables and each table has many attributes, it is necessary to perform data exploration and analysis on the data. During the data exploration process, the data corresponding to the different attributes of each device was visualized mainly based on the characteristics of the original data. Some of the results obtained are shown in Figures 1 to 3 below.

Insert image description here
(Figure 1 Reactive power and total reactive power)

Insert image description here
(Figure 2 Current trace)

Insert image description here
(Figure 3 Voltage trace)

Based on the visualization results, it can be seen that the current, voltage, and power properties vary between different devices.

Visualizing data attributes is shown in code listing 1.

import pandas as pd

import matplotlib.pyplot as plt

import os

 

filename = os.listdir('../data/附件1')  # 得到文件夹下的所有文件名称

n_filename = len(filename)  

# 给各设备的数据添加操作信息,画出各属性轨迹图并保存

def fun(a):

    save_name = ['YD1', 'YD10', 'YD11', 'YD2', 'YD3', 'YD4',

           'YD5', 'YD6', 'YD7', 'YD8', 'YD9']

    plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签

    plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

    for i in range(a):

        Sb = pd.read_excel('../data/附件1/' + filename[i], '设备数据', index_col = None)

        Xb = pd.read_excel('../data/附件1/' + filename[i], '谐波数据', index_col = None)

        Zb = pd.read_excel('../data/附件1/' + filename[i], '周波数据', index_col = None)

        # 电流轨迹图

        plt.plot(Sb['IC'])

        plt.title(save_name[i] + '-IC')

        plt.ylabel('电流(0.001A)')

        plt.show()

        # 电压轨迹图

        lt.plot(Sb['UC'])

        plt.title(save_name[i] + '-UC')

        plt.ylabel('电压(0.1V)')

        plt.show()

        # 有功功率和总有功功率

        plt.plot(Sb[['PC', 'P']])

        plt.title(save_name[i] + '-P')

        plt.ylabel('有功功率(0.0001kW)')

        plt.show()

        # 无功功率和总无功功率

        plt.plot(Sb[['QC', 'Q']])

        plt.title(save_name[i] + '-Q')

        plt.ylabel('无功功率(0.0001kVar)')

        plt.show()

        # 功率因数和总功率因数

        plt.plot(Sb[['PFC', 'PF']])

        plt.title(save_name[i] + '-PF')

        plt.ylabel('功率因数(%)')

        plt.show()

        # 谐波电压

        plt.plot(Xb.loc[:, 'UC02':].T)

        plt.title(save_name[i] + '-谐波电压')

        plt.show()

        # 周波数据

        plt.plot(Zb.loc[:, 'IC001':].T)

        plt.title(save_name[i] + '-周波数据')

        plt.show()

fun(n_filename)

4.2 Missing value processing

Through data exploration, it was found that sometime attributes in the data have missing values, and these missing values ​​need to be processed. Since the missing time period of thetime attribute in each piece of data is different, different processing is required. The data with a larger missing time period in each device data is deleted, and the data with a smaller missing time period is imputed using the previous value.

Before processing missing values, it is necessary to add the equipment data table, cycle data table, harmonic data table and operation record table in all equipment data in the training data, as well as the equipment data table, cycle data table in all equipment data in the test data. and harmonic data tables are extracted as independent data files, and some of the generated files are shown in Figure 4.

Insert image description here
(Figure 4 Partial results of extracting data files)

Extract the data file as shown in code listing 2.


# 将xlsx文件转化为CSV文件

import glob

import pandas as pd

import math

 

def file_transform(xls):

    print('共发现%s个xlsx文件' % len(glob.glob(xls)))

    print('正在处理............')

    for file in glob.glob(xls):  # 循环读取同文件夹下的xlsx文件

        combine1 = pd.read_excel(file, index_col=0, sheet_name=None)

        for key in combine1:

            combine1[key].to_csv('../tmp/' + file[8: -5] + key + '.csv', encoding='utf-8')

    print('处理完成')

 

xls_list = ['../data/附件1/*.xlsx', '../data/附件2/*.xlsx']

file_transform(xls_list[0])  # 处理训练数据

file_transform(xls_list[1])  # 处理测试数据

After the extraction of data files is completed, the extracted data files are processed for missing values. Some of the files generated after processing are shown in Figure 5.

Insert image description here
(Figure 5 Partial results after missing value processing)

# 对每个数据文件中较大缺失时间点数据进行删除处理,较小缺失时间点数据进行前值替补

def missing_data(evi):

    print('共发现%s个CSV文件' % len(glob.glob(evi)))

    for j in glob.glob(evi):

        fr = pd.read_csv(j, header=0, encoding='gbk')

        fr['time'] = pd.to_datetime(fr['time'])

        helper = pd.DataFrame({
    
    'time': pd.date_range(fr['time'].min(), fr['time'].max(), freq='S')})

        fr = pd.merge(fr, helper, on='time', how='outer').sort_values('time')

        fr = fr.reset_index(drop=True)

 

        frame = pd.DataFrame()

        for g in range(0, len(list(fr['time'])) - 1):

            if math.isnan(fr.iloc[:, 1][g + 1]) and math.isnan(fr.iloc[:, 1][g]):

                continue

            else:

                scop = pd.Series(fr.loc[g])

                frame = pd.concat([frame, scop], axis=1)

        frame = pd.DataFrame(frame.values.T, index=frame.columns, columns=frame.index)

        frames = frame.fillna(method='ffill')

        frames.to_csv(j[:-4] + '1.csv', index=False, encoding='utf-8')

    print('处理完成')

 

evi_list = ['../tmp/附件1/*数据.csv', '../tmp/附件2/*数据.csv']

missing_data(evi_list[0])  # 处理训练数据

missing_data(evi_list[1])  # 处理测试数据

5. Attribute structure

Although the attributes were initially processed during the data preparation process, too many attributes were introduced, and there was duplicate information among these attributes. In order to retain important attributes and establish an accurate and simple model, the original attributes need to be further screened and constructed.

5.1 Device data

During the data exploration process, it was found that the reactive power, total reactive power, active power, total active power, power factor and total power factor of different equipment are very different and have a high degree of discrimination, so reactive power was selected in this case , total reactive power, active power, total active power, power factor and total power factor are used as attributes of equipment data to build a discriminant attribute library.

After handling the missing values, the data of each device has changed from one table to multiple tables, so it is necessary to merge the same type of data tables into one table, such as merging the device data tables of all devices into one table. At the same time, because one of the ways to deal with missing values ​​is to use the previous value for interpolation, the same records are generated, and repeated records need to be processed. The data table generated after processing is shown in Table 1.

Insert image description here

Merging and deduplicating device data is shown in code listing 4:

import glob
import pandas as pd
import os

# 合并11个设备数据及处理合并中重复的数据

def combined_equipment(csv_name):

    # 合并

    print('共发现%s个CSV文件' % len(glob.glob(csv_name)))
    print('正在处理............')

    for i in glob.glob(csv_name):  # 循环读取同文件夹下的CSV文件

        fr = open(i, 'rb').read()
        file_path = os.path.split(i)
        with open(file_path[0] + '/device_combine.csv', 'ab') as f:
            f.write(fr)

    print('合并完毕!')

    # 去重

    df = pd.read_csv(file_path[0] + '/device_combine.csv', header=None, encoding='utf-8')
    datalist = df.drop_duplicates()
    datalist.to_csv(file_path[0] + '/device_combine.csv', index=False, header=0)

    print('去重完成')

csv_list = ['../tmp/附件1/*设备数据1.csv', '../tmp/附件2/*设备数据1.csv']

combined_equipment(csv_list[0])  # 处理训练数据
combined_equipment(csv_list[1])  # 处理测试数据

5.2 Cycle data

During the data exploration process, it was found that the current in the cycle data fluctuates greatly with time. The fluctuations in the line graph drawn by the current in the cycle data of different devices are not the same, and there are obvious differences. Therefore, this case Select wave peaks and wave troughs as attributes of the cycle data to build a discriminant attribute library.

Since the two attributes of current peaks and troughs do not exist in the original cycle data, attribute construction needs to be performed. The data table generated by the construction is shown in Table 2.

Insert image description here

The attribute code in constructing the cycle data is shown in Code Listing 5:

# 求取周波数据中电流的波峰和波谷作为属性参数

import glob
import pandas as pd
from sklearn.cluster import KMeans
import os

 

def cycle(cycle_file):

    for file in glob.glob(cycle_file):

        cycle_YD = pd.read_csv(file, header=0, encoding='utf-8')
        cycle_YD1 = cycle_YD.iloc[:, 0:128]
        models = []

        for types in range(0, len(cycle_YD1)):

            model = KMeans(n_clusters=2, random_state=10)
            model.fit(pd.DataFrame(cycle_YD1.iloc[types, 1:]))  # 除时间以外的所有列
            models.append(model)

 

        # 相同状态间平稳求均值

        mean = pd.DataFrame()
        for model in models:

            r = pd.DataFrame(model.cluster_centers_, )  # 找出聚类中心
            r = r.sort_values(axis=0, ascending=True, by=[0])
            mean = pd.concat([mean, r.reset_index(drop=True)], axis=1)

        mean = pd.DataFrame(mean.values.T, index=mean.columns, columns=mean.index)
        mean.columns = ['波谷', '波峰']
        mean.index = list(cycle_YD['time'])
        mean.to_csv(file[:-9] + '波谷波峰.csv', index=False, encoding='gbk ')

 

cycle_file = ['../tmp/附件1/*周波数据1.csv', '../tmp/附件2/*周波数据1.csv']
cycle(cycle_file[0])  # 处理训练数据
cycle(cycle_file[1])  # 处理测试数据

 

# 合并周波的波峰波谷文件

def merge_cycle(cycles_file):

    means = pd.DataFrame()

    for files in glob.glob(cycles_file):
        mean0 = pd.read_csv(files, header=0, encoding='gbk')
        means = pd.concat([means, mean0])

    file_path = os.path.split(glob.glob(cycles_file)[0])
    means.to_csv(file_path[0] + '/zuhe.csv', index=False, encoding='gbk')

    print('合并完成')

 
cycles_file = ['../tmp/附件1/*波谷波峰.csv', '../tmp/附件2/*波谷波峰.csv']

merge_cycle(cycles_file[0])  # 训练数据
merge_cycle(cycles_file[1])  # 测试数据

6. Model training

When identifying the device type, select the K nearest neighbor model for identification, use the attribute library constructed from attributes to train the model, and then use the trained model to identify device 1 and device 2. Build a discriminant model and identify device types, as shown in code listing 6.

import glob
import pandas as pd
from sklearn import neighbors
import pickle
import os

 

# 模型训练

def model(test_files, test_devices):

    # 训练集

    zuhe = pd.read_csv('../tmp/附件1/zuhe.csv', header=0, encoding='gbk')

    device_combine = pd.read_csv('../tmp/附件1/device_combine.csv', header=0, encoding='gbk')
    train = pd.concat([zuhe, device_combine], axis=1)
    train.index = train['time'].tolist()  # 把“time”列设为索引
    train = train.drop(['PC', 'QC', 'PFC', 'time'], axis=1)
    train.to_csv('../tmp/' + 'train.csv', index=False, encoding='gbk')

    # 测试集

    for test_file, test_device in zip(test_files, test_devices):
        test_bofeng = pd.read_csv(test_file, header=0, encoding='gbk')
        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')
        test = pd.concat([test_bofeng, test_devi], axis=1)
        test.index = test['time'].tolist()  # 把“time”列设为索引
        test = test.drop(['PC', 'QC', 'PFC', 'time'], axis=1)

 

        # K最近邻

        clf = neighbors.KNeighborsClassifier(n_neighbors=6, algorithm='auto')
        clf.fit(train.drop(['label'], axis=1), train['label'])

        predicted = clf.predict(test.drop(['label'], axis=1))
        predicted = pd.DataFrame(predicted)
        file_path = os.path.split(test_file)[1]

        test.to_csv('../tmp/' + file_path[:3] + 'test.csv', encoding='gbk')
        predicted.to_csv('../tmp/' + file_path[:3] + 'predicted.csv', index=False, encoding='gbk')

        with open('../tmp/' + file_path[:3] + 'model.pkl', 'ab') as pickle_file:

            pickle.dump(clf, pickle_file)

        print(clf)

 

model(glob.glob('../tmp/附件2/*波谷波峰.csv'),

      glob.glob('../tmp/附件2/*设备数据1.csv'))

7. Performance Measurement

Based on the device discrimination results in code listing 6, the model is evaluated and the results are as follows. The confusion matrix is ​​shown in Figure 7 and the ROC curve is shown in Figure 8.

模型分类准确度: 0.7951219512195122
模型评估报告:

               precision    recall  f1-score   support

         0.0       1.00      0.84      0.92        64

        21.0       0.00      0.00      0.00         0

        61.0       0.00      0.00      0.00         0

        91.0       0.78      0.84      0.81        77

        92.0       0.00      0.00      0.00         5

        93.0       0.76      0.75      0.75        59

       111.0       0.00      0.00      0.00         0

       accuracy                                0.80        205
     macro avg       0.36      0.35      0.35       205

weighted avg       0.82      0.80      0.81       205


计算auc:0.8682926829268293

The confusion matrix is ​​shown below:

Insert image description here

The ROC curve is as follows:

Insert image description here

Model evaluation is shown in Code Listing 7:

import glob
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import label_binarize
import os
import pickle

 

# 模型评估

def model_evaluation(model_file, test_csv, predicted_csv):

    for clf, test, predicted in zip(model_file, test_csv, predicted_csv):
        with open(clf, 'rb') as pickle_file:
            clf = pickle.load(pickle_file)

        test = pd.read_csv(test, header=0, encoding='gbk')
        predicted = pd.read_csv(predicted, header=0, encoding='gbk')
        test.columns = ['time', '波谷', '波峰', 'IC', 'UC', 'P', 'Q', 'PF', 'label']
        print('模型分类准确度:', clf.score(test.drop(['label', 'time'], axis=1), test['label']))
        print('模型评估报告:\n', metrics.classification_report(test['label'], predicted))

 

        confusion_matrix0 = metrics.confusion_matrix(test['label'], predicted)
        confusion_matrix = pd.DataFrame(confusion_matrix0)
        class_names = list(set(test['label']))
 

        tick_marks = range(len(class_names))

        sns.heatmap(confusion_matrix, annot=True, cmap='YlGnBu', fmt='g')

        plt.xticks(tick_marks, class_names)
        plt.yticks(tick_marks, class_names)
        plt.tight_layout()

        plt.title('混淆矩阵')
        plt.ylabel('真实标签')
        plt.xlabel('预测标签')
        plt.show()

        y_binarize = label_binarize(test['label'], classes=class_names)
        predicted = label_binarize(predicted, classes=class_names)

 

        fpr, tpr, thresholds = metrics.roc_curve(y_binarize.ravel(), predicted.ravel())
        auc = metrics.auc(fpr, tpr)
        print('计算auc:', auc)  

        # 绘图

        plt.figure(figsize=(8, 4))

        lw = 2

        plt.plot(fpr, tpr, label='area = %0.2f' % auc)
        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
        plt.fill_between(fpr, tpr, alpha=0.2, color='b')
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('1-特异性')
        plt.ylabel('灵敏度')
        plt.title('ROC曲线')
        plt.legend(loc='lower right')
        plt.show()


model_evaluation(glob.glob('../tmp/*model.pkl'),
                 glob.glob('../tmp/*test.csv'),
                 glob.glob('../tmp/*predicted.csv'))
               

According to the analysis goal, real-time power consumption needs to be calculated. The real-time power consumption is calculated as the product of the instantaneous electrical current, voltage and time. The formula is as follows.

Insert image description here
Among them, is the real-time power consumption, the unit is 0.001kWh. is power, the unit is W.

Real-time power consumption calculation, the obtained real-time power consumption is shown in Table 3.

Insert image description here

Calculate real-time power consumption as shown in code listing 8.

# 计算实时用电量并输出状态表

def cw(test_csv, predicted_csv, test_devices):

    for test, predicted, test_device in zip(test_csv, predicted_csv, test_devices):

        # 划分预测出的时刻表

        test = pd.read_csv(test, header=0, encoding='gbk')

        test.columns = ['time', '波谷', '波峰', 'IC', 'UC', 'P', 'Q', 'PF', 'label']

        test['time'] = pd.to_datetime(test['time'])

        test.index = test['time']

        predicteds = pd.read_csv(predicted, header=0, encoding='gbk')

        predicteds.columns = ['label']

        indexes = []

        class_names = list(set(test['label']))

        for j in class_names:

            index = list(predicteds.index[predicteds['label'] == j])

            indexes.append(index)

 

        # 取出首位序号及时间点

        from itertools import groupby  # 连续数字

        dif_indexs = []

        time_indexes = []

        info_lists = pd.DataFrame()

        for y, z in zip(indexes, class_names):

            dif_index = []

            fun = lambda x: x[1] - x[0]

            for k, g in groupby(enumerate(y), fun):

                dif_list = [j for i, j in g]  # 连续数字的列表

                if len(dif_list) > 1:

                    scop = min(dif_list)  # 选取连续数字范围中的第一个

                else:

                    scop = dif_list[0   ]

                dif_index.append(scop)

            time_index = list(test.iloc[dif_index, :].index)

            time_indexes.append(time_index)

            info_list = pd.DataFrame({
    
    '时间': time_index, 'model_设备状态': [z] * len(time_index)})

            dif_indexs.append(dif_index)

            info_lists = pd.concat([info_lists, info_list])

        # 计算实时用电量并保存状态表
        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')
        test_devi['time'] = pd.to_datetime(test_devi['time'])
        test_devi['实时用电量'] = test_devi['P'] * 100 / 3600
        info_lists = info_lists.merge(test_devi[['time', '实时用电量']],

                                      how='inner', left_on='时间', right_on='time')

        info_lists = info_lists.sort_values(by=['时间'], ascending=True)
        info_lists = info_lists.drop(['time'], axis=1)
        file_path = os.path.split(test_device)[1]
        info_lists.to_csv('../tmp/' + file_path[:3] + '状态表.csv', index=False, encoding='gbk')

        print(info_lists)

 

cw(glob.glob('../tmp/*test.csv'),
   glob.glob('../tmp/*predicted.csv'),
   glob.glob('../tmp/附件2/*设备数据1.csv'))

A book is given at the end of the article: "Python Data Mining: Introduction, Advanced and Practical Case Analysis"

What the blogger recommends today is: Python data mining book: "Python Data Mining: Introduction, Advanced and Practical Case Analysis"

  • How to participate: Follow the blogger and leave a message in the comment area to participate

  • Quantity given out: Tentatively 2~3 copies will be given out to fans

Insert image description here

"Python Data Mining: Introduction, Advanced and Practical Case Analysis" is a data mining book driven by actual project cases. It can help readers who have no Python programming foundation and data mining foundation to quickly master Python data mining technology. Processes and Methods. In terms of writing style, it is different from the traditional "combination of theory and practice" introductory books. It uses the well-known events in the field of data mining as the "Teddy Cup" Data Mining Challenge (which has been held for 10 years) and the "Teddy Cup" data analysis. Based on the Skills Competition (which has been held for 5 times) (more than 100,000 teachers and students from more than 1,500 colleges and universities participated), 11 classic competition questions were selected to integrate Python programming knowledge, data mining knowledge and industry knowledge, so that Readers can quickly master data mining methods in seven major industries including e-commerce, education, transportation, media, electric power, tourism, and manufacturing in practice.

This book is not only suitable for self-study by readers with zero basic knowledge, but also suitable for teacher teaching. In order to help readers master the content of this book more efficiently, this book provides the following 10 additional values:

  1. Modeling platform: Provides a one-stop big data mining modeling platform that requires no configuration and includes a large number of case projects. You can learn while practicing and say goodbye to talking on paper
  2. Video explanation: Provide no less than 600 minutes of Python programming and data mining related teaching videos, learn while watching, and quickly gain experience value
  3. Selected Exercises: Carefully select no less than 60 data mining exercises and provide detailed answers. Learn and practice while checking knowledge blind spots
  4. Author Q&A: If you have any questions during the learning process, you can use the "Tree Hole" applet to take photos of paper books and send them to the author with one click. You can learn while asking and get twice the result with half the effort a>
  5. Data files: Provide data files for each case, combined with engineering practice, ready to use out of the box, enhancing practicality
  6. Program code: Provides electronic files of the code in the book and installation packages of related tools. The code can be imported into the platform and run, and the learning effect is immediate
  7. Teaching courseware: Provides matching PPT courseware. Teachers who use this book as a teaching material can apply to save lesson preparation time
  8. Model service: Provides no less than 10 data mining models. The model provides a complete case implementation process to help improve data mining practice capabilities
  9. Teaching Platform: Teddy Technology provides a one-stop digital teaching platform for the additional resources provided in this book, with detailed operation guides, so you can learn and practice while reading, saving money. Time
  10. Employment recommendation: Provides a large number of employment recommendation opportunities and cooperates with 1,500+ companies, including well-known companies such as Huawei, JD.com, and Midea

By studying this book, readers can understand the principles of data mining, quickly master the relevant operations of big data technology, and lay a good technical foundation for subsequent data analysis, data mining, and deep learning practices and competitions.

Insert image description here

Guess you like

Origin blog.csdn.net/dietime1943/article/details/134029100