introduction
This case will deeply explore the current, voltage, and power of each power equipment based on the collected power data, and analyze the actual power consumption of each power equipment, thereby providing a certain reference basis for the power company to formulate power energy strategies. For more details, please refer to the book **"Python Data Mining: Introductory Advanced and Practical Case Analysis"**.
1 Case background
In order to better monitor the energy consumption of electrical equipment, power sub-metering technology was born. Electric power sub-metering is of great significance for power companies to accurately predict power loads, scientifically formulate power grid dispatch plans, and improve the stability and reliability of power systems. For users, electricity sub-metering can help users understand the usage of electrical equipment, improve users' awareness of energy conservation, and promote scientific and rational use of electricity.
2 Analysis title
Based on the background and business requirements of power data mining for non-intrusive load detection and decomposition, the goals to be achieved in this case are as follows.
ØAnalyze the operating attributes of each electrical equipment.
ØConstruct a device identification attribute library.
ØUse the K nearest neighbor model to "decompose" the independent power consumption data of each electrical device from the entire line.
3 Analysis process
4 Quantity preparation
1. Data exploration
In the power data mining analysis of this case, operation record data will not be involved. Therefore, equipment data, cycle data and harmonic data are mainly obtained here. After obtaining the data, since there are many data tables and each table has many attributes, it is necessary to perform data exploration and analysis on the data. During the data exploration process, the data corresponding to the different attributes of each device was visualized mainly based on the characteristics of the original data. Some of the results obtained are shown in Figures 1 to 3.
Figure 1 Reactive power and total reactive power
Figure 2 Current trace
Figure 3 Voltage trace
Based on the visualization results, it can be seen that the current, voltage, and power properties vary between different devices.
Visualizing data attributes is shown in code listing 1.
Code Listing 1 Visualize data attributes
import pandas as pd import matplotlib.pyplot as plt import us
filename = os.listdir('../data/attachment 1') # Get the names of all files in the folder n_filename = len(filename) # Add operation information to the data of each device, draw each attribute trajectory diagram and save it def fun(a): save_name = ['YD1', 'YD10', 'YD11', 'YD2', 'YD3', 'YD4', 'YD5', 'YD6', 'YD7', 'YD8', 'YD9'] plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally for i in range(a): Sb = pd.read_excel('../data/attachment1/' + filename[i], 'Device Data', index_col = None)
Zb = pd.read_excel('../data/attachment1/' + filename[i], 'Cycle data', index_col = None) # Current trace diagram plt.plot(Sb['IC']) plt.title(save_name[i] + '-IC') plt.ylabel('Current (0.001A)') plt.show() # Voltage trace diagram lt.plot(Sb['UC']) plt.title(save_name[i] + '-UC') plt.ylabel('Voltage (0.1V)') plt.show() # Active power and total active power plt.plot(Sb[['PC', 'P']]) plt.title(save_name[i] + '-P') plt.ylabel('Active power (0.0001kW)') plt.show() #Reactive power and total reactive power plt.plot(Sb[['QC', 'Q']]) plt.title(save_name[i] + '-Q') plt.ylabel('Reactive power (0.0001kVar)') plt.show() # Power factor and total power factor plt.plot(Sb[['PFC', 'PF']]) plt.title(save_name[i] + '-PF') plt.ylabel('Power factor (%)') plt.show() # Harmonic voltage plt.plot(Xb.loc[:, 'UC02':].T) plt.title(save_name[i] + '-harmonic voltage') plt.show() # Weekly data plt.plot(Zb.loc[:, 'IC001':].T) plt.title(save_name[i] + '-Cycle data') plt.show()
fun(n_filename) |
2. Missing value processing
Through data exploration, it was found that some "time" attributes in the data have missing values, and these missing values need to be processed. Since the missing time period of the "time" attribute in each piece of data is different, different processing is required. The data with a larger missing time period in each device data is deleted, and the data with a smaller missing time period is imputed using the previous value.
Before processing missing values, it is necessary to add the equipment data table, cycle data table, harmonic data table and operation record table in all equipment data in the training data, as well as the equipment data table, cycle data table in all equipment data in the test data. and harmonic data tables are extracted as independent data files, and some of the generated files are shown in Figure 4.
Figure 4 Partial results of extracting data files
Extract the data file as shown in code listing 2.
Code Listing 2 Extract data files
#Convert xlsx file to CSV file import glob import pandas as pd import math
def file_transform(xls): Print('A total of %s xlsx files found' % len(glob.glob(xls))) print('正在处理............') for file in glob.glob(xls): # 循环读取同文件夹下的xlsx文件 combine1 = pd.read_excel(file, index_col=0, sheet_name=None) for key in combine1: combine1[key].to_csv('../tmp/' + file[8: -5] + key + '.csv', encoding='utf-8') print('处理完成')
xls_list = ['../data/附件1/*.xlsx', '../data/附件2/*.xlsx'] file_transform(xls_list[0]) # 处理训练数据 file_transform(xls_list[1]) # 处理测试数据 |
提取数据文件完成后,对提取的数据文件进行缺失值处理,处理后生成的部分文件如图5所示。
图5 缺失值处理后的部分结果
缺失值处理如代码清单3所示。
代码清单3 缺失值处理
# 对每个数据文件中较大缺失时间点数据进行删除处理,较小缺失时间点数据进行前值替补 def missing_data(evi): print('共发现%s个CSV文件' % len(glob.glob(evi))) for j in glob.glob(evi): fr = pd.read_csv(j, header=0, encoding='gbk') fr['time'] = pd.to_datetime(fr['time']) helper = pd.DataFrame({'time': pd.date_range(fr['time'].min(), fr['time'].max(), freq='S')}) fr = pd.merge(fr, helper, on='time', how='outer').sort_values('time') fr = fr.reset_index(drop=True)
frame = pd.DataFrame() for g in range(0, len(list(fr['time'])) - 1): if math.isnan(fr.iloc[:, 1][g + 1]) and math.isnan(fr.iloc[:, 1][g]): continue else: scop = pd.Series(fr.loc[g]) frame = pd.concat([frame, scop], axis=1) frame = pd.DataFrame(frame.values.T, index=frame.columns, columns=frame.index) frames = frame.fillna(method='ffill') frames.to_csv(j[:-4] + '1.csv', index=False, encoding='utf-8') print('处理完成')
evi_list = ['../tmp/附件1/*数据.csv', '../tmp/附件2/*数据.csv'] missing_data(evi_list[0]) # 处理训练数据 missing_data(evi_list[1]) # 处理测试数据 |
5 属性构造
虽然在数据准备过程中对属性进行了初步处理,但是引入的属性太多,而且这些属性之间存在重复的信息。为了保留重要的属性,建立精确、简单的模型,需要对原始属性进一步筛选与构造。
- 设备数据
在数据探索过程中发现,不同设备的无功功率、总无功功率、有功功率、总有功功率、功率因数和总功率因数差别很大,具有较高的区分度,故本案例选择无功功率、总无功功率、有功功率、总有功功率、功率因数和总功率因数作为设备数据的属性构建判别属性库。
处理好缺失值后,每个设备的数据都由一张表变为了多张表,所以需要将相同类型的数据表合并到一张表中,如将所有设备的设备数据表合并到一张表当中。同时,因为缺失值处理的其中一种方式是使用前一个值进行插补,所以产生了相同的记录,需要对重复出现的记录进行处理,处理后生成的数据表如表1所示。
表1 合并且去重后的设备数据
time |
IC |
UC |
PC |
QC |
PFC |
P |
Q |
PF |
label |
2018/1/27 17:11 |
33 |
2212 |
10 |
65 |
137 |
10 |
65 |
137 |
0 |
2018/1/27 17:11 |
33 |
2212 |
10 |
66 |
143 |
10 |
66 |
143 |
0 |
2018/1/27 17:11 |
33 |
2213 |
10 |
65 |
143 |
10 |
65 |
143 |
0 |
2018/1/27 17:11 |
33 |
2211 |
10 |
66 |
135 |
10 |
66 |
135 |
0 |
2018/1/27 17:11 |
33 |
2211 |
10 |
66 |
141 |
10 |
66 |
141 |
0 |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
…… |
合并且去重设备数据如代码清单4所示。
代码清单4 合并且去重设备数据
import glob import pandas as pd import os
# 合并11个设备数据及处理合并中重复的数据 def combined_equipment(csv_name): # 合并 print('共发现%s个CSV文件' % len(glob.glob(csv_name))) print('正在处理............') for i in glob.glob(csv_name): # 循环读取同文件夹下的CSV文件 fr = open(i, 'rb').read() file_path = os.path.split(i) with open(file_path[0] + '/device_combine.csv', 'ab') as f: f.write(fr) print('合并完毕!') # 去重 df = pd.read_csv(file_path[0] + '/device_combine.csv', header=None, encoding='utf-8') datalist = df.drop_duplicates() datalist.to_csv(file_path[0] + '/device_combine.csv', index=False, header=0) print('去重完成')
csv_list = ['../tmp/附件1/*设备数据1.csv', '../tmp/附件2/*设备数据1.csv'] combined_equipment(csv_list[0]) # 处理训练数据 combined_equipment(csv_list[1]) # 处理测试数据 |
- 周波数据
在数据探索过程中发现,周波数据中的电流随着时间的变化有较大的起伏,不同设备的周波数据中的电流绘制出来的折线图的起伏不尽相同,具有明显的差异,故本案例选择波峰和波谷作为周波数据的属性构建判别属性库。
由于原始的周波数据中并未存在电流的波峰和波谷两个属性,所以需要进行属性构建,构建生成的数据表如表2所示。
表2 构建周波数据中的属性生成的数据
波谷 |
波峰 |
344 |
1666365 |
362 |
1666324 |
301 |
1666325 |
314 |
1666392 |
254 |
1666435 |
…… |
…… |
构建周波数据中的属性代码如代码清单5所示。
代码清单5 构建周波数据中的属性
# 求取周波数据中电流的波峰和波谷作为属性参数 import glob import pandas as pd from sklearn.cluster import KMeans import os
def cycle(cycle_file): for file in glob.glob(cycle_file): cycle_YD = pd.read_csv(file, header=0, encoding='utf-8') cycle_YD1 = cycle_YD.iloc[:, 0:128] models = [] for types in range(0, len(cycle_YD1)): model = KMeans(n_clusters=2, random_state=10) model.fit(pd.DataFrame(cycle_YD1.iloc[types, 1:])) # 除时间以外的所有列 models.append(model)
# 相同状态间平稳求均值 mean = pd.DataFrame() for model in models: r = pd.DataFrame(model.cluster_centers_, ) # 找出聚类中心 r = r.sort_values(axis=0, ascending=True, by=[0]) mean = pd.concat([mean, r.reset_index(drop=True)], axis=1) mean = pd.DataFrame(mean.values.T, index=mean.columns, columns=mean.index) mean.columns = ['波谷', '波峰'] mean.index = list(cycle_YD['time']) mean.to_csv(file[:-9] + '波谷波峰.csv', index=False, encoding='gbk ')
cycle_file = ['../tmp/附件1/*周波数据1.csv', '../tmp/附件2/*周波数据1.csv'] cycle(cycle_file[0]) # 处理训练数据 cycle(cycle_file[1]) # 处理测试数据
# 合并周波的波峰波谷文件 def merge_cycle(cycles_file): means = pd.DataFrame() for files in glob.glob(cycles_file): mean0 = pd.read_csv(files, header=0, encoding='gbk') means = pd.concat([means, mean0]) file_path = os.path.split(glob.glob(cycles_file)[0]) means.to_csv(file_path[0] + '/zuhe.csv', index=False, encoding='gbk') print('合并完成')
cycles_file = ['../tmp/附件1/*波谷波峰.csv', '../tmp/附件2/*波谷波峰.csv'] merge_cycle(cycles_file[0]) # 训练数据 merge_cycle(cycles_file[1]) # 测试数据 |
6 模型训练
在判别设备种类时,选择K最近邻模型进行判别,利用属性构建而成的属性库训练模型,然后利用训练好的模型对设备1和设备2进行判别。构建判别模型并对设备种类进行判别,如代码清单6所示。
代码清单6 建立判别模型并对设备种类进行判别
import glob import pandas as pd from sklearn import neighbors import pickle import os
# 模型训练 def model(test_files, test_devices): # 训练集 zuhe = pd.read_csv('../tmp/附件1/zuhe.csv', header=0, encoding='gbk') device_combine = pd.read_csv('../tmp/附件1/device_combine.csv', header=0, encoding='gbk') train = pd.concat([zuhe, device_combine], axis=1) train.index = train['time'].tolist() # 把“time”列设为索引 train = train.drop(['PC', 'QC', 'PFC', 'time'], axis=1) train.to_csv('../tmp/' + 'train.csv', index=False, encoding='gbk') # 测试集 for test_file, test_device in zip(test_files, test_devices): test_bofeng = pd.read_csv(test_file, header=0, encoding='gbk') test_devi = pd.read_csv(test_device, header=0, encoding='gbk') test = pd.concat([test_bofeng, test_devi], axis=1) test.index = test['time'].tolist() # Set the "time" column as the index test = test.drop(['PC', 'QC', 'PFC', 'time'], axis=1)
# KRecent neighborhood clf = neighbors.KNeighborsClassifier(n_neighbors=6, algorithm='auto') clf.fit(train.drop(['label'], axis=1), train['label']) predicted = clf.predict(test.drop(['label'], axis=1)) predicted = pd.DataFrame(predicted) file_path = os.path.split(test_file)[1] test.to_csv('../tmp/' + file_path[:3] + 'test.csv', encoding='gbk') predicted.to_csv('../tmp/' + file_path[:3] + 'predicted.csv', index=False, encoding='gbk') with open('../tmp/' + file_path[:3] + 'model.pkl', 'ab') as pickle_file: pickle.dump(clf, pickle_file) print(clf)
model(glob.glob('../tmp/attachment 2/*trough and peak.csv'), glob.glob('../tmp/attachment2/*devicedata1.csv')) |
7 Performance measure
Based on the device discrimination results in code listing 6, the model is evaluated and the results are as follows. The confusion matrix is shown in Figure 7 and the ROC curve is shown in Figure 8.
模型分类准确度: 0.7951219512195122
模型评估报告:
precision recall f1-score support
0.0 1.00 0.84 0.92 64
21.0 0.00 0.00 0.00 0
61.0 0.00 0.00 0.00 0
91.0 0.78 0.84 0.81 77
92.0 0.00 0.00 0.00 5
93.0 0.76 0.75 0.75 59
111.0 0.00 0.00 0.00 0
accuracy 0.80 205
macro avg 0.36 0.35 0.35 205
weighted avg 0.82 0.80 0.81 205
计算auc:0.8682926829268293
Note: Some results here have been omitted.
Figure 7 Confusion matrix
Figure 8 ROC curve
Model evaluation is shown in Code Listing 7.
Code Listing 7 Model Evaluation
import glob
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.preprocessing import label_binarize
import os
import pickle
# 模型评估
def model_evaluation(model_file, test_csv, predicted_csv):
for clf, test, predicted in zip(model_file, test_csv, predicted_csv):
with open(clf, 'rb') as pickle_file:
clf = pickle.load(pickle_file)
test = pd.read_csv(test, header=0, encoding='gbk')
predicted = pd.read_csv(predicted, header=0, encoding='gbk')
test.columns = ['time', '波谷', '波峰', 'IC', 'UC', 'P', 'Q', 'PF', 'label']
print('模型分类准确度:', clf.score(test.drop(['label', 'time'], axis=1), test['label']))
print('模型评估报告:\n', metrics.classification_report(test['label'], predicted))
confusion_matrix0 = metrics.confusion_matrix(test['label'], predicted)
confusion_matrix = pd.DataFrame(confusion_matrix0)
class_names = list(set(test['label']))
tick_marks = range(len(class_names))
sns.heatmap(confusion_matrix, annot=True, cmap='YlGnBu', fmt='g')
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
plt.tight_layout()
plt.title('混淆矩阵')
plt.ylabel('真实标签')
plt.xlabel('预测标签')
plt.show()
y_binarize = label_binarize(test['label'], classes=class_names)
predicted = label_binarize(predicted, classes=class_names)
fpr, tpr, thresholds = metrics.roc_curve(y_binarize.ravel(), predicted.ravel())
auc = metrics.auc(fpr, tpr)
print('计算auc:', auc)
# 绘图
plt.figure(figsize=(8, 4))
lw = 2
plt.plot(fpr, tpr, label='area = %0.2f' % auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.fill_between(fpr, tpr, alpha=0.2, color='b')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1-特异性')
plt.ylabel('灵敏度')
plt.title('ROC曲线')
plt.legend(loc='lower right')
plt.show()
model_evaluation(glob.glob('../tmp/*model.pkl'),
glob.glob('../tmp/*test.csv'),
glob.glob('../tmp/*predicted.csv'))
According to the analysis goal, real-time power consumption needs to be calculated. The real-time power consumption is calculated as the product of the instantaneous electrical current, voltage and time. The formula is as follows.
Among them, is the real-time power consumption, the unit is 0.001kWh. is power, unit is W.
Real-time power consumption calculation, the obtained real-time power consumption is shown in Table 3.
Table 3 Real-time power consumption
Calculate real-time power consumption as shown in code listing 8.
Code Listing 8 Calculate real-time power consumption
# 计算实时用电量并输出状态表
def cw(test_csv, predicted_csv, test_devices):
for test, predicted, test_device in zip(test_csv, predicted_csv, test_devices):
# 划分预测出的时刻表
test = pd.read_csv(test, header=0, encoding='gbk')
test.columns = ['time', '波谷', '波峰', 'IC', 'UC', 'P', 'Q', 'PF', 'label']
test['time'] = pd.to_datetime(test['time'])
test.index = test['time']
predicteds = pd.read_csv(predicted, header=0, encoding='gbk')
predicteds.columns = ['label']
indexes = []
class_names = list(set(test['label']))
for j in class_names:
index = list(predicteds.index[predicteds['label'] == j])
indexes.append(index)
# 取出首位序号及时间点
from itertools import groupby # 连续数字
dif_indexs = []
time_indexes = []
info_lists = pd.DataFrame()
for y, z in zip(indexes, class_names):
dif_index = []
fun = lambda x: x[1] - x[0]
for k, g in groupby(enumerate(y), fun):
dif_list = [j for i, j in g] # 连续数字的列表
if len(dif_list) > 1:
scop = min(dif_list) # 选取连续数字范围中的第一个
else:
scop = dif_list[0 ]
dif_index.append(scop)
time_index = list(test.iloc[dif_index, :].index)
time_indexes.append(time_index)
info_list = pd.DataFrame({
'时间': time_index, 'model_设备状态': [z] * len(time_index)})
dif_indexs.append(dif_index)
info_lists = pd.concat([info_lists, info_list])
# 计算实时用电量并保存状态表
test_devi = pd.read_csv(test_device, header=0, encoding='gbk')
test_devi['time'] = pd.to_datetime(test_devi['time'])
test_devi['实时用电量'] = test_devi['P'] * 100 / 3600
info_lists = info_lists.merge(test_devi[['time', '实时用电量']],
how='inner', left_on='时间', right_on='time')
info_lists = info_lists.sort_values(by=['时间'], ascending=True)
info_lists = info_lists.drop(['time'], axis=1)
file_path = os.path.split(test_device)[1]
info_lists.to_csv('../tmp/' + file_path[:3] + '状态表.csv', index=False, encoding='gbk')
print(info_lists)
cw(glob.glob('../tmp/*test.csv'),
glob.glob('../tmp/*predicted.csv'),
glob.glob('../tmp/附件2/*设备数据1.csv'))
8 Recommendation
Genuine link: https://item.jd.com/13814157.html
**"Python Data Mining: Introduction, Advanced and Practical Case Analysis"** is a data mining book driven by actual project cases. It can help readers who have no Python programming foundation or data mining foundation to quickly master Python data. Mining techniques, processes and methods. In terms of writing style, it is different from the traditional "combination of theory and practice" introductory books. It uses the well-known events in the field of data mining as the "Teddy Cup" Data Mining Challenge (which has been held for 10 years) and the "Teddy Cup" data analysis. Based on the Skills Competition (which has been held for 5 times) (more than 100,000 teachers and students from more than 1,500 colleges and universities participated), 11 classic competition questions were selected to integrate Python programming knowledge, data mining knowledge and industry knowledge, so that Readers can quickly master data mining methods in seven major industries including e-commerce, education, transportation, media, electric power, tourism, and manufacturing in practice.
This book is not only suitable for self-study by readers with zero basic knowledge, but also for teachers’ teaching. In order to help readers master the content of this book more efficiently, this book provides the following 10 additional values:
(1) Modeling platform: Provides a one-stop big data mining modeling platform that requires no configuration and includes a large number of case projects. You can learn while practicing and say goodbye to talking on paper< a i=3>(2) Video explanation: Provide no less than 600 minutes of Python programming and data mining related teaching videos, learn while watching, and quickly gain experience value(3) Selected exercises: Carefully select no less than 60 data mining exercises and provide detailed answers. Learn and practice while checking knowledge blind spots(4) Author Q&A: If you have any questions during the learning process, you can use the "Tree Hole" applet to take pictures of paper books and send them to the author with one click. You can learn while asking and get twice the result with half the effort a> **(5) Data files: ** Provide data files for each case, combined with engineering practice, ready to use out of the box, enhancing practicality ** (6) Program code: ** Provide electronic files of the code in the book and installation packages of related tools. The code can be imported into the platform and run, and the learning effect is immediate ** (7) Teaching courseware:* *Provide supporting PPT courseware. Teachers who use this book as a teaching material can apply to save lesson preparation time **(8) Model service: **Provide no less than 10 data mining models, models Provide a complete case implementation process to help improve data mining practice capabilities **(9) Teaching platform: **Teddy Technology provides a one-stop data-based teaching platform for the additional resources provided in this book , with detailed operation guide, learn and practice while watching, saving time **(10) Employment recommendation: **Provide a large number of employment recommendation opportunities, cooperate with 1500+ companies, including Huawei, Well-known companies such as JD.com and Midea
By studying this book, readers can understand the principles of data mining, quickly master the relevant operations of big data technology, and lay a good technical foundation for subsequent data analysis, data mining, and deep learning practices and competitions.
9 How to participate
- 2 books are given away this time
- Activity time: Until 2023-11-2
- Participation method: follow the blogger, like, collect and comment as you like
PS: Comments with more than 20 words will be selected based on the number of likes in the comments - Add one book if the reading volume exceeds 2k (the final book will be given out according to the reading volume. If the reading volume does not meet the standard, it will be given out according to the actual amount)
PS: The fan base after the winner list event is over and comment area announcement