[Chenyuan Book Donation Event: Issue 01] Python Data Mining - Introduction to Advanced and Practical Case Analysis

introduction

This case will deeply explore the current, voltage, and power of each power equipment based on the collected power data, and analyze the actual power consumption of each power equipment, thereby providing a certain reference basis for the power company to formulate power energy strategies. For more details, please refer to the book **"Python Data Mining: Introductory Advanced and Practical Case Analysis"**.

picture

1 Case background

In order to better monitor the energy consumption of electrical equipment, power sub-metering technology was born. Electric power sub-metering is of great significance for power companies to accurately predict power loads, scientifically formulate power grid dispatch plans, and improve the stability and reliability of power systems. For users, electricity sub-metering can help users understand the usage of electrical equipment, improve users' awareness of energy conservation, and promote scientific and rational use of electricity.

picture

2 Analysis title

Based on the background and business requirements of power data mining for non-intrusive load detection and decomposition, the goals to be achieved in this case are as follows.

ØAnalyze the operating attributes of each electrical equipment.

ØConstruct a device identification attribute library.

ØUse the K nearest neighbor model to "decompose" the independent power consumption data of each electrical device from the entire line.

3 Analysis process

picture

4 Quantity preparation

1. Data exploration

In the power data mining analysis of this case, operation record data will not be involved. Therefore, equipment data, cycle data and harmonic data are mainly obtained here. After obtaining the data, since there are many data tables and each table has many attributes, it is necessary to perform data exploration and analysis on the data. During the data exploration process, the data corresponding to the different attributes of each device was visualized mainly based on the characteristics of the original data. Some of the results obtained are shown in Figures 1 to 3.

picture

Figure 1 Reactive power and total reactive power

picture

Figure 2 Current trace

picture

Figure 3 Voltage trace

Based on the visualization results, it can be seen that the current, voltage, and power properties vary between different devices.

Visualizing data attributes is shown in code listing 1.

Code Listing 1 Visualize data attributes

import pandas as pd

import matplotlib.pyplot as plt

import us

 

filename = os.listdir('../data/attachment 1') # Get the names of all files in the folder

n_filename = len(filename)  

# Add operation information to the data of each device, draw each attribute trajectory diagram and save it

def fun(a):

    save_name = ['YD1', 'YD10', 'YD11', 'YD2', 'YD3', 'YD4',

           'YD5', 'YD6', 'YD7', 'YD8', 'YD9']

    plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally

    plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally

    for i in range(a):

        Sb = pd.read_excel('../data/attachment1/' + filename[i], 'Device Data', index_col = None)

        

        ​​​​Zb = pd.read_excel('../data/attachment1/' + filename[i], 'Cycle data', index_col = None)

        # Current trace diagram

        plt.plot(Sb['IC'])

        plt.title(save_name[i] + '-IC')

        plt.ylabel('Current (0.001A)')

        plt.show()

        # Voltage trace diagram

        lt.plot(Sb['UC'])

        plt.title(save_name[i] + '-UC')

        plt.ylabel('Voltage (0.1V)')

        plt.show()

        # Active power and total active power

        plt.plot(Sb[['PC', 'P']])

        plt.title(save_name[i] + '-P')

        plt.ylabel('Active power (0.0001kW)')

        plt.show()

        ​​​​ #Reactive power and total reactive power

        plt.plot(Sb[['QC', 'Q']])

        plt.title(save_name[i] + '-Q')

        plt.ylabel('Reactive power (0.0001kVar)')

        plt.show()

        # Power factor and total power factor

        plt.plot(Sb[['PFC', 'PF']])

        plt.title(save_name[i] + '-PF')

        plt.ylabel('Power factor (%)')

        plt.show()

        # Harmonic voltage

        plt.plot(Xb.loc[:, 'UC02':].T)

        plt.title(save_name[i] + '-harmonic voltage')

        plt.show()

        # Weekly data

        plt.plot(Zb.loc[:, 'IC001':].T)

        plt.title(save_name[i] + '-Cycle data')

        plt.show()

 

fun(n_filename)

2. Missing value processing

Through data exploration, it was found that some "time" attributes in the data have missing values, and these missing values ​​need to be processed. Since the missing time period of the "time" attribute in each piece of data is different, different processing is required. The data with a larger missing time period in each device data is deleted, and the data with a smaller missing time period is imputed using the previous value.

Before processing missing values, it is necessary to add the equipment data table, cycle data table, harmonic data table and operation record table in all equipment data in the training data, as well as the equipment data table, cycle data table in all equipment data in the test data. and harmonic data tables are extracted as independent data files, and some of the generated files are shown in Figure 4.

picture

Figure 4 Partial results of extracting data files

Extract the data file as shown in code listing 2.

Code Listing 2 Extract data files

#Convert xlsx file to CSV file

import glob

import pandas as pd

import math

 

def file_transform(xls):

    Print('A total of %s xlsx files found' % len(glob.glob(xls)))

    print('正在处理............')

    for file in glob.glob(xls):  # 循环读取同文件夹下的xlsx文件

        combine1 = pd.read_excel(file, index_col=0, sheet_name=None)

        for key in combine1:

            combine1[key].to_csv('../tmp/' + file[8: -5] + key + '.csv', encoding='utf-8')

    print('处理完成')

 

xls_list = ['../data/附件1/*.xlsx', '../data/附件2/*.xlsx']

file_transform(xls_list[0])  # 处理训练数据

file_transform(xls_list[1])  # 处理测试数据

提取数据文件完成后,对提取的数据文件进行缺失值处理,处理后生成的部分文件如图5所示。

picture

图5 缺失值处理后的部分结果

缺失值处理如代码清单3所示。

代码清单3 缺失值处理

# 对每个数据文件中较大缺失时间点数据进行删除处理,较小缺失时间点数据进行前值替补

def missing_data(evi):

    print('共发现%s个CSV文件' % len(glob.glob(evi)))

    for j in glob.glob(evi):

        fr = pd.read_csv(j, header=0, encoding='gbk')

        fr['time'] = pd.to_datetime(fr['time'])

        helper = pd.DataFrame({'time': pd.date_range(fr['time'].min(), fr['time'].max(), freq='S')})

        fr = pd.merge(fr, helper, on='time', how='outer').sort_values('time')

        fr = fr.reset_index(drop=True)

 

        frame = pd.DataFrame()

        for g in range(0, len(list(fr['time'])) - 1):

            if math.isnan(fr.iloc[:, 1][g + 1]) and math.isnan(fr.iloc[:, 1][g]):

                continue

            else:

                scop = pd.Series(fr.loc[g])

                frame = pd.concat([frame, scop], axis=1)

        frame = pd.DataFrame(frame.values.T, index=frame.columns, columns=frame.index)

        frames = frame.fillna(method='ffill')

        frames.to_csv(j[:-4] + '1.csv', index=False, encoding='utf-8')

    print('处理完成')

 

evi_list = ['../tmp/附件1/*数据.csv', '../tmp/附件2/*数据.csv']

missing_data(evi_list[0])  # 处理训练数据

missing_data(evi_list[1])  # 处理测试数据

5 属性构造

虽然在数据准备过程中对属性进行了初步处理,但是引入的属性太多,而且这些属性之间存在重复的信息。为了保留重要的属性,建立精确、简单的模型,需要对原始属性进一步筛选与构造。

  1. 设备数据

在数据探索过程中发现,不同设备的无功功率、总无功功率、有功功率、总有功功率、功率因数和总功率因数差别很大,具有较高的区分度,故本案例选择无功功率、总无功功率、有功功率、总有功功率、功率因数和总功率因数作为设备数据的属性构建判别属性库。

处理好缺失值后,每个设备的数据都由一张表变为了多张表,所以需要将相同类型的数据表合并到一张表中,如将所有设备的设备数据表合并到一张表当中。同时,因为缺失值处理的其中一种方式是使用前一个值进行插补,所以产生了相同的记录,需要对重复出现的记录进行处理,处理后生成的数据表如表1所示。

表1 合并且去重后的设备数据

time

IC

UC

PC

QC

PFC

P

Q

PF

label

2018/1/27 17:11

33

2212

10

65

137

10

65

137

0

2018/1/27 17:11

33

2212

10

66

143

10

66

143

0

2018/1/27 17:11

33

2213

10

65

143

10

65

143

0

2018/1/27 17:11

33

2211

10

66

135

10

66

135

0

2018/1/27 17:11

33

2211

10

66

141

10

66

141

0

……

……

……

……

……

……

……

……

……

……

合并且去重设备数据如代码清单4所示。

代码清单4 合并且去重设备数据

import glob

import pandas as pd

import os

 

# 合并11个设备数据及处理合并中重复的数据

def combined_equipment(csv_name):

    # 合并

    print('共发现%s个CSV文件' % len(glob.glob(csv_name)))

    print('正在处理............')

    for i in glob.glob(csv_name):  # 循环读取同文件夹下的CSV文件

        fr = open(i, 'rb').read()

        file_path = os.path.split(i)

        with open(file_path[0] + '/device_combine.csv', 'ab') as f:

            f.write(fr)

    print('合并完毕!')

    # 去重

    df = pd.read_csv(file_path[0] + '/device_combine.csv', header=None, encoding='utf-8')

    datalist = df.drop_duplicates()

    datalist.to_csv(file_path[0] + '/device_combine.csv', index=False, header=0)

    print('去重完成')

 

csv_list = ['../tmp/附件1/*设备数据1.csv', '../tmp/附件2/*设备数据1.csv']

combined_equipment(csv_list[0])  # 处理训练数据

combined_equipment(csv_list[1])  # 处理测试数据

  1. 周波数据

在数据探索过程中发现,周波数据中的电流随着时间的变化有较大的起伏,不同设备的周波数据中的电流绘制出来的折线图的起伏不尽相同,具有明显的差异,故本案例选择波峰和波谷作为周波数据的属性构建判别属性库。

由于原始的周波数据中并未存在电流的波峰和波谷两个属性,所以需要进行属性构建,构建生成的数据表如表2所示。

表2 构建周波数据中的属性生成的数据

波谷

波峰

344

1666365

362

1666324

301

1666325

314

1666392

254

1666435

……

……

构建周波数据中的属性代码如代码清单5所示。

代码清单5 构建周波数据中的属性

# 求取周波数据中电流的波峰和波谷作为属性参数

import glob

import pandas as pd

from sklearn.cluster import KMeans

import os

 

def cycle(cycle_file):

    for file in glob.glob(cycle_file):

        cycle_YD = pd.read_csv(file, header=0, encoding='utf-8')

        cycle_YD1 = cycle_YD.iloc[:, 0:128]

        models = []

        for types in range(0, len(cycle_YD1)):

            model = KMeans(n_clusters=2, random_state=10)

            model.fit(pd.DataFrame(cycle_YD1.iloc[types, 1:]))  # 除时间以外的所有列

            models.append(model)

 

        # 相同状态间平稳求均值

        mean = pd.DataFrame()

        for model in models:

            r = pd.DataFrame(model.cluster_centers_, )  # 找出聚类中心

            r = r.sort_values(axis=0, ascending=True, by=[0])

            mean = pd.concat([mean, r.reset_index(drop=True)], axis=1)

        mean = pd.DataFrame(mean.values.T, index=mean.columns, columns=mean.index)

        mean.columns = ['波谷', '波峰']

        mean.index = list(cycle_YD['time'])

        mean.to_csv(file[:-9] + '波谷波峰.csv', index=False, encoding='gbk ')

 

cycle_file = ['../tmp/附件1/*周波数据1.csv', '../tmp/附件2/*周波数据1.csv']

cycle(cycle_file[0])  # 处理训练数据

cycle(cycle_file[1])  # 处理测试数据

 

# 合并周波的波峰波谷文件

def merge_cycle(cycles_file):

    means = pd.DataFrame()

    for files in glob.glob(cycles_file):

        mean0 = pd.read_csv(files, header=0, encoding='gbk')

        means = pd.concat([means, mean0])

    file_path = os.path.split(glob.glob(cycles_file)[0])

    means.to_csv(file_path[0] + '/zuhe.csv', index=False, encoding='gbk')

    print('合并完成')

 

cycles_file = ['../tmp/附件1/*波谷波峰.csv', '../tmp/附件2/*波谷波峰.csv']

merge_cycle(cycles_file[0])  # 训练数据

merge_cycle(cycles_file[1])  # 测试数据

6 模型训练

在判别设备种类时,选择K最近邻模型进行判别,利用属性构建而成的属性库训练模型,然后利用训练好的模型对设备1和设备2进行判别。构建判别模型并对设备种类进行判别,如代码清单6所示。

代码清单6 建立判别模型并对设备种类进行判别

import glob

import pandas as pd

from sklearn import neighbors

import pickle

import os

 

# 模型训练

def model(test_files, test_devices):

    # 训练集

    zuhe = pd.read_csv('../tmp/附件1/zuhe.csv', header=0, encoding='gbk')

    device_combine = pd.read_csv('../tmp/附件1/device_combine.csv', header=0, encoding='gbk')

    train = pd.concat([zuhe, device_combine], axis=1)

    train.index = train['time'].tolist()  # 把“time”列设为索引

    train = train.drop(['PC', 'QC', 'PFC', 'time'], axis=1)

    train.to_csv('../tmp/' + 'train.csv', index=False, encoding='gbk')

    # 测试集

    for test_file, test_device in zip(test_files, test_devices):

        test_bofeng = pd.read_csv(test_file, header=0, encoding='gbk')

        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')

        test = pd.concat([test_bofeng, test_devi], axis=1)

        test.index = test['time'].tolist() # Set the "time" column as the index

        test = test.drop(['PC', 'QC', 'PFC', 'time'], axis=1)

 

        # KRecent neighborhood

        clf = neighbors.KNeighborsClassifier(n_neighbors=6, algorithm='auto')

        clf.fit(train.drop(['label'], axis=1), train['label'])

        predicted = clf.predict(test.drop(['label'], axis=1))

        predicted = pd.DataFrame(predicted)

        file_path = os.path.split(test_file)[1]

        test.to_csv('../tmp/' + file_path[:3] + 'test.csv', encoding='gbk')

        predicted.to_csv('../tmp/' + file_path[:3] + 'predicted.csv', index=False, encoding='gbk')

        with open('../tmp/' + file_path[:3] + 'model.pkl', 'ab') as pickle_file:

            pickle.dump(clf, pickle_file)

        print(clf)

 

model(glob.glob('../tmp/attachment 2/*trough and peak.csv'),

      glob.glob('../tmp/attachment2/*devicedata1.csv'))

7 Performance measure

Based on the device discrimination results in code listing 6, the model is evaluated and the results are as follows. The confusion matrix is ​​shown in Figure 7 and the ROC curve is shown in Figure 8.

模型分类准确度: 0.7951219512195122

模型评估报告:

               precision    recall  f1-score   support

         0.0       1.00      0.84      0.92        64

        21.0       0.00      0.00      0.00         0

        61.0       0.00      0.00      0.00         0

        91.0       0.78      0.84      0.81        77

        92.0       0.00      0.00      0.00         5

        93.0       0.76      0.75      0.75        59

       111.0       0.00      0.00      0.00         0

 

        accuracy                                0.80        205

     macro avg       0.36      0.35      0.35       205

weighted avg       0.82      0.80      0.81       205

 

计算auc:0.8682926829268293

Note: Some results here have been omitted.

picture

Figure 7 Confusion matrix

picture

Figure 8 ROC curve

Model evaluation is shown in Code Listing 7.

Code Listing 7 Model Evaluation

import glob

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import metrics

from sklearn.preprocessing import label_binarize

import os

import pickle

 

# 模型评估

def model_evaluation(model_file, test_csv, predicted_csv):

    for clf, test, predicted in zip(model_file, test_csv, predicted_csv):

        with open(clf, 'rb') as pickle_file:

            clf = pickle.load(pickle_file)

        test = pd.read_csv(test, header=0, encoding='gbk')

        predicted = pd.read_csv(predicted, header=0, encoding='gbk')

        test.columns = ['time', '波谷', '波峰', 'IC', 'UC', 'P', 'Q', 'PF', 'label']

        print('模型分类准确度:', clf.score(test.drop(['label', 'time'], axis=1), test['label']))

        print('模型评估报告:\n', metrics.classification_report(test['label'], predicted))

 

        confusion_matrix0 = metrics.confusion_matrix(test['label'], predicted)

        confusion_matrix = pd.DataFrame(confusion_matrix0)

        class_names = list(set(test['label']))

 

        tick_marks = range(len(class_names))

        sns.heatmap(confusion_matrix, annot=True, cmap='YlGnBu', fmt='g')

        plt.xticks(tick_marks, class_names)

        plt.yticks(tick_marks, class_names)

        plt.tight_layout()

        plt.title('混淆矩阵')

        plt.ylabel('真实标签')

        plt.xlabel('预测标签')

        plt.show()

        y_binarize = label_binarize(test['label'], classes=class_names)

        predicted = label_binarize(predicted, classes=class_names)

 

        fpr, tpr, thresholds = metrics.roc_curve(y_binarize.ravel(), predicted.ravel())

        auc = metrics.auc(fpr, tpr)

        print('计算auc:', auc)  

        # 绘图

        plt.figure(figsize=(8, 4))

        lw = 2

        plt.plot(fpr, tpr, label='area = %0.2f' % auc)

        plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

        plt.fill_between(fpr, tpr, alpha=0.2, color='b')

        plt.xlim([0.0, 1.0])

        plt.ylim([0.0, 1.05])

        plt.xlabel('1-特异性')

        plt.ylabel('灵敏度')

        plt.title('ROC曲线')

        plt.legend(loc='lower right')

        plt.show()

 

model_evaluation(glob.glob('../tmp/*model.pkl'),

                 glob.glob('../tmp/*test.csv'),

                 glob.glob('../tmp/*predicted.csv'))

According to the analysis goal, real-time power consumption needs to be calculated. The real-time power consumption is calculated as the product of the instantaneous electrical current, voltage and time. The formula is as follows.

picture

Among them, is the real-time power consumption, the unit is 0.001kWh. is power, unit is W.

Real-time power consumption calculation, the obtained real-time power consumption is shown in Table 3.

Table 3 Real-time power consumption

Insert image description here
Calculate real-time power consumption as shown in code listing 8.

Code Listing 8 Calculate real-time power consumption

# 计算实时用电量并输出状态表

def cw(test_csv, predicted_csv, test_devices):

    for test, predicted, test_device in zip(test_csv, predicted_csv, test_devices):

        # 划分预测出的时刻表

        test = pd.read_csv(test, header=0, encoding='gbk')

        test.columns = ['time', '波谷', '波峰', 'IC', 'UC', 'P', 'Q', 'PF', 'label']

        test['time'] = pd.to_datetime(test['time'])

        test.index = test['time']

        predicteds = pd.read_csv(predicted, header=0, encoding='gbk')

        predicteds.columns = ['label']

        indexes = []

        class_names = list(set(test['label']))

        for j in class_names:

            index = list(predicteds.index[predicteds['label'] == j])

            indexes.append(index)

 

        # 取出首位序号及时间点

        from itertools import groupby  # 连续数字

        dif_indexs = []

        time_indexes = []

        info_lists = pd.DataFrame()

        for y, z in zip(indexes, class_names):

            dif_index = []

            fun = lambda x: x[1] - x[0]

            for k, g in groupby(enumerate(y), fun):

                dif_list = [j for i, j in g]  # 连续数字的列表

                if len(dif_list) > 1:

                    scop = min(dif_list)  # 选取连续数字范围中的第一个

                else:

                    scop = dif_list[0   ]

                dif_index.append(scop)

            time_index = list(test.iloc[dif_index, :].index)

            time_indexes.append(time_index)

            info_list = pd.DataFrame({
    
    '时间': time_index, 'model_设备状态': [z] * len(time_index)})

            dif_indexs.append(dif_index)

            info_lists = pd.concat([info_lists, info_list])

        # 计算实时用电量并保存状态表

        test_devi = pd.read_csv(test_device, header=0, encoding='gbk')

        test_devi['time'] = pd.to_datetime(test_devi['time'])

        test_devi['实时用电量'] = test_devi['P'] * 100 / 3600

        info_lists = info_lists.merge(test_devi[['time', '实时用电量']],

                                      how='inner', left_on='时间', right_on='time')

        info_lists = info_lists.sort_values(by=['时间'], ascending=True)

        info_lists = info_lists.drop(['time'], axis=1)

        file_path = os.path.split(test_device)[1]

        info_lists.to_csv('../tmp/' + file_path[:3] + '状态表.csv', index=False, encoding='gbk')

        print(info_lists)

 

cw(glob.glob('../tmp/*test.csv'),

   glob.glob('../tmp/*predicted.csv'),

   glob.glob('../tmp/附件2/*设备数据1.csv'))

8 Recommendation

picture

Genuine link: https://item.jd.com/13814157.html

**"Python Data Mining: Introduction, Advanced and Practical Case Analysis"** is a data mining book driven by actual project cases. It can help readers who have no Python programming foundation or data mining foundation to quickly master Python data. Mining techniques, processes and methods. In terms of writing style, it is different from the traditional "combination of theory and practice" introductory books. It uses the well-known events in the field of data mining as the "Teddy Cup" Data Mining Challenge (which has been held for 10 years) and the "Teddy Cup" data analysis. Based on the Skills Competition (which has been held for 5 times) (more than 100,000 teachers and students from more than 1,500 colleges and universities participated), 11 classic competition questions were selected to integrate Python programming knowledge, data mining knowledge and industry knowledge, so that Readers can quickly master data mining methods in seven major industries including e-commerce, education, transportation, media, electric power, tourism, and manufacturing in practice.

This book is not only suitable for self-study by readers with zero basic knowledge, but also for teachers’ teaching. In order to help readers master the content of this book more efficiently, this book provides the following 10 additional values:
(1) Modeling platform: Provides a one-stop big data mining modeling platform that requires no configuration and includes a large number of case projects. You can learn while practicing and say goodbye to talking on paper< a i=3>(2) Video explanation: Provide no less than 600 minutes of Python programming and data mining related teaching videos, learn while watching, and quickly gain experience value(3) Selected exercises: Carefully select no less than 60 data mining exercises and provide detailed answers. Learn and practice while checking knowledge blind spots(4) Author Q&A: If you have any questions during the learning process, you can use the "Tree Hole" applet to take pictures of paper books and send them to the author with one click. You can learn while asking and get twice the result with half the effort a> **(5) Data files: ** Provide data files for each case, combined with engineering practice, ready to use out of the box, enhancing practicality ** (6) Program code: ** Provide electronic files of the code in the book and installation packages of related tools. The code can be imported into the platform and run, and the learning effect is immediate ** (7) Teaching courseware:* *Provide supporting PPT courseware. Teachers who use this book as a teaching material can apply to save lesson preparation time **(8) Model service: **Provide no less than 10 data mining models, models Provide a complete case implementation process to help improve data mining practice capabilities **(9) Teaching platform: **Teddy Technology provides a one-stop data-based teaching platform for the additional resources provided in this book , with detailed operation guide, learn and practice while watching, saving time **(10) Employment recommendation: **Provide a large number of employment recommendation opportunities, cooperate with 1500+ companies, including Huawei, Well-known companies such as JD.com and Midea








By studying this book, readers can understand the principles of data mining, quickly master the relevant operations of big data technology, and lay a good technical foundation for subsequent data analysis, data mining, and deep learning practices and competitions.

picture

9 How to participate

  • 2 books are given away this time
  • Activity time: Until 2023-11-2
  • Participation method: follow the blogger, like, collect and comment as you like
    PS: Comments with more than 20 words will be selected based on the number of likes in the comments
  • Add one book if the reading volume exceeds 2k (the final book will be given out according to the reading volume. If the reading volume does not meet the standard, it will be given out according to the actual amount)
    PS: The fan base after the winner list event is over and comment area announcement

Guess you like

Origin blog.csdn.net/weixin_55756734/article/details/134028780