錬金術:SPICの2020風力タービン異常データの識別とクリーニングのベースラインf1 = 0.858シェアリング

質問アドレス:

https://www.datafountain.cn/competitions/451

チャレンジタスク:

タイムスタンプ情報、風速情報、電力情報などを含む、1年10分間隔の12の風力タービンの提供されたSCADA運用データに基づいて、機械学習関連技術を使用して、堅牢な風力タービン異常データ検出モデルが確立され、識別および除去されますデータ品質を改善するための潜在的に異常なデータ。
このタスクは異常なデータラベルを与えず、クラスタリングタスクと見なされます。競争のニーズに一致するようにプレーヤーをガイドするために、異常なデータの定義について簡単に説明します。風力タービンの運転プロセスと設計運転条件との間に大きな偏差が発生すると、異常なデータが生成されます。たとえば、風速計の異常な風測定により、収集された電力散布点が設計風力から大幅にずれます。

データ紹介:

https://www.datafountain.cn/competitions/451/datasets

以下は、オンラインのf1 0.858プランです。

#!/usr/bin/env python
# coding: utf-8
import pandas as pd
import numpy as np
from tqdm import tqdm
from matplotlib import pyplot as plt

data_df = pd.read_csv('../data/dataset.csv')
fan_info = pd.read_csv('../data/12faninfo.csv', names=["WindNumber", "fan_diam", "rated_power", "speed_in", "speed_out", "speed_min", "speed_max", "speed_range"])

data_fan_df = data_df.merge(fan_info, on='WindNumber')
data_fan_df['label'] = 0

#异常值
#三列值小于0
data_fan_df.loc[(data_fan_df['WindSpeed'] < 0) ,'label'] = 1
data_fan_df.loc[(data_fan_df['Power'] < 0    ) ,'label'] = 1
data_fan_df.loc[(data_fan_df['RotorSpeed'] < 0),'label'] = 1
#风速小于切入,功率大于0
data_fan_df.loc[(data_fan_df['WindSpeed'] < data_fan_df['speed_in'])  & (data_fan_df['Power'] > 0 ) ,'label'] = 1
#风速大于切入,功率小于等于0
data_fan_df.loc[(data_fan_df['WindSpeed'] >= data_fan_df['speed_in']) & (data_fan_df['Power'] <= 0) ,'label'] = 1
#风速大于切除,功率大于0
data_fan_df.loc[(data_fan_df['WindSpeed'] > data_fan_df['speed_out']) & (data_fan_df['Power'] > 0)  ,'label'] = 1

#功率大于额定功率1.2倍
data_fan_df.loc[ data_fan_df['Power'] > 1.2*data_fan_df['rated_power'] ,'label'] = 1
#风轮异常
data_fan_df.loc[ data_fan_df['RotorSpeed'] > 1.2*data_fan_df['speed_max'] ,'label'] = 1
data_fan_df.loc[ data_fan_df['RotorSpeed'] < 0.8*data_fan_df['speed_min'] ,'label'] = 1

data_fan_df_cp = data_fan_df.reset_index()
data_fan_df_cp = data_fan_df.reset_index()
data_fan_df_filter = data_fan_df_cp[data_fan_df['label']==0].copy()

for corr_col in tqdm( ['Power','RotorSpeed','WindSpeed' ] ):
    for cut_col in ['Power','RotorSpeed','WindSpeed' ]:
        if corr_col != cut_col:
            data_df = data_fan_df_filter[data_fan_df_filter['label']==0].copy()
            cut_label = pd.DataFrame()
            cut_col_name = cut_col   + '_bin'
            corr_col_name = corr_col + 'Qt75'
            low_ratio = 0.5
            hig_ratio = 1.5

            for i in data_df['WindNumber'].unique():
                temp_data_df = data_df[data_df['WindNumber']==i].copy()
                cut_bin   = int( data_fan_df_filter[cut_col].max() ) * 5
                print('cut col and bin is :', cut_col, cut_bin, 'temp data rows:', temp_data_df.shape[0])
                temp_data_df[cut_col_name] = pd.cut(temp_data_df[cut_col], cut_bin, duplicates='drop').apply(lambda x:x.left)

                PowerQt75 = pd.DataFrame( (temp_data_df.groupby([cut_col_name])[corr_col].quantile(0.75)) )
                PowerQt75.columns = [corr_col_name]

                temp_data_df = temp_data_df.merge(PowerQt75, on=cut_col_name)
                temp_x = temp_data_df[cut_col_name]
                temp_y = temp_data_df[corr_col]
                temp_data_df.loc[ (temp_data_df[corr_col] < low_ratio*temp_data_df[corr_col_name]) | (temp_data_df[corr_col] > hig_ratio*temp_data_df[corr_col_name]),'label'] = 1

                cut_label = pd.concat([cut_label, temp_data_df[['index','label']].copy() ], axis=0)
                temp_data_df = temp_data_df[ (temp_data_df[corr_col] < low_ratio*temp_data_df[corr_col_name]) | (temp_data_df[corr_col] > hig_ratio*temp_data_df[corr_col_name])]
                print(temp_data_df.shape)
                #分别画异常点、所有点散点图
                plt.scatter(temp_data_df[cut_col_name], temp_data_df[corr_col])
                plt.scatter(temp_x, temp_y, alpha=0.1)
                plt.xlabel(cut_col)
                plt.ylabel(corr_col)
                plt.show()
            cut_label.columns = ['index', 'labelcut']
            data_fan_df_filter = data_fan_df_filter.merge(cut_label, how='outer', on='index')
            data_fan_df_filter['labelcut'] = data_fan_df_filter['labelcut'].fillna(0)
            data_fan_df_filter['label'] = data_fan_df_filter['label'] + data_fan_df_filter['labelcut']
            data_fan_df_filter['label'] = data_fan_df_filter['label'].replace(2,1)
            data_fan_df_filter.drop(['labelcut'], axis=1,inplace=True)

filter_res  = data_fan_df_filter[['index', 'label']].copy()
filter_res.columns = ['index', 'labelres']
data_fan_df_cp = data_fan_df_cp.merge(filter_res, how='outer', on='index')
data_fan_df_cp['labelres'] = data_fan_df_cp['labelres'].fillna(1)
data_fan_df_cp['label'] = data_fan_df_cp['label'] + data_fan_df_cp['labelres']
data_fan_df_cp['label'] = data_fan_df_cp['label'].replace(2,1)
data_fan_df_cp.drop(['labelres'], axis=1,inplace=True)

data_sub = data_fan_df_cp[['WindNumber','Time','label']].copy()
data_sub.to_csv('../subs/sub_6_cut_low0.5_hight1.5.csv',index=None)

上記の提出結果は、オンラインで約0.85824です。

おすすめ

転載: blog.csdn.net/lanxuxml/article/details/108714621