【数据挖掘】心跳信号分类预测 之 My_Task3特征工程

Table of Contents

3.1 学习目标

  • 学习时间序列数据的特征预处理方法
  • 学习时间序列特征处理工具Tsfresh(TimeSeries Fresh) 的使用

3.2 内容介绍

数据预处理

  • 时间序列数据格式处理
  • 加入时间步特征time

特征工程

  • 时间序列特征构造
  • 特征筛选
  • 使用tsfresh

3.3 代码示例

3.3.1 导入包并读取数据

Tsfresh是处理时间序列的关系数据库的特征工程工具,能自动从时间序列中提取100多个特征。
该软件包包含多种特征提取方法和一种稳健的特征选择算法,还包含评价这些特征对回归或分类
任务的解释能力和重要性的方法。
https://zhuanlan.zhihu.com/p/93310900

# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features,select_features
from tsfresh.utilities.dataframe_functions import impute
# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")

print(data_train.shape)
print(data_test_A.shape)
(100000, 3)
(20000, 2)

3.3.2 数据预处理

  • 对心电特征进行行列处理,同时为每个心电信号加入时间步特征time
  • reset_index()和set_index()的使用
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",",expand=True).stack()
train_heartbeat_df
0      0      0.9912297987616655
       1      0.9435330436439665
       2      0.7646772997256593
       3      0.6185708990212999
       4      0.3796321642826237
                     ...        
99999  200                   0.0
       201                   0.0
       202                   0.0
       203                   0.0
       204                   0.0
Length: 20500000, dtype: object
  • 重新设置索引 且变成了数据框的形式
train_heartbeat_df = train_heartbeat_df.reset_index()  
train_heartbeat_df
level_0 level_1 0
0 0 0 0.9912297987616655
1 0 1 0.9435330436439665
2 0 2 0.7646772997256593
3 0 3 0.6185708990212999
4 0 4 0.3796321642826237
... ... ... ...
20499995 99999 200 0.0
20499996 99999 201 0.0
20499997 99999 202 0.0
20499998 99999 203 0.0
20499999 99999 204 0.0

20500000 rows × 3 columns

  • 将level_0 设置为索引
train_heartbeat_df =  train_heartbeat_df.set_index("level_0")
train_heartbeat_df
level_1 0
level_0
0 0 0.9912297987616655
0 1 0.9435330436439665
0 2 0.7646772997256593
0 3 0.6185708990212999
0 4 0.3796321642826237
... ... ...
99999 200 0.0
99999 201 0.0
99999 202 0.0
99999 203 0.0
99999 204 0.0

20500000 rows × 2 columns

  • 将索引的名字置空,感觉就好像是扔掉了
train_heartbeat_df.index.name = None
train_heartbeat_df
level_1 0
0 0 0.9912297987616655
0 1 0.9435330436439665
0 2 0.7646772997256593
0 3 0.6185708990212999
0 4 0.3796321642826237
... ... ...
99999 200 0.0
99999 201 0.0
99999 202 0.0
99999 203 0.0
99999 204 0.0

20500000 rows × 2 columns

  • 使用rename()方法更改列名,inplace为True应该就是原地更改的意思【直接修改】
train_heartbeat_df.rename(columns={
    
    "level_1":"time",0:"heartbeat_signals"},inplace=True)
train_heartbeat_df
time heartbeat_signals
0 0 0.9912297987616655
0 1 0.9435330436439665
0 2 0.7646772997256593
0 3 0.6185708990212999
0 4 0.3796321642826237
... ... ...
99999 200 0.0
99999 201 0.0
99999 202 0.0
99999 203 0.0
99999 204 0.0

20500000 rows × 2 columns

train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)
train_heartbeat_df
time heartbeat_signals
0 0 0.991230
0 1 0.943533
0 2 0.764677
0 3 0.618571
0 4 0.379632
... ... ...
99999 200 0.000000
99999 201 0.000000
99999 202 0.000000
99999 203 0.000000
99999 204 0.000000

20500000 rows × 2 columns

  • 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train_label
0        0.0
1        0.0
2        2.0
3        0.0
4        2.0
        ... 
99995    0.0
99996    2.0
99997    3.0
99998    2.0
99999    0.0
Name: label, Length: 100000, dtype: float64
  • 将data_train去掉label这一列
data_train = data_train.drop('label',axis=1)
data_train
id heartbeat_signals
0 0 0.9912297987616655,0.9435330436439665,0.764677...
1 1 0.9714822034884503,0.9289687459588268,0.572932...
2 2 1.0,0.9591487564065292,0.7013782792997189,0.23...
3 3 0.9757952826275774,0.9340884687738161,0.659636...
4 4 0.0,0.055816398940721094,0.26129357194994196,0...
... ... ...
99995 99995 1.0,0.677705342021188,0.22239242747868546,0.25...
99996 99996 0.9268571578157265,0.9063471198026871,0.636993...
99997 99997 0.9258351628306013,0.5873839035878395,0.633226...
99998 99998 1.0,0.9947621698382489,0.8297017704865509,0.45...
99999 99999 0.9259994004527861,0.916476635326053,0.4042900...

100000 rows × 2 columns

data_train = data_train.drop("heartbeat_signals", axis=1)
data_train
id
0 0
1 1
2 2
3 3
4 4
... ...
99995 99995
99996 99996
99997 99997
99998 99998
99999 99999

100000 rows × 1 columns

data_train = data_train.join(train_heartbeat_df)
data_train
id time heartbeat_signals
0 0 0 0.991230
0 0 1 0.943533
0 0 2 0.764677
0 0 3 0.618571
0 0 4 0.379632
... ... ... ...
99999 99999 200 0.000000
99999 99999 201 0.000000
99999 99999 202 0.000000
99999 99999 203 0.000000
99999 99999 204 0.000000

20500000 rows × 3 columns

data_train[data_train["id"]==1]
id time heartbeat_signals
1 1 0 0.971482
1 1 1 0.928969
1 1 2 0.572933
1 1 3 0.178457
1 1 4 0.122962
... ... ... ...
1 1 200 0.000000
1 1 201 0.000000
1 1 202 0.000000
1 1 203 0.000000
1 1 204 0.000000

205 rows × 3 columns

可以看到,每个样本的心电特征都由205个时间步的心电信号组成

3.3.3 使用tsfresh 进行时间序列特征处理

1.特征抽取
**Tsfresh(TimeSeries Fresh)**是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。官方文档:Introduction — tsfresh 0.17.1.dev24+g860c4e1 documentation

# # 特征提取
# train_features = extract_features(data_train,column_id = 'id',column_sort='time')
# train_features
  • 导入已经跑好的特征(以pkl格式存储),直接读取用,不用每次都要重新生成这么耗时
import pickle
feature_file = open("./HeartbeatClassification/train_features_file.pkl","rb")
train_features = pickle.load(feature_file)

train_features
heartbeat_signals__variance_larger_than_standard_deviation heartbeat_signals__has_duplicate_max heartbeat_signals__has_duplicate_min heartbeat_signals__has_duplicate heartbeat_signals__sum_values heartbeat_signals__abs_energy heartbeat_signals__mean_abs_change heartbeat_signals__mean_change heartbeat_signals__mean_second_derivative_central heartbeat_signals__median ... heartbeat_signals__fourier_entropy__bins_2 heartbeat_signals__fourier_entropy__bins_3 heartbeat_signals__fourier_entropy__bins_5 heartbeat_signals__fourier_entropy__bins_10 heartbeat_signals__fourier_entropy__bins_100 heartbeat_signals__permutation_entropy__dimension_3__tau_1 heartbeat_signals__permutation_entropy__dimension_4__tau_1 heartbeat_signals__permutation_entropy__dimension_5__tau_1 heartbeat_signals__permutation_entropy__dimension_6__tau_1 heartbeat_signals__permutation_entropy__dimension_7__tau_1
0 0.0 0.0 1.0 1.0 38.927945 18.216197 0.019894 -0.004859 0.000117 0.125531 ... 0.095763 0.109222 0.109222 0.356175 0.940492 1.180828 1.734917 2.184420 2.500658 2.722686
1 0.0 0.0 1.0 1.0 19.445634 7.705092 0.019952 -0.004762 0.000105 0.030481 ... 0.248333 0.409767 0.567944 0.913016 1.791964 1.360828 2.118249 2.710933 3.065802 3.224835
2 0.0 0.0 1.0 1.0 21.192974 9.140423 0.009863 -0.004902 0.000101 0.000000 ... 0.054659 0.054659 0.150231 0.204601 0.542013 0.712221 1.031064 1.263370 1.406001 1.509478
3 0.0 0.0 1.0 1.0 42.113066 15.757623 0.018743 -0.004783 0.000103 0.241397 ... 0.054659 0.109222 0.186062 0.258874 1.426345 1.389686 2.206088 2.986728 3.534354 3.854177
4 0.0 0.0 1.0 1.0 69.756786 51.229616 0.014514 0.000000 -0.000137 0.000000 ... 0.054659 0.109222 0.109222 0.163690 0.517722 1.045339 1.543338 1.914511 2.165627 2.323993
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 0.0 0.0 1.0 1.0 63.323449 28.742238 0.023588 -0.004902 0.000794 0.388402 ... 0.054659 0.054659 0.109222 0.109222 1.405361 1.326208 2.137411 2.873602 3.391830 3.679969
99996 0.0 0.0 1.0 1.0 69.657534 31.866323 0.017373 -0.004543 0.000051 0.421138 ... 0.095763 0.095763 0.109222 0.163690 0.749555 1.408284 2.244166 3.085504 3.728881 4.095457
99997 0.0 0.0 1.0 1.0 40.897057 16.412857 0.019470 -0.004538 0.000834 0.213306 ... 0.164224 0.186062 0.299588 0.353661 0.995174 1.305626 2.005282 2.601062 2.996962 3.293562
99998 0.0 0.0 1.0 1.0 42.333303 14.281281 0.017032 -0.004902 0.000013 0.264974 ... 0.095763 0.109222 0.163690 0.218060 1.321241 1.460980 2.387132 3.236950 3.793512 4.018302
99999 0.0 0.0 1.0 1.0 53.290117 21.637471 0.021870 -0.004539 0.000023 0.320124 ... 0.095763 0.150231 0.204601 0.463604 1.768224 1.344607 2.186286 2.949266 3.462549 3.688612

100000 rows × 779 columns

  1. 特征选择
    train_features中包含了heartbeat_signals的779种常见的时间序列特征(所有这些特征的解释可以去看官方文档),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:
# 去除抽取特征中的NAN值
impute(train_features)
heartbeat_signals__variance_larger_than_standard_deviation heartbeat_signals__has_duplicate_max heartbeat_signals__has_duplicate_min heartbeat_signals__has_duplicate heartbeat_signals__sum_values heartbeat_signals__abs_energy heartbeat_signals__mean_abs_change heartbeat_signals__mean_change heartbeat_signals__mean_second_derivative_central heartbeat_signals__median ... heartbeat_signals__fourier_entropy__bins_2 heartbeat_signals__fourier_entropy__bins_3 heartbeat_signals__fourier_entropy__bins_5 heartbeat_signals__fourier_entropy__bins_10 heartbeat_signals__fourier_entropy__bins_100 heartbeat_signals__permutation_entropy__dimension_3__tau_1 heartbeat_signals__permutation_entropy__dimension_4__tau_1 heartbeat_signals__permutation_entropy__dimension_5__tau_1 heartbeat_signals__permutation_entropy__dimension_6__tau_1 heartbeat_signals__permutation_entropy__dimension_7__tau_1
0 0.0 0.0 1.0 1.0 38.927945 18.216197 0.019894 -0.004859 0.000117 0.125531 ... 0.095763 0.109222 0.109222 0.356175 0.940492 1.180828 1.734917 2.184420 2.500658 2.722686
1 0.0 0.0 1.0 1.0 19.445634 7.705092 0.019952 -0.004762 0.000105 0.030481 ... 0.248333 0.409767 0.567944 0.913016 1.791964 1.360828 2.118249 2.710933 3.065802 3.224835
2 0.0 0.0 1.0 1.0 21.192974 9.140423 0.009863 -0.004902 0.000101 0.000000 ... 0.054659 0.054659 0.150231 0.204601 0.542013 0.712221 1.031064 1.263370 1.406001 1.509478
3 0.0 0.0 1.0 1.0 42.113066 15.757623 0.018743 -0.004783 0.000103 0.241397 ... 0.054659 0.109222 0.186062 0.258874 1.426345 1.389686 2.206088 2.986728 3.534354 3.854177
4 0.0 0.0 1.0 1.0 69.756786 51.229616 0.014514 0.000000 -0.000137 0.000000 ... 0.054659 0.109222 0.109222 0.163690 0.517722 1.045339 1.543338 1.914511 2.165627 2.323993
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 0.0 0.0 1.0 1.0 63.323449 28.742238 0.023588 -0.004902 0.000794 0.388402 ... 0.054659 0.054659 0.109222 0.109222 1.405361 1.326208 2.137411 2.873602 3.391830 3.679969
99996 0.0 0.0 1.0 1.0 69.657534 31.866323 0.017373 -0.004543 0.000051 0.421138 ... 0.095763 0.095763 0.109222 0.163690 0.749555 1.408284 2.244166 3.085504 3.728881 4.095457
99997 0.0 0.0 1.0 1.0 40.897057 16.412857 0.019470 -0.004538 0.000834 0.213306 ... 0.164224 0.186062 0.299588 0.353661 0.995174 1.305626 2.005282 2.601062 2.996962 3.293562
99998 0.0 0.0 1.0 1.0 42.333303 14.281281 0.017032 -0.004902 0.000013 0.264974 ... 0.095763 0.109222 0.163690 0.218060 1.321241 1.460980 2.387132 3.236950 3.793512 4.018302
99999 0.0 0.0 1.0 1.0 53.290117 21.637471 0.021870 -0.004539 0.000023 0.320124 ... 0.095763 0.150231 0.204601 0.463604 1.768224 1.344607 2.186286 2.949266 3.462549 3.688612

100000 rows × 779 columns

接下来,按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:

  • 首先单独计算每个特征和响应变量之间的相关性
  • 然后利用Benjamini-Yekutieli procedure[1]进行特征选择,决定那些特征可以被保留.
    特征选择的一些常用方法
    在这里插入图片描述
# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features,data_train_label)

train_features_filtered
heartbeat_signals__sum_values heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_28 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_27 ... heartbeat_signals__fft_coefficient__attr_"abs"__coeff_84 heartbeat_signals__fft_coefficient__attr_"imag"__coeff_97 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_90 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_94 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_92 heartbeat_signals__fft_coefficient__attr_"real"__coeff_97 heartbeat_signals__fft_coefficient__attr_"abs"__coeff_75 heartbeat_signals__fft_coefficient__attr_"real"__coeff_88 heartbeat_signals__fft_coefficient__attr_"real"__coeff_92 heartbeat_signals__fft_coefficient__attr_"real"__coeff_83
0 38.927945 1.168685 0.982133 1.223496 1.236300 1.104172 1.497129 1.358095 1.704225 1.745158 ... 0.531883 -0.047438 0.554370 0.307586 0.564596 0.562960 0.591859 0.504124 0.528450 0.473568
1 19.445634 1.460752 1.924501 1.925485 1.715938 2.079957 1.818636 2.490450 1.673244 2.821067 ... 0.563590 -0.109579 0.697446 0.398073 0.640969 0.270192 0.224925 0.645082 0.635135 0.297325
2 21.192974 1.787166 2.146987 1.686190 1.540137 2.291031 2.403422 1.765422 1.993213 2.756081 ... 0.712487 -0.074042 0.321703 0.390386 0.716929 0.316524 0.422077 0.722742 0.680590 0.383754
3 42.113066 2.071539 1.000340 2.728281 1.391727 2.017176 2.610492 0.747448 2.900299 1.294779 ... 0.601499 -0.184248 0.564669 0.623353 0.466980 0.651774 0.308915 0.550097 0.466904 0.494024
4 69.756786 0.653924 0.231422 1.080003 0.711244 1.357904 1.237998 1.346404 1.645870 0.941866 ... 0.015292 0.070505 0.065835 0.051780 0.092940 0.103773 0.179405 -0.089611 0.091841 0.056867
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 63.323449 0.417221 2.036034 1.659054 0.500584 1.693545 0.859932 1.963009 1.524831 1.344715 ... 0.779955 0.005525 0.486013 0.273372 0.705386 0.602898 0.447929 0.474844 0.564266 0.133969
99996 69.657534 1.611333 1.793044 1.092325 0.507138 1.763940 2.677643 2.640827 1.128049 0.856280 ... 0.539489 0.114670 0.579498 0.417226 0.270110 0.556596 0.703258 0.462312 0.269719 0.539236
99997 40.897057 1.190514 0.674603 1.632769 0.229008 2.027802 0.302457 2.016243 0.352602 1.836034 ... 0.282597 -0.474629 0.460647 0.478341 0.527891 0.904111 0.728529 0.178410 0.500813 0.773985
99998 42.333303 1.237608 1.325212 2.785515 1.918571 0.814167 2.613950 2.083409 1.330934 2.801509 ... 0.594252 -0.162106 0.694276 0.681025 0.357196 0.498088 0.433297 0.406154 0.324771 0.340727
99999 53.290117 0.154759 2.921164 2.183932 1.485150 2.685922 0.583443 3.101826 1.264842 2.877000 ... 0.463697 0.289364 0.285321 0.422103 0.692009 0.276236 0.245780 0.269519 0.681719 -0.053993

100000 rows × 700 columns

特征工程总结:
在这里插入图片描述

参考

GitHub链接

猜你喜欢

转载自blog.csdn.net/jcjic/article/details/115031352
今日推荐