Feature Selection Techniques

Table of Contents

Feature Selection Techniques特征选择技术


Agenda

  1. Introduction to Feature Selection
  2. VarianceThreshold
  3. Chi-squared stats
  4. ANOVA using f_classif
  5. Univariate Linear Regression Tests using f_regression
  6. F-score vs Mutual Information
  7. Mutual Information for discrete value
  8. Mutual Information for continues value
  9. SelectKBest
  10. SelectPercentile
  11. SelectFromModel
  12. Recursive Feature Elemination
  13. 特征选择简介
  14. VarianceThreshold
  15. 卡方统计
  16. 使用f_classif进行方差分析
  17. 使用f_regression进行单变量线性回归检验
  18. F得分与互信息
  19. 互信息离散值
  20. 互信息连续值
  21. 选择KBest
  22. 选择百分位数
  23. SelectFromModel
  24. 递归特征消除

Introduction to Feature Selection特征选择简介

  • Selecting features from the dataset

  • Improve estimator’s accuracy

  • Boost preformance for high dimensional datsets

  • Below we will discuss univariate selection methods

  • Also, feature elimination method

  • 从数据集中选择特征

  • 提高估算器的准确性

  • 提高高性能的数据集的性能

  • 下面我们将讨论单变量选择方法

  • 此外,特征消除方法

from sklearn import feature_selection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

VarianceThreshold

  • Drop the columns whose variance is below configured level
  • This method is unsupervised .i.e target not taken into action
  • Intution : Columns whose values arn’t petty much the same won’t have much impact on target
  • 删除方差低于配置水平的列
  • 此方法是无监督的,即目标未生效
  • 直觉:值不太相同的列不会对目标产生太大影响
df = pd.DataFrame({'A':['m','f','m','m','m','m','m','m'], 
              'B':[1,2,3,1,2,1,1,1], 
              'C':[1,2,3,1,2,1,1,1]})
df
A B C
0 m 1 1
1 f 2 2
2 m 3 3
3 m 1 1
4 m 2 2
5 m 1 1
6 m 1 1
7 m 1 1
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['A'] = le.fit_transform(df.A)
df
A B C
0 1 1 1
1 0 2 2
2 1 3 3
3 1 1 1
4 1 2 2
5 1 1 1
6 1 1 1
7 1 1 1
vt = feature_selection.VarianceThreshold(threshold=.2)
vt.fit_transform(df)
vt.variances_
array([0.109375, 0.5     , 0.5     ])

Chi-Square for Non-negative feature & class非负特征和类的卡方

  • Feature data should be booleans or count
  • Supervised technique for feature selection
  • Target should be discrete
  • 特征数据应为布尔值或计数
  • 特征选择的监督技术
  • 目标应该是离散的
df = pd.read_csv('Data/tennis.csv.txt')
for col in df.columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
df
outlook temp humidity windy play
0 2 1 0 0 0
1 2 1 0 1 0
2 0 1 0 0 1
3 1 2 0 0 1
4 1 0 1 0 1
5 1 0 1 1 0
6 0 0 1 1 1
7 2 2 0 0 0
8 2 0 1 0 1
9 1 2 1 0 1
10 2 2 1 1 1
11 0 2 0 1 1
12 0 1 1 0 1
13 1 2 0 1 0
chi2, pval = feature_selection.chi2(df.drop('play',axis=1),df.play)
chi2
array([2.02814815, 0.02222222, 1.4       , 0.53333333])
  • Higher value means more important feature for target
  • 更高的值意味着是对目标更重要的功能

ANOVA using f_classif方差分析使用f_classif

  • For feature variables continues in nature
  • And, target variable discrete in nature
  • Internally, this method uses ratio of variation within a columns & variation across columns
  • 对于特征变量自然
  • 并且,目标变量本质上是离散的
  • 在内部,此方法使用列内变化率与列间变化率
from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()
X = cancer_data.data
Y = cancer_data.target
print(X)
chi2, pval = feature_selection.f_classif(X,Y)
[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
np.round(chi2)
array([647., 118., 697., 573.,  84., 313., 534., 862.,  70.,   0., 269.,
         0., 254., 244.,   3.,  53.,  39., 113.,   0.,   3., 861., 150.,
       898., 662., 122., 304., 437., 964., 119.,  66.])
  • Each value represents importance of a feature

Univariate Regression Test using f_regression使用f_regression的单变量回归测试

  • Linear model for testing the individual effect of each of many regressors.
  • Correlation between each value & target is calculated
  • F-test captures linear dependency
  • 线性模型,用于测试许多回归变量各自的效果。
  • 计算每个值与目标之间的相关性
  • F检验捕获线性相关性
from sklearn.datasets import california_housing
house_data = california_housing.fetch_california_housing()
X,Y = house_data.data, house_data.target
F, pval = feature_selection.f_regression(X,Y)
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\Administrator\scikit_learn_data
F
array([1.85565716e+04, 2.32841479e+02, 4.87757462e+02, 4.51085756e+01,
       1.25474103e+01, 1.16353421e+01, 4.38005453e+02, 4.36989761e+01])
pval
array([0.00000000e+000, 2.76186068e-052, 7.56924213e-107, 1.91258939e-011,
       3.97630785e-004, 6.48344237e-004, 2.93985929e-096, 3.92332207e-011])
  • Columns with top F values are the selected features

F score verses Mutual InformationF得分与共同信息

np.random.seed(0)
X = np.random.rand(1000, 3)
y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)
feature_selection.f_regression(X,y)
(array([187.42118421,  52.52357392,   0.47268298]),
 array([3.19286906e-39, 8.50243215e-13, 4.91915197e-01]))
plt.scatter(X[:,0],y,s=10)

在这里插入图片描述

plt.scatter(X[:,1],y,s=10)

在这里插入图片描述

Mutual Information for regression using mutual_info_regression使用Mutual_info_regression进行回归的相互信息

  • Returns dependency in the scale of 0 & 1 among feature & target
  • Captures any kind of dependency even if non-linear
  • Target is continues in nature
  • 在特征和目标之间以0和1的比例返回依赖关系
  • 捕获任何类型的依赖关系,即使是非线性的
  • 目标是连续的
feature_selection.mutual_info_regression(X,y)
array([0.31431334, 0.86235026, 0.        ])

Mutual Information for classification using mutual_info_classification

  • Returns dependency in the scale of 0 & 1 among feature & target
  • Captures any kind of dependency even if non-linear
  • Target is discrete in nature
cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship'
        ,'race','sex','capital-gain','capital-loss','hours-per-week','native-country','Salary']
adult_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/adult.data.txt', names=cols)
cat_cols = list(adult_data.select_dtypes('object').columns)
cat_cols.remove('Salary')
from sklearn.preprocessing import LabelEncoder
for col in cat_cols:
    le = LabelEncoder()
    adult_data[col]  = le.fit_transform(adult_data[col])
adult_data
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country Salary
0 39 7 77516 9 13 4 1 1 4 1 2174 0 40 39 <=50K
1 50 6 83311 9 13 2 4 0 4 1 0 0 13 39 <=50K
2 38 4 215646 11 9 0 6 1 4 1 0 0 40 39 <=50K
3 53 4 234721 1 7 2 6 0 2 1 0 0 40 39 <=50K
4 28 4 338409 9 13 2 10 5 2 0 0 0 40 5 <=50K
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27 4 257302 7 12 2 13 5 4 0 0 0 38 39 <=50K
32557 40 4 154374 11 9 2 7 0 4 1 0 0 40 39 >50K
32558 58 4 151910 11 9 6 1 4 4 0 0 0 40 39 <=50K
32559 22 4 201490 11 9 4 1 3 4 1 0 0 20 39 <=50K
32560 52 5 287927 11 9 2 4 5 4 0 15024 0 40 39 >50K

32561 rows × 15 columns

adult_data.Salary.value_counts()
 <=50K    24720
 >50K      7841
Name: Salary, dtype: int64

重要

#adult_data.Salary[adult_data.Salary=='<=50K']=0;
#adult_data.Salary[adult_data.Salary=='>50K']=1;
#adult_data

#adult_data.loc[adult_data['Salary']=='<=50K','Salary']=0
#adult_data.loc[adult_data['Salary']=='>50K','Salary']=1
#adult_data
#adult_data['Salary']=adult_data['Salary'].map({'<=50K':0,'>50K':1})
adult_data['Salary'] = np.where(adult_data['Salary']=='<=50K',0,1)
d:\Anaconda3\lib\site-packages\pandas\core\ops\__init__.py:1115: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  result = method(y)
adult_data
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country Salary
0 39 7 77516 9 13 4 1 1 4 1 2174 0 40 39 1
1 50 6 83311 9 13 2 4 0 4 1 0 0 13 39 1
2 38 4 215646 11 9 0 6 1 4 1 0 0 40 39 1
3 53 4 234721 1 7 2 6 0 2 1 0 0 40 39 1
4 28 4 338409 9 13 2 10 5 2 0 0 0 40 5 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 27 4 257302 7 12 2 13 5 4 0 0 0 38 39 1
32557 40 4 154374 11 9 2 7 0 4 1 0 0 40 39 1
32558 58 4 151910 11 9 6 1 4 4 0 0 0 40 39 1
32559 22 4 201490 11 9 4 1 3 4 1 0 0 20 39 1
32560 52 5 287927 11 9 2 4 5 4 0 15024 0 40 39 1

32561 rows × 15 columns

feature_selection.mutual_info_classif(adult_data, adult_data.Salary)
array([1.07490556e-04, 3.88501582e-03, 0.00000000e+00, 1.45880041e-03,
       1.44344461e-03, 2.62584073e-03, 7.52433893e-04, 1.38202144e-03,
       6.12696170e-03, 3.51647677e-03, 3.07115875e-05, 4.60673812e-05,
       2.27265747e-03, 7.41684838e-03, 7.55505052e-03])
adult_data.columns
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'Salary'],
      dtype='object')

SelectKBest

  • SelectKBest returns K important features based on above techniques
  • Based on configuration, it can use mutual_information or ANOVA or regression based techniques
adult_data.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country Salary
0 39 7 77516 9 13 4 1 1 4 1 2174 0 40 39 1
1 50 6 83311 9 13 2 4 0 4 1 0 0 13 39 1
2 38 4 215646 11 9 0 6 1 4 1 0 0 40 39 1
3 53 4 234721 1 7 2 6 0 2 1 0 0 40 39 1
4 28 4 338409 9 13 2 10 5 2 0 0 0 40 5 1
selector = feature_selection.SelectKBest(k=7, score_func=feature_selection.f_classif)
data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
selector.scores_
d:\Anaconda3\lib\site-packages\sklearn\feature_selection\univariate_selection.py:109: RuntimeWarning: invalid value encountered in true_divide
  msb = ssbn / float(dfbn)





array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan])
selector.scores_
array([1.88670731e+03, 8.69361605e+01, 2.91559359e+00, 2.06129509e+02,
       4.12009578e+03, 1.34685178e+03, 1.86500322e+02, 2.18764583e+03,
       1.68934788e+02, 1.59310791e+03, 1.70915006e+03, 7.54830452e+02,
       1.81338628e+03, 8.17155711e+00])
data[0]
array([   1,    4,    1, 2174,    0,   40,   39], dtype=int64)
selector = feature_selection.SelectKBest(k=7, score_func=feature_selection.mutual_info_classif)
data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
selector.scores_
array([1.07490556e-04, 3.62396732e-03, 0.00000000e+00, 1.41273302e-03,
       1.53557937e-03, 2.44157120e-03, 7.21722306e-04, 9.98126593e-04,
       6.26516385e-03, 3.28613986e-03, 6.14231750e-05, 1.53557937e-05,
       2.02696477e-03, 6.77190504e-03])
data[0]
array([ 7, 13,  4,  4,  1, 40, 39], dtype=int64)

SelectPercentile

  • Selecting top features whose importances are in configured parameter
  • Default is top 10 percentile
selector = feature_selection.SelectPercentile(percentile=20, score_func=feature_selection.mutual_info_classif)
data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
data[:5]
array([[ 7,  4, 39],
       [ 6,  4, 39],
       [ 4,  4, 39],
       [ 4,  2, 39],
       [ 4,  2,  5]], dtype=int64)

SelectFromModel

  • Selecting important features from model weights
  • The estimator should support ‘feature_importances’
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

boston = load_boston()
clf = LinearRegression()
sfm = feature_selection.SelectFromModel(clf, threshold=0.25)
sfm.fit_transform(boston.data, boston.target).shape
(506, 7)
boston.data.shape
(506, 13)

Recursive Feature Elimination递归特征消除

  • Uses an external estimator to calculate weights of features
  • First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute.
  • Then, the least important features are pruned from current set of features.
  • That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
  • 使用外部估算器来计算特征权重
  • 首先,对估计器进行初始特征集训练,并且通过coef_属性或feature_importances_属性获得每个特征的重要性。
  • 然后,从当前的一组功能中删除最不重要的功能。
  • 在修剪后的集上递归地重复该过程,直到最终达到所需的特征数量。
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_regression(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, 5, step=1)
data = selector.fit_transform(X, y)
X.shape
data.shape
selector.ranking_
array([1, 1, 4, 3, 1, 6, 1, 2, 5, 1])
发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

猜你喜欢

转载自blog.csdn.net/sinat_23971513/article/details/105255166
今日推荐