Feature Selection Techniques

1 Feature Selection Techniques特征选择技术

Feature Selection Techniques特征选择技术

Agenda

Introduction to Feature Selection
VarianceThreshold
Chi-squared stats
ANOVA using f_classif
Univariate Linear Regression Tests using f_regression
F-score vs Mutual Information
Mutual Information for discrete value
Mutual Information for continues value
SelectKBest
SelectPercentile
SelectFromModel
Recursive Feature Elemination
特征选择简介
VarianceThreshold
卡方统计
使用f_classif进行方差分析
使用f_regression进行单变量线性回归检验
F得分与互信息
互信息离散值
互信息连续值
选择KBest
选择百分位数
SelectFromModel
递归特征消除

Introduction to Feature Selection特征选择简介

Selecting features from the dataset
Improve estimator’s accuracy
Boost preformance for high dimensional datsets
Below we will discuss univariate selection methods
Also, feature elimination method
从数据集中选择特征
提高估算器的准确性
提高高性能的数据集的性能
下面我们将讨论单变量选择方法
此外，特征消除方法

from sklearn import feature_selection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

VarianceThreshold

Drop the columns whose variance is below configured level
This method is unsupervised .i.e target not taken into action
Intution : Columns whose values arn’t petty much the same won’t have much impact on target
删除方差低于配置水平的列
此方法是无监督的，即目标未生效
直觉：值不太相同的列不会对目标产生太大影响

df = pd.DataFrame({'A':['m','f','m','m','m','m','m','m'], 
              'B':[1,2,3,1,2,1,1,1], 
              'C':[1,2,3,1,2,1,1,1]})

df

	A	B	C
0	m	1	1
1	f	2	2
2	m	3	3
3	m	1	1
4	m	2	2
5	m	1	1
6	m	1	1
7	m	1	1

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['A'] = le.fit_transform(df.A)
df

	A	B	C
0	1	1	1
1	0	2	2
2	1	3	3
3	1	1	1
4	1	2	2
5	1	1	1
6	1	1	1
7	1	1	1

vt = feature_selection.VarianceThreshold(threshold=.2)

vt.fit_transform(df)
vt.variances_

array([0.109375, 0.5     , 0.5     ])

Chi-Square for Non-negative feature & class非负特征和类的卡方

Feature data should be booleans or count
Supervised technique for feature selection
Target should be discrete
特征数据应为布尔值或计数
特征选择的监督技术
目标应该是离散的

df = pd.read_csv('Data/tennis.csv.txt')

for col in df.columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])

df

	outlook	temp	humidity	windy	play
0	2	1	0	0	0
1	2	1	0	1	0
2	0	1	0	0	1
3	1	2	0	0	1
4	1	0	1	0	1
5	1	0	1	1	0
6	0	0	1	1	1
7	2	2	0	0	0
8	2	0	1	0	1
9	1	2	1	0	1
10	2	2	1	1	1
11	0	2	0	1	1
12	0	1	1	0	1
13	1	2	0	1	0

chi2, pval = feature_selection.chi2(df.drop('play',axis=1),df.play)

chi2

array([2.02814815, 0.02222222, 1.4       , 0.53333333])

Higher value means more important feature for target
更高的值意味着是对目标更重要的功能

ANOVA using f_classif方差分析使用f_classif

For feature variables continues in nature
And, target variable discrete in nature
Internally, this method uses ratio of variation within a columns & variation across columns
对于特征变量自然
并且，目标变量本质上是离散的
在内部，此方法使用列内变化率与列间变化率

from sklearn.datasets import load_breast_cancer

cancer_data = load_breast_cancer()
X = cancer_data.data
Y = cancer_data.target
print(X)
chi2, pval = feature_selection.f_classif(X,Y)

[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]

np.round(chi2)

array([647., 118., 697., 573.,  84., 313., 534., 862.,  70.,   0., 269.,
         0., 254., 244.,   3.,  53.,  39., 113.,   0.,   3., 861., 150.,
       898., 662., 122., 304., 437., 964., 119.,  66.])

Each value represents importance of a feature

Univariate Regression Test using f_regression使用f_regression的单变量回归测试

Linear model for testing the individual effect of each of many regressors.
Correlation between each value & target is calculated
F-test captures linear dependency
线性模型，用于测试许多回归变量各自的效果。
计算每个值与目标之间的相关性
F检验捕获线性相关性

from sklearn.datasets import california_housing
house_data = california_housing.fetch_california_housing()
X,Y = house_data.data, house_data.target
F, pval = feature_selection.f_regression(X,Y)

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\Administrator\scikit_learn_data

array([1.85565716e+04, 2.32841479e+02, 4.87757462e+02, 4.51085756e+01,
       1.25474103e+01, 1.16353421e+01, 4.38005453e+02, 4.36989761e+01])

pval

array([0.00000000e+000, 2.76186068e-052, 7.56924213e-107, 1.91258939e-011,
       3.97630785e-004, 6.48344237e-004, 2.93985929e-096, 3.92332207e-011])

Columns with top F values are the selected features

F score verses Mutual InformationF得分与共同信息

np.random.seed(0)
X = np.random.rand(1000, 3)
y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)

feature_selection.f_regression(X,y)

(array([187.42118421,  52.52357392,   0.47268298]),
 array([3.19286906e-39, 8.50243215e-13, 4.91915197e-01]))

plt.scatter(X[:,0],y,s=10)

在这里插入图片描述

plt.scatter(X[:,1],y,s=10)

在这里插入图片描述

Mutual Information for regression using mutual_info_regression使用Mutual_info_regression进行回归的相互信息

Returns dependency in the scale of 0 & 1 among feature & target
Captures any kind of dependency even if non-linear
Target is continues in nature
在特征和目标之间以0和1的比例返回依赖关系
捕获任何类型的依赖关系，即使是非线性的
目标是连续的

feature_selection.mutual_info_regression(X,y)

array([0.31431334, 0.86235026, 0.        ])

Mutual Information for classification using mutual_info_classification

Returns dependency in the scale of 0 & 1 among feature & target
Captures any kind of dependency even if non-linear
Target is discrete in nature

cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship'
        ,'race','sex','capital-gain','capital-loss','hours-per-week','native-country','Salary']
adult_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/adult.data.txt', names=cols)

cat_cols = list(adult_data.select_dtypes('object').columns)

cat_cols.remove('Salary')

from sklearn.preprocessing import LabelEncoder

for col in cat_cols:
    le = LabelEncoder()
    adult_data[col]  = le.fit_transform(adult_data[col])
adult_data

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	Salary
0	39	7	77516	9	13	4	1	1	4	1	2174	0	40	39	<=50K
1	50	6	83311	9	13	2	4	0	4	1	0	0	13	39	<=50K
2	38	4	215646	11	9	0	6	1	4	1	0	0	40	39	<=50K
3	53	4	234721	1	7	2	6	0	2	1	0	0	40	39	<=50K
4	28	4	338409	9	13	2	10	5	2	0	0	0	40	5	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27	4	257302	7	12	2	13	5	4	0	0	0	38	39	<=50K
32557	40	4	154374	11	9	2	7	0	4	1	0	0	40	39	>50K
32558	58	4	151910	11	9	6	1	4	4	0	0	0	40	39	<=50K
32559	22	4	201490	11	9	4	1	3	4	1	0	0	20	39	<=50K
32560	52	5	287927	11	9	2	4	5	4	0	15024	0	40	39	>50K

32561 rows × 15 columns

adult_data.Salary.value_counts()

 <=50K    24720
 >50K      7841
Name: Salary, dtype: int64

重要

#adult_data.Salary[adult_data.Salary=='<=50K']=0;
#adult_data.Salary[adult_data.Salary=='>50K']=1;
#adult_data

#adult_data.loc[adult_data['Salary']=='<=50K','Salary']=0
#adult_data.loc[adult_data['Salary']=='>50K','Salary']=1
#adult_data
#adult_data['Salary']=adult_data['Salary'].map({'<=50K':0,'>50K':1})
adult_data['Salary'] = np.where(adult_data['Salary']=='<=50K',0,1)

d:\Anaconda3\lib\site-packages\pandas\core\ops\__init__.py:1115: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  result = method(y)

adult_data

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	Salary
0	39	7	77516	9	13	4	1	1	4	1	2174	0	40	39	1
1	50	6	83311	9	13	2	4	0	4	1	0	0	13	39	1
2	38	4	215646	11	9	0	6	1	4	1	0	0	40	39	1
3	53	4	234721	1	7	2	6	0	2	1	0	0	40	39	1
4	28	4	338409	9	13	2	10	5	2	0	0	0	40	5	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
32556	27	4	257302	7	12	2	13	5	4	0	0	0	38	39	1
32557	40	4	154374	11	9	2	7	0	4	1	0	0	40	39	1
32558	58	4	151910	11	9	6	1	4	4	0	0	0	40	39	1
32559	22	4	201490	11	9	4	1	3	4	1	0	0	20	39	1
32560	52	5	287927	11	9	2	4	5	4	0	15024	0	40	39	1

32561 rows × 15 columns

feature_selection.mutual_info_classif(adult_data, adult_data.Salary)

array([1.07490556e-04, 3.88501582e-03, 0.00000000e+00, 1.45880041e-03,
       1.44344461e-03, 2.62584073e-03, 7.52433893e-04, 1.38202144e-03,
       6.12696170e-03, 3.51647677e-03, 3.07115875e-05, 4.60673812e-05,
       2.27265747e-03, 7.41684838e-03, 7.55505052e-03])

adult_data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'Salary'],
      dtype='object')

SelectKBest

SelectKBest returns K important features based on above techniques
Based on configuration, it can use mutual_information or ANOVA or regression based techniques

adult_data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	Salary
0	39	7	77516	9	13	4	1	1	4	1	2174	40	39	1
1	50	6	83311	9	13	2	4	0	4	1	0	13	39	1
2	38	4	215646	11	9	0	6	1	4	1	0	40	39	1
3	53	4	234721	1	7	2	6	0	2	1	0	40	39	1
4	28	4	338409	9	13	2	10	5	2	0	0	40	5	1

selector = feature_selection.SelectKBest(k=7, score_func=feature_selection.f_classif)

data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
selector.scores_

d:\Anaconda3\lib\site-packages\sklearn\feature_selection\univariate_selection.py:109: RuntimeWarning: invalid value encountered in true_divide
  msb = ssbn / float(dfbn)





array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan])

selector.scores_

array([1.88670731e+03, 8.69361605e+01, 2.91559359e+00, 2.06129509e+02,
       4.12009578e+03, 1.34685178e+03, 1.86500322e+02, 2.18764583e+03,
       1.68934788e+02, 1.59310791e+03, 1.70915006e+03, 7.54830452e+02,
       1.81338628e+03, 8.17155711e+00])

data[0]

array([   1,    4,    1, 2174,    0,   40,   39], dtype=int64)

selector = feature_selection.SelectKBest(k=7, score_func=feature_selection.mutual_info_classif)

data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)

selector.scores_

array([1.07490556e-04, 3.62396732e-03, 0.00000000e+00, 1.41273302e-03,
       1.53557937e-03, 2.44157120e-03, 7.21722306e-04, 9.98126593e-04,
       6.26516385e-03, 3.28613986e-03, 6.14231750e-05, 1.53557937e-05,
       2.02696477e-03, 6.77190504e-03])

data[0]

array([ 7, 13,  4,  4,  1, 40, 39], dtype=int64)

SelectPercentile

Selecting top features whose importances are in configured parameter
Default is top 10 percentile

selector = feature_selection.SelectPercentile(percentile=20, score_func=feature_selection.mutual_info_classif)
data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
data[:5]

array([[ 7,  4, 39],
       [ 6,  4, 39],
       [ 4,  4, 39],
       [ 4,  2, 39],
       [ 4,  2,  5]], dtype=int64)

SelectFromModel

Selecting important features from model weights
The estimator should support ‘feature_importances’

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

boston = load_boston()
clf = LinearRegression()
sfm = feature_selection.SelectFromModel(clf, threshold=0.25)
sfm.fit_transform(boston.data, boston.target).shape

(506, 7)

boston.data.shape

(506, 13)

Recursive Feature Elimination递归特征消除

Uses an external estimator to calculate weights of features
First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute.
Then, the least important features are pruned from current set of features.
That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
使用外部估算器来计算特征权重
首先，对估计器进行初始特征集训练，并且通过coef_属性或feature_importances_属性获得每个特征的重要性。
然后，从当前的一组功能中删除最不重要的功能。
在修剪后的集上递归地重复该过程，直到最终达到所需的特征数量。

from sklearn.datasets import make_regression
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_regression(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, 5, step=1)
data = selector.fit_transform(X, y)

X.shape
data.shape
selector.ranking_

array([1, 1, 4, 3, 1, 6, 1, 2, 5, 1])

sljwy

发布了186 篇原创文章 · 获赞 21 · 访问量 1万+

私信关注

Feature Selection Techniques

Table of Contents

Feature Selection Techniques特征选择技术

Agenda

Introduction to Feature Selection特征选择简介

VarianceThreshold

Chi-Square for Non-negative feature & class非负特征和类的卡方

ANOVA using f_classif方差分析使用f_classif

Univariate Regression Test using f_regression使用f_regression的单变量回归测试

F score verses Mutual InformationF得分与共同信息

Mutual Information for regression using mutual_info_regression使用Mutual_info_regression进行回归的相互信息

Mutual Information for classification using mutual_info_classification

SelectKBest

SelectPercentile

SelectFromModel

Recursive Feature Elimination递归特征消除

猜你喜欢

	outlook	temp	humidity	windy	play
0	2	1	0	0	0
1	2	1	0	1	0
2	0	1	0	0	1
3	1	2	0	0	1
4	1	0	1	0	1
5	1	0	1	1	0
6	0	0	1	1	1
7	2	2	0	0	0
8	2	0	1	0	1
9	1	2	1	0	1
10	2	2	1	1	1
11	0	2	0	1	1
12	0	1	1	0	1
13	1	2	0	1	0

	outlook	temp	humidity	windy	play
0	2	1	0	0	0
1	2	1	0	1	0
2	0	1	0	0	1
3	1	2	0	0	1
4	1	0	1	0	1
5	1	0	1	1	0
6	0	0	1	1	1
7	2	2	0	0	0
8	2	0	1	0	1
9	1	2	1	0	1
10	2	2	1	1	1
11	0	2	0	1	1
12	0	1	1	0	1
13	1	2	0	1	0

	outlook	temp	humidity	windy	play
0	2	1	0	0	0
1	2	1	0	1	0
2	0	1	0	0	1
3	1	2	0	0	1
4	1	0	1	0	1
5	1	0	1	1	0
6	0	0	1	1	1
7	2	2	0	0	0
8	2	0	1	0	1
9	1	2	1	0	1
10	2	2	1	1	1
11	0	2	0	1	1
12	0	1	1	0	1
13	1	2	0	1	0