Table of Contents
- 1 Feature Selection Techniques特征选择技术
- 1.1 Agenda
- 1.2 Introduction to Feature Selection特征选择简介
- 1.3 VarianceThreshold
- 1.4 Chi-Square for Non-negative feature & class非负特征和类的卡方
- 1.5 ANOVA using f_classif方差分析使用f_classif
- 1.6 Univariate Regression Test using f_regression使用f_regression的单变量回归测试
- 1.7 F score verses Mutual InformationF得分与共同信息
- 1.8 Mutual Information for regression using mutual_info_regression使用Mutual_info_regression进行回归的相互信息
- 1.9 Mutual Information for classification using mutual_info_classification
- 1.10 SelectKBest
- 1.11 SelectPercentile
- 1.12 SelectFromModel
- 1.13 Recursive Feature Elimination递归特征消除
Feature Selection Techniques特征选择技术
Agenda
- Introduction to Feature Selection
- VarianceThreshold
- Chi-squared stats
- ANOVA using f_classif
- Univariate Linear Regression Tests using f_regression
- F-score vs Mutual Information
- Mutual Information for discrete value
- Mutual Information for continues value
- SelectKBest
- SelectPercentile
- SelectFromModel
- Recursive Feature Elemination
- 特征选择简介
- VarianceThreshold
- 卡方统计
- 使用f_classif进行方差分析
- 使用f_regression进行单变量线性回归检验
- F得分与互信息
- 互信息离散值
- 互信息连续值
- 选择KBest
- 选择百分位数
- SelectFromModel
- 递归特征消除
Introduction to Feature Selection特征选择简介
-
Selecting features from the dataset
-
Improve estimator’s accuracy
-
Boost preformance for high dimensional datsets
-
Below we will discuss univariate selection methods
-
Also, feature elimination method
-
从数据集中选择特征
-
提高估算器的准确性
-
提高高性能的数据集的性能
-
下面我们将讨论单变量选择方法
-
此外,特征消除方法
from sklearn import feature_selection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
VarianceThreshold
- Drop the columns whose variance is below configured level
- This method is unsupervised .i.e target not taken into action
- Intution : Columns whose values arn’t petty much the same won’t have much impact on target
- 删除方差低于配置水平的列
- 此方法是无监督的,即目标未生效
- 直觉:值不太相同的列不会对目标产生太大影响
df = pd.DataFrame({'A':['m','f','m','m','m','m','m','m'],
'B':[1,2,3,1,2,1,1,1],
'C':[1,2,3,1,2,1,1,1]})
df
A | B | C | |
---|---|---|---|
0 | m | 1 | 1 |
1 | f | 2 | 2 |
2 | m | 3 | 3 |
3 | m | 1 | 1 |
4 | m | 2 | 2 |
5 | m | 1 | 1 |
6 | m | 1 | 1 |
7 | m | 1 | 1 |
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['A'] = le.fit_transform(df.A)
df
A | B | C | |
---|---|---|---|
0 | 1 | 1 | 1 |
1 | 0 | 2 | 2 |
2 | 1 | 3 | 3 |
3 | 1 | 1 | 1 |
4 | 1 | 2 | 2 |
5 | 1 | 1 | 1 |
6 | 1 | 1 | 1 |
7 | 1 | 1 | 1 |
vt = feature_selection.VarianceThreshold(threshold=.2)
vt.fit_transform(df)
vt.variances_
array([0.109375, 0.5 , 0.5 ])
Chi-Square for Non-negative feature & class非负特征和类的卡方
- Feature data should be booleans or count
- Supervised technique for feature selection
- Target should be discrete
- 特征数据应为布尔值或计数
- 特征选择的监督技术
- 目标应该是离散的
df = pd.read_csv('Data/tennis.csv.txt')
for col in df.columns:
le = LabelEncoder()
df[col] = le.fit_transform(df[col])
df
outlook | temp | humidity | windy | play | |
---|---|---|---|---|---|
0 | 2 | 1 | 0 | 0 | 0 |
1 | 2 | 1 | 0 | 1 | 0 |
2 | 0 | 1 | 0 | 0 | 1 |
3 | 1 | 2 | 0 | 0 | 1 |
4 | 1 | 0 | 1 | 0 | 1 |
5 | 1 | 0 | 1 | 1 | 0 |
6 | 0 | 0 | 1 | 1 | 1 |
7 | 2 | 2 | 0 | 0 | 0 |
8 | 2 | 0 | 1 | 0 | 1 |
9 | 1 | 2 | 1 | 0 | 1 |
10 | 2 | 2 | 1 | 1 | 1 |
11 | 0 | 2 | 0 | 1 | 1 |
12 | 0 | 1 | 1 | 0 | 1 |
13 | 1 | 2 | 0 | 1 | 0 |
chi2, pval = feature_selection.chi2(df.drop('play',axis=1),df.play)
chi2
array([2.02814815, 0.02222222, 1.4 , 0.53333333])
- Higher value means more important feature for target
- 更高的值意味着是对目标更重要的功能
ANOVA using f_classif方差分析使用f_classif
- For feature variables continues in nature
- And, target variable discrete in nature
- Internally, this method uses ratio of variation within a columns & variation across columns
- 对于特征变量自然
- 并且,目标变量本质上是离散的
- 在内部,此方法使用列内变化率与列间变化率
from sklearn.datasets import load_breast_cancer
cancer_data = load_breast_cancer()
X = cancer_data.data
Y = cancer_data.target
print(X)
chi2, pval = feature_selection.f_classif(X,Y)
[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
[2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
[1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
...
[1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
[2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
[7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
np.round(chi2)
array([647., 118., 697., 573., 84., 313., 534., 862., 70., 0., 269.,
0., 254., 244., 3., 53., 39., 113., 0., 3., 861., 150.,
898., 662., 122., 304., 437., 964., 119., 66.])
- Each value represents importance of a feature
Univariate Regression Test using f_regression使用f_regression的单变量回归测试
- Linear model for testing the individual effect of each of many regressors.
- Correlation between each value & target is calculated
- F-test captures linear dependency
- 线性模型,用于测试许多回归变量各自的效果。
- 计算每个值与目标之间的相关性
- F检验捕获线性相关性
from sklearn.datasets import california_housing
house_data = california_housing.fetch_california_housing()
X,Y = house_data.data, house_data.target
F, pval = feature_selection.f_regression(X,Y)
Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\Administrator\scikit_learn_data
F
array([1.85565716e+04, 2.32841479e+02, 4.87757462e+02, 4.51085756e+01,
1.25474103e+01, 1.16353421e+01, 4.38005453e+02, 4.36989761e+01])
pval
array([0.00000000e+000, 2.76186068e-052, 7.56924213e-107, 1.91258939e-011,
3.97630785e-004, 6.48344237e-004, 2.93985929e-096, 3.92332207e-011])
- Columns with top F values are the selected features
F score verses Mutual InformationF得分与共同信息
np.random.seed(0)
X = np.random.rand(1000, 3)
y = X[:, 0] + np.sin(6 * np.pi * X[:, 1]) + 0.1 * np.random.randn(1000)
feature_selection.f_regression(X,y)
(array([187.42118421, 52.52357392, 0.47268298]),
array([3.19286906e-39, 8.50243215e-13, 4.91915197e-01]))
plt.scatter(X[:,0],y,s=10)
plt.scatter(X[:,1],y,s=10)
Mutual Information for regression using mutual_info_regression使用Mutual_info_regression进行回归的相互信息
- Returns dependency in the scale of 0 & 1 among feature & target
- Captures any kind of dependency even if non-linear
- Target is continues in nature
- 在特征和目标之间以0和1的比例返回依赖关系
- 捕获任何类型的依赖关系,即使是非线性的
- 目标是连续的
feature_selection.mutual_info_regression(X,y)
array([0.31431334, 0.86235026, 0. ])
Mutual Information for classification using mutual_info_classification
- Returns dependency in the scale of 0 & 1 among feature & target
- Captures any kind of dependency even if non-linear
- Target is discrete in nature
cols = ['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship'
,'race','sex','capital-gain','capital-loss','hours-per-week','native-country','Salary']
adult_data = pd.read_csv('https://raw.githubusercontent.com/zekelabs/data-science-complete-tutorial/master/Data/adult.data.txt', names=cols)
cat_cols = list(adult_data.select_dtypes('object').columns)
cat_cols.remove('Salary')
from sklearn.preprocessing import LabelEncoder
for col in cat_cols:
le = LabelEncoder()
adult_data[col] = le.fit_transform(adult_data[col])
adult_data
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | Salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | 7 | 77516 | 9 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 | <=50K |
1 | 50 | 6 | 83311 | 9 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 | <=50K |
2 | 38 | 4 | 215646 | 11 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 | <=50K |
3 | 53 | 4 | 234721 | 1 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 | <=50K |
4 | 28 | 4 | 338409 | 9 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32556 | 27 | 4 | 257302 | 7 | 12 | 2 | 13 | 5 | 4 | 0 | 0 | 0 | 38 | 39 | <=50K |
32557 | 40 | 4 | 154374 | 11 | 9 | 2 | 7 | 0 | 4 | 1 | 0 | 0 | 40 | 39 | >50K |
32558 | 58 | 4 | 151910 | 11 | 9 | 6 | 1 | 4 | 4 | 0 | 0 | 0 | 40 | 39 | <=50K |
32559 | 22 | 4 | 201490 | 11 | 9 | 4 | 1 | 3 | 4 | 1 | 0 | 0 | 20 | 39 | <=50K |
32560 | 52 | 5 | 287927 | 11 | 9 | 2 | 4 | 5 | 4 | 0 | 15024 | 0 | 40 | 39 | >50K |
32561 rows × 15 columns
adult_data.Salary.value_counts()
<=50K 24720
>50K 7841
Name: Salary, dtype: int64
重要
#adult_data.Salary[adult_data.Salary=='<=50K']=0;
#adult_data.Salary[adult_data.Salary=='>50K']=1;
#adult_data
#adult_data.loc[adult_data['Salary']=='<=50K','Salary']=0
#adult_data.loc[adult_data['Salary']=='>50K','Salary']=1
#adult_data
#adult_data['Salary']=adult_data['Salary'].map({'<=50K':0,'>50K':1})
adult_data['Salary'] = np.where(adult_data['Salary']=='<=50K',0,1)
d:\Anaconda3\lib\site-packages\pandas\core\ops\__init__.py:1115: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
result = method(y)
adult_data
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | Salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | 7 | 77516 | 9 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 | 1 |
1 | 50 | 6 | 83311 | 9 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 | 1 |
2 | 38 | 4 | 215646 | 11 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 | 1 |
3 | 53 | 4 | 234721 | 1 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 | 1 |
4 | 28 | 4 | 338409 | 9 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32556 | 27 | 4 | 257302 | 7 | 12 | 2 | 13 | 5 | 4 | 0 | 0 | 0 | 38 | 39 | 1 |
32557 | 40 | 4 | 154374 | 11 | 9 | 2 | 7 | 0 | 4 | 1 | 0 | 0 | 40 | 39 | 1 |
32558 | 58 | 4 | 151910 | 11 | 9 | 6 | 1 | 4 | 4 | 0 | 0 | 0 | 40 | 39 | 1 |
32559 | 22 | 4 | 201490 | 11 | 9 | 4 | 1 | 3 | 4 | 1 | 0 | 0 | 20 | 39 | 1 |
32560 | 52 | 5 | 287927 | 11 | 9 | 2 | 4 | 5 | 4 | 0 | 15024 | 0 | 40 | 39 | 1 |
32561 rows × 15 columns
feature_selection.mutual_info_classif(adult_data, adult_data.Salary)
array([1.07490556e-04, 3.88501582e-03, 0.00000000e+00, 1.45880041e-03,
1.44344461e-03, 2.62584073e-03, 7.52433893e-04, 1.38202144e-03,
6.12696170e-03, 3.51647677e-03, 3.07115875e-05, 4.60673812e-05,
2.27265747e-03, 7.41684838e-03, 7.55505052e-03])
adult_data.columns
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'Salary'],
dtype='object')
SelectKBest
- SelectKBest returns K important features based on above techniques
- Based on configuration, it can use mutual_information or ANOVA or regression based techniques
adult_data.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | Salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | 7 | 77516 | 9 | 13 | 4 | 1 | 1 | 4 | 1 | 2174 | 0 | 40 | 39 | 1 |
1 | 50 | 6 | 83311 | 9 | 13 | 2 | 4 | 0 | 4 | 1 | 0 | 0 | 13 | 39 | 1 |
2 | 38 | 4 | 215646 | 11 | 9 | 0 | 6 | 1 | 4 | 1 | 0 | 0 | 40 | 39 | 1 |
3 | 53 | 4 | 234721 | 1 | 7 | 2 | 6 | 0 | 2 | 1 | 0 | 0 | 40 | 39 | 1 |
4 | 28 | 4 | 338409 | 9 | 13 | 2 | 10 | 5 | 2 | 0 | 0 | 0 | 40 | 5 | 1 |
selector = feature_selection.SelectKBest(k=7, score_func=feature_selection.f_classif)
data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
selector.scores_
d:\Anaconda3\lib\site-packages\sklearn\feature_selection\univariate_selection.py:109: RuntimeWarning: invalid value encountered in true_divide
msb = ssbn / float(dfbn)
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan])
selector.scores_
array([1.88670731e+03, 8.69361605e+01, 2.91559359e+00, 2.06129509e+02,
4.12009578e+03, 1.34685178e+03, 1.86500322e+02, 2.18764583e+03,
1.68934788e+02, 1.59310791e+03, 1.70915006e+03, 7.54830452e+02,
1.81338628e+03, 8.17155711e+00])
data[0]
array([ 1, 4, 1, 2174, 0, 40, 39], dtype=int64)
selector = feature_selection.SelectKBest(k=7, score_func=feature_selection.mutual_info_classif)
data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
selector.scores_
array([1.07490556e-04, 3.62396732e-03, 0.00000000e+00, 1.41273302e-03,
1.53557937e-03, 2.44157120e-03, 7.21722306e-04, 9.98126593e-04,
6.26516385e-03, 3.28613986e-03, 6.14231750e-05, 1.53557937e-05,
2.02696477e-03, 6.77190504e-03])
data[0]
array([ 7, 13, 4, 4, 1, 40, 39], dtype=int64)
SelectPercentile
- Selecting top features whose importances are in configured parameter
- Default is top 10 percentile
selector = feature_selection.SelectPercentile(percentile=20, score_func=feature_selection.mutual_info_classif)
data = selector.fit_transform(adult_data.drop('Salary',axis=1),adult_data.Salary)
data[:5]
array([[ 7, 4, 39],
[ 6, 4, 39],
[ 4, 4, 39],
[ 4, 2, 39],
[ 4, 2, 5]], dtype=int64)
SelectFromModel
- Selecting important features from model weights
- The estimator should support ‘feature_importances’
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
boston = load_boston()
clf = LinearRegression()
sfm = feature_selection.SelectFromModel(clf, threshold=0.25)
sfm.fit_transform(boston.data, boston.target).shape
(506, 7)
boston.data.shape
(506, 13)
Recursive Feature Elimination递归特征消除
- Uses an external estimator to calculate weights of features
- First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute.
- Then, the least important features are pruned from current set of features.
- That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
- 使用外部估算器来计算特征权重
- 首先,对估计器进行初始特征集训练,并且通过coef_属性或feature_importances_属性获得每个特征的重要性。
- 然后,从当前的一组功能中删除最不重要的功能。
- 在修剪后的集上递归地重复该过程,直到最终达到所需的特征数量。
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_regression(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, 5, step=1)
data = selector.fit_transform(X, y)
X.shape
data.shape
selector.ranking_
array([1, 1, 4, 3, 1, 6, 1, 2, 5, 1])