数据预处理——包裹式特征选取

RFE

模型原型

class sklearn.feature_selection.RFE(estimator,n_features_to_select=None,step=1,estimator_params=None,verbose=0)
参数

estimator:一个学习器（通常使用SVM和广义线性模型作为estimator）
n_features_to_select:指定要选出几个特征
step:指定每次迭代要剔除权重最小的几个特征
- 大于等于1：指定每次迭代要剔除权重最小的特征的数量
- 在0.0~1.0：指定每次迭代要剔除权重最小的特征的比例
estimator_params:一个字典，用于设定estimator的参数
verbose

属性

nfeatures:给出了被选出的特征的数量
support_:一个数组，给出了被选择特征的mask
ranking_:特征排名

方法

fit(X,y):训练RFE模型
transform(X):执行特征选择
fit_transform(X,y):从样本数据中学习RFE模型，然后执行特征选择
get_support([indices])
inverse_transform(X)
predict(X)/predict_log_proba(X)/predict_proba(X):将X进行特征选择之后，在使用内部的estimator来预测
score(X,y):将X进行特征选择之后，在使用内部的estimator来评分

示例

from sklearn.feature_selection import RFE
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
iris=load_iris()
X=iris.data
y=iris.target
estimator=LinearSVC()
selector=RFE(estimator=estimator,n_features_to_select=2)
selector.fit(X,y)
print('N_features %s'%selector.n_features_)
print('Support is %s'%selector.support_)
print('Ranking %s'%selector.ranking_)

特征提取对于预测性能的提升没有必然的联系

from sklearn.feature_selection import RFE
from sklearn.svm import LinearSVC
from sklearn import cross_validation
from sklearn.datasets import load_iris

#加载数据
iris=load_iris()
X,y=iris.data,iris.target

#特征提取
estimator=LinearSVC()
selector=RFE(estimator=estimator,n_features_to_select=2)
X_t=selector.fit_transform(X,y)

#切分测试集与验证集
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.25,random_state=0,stratify=y)
X_train_t,X_test_t,y_train_t,y_test_t=cross_validation.train_test_split(X_t,y,test_size=0.25,random_state=0,stratify=y)

#测试与验证
clf=LinearSVC()
clf_t=LinearSVC()
clf.fit(X_train,y_train)
clf_t.fit(X_train_t,y_train_t)
print('Original DataSet:test score=%s'%(clf.score(X_test,y_test)))
print('Selected DataSet:test score=%s'%(clf_t.score(X_test_t,y_test_t)))

RFECV

模型原型

class sklearn.feature_selection.RFECV(estimator,step=1,cv=None,scoring=None,estimator_params=None,verbose=0)
参数

estimator:一个学习器（通常使用SVM和广义线性模型作为estimator）
step:指定每次迭代要剔除权重最小的几个特征
- 大于等于1：指定每次迭代要剔除权重最小的特征的数量
- 在0.0~1.0：指定每次迭代要剔除权重最小的特征的比例
cv:决策力交叉验证策略
- None:使用默认的3折交叉验证
- 整数k:使用k折交叉验证
- 交叉验证生成器:直接使用该对象
- 可迭代对象:使用该可迭代对象迭代生成训练——测试集合
scoring
estimator_params:一个字典，用于设定estimator的参数
verbose

属性

n_features_:给出了被选出的特征的数量
support_:一个数组，给出了被选择特征的mask
ranking_:特征排名
grid_scores_

方法

fit(X,y):训练RFECV模型
transform(X):执行特征选择
fit_transform(X,y):从样本数据中学习RFE模型，然后执行特征选择
get_support([indices])
inverse_transform(X)
predict(X)/predict_log_proba(X)/predict_proba(X):将X进行特征选择之后，在使用内部的estimator来预测
score(X,y):将X进行特征选择之后，在使用内部的estimator来评分

示例

扫描二维码关注公众号，回复： 628931 查看本文章

import numpy as np
from sklearn.feature_selection import RFECV
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris

iris=load_iris()
X=iris.data
y=iris.target
estimator=LinearSVC()
selector=RFECV(estimator=estimator,cv=3)
selector.fit(X,y)
print('N_features %s'%selector.n_features_)
print('Support is %s'%selector.support_)
print('Ranking %s'%selector.ranking_)
print('Grid Scores %s'%selector.grid_scores_)

数据预处理——包裹式特征选取

RFE

RFECV

猜你喜欢