Sklearn - feature dimensionality reduction


Related official documents:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

Related APIs:

Matrix Decomposition

<(https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition>

The sklearn.decomposition module includes matrix decomposition algorithms, including among others PCA, NMF or ICA. Most of the algorithms of this module can be regarded as dimensionality reduction techniques.

User guide: See the Decomposing signals in components (matrix factorization problems) section for further details.

decomposition.DictionaryLearning([…]) Dictionary learning.
decomposition.FactorAnalysis([n_components, …]) Factor Analysis (FA).
decomposition.FastICA([n_components, …]) FastICA: a fast algorithm for Independent Component Analysis.
decomposition.IncrementalPCA([n_components, …]) Incremental principal components analysis (IPCA).
decomposition.KernelPCA([n_components, …]) Kernel Principal component analysis (KPCA) [R396fc7d924b8-1].
decomposition.LatentDirichletAllocation([…]) Latent Dirichlet Allocation with online variational Bayes algorithm.
decomposition.MiniBatchDictionaryLearning([…]) Mini-batch dictionary learning.
decomposition.MiniBatchSparsePCA([…]) Mini-batch Sparse Principal Components Analysis.
decomposition.NMF([n_components, init, …]) Non-Negative Matrix Factorization (NMF).
decomposition.MiniBatchNMF([n_components, …]) Mini-Batch Non-Negative Matrix Factorization (NMF).
decomposition.PCA([n_components, copy, …]) Principal component analysis (PCA).
decomposition.SparsePCA([n_components, …]) Sparse Principal Components Analysis (SparsePCA).
decomposition.SparseCoder(dictionary, *[, …]) Sparse coding.
decomposition.TruncatedSVD([n_components, …]) Dimensionality reduction using truncated SVD (aka LSA).
decomposition.dict_learning(X, n_components, …) Solve a dictionary learning matrix factorization problem.
decomposition.dict_learning_online(X[, …]) Solve a dictionary learning matrix factorization problem online.
decomposition.fastica(X[, n_components, …]) Perform Fast Independent Component Analysis.
decomposition.non_negative_factorization(X) Compute Non-negative Matrix Factorization (NMF).
decomposition.sparse_encode(X, dictionary, *) Sparse coding.

Feature Selection

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

The sklearn.feature_selection module implements feature selection algorithms. It currently includes univariate filter selection methods and the recursive feature elimination algorithm.

User guide: See the Feature selection section for further details.

feature_selection.GenericUnivariateSelect([…]) Univariate feature selector with configurable strategy.
feature_selection.SelectPercentile([…]) Select features according to a percentile of the highest scores.
feature_selection.SelectKBest([score_func, k]) Select features according to the k highest scores.
feature_selection.SelectFpr([score_func, alpha]) Filter: Select the pvalues below alpha based on a FPR test.
feature_selection.SelectFdr([score_func, alpha]) Filter: Select the p-values for an estimated false discovery rate.
feature_selection.SelectFromModel(estimator, *) Meta-transformer for selecting features based on importance weights.
feature_selection.SelectFwe([score_func, alpha]) Filter: Select the p-values corresponding to Family-wise error rate.
feature_selection.SequentialFeatureSelector(…) Transformer that performs Sequential Feature Selection.
feature_selection.RFE(estimator, *[, …]) Feature ranking with recursive feature elimination.
feature_selection.RFECV(estimator, *[, …]) Recursive feature elimination with cross-validation to select features.
feature_selection.VarianceThreshold([threshold]) Feature selector that removes all low-variance features.
feature_selection.chi2(X, y) Compute chi-squared stats between each non-negative feature and class.
feature_selection.f_classif(X, y) Compute the ANOVA F-value for the provided sample.
feature_selection.f_regression(X, y, *[, …]) Univariate linear regression tests returning F-statistic and p-values.
feature_selection.r_regression(X, y, *[, …]) Compute Pearson’s r for each features and the target.
feature_selection.mutual_info_classif(X, y, *) Estimate mutual information for a discrete target variable.
feature_selection.mutual_info_regression(X, y, *) Estimate mutual information for a continuous target variable.

一、利用特征提取进行特征降维 decomposition

1、PCA 主成分

PCA : 线性降维方法

Principal Component Analysis

  • 将样本数据映射到 特征矩阵的主成分空间

  • 无监督的学习方法,只考虑特征矩阵,而不需要用目标向量的信息

  • 将特征空间映射到脚底维度的空间


官方说明
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html


pca 参数

  • n_componets 取值为 0.95 或 0.99,表示保留 95% 或 99% 的原始特征信息量
  • whiten = True,表示对每一个主成分,都进行转换,以保证他们的平均值为 0、方差为 1
  • svd_solver = ‘randomized’,表示使用随机方法 找到第一个主成分;这种方法通常速度很快。
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA, KernelPCA

from sklearn import datasets

iris = datasets.load_iris()

iris_feature = iris.data
iris_target = iris.target 

digits = datasets.load_digits()   
digits_features = digits.data 
digits_target = digits.target  
digits = datasets.load_digits()
digits
{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'frame': None,
 'feature_names': ['pixel_0_0',
  'pixel_0_1',
  'pixel_0_2',
  'pixel_0_3',
  'pixel_0_4',
  'pixel_0_5',
  'pixel_0_6',
  'pixel_0_7',
  'pixel_1_0',
  'pixel_1_1',
  'pixel_1_2',
  'pixel_1_3',
  'pixel_1_4',
  ...
  'pixel_7_6',
  'pixel_7_7'],
 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 'images': array([[[ 0.,  0.,  5., ...,  1.,  0.,  0.],
         [ 0.,  0., 13., ..., 15.,  5.,  0.],
         [ 0.,  3., 15., ..., 11.,  8.,  0.],
         ...,
         [ 0.,  4., 11., ..., 12.,  7.,  0.],
         [ 0.,  2., 14., ..., 12.,  0.,  0.],
         [ 0.,  0.,  6., ...,  0.,  0.,  0.]],
 
        [[ 0.,  0.,  0., ...,  5.,  0.,  0.],
         [ 0.,  0.,  0., ...,  9.,  0.,  0.],
         [ 0.,  0.,  3., ...,  6.,  0.,  0.],
         ...,
         [ 0.,  0.,  1., ...,  6.,  0.,  0.],
         [ 0.,  0.,  1., ...,  6.,  0.,  0.],
         [ 0.,  0.,  0., ..., 10.,  0.,  0.]], 
 
        ...,
  
        [[ 0.,  0., 10., ...,  1.,  0.,  0.],
         [ 0.,  2., 16., ...,  1.,  0.,  0.],
         [ 0.,  0., 15., ..., 15.,  0.,  0.],
         ...,
         [ 0.,  4., 16., ..., 16.,  6.,  0.],
         [ 0.,  8., 16., ..., 16.,  8.,  0.],
         [ 0.,  1.,  8., ..., 12.,  1.,  0.]]]),
 'DESCR': ".. _digits_dataset:\n\nOptical recognition of handwritten digits dataset\n--------------------------------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 1797\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\n.. topic:: References\n\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000.\n"}
## 标准化特征矩阵
features = StandardScaler().fit_transform(digits.data)
features
array([[ 0.        , -0.33501649, -0.04308102, ..., -1.14664746,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  0.54856067,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  1.56568555,
         1.6951369 , -0.19600752],
       ...,
       [ 0.        , -0.33501649, -0.88456568, ..., -0.12952258,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -0.67419451, ...,  0.8876023 ,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649,  1.00877481, ...,  0.8876023 ,
        -0.26113572, -0.19600752]])
## 创建可以保留 99% 信息量(用方差表示)的 PCA 
pca = PCA(n_components=0.95, whiten=True) ## n_componets 通常取 0.95, 0.99 
features_pca = pca.fit_transform(features) 
features_pca
array([[ 0.70631939, -0.39512814, -1.73816236, ...,  1.50118222,
         0.04737545,  0.63346688],
       [ 0.21732591,  0.38276482,  1.72878893, ...,  0.36929518,
         0.16553156, -0.96249022],
       [ 0.4804351 , -0.13130437,  1.33172761, ..., -0.39573004,
        -2.43023645,  1.07404128],
       ...,
       [ 0.37732433, -0.0612296 ,  1.0879821 , ...,  1.10895845,
         0.78365806,  2.37869004],
       [ 0.39705007, -0.15768102, -1.08160094, ...,  0.8069015 ,
        -1.32761099, -0.87248068],
       [-0.46407544, -0.92213976,  0.12493334, ..., -0.70274994,
         0.24428465,  2.60007322]])
features.shape[1]
64
features_pca.shape[1]
40

2、对线性不可分数据进行特征降维

  • 线性可分数据:可使用一条直线或超平面 将两类数据分开
  • 线性不可分数据

核机制

数据不是线性可分的话,线性变换的效果不好
PCA 降维 两类数据 会被映射到 第一个主成分上,可能会交织
核函数能将不可分数据,映射到 更高的维度,数据在新维度线性可分

这种方法称为 核机制(kernel trick)

kernel PCA 提供了很多核函数,常见如下

  • rbf,高斯径向基函数
  • poly,多项式核
  • sigmoid 核
  • linear 线性核,和PCA 效果相同

PCA 缺点:需要指定很多参数,还有超参数

from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles 
## 创建线性不可分数据

features, _ = make_circles(n_samples=1000, random_state=1, noise=0.1, factor=0.1)
features 
array([[ 0.23058395, -0.10671314],
       [-0.0834218 , -0.22647078],
       [ 0.9246533 , -0.71492522],
       ...,
       [ 0.02517206,  0.00964548],
       [-0.92836187,  0.06693357],
       [ 1.03502248,  0.54878286]])
## 应用基于 径向基函数(radius basic function, RBF)核的 Kernal PCA 方法

kpca = KernelPCA(kernel='rbf', gamma=15, n_components=1 )
features_kpca = kpca.fit_transform(features)  

如果出现报错:

cannot import name ‘available_if’ from ‘sklearn.utils.metaestimators’

可以将 sklearn 升级;我的在 0.24 会有这个报错,升级到 1.0.2 就不再报错了

features_kpca
array([[ 0.08961469],
       [ 0.17082614],
       [-0.36539792],
       [-0.37995615],
       [-0.37090715],
       [ 0.6078442 ],
       [-0.39356042],
       [ 0.55552131], 
       ... 
       [ 0.60795829],
       [-0.38024627],
       [-0.36804001]])
features.shape[1] 
2
features_kpca.shape[1]
1


3、通过最大化类间可分性进行特征降维

LDA : Linear Discriminant Analysis 线性判别分析

LDA 和 PCA 都是高维 --> 低维

可以使用 LDA 方法,将特征数据映射到一个 可以使类间可分性最大 的成分坐标轴上。


LDA官方
https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

class sklearn.discriminant_analysis.LinearDiscriminantAnalysis(solver=‘svd’, shrinkage=None, priors=None, n_components=None, store_covariance=False, tol=0.0001, covariance_estimator=None)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
lda = LinearDiscriminantAnalysis(n_components=1) 
feature_lda = lda.fit(iris_feature, iris_target).transform(iris_feature)    

iris_feature.shape, feature_lda.shape 
((150, 4), (150, 1))
***
## 每个输出的特征 所保留的信息量(方差)
## lda.explaineal_variance_ratio_
***
## 
## AttributeError: 'LinearDiscriminantAnalysis' object has no attribute 'explaineal_variance_ratio_'


from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda
clf = lda(solver='eigen', shrinkage = 'auto')
clf.fit(iris_feature, iris_target)

LinearDiscriminantAnalysis(shrinkage='auto', solver='eigen')
clf.explained_variance_ratio_
array([0.97875711, 0.01751255])

4、使用矩阵分解进行特征降维

NMF - Non-Negative Matrix Factorization 对特征矩阵进行降维
无监督,线性降维方法

可以分解矩阵:将线性矩阵分解为多个矩阵,其乘积近似于原始矩阵
NMF 可以减少维数,将特征矩阵转化为表示样本与特征之间潜在关系的矩阵

NMF 将特征矩阵分解为 V == WF

V: d * n 维特征矩阵, d 个特征,n个样本
W: d * r 维矩阵
F: r * n 维矩阵

与 PCA, LDA 不同,NMA 不会告诉我们,输出特征中保留的原始数据信息量

from sklearn.decomposition import NMF 
## 创建 NMF,进行变换并应用
nmf = NMF(n_components=10, random_state=1)
nmf 
NMF(n_components=10, random_state=1)
features_nmf = nmf.fit_transform(digits_features) 
features_nmf
array([[0.        , 0.        , 0.837839  , ..., 0.        , 1.20675846,
        0.42031996],
       [0.25929024, 1.20855754, 0.17863194, ..., 0.06564288, 0.        ,
        0.        ],
       [0.17020236, 0.7959156 , 0.91306418, ..., 0.57233516, 0.        ,
        0.        ],
       ...,
       [0.23816978, 0.78676462, 0.93940245, ..., 0.08195033, 0.50671336,
        0.05726816],
       [0.        , 0.        , 0.12887881, ..., 0.3640044 , 0.7338979 ,
        0.55289412],
       [0.52050455, 0.        , 0.82556395, ..., 0.55138297, 0.58910755,
        0.23813568]])
features.shape, features_nmf.shape 
((1000, 2), (1797, 10))

5、对稀疏矩阵进行特征降维

TSVD : Truncated Sigular Value Decomposition 截断奇异值分解

from sklearn.decomposition import TruncatedSVD, NMF
from scipy.sparse import csr_matrix
digits_features = digits.data 
digits_target = digits.target  
## 标准化特征矩阵 
features = StandardScaler().fit_transform(digits_features)  
features
array([[ 0.        , -0.33501649, -0.04308102, ..., -1.14664746,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  0.54856067,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -1.09493684, ...,  1.56568555,
         1.6951369 , -0.19600752],
       ...,
       [ 0.        , -0.33501649, -0.88456568, ..., -0.12952258,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649, -0.67419451, ...,  0.8876023 ,
        -0.5056698 , -0.19600752],
       [ 0.        , -0.33501649,  1.00877481, ...,  0.8876023 ,
        -0.26113572, -0.19600752]])
## 生成稀疏矩阵  
features_sparse = csr_matrix(features)
features_sparse
<1797x64 sparse matrix of type '<class 'numpy.float64'>'
	with 109617 stored elements in Compressed Sparse Row format>
tsvd = TruncatedSVD(n_components=10)
tsvd
TruncatedSVD(n_components=10)
features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse) 
features_sparse.shape, features_sparse_tsvd.shape 
((1797, 64), (1797, 10))
 

二、使用特征选择 进行降维 feature_selection


1、数值型特征方差的阈值化

方差阈值化:Variance Thresholding, VT
依据:最小方差的特征,可能比大方差的特征 重要性更低
VT 方法的第一步,是计算每个特征的方差


https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html

class sklearn.feature_selection.VarianceThreshold(threshold=0.0)

from sklearn.feature_selection import VarianceThreshold 
threshold = VarianceThreshold(threshold=.5) 
threshold 
VarianceThreshold(threshold=0.5)
## 创建大方差特征矩阵
features_high_variance = threshold.fit_transform(iris_feature)
## 查看
features_high_variance.shape 
(150, 3)
features_high_variance [:3] 
array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])
## 显示方差
threshold.fit(iris_feature).variances_ 
array([0.68112222, 0.18871289, 3.09550267, 0.57713289])
from sklearn.preprocessing import StandardScaler 

## 标准化特征矩阵

scaler = StandardScaler() 
feature_std = scaler.fit_transform(iris_feature)
feature_std 
array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
        ...
       [ 6.86617933e-02, -1.31979479e-01,  7.62758269e-01,
         7.90670654e-01]])
## 计算每个方差的特征

selector = VarianceThreshold() 
selector.fit(feature_std).variances_ 
array([1., 1., 1., 1.])

2、二值特征的方差阈值化

  • 和数值型特征一样,挑选高信息量的分类特征方法之一,就是查看他们的方差;

  • 在二值特征(即 伯努利随机变量)中,方差的计算公式为: V a r ( x ) = p ( 1 − p ) Var(x) = p(1-p) Var(x)=p(1p)

    • p 是观察值属于第一个分类的概率

通过设置p值,我们可以额删除大部分观察值都属于同一个类别的特征。

features = [[0, 1, 0], 
            [0, 1, 1], 
            [0, 1, 0], 
            [0, 1, 1], 
            [1, 0, 0],   
           ]
threshold = VarianceThreshold(threshold=(0.75 * (1-0.75)) ) 
threshold 
VarianceThreshold(threshold=0.1875)
threshold.fit_transform(features) 
array([[0],
       [1],
       [0],
       [1],
       [0]])


3、处理高度相关性的特征

问题:特征矩阵中的某些特征之间,具有较高的相关性;即 冗余特征。

解决方案:使用相关矩阵 检查是否存在 较高相关性的特征;如果存在,则删除其中一个。

import pandas as pd 
import numpy as np  
## 创建一个特征矩阵,其中包含两个高度相关的特征 

features = [[1, 1, 1], 
            [2, 2, 0],   
            [3, 3, 1],
            [4, 4, 0],   
            [5, 5, 1],
            [6, 6, 0],   
            [7, 7, 1],
            [8, 7, 0],   
            [9, 7, 1],     
           ]

df = pd.DataFrame(features) 
df
0 1 2
0 1 1 1
1 2 2 0
2 3 3 1
3 4 4 0
4 5 5 1
5 6 6 0
6 7 7 1
7 8 7 0
8 9 7 1

corr_matrix = df.corr().abs() 
corr_matrix
0 1 2
0 1.000000 0.976103 0.000000
1 0.976103 1.000000 0.034503
2 0.000000 0.034503 1.000000

m = np.triu(np.ones( corr_matrix.shape ), k=1)
m 
array([[0., 1., 1.],
       [0., 0., 1.],
       [0., 0., 0.]])
## 选择相关矩阵的上三角阵
upper = corr_matrix.where(m.astype(np.bool) )  
upper

Out[64]:

0 1 2
0 NaN 0.976103 0.000000
1 NaN NaN 0.034503
2 NaN NaN NaN
## 找到相关性大于 0.95 的特征列索引

to_drop = [column for column in upper.columns if any(upper[column] > 0.95 ) ] 
to_drop 
[1]
## 删除特征

df.drop(df.columns[to_drop], axis=1).head(3)  

Out[66]:

0 2
0 1 1
1 2 0
2 3 1

4、删除与分类任务不相关的特征

根据分类的目标向量,删除信息量较低的特征

from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2, f_classif  
## 将特征数据转化为整形 
features = iris_feature.astype(int) 
iris_feature[:3], features[:3]  
(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2]]),
 array([[5, 3, 1, 0],
        [4, 3, 1, 0],
        [4, 3, 1, 0]]))

1)分类类型

计算每个特征和目标向量的卡方统计量

***
## 选择卡方统计量最大的两个特征 

chi2_selector = SelectKBest(chi2, k=2) 
features_kbest = chi2_selector.fit_transform(features, iris_target) 
chi2_selector
SelectKBest(k=2, score_func=<function chi2 at 0x7fd058bde9e0>)
features.shape, features_kbest.shape 
((150, 4), (150, 2))
features_kbest[:3] 
array([[1, 0],
       [1, 0],
       [1, 0]])
iris_target[:3] 
array([0, 0, 0])

2)数值型特征

计算每个特征与目标向量之间的方差分析F值(ANOVA F-value )

## 选择 F 值最大的两个特征
fvalue_selector = SelectKBest(f_classif, k=2) 
features_kbest = fvalue_selector.fit_transform(features, iris_target) 
fvalue_selector
SelectKBest(k=2)
features.shape, features_kbest.shape 
((150, 4), (150, 2))
features_kbest[:3]
array([[1, 0],
       [1, 0],
       [1, 0]])

3)SelectPercentile 选择前 n% 的特征

以上是选择固定数量的特征,这里选择一定比例的特征

from sklearn.feature_selection import SelectPercentile 
## 选择F值位于前 75% 的特征 
fvalue_selector = SelectPercentile(f_classif, percentile=75)  
features_kbest = fvalue_selector.fit_transform(features, iris_target) 
fvalue_selector
SelectPercentile(percentile=75)
features_kbest[:3] 
array([[5, 1, 0],
       [4, 1, 0],
       [4, 1, 0]])
features.shape, features_kbest.shape  
((150, 4), (150, 3))

5、递归式特征消除

REF : Recursive Feature Elimination

使用 sklearn 的 RFECV 类,通过交叉验证,进行递归式特征消除。

这个方法会重复训练模型,每一次训练 移除一个特征;直到模型性能变差;剩余的就是最优特征。

from sklearn.datasets import make_regression  
from sklearn.feature_selection import RFECV 
from sklearn import datasets, linear_model 
import warnings 

warnings.filterwarnings(action='ignore', module='scipy', message='^internale gelsd')

features, target = make_regression(n_samples=10000, n_features=100, n_informative=2, random_state=1) 
features, target
(array([[-0.64280735, -0.76165969, -0.21101598, ..., -0.09009697,
         -0.465717  ,  0.57708925],
        [ 1.3445917 , -0.46550514, -0.35861939, ..., -0.83288561,
         -0.80204139,  0.23866708],
        [-0.8296151 ,  1.36836958,  0.72194912, ..., -0.19323117,
          0.61859301,  0.56851734],
        ...,
        [ 0.07312981,  0.39023196, -0.19507822, ...,  0.16783823,
         -1.44486423,  0.80140047],
        [-0.59308625,  0.44825266, -0.75444722, ..., -0.23055945,
          1.83227846, -0.31107328],
        [ 0.94761575,  1.15616404, -0.10539287, ...,  0.05952898,
         -0.12939911, -1.18507119]]),
 array([  27.46715058,   23.8681199 ,   27.34686337, ..., -116.75887199,
         -24.22656921,   -6.46891152]))
## 创建线性回归对象 
ols = linear_model.LinearRegression() 
***
## 递归消除特征
rfecv = RFECV(estimator=ols, step=1, scoring='neg_mean_squared_error')  
rfecv.fit(features, target) 
rfecv.transform(features)   


array([[ 0.00850799,  0.7031277 ],
       [-1.07500204,  2.56148527],
       [ 1.37940721, -1.77039484],
       ...,
       [-0.80331656, -1.60648007],
       [ 0.39508844, -1.34564911],
       [-0.55383035,  0.82880112]])
## 最优特征的数量

rfecv.n_features_ 
2
## 哪儿是最优特征 
rfecv.support_ 
array([False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])
## 特征从最好到最差排名(1-最优
rfecv.ranking_ 
array([50, 43, 28, 44, 17,  1, 13, 27, 24, 97, 22,  4, 47, 21, 26, 10, 52,
       23, 16, 11, 60, 73, 33, 72, 80, 32, 49, 42, 37, 30, 38, 55, 84, 86,
       85, 95, 48, 62, 57,  1, 40, 35, 70, 71, 41, 29, 82, 91, 66, 69, 67,
       14,  2, 76, 61, 56, 20, 75, 74, 58, 64, 12, 79,  7, 89, 77, 90, 87,
       46, 19,  3, 65, 68, 34, 31, 92,  8, 88, 54, 94, 59, 81, 53,  6, 93,
       98, 18, 78, 83, 45, 36, 63,  5, 25, 39, 96,  9, 15, 51, 99])

2023-03-25

Guess you like

Origin blog.csdn.net/lovechris00/article/details/129771315