Feature selection for feature dimensionality reduction

Feature extraction: Create new features to reduce the dimensionality of the feature matrix without losing the original information of features as much as possible.

Another dimensionality reduction method, feature selection will retain the features with higher information content and discard the features with lower information content.

  • Feature selection method:
    • Filter: The filter method selects the optimal feature based on the statistical information of the feature.
    • Wrapper: Through trial and error, the wrapper finds a subset of features that can produce a high-quality predictive value model.
    • Embedded method: This method selects the optimal feature subset as part of the machine learning algorithm training process.

1. Thresholding of variance of numerical features

Problem description: Remove the feature with small variance (that is, it may contain less information) from a set of numerical features.

Solution: pick out features with variance greater than a given threshold

from sklearn import datasets
from sklearn.feature_selection import VarianceThreshold
#加载数据
iris = datasets.load_iris()
#创建features和target
features = iris.data
target = iris.target
#创建VarianceThreshold
threshold = VarianceThreshold(threshold = 0.5)
#创建大方差特征矩阵
features_high_variance = threshold.fit_transform(features)
#显示大方差特征矩阵
features_high_variance[0:3]
outputs:
array([[5.1, 1.4, 0.2],
       [4.9, 1.4, 0.2],
       [4.7, 1.3, 0.2]])

Variance Thresholding (VT) is one of the most basic feature selection methods. This method is based on the fact that features with small variance may be less important than features with large variance.

  • important point:
    • Variance is not centralized, so if the units of the features in the feature data set are different, then VT will not work.
    • The threshold of variance is manually selected, so you must rely on manual selection of an appropriate threshold. You can view the variance of each feature through the parameter variance_: VarianceThreshold().fit(features).variances_
    • If the features have been standardized, the variance threshold will not play a role in screening.

2. Thresholding the variance of binary features

Problem description: There is a set of well-known feature data, and features with small variance should be removed.

Solution: Pick out the binary features with a given threshold of variance.

from sklearn.feature_selection import VarianceThreshold
#创建特征矩阵
features = [[0,1,0],
            [0,1,1],
            [0,1,0],
            [0,1,1],
            [1,0,0]]
#创建VarianceThreshold对象并运行
thresholder = VarianceThreshold(threshold = (0.75 * (1-0.75)))
thresholder.fit_transform(features)
output:
array([[0],
       [1],
       [0],
       [1],
       [0]])

As with numerical features, one of the ways to select classification features with high information content is to look at their variance. In binary features (ie Bernoulli random variables), the variance is calculated as follows:

V a r ( x ) = p ( 1 − p ) Var(x) = p(1-p) V a r ( x )=p(1p )
where p is the probability that the observation belongs to 1 category. By setting the value of p, it is possible to delete features whose most observations belong to the same category.

3. Dealing with highly relevant features

Problem description: Some features in the feature matrix have high correlation

Solution: Use the correlation matrix to check whether there are features with higher correlation, and if there are, delete one of them.

import pandas as pd
import numpy as np
#创建一个特征矩阵,其中包含两个高度相关的特征
features = np.array([[1,1,1],
                     [2,2,0],
                     [3,3,1],
                     [4,4,0],
                     [5,5,1],
                     [6,6,0],
                     [7,7,1],
                     [8,7,0],
                     [9,7,1]])
#将特征矩阵转换成DataFrame
dataframe = pd.DataFrame(features)
print(dataframe)
#创建相关矩阵
corr_matrix = dataframe.corr().abs()
#选择相关矩阵的上三角阵
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape),k =1).astype(np.bool))
#找到相关性大于0.95的特征列的索引
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
#删除特征
dataframe.drop(dataframe.columns[to_drop],axis = 1).head(5)
#output:
0	2
0	1	1
1	2	0
2	3	1
3	4	0
4	5	1

4. Remove features not related to the classification task

Problem description: According to the classified target vector, delete features with low information content

Solution:
For sub-type features, calculate the chi-square statistics of each feature and target vector

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2,f_classif
#加载数据
iris = load_iris()
features = iris.data
target = iris.target
#将分类数据转换成整数型数据
features = features.astype(int)
#选择卡方统计量最大的两个特征
chi2_selector = SelectKBest(chi2,k = 2)
features_kbest = chi2_selector.fit_transform(features,target)
#显示结果
print("original number of features:",features.shape[1])
print("Reduced number of features:",features_kbest.shape[1])

output:

original number of features: 4
Reduced number of features: 2

For numerical feature data, calculate the variance analysis F value between each feature and the target vector:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2,f_classif
#加载数据
iris = load_iris()
features = iris.data
target = iris.target
#将分类数据转换成整数型数据
features = features.astype(int)
#选择F值最大的两个特征
fvalue_selector = SelectKBest(f_classif,k = 2)
features_kbest = fvalue_selector.fit_transform(features,target)
#显示结果
print("original number of features:",features.shape[1])
print("Reduced number of features:",features_kbest.shape[1])

In addition to selecting fixed features, the top n% of features can be selected through the SelectPercentile method:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2,f_classif
from sklearn.feature_selection import SelectPercentile
#加载数据
iris = load_iris()
features = iris.data
target = iris.target
#将分类数据转换成整数型数据
features = features.astype(int)
#选择F值位于75%的两个特征
fvalue_selector = SelectPercentile(f_classif, percentile = 75)
features_kbest = fvalue_selector.fit_transform(features,target)
#显示结果
print("original number of features:",features.shape[1])
print("Reduced number of features:",features_kbest.shape[1])

output:

original number of features: 4
Reduced number of features: 3

Chi-square statistics can check the mutual independence of two classification vectors. In other words, the chi-square statistic represents the difference between the observed sample size and the expected sample size (assuming that the feature is not related to the target vector).

By calculating the chi-square statistics of the feature and target vector, a measure of the independence between the two can be obtained. If the two are independent of each other, it means that the feature has nothing to do with the target vector, and the feature does not contain the information needed for classification. On the contrary, if the feature and the target vector are highly correlated, it means that the feature contains a lot of information needed to train the classification model.

5. Recursive feature elimination

Problem description: Automatically select the best features that need to be retained.

Solution: Use sklearn's RFECV class to perform Recursive Feature Elimination (REF) through cross validation (crossing validation, CV). This method repeatedly trains the model, removing one feature each time, until the model performance (such as accuracy) deteriorates. The remaining features are the optimal features:

import warnings
from sklearn.datasets import make_regression
from sklearn.feature_selection import RFECV
from sklearn import datasets,linear_model
#忽略警告信息
#warnings.filterwarnings(action="ingore",module="scipy",message="^internal gelsd")
#生成特征矩阵、目标向量以及系数
features,target = make_regression(n_samples = 10000,
                                  n_features = 100,
                                  n_informative = 2,
                                  random_state = 1)
#创建线性回归对象
ols = linear_model.LinearRegression()
#递归消除特征
rfecv = RFECV(estimator = ols,step = 1,scoring = "neg_mean_squared_error")
rfecv.fit(features,target)
rfecv.transform(features)
#最优特征数量
rfecv.n_features_
#最优特征
rfecv.support_
#将特征从最好到最差排序
rfecv.ranking_

Guess you like

Origin blog.csdn.net/weixin_44127327/article/details/108600365