[Sklearn] feature selection and dimensionality reduction

1.13 Feature Selection

sklearn.feature_selection module class can be used to select features on a sample set / dimension reduction, to improve the accuracy of the estimator value, or a higher application in high-dimensional data sets performance.

1.13.1 Delete feature low variance

VarianceThreshold is a simple feature selection method baseline. It removes all the features of the variance does not meet a certain threshold.
By default, it will remove all the zero dispersion characteristic, that characteristic has the same value in all samples.

For example, suppose we have a set of data having a Boolean characteristic, and we want to delete all or either of (on or off) is characterized by more than 80% 0 of the sample 1.
Boolean characterized by Bernoulli random variable with variance

\(\mathrm{Var}[X] = p(1 - p)\)

So we can choose to use a threshold .8 * (1-- .8):

from sklearn.feature_selection import VarianceThreshold

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]

sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)

array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])

VarianceThreshold deleted first column, which contains the 0 probability of p = 5/6> .8.

1.13.2 univariate feature selection

Works univariate feature selection is to select the best features based on univariate statistical tests. It can be considered as a pretreatment step estimator.
Scikit-learn the characteristics of the selection routine is implemented as an object of conversion method is disclosed:

  • SelectKBest removed all properties other than the characteristic k highest score;

  • SelectPercentile deleted all features except the highest percentage rate specified by the user;

  • Use a common univariate statistical tests for each feature: the false positive rate SelectFpr, false discovery rate SelectFdr, or an error rate is determined Group SelectFwe;

  • GenericUnivariateSelect configurable policy enforcement allows the use of single-variable characteristic selection. This allows you to select the best single variable selection strategies and super search parameters estimator.

For example, we can sample \ (\ chi ^ 2 \) test, two retrieve only the best features:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

X, y = load_iris(return_X_y=True)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)

X.shape, X_new.shape

((150, 4), (150, 2))

These subjects received a rating as a function of the input, the function returns the value of a single variable fraction and p- (or return only a fraction of the SelectKBest and SelectPercentile):

For regression problems: f_regression, mutual_info_regression

For the classification problem: chi2, f_classif, mutual_info_classif

Based on the F-test method to estimate the degree of linear correlation between two random variables.
On the other hand, mutual information can be captured any type of statistical correlation, but because of non-parametric, thus requiring more samples to accurately estimate.

1.13.3 recursive feature deleted

Given a feature of imparting weights to estimate the external (e.g., the coefficients of the linear model), recursive feature deleted (The RFE) are selected recursively considering characterized by smaller and smaller feature set.
First, the estimate is trained original feature set, the importance of each feature is then obtained by coef_ feature_importances_ property or properties.
Then, remove the most important features from the current set of properties. This process is repeated recursively on the trim set, until the final number of desired properties to be selected.

RFECV RFE executed in the cross-validation cycle to find the optimal number of characteristics.

1.13.4 selection feature with SelectFromModel

SelectFromModel is a meta-converter, it may be used with any or feature_importances_ coef_ estimator having properties after fitting.
Or if the corresponding coef_ feature_importances_ parameters provided below a threshold, it is considered important and delete these characteristics.
In addition to specifying a threshold value in a digital manner, there are built-in parameter lookup string heuristic thresholds.
Heuristic methods may be used are "average", "median" and multiples of these floating point numbers, such as "0.1 * Mean."

Examples on how to use it, please refer to the following section.

L1 is selected based on feature 1.13.4.1

Linear Model with L1 norm of punishment sparse solution: many of their estimated coefficients are zero.
When the goal is to reduce the data used in conjunction with another classifier when the number of dimensions, they can be used together with feature_selection.SelectFromModel selected nonzero coefficients.
In particular, the sparse estimator is well suited to this scenario, such as for linear_model.Lasso return for linear_model.LogisticRegression and svm.LinearSVC classification:

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X.shape, X_new.shape

((150, 4), (150, 3))

For SVM and logistic regression, controlling sparseness parameter C: C is smaller, the smaller the selected characteristic. For higher Lasso, alpha parameter, the less the selected feature.

1.13.4.2 tree-based feature selection

Based on the estimated tree (see sklearn.tree module and the module sklearn.ensemble Forest) can be used to calculate the importance of the feature, but also for the importance of these features to discard irrelevant (used in combination with the transducer element SelectFromModel):

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

X, y = load_iris(return_X_y=True)
clf = ExtraTreesClassifier(n_estimators=50)
clf = clf.fit(X, y)

model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X.shape, X_new.shape, clf.feature_importances_

((150, 4), (150, 2), array([0.09394361, 0.05591118, 0.42796637, 0.42217885]))

1.13.5 feature selection as part of the pipeline

Feature selection is generally used as the actual study before pretreatment step. In the proposed scikit-learn method is to use sklearn.pipeline.Pipeline:

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

In this code fragment, we used sklearn.svm.LinearSVC and sklearn.feature_selection.SelectFromModel used to assess the characteristics and the importance of selecting the most relevant features.
Then, for a training sklearn.ensemble.RandomForestClassifier converted output, i.e., only the relevant features.
Other characteristics may also be used to provide classification and selection method to assess the importance of the characteristics method similar operation is performed.

Full story

An example of univariate feature selection display.

Was added noise (non-informative) characterized in that the iris data, univariate feature selection.
For each feature, we draw univariate feature weights corresponding to the selected values of p and support vector machines.
We can see, univariate feature selection information to select the features, and these features have a greater SVM weights.

Throughout the feature set, only the first four characteristics are important. We can see that they have chosen the highest score on a single variable characteristics.
SVM to these features gives a lot of weight, but also choose a number of non-informative features.
Performed before Univariate SVM feature selection, feature can significantly increase the weight of the heavy support vector machine, thereby improving the classification.

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets, svm
from sklearn.feature_selection import SelectPercentile, f_classif

# #############################################################################
# Import some data to play with

# The iris dataset
iris = datasets.load_iris()

# Some noisy data not correlated
E = np.random.uniform(0, 0.1, size=(len(iris.data), 20))

# Add the noisy data to the informative features
X = np.hstack((iris.data, E))
y = iris.target

plt.figure(1)
plt.clf()

X_indices = np.arange(X.shape[-1])

# #############################################################################
# Univariate feature selection with F-test for feature scoring
# We use the default selection function: the 10% most significant features
selector = SelectPercentile(f_classif, percentile=10)
selector.fit(X, y)
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
plt.bar(X_indices - .45, scores, width=.2,
        label=r'Univariate score ($-Log(p_{value})$)', color='darkorange',
        edgecolor='black')

# #############################################################################
# Compare to the weights of an SVM
clf = svm.SVC(kernel='linear')
clf.fit(X, y)

svm_weights = (clf.coef_ ** 2).sum(axis=0)
svm_weights /= svm_weights.max()

plt.bar(X_indices - .25, svm_weights, width=.2, label='SVM weight',
        color='navy', edgecolor='black')

clf_selected = svm.SVC(kernel='linear')
clf_selected.fit(selector.transform(X), y)

svm_weights_selected = (clf_selected.coef_ ** 2).sum(axis=0)
svm_weights_selected /= svm_weights_selected.max()

plt.bar(X_indices[selector.get_support()] - .05, svm_weights_selected,
        width=.2, label='SVM weights after selection', color='c',
        edgecolor='black')


plt.title("Comparing feature selection")
plt.xlabel('Feature number')
plt.yticks(())
plt.axis('tight')
plt.legend(loc='upper right')
plt.show()

Automatically created module for IPython interactive environment

Reference material

sklearn user guide 1.13 Feature secection

Guess you like

Origin www.cnblogs.com/yanqiang/p/11781377.html