Feature selection (Feature Selection)

How to find the feature model needs? First of all to find business experts in the field, let them give some advice. For example, we need to address the problem of classification of a drug effect, then first find experts in the field, what factors (characteristics) to consult them will have an impact on the efficacy of the drug, factors of greater impact and be less affected. The first factor is the candidate set our identity. (Excerpt from: https://www.cnblogs.com/pinard/p/9032759.html )

 

The above is a selection of features from a business perspective, this is the most important method.

 

In addition, from a technical viewpoint, feature selection is divided into three categories:

  • Filtration (the Filter) : screened for the statistical properties of the individual features. According to divergence or relevance to various features scoring, characterized by setting a threshold value selection.
  • Coating Process (Wrapper) : essentially iterative method. According to the model to predict the effect of scoring every time you add a feature to the model, or delete a feature, until a specified stop condition.
  • Embedding (Embedded) : is a model-based approach. Using the model is trained to give the weight coefficient of each feature, feature of the coefficients in descending selection. Feature selection itself is integrated in the process model in training.

 

Filtration ( the Filter)

(The following section is taken from: https://www.zhihu.com/question/28641663/answer/110165221 )

1, variance

The basic principle is: the lower the variance remove features. If a feature does not diverge, for example, the variance is close to zero, i.e. in this sample wherein substantially no difference, then the distinguishing features of the sample and no use. The specific method is: calculating the variance of each feature, feature selection variance is greater than a threshold value.

Sklearn with VarianceThreshold class libraries in feature_selection selected feature code is as follows:

Import sklearn.feature_selection from VarianceThreshold 

# variance selection method, returns the value of the feature selection data 
# parameter threshold variance threshold 
VarianceThreshold (threshold = 3) .fit_transform ( iris.data)

 

2, Pearson correlation coefficient

The basic principle is: the removal and target y features irrelevant. The specific method is: calculating the Pearson correlation coefficient of each feature and the target value of the correlation coefficient p, significant features related to selection.

With SelectBest sklearn in feature_selection library (selected k most powerful features) bound class correlation coefficient selection feature code is as follows:

Import sklearn.feature_selection from SelectKBest 
from scipy.stats Import pearsonr 

# K-best selection feature, feature selection data returned 
# to calculate a first parameter evaluation feature if a good function that the input feature vector matrix and the target, array output tuple (score, P value), the i-th item of the array score and P i-th feature value. Is defined herein as the calculation of the correlation coefficient 
# wherein k is the number of parameters selected 
SelectKBest (lambda X, Y: array (map (lambda x:. Pearsonr (x, Y), XT)) T, k = 2) .fit_transform ( iris.data, iris.target)

You can also choose the most powerful features with SelectPercentile class k%. (The same below)

Note that: Pearson correlation coefficient can only measure whether the linear correlation between x and y.

 

3, chi-square test

The basic principle is: the use of chi-square test of independence test, if there is an association between two categorical variables, then that is not independent of each other. The specific method is: Suppose there are N kinds of values ​​of the independent variable, dependent variable values ​​M species, considering the independent variable i is equal to the gap between the sample and the dependent variable j is equal to the frequency of observation and the desired number to construct statistics.

Wherein the code is selected in conjunction with the chi-square test in feature_selection sklearn SelectKBest class library as follows:
Import sklearn.feature_selection from SelectKBest 
from sklearn.feature_selection Import chi2 

# selects K the best feature, go back and select feature data 
SelectKBest (score_func = chi2, k = 2) .fit_transform (iris.data, iris.target)

Note: Chi-square test can only be the case for x and y are categorical variables.

 

4, F test

The basic principle is: the use of single-factor analysis of variance, the variable total sum of squares, the proportion of variance if the group is large, explain changes observed variables is mainly caused by the control variables. The specific method is: the groups were divided by the mean square error variance within the group, to build statistics.

With SelectBest sklearn in feature_selection library (selected k most powerful features) bound f_classif selected feature class code as follows:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
SelectKBest(score_func=f_classif, k=4).fit_transform(iris.data, iris.target)

Note: f_classif for y is the case of categorical variables, if the variable y is a number, then use f_regression, at this time, and as a Pearson correlation coefficient.

 

5, the mutual information

Mutual information can also be evaluated by qualitative arguments and qualitative correlation between variables, information gain from the point of view, since x represents the mutual information introduced y the amount of reduced uncertainty. The larger the mutual information, indicating the greater the correlation between the independent and dependent variables. Mutual information is calculated as follows: .

With SelectBest sklearn in feature_selection library (selected k most powerful features) bound mutual_info_classif selected feature class code as follows:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

SelectKBest(score_func=mutual_info_classif, k=4).fit_transform(iris.data, iris.target)

 

But the mutual information directly used for feature selection is actually not too convenient: 1, it is not a measure of the way, there is no way to normalization, can not be compared on different data and results; 2, is not very convenient for the calculation of continuous variables usually need to discrete variables, and the results of the mutual information is very sensitive to the discrete manner.

Therefore, to deal with quantitative data, the maximum coefficient information (Mutual information and maximal information coefficient, MIC) has been proposed. It first looks for an optimal discrete manner, then converts the value into a mutual information metric, the value in the interval [0,1]. 

Sklearn used in conjunction feature_selection class libraries SelectKBest maximum coefficient information (minepy Providing MIC function) to select a feature code is as follows:

Import sklearn.feature_selection from SelectKBest 
from minepy Import MINE 
 
 # MINE because the design is not functional, the definition of which method mic to a fixed value of the function formulas P, returns a tuple, the second setting item tuples 0.5 
 DEF MIC (X, Y): 
     m = MINE ()  m.compute_score (X, Y)  return (m.mic (), 0.5 ) 
 # K-best selected feature, feature selection data returned  SelectKBest (lambda X , Y: array (map (lambda x: mic (x, Y), XT)). T, k = 2) .fit_transform (iris.data, iris.target)

Note: the mutual information can measure the linear relationship between x and y.

 

Coating Process ( the Wrapper)

1, recursive feature elimination method (Recursive Feature Elimination, RFE)

A method using a recursive feature elimination group model multiple rounds of training, the training round, eliminating a number of characteristic values ​​of the coefficients right, then the next round of training based on the new feature set. First, the training model on the original features, each feature receive a weight. Thereafter, characterized in that the absolute value of the minimum weight has been kicked feature set. Recursively and so forth, until the number reaches the number of remaining features required features. How RFE's performance is very dependent on the base model, so if the base model is stable feature selection of quality stable.

Said before stepwise regression (see: https://www.cnblogs.com/HuZihu/p/12329998.html ) is recursive feature elimination of Law, before the article is for linear regression model is, if be extended to other models, then use the cross-validation prediction error to measure models.

Use RFE class library sklearn feature_selection selected feature code is as follows:

Import sklearn.feature_selection from The RFE 
from sklearn.linear_model Import LogisticRegression 

# recursive feature elimination method, return data after the feature selection 
# model-based parameter estimator 
# n_features_to_select parameters selected number is characterized 
RFE (estimator = LogisticRegression (), n_features_to_select = 2 ) .fit_transform (iris.data, iris.target)

 

Embedding ( Embedded)

1, based on the feature tree model selection method

At present it is generally used for classification problems Gini coefficient or gain information, to return to the issue for general use RMSE (root mean square error) or MSE (square error).

SelectFromModel GBDT model type of binding selected feature code uses sklearn in feature_selection library as follows:

Import sklearn.feature_selection from SelectFromModel 
from sklearn.ensemble Import GradientBoostingClassifier 

#GBDT model selected as a feature group 
SelectFromModel (GradientBoostingClassifier ()). fit_transform (iris.data, iris.target)
 

2, regularization

This refers to L1 regularization method, which essentially corresponds to the right to certain features of the reset to zero. Tape base model using the penalty term, in addition to the external features selected, but also trained model.

Use sklearn in feature_selection library SelectFromModel type of binding with the logistic regression model to select the L1 penalty term feature code is as follows:

Import sklearn.feature_selection from SelectFromModel 
from sklearn.linear_model Import LogisticRegression 

selection # L1 band penalty term as the feature group logistic regression model 
SelectFromModel (LogisticRegression (penalty = "l1 ", C = 0.1)). fit_transform (iris.data, iris. target)

 

 

These three categories of feature selection method advantages and disadvantages:

  • The filtering method using statistical index score to each feature and filtering characteristics, which is focused on the characteristics of the data itself. The advantage is a fast calculation, is not dependent on a particular model. The disadvantage is selected statistical indicators are not customized for a particular model, thus the final accuracy rate may not be high. And because conducted univariate statistical tests did not consider the relationship between the characteristics of the interaction terms and can not be selected.
  • The wrap-around method using a model to screen feature, by constantly adding or removing features, validation test model accuracy on the set, to find the optimal subset of features. Package approach because there are directly involved in the model, which is usually higher accuracy, but because every change in a feature of the model must be re-trained, and thus calculate large overhead, its other shortcomings are easy to over-fitting.
  • Embedded method utilizes characteristics of the model itself, the feature selection process model embedded in the construct. Typical models such as Lasso and trees. Higher accuracy, computational complexity and parcels between filtering approach, but the drawback is that only part of the model has this feature.

 

Reference: https://www.cnblogs.com/massquantity/p/10486904.html

 

Guess you like

Origin www.cnblogs.com/HuZihu/p/12381752.html