[Machine Learning] Summary of Feature Selection Methods

1. Summary

insert image description here

2. Introduction

When dealing with structured data, feature selection in feature engineering is a very important part. Feature selection is to select features that are important to the model. Its benefits are:

● Reduce the size of training data and speed up model training.

● Reduce model complexity and avoid overfitting.

● The number of features is small, which is good for explaining the model.

● If you choose the right subset of features, the model accuracy may improve.

Feature selection methods are divided into three categories: Filter, Wrapper, and Embedded.

Three, filter method (Filter)

insert image description here
Filtering method: Regardless of the model when selecting features, this method selects based on the general performance of features, such as: target correlation, autocorrelation, and divergence.

● Advantages: The computational cost of feature selection is small, and it can effectively avoid overfitting.

● Disadvantage: It does not consider selecting feature subsets for the learner to be used later, which weakens the fit ability of the learner.

When we use the filtering method to examine variables, we will judge whether the variable should be filtered out from the relationship between the univariate itself and the multivariate.
insert image description here

3.1 Univariate

(1) Missing Percentage

The missing sample proportion is too large and it is difficult to fill in the features. It is recommended to remove this variable.

(2) Variance

If the variance of a continuous variable is close to 0, it means that its eigenvalue tends to be a single value, which is of little help to the model. It is recommended to remove this variable.

(3) Frequency

If the sample size distribution of the enumeration value of a certain category variable is concentrated on a single enumeration value, it is recommended to exclude this variable.

Here is an example of the Boston house price data set. The sample code is as follows:

# load Boston dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data, columns = boston.feature_names)


# Missing Percentage + Variance
stat_df = pd.DataFrame({
    
    '# of miss':df.isnull().sum(),
                        '% of miss':df.isnull().sum() / len(df) * 100,
                        'var':df.var()})


# Frequency
cat_name = 'CHAS'
chas = df[cat_name].value_counts().sort_index()
cat_df = pd.DataFrame({
    
    'enumerate_val':list(chas.index), 'frequency':list(chas.values)})
sns.barplot(x = "enumerate_val", y = "frequency",data = cat_df, palette="Set3")
for x, y in zip(range(len(cat_df)), cat_df.frequency):
    plt.text(x, y, '%d'%y, ha='center', va='bottom', color='grey')
plt.title(cat_name)
plt.show()

insert image description here
The low variance of NOX and the severe imbalance in the frequency distribution of CHAS can be considered to be eliminated.

3.2 Multivariate

When studying the relationship between multiple variables, we mainly start from two relationships:

● Correlation between independent variables: The higher the correlation, the multicollinearity problem will be caused, which will lead to the deterioration of the stability of the model, and the small disturbance of the sample will bring about a large parameter change [5]. Just select one of the linear features, and remove the rest.

● Correlation between independent variable and dependent variable: The higher the correlation, the more important the feature is to the model prediction target, and it is recommended to keep it.

Since the variables are divided into continuous variables and categorical variables, different methods should be used when studying the relationship between variables:

3.2.1 Continuous vs Continuous

(1) Pearson Correlation Coefficient

The Pearson correlation coefficient is the covariance of the two variables divided by the product of the standard deviations of the two variables. The covariance can reflect the degree of correlation between two random variables (when the covariance is greater than 0, it means that the two are positively correlated, and when it is less than 0, it means that the two are negatively correlated), and after dividing by the standard deviation, the value range of Pearson is [-1 ,1]. When the linear relationship between two variables is enhanced, the correlation coefficient tends to 1 or -1, and the sign indicates a positive and negative correlation. [6]

(2) Spearman's Rank Correlation Coefficient

The Pearson correlation coefficient is based on the fact that the variables conform to the normal distribution, while the Spearman correlation coefficient does not assume which distribution the variables obey. It calculates the correlation between variables based on the concept of rank. If the variable is an ordinal variable (Ordinal Feature), it is recommended to use the Spearman correlation coefficient.

Among them, is the level difference between two variables, and is the number of levels. Here is an example to understand it better. Suppose we want to explore the Spearman correlation coefficient of continuous variables and the calculation process is as follows: Similarly
insert image description here
, the correlation coefficient tends to 1 or -1, and the sign points to the positive and negative correlation.

3.2.2 Continuous vs Categorical

(1) Analysis of variance (ANOVA)

The purpose of ANOVA is to test whether there is a significant difference in the means under different groups. For example, we want to judge whether there is a significant difference in the average math scores of students in classes 1, 2 and 3? We can get that the class is a categorical variable, and the math score is a continuous variable. If there is a correlation between the class and the math score, for example, the students in class 1 are better at mathematics, it means that the average math scores of different classes are significantly different. In order to verify the correlation between the class and the math score, ANOVA will first establish the null hypothesis: : (the math scores of the three classes are not significantly different), and its verification method is to see whether the variance between groups (Mean Squared Between, MSB) is greater than that within the group Variance (Mean Squared Error, MSE), if the variance between groups > the variance within the group, it means that there is at least one distribution that is far away from other distributions, and you can consider rejecting the null hypothesis.

Let's try to calculate for example:

insert image description here
Note that three assumptions need to be met before ANOVA analysis: each group of samples has homogeneity of variance, the samples in the group obey the normal distribution, and the samples need to be independent.

(2) Kendall tau rank correlation coefficient

Assuming that we want to evaluate the correlation between education and salary, the Kendall coefficient will sort the samples according to the education background. If after sorting, the education and salary rankings are the same, the Kendall coefficient is 1, and the two variables are positively correlated. If education and salary are completely opposite, the coefficient is -1, which is completely negatively correlated. And if education and salary are completely independent, the coefficient is 0. The Kendall coefficient calculation formula is as follows:

Among them, is the same-order pair, is the out-of-order pair, and is the total number of pairs. Similarly, we give an example to show the calculation process:
insert image description here

3.2.3 Categorical vs Categorical

(1) Chi-squared Test

The chi-square test can be used to test the association between two categorical variables. The null hypothesis it establishes is that there is no correlation between the two variables. The formula for calculating the chi-square value is as follows:

Among them, is the actual value and is the theoretical value. The purpose of the chi-square value is to measure the degree of difference between theory and practice. If we study whether people who exercise get injured, the calculation goes like this: a
insert image description here
high chi-square value indicates a greater likelihood of a correlation between the two variables.

(2) Mutual Information

Mutual information is a measure of the degree of interdependence between variables, and its calculation formula is as follows:

It can be transformed into the form of entropy, where and is the conditional entropy and is the joint entropy. When and are independent, , then the mutual information is 0. When the two variables are exactly the same, the mutual information is the largest, so the greater the mutual information, the stronger the variable correlation. Furthermore, the mutual information is positive and symmetric (ie .
insert image description here

3.3 Summary of filtering methods

To summarize the above content, as shown in the following figure:

insert image description here
We can use the above indicators as needed to observe the correlation between variables, and then manually select features. In addition, you can also use the feature selection library sklearn.feature_selection in scikit-learn. Here I take SelectKBest (select Top K features with the highest score) as an example:

# load Boston dataset
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression


boston = load_boston()
df = pd.DataFrame(boston.data, columns = boston.feature_names)
target = pd.DataFrame(boston.target, columns=['MEDV'])
print('X.shape:', df.shape)


# select feature by person coefficient
X = np.array(df)
Y = np.array(target)
skb = SelectKBest(score_func=f_regression, k=5)
skb.fit(X, Y.ravel())
print('选择的特征有:', [boston.feature_names[i] for i in skb.get_support(indices = True)])
X_selected = skb.transform(X)
print('X_selected.shape:', X_selected.shape)

insert image description here

4. Wrapper

insert image description here
Wrapping method: The performance of the learner to be used is used as the evaluation criterion of the feature subset, and the purpose is to select a "tailor-made" feature subset for the given learner.

● Pros: Feature selection is more targeted than filtering and is good for model performance.

● Disadvantage: more computational overhead.

The package method has the following three types of search methods:
insert image description here

  1. full search

Iterate through the feature subsets of all possible combinations, then input to the model, and select the feature subset with the best model score. Not recommended, the calculation overhead is too large.

  1. heuristic search

Heuristic search is a method to continuously narrow the search space by using heuristic information. In feature selection, model scores or feature weights can be used as heuristic information.

2.1 Forward/backward search

The forward search starts with an empty set, adds only one feature in each round, and then trains the model. If the model evaluation score improves, the feature added in this round is retained, otherwise it is discarded. On the contrary, the backward feature is to do subtraction, starting from the full feature set, subtracting one feature in each round, if the model performance is reduced, keep the feature, otherwise discard it.

2.2 Recursive Feature Elimination

Recursive Feature Elimination is referred to as RFE (Recursive Feature Elimination). RFE uses a base model for multiple rounds of training. After each round of training, several low-weight features (such as feature weight coefficients or feature importance) are eliminated, and then based on new features set for the next round of training [1]. When using RFE, it is necessary to limit the number of features to be selected in advance (n_features_to_select). This hyperparameter is difficult to ensure that it is set properly once, because if it is set high, it is easy to feature redundancy, and if it is set low, relatively important features may be filtered out. Moreover, RFE is only selected based on feature weights without considering model performance, so RFECV appears, REFCV is REF + CV (cross-validation), its operating mechanism is: first use REF to obtain the ranking of each feature, and then based on ranking, Select [min_features_to_select, len(feature)] feature subsets in turn for model training and cross-validation, and finally select the feature subset with the highest average score.

insert image description here
I won’t repeat the wheel here, you can refer to wanglei5205 to provide sample code:

### 生成数据
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000,         # 样本个数
                           n_features=25,          # 特征个数
                           n_informative=3,        # 有效特征个数
                           n_redundant=2,          # 冗余特征个数(有效特征的随机组合)
                           n_repeated=0,           # 重复特征个数(有效特征和冗余特征的随机组合)
                           n_classes=8,            # 样本类别
                           n_clusters_per_class=1, # 簇的个数
                           random_state=0)


### 特征选择
# RFE
from sklearn.svm import SVC
svc = SVC(kernel="linear")


from sklearn.feature_selection import RFE
rfe = RFE(estimator = svc,           # 基分类器
          n_features_to_select = 2,  # 选择特征个数
          step = 1,                  # 每次迭代移除的特征个数 
          verbose = 0                # 显示中间过程
          ).fit(X,y)
X_RFE = rfe.transform(X)
print("RFE特征选择结果——————————————————————————————————————————————————")
print("有效特征个数 : %d" % rfe.n_features_)
print("全部特征等级 : %s" % list(rfe.ranking_))


# RFECV
from sklearn.svm import SVC
svc = SVC(kernel="linear")


from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
rfecv = RFECV(estimator=svc,          # 学习器
              min_features_to_select=2, # 最小选择的特征数量
              step=1,                 # 移除特征个数
              cv=StratifiedKFold(2),  # 交叉验证次数
              scoring='accuracy',     # 学习器的评价标准
              verbose = 0,
              n_jobs = 1
              ).fit(X, y)
X_RFECV = rfecv.transform(X)
print("RFECV特征选择结果——————————————————————————————————————————————————")
print("有效特征个数 : %d" % rfecv.n_features_)
print("全部特征等级 : %s" % list(rfecv.ranking_))

insert image description here
3. Random search

3.1 Random feature subset

Randomly select multiple feature subsets, then evaluate the model performance separately, and select the feature subset with a high evaluation score.

3.2 Null Importance

Kaggle GM Olivier proposed the Null Importance feature selection method 3 years ago. After reading the code recently, I think it is wonderful. It successfully finds out the characteristics of "seeing the wind and making the rudder" and eliminates them. What are the characteristics of "seeing the wind and making the rudder"? It is more common in features with strong identification or full of noise. For example, if we add userID as a feature to the model to predict which consumer groups different userIDs belong to, an overfitting model can learn the direct relationship between userID and consumer groups. Mapping relationship (equivalent to the model directly remembering what consumer group this userID is), then if I pretend to scramble the label and make a fake label to retrain the prediction, we will find that the model will directly map the userID to the scrambled On the label, and finally under the true and false labels, the userID "plays with the wind" has made itself the most important feature. How do we find such features? Olivier's idea is simple: the truly robust, stable and important features must be important under the true label, but once the label is disrupted, the importance of these high-quality features will become worse. Conversely, if a certain feature performs generally under the original label, but after the label is disturbed, its importance actually increases, which is obviously unreliable, and this kind of "follow the wind" feature must be eliminated.

The calculation process of Null Importance is roughly as follows:

(1) Run the model on the original data set to obtain feature importance;

(2) Shuffle multiple labels, and obtain the feature importance under the false label after each shuffle;

(3) Calculate the feature importance difference under the true and false labels, and filter features based on the difference.

insert image description here
We can know the general operation process of Null Importance, but here are some details. Among them, you can choose importance_gain or importance_split for importance. In addition, as shown in Figure 14, if we want to compare the feature importance under the original label and the scrambled label, Olivier provides two comparison methods:

The first: quantile comparison.

The sum 1 is to avoid the case where the sum denominator is 0. A sample output is as follows:

insert image description here
The second type: comparison of frequency ratio.

Normally, there is only one single feature. The reason why the author requires the 25th percentile is that if we use it, we will also repeatedly train the original features to generate multiple sets of feature importance, so the 25th percentile is added. The output sample is as follows:
insert image description here
As can be seen from the above, the feature score obtained by the second method is in the range of 0-100, so Olivier chooses the second method, uses different thresholds to filter features, and then evaluates the model performance. It is recommended to read Olivier's open source code, which is easy to understand.

4.1 Summary of wrapping method

In actual use, RFECV and Null Importance are recommended because they consider both feature weights and model performance.

Five, embedded method (Embedded)

insert image description here
Embedding: Feature selection is embedded into the learner training process. Unlike wrappers, feature selection is clearly separated from the learner training process. [4]

● Advantages: It saves time and effort compared with the wrapping method, and leaves the feature selection to the model to learn.

● Disadvantage: Increase the burden of model training.

Common embedding methods include LASSO's L1 regular penalty term, and random forests select feature subsets when building subtrees. The application of the embedding method is relatively monotonous. sklearn provides SelectFromModel, which can directly call the model to select features. The reference sample code is as follows:

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier


iris = load_iris()


# 将待L1惩罚项的逻辑回归作为基模型的特征选择
selected_data_lr = SelectFromModel(LogisticRegression(penalty='l1', C = 0.1, solver = 'liblinear'), max_features = 3).fit_transform(iris.data, iris.target)


# 将GBDT作为基模型的特征选择
selected_data_gbdt = SelectFromModel(GradientBoostingClassifier(), max_features = 3).fit_transform(iris.data, iris.target)


print(iris.data.shape)
print(selected_data_lr.shape)
print(selected_data_gbdt.shape)

insert image description here

6. Summary

When performing feature selection, it is recommended to try to use the filtering method, wrapping method and embedding method. The early feature filtering is beneficial to reduce the learning burden of the model. Of course, the most advanced feature selection is still manual selection based on business knowledge. The features selected by the above methods are also recommended to think more about why this feature is helpful to the model, and whether the selected high-quality features have the possibility of further digging.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/128748981