Feature engineering in sklearn (filtering, embedding and wrapping)

Table of contents

​The first step in editing feature engineering: understanding the business

Filter method

​Edit variance filtering

​EDIT - Relevance Filtering

- Chi-square filtering

- F test

- mutual information method

​Edit embedded method (Embedded)

Packaging method (Wrapper)

insert image description here
The first step of feature engineering: understand the business


If the features are relatively small and easy to understand, we can judge the choice of features by ourselves, such as the previous Titanic data set. However, in the field of real data applications, such as finance, medical care, and e-commerce, our data cannot have as few and obvious features as the Titanic data. So what should we do if we encounter extreme situations and we cannot rely on our understanding of the business to select features? We have four methods that can be used to select features: filtering, embedding, wrapping, and dimensionality reduction

 

Filter method


Filtering methods are often used as a preprocessing step, feature selection is completely independent of any machine learning algorithm. It selects features based on their scores in various statistical tests and various indicators of correlation.

insert image description here
variance filtering


VarianceThreshold
For example, if the variance of a feature itself is very small, it means that the samples have basically no difference in this feature. It is possible that most of the values ​​​​in the feature are the same, or even the values ​​of the entire feature are the same. Then this feature has no effect on sample distinction.
So no matter what the next feature engineering is going to do, the features with a variance of 0 must be eliminated first. VarianceThreshold has an important parameter threshold, which indicates the threshold of variance, which means discarding all features with variance less than threshold. If not filled, the default is 0, that is, delete all features with the same records.

 

First import the original function

import pandas as pd
data = pd.read_csv(r'F:\data\digit recognizer.csv')

The shape of the original function is (42000, 784)
and then we use variance filtering

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold() #Instantiation, without parameters, the default variance is 0
X_var0 = selector.fit_transform(X) #Get the new feature matrix after deleting unqualified features
# Equivalent to X_var0 = selector.fit_transform( VarianceThreshold())

The shape of the function after variance filtering is (42000, 708)
So far, all features with a variance of 0 have been deleted.
If you only want to keep half of the eigenvalues, you can do

import numpy as np
np.median(X.var().values) # This step is to take the median
X_fsvar = VarianceThreshold(np.median(X.var().values)).fit_transform(X)

X_fsvar at this time .shape is (42000, 392)
Of course, if you want to get the first 50, you can sort the values ​​first, and then define the threshold as the variance of the 50th.

In addition, when the variance is a binary classification, the value of the feature is a Bernoulli random variable, and the variance of these variables can be calculated as Var[X] = p(1-p), p is a class of binary classification features in The probability accounted for by this feature.
It can be defined that when a certain category accounts for more than 80% of the two-category features, the features are deleted.

X_bvar = VarianceThreshold(.8 * (1 - .8)).fit_transform(X)

The impact of variance filtering on the model
After we do this, how will it affect the model effect? Here, I have prepared for you the comparison of the effect and running time of KNN and random forest before and after variance filtering.

#KNN vs Random Forest Comparison under different variance filtering effects
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import cross_val_score
import numpy as np
X = data.iloc[:,1:]
y = data.iloc[:,0]
X_fsvar = VarianceThreshold(np.median(X.var().values)).fit_transform(X)
#======【TIME WARNING: 35mins +】==== ==#
cross_val_score(KNN(),X,y,cv=5).mean()
#The magic command in python, you can directly use %%timeit to calculate the time required to run the code in this cell
#In order to calculate the It needs to run the code in this cell many times (usually 7 times) and calculate the average value, so the time to run %%timeit will be far longer than the time to run the code in the
cell alone
#===== =【TIME WARNING: 4 hours】======#
%%timeit
cross_val_score(KNN(),X,y,cv=5).mean()
#======【TIME WARNING:20 mins+】======#
cross_val_score(KNN(),X_fsvar,y,cv=5).mean()
#======【TIME WARNING:2 hours】======#
%%timeit
cross_val_score(KNN(),X,y,cv=5).mean()
cross_val_score(RFC(n_estimators=10,random_state=0),X,y,cv=5).mean()
Tsai


KNN results before variance filtering

insert image description here
#======【TIME WARNING: 20 mins+】======#
cross_val_score(KNN(),X_fsvar,y,cv=5).mean()
#======【TIME WARNING: 2 hours】======#
%%timeit
cross_val_score(KNN(),X,y,cv=5).mean()

KNN result after variance filtering

 insert image description here

 

The accuracy rate has improved slightly, but the average running time has been reduced by 10 minutes, and the efficiency of the algorithm has increased by 1/3 after feature selection.
What about random forests?
Before random forest variance filtering

insert image description here

 

After random forest variance filtering

insert image description here

 

The first thing that can be observed is that the accuracy of random forest is slightly lower than that of KNN, but the running time is less than 1% of KNN, which only takes more than ten seconds. Secondly, after variance filtering, the accuracy of random forest also increased slightly, but the running time hardly changed, still 11 seconds.
Why does random forest run so fast? Why does variance filtering not have a big impact on random forests? This is due to the difference in the amount of computation involved in the principles of the two algorithms. The nearest neighbor algorithm KNN, single decision tree, support vector machine SVM, neural network, and regression algorithm all need to traverse features or increase dimensions to perform calculations, so they themselves have a large amount of calculations and take a long time, so Feature selection such as variance filtering is particularly important for them. But for algorithms that do not need to traverse features, such as random forest, it randomly selects features for branching, and the operation itself is very fast, so feature selection is mediocre for it. This is actually easy to understand. No matter how the filtering method reduces the number of features, the random forest will only select a fixed number of features to model; the nearest neighbor algorithm is different, the fewer the features, the less the dimension of distance calculation, the model Obviously, it will become lighter as the features are reduced. Therefore, the main object of the filtering method is: algorithms that need to traverse features or increase dimensions, and the main purpose of the filtering method is to help algorithms reduce computing costs while maintaining the performance of the algorithm.
In general,
the effect of variance filtering is as follows

insert image description here
- Relevance filtering


After the variance is selected, we have to consider the next question: correlation. We want to select features that are relevant and meaningful to the label, because such features can provide us with a lot of information. If the features are not associated with labels, it will just waste our computational memory and possibly introduce noise to the model. In sklearn, we have three commonly used methods to judge the correlation between features and labels: Chi-square, F-test, and mutual information.

- Chi-square filtering


Chi-square filtering is a correlation filter specifically for discrete labels (i.e., classification problems). The chi-square test class feature_selection.chi2 calculates the chi-square statistic between each non-negative feature and label, and ranks the features according to the chi-square statistic from high to low. Combined with feature_selection.SelectKBest, which can input "scoring criteria" to select the top K features with the highest scores, we can use this to remove features that are most likely to be independent of labels and have nothing to do with our classification purposes.
In addition, if the chi-square test detects that all values ​​in a feature are the same, it will prompt us to use variance to perform variance filtering first. Moreover, we have just verified that when we use variance filtering to filter out half of the features, the performance of the model is improved. So here, we use the variance-filtered data completed when threshold=median to do the chi-square test (if the performance of the model after variance filtering is reduced, then we will not use the variance-filtered data, but use raw data):

 

from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 #Chi Square Test
#Assume here we know we need 300 features
X_fschi = SelectKBest(chi2,k= 300 ).fit_transform(X_fsvar,Y)
cross_val_score(RFC(n_estimators=10,random_state=0),X_fschi,Y,cv=5).mean() cross_val_score(RFC(n_estimators=10,random_state=0)
,X_fsvar,Y, cv=5).mean()

We found that when the median was used for method filtering, the cross-check score was 0.9388098166696807, but after chi-square filtering, the cross-check score became 0.9333098667649198, which was reduced!
This shows that there is a problem with the setting of K=300, and our K value is adjusted too small.
How to choose K value?
We can take a learning curve and give it a run.

%matplotlib inline
import matplotlib.pyplot as plt
score = []
for i in range(390,200,-10):
    X_fschi = SelectKBest(chi2,k=i).fit_transform(X_fsvar,Y)
    once = cross_val_score(RFC(n_estimators=10,random_state=0),X_fschi,Y,cv=5).mean()
    score.append(once)
plt.plot(range(390,200,-10),score)
plt.show()
insert image description here

 

Another way to select the value of k: choose the value of k by looking at the p value
Method: choose k by looking at the p value.
The essence of the chi-square test is to infer the difference between two groups of data, and the original hypothesis of the test is "the two groups of data are independent of each other". The chi-square test returns two statistics, the chi-square value and the P value. Among them, the chi-square value is difficult to define the effective range, and the p value, we generally use 0.01 or 0.05 as the significance level, which is the boundary of the p value judgment. Specifically, we You can look at it this way:

From the perspective of feature engineering, we hope to select features with a large chi-square value and a p-value less than 0.05, that is, features that are associated with labels. Before calling SelectKBest, we can directly obtain the chi-square value and P value corresponding to each feature from the model instantiated by chi2.

chivalue, pvalues_chi = chi2(X_fsvar,Y)
#How much does k take? We want to eliminate all features with a p-value greater than a set value, say 0.05 or 0.01:
k = chivalue.shape[0] - (pvalues_chi > 0.05).sum()
 

- F test


The F test, also known as ANOVA, is a filtering method used to capture the linear relationship between each feature and the label. It can do both regression and
classification, so it contains two classes: feature_selection.f_classif (F test classification) and feature_selection.f_regression (F test regression). The F-test classification is used for data whose labels are discrete variables, while the F-test regression is used for data whose labels are continuous variables.
The essence of the F test is to find a linear relationship between two sets of data, and its null hypothesis is "there is no significant linear relationship in the data". It returns two statistics, F-value and p-value. Like chi-square filtering, we want to select features with a p-value less than 0.05 or 0.01, which are significantly linearly related to the label, and features with a p-value greater than 0.05 or 0.01 are considered features that have no significant linear relationship with the label. , should be removed. Taking the classification of the F test as an example, we continue to perform feature selection on the digital data set:

from sklearn.feature_selection import f_classif
F, pvalues_f = f_classif(X_fsvar,Y)
k = F.shape[0] - (pvalues_f > 0.05).sum()
 

- mutual information method


The mutual information method is a filtering method used to capture any relationship (including linear and nonlinear relationships) between each feature and label. Similar to the F test, it can do both regression and classification, and contains two classes feature_selection.mutual_info_classif (mutual information classification) and
feature_selection.mutual_info_regression (mutual information regression). The usage and parameters of these two classes are exactly the same as the F test, but the mutual information method is more powerful than the F test, and the F test can only find a linear relationship, while the mutual information method can find any relationship.

from sklearn.feature_selection import mutual_info_classif as MIC
result = MIC(X_fsvar,y)
k = result.shape[0] - sum(result <= 0)
#X_fsmic = SelectKBest(MIC, k=fill in specific k).fit_transform(X_fsvar , y)
#cross_val_score(RFC(n_estimators=10,random_state=0),X_fsmic,y,cv=5).mean()

suggestion:
use variance filtering first, and then use mutual information method to capture correlation

insert image description here
Embedded


Therefore, compared with the filtering method**, the result of the embedding method will be more accurate to the utility of the model itself, and has a better effect on improving the effectiveness of the model**. Moreover, due to the consideration of the contribution of features to the model, irrelevant features (features that require correlation filtering) and indiscriminate features (features that require variance filtering) will be deleted due to lack of contribution to the model, which can be described as filtering. An evolution of the law.

 insert image description here

 

You can use the embedding method directly without filtering at all.

feature_selection.SelectFromModel

insert image description here

 

The first two parameters are the most important.
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier as RFC
RFC_ = RFC(n_estimators = 10,random_state = 0) #Instantiation of random forest
X_embedded = SelectFromModel(RFC_, threshold=0.005).fit_transform(X,Y)
X_embedded The result of .shape

is (42000, 47)
The threshold of 0.005 is very high for the data of 780 features, and the dimensionality of the model here is significantly reduced.
Similarly, we can also find the optimal threshold by drawing a learning curve.

import numpy as np
import matplotlib.pyplot as plt
RFC_.fit(X,Y).feature_importances_

threshold = np.linspace(0,(RFC_.fit(X,Y).feature_importances_).max(),20)

score = []
for i in threshold:
    X_embedded = SelectFromModel(RFC_,threshold=i).fit_transform(X,Y)
    once = cross_val_score(RFC_,X_embedded,Y,cv=5).mean()
    score.append(once)
plt.plot(threshold,score)
plt.show()

代码结果如下:

insert image description here

 

As the threshold gets higher and higher, the effect of the model gradually becomes worse, more and more features are deleted, and the information loss gradually increases. But before 0.00134, the effect of the model can be maintained above 0.93, so we can choose a value from it to verify the effect of the model.

X_embedded = SelectFromModel(RFC_,threshold = 0.00067).fit_transform(X,Y)
X_embedded.shape
cross_val_score(RFC_,X_embedded,Y,cv=5).mean()

代码结果为
(42000, 324)
0.939905083368037

It can be seen that the number of features is reduced to more than 324 in an instant, which is smaller than the 392 columns of the results filtered by the median when we filter the variance, and the cross-validation score of 0.9399 is higher than the result of 0.9388 after the variance filter, which is Since the embedding method is more specific to the performance of the model than the variance filtering, if you change the algorithm and use the same threshold, the effect may not be so good.

As with other tuning parameters, we can choose a range after the first learning curve and use a refined learning curve to find the best value:

score2 = []
for i in np.linspace(0,0.00134,20):
    X_embedded = SelectFromModel(RFC_, threshold=i).fit_transform(X,Y)
    once = cross_val_score(RFC_, X_embedded, Y, cv=5).mean()
    score2.append(once)
plt.figure(figsize=[20,5])
plt.plot(np.linspace(0,0.00134,20),score2)
plt.xticks(np.linspace(0,0.00134,20))
plt.show()
 


X_embedded = SelectFromModel(RFC_,threshold=0.000564).fit_transform(X,Y)
X_embedded.shape
cross_val_score(RFC_,X_embedded,Y,cv=10).mean()

We may have found the best result of the existing model.
(42000, 340)
0.9414774325210074
What if we adjust the parameters of the random forest?

cross_val_score(RFC(n_estimators=100,random_state=0),X_embedded,Y,cv=5).mean()

0.9639525817795566

The number of features obtained is still smaller than the variance screening, and the performance of the model is higher than before without screening, which is already comparable to KNN calculated once for half an hour (the accuracy of KNN is 96.58%), and then the random forest By adjusting the parameters, the accuracy rate should be able to increase a lot. It can be seen that under the embedding method, we can easily achieve the goal of feature selection: reduce the amount of calculation and improve the performance of the model. Therefore, embedding may be a more efficient approach than filtering, which involves thinking about many statistics. However, when the algorithm itself is very complex, the calculation of the filtering method is much faster than the embedding method, so in large data, we will still give priority to the filtering method.

Packaging method (Wrapper)

insert image description here

 

The wrapper trains an estimator on an initial feature set, and obtains the importance of each feature via the coef_attribute or via the feature_importances_attribute. Then, the least important features are pruned from the current set of features. This process is repeated recursively on the pruned set until eventually the desired number of features to be selected is reached. Different from the filtering method and embedding method, one-time training solves all problems, and the packaging method needs to use feature subsets for multiple trainings, so it requires the highest computational cost.

The most typical objective function is the recursive feature elimination method (recursive feature elimination, abbreviated as RFE). It is a greedy optimization algorithm designed to find the best performing subset of features. It iteratively builds the model, and keeps the best features or removes the worst features in each iteration, and in the next iteration, it uses the features that were not selected in the previous modeling to build the next model until all the features are exhausted. as far as possible. It then ranks the features according to the order in which they are kept or dropped, and finally selects a best subset. The effect of the packaging method is the most conducive to improving the performance of the model among all feature selection methods. It can use very few features to achieve excellent results. In addition, when the number of features is the same, the effect of the packing method and the embedding method can be compared, but it is slower than the embedding method, so it is not suitable for too large data. In contrast, the wrapping method is the feature selection method that can best guarantee the model effect.

feature_selection.RFE
class sklearn.feature_selection.RFE(estimator,n_features_to_select=None,step=1,verbose=0)

insert image description here

 

from sklearn.feature_selection import RFE
RFC_ = RFC(n_estimators=10, random_state=0) selector = RFE(RFC_, n_features_to_select=340, step=50).fit(X,Y) #340 selector
just selected by embedding method.
support_.sum() #support returns a boolean matrix
selector.ranking_
X_wrapper = selector.transform(X)
cross_val_score(RFC_,X_wrapper,Y,cv=5).mean()

0.9389522459432109

Next draw the learning curve for wrappers.

score = []
for i in range(1,751,50):
    X_wrapper = RFE(RFC_, n_features_to_select=i, step=50).fit_transform(X,Y)
    once = cross_val_score(RFC_, X_wrapper, Y, cv=5).mean()
    score.append(once)
plt.figure(figsize=[20,5])
plt.plot(range(1,751,50),score)
plt.xticks(range(1,751,50))
plt.show()
insert image description here

 

It can be clearly seen that under the packaging method, when 50 features are applied, the performance of the model has reached more than 90%, which is much more efficient than the embedding method and the filtering method. We can zoom in on the image and look for points where the model becomes very stable to draw further learning curves (like we did in embeddings). If what we are pursuing at this time is to minimize the running time of the model, we can even directly choose 50 as the number of features, which is based on reducing the features by 94%, and can also ensure that the model performance is more than 90%. The combination of features is not inefficient.

Also, we mentioned that wrappers can outperform embeddings when the number of features is the same. Try to see if we also use 340 as the number of features, run it, and you can feel which one is faster, the packaging method or the embedding method. Since the effect of the wrapping method is similar to that of the embedding method, using the learning curve in a smaller range, we can also adjust the effect of the wrapping method very well. You can try it.

Summary of feature engineering
When the amount of data is large, variance filtering and mutual information adjustment are preferred, followed by other feature selection methods.
When using logistic regression, the embedding method is preferred.
When using support vector machines, wrappers are preferred.

Original link: https://blog.csdn.net/xlperpetual/article/details/103402737

I think this article is very well written, but the number of views is very small. I want to bring it into everyone's field of vision so that more people can learn it.

Guess you like

Origin blog.csdn.net/m0_63309778/article/details/130682922