Summary of Data Preprocessing and Feature Engineering - Feature Selection - Embedding and Packaging (5)

Organized according to the courses of the dishes, easy to remember and understand

The code location is as follows:

Embedded method

Embedding is a method that allows the algorithm to decide which features to use, that is, feature selection and algorithm training are performed at the same time

When using the embedded method,

  • First use some machine learning algorithms and models for training, get the weight coefficients of each feature, and select features according to the weight coefficients from large to small (these weight coefficients often represent a certain contribution or an important feature of the feature to the model. sex)
  • Based on this assessment of contribution, we can then find the most useful features for model building.
  • Compared with the filtering method, the results of the embedding method will be more accurate to the utility of the model itself, which has a better effect on improving the effectiveness of the model.
  • Consider the contribution of features to the model, so irrelevant features (features that require correlation filtering) and indiscriminate features (features that require variance filtering) will be deleted due to the lack of contribution to the model, which can be described as the evolution of the filtering method Version

image.png

  • shortcoming
    • The statistics used in the filtering method can use statistical knowledge and common sense to find a range (eg the p-value should be below the significance level of 0.05), while the weight coefficients used in the embedding method have no such range to find - we can say , the feature with a weight coefficient of 0 has no effect on the model at all, but when a large number of features contribute to the model and the contributions are different, it is difficult for us to define an effective critical value.
      • In this case, the model weight coefficient is our hyperparameter , and we may need a learning curve, or judge the optimal value of this hyperparameter according to some properties of the model itself. Explore embeddings for random forest and decision tree models.
    • The embedding method introduces an algorithm to select features, so its calculation speed will also have a great relationship with the applied algorithm. If an algorithm with a large amount of computation and slow computation is used, the embedding method itself will be very time-consuming and labor-intensive. And, after the selection is complete, we still need to evaluate the model ourselves.

feature_selection.SelectFromModel

class sklearn.feature_selection.SelectFromModel (estimator, threshold=None, prefit=False, norm_order=1,max_features=None)

  • SelectFromModel是一个元变换器,可以与任何在拟合后具有coef_,feature_importances_ 属性或参数中可选惩罚项的评估器一起使用(比如随机森林和树模型就具有属性feature_importances_,逻辑回归就带有l1和l2惩罚项,线性支持向量机也支持l2惩罚项)。
    • 对于有feature_importances_的模型来说,若重要性低于提供的阈值参数,则认为这些特征不重要并被移除。feature_importances_的取值范围是[0,1],如果设置阈值很小,比如0.001,就可以删除那些对标签预测完全没贡献的特征。如果设置得很接近1,可能只有一两个特征能够被留下。
  • 使用惩罚项的模型的嵌入法
    • 而对于使用惩罚项的模型来说,正则化惩罚项越大,特征在模型中对应的系数就会越小。当正则化惩罚项大到一定的程度的时候,部分特征系数会变成0,当正则化惩罚项继续增大到一定程度时,所有的特征系数都会趋于0。但是我们会发现一部分特征系数会更容易先变成0,这部分系数就是可以筛掉的。也就是说,我们选择特征系数较大的特征。
    • 支持向量机和逻辑回归使用参数C来控制返回的特征矩阵的稀疏性,参数C越小,返回的特征越少。
    • Lasso回归,用alpha参数来控制返回的特征矩阵,alpha的值越大,返回的特征越少。
参数 说明
estimator 使用的模型评估器,只要是带feature_importances_或者coef_属性,或带有l1和l2惩罚项的模型都可以使用
threshold 特征重要性的阈值,重要性低于这个阈值的特征都将被删除
prefit 默认False,判断是否将实例化后的模型直接传递给构造函数。如果为True,则必须直接
调用fit和transform,不能使用fit_transform,并且SelectFromModel不能与
cross_val_score,GridSearchCV和克隆估计器的类似实用程序一起使用。
norm_order k可输入非零整数,正无穷,负无穷,默认值为1
在评估器的coef_属性高于一维的情况下,用于过滤低于阈值的系数的向量的范数的阶数。
max_features 在阈值设定下,要选择的最大特征数。要禁用阈值并仅根据max_features选择,请设置threshold = -np.inf

我们重点要考虑的是前两个参数。在这里,我们使用随机森林为例,则需要学习曲线来帮助我们寻找最佳特征值。

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier as RFC
RFC_ = RFC(n_estimators =10,random_state=0)
x_embedded = SelectFromModel(RFC_,threshold=0.005).fit_transform(x,y)
x_embedded

"""
array([[  0,   0,   0, ..., 253,   0,   0],
       [254, 254, 254, ..., 254, 255, 254],
       [  9, 254, 254, ...,   0, 254, 254],
       ...,
       [  0,   0,   0, ...,   0, 255, 255],
       [  0,   0,  27, ..., 242,   0,   0],
       [  0,   0,   0, ...,   0,   0,   0]], dtype=int64)
"""

x_embedded.shape  # (42000, 47)
复制代码
  • 使用学习曲线结合
# 我们来绘制threshold的学习曲线
import numpy as np
import matplotlib.pyplot as plt
RFC_.fit(X,y).feature_importances_
threshold = np.linspace(0,(RFC_.fit(X,y).feature_importances_).max(),20)
score = []
for i in threshold:
    X_embedded = SelectFromModel(RFC_,threshold=i).fit_transform(x,y)
    once = cross_val_score(RFC_,X_embedded,y,cv=5).mean()
    score.append(once)
plt.plot(threshold,score)
plt.show()
复制代码

image.png

从图像上来看,随着阈值越来越高,模型的效果逐渐变差,被删除的特征越来越多,信息损失也逐渐变大

  • 验证特征选择过后,模型的效果
X_embedded = SelectFromModel(RFC_,threshold=0.00067).fit_transform(x,y)
X_embedded.shape
cross_val_score(RFC_,X_embedded,y,cv=5).mean()
# 0.9391190476190475
复制代码

特征个数瞬间缩小到324多,这比我们在方差过滤的时候选择中位数过滤出来的结果392列要小,并且交叉验证分数0.9399高于方差过滤后的结果0.9388,这是由于嵌入法比方差过滤更具体到模型的表现的缘故

  • 我们可以细化学习曲线
score2 = []
for i in np.linspace(0,0.00134,20):
    X_embedded = SelectFromModel(RFC_,threshold=i).fit_transform(x,y)
    once = cross_val_score(RFC_,X_embedded,y,cv=5).mean()
    score2.append(once)
plt.figure(figsize=[20,5])
plt.plot(np.linspace(0,0.00134,20),score2)
plt.xticks(np.linspace(0,0.00134,20))
plt.show()
复制代码

image.png

  • 找最佳位置的阈值参数
X_embedded = SelectFromModel(RFC_,threshold=0.000071).fit_transform(x,y)
X_embedded.shape  # (42000, 340)

cross_val_score(RFC_,X_embedded,y,cv=5).mean()  # 0.9392857142857144
复制代码

在嵌入法下,我们很容易就能够实现特征选择的目标:减少计算量,提升模型表现

比起要思考很多统计量的过滤法来说,嵌入法可能是更有效的一种方法。然而,在算法本身很复杂的时候,过滤法的计算远远比嵌入法要快,所以大型数据中,我们还是会优先考虑过滤法。

Wrapper包装法

  • 与嵌入法相似部分
    • 是一个特征选择和算法训练同时进行的方法
    • 依赖于算法自身的选择,比如coef_属性或feature_importances_属性来完成特征选择。
  • 与嵌入法不同部分
    • 我们往往使用一个目标函数作为黑盒来帮助我们选取特征,而不是自己输入某个评估指标或统计量的阈值。
    • 区别于过滤法和嵌入法的一次训练解决所有问题,包装法要使用特征子集进行多次训练,因此它所需要的计算成本是最高的。
  • 包装法在初始特征集上训练评估器,并且通过coef_属性或通过feature_importances_属性获得每个特征的重要性。然后,从当前的一组特征中修剪最不重要的特征。在修剪的集合上递归地重复该过程,直到最终到达所需数量的要选择的特征。

image.png

注意,在这个图中的“算法”,指的不是我们最终用来导入数据的分类或回归算法(即不是随机森林),而是专业的数据挖掘算法,即我们的目标函数。这些数据挖掘算法的核心功能就是选取最佳特征子集。 最典型的目标函数是递归特征消除法(Recursive feature elimination, 简写为RFE)。它是一种贪婪的优化算法,旨在找到性能最佳的特征子集。 它反复创建模型,并在每次迭代时保留最佳特征或剔除最差特征,下一次迭代时,它会使用上一次建模中没有被选中的特征来构建下一个模型,直到所有特征都耗尽为止。 然后,它根据自己保留或剔除特征的顺序来对特征进行排名,最终选出一个最佳子集。包装法的效果是所有特征选择方法中最利于提升模型表现的,它可以使用很少的特征达到很优秀的效果。除此之外,在特征数目相同时,包装法和嵌入法的效果能够匹敌,不过它比嵌入法算得更见缓慢,所以也不适用于太大型的数据。相比之下,包装法是最能保证模型效果的特征选择方法。

feature_selection.RFE

class sklearn.feature_selection.RFE (estimator, n_features_to_select=None, step=1, verbose=0)

  • parameter:
    • estimator : is the instantiated estimator that needs to be filled in
    • **n_features_to_select: ** is the number of features you want to select
    • step : Indicates the number of features to be removed in each iteration
  • Attributes
    • .support_ : Returns the boolean matrix of whether all features were last selected
    • .ranking_ : Returns the ranking of features by their combined importance over a number of iterations
  • Features
    • It is a greedy optimization algorithm that aims to find the best performing subset of features. It iteratively builds the model, keeping the best features or culling the worst features on each iteration, and on the next iteration, it builds the next model using the features that were not selected in the previous modeling until all features are consumed until the end.
  • advantage
    • The effect of the packing method is the most conducive to improving the performance of the model among all feature selection methods. It can achieve excellent results with few features.
    • The packing method is the feature selection method that can best guarantee the model effect.
  • shortcoming
    • When the number of features is the same, the effect of the packing method and the embedding method can be matched, but it is slower than the embedding method, so it is not suitable for too large data

feature_selection.RFECV

RFE is performed in the cross-validation loop to find the optimal number of features, the parameter cv is increased, and other usages are exactly the same as RFE.

### 递归特征消除法 feature_selection.RFE
from sklearn.feature_selection import RFE
RFC_ = RFC(n_estimators =10,random_state=0)
selector = RFE(RFC_, n_features_to_select=340, step=50).fit(x, y)
selector.support_

"""
array([False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False, False, False, False, False, False,
       False, False, False, False, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False, False, False, False, False,
       False, False, False, False, False, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False, False, False])
"""
复制代码
  • Property display
selector.support_.sum()  # 340

selector.ranking_

"""
array([10,  9,  8,  7,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,
        6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  7,  7,  6,  6,
        5,  6,  5,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  7,  6,  7,  7,
        7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  6,  6,  5,  4,
        4,  5,  3,  4,  4,  4,  5,  4,  5,  7,  6,  7,  7,  7,  8,  8,  8,
        8,  8,  8,  8,  8,  6,  7,  4,  3,  1,  2,  3,  3,  1,  1,  1,  1])
"""
复制代码
  • Verify the results of feature selection
X_wrapper = selector.transform(x)
cross_val_score(RFC_,X_wrapper,y,cv=5).mean() # 0.9379761904761905

# 绘制学习曲线
score = []
for i in range(1,751,50):
    X_wrapper = RFE(RFC_,n_features_to_select=i, step=50).fit_transform(x,y)
    once = cross_val_score(RFC_,X_wrapper,y,cv=5).mean()
    score.append(once)
plt.figure(figsize=[20,5])
plt.plot(range(1,751,50),score)
plt.xticks(range(1,751,50))
plt.show()
复制代码

image.png

Under the packaging method, when 50 features are applied, the performance of the model has reached more than 90%, which is much more efficient than the embedding method and the filtering method.

Feature Selection Summary

  • Filtration is faster, but coarser. The packaging method and the embedding method are more accurate and more suitable for specific algorithm adjustment, but the calculation amount is relatively large and the running time is long.
  • When the amount of data is large, variance filtering and mutual information adjustment are preferred , followed by other feature selection methods.
  • When using logistic regression, the embedding method is preferred. When using support vector machines, the packing method is preferred

Guess you like

Origin juejin.im/post/7086647430569000990