Feature selection


Feature selection

The difference between feature selection and dimensionality reduction : Feature selection just deletes the features that are not related to the result prediction in the original features, and the latter does the calculation and combination of features to form new features.
Feature selection includes: filtering, wrapping, and embedding

filter type

  • Method : Evaluate the degree of correlation between a single feature and the result, and rank the top related parts.
  • Evaluation method : Pearson correlation coefficient, mutual information
  • Disadvantages : The correlation between features is not considered, and useful correlation features may be kicked out. Therefore, the industry uses less
  • Python package : SelectKBest specifies the number of filters, SelectPercentile specifies the percentage of filters
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

(150, 4)

X_new = SelectKBest(chi2, k = 2).fit_transform(X, y)
X_new.shape
  
  
  • 1
  • 2

(150, 2)

Wrapped

  • Approach : Treat feature selection as a feature subset search problem, filter various feature subsets, and evaluate the effect with a model ( Recursive Feature Removal Algorithm , RFE ).
  • The process of applying to LR : run a model with full features; delete 5~10% of weak features, and observe the changes in accuracy/AUC; step by step, until the accuracy/AUC shows a large decline and stop.
  • python: RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select = 1)
rfe.fit(X, Y)
print("Features sorted by their rank:")
print(sorted(zip(map(lambda x: round(x, 4), rfe.ranking_),names)))
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

Features sorted by their rank:
[(1, ‘NOX’), (2, ‘RM’), (3, ‘CHAS’), (4, ‘PTRATIO’), (5, ‘DIS’), (6, ‘LSTAT’), (7, ‘RAD’), (8, ‘CRIM’), (9, ‘INDUS’), (10, ‘ZN’), (11, ‘TAX’), (12, ‘B’), (13, ‘AGE’)]

Embedded

  • Method : Analyze feature importance according to the model, the most common way is regularization for feature selection
  • Example : For example, the earliest CRT prediction was done in e-commerce LR, and the L1 regularized LR model was used on the features of 300-500 million dimensional coefficients. The remaining 20-30 million features means that the importance of other features is not enough.
  • python : feature_selection.SelectFromModel selects features whose weight is not 0.
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
  
  
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

(150, 4)

lsvc = LinearSVC(C = 0.01, penalty = "l1", dual = False).fit(X, y)
model = SelectFromModel(lsvc, prefit = True)
X_new = model.transform(X)
X_new.shape
  
  
  • 1
  • 2
  • 3
  • 4

(150, 3)

Reprinted from: https://blog.csdn.net/fisherming/article/details/79925574


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325388476&siteId=291194637