Use PermutationImportance to select variables

When we are building a tree model (XGBoost, LightGBM, etc.), if we want to know which variables are more important. The feature importance can be obtained through the feature_importances_ method of the model . For example, the feature_importances_ of LightGBM can be measured by the number of times the feature is split or the gain after the feature is split. In general, the order of feature importance obtained by different measurement criteria will be different. I generally cross-select features through multiple evaluation criteria. The blogger believes that if a feature is more important under different evaluation criteria, then the feature has a better predictive ability for the label.

Introduce a method for evaluating the importance of features: PermutationImportance . The document describes the method as follows: eli5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available; the method is also known as “permutation importance” or “Mean Decrease Accuracy (MDA )”. My understanding is: if a feature is set as a random number, the model effect will drop a lot, indicating that the feature is more important; otherwise, it is not.

Here is a simple example for everyone. We use different models to select variables (RF, LightGBM, LR). And select the most important 30 variables (total variables 200+) for modeling.

import eli5
from eli5.sklearn import PermutationImportance
from sklearn.feature_selection import SelectFromModel

def PermutationImportance_(clf,X_train,y_train,X_valid,X_test):
    
    perm = PermutationImportance(clf, n_iter=5, random_state=1024, cv=5)
    perm.fit(X_train, y_train)    
    
    result_ = {'var':X_train.columns.values
           ,'feature_importances_':perm.feature_importances_
           ,'feature_importances_std_':perm.feature_importances_std_}
    feature_importances_ = pd.DataFrame(result_, columns=['var','feature_importances_','feature_importances_std_'])
    feature_importances_ = feature_importances_.sort_values('feature_importances_',ascending=False)
    #eli5.show_weights(perm, feature_names=X_train.columns.tolist(), top=500) #结果可视化   
    
    sel = SelectFromModel(perm, threshold=0.00, prefit=True)
    X_train_ = sel.transform(X_train)
    X_valid_ = sel.transform(X_valid)
    X_test_ = sel.transform(X_test)

    return feature_importances_,X_train_,X_valid_,X_test


#PermutationImportance
model_1 = RandomForestClassifier(random_state=1024)
feature_importances_1,X_train_1,X_valid_1,X_test_1 = PermutationImportance_(model_1,X_train,y_train,X_valid,X_test)

model_2 = lgb.LGBMClassifier(objective='binary',random_state=1024)
feature_importances_2,X_train_2,X_valid_2,X_test_2 = PermutationImportance_(model_2,X_train,y_train,X_valid,X_test)

model_3 = LogisticRegression(random_state=1024)
feature_importances_3,X_train_3,X_valid_3,X_test_3 = PermutationImportance_(model_3,X_train,y_train,X_valid,X_test

 

​The modeling features of the following figure are: all features, the first 30 features of RF, the first 30 features of LightGBM, and the first 30 features of LR. It can be seen that the generalization of the 30 feature models selected by LightGBM through PermutationImportance is better than modeling with all variables.

train auc: 0.737572501101 valid auc: 0.707917079532 test auc: 0.698453842775
train auc: 0.728547026706 valid auc: 0.694552089056 test auc: 0.674431794411
train auc: 0.737740963444 valid auc: 0.711832783676 test auc: 0.702665571919
train auc: 0.721754352344 valid auc: 0.694629157213 test auc: 0.679796146766

Official documents:

https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html

Guess you like

Origin blog.csdn.net/lz_peter/article/details/88654198