Calculate Feature Importance Using Random Forest

Random forest contains many decision trees, and each node in the decision tree is a condition of a certain feature. These conditions are used to divide the dataset into two parts so that the response values ​​of each part belong to the same set. The selection of optimal conditions is based on impurity. Impurity is usually Gini Impurity or Information Gain/Information Entropy in classification and variance for regression problems.
The following are two methods for calculating feature importance:
1 Implementation based on sklearn

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
import numpy as np
#Load boston housing dataset as an example
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
rf = RandomForestRegressor()
rf.fit(X, Y)
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), 
             reverse=True)

The output is

Features sorted by their score:
[(0.5298, 'LSTAT'), (0.4116, 'RM'), (0.0252, 'DIS'), (0.0172, 'CRIM'), (0.0065, 'NOX'), (0.0035, 'PTRATIO'), (0.0021, 'TAX'), (0.0017, 'AGE'), (0.0012, 'B'), (0.0008, 'INDUS'), (0.0004, 'RAD'), (0.0001, 'CHAS'), (0.0, 'ZN')]

There are a few things to be aware of in ordering models based on impurity:
(1) Feature selection based on reduced impurity will be biased towards selecting variables with more categories ( bias ).
(2) When there are related features, after a feature is selected, the importance of other features related to it will become very low, because the impurity they can reduce has been removed by the previous features.
2 Mean value of accuracy reduction
This method is to directly measure the influence of each feature on the model prediction accuracy. The basic idea is to rearrange the order of a certain column of feature values ​​to observe how much the accuracy of the model is reduced. For unimportant features, this method has little effect on the model accuracy, but for important features, it will greatly reduce the model accuracy.
Here is an example of this approach:

from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict

X = boston["data"]
Y = boston["target"]

rf = RandomForestRegressor()
scores = defaultdict(list)

#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
    X_train, X_test = X[train_idx], X[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    r = rf.fit(X_train, Y_train)
    acc = r2_score(Y_test, rf.predict(X_test))
    for i in range(X.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = r2_score(Y_test, rf.predict(X_t))
        scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
              feat, score in scores.items()], reverse=True)

The output is:

Features sorted by their score:
[(0.7276, 'LSTAT'), (0.5675, 'RM'), (0.0867, 'DIS'), (0.0407, 'NOX'), (0.0351, 'CRIM'), (0.0233, 'PTRATIO'), (0.0168, 'TAX'), (0.0122, 'AGE'), (0.005, 'B'), (0.0048, 'INDUS'), (0.0043, 'RAD'), (0.0004, 'ZN'), (0.0001, 'CHAS')]

In this example, LSTAT and RM are two features that have a large impact on the model, and shuffling their order will reduce the model's accuracy by about 73% and 57%.

See http://blog.datadive.net/selecting-good-features-part-iii-random-forests/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325901000&siteId=291194637