Use random forest to evaluate feature importance

Preface

Random forest is an integrated learning algorithm based on decision tree learners. Random forest is very simple, easy to implement, and has low computational overhead. What's more surprising is that it has shown amazing performance in classification and regression. Therefore, random forest is also known as "a method that represents the level of integrated learning technology." ". 
This article is a brief introduction to how random forests are used in feature selection.

Introduction to Random Forest (RF)

As long as you understand the algorithm of decision trees, random forests are fairly easy to understand. The random forest algorithm can be summarized in the following steps:

  1. Use the method of sampling replacement (bootstrap) to select n samples from the sample set as a training set
  2. Use the sample set obtained by sampling to generate a decision tree. At each node generated: 
    • Randomly select d features without repetition
    • Use these d features to divide the sample set to find the best division feature (can be judged by Gini coefficient, gain rate or information gain)
  3. Repeat step 1 to step 2 a total of k times, k is the number of decision trees in the random forest.
  4. Use the trained random forest to predict the test sample, and use the voting method to determine the prediction result.

The following figure shows the random forest algorithm more intuitively (the picture comes from literature 2): 

Random forest algorithm

 

Figure 1: Schematic diagram of random forest algorithm


That's right, this is the algorithm with random values ​​everywhere. It has excellent results in classification and regression. Do you think it is too strong to explain? 
However, the focus of this article is not this, but the next feature importance evaluation .

 

Feature importance evaluation

In reality, there are often hundreds of previous features in a data set. How to choose the features that have the greatest impact on the results in order to reduce the number of features when building a model is a problem we are more concerned about. There are actually many such methods, such as principal component analysis, lasso and so on. However, what we are going to introduce here is to use random forest to filter features. 
The idea of ​​using random forest to evaluate the importance of features is actually very simple. To put it bluntly, it is to see how much contribution each feature makes to each tree in the random forest, then take an average value, and finally compare the features The size of the contribution. 
Okay, so what is this contribution? The Gini index (Gini index) or out-of-bag data (OOB) error rate can usually be used as an evaluation index to measure. 
We only introduce the method of using Gini index to evaluate. If you want to know another method, please refer to Literature 2. 
We use VIMVIM to represent variable importance measures and GIGI to represent Gini index, assuming there are mm features X1, X2, X3,..., XcX1, X2, X3,..., Xc Now we need to calculate the Gini index score VIM(Gini)jVIMj(Gini) of each feature XjXj, that is, the average change of the node split impurity of the jjth feature in all decision trees in RF. 
The formula for calculating the Gini index is 

GIm = ∑ | K | k = 1∑k ′ ≠ kpmkpmk ′ = 1 − ∑ | K | k = 1p2mkGIm = ∑k = 1 | K | ∑k ′ ≠ kpmkpmk ′ = 1 − ∑k = 1 | K | pmk2


Among them, KK indicates that there are KK categories, and pmkpmk indicates the proportion of category kk in the node mm. 
Intuitively, it is the probability that two samples are randomly selected from the node mm, and their category labels are inconsistent. 
The importance of feature XjXj at node mm, that is, the change in GiniGini index before and after the branch of node mm is 

VIM (Gini) jm = GIm − GIl − GIrVIMjm (Gini) = GIm − GIl − GIr


Among them, GIlGIl and GIrGIr respectively represent the GiniGini indices of the two new nodes after branching. 
If the node of feature XjXj in decision tree ii is in the set MM, then the importance of XjXj in the ii tree is 

VIM (Gini) ij = ∑m∈MVIM (Gini) jmVIMij (Gini) = ∑m∈MVIMjm (Gini)


Assuming there are nn trees in RFRF, then 

VIM (Gini) j = ∑ni = 1VIM (Gini) ijVIMj (Gini) = ∑i = 1nVIMij (Gini)


Finally, normalize all the obtained importance scores. 

VIMj = VIMj∑ci = 1VIMiVIMj = VIMj∑i = 1cVIMi

 

for example

Fortunately, sklearn has already encapsulated everything for us, we only need to call the functions in it. 
Let's take the example of wine on UCI as an example, first import the data set.

import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
df = pd.read_csv(url, header = None)
df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 
              'Alcalinity of ash', 'Magnesium', 'Total phenols', 
              'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 
              'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

Then, let’s take a rough look at what kind of data set is at this time

import numpy as np
np.unique(df['Class label'])

Output is

array([1, 2, 3], dtype=int64)

It can be seen that there are 3 categories. Then look at the data information:

df.info()

Output is

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
Class label                     178 non-null int64
Alcohol                         178 non-null float64
Malic acid                      178 non-null float64
Ash                             178 non-null float64
Alcalinity of ash               178 non-null float64
Magnesium                       178 non-null int64
Total phenols                   178 non-null float64
Flavanoids                      178 non-null float64
Nonflavanoid phenols            178 non-null float64
Proanthocyanins                 178 non-null float64
Color intensity                 178 non-null float64
Hue                             178 non-null float64
OD280/OD315 of diluted wines    178 non-null float64
Proline                         178 non-null int64
dtypes: float64(11), int64(3)
memory usage: 19.5 KB

It can be seen that there are 13 features except for the class label, and the size of the data set is 178. 
According to conventional practice, the data set is divided into training set and test set.

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
x, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feat_labels = df.columns[1:]
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
forest.fit(x_train, y_train)

Well, in this way, the random forest is trained, and the importance of the features has been evaluated. Let's take a look.

importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(x_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

The output result is

 1) Color intensity                0.182483
 2) Proline                        0.158610
 3) Flavanoids                     0.150948
 4) OD280/OD315 of diluted wines   0.131987
 5) Alcohol                        0.106589
 6) Hue                            0.078243
 7) Total phenols                  0.060718
 8) Alcalinity of ash              0.032033
 9) Malic acid                     0.025400
10) Proanthocyanins                0.022351
11) Magnesium                      0.022078
12) Nonflavanoid phenols           0.014645
13) Ash                            0.013916

Yes, it is so convenient. 
If you want to filter out the more important variables, you can do this

threshold = 0.15
x_selected = x_train[:, importances > threshold]
x_selected.shape

Output is

(124, 3)

Look, no, did you choose 3 features with importance greater than 0.15~

references

[1] Raschka S. Python Machine Learning[M]. Packt Publishing, 2015. 
[2] Yang Kai, Hou Yan, Li Kang. Random forest variable importance score and its research progress[J]. 2015.

 

Reprinted from: https://blog.csdn.net/zjuPeco/article/details/77371645?locationNum=7&fps=1

Guess you like

Origin blog.csdn.net/xiezhen_zheng/article/details/82011908