特征重要性--feature_importance

 以random forest为例,feature importance特性有助于模型的可解释性。简单考虑下,就算在解释性很强的决策树模型中,如果树过于庞大,人类也很难解释它做出的结果。

 随机森林通常会有上百棵树组成,更加难以解释。好在我们可以找到那些特征是更加重要的,从而辅助我们解释模型。更加重要的是可以剔除一些不重要的特征,降低杂讯。比起pca降维后的结果,更具有人类的可理解性。
feature importance有两种常用实现思路:
  (1) mean decrease in node impurity: 

feature importance is calculated by looking at the splits of each tree.
The importance of the splitting variable is proportional to the improvement to the gini index given by that split and it is accumulated (for each variable) over all the trees in the forest.

       就是计算每棵树的每个划分特征在划分准则(gini或者entropy)上的提升,然后对聚合所有树得到特征权重

  (2) mean decrease in accuracy:

 This method, proposed in the original paper, passes the OOB samples down the tree and records prediction accuracy. 
A variable is then selected and its values in the OOB samples are randomly permuted. OOB samples are passed down the tree and accuracy is computed again.
A decrease in accuracy obtained by this permutation is averaged over all trees for each variable and it provides the importance of that variable (the higher the decreas the higher the importance).

    简单来说,如果该特征非常的重要,那么稍微改变一点它的值,就会对模型造成很大的影响。

             自己造数据太麻烦,可以直接在OOB数据集对该维度的特征数据进行打乱,重新训练测试,打乱前的准确率减去打乱后的准确率就是该特征的重要度。该方法又叫permute。

参考博客:

1.特征选择之tree的feature_importance的缺陷和处理方法 另一篇

2.月下之风

猜你喜欢

转载自www.cnblogs.com/wqbin/p/12803594.html