Integrated Learning (III): Random Forests

           Introduced in integrated learning (a) in, Bagging (bootstrap aggregation) is an enhanced model of diversity, effective means of reducing the variance, especially for high variance, low bias of the model, such as decision trees, also a notice Bagging parallel model, the model is a sequential Boosting is a kind of lift from weak to strong learning learner's algorithm. Note that for the vast majority of situations, the algorithm is superior to Bagging Boosting, Bagging but also has its own advantages that model is simple, parallel, fast, significantly lower than Boosting computing model.

           Random Forest (Random forsest) is further improved release of a Bagging, he makes diversity model sub-model is further enhanced to further reduce the correlation between each other. In many RF models have this and Boosting similar good performance, but it is easier to train and prune.

 

 

 

Random Forest algorithm

 

      

           Random Forests and ordinary Bagging main difference is the following so-called "random attribute selection," that is randomly selected at the time of each node splitting $ m \ leq P $ attribute, and then to choose the best attribute as a split node. The algorithm is summarized as follows:

 


 

                   Input: training data set $ D = \ lbrace (x_ { 1}, y_ {1}), ..., (x_ {N}, y_ {N}) \ rbrace $, where $ x_ {i} \ in \ mathbb {R} ^ {P} $; $ T $ positive integer; positive integer $ m \ leq P $; positive integer $ S \ leq N $

                   Output: a classification or regression is

                   . Step1 of any $ t = 1, ..., T $, do the following:

                              1) randomly selected samples by $ S $ $ D $ Bootstrap Method formed from sample set $ D_ {t} $

                              2) generating a decision tree model $ f_ {t} $, decrement until branch conditions are met defined stop dividing node, each node split when the following steps are performed:

                                  . I m randomly selected attributes;

                                 ii. select the best attributes and their corresponding split from node i of m attributes

                                iii. The node split into two sub-nodes.

                   Step2. Output tree \ lbrace f_ {t} \ rbrace_ {t = 1} ^ {T} integration, regression averaging, taking a majority vote classification.

 


 

          We can see that the only extra Bagging than random forest improvement is that each division split randomly selected number of attributes. Doing so on the basis of a random sample (sample data disturbance) on the Bootstrap further enhance diversity model input perturbation method, while reducing the correlation of the different sub-models, reducing the variance of the final model, enhanced generalization capacity, not only that but also relatively Baggging greatly reduces the amount of computation.

 

Sample outer bag and the bag outer error

 

      For random forests each tree $ f_ {t} $, it is a sample set $ D_ {t} $ is trained, we call on all of D is not in the $ D_ {t} $ in the sample was $ f_ {t} $ of the bag to the specimens , referred to as $ D ^ {\ prime} _ {t} $ the error:

                                                     \begin{equation}OOBE(f_{t})\triangleq \frac{1}{\vert D_{t}^{\prime}\vert}\sum_{(x,y)\in D^{\prime}_{t}}l(f_{t}(x),y)\end{equation}   

Error outer bag (Out of Bag Error), and the final integrated Random Forest model obtained $ f $ we define the outer bags for the average error of all the trees of the outer bag error referred $ f_ {t} $ of:

                                                      \begin{equation}OOBE(f)=\frac{1}{T}\sum_{t=1}OOBE(f_{t})\end{equation}

Here we see a big benefit Bootstrap:

          OOBE may be intrinsic (intrinsicly) evaluation as the evaluation of learning performance of random forests, we do not need cross-validation, but does not stop the train in OOB significantly reduced time.

 

Intrinsic similarity matrix (Intrinsic Proximity Matrix)

 

      In addition, we can for each sample $ (x_ {i}, y_ {i}) $ and $ (x_ {j}, y_ {j}) $ define a degree of similarity (or proximity) $ D_ {ij} $, to give a final NxN similarity matrix, the specific approach can be easily understood:

      $ D_ {ij} = 0 $, through all leaf nodes of all of the tree, as long as the $ (x_ {i}, y_ {i}) $ and $ (x_ {j}, y_ {j}) (i \ neq j ) $ occur in the same leaf node will be $ d_ {ij} $ 1 plus the last two times we all will be divided by the similarity of the number of random tree forest in.

      Here similarity matrix obtained from a random forest is a method to obtain intrinsic similarity between samples, we can make use of this matrix [3]:

      1) cluster;

      2) missing values;

      3) outlier detection;

 

 

The importance of features (Feature Importance)

 

       Random Forest offers two ways to calculate the characteristics of the importance of a more direct way:

                   The sum of the incremental evaluation after splitting the importance of this feature is a feature tree with all the characteristics of a split node.

            It noted that, in this way the presence of defects, is primarily (Bowen reference [4]):

                   1) tend to choose those variables with more categories. 
                   2) When there is a correlation feature, after a feature is selected, the importance of its other features related to it will become very low, because they can not reduce the purity of the features have been removed earlier, that is an important feature heavily dependent on the order of division features are selected.

 

            Another way is, one feature of importance is the sum of the importance of each feature for the tree, and each tree $ f_ {t} $, $ characterized X_ {k} $ importance with the following The method of calculation:

                   Outer 1) find $ f_ {t} $ sample bag, the outer bag calculate the OOBE error;

                   2) random input sequence disrupted in the k-th row of the matrix, the outer bag error $ OOBE ^ {\ prime} $, out of order after the calculation we use $ OOBE ^ {\ prime} -OOBE $ $ f_ {t measure tree importance} $ $ k $ a first feature.

 

to sum up:

 

           1. Random Forests advantages:

               Parallel computing speed;

               Random selection, to further enhance diversity, the amount of computation is reduced, reducing the variance;

              The outer bag can intrinsic error we do model performance evaluation without the need to establish and then CV or validation sets;

              The importance of features can be obtained, convenience feature selection;

              Intrinsic similarity matrix can be obtained samples, to facilitate clustering, outlier detection, missing values.

 

           2. Disadvantages: interpretability difference, the deviation can not be reduced, the noise is still readily large data overfitting.

 

references:

   [1] Zhou Zhihua: "Machine Learning", Beijing, Tsinghua University Press, 2016;

     [2] Trevor Hastie,Robert Tibshirani,Jerome Friedman: The Elements of Statistical Learning Data Mining,Inference,and Prediction, second edition, Springer Verleg, 2009

     [3] Breiman L: Manual On Setting Up, Using, And UnderstandingRandom Forests V3.1, http://oz.berkeley.edu/users/breiman/Using_random_forests_V3.1.pdf

     [4] " random forest characteristic order of importance": https: //blog.csdn.net/qq_15111861/article/details/80366787

Guess you like

Origin www.cnblogs.com/szqfreiburger/p/11688101.html