Machine Learning-Random Forest

Prerequisite knowledge:

Random Forest-Random Forest


1. What is a random forest?

After learning the decision tree , it will be easy to understand what a random forest is. Random forest is an algorithm that integrates multiple trees through the idea of ensemble learning : its basic unit is the decision tree. There are two keywords in the name of random forest, one is "random" and the other is "forest". "Forest" is well understood. One tree is called a tree, so hundreds of thousands of trees can be called a forest. This analogy is still very appropriate. In fact, this is also the embodiment of the main idea of ​​random forest-integrated thinking. The meaning of "random" will be discussed in the next section.

In fact, from an intuitive point of view, each decision tree is a classifier (assuming that it is now a classification problem), then for an input sample, N trees will have N classification results. Random forest integrates all classification voting results, and assigns the category with the most votes as the final output. This is the simplest Bagging idea.

2. Features of Random Forest

  Random forest is a very flexible and practical method, it has the following characteristics:

  • It is unexcelled in accuracy among current algorithms;
  • It runs efficiently on large data bases;
  • It can handle thousands of input variables without variable deletion; It can handle thousands of input variables without variable deletion;
  • It gives estimates of what variables are important in the classification;
  • During the generation process, it generates an internal unbiased estimate of the generalization error as the forest building progresses;
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing
  • ... ...

In fact, the characteristics of random forest are not only these six points, it is equivalent to the Leatherman (generalist) in the field of machine learning, you can throw almost anything in it, and it is basically available for use. It is especially useful in estimating inferred mapping, so that it does not need to do many parameter debugging like SVM. For a detailed introduction to random forests, please refer to the Random Forest homepage: Random Forest .

3. Random forest generation

As mentioned earlier, there are many classification trees in random forests. If we want to classify an input sample, we need to input it into each tree for classification. The classification results of several weak classifiers are voted to form a strong classifier. This is the idea of ​​random forest bagging. The following figure can vividly describe this situation:

 

With trees we can classify, but how is each tree in the forest generated?

  Each tree is generated according to the following rules:

  1) If the size of the training set is N, for each tree, N training samples are randomly selected from the training set with replacement (this sampling method is called bootstrap sample method, which is the bootstrap sampling method ), as The training set of the tree; from here we can know: the training set of each tree is different, and it contains repeated training samples.

  2) If there are M feature dimensions , specify the number m << M, so that at each node, m feature dimensions are randomly selected from M , and the best feature of these m feature dimensions is used (maximizing the information gain ) To split the node. During the growth of the forest, the value of m remains unchanged.

  3) Every tree grows to the maximum extent, and there is no pruning process.

  The "random" in the random forest we mentioned at the beginning refers to the two randomnesses here . The introduction of two randomnesses is crucial to the classification performance of random forests. Due to their introduction, random forests are not easy to fall into overfitting, and have good anti-noise ability (for example: insensitive to default values).

  The random forest classification effect (error rate) is related to two factors:

  • The correlation of any two trees in the forest : the greater the correlation, the greater the error rate;
  • Classification ability of each tree in the forest: The stronger the classification ability of each tree, the lower the error rate of the entire forest.

  Decrease the number of feature selection m, the correlation and classification ability of the tree will correspondingly decrease; increase m, both will increase accordingly. So the key question is how to choose the optimal m, which is also the only parameter of random forest.

4. OOB (Out of Bag Error Rate)

As we mentioned above, the key problem of constructing a random forest is how to choose the parameter of the optimal feature number m . To solve this problem, it is mainly based on calculating the out-of-bag error (out-of-bag error).

An important advantage of random forest is that there is no need to cross-validate it or use an independent test set to obtain an unbiased estimate of the error. It can be evaluated internally, which means that an unbiased estimate of the error can be established during the generation process.

We know that when constructing each tree, we used different bootstrap samples for the training set (randomly and with replacement). So for each tree (assuming for the k-th tree), about 1/3 of the training examples did not participate in the generation of the k-th tree, and they are called the oob samples of the k-th tree.

And this sampling feature allows us to perform oob estimation, which is calculated as follows:

  1) For each sample , calculate the classification of it as an oob sample tree ( about 1/3 of the tree );

  2) Then a simple majority vote is used as the classification result of the sample;

  3) Finally, the ratio of the number of misclassifications to the total number of samples is used as the oob misclassification rate of the random forest.

  The oob misclassification rate is an unbiased estimate of the random forest generalization error , and its result is similar to the k-fold cross-validation that requires a lot of calculation .

In this way, the best feature number m can be selected by comparing the oob misclassification rate.

5. Random forest parameters

In scikit-learn, the classifier of RF is RandomForestClassifier, and the regressor is RandomForestRegressor. The parameters of RF tuning also include two parts. The first part is the parameters of the Bagging framework, and the second part is the parameters of a CART decision tree. The specific parameters refer to the function prototype of the random forest classifier:

sklearn.ensemble.RandomForestClassifier(
        n_estimators=10, criterion='gini',
        max_depth=None,min_samples_split=2, 
        min_samples_leaf=1, min_weight_fraction_leaf=0.0,
        max_features='auto', max_leaf_nodes=None,
        min_impurity_split=1e-07,bootstrap=True,
        oob_score=False, n_jobs=1, 
        random_state=None, verbose=0,
        warm_start=False, class_weight=None)

5.1 Bagging framework parameters

Let's take a look at the parameters of the important Bagging framework of RF. Since most of the parameters of RandomForestClassifier and RandomForestRegressor are the same, we will talk about them together here, and the differences will be pointed out.

     1)  n_estimators : the number of weak learners (decision trees). Generally speaking, if n_estimators is too small, it is easy to underfit. If n_estimators is too large, the amount of calculation will be too large, and after n_estimators reaches a certain number, the model improvement obtained by increasing n_estimators will be small, so generally choose a moderate value. The default is 100.

     2)  oob_score  : whether to use out-of-bag samples to evaluate the quality of the model. The default is False. It is recommended to set it to True, because the out-of-bag score reflects the generalization ability of a model after fitting .

  3)  Criterion : It  is the evaluation criterion for the features when the CART tree is divided. The loss function of classification model and regression model is different. The CART classification tree corresponding to the classification RF defaults to the Gini coefficient gini, and another optional criterion is the information gain . The CART regression tree corresponding to the regression RF defaults to the mean square error mse , and another standard that can be selected is the absolute value difference mae . Generally speaking, it is good to choose the default standard.

It can be seen from the above that the important frame parameters of RF are relatively few, and the main thing that needs to be paid attention to is n_estimators, that is, the number of decision trees with the largest RF.

5.2 Decision tree parameters

Let's look at the decision tree parameters of RF:

1) The maximum number of features considered during RF division  max_features : it is the m in the previously mentioned "at each node, randomly select m feature dimensions from M ". The default is "auto", which means that each node randomly considers √N features when partitioning; if it is "log2", it means randomly considering log_2^Nfeatures when partitioning ; if it is an integer, it represents the absolute number of features considered. If it is a floating point number, it means that the percentage of features is considered, that is, the number of features rounded off by the percentage of consideration × the total number of feature dimensions. Generally, the default "auto" is sufficient; if the number of features is very large, you can flexibly use the other values ​​just described to control the maximum number of features considered during division to control the generation time of the decision tree.

2) The maximum depth of the decision tree max_depth : Default can not be entered, if you do not enter, the decision tree will not limit the depth of the subtree when it builds the subtree. Generally speaking, this value can be ignored when there are few data or features. If the model has a large sample size and many features, it is recommended to limit this maximum depth. The specific value depends on the distribution of the data. Commonly used values ​​can be between 10-100.

3) Min_samples_split : This value limits the condition for the subtree to continue to be divided. If the number of samples of a node is less than min_samples_split, no further division will be continued. The default is 2. If the sample size is very large, it is recommended to increase this value.

4) the minimum number of samples the leaf node min_samples_leaf : This value restricts the leaf node number of samples with minimal, if the number of leaf nodes is less than a number of samples, will be pruned and brother together, leaving only the original parent. The default is 1. If the sample size is very large, it is recommended to increase this value.

5) The minimum sample weight of the leaf node and min_weight_fraction_leaf : This value limits the minimum value of the sum of all sample weights of the leaf node. If it is less than this value, it will be pruned together with the sibling node, and only the original parent node will be retained. The default is 0, that is, the weight issue is not considered. If we have more samples with missing values, or if the distribution categories of the classification tree samples are very unbalanced, sample weights will be introduced . At this time, we must pay attention to this value.

6) Maximum number of leaf nodes max_leaf_nodes : By limiting the maximum number of leaf nodes, over-fitting can be prevented. The default is "None", that is, the maximum number of leaf nodes is not limited. If a restriction is added, the algorithm will build the optimal decision tree within the maximum number of leaf nodes. If there are many features, they can be restricted, and the specific value can be obtained through cross-validation.

7) Node division minimum impurity min_impurity_split:  This value limits the growth of the decision tree. If the impurity of  a node (based on Gini coefficient, mean square error) is less than this threshold, the node no longer generates child nodes. It is the leaf node. Generally not recommended to change, the default value is 1e-7.

The most important of the above decision tree parameters include the maximum feature number max_features, the maximum depth max_depth, the minimum number of samples required for subdividing internal nodes min_samples_split and the minimum number of leaf nodes min_samples_leaf.


Reference materials:

[1] https://zhuanlan.zhihu.com/p/44695084

[2] https://blog.csdn.net/yangyin007/article/details/82385967

[3] https://blog.csdn.net/geduo_feng/article/details/79558572

Guess you like

Origin blog.csdn.net/weixin_41332009/article/details/113815702