Ensemble Learning & Random Forest

integrated learning

Ensemble learning is divided into three categories of algorithms:

Bagging, Boosting and fusion stacking.

insert image description here

Bagging : The core idea is to build multiple independent evaluators, and then average or majority voting on their predictions to determine the result of the integrated evaluator.

  • The representative model of bagging method is random forest.

Boosting : In the boosting method, the base evaluators are related and built one by one in order.
Its core idea is to combine the power of weak estimators to predict difficult-to-evaluate samples again and again to form a strong estimator.

  • Representative models of the boosting method include Adaboost and gradient boosting trees.

insert image description here

insert image description here

decision tree

  • Refining Decision Rules with Tree Models

There are two core problems of decision tree:

  1. One is how to find the correct features to ask questions, that is, how to branch
  2. Second, when should the tree grow to stop

For the first question, we define the impurity index used to measure the quality of branches. The impurity of the classification tree is measured by the Gini coefficient or information entropy, and the impurity of the regression tree is measured by the MSE mean square error . At each branch, the decision tree calculates the impurity of all features, selects the feature with the lowest impurity to branch, and after branching, calculates the impurity of each feature under different values ​​of the branch , continue to select the feature with the lowest impurity for branching

insert image description here

  • With each layer of branches, the overall impurity of the tree will become smaller and smaller, and the decision tree pursues the minimum impurity. Therefore, the decision tree branches consistently until no more features are available, or the overall impurity metric is optimal, and the decision tree stops growing.

random forest

class sklearn.ensemble.RandomForestClassifier (n_estimators=’10’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)

insert image description here

non-estimator

Number of base learners

The impact of this parameter on the accuracy of the random forest model is monotonous, the larger the n_estimators
, the better the effect of the model. But correspondingly, any model has a decision boundary. After n_estimators reach a certain level, the accuracy of the random forest often does not rise or starts to fluctuate. Moreover, the larger the n_estimators, the greater the amount of calculation and memory required, and the training time is also shorter. will get longer and longer. For this parameter, we are eager to strike a balance between training difficulty and model performance.

random_state

The essence of random forest is a bagging integration algorithm (bagging). The bagging integration algorithm averages the prediction results of the base evaluator or uses the majority voting principle to determine the result of the integrated evaluator.

The randomness of random forest consists of two parts, sample randomness and attribute randomness, controlled by random_state

  • When this randomness is greater, the effect of the bagging method will generally be better and better. When ensembling with bagging, the base classifiers should be independent and not identical.
  • But the limitation of this approach is very strong. When we need thousands of trees, the data may not be able to provide thousands of features to allow us to build as many different trees as possible.

So, except for random_state. We also need other randomness. For example, random sampling

  • Note that there is a flaw in returning to random sampling. Some samples will appear repeatedly, and some samples may never be sampled. The sampling rate is about 0.632, that is, 0.37 samples will be wasted.
  • However, the out-of-bag data can be used to test the model

In other words, when using random forest, we don't need to divide the test set and training set , we only need to test our model with out-of-bag data

However, when n and n_estimators are not large enough , it is likely that no data falls
out of the bag, and naturally it is impossible to use oob data to test the model

If you want to test with out-of-bag data, you need to adjust the oob_score parameter to True when instantiating. After training, we can use another important attribute of random forest: oob_score_ to view our out-of-bag data. test results

insert image description here

  1. Common interface:apply, fit, predict,score,predict_proba
  2. Bonus condition: the accuracy of the base learner should exceed that of the random classifier. For binary classification, if the accuracy of the base learner cannot exceed 0.5, the modified base learner should be discarded

Random Forest Interview Questions

1. A brief introduction to Random Forest

An optimized version of Bagging based on the tree model, the generation of one tree is definitely not as good as multiple trees, so there is a random forest to solve the characteristics of weak generalization ability of decision trees.

Multiple random sampling, multiple random attribute selection, selection of the optimal split point, construction of multiple (CART) classifiers, voting

Algorithm flow:

  • The input is a sample set D = { ( x , y 1 ), ( x 2 , y 2 ) … ( xm , ym ) } D=\{(x, y_1), (x_2, y_2) \dots (x_m, y_m) \}D={(xy1)(x2y2)(xmym)} , weak classifier iteration timesTTT

  • The output is the final strong classifier f ( x ) f(x)f(x)

  • For t = 1, 2 … T t=1, 2 \dots Tt=12T

    • Perform the ttth tt on the training setRandom sampling for t times, a total ofmmm times, get the information containingmmThe sampling set Dt of m samples
    • With sample set D t D_tDttraining ttt decision tree modelsG t ( x ) G_t(x)Gt( x ) , when training the nodes of the decision tree model, select a part of the sample features from all the sample features on the node, and select an optimal feature among these randomly selected part of the sample features to make the left and right subtrees of the decision tree to divide
  • If it is a classification algorithm prediction, then TTThe category or one of the categories with the most votes from the T weak learners is the final category. If it is a regression algorithm,TTThe arithmetic average of the regression results obtained by T weak learners is the final model output .

2. Where is the randomness of random forest?

Multiple random sampling with replacement, random selection of features multiple times

3. Why is random forest not easy to overfit?

  • Every tree in the random forest is overfitted to very small details

  • By introducing randomness, random forest makes the details of each tree fitting different

  • When all the trees are combined, the overfitting part will be automatically eliminated.

Therefore, the probability of random forest overfitting is relatively low.

4. Why not use full sample training?

Full-sample training ignores the law of local samples (each decision tree tends to be the same), which is harmful to the generalization ability of the model and makes the random forest algorithm lose randomness at the sample level.

5. Why random features?

The random features guarantee the diversity (difference) of the base classifier, and the generalization performance of the final ensemble can be further improved by the difference between individual learners, thereby improving the generalization ability and noise resistance ability.

6. What is the difference between RF and GBDT?

  • The random forest votes the results of multiple decision trees to obtain the final result, and does not further optimize and improve the training results of different trees, which is called the Bagging algorithm .
  • GBDT uses the Boosting algorithm , and builds a weak learner at each step of the iteration to make up for the shortcomings of the original model. The Gradient Boost in GBDT is to build a learner in the direction of the fastest gradient descent through each iteration.

7. Why is RF more efficient than Bagging?

Because in the process of building an individual decision tree, Bagging uses a "deterministic" decision tree. When bagging selects the partition attribute, it needs to examine all the features of each tree; while random forest only considers a subset of features.

8. You have built a random forest model with 10000 trees. After getting a training error of 0.00, you are very happy. However, the validation error is 34.23. what is going on? Have you not trained your model yet?

  • The model overfitting is very serious
  • The new test set does not match the data distribution of the training set

9. How to use random forest to evaluate feature importance?

Out-of-bag data (OOB) : About 1/3 of the training instances did not participate in the generation of the kth tree, they are called the kkth treeOut-of-bag data samples for k trees.

A certain feature XX in the random forestThe calculation method of the importance of X is as follows:

  • For each decision tree in the random forest, use the corresponding OOB (out-of-bag data) to calculate its out-of-bag data error, recorded as err OOB 1 err_{OOB1}errOOB 1
  • Randomly compare the features XX of all samples of the out-of-bag data OOBX adds noise interference (you can randomly change the value of the sample at feature X), and calculate its out-of-bag data error again, which is recorded aserr OOB 2 err_{OOB2}errOOB2
  • Suppose there are NN in random forestN trees, then for featureXXThe importance of X is ( err OOB 2 − err OOB 1 / N ) (err_{OOB2}-err_{OOB1}/N)(errOOB2errOOB 1/ N ) , the reason why this expression can be used as a measure of the importance of the corresponding feature is because: if the accuracy rate outside the bag is greatly reduced after random noise is added to a certain feature, it means that this feature is important for the sample. The classification result has a great influence, that is to say, its importance is relatively high.

10. What parameters need to be adjusted during random forest algorithm training?

  • **n_estimators:** The number of subtrees built by the random forest.
    More subtrees generally lead to better performance of the model, but at the same time make your code slower. Need to choose the optimal number of random forest subtrees

  • **max_features:** The maximum number of features that a random forest allows a single decision tree to use.
    Increasing max_features generally improves the performance of the model, because at each node, we have more options to consider. However, this is not necessarily true because it reduces the diversity of individual trees, which is the unique advantage of random forests. But, for sure, you will slow down the algorithm by increasing max_features. Therefore, you need to properly balance and choose the best max_features.

  • max_depth: the maximum depth of the decision tree

    The default decision tree does not limit the depth of subtrees when building subtrees

  • **min_samples_split: **The minimum number of samples required for subdivision of internal nodes
    The minimum number of samples required for subdivision of internal nodes. If the number of samples of a node is less than min_samples_split, it will not continue to try to select the optimal feature for division.

  • min_samples_leaf: Minimum samples of leaf nodes

    This value limits the minimum number of samples of leaf nodes. If the number of leaf nodes is less than the number of samples, it will be pruned together with sibling nodes.

  • max_leaf_nodes: maximum number of leaf nodes

    By limiting the maximum number of leaf nodes, overfitting can be prevented. The default is "None", that is, the maximum number of leaf nodes is not limited. If restrictions are added, the algorithm will build the optimal decision tree within the maximum number of leaf nodes.

  • min_impurity_split: The minimum impurity of node division.
    This value limits the growth of the decision tree. If the impurity of a node (based on the Gini coefficient, mean square error) is less than this threshold, the node will no longer generate child nodes. It is a leaf node. It is generally not recommended to change the default value of 1e-7.

11. Advantages and disadvantages of random forest

  • advantage

    • Training can be highly parallelized, which is advantageous for the speed of large sample training in the era of big data. Personally, I think this is the main advantage.
    • Since the decision tree nodes can be randomly selected to divide the features, the model can still be trained efficiently when the sample feature dimension is high.
    • After training, the importance of each feature to the output can be given
    • Due to the random sampling, the variance of the trained model is small and the generalization ability is strong.
    • Compared with Adaboost and GBDT of the Boosting series, RF implementation is relatively simple.
    • It is insensitive to the loss of some features, and can still maintain accuracy if a large part of the features are missing.
  • shortcoming

    • On some sample sets with relatively large noise, the RF model is prone to overfitting.
    • Features with more value divisions tend to have a greater impact on RF decision-making, thereby affecting the effect of the fitted model.

12. Briefly describe the principle of Adaboost

The Adaboost algorithm uses the same base classifier (weak classifier), assigns different weight parameters based on the error rate of the classifier, and finally accumulates the weighted prediction results as output.

  • Adaboost algorithm process:
    • The samples are weighted to get the first classifier.
    • Calculate the error rate of the classifier, and assign weights to the classifier according to the error rate (note that this is the weight of the classifier ).
    • Increase the weight of misclassified samples and decrease the weight of paired samples (note that this is the weight of samples ).
    • Then use the new sample weights to train the data to get a new classifier.
    • Multiple iterations until the classifier error rate is 0 or the overall weak classifier error is 0, or the number of iterations is reached.
    • The results of all weak classifiers are weighted and summed to obtain a more accurate classification result. A classifier with a low error rate achieves a higher coefficient of determination and thus plays a key role in making predictions on the data.

13. Advantages and disadvantages of AdaBoost

  • advantage
    • Adaboost provides a framework within which sub-classifiers can be constructed using various methods. A simple weak classifier can be used without screening features, and there is no over-fitting phenomenon.
    • The Adaboost algorithm does not require prior knowledge of weak classifiers, and the classification accuracy of the final strong classifier depends on all weak classifiers. Whether applied to artificial or real data, Adaboost can significantly improve learning accuracy.
    • The Adaboost algorithm does not need to know the upper limit of the error rate of the weak classifier in advance, and the classification accuracy of the final strong classifier depends on the classification accuracy of all weak classifiers, which can dig deep into the ability of the classifier.
    • Adaboost can adaptively adjust the assumed error rate according to the feedback of the weak classifier, and the execution efficiency is high.
    • Adaboost trains different weak classifiers for the same training sample set, and assembles these weak classifiers according to a certain method to construct a strong classifier with strong classification ability, that is, "three cobblers beat one Zhuge Liang"".
  • shortcoming
    • During the Adaboost training process, Adaboost will make the weight of difficult-to-classify samples exponentially increase, and the training will be too biased towards such difficult samples, making the Adaboost algorithm susceptible to noise interference.
    • Adaboost relies on weak classifiers, which tend to take a long time to train.

14. Is Adaboost sensitive to noise?

During the Adaboost training process, Adaboost will make the weight of difficult-to-classify samples exponentially increase, and the training will be too biased towards such difficult samples, making the Adaboost algorithm susceptible to noise interference.

15. Similarities and differences between Adaboost and random forest algorithms

Both random forest and Adaboost algorithms can be used for classification, and they are both excellent combination algorithms based on decision trees.

  • similarities
    • Both are samples selected by Bootstrap self-service method.
    • Both are to train a lot of decision trees.
  • the difference
    • Adaboost is a Boosting-based algorithm, and Random Forest is a Bagging-based algorithm.
    • In the training of the tree behind Adaboost, when the variable sampling is selected, the probability of sampling will increase for the wrong sample of the previous tree.
    • When random forest trains each tree, some features are randomly selected as split features, rather than all features are used as split features.
    • When predicting new data, all trees in Adaboost vote to determine the predicted value of the dependent variable. The weight of each tree is related to the error rate; the random forest determines the prediction of the dependent variable according to the classification value of a minority of all trees obeying the majority tree value (or average the tree predictions).

Guess you like

Origin blog.csdn.net/RandyHan/article/details/130408959