Eight, decision tree and random forest

Reference url:

https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html

No Parameter Random Forest algorithm, is an integrated method for forming the cumulative effect of the relatively simple integration of a plurality of evaluator, i.e., the final result of the majority vote of several evaluator (majority vote) tend to perform better than a single evaluator vote.

1, incentives Random forest: tree  

  Random Forests decision tree is built on the basis of integrated learner.

  Tree with a very intuitive way to classify things or playing tag: just ask a series of questions can be classified.

  Binary Tree branches can be very effective method to classify: a rational structure in the decision tree, each problem is basically the kind of possibilities can be halved, when a large number of species make decisions even if it is also possible to quickly narrow your choices range.

  Difficulty is how to design a decision tree for each step of the problem, in the machine learning algorithm decision tree, generally due to problems in the form of feature classification boundary is parallel to the axis of the divided data caused, i.e. each node of the decision tree in accordance with one a threshold characteristic of the data into two groups.

  1, create a decision tree

    

 

 

    A simple decision tree constructed on this data set will be a continuous feature or another feature of the data is divided according to a certain determination condition.

    Each split time, will be distributed majority vote within the label points to the new area of ​​the region.

    

 

 

    After dividing a stab, in the upper half branches of all data points are not changed, so that the branch does not need to continue dividing. Unless a node comprises only one color, then each need to be divided for dividing each region in one of two features.

    

 

 

     

 

 

 

  2, decision trees and overfitting

    It is over-fitting the general properties of tree - the tree is very easy to fall very deep, and therefore tend to fit local data, but the bigger picture is not the entire data distribution, that model training is different subset of the data.

2, integrated assessment algorithm: Random Forests

  By combining multiple over-fitting evaluator to reduce the extent of over-fitting idea is actually an integrated learning approach, called bagging algorithm.

  Bagging algorithm uses the data is evaluated in parallel with a back extraction integrated, each evaluator are over-fitting the data, the classification results can be obtained by better averaging.

  Integrated decision tree algorithm random is random forest.

  

 

 

    If it is determined how the data is divided by a random method (stochasticity), decision trees fit randomness would be more effective, doing so allows all data to be fitted at each training, but the result was still fit to be random .

   Random integration of the optimization decision tree algorithm is implemented by RandomForestClassifier evaluator, it will automatically be randomized in decision Scikit-Learn, i.e. just select a group evaluator, they can be completed very quickly (if necessary parallel computing ) per tree fit the task.

  

 

 

 

3, random forest regression

  Random Forests can also be used as a return (handle continuous variable rather than discrete variables).

  Evaluator random forest regression is RandomForestRegressor.

  

 

 

   

 

 

  Model really is a smooth curve, and zigzag lines Random Forest model.

  As can be seen from the figure, without parameters Random Forest model is very suitable for handling multi-cycle data, we do not need to configure a multi-cycle model.

4, case: recognition of handwritten numbers with random forest

  

 

   

 

   

 

 

5, Random Forests summary

  Random Forests is a powerful machine learning method, its advantages:

  (1) because the principle is very simple decision tree, so it's training and forecasting speed is very fast. Further, parallel computing can multi-task directly, because each tree is completely independent.

  (2) multiple trees may be probabilistic classification: a majority vote among the plurality of evaluation of the probability estimate may be given (using the Scikit-Learn predict_proba () method)

  (3) non-parametric model is very flexible, outstanding performance in other evaluator owes fitting tasks.

  The main disadvantage of Random Forest lies in its results is not easy to explain, that you want to sum up the meaning of the classification model, random forests may not be the best choice.

Guess you like

Origin www.cnblogs.com/nuochengze/p/12532880.html