[Practical Machine Learning] 3.2 DecisionTrees

Record the learning process
[3.2 The simplest and most commonly used decision tree [Stanford 21 Fall: Practical Machine Learning Chinese Edition]]

Decision Trees can be used for classification (Classification) and regression (regression tree) tasks

Decision Trees

  • Pros
    • Explainable
    • Can handle both numerical and categorical features (can handle both numerical (greater than or less than) and categorical features)
  • Cons
    • Very non-robust (ensemble (integrated learning) to help) (probably affected by the noise of the data)
    • Complex trees cause over-fitting (prune trees(剪枝))
    • Not easy to be parallelized in computing (the serialization process is more difficult and parallel performance will suffer slightly)

Random Forest

  • Train multiple decision trees to improve robustness
    • Each tree is trained independently
    • Majority voting for classification, average for regression (vote for classification problems, average for regression problems)
    • The price is a little higher training cost
  • Where is the randomness from? (two random ways)
    • Bagging: randomly sample training examples with replacement (you can repeatedly take out a set of data for separate training, repeat this process, and train n trees)
      • E.g. [1,2,3,4,5] → \rightarrow [1,2,2,3,4] (sampling may be repeated)
    • Randomly select a subset of features. (random for features, do not use the entire feature)

Gradient Boosting Decision Trees

  • Train multiple trees sequentially (as before, many trees are trained, but they are no longer done independently, but sequentially. These trees can be combined into a larger model)
  • At step t = 1 , … t=1, \ldots t=1,, denote by F t ( x ) F_t(x) Ft( x ) the sum of past trained trees (the sum of all tree training in the past, each tree is a function, and your output is the sum of the outputs of each number)
    • Train a new tree f t f_t ft on residuals: { ( x i , y i − F t ( x i ) ) } i = 1 , … \left\{\left(x_i, y_i-F_t\left(x_i\right)\right)\right\}_{i=1, \ldots} { (xi,yiFt(xi))}i=1,(Train a new tree in the next time t, it is not on the original data, but on the data of the residual (the difference between the real value and the predicted value, the piece that the model did not do well) , and then train a tree) in this way is closer to the true value
    • F t + 1 ( x ) = F t ( x ) + f t ( x ) F_{t+1}(x)=F_t(x)+f_t(x) Ft+1(x)=Ft(x)+ft(x)
  • The residual equals to − ∂ L / ∂ F -\partial L / \partial F L / F if using mean square the loss, so it's called gradient boosting () (equivalent to an average mean square error, each time to train a new tree to fit the negative of the gradient) (specific gradient drop is defined in this section)

Summary

  • Decision tree: an explainable model for classification/regression
  • Ensemble trees to reduce bias and variance (decision trees are very sensitive to data noise, the specific offset and variance are defined in this section)
    • Random forest: trees trained in parallel with randomness (random parallel training tree)
    • Gradient boosting trees: train in sequential on residuals (sequentially train some numbers, and each new tree is the part that is inaccurate in predicting the previous numbers to continue to fit)
  • Trees are widely used in industry
    • Simple, easy-to-tune, often gives satisfied results (training is simple, there are not too many hyperparameters to adjust, and it is easy to give better results)

Guess you like

Origin blog.csdn.net/weixin_62501745/article/details/128796852