[Machine Learning] How does a decision tree achieve regression

[Machine Learning] How does a decision tree achieve regression

1 Introduction

In https://blog.csdn.net/qq_51392112/article/details , we introduced the basic content of decision tree in detail: how to implement, type of decision tree, formula calculation, etc. The content is more biased towards the classification task of decision tree, while Classification tasks are also well understood, because intuitively, decision trees themselves are well suited for classification tasks.

  • But decision trees can also implement regression tasks, which we will discuss in detail in this lecture.
  • Different from linear regression, the regression tree divides the "space", and each space corresponds to a unified predicted value.

2. Construction method of regression tree

1) The predicted label value is averaged based on the total number of samples in the region (space).
insert image description here

2) Similar to linear regression, the regression tree also needs a loss function to evaluate the effect of regression. Here, the squared residual and RSS are used for evaluation:

insert image description here

  • Among them, yi y_iyiis the true value of the sample, y R j y_{R_j}yRjis the first R j R_jRjPredicted values ​​for all samples in the space.
  • The inner layer ∑ is to sum the square of the difference between the predicted value and the real value of all samples in the area;
  • The outer layer Σ is to traverse all the divided areas.

But if we think about it carefully, if this method is used for regression, the amount of calculation is astonishing, because there are too many cases of space division. In order to deal with this problem, we often use a method to simplify the space division requirements. !

  • This method is called "recursive dichotomy"!

3. Recursive dichotomy

What is recursive bisection? As the name suggests, each split of the tree splits in the form of a binary tree. When we initially split two sub-nodes (that is, space) RJ​ according to the characteristics and their best division points
, we will continue to divide the samples of this space into two parts again from the current position!
insert image description here
1) Division plan:

  • From top to bottom, start from all samples, and continuously divide the samples into 2 branches from the current position
  • Greedy, each division, only considers the best of the current division, and does not look back at the previous division

2) Optimization principles:

  • Select the split dimension x_j (that is, each feature of the data) and the split point s, so that the RSS result of the regression tree after division is the smallest
    insert image description here

In layman's terms, after we initially divide the two spaces, we will continue to select the dimension according to the loss function RSS, and divide the subspace into two again at the segmentation point t under this dimension.
insert image description here
insert image description here

4. Pruning of regression trees

Similarly, the process of regression optimization is the same as that of linear regression. In the process of optimizing the model by reducing the loss function, the model is prone to fall into the state of "overfitting". It is also necessary to introduce a "regularization term" as a penalty.

  • The difference from linear regression is that since the regression tree is not a numerical model , the regularization term cannot introduce numerical items such as the L2 regularization term, so the regularization term in the regression tree is related to the leaf nodes: here
    insert image description here
    ∣T ∣ shows the number of nodes in the tree T. When the hyperparameter α > 0, the more nodes in the tree, the more complex the model, and the tree will pay for its complexity, so the subtree that takes the minimum value of the above formula will be become smaller.

5. Summary

Regression decision tree algorithm:

  1. Generate a large tree on the training set using recursive binary splitting, stopping only when the number of observations contained in a terminal node falls below some minimum value.
  2. Prune the large tree with cost complexity to obtain a series of optimal subtrees, and the subtrees are functions of α.
  3. Use K-fold cross-validation to select α. The specific method is to divide the training set into K folds.
    • For all k = 1 , 2 , 3 , ⋯ K; repeat step (1)~step (2) for all data in the training set that does not belong to the kth fold to obtain the subtree corresponding to α, and find the above subtree The mean squared prediction error of the tree at the kth fold.
  4. Each α will have corresponding K mean square prediction errors, average these K values, and select the α that minimizes the average error.
  5. Find the subtree corresponding to the selected α in step (2).

reference

【1】https://blog.csdn.net/RichardsZ_/article/details/108903858

Guess you like

Origin blog.csdn.net/qq_51392112/article/details/130514532