Traditional machine learning notes 6 - regression tree model

foreword

  We introduced the decision tree classification model in the previous part. If you don’t know it, you can go back and learn it. Traditional machine learning notes 4-decision tree . In fact, the decision tree can also be used as a regression task. We call it a regression tree. The core is still a tree structure, but the growth method of attributes is different from the classification decision tree.

1. Decision tree regression

1.1. Core idea

  Let's take a look at the typical structure of the decision tree, as shown in the figure below: For a more detailed introduction to the decision tree, you can go back and look at it, so I won't repeat it here.
insert image description here
  Let's first look at the core idea of ​​the decision tree. When it comes to regression trees, the first thing we think of is the CART tree. The full name of the CART tree is Classification And Regression Trees, that is, classification trees and regression trees. The CART tree has the following characteristics:
Assuming that the decision tree is a binary tree, the values ​​of the internal node features are yes and no, the right branch is the branch with the value yes, and the left branch is the branch with the value no. Such a decision tree is equivalent to recursively dichotomizing each feature, dividing the input space into a finite number of units, and determining the predicted probability distribution on these units, that is, outputting the conditional probability distribution under the given input conditions. Let's first assume that there is a data set D, and the general idea of ​​​​building a regression tree is as follows:

  1. Consider all features j on the data set D, traverse all possible values ​​or segmentation points s under each feature, and divide the data set D into two parts D 1 , D 2 D_{1}, D_{2}D1D2
  2. Calculate D 1 , D 2 D_{1}, D_{2} respectivelyD1D2The sum of squared differences, select the feature and split point corresponding to the smallest squared difference, and generate two child nodes, that is, divide the data into two parts.
  3. Call steps 1 and 2 recursively for the above two child nodes until the stop condition is satisfied.

  After the regression tree is constructed, the division of the overall input space is completed, that is, the establishment of the regression tree is completed. The entire input space is divided into multiple sub-regions, and the output of each sub-region is the average value of all training samples in this region. Let's look at two examples:
partitioning the entire input space into subregions:
insert image description here

The output of each sub-region is the average of all training samples in the region:
insert image description here
  We know that the regression tree actually divides the input space into units, and the output value of each region is the average of all point values ​​in the region. But we want to build the most efficient regression tree: the least difference between the predicted value and the true value. Below we introduce how the regression tree grows.

2. Heuristic segmentation and optimal attribute selection

2.1. Regression model example

  Let's take the income of baseball players as an example to explain the regression model, as shown in the figure below:
insert image description here
where:

  • Red and yellow indicate high income, blue and green indicate low income
  • The abscissa indicates years, and the ordinate indicates performance

There are two features in total, years of employment and performance hits. The decision-making process of the regression tree is determined by the final generated regression tree, as shown in the figure below: the root decision node is the
insert image description here
  feature years, and its division threshold is 4.5, and the samples with years less than 4.5 Divide to the left, samples greater than or equal to 4.5 are divided to the right; the second decision node is characterized by hits, its division threshold is 117.5, samples with hits less than 117.5 are divided to the left, and samples greater than or equal to 117.5 are divided to the right. A sample follows the decision-making conditions of the decision tree to the leaf node to obtain the predicted salary. The predicted salary here has a total of 3 values, which are 5.11, 6.0, and 6.74.
  After the regression tree is constructed, the division of the entire space is realized, as shown in the figure below. When actually predicting, new samples will be divided into the following figure R 1 R_{1} according to the decision-making process of the regression treeR1 R 1 R_{1} R1 R 1 R_{1} R1A region R i R_{i} inRi, and the predicted value for this new sample (in this case, the baseball player's salary) is its region. As shown in the figure below, the entire plane is divided into 3 parts:
R 1 = X ∣ Years < 4.5 R 2 = X ∣ Years ≥ 4.5 , Hits < 117.5 R 3 = X ∣ Years ≥ 4.5 , Hits ≥ 117.5 \begin{gathered} R 1=X \mid \text { Years }<4.5 \\ R 2=X \mid \text { Years } \geq 4.5, \text { Hits }<117.5 \\ R 3=X \mid \text { Years } \geq 4.5, \text { Hits } \geq 117.5 \end{gathered}R 1=X Years <4.5R2 _=X Years 4.5, Hits <117.5R 3=X Years 4.5, Hits 117.5
insert image description here

2.2. Construction method of regression tree

  The core of constructing a regression tree: segmentation method and attribute selection First assume a regression problem, and the estimated result is y ∈ R y \in RyR , the feature vector isX = [ x 1 , x 2 , x 3 , … , xp ] X=\left[x_1, x_2, x_3, \ldots, x_p\right]X=[x1,x2,x3,,xp] , then the two steps to construct the regression tree are:

  1. First divide the feature space X into J non-overlapping regions X = [ R 1 , R 2 , R 3 , … , R p ] X=\left[R_1,R_2, R_3, \ldots, R_p\right]X=[R1,R2,R3,,Rp]
  2. where RJ R_{J}RJWe give the same prediction result y ~ R j = 1 n ∑ j ∈ R jyj \tilde{y}_{R_{j}}=\frac{1}{n} \sum j \in for each sample in R j y_jy~Rj=n1jR j yj, where n is RJ R_{J}RJThe total number of samples in .

Through the above construction, we hope to find a division method that minimizes RSS R 1 , R 2 , R 3 , . . . RJ R_{1}, R_{2}, R_{3}, ...R_{J}R1R2R3...RJ,RSS表示如下:
R S S = ∑ j = 1 J ∑ i ∈ R j ( y i − y ~ R j ) 2 R S S=\sum_{j=1}^J \sum_{i \in R j}\left(y_i-\tilde{y}_{R_j}\right)^2 RSS=j=1JiRj(yiy~Rj)2
of which:

  • yyy : label vector formed for the label of each training sample, each element yj y_jin the vectoryjCorresponding to the label of each sample.
  • X X X : a collection of features,x 1 , x 2 , … , xp x_1, x_2, \ldots, x_px1,x2,,xpFor the 1st feature to the ppthp features.
  • R 1 , R 2 , R 3 , … , R J R_1, R_2, R_3, \ldots, R_J R1,R2,R3,,RJ: Each non-overlapping area is divided into the entire feature space (refer to the right figure on the previous page).
  • y ~ R j \tilde{y}_{R_j} y~Rj: for dividing into the jjthj regionsR j R_jRjThe average label value of the sample, use this value as the predicted value of the area, that is, if a test sample falls into the area during the test, the label value of the sample is predicted as y ~ R j \tilde{y}_{R_j
    }y~Rj.
    From the calculation process above, we can see that when the feature space is complex, the amount of calculation is very large. So it leads to the recursive dichotomy we will introduce below.

recursive dichotomy

  The regression tree uses a top-down greedy recursive method. Greedy here refers to each division, only considering the current optimal, without looking back at the previous division. Mathematically defined, that is, to select the dimension (feature) of the split and the split point to minimize the RSS result of the split tree. The formula is as follows:
R 1 ( j , s ) = { x ∣ xj < s } R 2 ( j , s ) = { x ∣ xj ≥ s } RSS = ∑ xi ∈ R 1 ( j , s ) ( yi − y ~ R 1 ) 2 + ∑ xi ∈ R 2 ( j , s ) ( yi − y ~ R 2 ) 2 \begin{aligned} &R_1(j, s)=\left\{x \mid x_j<s\right\} \\ &R_2(j, s)=\left\{x \mid x_j \geq s\ right\} \\ &R SS=\sum x_i \in R_1(j, s)\left(y_i-\tilde{y}_{R 1}\right)^2+\sum x_i \in R_2(j, s )\left(y_i-\tilde{y}_{R_2}\right)^2 \end{aligned}R1(j,s)={ xxj<s}R2(j,s)={ xxjs}RSS=xiR1(j,s)(yiy~R 1)2+xiR2(j,s)(yiy~R2)2
  Let's look at the recursive segmentation again, and look directly at the two pictures below. The left picture is obtained by non-recursive segmentation, and the right picture is the binary recursive method.
insert image description here
  As can be seen from the above figure, recursive segmentation can definitely find a better solution, while non-recursive segmentation cannot exhaustively cover all situations, which cannot be realized algorithmically, and a better solution may not be obtained.
  The overall process of the regression tree is similar to the classification tree: when branching, the possible division thresholds of each feature are exhausted to find the optimal segmentation feature and the optimal segmentation point threshold. The measurement method is to minimize the square error. Branching stops until a preset termination condition (such as the upper limit of the number of leaves) is reached.
  Usually when dealing with specific problems, a single regression tree model has limited capabilities and may fall into overfitting. We often use the Boosting idea in integrated learning to enhance the regression tree. The new model obtained is the Boosting Decision Tree (Boosting Decision Tree) ), further, you can get the Gradient Boosting Decision Tree (GBDT), and then you can upgrade to XGBoost. Fitting the residuals through multiple regression trees can continuously reduce the deviation between the predicted value and the label value, so as to achieve the purpose of accurate prediction.

Overfitting and Regularization

3.1. Overfitting problem

  When the size of the tree is too small, the model will not perform well, and if the size of the tree is too large, it will cause overfitting, which is very difficult to control. Therefore, the following methods to solve overfitting have been born.

3.2. The solution to the overfitting problem

3.2.1. Constraints control tree overgrowth

  • Limit Tree Depth: End tree growth when the set maximum depth is reached.
  • Classification error method: When the tree continues to grow and cannot obtain the desired classification error reduction, it stops growing.
  • Leaf node minimum data volume limit: the data volume of a leaf node is too small, and the tree stops growing.

3.2.2. Pruning

  The disadvantage of constrained tree growth is that it kills other possibilities in advance and terminates the growth of the tree prematurely. We can also wait for the tree to grow and then pruning, which is the so-called post-pruning. The post-pruning algorithm mainly has the following aspects: kind:

  • Reduced-Error Pruning (REP, error rate reduction pruning).
  • Pesimistic-Error Pruning (PEP, pessimistic error pruning).
  • Cost-Complexity Pruning (CCP, cost complexity pruning).
  • Error-Based Pruning (EBP, error-based pruning).

3.2.3. Regularization

  For regression trees, we add a regularization measure during pruning. Consider the subtree obtained after pruning as shown below, where is the coefficient of the regularization term. When it is fixed, the best is the subtree that makes the sub-value of the following formula the smallest.
∑ m = 1 ∣ T ∣ ∑ xi ∈ R m ( yi − y ~ R 2 ) 2 + α ∣ T ∣ \sum_{m=1}^{|T|} \sum_{x_i \in R_m}\left( y_i-\tilde{y}_{R_2}\right)^2+\alpha|T|m=1TxiRm(yiy~R2)2+αT

  • ∣ T ∣ |T| T is the number of leaf nodes in the regression tree.
  • a \alphaα can be selected by cross-validation.

The part about the model principle of the regression tree is basically introduced, and everyone is welcome to criticize and correct.

Guess you like

Origin blog.csdn.net/qq_38683460/article/details/127510978
Recommended