"Machine Learning in Action" Chapter 9 Tree Regression

      Linear regression models need to fit all samples (except for locally weighted linear regression). When the data has many features and the relationship between the features is very complex, it is too difficult to build a global model. A feasible method is to split the data set into many pieces of data that are easy to model, and then use linear regression techniques to model. If it is still difficult to fit the linear model after the first split, continue to split. In this split mode, tree structure and regression methods are quite useful.

       The CART (Classification And Regression Trees) algorithm can be used for both classification and regression. The tree pruning technique is used to prevent overfitting of the tree .

Chapter 3 of this book uses decision trees for classification. Decision trees keep splitting the data into small datasets until all target variables are exactly the same, or the data can no longer be split. A decision tree is a greedy algorithm that makes the best choice in a given time, but doesn't care whether it can reach the global optimum.

Three methods are commonly used to construct a decision tree algorithm: ID3, C4.5, CART.
The difference between the three methods is the way of dividing the branches of the tree:
       (1) ID3 is the information gain branch
       (2) C4.5 is the information Gain rate branch
       (3) CART is the GINI coefficient branch

       The main difference between CART and C4.5 lies in the classification results. CART can be used for regression analysis or classification, while C4.5 can only be classified; C4.5 sub-nodes can be divided into multiple sub-nodes, while CART is an infinite number of binary sub-nodes; This expands the "tree group" Random forest based on CART, and the "tree group" GBDT based on regression tree.

       And the tree building algorithm used earlier in Chapter 3 is ID3. The approach of ID3 is to select the current best feature each time to segment the data, and segment it according to all possible values ​​of the feature. That is, if a feature has 4 values, then the data will be divided into 4 parts. Once segmented according to a certain feature, the feature will no longer work in the subsequent algorithm execution process, so there is a view that this segmentation method is too rapid. In addition to being too fast for segmentation, another problem with the ID3 algorithm is that it cannot handle continuous features directly . It can only be used in the ID3 algorithm if the continuous features are converted into discrete ones in advance. But this conversion process destroys the inherent properties of continuous variables.

       Another tree construction method is the binary segmentation method , that is, the data set is divided into two parts each time. If a certain feature value of the data is equal to the value required by the segmentation, then the data will enter the left subtree of the tree, otherwise it will enter the right subtree of the tree. The tree construction process is easily tuned to handle continuous features using the binary split rule. The specific processing method is: if the eigenvalue is greater than the given value, go to the left subtree, otherwise go to the right subtree. In addition, the binary segmentation method also saves the tree construction time, but this is not particularly significant, because these tree constructions are generally completed offline, and time is not a factor that needs to be focused on.

       CART is a well-known and well-documented tree-building algorithm that uses binary splits for continuous variables, and can be modified to handle regression problems. The CART algorithm also uses a dictionary to store the data structure of the tree, and the dictionary contains:
    the feature to be segmented and the     right subtree
    of the feature value to be segmented .
When no segmentation is required, it can also be a single value

    Left subtree. Similar to the right subtree.

       CART can build two kinds of trees: a regression tree, where each leaf node contains a single value, and a model tree, where each leaf node contains a linear equation. The idea of ​​regression tree is similar to that of classification tree, but the data type of leaf nodes is not discrete, but continuous.

1. Use the CAPT algorithm for regression

       Regression trees assume that leaf nodes are constant values, a strategy that assumes that complex relationships in the data can be summarized in a tree structure. In order to build a tree with piecewise constants as leaf nodes, it is necessary to measure the consistency of the data . Classification using a decision tree constructed with the ID3 algorithm calculates the chaoticity of the data given a node. Computing the chaos of continuous values ​​is very simple, first calculate the mean of all the values, and then calculate the difference between the values ​​of each data and the mean. In order to treat the positive and negative difference equally, the absolute value or the square value is generally used instead of the above difference. Here is the total variance (total value of squared error), total variance = mean square error * number of samples.

1.1 Building the tree

       The first thing to do is to find the best position to split the dataset, and use the function chooseBestSplit() in the book to split the dataset. Given the error calculation method, the function finds the best way to split the binary on the dataset, and generates a leaf node once the split is stopped. It iterates over all features and their possible values ​​to find the segmentation threshold that minimizes the error. The pseudocode of the function is as follows:


Three conditions for the segmentation to stop:
(1) The number of remaining eigenvalues ​​is 1;
(2) If the error after segmentation of the data set is not greatly improved, the segmentation operation should not be performed, but the leaf nodes should be created directly;

(3) When the size of one of the two divided subsets is less than the user-defined parameter tolN;

1.2 Tree pruning

       A tree with too many nodes indicates that the model may be "overfitting" the data. The process of avoiding overfitting by reducing the complexity of a decision tree is called pruning . The early termination condition in chooseBestSplit() is actually a so-called prepruning operation. Another type of pruning that requires the use of test and training sets is called postpruning .

       Prepruning: As the name suggests, prepruning is to stop tree growth early and prune while constructing the decision tree. All decision tree construction methods stop the process of creating branches when the entropy cannot be further reduced. In order to avoid overfitting, a threshold can be set. The amount of entropy reduction is less than this threshold, even if it can continue. Decrease entropy and also stop continuing to create branches. But this method does not work well in practice.

       Postpruning: pruning is performed after the decision tree is constructed. The process of pruning is to check a group of nodes with the same parent node to determine whether the increase in entropy is less than a certain threshold if they are merged. If it is really small, this set of nodes can be combined into a single node that contains all possible outcomes. Merging, also known as collapse processing, is generally used in regression trees to take the average of all subtrees that need to be merged. Post-pruning is by far the most common practice. The pseudocode of the function prune() is as follows:


2. Model tree

       Using a tree to model data, in addition to simply setting the leaf nodes to constant values, there is another way to set the leaf nodes as a piecewise linear function. The so-called piecewise linear here refers to The model consists of multiple linear segments. The parseability of the model tree is one of its advantages over regression, and it also has higher prediction accuracy. For the data shown in the figure, the effect of fitting with two straight lines is significantly better than that with one straight line.

                                    

       In order to judge which model is the best model tree, regression tree and general regression method, the book uses the correlation coefficient R^2 to measure. The R^2 coefficient of determination is the coefficient of determination of goodness of fit, which reflects the proportion of the variation of the independent variable in the variation of the dependent variable in the regression model. The closer the value of R^2 is to 1.0, the better.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324884027&siteId=291194637