Machine Learning: Decision Tree (Part 2)

Decision tree (below)

CART algorithm for classification, regression

I. Overview

The entire decision tree model generation is completed in three steps: feature selection, decision tree generation, and pruning.

CART decision trees are binary trees.

The CART algorithm consists of the following two steps:

(1) Decision tree generation: Generate a decision tree based on the training data set, and the generated decision tree should be as large as possible;

(2) Decision tree pruning: Use the verification data set to prune the generated tree and select the optimal subtree. At this time, the minimum loss function is used as the pruning standard.

The generation of a decision tree is the process of recursively building a binary decision tree. Use the square error minimization criterion for the regression tree, and use the Gini index (Gini index) minimization criterion for the classification tree to perform feature selection and generate a binary tree.

(1) Regression tree:

generate:

For each data in the training set as a segmentation point, calculate the loss once with the following formula, and find the point with the smallest loss as the segmentation point. Repeat this until the requirement is met.

The value of the segmented unit is set to the average of all data points in the unit. 

 (2) Classification tree:

Traverse the features and values ​​of all input data in the training set data, and divide them into two categories D1 and D2 as separation conditions. Calculate the conditional Gini index under the corresponding feature value, take all the feature values ​​with the smallest Gini index as the separation condition to classify the data set, and so on until the condition is met.

p is the probability of taking the corresponding value of the feature. Count all data that does not meet the conditions as another class with a probability of 1-p 

2. Main content

(1) Regression tree

Model:

 

That is to say, for regression problems, one input corresponds to one output. The resulting regression tree divides the input data into M units by M-1 divisions. Each cell corresponds to a fixed output value. When predicting, which unit interval the input data is in, its output value is predicted as the output value of this unit. If it is expressed geometrically, take a 1-dimensional input data as an example, and the drawn x, y image is stepped.

In the formula, Cm is the value of the mth unit, and the indicator function I indicates whether the input x is in this unit interval.

Strategy:

//============= Extension =============//

For each class after the division is completed, the square error can be used to represent the prediction error of the training data.

class m

According to the criterion of minimum square error, obviously we should set the value of this unit to be the mean value of the output corresponding to all input points. 

//===============================//

The problem is how to divide the input space.

Here, a heuristic method is used to select the jth variable x (j) and its value s as the splitting variable and splitting point, and define two regions:

Then find the optimal segmentation variable j and the optimal segmentation point s. Specifically, solving

 

 For all the input data in the training set as the segmentation point, calculate the above formula once, calculate the minimum loss, and use it as the segmentation point this time.

The values ​​of the two divided regions are respectively the average value of the output of the sample points, that is:

The optimal split point s can be found for a fixed input variable j.

 Traverse all input variables, find the optimal segmentation variable j, and form a pair (j, s). According to this, the input space is divided into two regions. Next, repeat the above division process for each region until the stop condition is satisfied. This generates a regression tree. Such a regression tree is often called a least squares regression tree

Generate algorithm:

(2) Classification tree

 The classification tree uses the Gini index to select the optimal feature, and at the same time determine the optimal binary segmentation point of the feature.

 

 

For the CART decision tree, because it is a binary classification problem, the above calculation formula is used. 

Conditional Gini index:

The Gini index Gini(D) represents the uncertainty of the set D, and the Gini index Gini(D,A) represents the uncertainty of the set D divided by A=a. The larger the Gini index value, the greater the uncertainty of the sample set, which is similar to entropy. 

Generate algorithm:

Traverse all the values ​​of all features, find the gini index, select the smallest one as the segmentation point, and so on. 

Pruning algorithm:

The CART pruning algorithm cuts some subtrees from the bottom of the "fully grown" decision tree to make the decision tree smaller (the model becomes simpler), so that it can make more accurate predictions for unknown data. The CART pruning algorithm consists of two steps: first, it starts pruning continuously from the bottom of the decision tree T0 generated by the generation algorithm until the root node of T0, forming a subtree sequence {T0, T1, ..., Tn}; The verification method tests the sequence of subtrees on an independent verification data set, from which the optimal subtree is selected.

Trees can be pruned recursively. Increase a from small to small, 0=a0<a1<... <an<  , pruning to get the optimal subtree sequence {T0, T1,...,Tn}, the subtrees in the sequence are nested.

 

When a grows from small to large, the number of leaf nodes gradually decreases. So the leaves are cut off by adjusting a. Do not cut off the leaves randomly. The problem is how to control a to achieve the purpose of pruning.

 //================ Preliminary knowledge ================//

If the whole number T0 is pruned, where t is any internal node inside. Then if t is regarded as a tree with a single node, the loss is:

Because |T| = 1. Ca(T) is the loss of the whole tree when the parameter is a, and C(T) is the loss of the whole tree. Since there is only one node t, T = t, so this is the loss of the tree with this single node. 

The loss function of the subtree Tt with t as the root node is

A subtree with t as the root node indicates that the node t is the root node and contains the following internal nodes and leaf nodes. |Tt| is the number of leaf nodes rooted at t.

obviously:

When a=0 and a is sufficiently small, there is an inequality

 

When a increases, at a certain a there is 

Because the larger a, the greater the loss when the number of leaf nodes remains the same, until it increases to the equivalent of the loss when there is only one root node and no leaf nodes. 

In this way, the leaf nodes of the tree are reduced by increasing a.

//===================================//

 

Doing so means:

Find g(t) as a, and after setting a of the tree to this value, the loss of the tree before pruning Tt is the same as the loss after pruning, so you can cut off the subtree Tt , to get a new tree T1 after pruning, and the corresponding a of this new tree is g(t) denoted as a1.

Then find another smallest g(t) in the overall tree T1, and so on. The end result is a sequence of all N subtrees from the original tree up to a single root node .

2. Select the optimal subtree Ta through cross-validation in the subtree sequence T0 , T1 ,...,Tn obtained by pruning. Specifically, use an independent verification data set to test each subtree in the subtree sequence T0 , T1 ,...,Tn The squared error or Gini index of . The decision tree with the smallest squared error or Gini index is considered the optimal decision tree. In the sequence of subtrees, each subtree T1 , T2 , ..., Tn corresponds to a parameter a1 , a2 , ..., an. Therefore, when the optimal subtree Tk is determined, the corresponding ak is also determined, that is, the optimal decision tree Ta is obtained.

 

 

Guess you like

Origin blog.csdn.net/stephon_100/article/details/125243173