Machine Learning Review articles (8): CART decision tree algorithm

Note: This series will continue to update all the blog and posted github and gitee on, you can download all the articles in this series notes files through github, gitee.

1 Introduction

Previous blog introduces two kinds of ID3 and C4.5 decision tree algorithm, these two decision trees can only be used for classification, and this article say CART (classification and regression tree) tree can not only be used for classification problem, but also it can be used for regression problems.

Compared with the ID3 algorithm and C4.5 algorithms, CART is that there is a characteristic of all non-leaf nodes are only two sub-tree, that is to say when the feature dataset property division, regardless of how many of the characteristic attributes may take values ​​are only two options - 'a' and 'No', the text above is determined whether the dataset is a programmer, for example, if the split according to the degree of myopia, the data may be divided into sets { 'minor'} and { ' medium ',' serious'} two data sets (of course also be a combination of two) and then further refined set of data points in a further iteration.

Here, we are to talk about how CART algorithm to solve the problem of classification and regression problems.

2 classification

For classification problems, CART algorithm uses the Gini index as the optimal division feature attribute selection criteria.

First is that the Gini index, and entropy as the Gini index smaller the data set is not less uncertainty, the higher the data set represents purity. Given data set comprising $ L $ $ $ X-Category, then the data set $ $ X-Gini index:

 $$Gini(X) = \sum\limits_l^L {\frac{{|{X_l}|}}{{|X|}}(1 - \frac{{|{X_l}|}}{{|X|}})}  = 1 - {\sum\limits_{l = 1}^L {\left( {\frac{{|{X_l}|}}{{|X|}}} \right)} ^2}$$

Suppose the data set is $ A $ $ $ X-contains a plurality of possible values ​​of characteristic properties, $ A $ $ A $ is one of the possible values ​​of the data set X-$ $ $ A $ divided according, to It can be divided into two data sets, namely $ {X_1} = \ left \ {{x \ in X | {x_A} = a} \ right \} $ and $ {X_2} = \ left \ {{x \ in X | {x_A} \ ne a} \ right \} $, characterized in that at $ a $, X-$ $ set Gini index:

$$Gini(X,A) = \left| {\frac{{{X_1}}}{X}} \right|Gini({X_1}) + \left| {\frac{{{X_2}}}{X}} \right|Gini({X_2})$$

Next, we select the best features of property division by example demonstrates the application if the Gini index. Or the use of used Introducing the blog post ID3 algorithm dataset, as shown below. Calculating first three possible values ​​of the respective characteristic attributes Gini index.

Gini index "wear plaid shirt" attribute value of $ A $:

$$Gini(X,{A_1}) = \frac{5}{{10}} \times \left\{ {2 \times \frac{4}{5} \times \frac{1}{5}} \right\} + \frac{5}{{10}} \times \left\{ {2 \times \frac{3}{5} \times \frac{2}{5}} \right\} = 0.4$$

$ A $ attribute of "do not wear plaid shirt" This value is calculated Gini index, as only two properties, regardless of which property is calculated according to the same result, so:

$$Gini(X,{A_2}){\text{ = }}Gini(X,{A_1}) = 0.4$$

Gini index of property $ B $ "serious" this value:

$$Gini(X,{B_1}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{2}{3} \times \frac{1}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{5}{7} \times \frac{2}{7}} \right\} = 0.42$$

Gini index of "moderate" the value of the property $ B $:

$$Gini(X,{B_2}) = \frac{4}{{10}} \times \left\{ {2 \times \frac{4}{4} \times \frac{0}{4}} \right\} + \frac{6}{{10}} \times \left\{ {2 \times \frac{3}{6} \times \frac{3}{6}} \right\} = 0.3$$

$ B $ attribute of "slight" Gini index value is calculated:

$$Gini(X,{B_3}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{1}{3} \times \frac{2}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{6}{7} \times \frac{1}{7}} \right\} = 0.46$$

$ C $ attribute of "serious" this value is calculated Gini index:

$$Gini(X,{C_1}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{0}{3} \times \frac{3}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{4}{7} \times \frac{3}{7}} \right\} = 0.34$$

Gini index of "medium" attribute value of $ C $:

$$Gini(X,{C_2}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{1}{3} \times \frac{2}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{5}{7} \times \frac{2}{7}} \right\} = 0.42$$

Gini index of "slight" attribute value of $ C $:

$$Gini(X,{C_3}) = \frac{3}{{10}} \times \left\{ {2 \times \frac{1}{3} \times \frac{2}{3}} \right\} + \frac{7}{{10}} \times \left\{ {2 \times \frac{6}{7} \times \frac{1}{7}} \right\} = 0.46$$

Visible attributes $ B $ "medium" has a minimum value when the Gini index, so the value of the best split characteristic property values ​​as the current data set. After the split, the two data sets can be obtained, to continue the calculation of a data set obtained Gini index, selecting the optimal split feature attribute values, so iterative form a complete decision tree.

For continuous characteristic properties of the continuous processing method wherein the property, but in the CART algorithm is the Gini index can be calculated with reference to C4.5 algorithm.

3 Regression

At this point, we have studied the problem is the return (on regression and classification, when discussing linear regression algorithm has been analyzed, if not clear, the portal to go on, so please change ideas, for any $ x \ in X $, after a decision tree outputs $ f (x) $ is no longer possible values as before the classification decision tree that, $ f (x) $ values can only be appeared in $ X $ in what kinds of values, the final output regression trees $ f (x) $ is probably not seen before, and even the number of possible values is not even fixed. so, for regression trees, first problem to solve is how to determine the possible values of $ f (x) $ of.

For the data set $ X $, $ A $ suppose we divide the data set into two categories in which $ A $ a property value in the morning:

$${X_1} = \{ x|{x_A} \leqslant a\} $$

$${X_2} = \{ x|{x_A} > a\} $$

In this class two output values ​​$ f (x) $ are $ {c_1} $ and $ {c_2} $, then the X-$ $ $ A $ to be divided based on the value of the characteristic attributes $ A $, generated the total error is:

$$Los{s_{A,a}} = \sum\limits_{x \in {X_1}} {(y - {c_1}} {)^2} + \sum\limits_{x \in {X_2}} {(y - {c_2}} {)^2}$$

Wherein, $ y $ $ X $ is the actual value corresponds. Our goal is to make $ Los {s_ {A, a}} $ $ {c_1} $ $ time and minimize {c_2} $, the objective function is:

$${\min \sum\limits_{x \in {X_1}} {{{(y - {c_1})}^2}}  + \min \sum\limits_{x \in {X_2}} {{{(y - {c_2})}^2}} }$$

Then, when the $ {c_1} $ and $ {c_2} $ For what value of $ Los {s_ {A, a}} $ smallest it? The nature of the least squares can be seen, when the $ {c_1} $ and $ {c_2} $ divided to $ {X_1} $ and $ {X_2} $ $ Y $ average of all of the time and $ {c_1} $ $ {c_2} $ to the minimum, namely:

$${c_i} = ave(y|x \in {X_i})$$

Therefore, if obtained according to the partitioning after $ A $ is a leaf node, then the value of the final output sample is the subset relevant to the average of all $ Y $.

$$f(x)={c_i} = ave(y|x \in {X_i})$$

How to determine the output value of the number of issues had been resolved. Then there were two questions need to be addressed, that is to choose which attribute as the optimal partitioning features and attributes which value is selected as the best split point.

For this problem can be addressed by way of possible values ​​through the data sets wherein each attribute: data set of X-$ $ $ each characteristic attributes $ A, $ Los calculated in all its value $ a $ {s_ { a, a}} $, then compare all $ Los {s_ {a, a}} $, the minimum value of $ Los {s_ {a, a}} $ corresponding characteristic properties of $ a $ current optimum split feature attributes, $ a $ is the best split point.

At this point, how to determine the output value of each branch, how to choose the best question segmentation feature attributes and split points have been resolved, and finally summarize the CART decision tree algorithm in regression problems in the build process:

Los minimum $ when (1) the current data set $ X $, compute all feature attribute values ​​and all $ A $ $ A $ as a division point {s_ {A, a}} $;

(2) Compare all $ Los {s_ {A, a}} $, select the smallest $ Los {s_ {A, a}} $ corresponding characteristic properties of $ A $ optimum split current characteristic properties, $ a $ It is the best split point data set into both the left and right sub-tree;

(3) on the left and right subtrees of repeated data set (1), (2) the step of dividing continues until the data set in the decision tree node satisfies the conditions specified build.

4 tree pruning

Whether it is the face of classification or regression problem, ultimately resulting trees are likely to be too complicated, prone to over-fitting situation, so the tree is built, it is necessary to further complete the number of pruning.

Consideration herein complexity prune Cost-Complexity Pruning (CCP) method, as follows:

Input: the CART decision tree algorithm to generate $ $ T_0  
output: Optimal Decision Tree pruning after T_ {$ \ $} Alpha  
(. 1) Order $ k = 0 $, $ T = T_0 $, $ \ alpha = + \ $ infty;   
(2) the top down calculated $ C ({T_t}) $ , $ for each internal node | {T_t} | $ and

$$g(t) = {{C(t) - C({T_t})} \over {|{T_t}| - 1}}$$

$$\alpha  = \min (\alpha ,g(t))$$ 

Wherein, $ T_t $ $ t $ expressed as the root subtree, $ {C (t)} $ $ t $ is performed after pruning prediction error of the training data set, $ {C ({T_t}) } $ prediction error on the training data set, $ {| {T_t} |} $ is a leaf node point number of $ $ T_T;

(3) from the top down to access the internal node $ t $, if $ g (t) = \ alpha $, $ t $ of the pruning and the leaf node $ t $ majority vote to determine an output method, get tree $ T $;

(4) Let $ k = k + 1 $, $ {\ alpha _k} = \ alpha $, $ {T_k} = T $;

(5) If $ T $ tree root node is not configured separately, the process returns to step (3); 

(6) using the cross-validation method in the subtree sequence $ {T_0}, {T_1}, \ cdots, {T_k} = T $ to select the optimal subtrees $ {T_ \ alpha} $.

To understand the whole process of CART decision tree pruning, the key is to understand that $ g (t) $ meanings, ideal for a decision tree, we certainly hope that the prediction error as small as possible, the size of the tree is also smaller the better, but it is not attained because the prediction error tends to increases as the size of the tree is reduced, so that the prediction error variation considered separately or tree size variations are inappropriate, it is best to select a measure capable of taking into account the prediction error tree changes in the size and the amount of change, for example, the ratio of both.

Carefully calculated $ g (t) $ found before and after the pruning molecule subtract a prediction error, the prediction error is the amount of change, the amount of molecular changes before and after the pruning leaf node points, so we can say that the ratio of the two is the tree $ t $ each leaf node brought about by changes in the amount of prediction error, or prediction error rate of change of the tree brought $ t $ - this is the meaning of $ g (t) $ in.
Why every time $ g (t) $ smallest node prune it? Since $ g (t) $ represents the smaller the smaller the effect on the entire decision tree for $ t $, its influence on the decision tree pruning accuracy is the most wanted, of course, should be a priority pruning.

If you do not understand, then to understand it (the example comes from: https: //www.jianshu.com/p/b90a9ce05b28) by the following example. 

This has the following coordinates in the data set, and the constructed tree, sets the data as shown below:

In this case, $ {\ alpha _1} = 0 $, there are three nodes in the tree, which are calculated $ g (t) for each node $:

$ T_1 $, $ t_2 $ node $ g (t) $ minimum, we select the node pruning small, i.e. $ $ T_3 pruning, and let $ {\ alpha _2} = {1 \ over 8} $ . After pruning tree as follows:

The remaining two nodes, each node continues to calculate $ g (t) $:

Obviously, T_2 $ a $ $ g (t) $ is smaller, so the T_2 $ $ pruning, and let $ {\ alpha _3} = {1 \ over 8} $:

At this time, leaving only a $ $ T_l, there $ g ({t_3}) = {1 \ over 4} $ After calculation, the $ {\ alpha _4} = {1 \ over 4} $

After completing all the above calculations, we obtain a sequence $ {\ alpha _0} = 0, {\ alpha _2} = {1 \ over 8}, {\ alpha _3} = {1 \ over 8}, {\ alpha _4} = {1 \ over 4} $, and the corresponding sub-tree. Next, the remaining task is an independent validation data set using the square error is calculated for each subtree, or Gini index, selecting that sub-tree of the smallest error as an optimum tree after pruning.

5 summary

At the end of this paper, three kinds of decision tree algorithm to make a simple comparison summary:

Guess you like

Origin www.cnblogs.com/chenhuabin/p/11774926.html