Principle decision tree algorithm (under)

In the decision tree algorithm principle (on) this, we talked about the improved version of the C4.5 decision tree algorithm in ID3 algorithm and ID3 algorithm. For C4.5 algorithm, we also mentioned the lack of it, for example, the model is more complex entropy measure, the use of relatively complex multi-tree, can only deal with classification can not handle regression. For these problems, CART algorithm most of the improvements have been made. CART algorithm is what we focus on the following. As the CART algorithm can do return, you can do the classification, we were to be introduced, starting with the CART tree classification algorithm starts, compare different focus point and C4.5 algorithms. Then introduced CART regression tree algorithm, focusing on different points and CART classification trees. Then we discuss the achievements algorithms and tree pruning algorithm CART, concluded advantages and disadvantages of the decision tree algorithm.

Optimal characterized 1. CART classification tree algorithm selection method

We know that in ID3 algorithm, we use the information to gain selection feature, information gain greater preference. In C4.5 algorithm, with a gain ratio information to select features to reduce the information gain value more easily selected characteristic features of the problem. But both ID3 or C4.5, are based on the entropy of information theory model, there will involve a lot of number of operations. Can the simplified model at the same time will not have the advantage of a complete loss of entropy model of it? Have! CART classification tree algorithm Gini coefficients instead of the information gain ratio, Gini coefficient representing the impurity model, the Gini coefficient, the lower the purity is not better characteristics. This information gain (ratio) is reversed.

Specifically, in the classification problem, assume that there are K classes, the probability of the k-th category is $ p_k $, expressions Gini coefficient is:

If you are second-class classification problem, the calculation more simple, if the probability of belonging to the first sample output is p, the Gini coefficient for the expression:

For a given sample D, assuming K classes, the number of categories is $ k $ C_k, the Gini coefficient expression for sample D:

In particular, for Sample D, if a value according to a characteristic of the A, D into the two portions D1 and D2, at the eigenvalues ​​of A, D is the Gini coefficient expression:

We can compare the following expression expressions and entropy models of the Gini coefficient, the second operation is not a lot simpler than the log? Especially calculated second-class classification easier. But the simple go simple, and metrics entropy models than the Gini coefficient corresponding error be? For the second category classification, and Gini coefficient entropy half of the curve are as follows:

As can be seen from the figure, the half of the Gini coefficient and entropy curves are very close, just near the 45 degree angle slightly greater error. Thus, the Gini coefficient can be used as alternative entropy model approximation. The CART classification tree algorithm is used to select the Gini coefficient characteristics of the decision tree. Meanwhile, in order to further simplify, CART classification tree algorithm for each value of a feature only to perform a binary, rather than extra points, so CART classification tree algorithm is built up of binary tree, instead of multiple tree. Such a calculation can be further simplified Gini coefficient, two can create a more refined model of the binary tree.

2. CART classification tree algorithm for discrete and continuous features improved processing characteristics

To deal with the problem CART classification trees consecutive values, their ideas and C4.5 are the same, are a continuous feature discrete. The only difference lies in the different metrics in the choice of the division point, using the C4.5 gain ratio information, the CART tree is the classification Gini coefficient.

The following specific ideas, such as continuous characteristic A m with m samples, in ascending order of $ {a_1, a_2, ..., a_m} $, the CART algorithm takes the average of two adjacent sample values, to obtain a total of m -1 division point, wherein the i-th division point T_i $ $ expressed as: $ T_i = frac {a_i + a_ {i + 1}} {2} $. For m-1 points are calculated Gini coefficient at that point as a binary classification point. Select the smallest point Gini coefficient as the feature of the discrete continuous classification points. Take for example the minimum point Gini coefficient a_t $ $, $ a_t less than the value of the category $ 1, $ value greater than $ a_t category 2, so we do discretized continuous features. It is noted that, ID3 or C4.5 treated with different attributes are discrete, if the current node is a continuous attribute, the attribute may also be behind the selection process involved in producing a child node.

For dealing CART classification trees discrete values, the idea of ​​using two separate stop dispersion characteristics.

Recall that ID3 or C4.5, if a feature A decision tree node is selected, if it has A1, A2, A3 three categories, we will look at the establishment of a trigeminal node in the tree. This results in a decision tree is a multi-tree. However, different methods CART classification tree used, he uses a stop-half, or in this example, CART classification tree will consider A divided {A1} and {A2, A3}, {A2} and {A1, A3}, {A3} and {A1, A2} three cases, find the smallest Gini coefficient combinations, such as {A2} and {A1, A3}, then the binary tree node is a node corresponding to the sample A2, the other nodes are {A1 , A3} the corresponding node. Also, because this is not the characteristic values ​​A completely separate, later we have a chance to select the sub-node continues to divide feature A A1 and A3. This ID3 and C4.5 or different, in a subtree ID3 or C4.5, the discrete features will only be involved in setting up a node.

3. CART specific process classification tree algorithm

The above describes some of the differences between CART and C4.5 algorithms, let's take a look CART specific process classification tree algorithm, coupled with the establishment of the reason is because there is a separate CART tree algorithm pruning algorithm this one, we talk about this in section 5.

D is the training set input to the algorithm, the Gini coefficient threshold value, the threshold number of samples.

The output is a decision tree T.

Our algorithm starting from the root, with training set to establish CART tree recursion.

1) For the data set of the current node is D, if the number of samples less than the threshold value or no characteristic, decision subtree is returned, the current node recursion stops.

2) Gini coefficient D of the sample set, if Gini coefficient less than the threshold, the process returns subtree tree, recursion stops the current node.

3) current node is calculated for each feature value of each existing features of the Gini coefficient D data set, and the method for calculation processing Gini coefficient discrete values ​​and continuous values ​​in section II. Part C4.5 algorithm and the same as described in the processing method of missing values.

4) In the respective feature values ​​calculated from the various features of the Gini coefficient D data set, select the smallest Gini coefficient A and the corresponding features characteristic value a. According to this feature and the optimal value optimal feature, the data set is divided into two parts D1 and D2, while establishing the left and right of the current node, node D data set is made D1, the data set in the right node D is D2.

5) the left and right child nodes to steps 1-4 recursively to generate a decision tree.

When the decision tree for making predictions generated, if the test suite of sample A fell a leaf node, while the node there are multiple training samples. A class prediction for the adoption of the leaf nodes are in the most probable category.

4. CART regression tree algorithm to establish

Most established CART regression tree algorithm and CART classification trees are similar, so here we only discuss the establishment of different algorithms CART and CART regression tree classification tree place.

First of all, we need to understand, what is the return of a tree, what is the classification tree. The difference is that the output of the sample, if the sample is a discrete output value, then this is a classification tree. If the result is a continuous sample output value, then this is a regression tree.

In addition to different concepts, and to establish the difference between the predicted CART regression trees and CART classification trees are mainly the following two points:

  • Successive values ​​of different treatment methods
  • Making predictions of different ways after the tree is established.

For continuous process value, we know CART classification tree is used in the strengths and weaknesses of each division point using the Gini coefficient to measure the size of features. This is more suitable for classification model, but for the regression model, we used the metrics common and variance measurement objectives CART regression trees, for any division wherein A, corresponding to an arbitrary dividing point s on both sides into a data set D1 and D2, and D1 are determined so that each set of minimum variance D2, while the D1 and D2 and the minimum mean square error sum of the eigenvalues ​​and the corresponding feature point division. Expression is:

Wherein, $ c_1 $ to $ $ Dl mean output sample data set, $ c_2 $ to $ $ D2 of the mean output sample data set.

For making predictions after the way the tree is established, the above mentioned categories CART classification trees predictive probability of leaf nodes in the largest category as the current node. And not Regression Trees output category, it uses a mean or median leaves to predict the final output.

In addition to the above-mentioned outside of established algorithms and predictive regression tree CART and CART classification trees is no different.

5. CART tree pruning algorithm

CART CART classification trees and regression tree pruning strategies are in addition to a use variance in loss of time measurement, using a Gini coefficient, the basic algorithm exactly the same, here we are concerned with.

Since the decision-making algorithm is very easy to over-fitting the training set, which led to poor generalization ability, in order to solve this problem, we need to CART trees are pruned, that is similar to the linear regression of regularization, to increase the generalization of the decision tree ability. However, there are a lot of pruning, so we should choose it? CART approach uses a pruning process, i.e. Mr. into a decision tree, and then to generate all possible CART tree after pruning, and cross-validation to verify the effect of various pruning, select the best generalization pruning strategy.

That is, the CART tree pruning algorithm can be summarized as two steps, the first step is to generate a decision tree pruning effect from the original tree, the second cross section is to test the predictive power pruning verified, select the number after the generalization of the best pruning predictive power as the final CART tree.

First we look at the loss of function metrics pruning, pruning in the process, for any moment sub-tree T, whose loss function is:

Wherein, $ $ Alpha is the regularization parameter, which linear regression and regularization same. $ C (T_t) $ prediction error on the training data, the classification tree is the Gini coefficient measure, regression tree is both a measure of variance. $ | T_t | $ is the number of leaf nodes of the sub-tree T.

When $ alpha = 0 $, i.e. no regularization, the original generated CART tree is the optimal subtrees. When $ alpha = infty $, i.e. the maximum intensity of regularization, this time by the CART tree root original composition of the resulting single optimum node tree subtree. Of course, these are the two extreme cases. Generally, the larger $ $ Alpha, the more powerful the pruning shear, generating the optimal subtrees more small compared to the native tree. For a fixed $ alpha $, so there must be a loss function $ C_ {alpha} (T) $ the smallest unique subtrees.

After seeing the loss of function metrics pruning, we look at the idea of ​​pruning, for nodes located in any one sub-tree t $ T_t $, if not pruned its losses

If it is cut, retain only the root node, then the loss is

When $ alpha = 0 $ or $ $ Alpha is small, $ C_ {alpha} (T_t) <C_ {alpha} (T) $, when $ $ Alpha increased to a certain extent

When Alpha $ $ reverse inequality continues to increase, i.e., if they meet the following formulas:

T_T $ $ $ T $ and have the same loss of function, but fewer nodes $ T $, it can T_T $ $ subtree pruning, i.e. to cut off all its child nodes, the leaf node becomes a T .

Finally, we look at the cross CART tree verification strategy. Mentioned above, we can calculate whether each subtree pruning threshold $ $ Alpha, if we put all of the nodes are pruned Alpha value of $ $ are calculated, and corresponding respectively to different shear $ $ Alpha after the sub-optimal tree branches to make cross-validation. So you can choose the best $ alpha $, with this $ alpha $, we can use the corresponding optimal sub-tree as the final result.

Well, with the above ideas, we now take a look at pruning algorithm CART tree.

Enter the original decision tree algorithm to get CART tree build T.

The output is sub-optimal decision tree $ T_alpha $.

Algorithm is as follows:

1) Initialization $ alpha_ {min} = infty $, the optimal subtrees set $ omega = {T} $.

2) Start bottom-t was calculated for each internal node of the training error loss function $ C_ {alpha} (T_t) $ (mean square regression trees, classification trees Gini coefficient) from the leaf node, leaf nodes $ | T_T | $, and regularization threshold $ alpha = min {frac {C (T) -C (T_t)} {| T_t | -1} $, $ alpha_ {min}} $, updated $ alpha_ {min} = alpha $

3) to give $ $ Alpha values ​​of all nodes in the set M.

4) Select the maximum value from $ alpha_k $ M, the internal nodes accessed subtree t top-down, if $ frac {C (T) -C (T_t)} {| T_t | -1} leq alpha_k $ when pruning. And determines the value of the leaf node t. If the classification tree, is the highest probability category, if it is regression trees, the mean of all samples is output. This gave $ alpha_k $ corresponding to the optimal subtrees T_k $ $

5) set the optimal subtrees $ omega = omega cup T_k $, $ M = M - {alpha_k} $.

6) If M is not empty, the process returns to step 4. Otherwise, it has got all the optional sub-optimal tree collection $ omega $.

7) The cross-validation to select the optimal subtrees $ $ omega $ T_alpha $

6. CART algorithm Summary

Above we do CART algorithm of a detailed description, classification C4.5 algorithm compared CART algorithm, using the simplified binomial model, while using the selected feature Gini coefficient approximate to simplify the calculations. Of course, the biggest benefit CART trees can also do regression model, the C4.5 no. The following table shows ID3, a comparative summary of C4.5 and CART. I hope to help you understand.

algorithm Support Model Tree structure Feature Selection Processing successive values Missing values Pruning
ID3 classification Multi-tree Information gain not support not support not support
C4k5 classification Multi-tree Information gain ratio stand by stand by stand by
CART Classification, regression Binary Tree Gini index, mean square error stand by stand by stand by

Looks on CART algorithm tall, then the CART algorithm is there any disadvantages? Have! The main drawback I think is as follows:

1) It should be noted that we have, whether it is ID3, C4.5 or CART, doing feature selection is to choose the best time to do a feature classification decision, but most, classification decisions should not be made a certain feature decision, but it should be decided by a set of features. Such a decision tree to get a more accurate decision-making. This tree is called multivariate decision tree (multi-variate decision tree). When selecting the optimal characteristics, multivariate decision tree is not an optimal choice for a feature, but a feature of selecting the optimal linear combination to make decisions. This represents the algorithm is OC1, little introduction here.

2) If the sample change occurs a little, it will lead to drastic change in the tree structure. This can be resolved by random forest ensemble learning like inside.   

7. Summary of decision tree algorithm

Finally to the final summary stage, and here we are no longer entangled in ID3, C4.5 and CART, let's look at the advantages and disadvantages of decision tree classification algorithm as a large class of regression algorithm. This section summarizes in scikit-learn English documents.

First we look at the advantages of the decision tree algorithm:

1) simple and intuitive, very intuitive decision tree generated.

2) substantially without pretreatment, no advance normalization, handle missing values.

Consideration 3) predicted using a decision tree is O (log_2m). m is the number of samples.

4) may be treated discrete values ​​can also handle continuous value. Many algorithms just focus on discrete values ​​or continuous values.

5) can handle multi-dimensional classification output.

6) classification model as compared to the black box or the like neural network, decision tree can be well explained logically

7) may be cross-validated pruning selected model to improve the generalization ability.

8) Good for outliers fault tolerance, high robustness.

Let us look at the shortcomings of the decision tree algorithm:

1) decision tree algorithm is very easy to over-fitting, resulting in generalization is not strong. It can be improved by setting the number of samples and a minimum limit decision tree nodes depth.

2) decision tree samples occur because a little bit of change will lead to dramatic changes in the tree structure. This can be solved by means of integrated learning and the like.

3) to find the optimal decision tree is an NP-hard problem, we generally through heuristics, easy to fall into local optimum. It can be improved by ensemble learning method and the like.

4) Some of the more complex relationships, decision tree is difficult to learn, such as XOR. This is no way, and generally this relationship can change neural network classification methods to solve.

5) If the sample is too large proportion of certain features, decision trees tend to these features easily. This can be improved by adjusting the weight of sample weights.


References: https: //www.cnblogs.com/pinard/p/6053344.html

Original: Large column  decision tree algorithm theory (next)


Guess you like

Origin www.cnblogs.com/chinatrump/p/11615159.html