Machine Learning-Decision Tree (ID3, C4.5, CART)

[Machine Learning] Decision Tree (Part 1)-ID3, C4.5, CART

Decision tree is a very common and excellent machine learning algorithm. It is easy to understand and interpretable. It can be used as a classification algorithm and can also be used for regression models.

For the basic tree, I will roughly introduce each algorithm from the following four aspects: idea, division standard, pruning strategy, advantages and disadvantages.

1. ID3

The ID3 algorithm is based on Occam's razor (with less things, you can also do a good job): the smaller the decision tree, the better the large decision tree.

1.1 Thought

From the knowledge of information theory, we know that the greater the information entropy, the lower the purity of the sample. The core idea of ​​ID3 algorithm is to measure feature selection by information gain, and select the feature with the largest information gain to split. The algorithm uses a top-down greedy search to traverse the space of possible decision trees (C4.5 is also a greedy search). The general steps are:

  1. Initialize the feature set and data set;
  2. Calculate the information entropy of the data set and the conditional entropy of all features, and select the feature with the largest information gain as the current decision node;
  3. Update the data set and feature set (delete the feature used in the previous step, and divide the data set of different branches according to the feature value);
  4. Repeat steps 2 and 3, if the subset value contains a single feature, it is a branch leaf node.

1.2 Classification criteria

The classification standard used by ID3 is information gain, which represents the degree to which the uncertainty of the sample set is reduced by knowing the information of feature A.

Information entropy of the data set:

 

Where represents the subset of samples belonging to the k-th sample in the set D.

For a certain feature A, the conditional entropy of data set D is:

 

Where represents the sample subset of the i-th value of feature A in D, and represents the sample subset of the k-th category in.

Information gain = information entropy-conditional entropy:

 

The greater the information gain, the greater the "improvement in purity" obtained by using feature A to divide.

1.3 Disadvantages

  • ID3 has no pruning strategy and is easy to overfit;
  • The information gain criterion has a preference for features with a large number of possible values, and the information gain of features similar to "number" is close to 1;
  • Can only be used to deal with discretely distributed features;
  • Missing values ​​are not considered.

2. C4.5

The biggest feature of the C4.5 algorithm is that it overcomes the shortcoming of ID3's emphasis on the number of features, and introduces the information gain rate as the classification standard.

2.1 Thought

Compared with the shortcomings of ID3, C4.5 has the following improvements:

  • Introduce a pessimistic pruning strategy for post-pruning;
  • Introduce information gain rate as the division standard;
  • Discretize the continuous feature. Assuming that the continuous feature A of n samples has m values, C4.5 sorts them and takes the average of the two adjacent sample values, totaling m-1 division points, and calculates the division points respectively. As the information gain of a binary classification point, and select the point with the largest information gain as the binary discrete classification point of the continuous feature;
  • The treatment of missing values ​​can be divided into two sub-problems:
  • Question 1: How to select the division feature when the feature value is missing? (I.e. how to calculate the information gain rate of the feature)
  • Question 2: Select the division feature, how to deal with the sample missing the feature value? (That is, which node to divide this sample into)
  • In response to question 1, the approach of C4.5 is: For features with missing values, use the proportion of sample subsets that are not missing to convert;
  • In response to the second question, C4.5's approach is to divide the sample into all child nodes at the same time, but to adjust the weight value of the sample, in fact, it is divided into different nodes with different probabilities.

2.2 Classification criteria

The information gain rate can be used to overcome the shortcomings of information gain, and the formula is

 

  It is called the intrinsic value of feature A.

It should be noted here that the information gain rate has a preference for features with fewer values ​​(the smaller the denominator, the larger the overall), so C4.5 is not directly divided by the feature with the largest gain rate, but uses a heuristic Method : First find the feature with higher information gain than the average from the candidate partition features, and then select the one with the highest gain rate.

2.3 Pruning strategy

Why pruning: The performance of the over-fitting tree in generalization ability is very poor.

2.3.1 Pre-pruning

The main methods to determine whether to continue growth before dividing nodes are as follows:

  • The data sample in the node is below a certain threshold;
  • All node features have been split;
  • The accuracy rate before the node is divided is higher than that after the node is divided.

Pre-pruning can not only reduce the risk of overfitting but also reduce the training time, but on the other hand, it is based on the "greedy" strategy, which will bring the risk of underfitting.

2.3.2 Post-pruning

Pruning is performed on the generated decision tree to obtain a simplified version of the pruning decision tree.

The pessimistic pruning method adopted by C4.5 uses a recursive method to evaluate each non-leaf node from the bottom up to evaluate whether it is beneficial to replace the subtree with an optimal leaf node. If the error rate after pruning is maintained or decreased compared to before pruning, this subtree can be replaced. C4.5 Estimate the error rate on unknown samples by the number of misclassifications on the training data set.

The risk of under-fitting the post-pruned decision tree is very small, and the generalization performance is often better than that of the pre-pruned decision tree. But at the same time its training time will be much longer.

2.4 Disadvantages

  • The pruning strategy can be optimized;
  • C4.5 uses a polytree, which is more efficient with a binary tree;
  • C4.5 can only be used for classification;
  • The entropy model used in C4.5 has a lot of time-consuming logarithmic operations, continuous values ​​and sorting operations;
  • C4.5 In the process of constructing the tree, the numerical attribute values ​​need to be sorted according to their size, and a split point is selected from it, so it is only suitable for the data set that can reside in the memory, when the training set is too large to fit in the memory , The program cannot run.

3. CART

Although ID3 and C4.5 can mine as much information as possible in the learning of the training sample set, the branches and scales of the generated decision trees are relatively large. The dichotomy of the CART algorithm can simplify the scale of the decision tree and improve the generation of decisions. The efficiency of the tree.

3.1 Thought

CART includes the basic processes of splitting, pruning and tree selection.

  • Splitting: The splitting process is a binary recursive partitioning process. Its input and prediction features can be either continuous or discrete. CART has no stopping criterion and will continue to grow;
  • Pruning: using cost complexity pruning , starting from the largest tree, each time the split node with the least contribution to the overall performance of the training data entropy is selected as the next pruning object, until only the root node is left. CART will generate a series of nested pruning trees, from which an optimal decision tree needs to be selected;
  • Tree selection: Evaluate the prediction performance of each pruned tree with a separate test set (cross-validation can also be used).

CART has made many improvements on the basis of C4.5.

  • C4.5 is a multi-branch tree with slow calculation speed, and CART is a binary tree with fast calculation speed;
  • C4.5 can only be classified, CART can be classified or regressed;
  • CART uses the Gini coefficient as the impurity of the variable, reducing a lot of logarithmic operations;
  • CART uses proxy testing to estimate missing values, while C4.5 is divided into different nodes with different probabilities;
  • CART uses the "cost-based complexity pruning" method for pruning, while C4.5 uses the pessimistic pruning method.

3.2 Classification criteria

The entropy model has a lot of time-consuming logarithmic operations, and the Gini index keeps the advantages of the entropy model while simplifying the model. The Gini index represents the impurity of the model. The smaller the Gini coefficient, the lower the impurity and the better the characteristics. This is the opposite of information gain (rate).

 

Where k represents the category.

The Gini index reflects the probability that two samples are randomly selected from the data set and their category labels are inconsistent . Therefore, the smaller the Gini index, the higher the purity of the data set. The Gini index is biased towards features with more eigenvalues, similar to information gain. The Gini index can be used to measure any uneven distribution. It is a number between 0 and 1. 0 is completely equal, and 1 is completely unequal.

In addition, when CART is a two-category, its expression is:

 

We can see that in the case of squaring and binary classification, the operation is simpler. Of course, its performance is also very close to the entropy model.

So here comes the question: the performance of the Gini index is close to that of the entropy model, but how far is it from the entropy model?

We know so

 

We can see that the Gini index can be understood as the first-order Taylor expansion of the entropy model. Here is a very classic picture:

3.3 Missing value processing

As mentioned above, the model's handling of missing values ​​will be divided into two sub-problems:

  1. How to select the division feature when the feature value is missing?
  2. After selecting the partition feature, how should the model deal with samples that lack the feature value?

Regarding question 1, CART strictly requires that only the part of the data that has no missing values ​​on the feature be used when evaluating split features. In subsequent versions, the CART algorithm uses a penalty mechanism to suppress the promotion value, thereby reflecting the missing value. The influence of the value (for example, if a feature is missing in 20% of the records of the node, then the feature will be reduced by 20% or other values).

For question 2, the mechanism of the CART algorithm is to find a proxy splitter for each node of the tree, regardless of whether the tree obtained on the training data has missing values ​​or not. In the proxy splitter, the score of a feature must exceed the performance of the default rule to qualify as a proxy (that is, the proxy is to replace the missing value feature as a feature to divide the feature). When a missing value is encountered in the CART tree, this instance is divided into The left or the right is determined by the agent with the highest ranking. If the value of this agent is also missing, then the second-ranked agent is used, and so on. If all the proxy values ​​are missing, then the default rule is to divide the sample into the higher one. The larger child node. The proxy splitter can ensure that the tree obtained on the training data with no missing data can be used to process new data that contains exact values.

3.4 Pruning strategy

Use a "cost-complexity-based pruning" method for post-pruning. This method will generate a series of trees, each of which is done by replacing one or some subtrees of the previous tree with a leaf node The result is that the last tree in the series of trees contains only one leaf node for predicting the category. Then a cost complexity measurement criterion is used to determine which subtree should be replaced by a leaf node that predicts the category value. This method needs to use a single test data set to evaluate all trees, and select the best tree based on their classification performance in the test data set entropy.

Let's take a look at the cost complexity pruning algorithm in detail:

First of all, we call the largest tree. We hope to reduce the size of the tree to prevent overfitting, but we are worried that the prediction error will increase after removing the nodes, so we define a loss function to achieve a balance between these two variables. The loss function is defined as follows:

 

  Is an arbitrary subtree, is the prediction error, is the number of leaf nodes of the subtree, and is a parameter, which measures the degree of fit of the training data, measures the complexity of the tree, and weighs the degree of fit and the complexity of the tree.

So how to find the right one to achieve the best balance between complexity and fit? The best way is to take another from 0 to positive infinity. For each fixed value, we can find the smallest optimal sub tree. When is very small, is the optimal subtree; when is the largest, the single root node is such an optimal subtree. With increasing, we can get a sequence of subtrees like this:, the subtree generation here is generated by cutting off an internal node of the previous subtree.

Breiman proved that: will increase from small, in each interval, the subtree is the best in this interval.

This is the core idea of ​​cost complexity pruning.

Every time we prune is for a certain non-leaf node, other nodes remain unchanged, so we only need to calculate the loss function of the node before and after pruning.

For any internal node t, the state before pruning has a leaf node, and the prediction error is; the state after pruning: there is only one leaf node, and the prediction error is.

Therefore, the loss function of the subtree with t node as the root node before pruning is:

 

The loss function after pruning is

 

Prove through Breiman that we know that there must be a such that the value is:

 

  The significance of is that, in, the subtree is the best in this interval. When is greater than this value, there must be, that is, it is better to cut this node than not to cut it. Therefore, each optimal subtree corresponds to an interval, and it is optimal in this interval.

Then we calculate for each internal node t in:

 

  Represents the threshold, so we will subtract the smallest value every time.

3.5 Unbalanced categories

A big advantage of CART is that no matter how unbalanced the training data set is, it can eliminate its sub-freezes without requiring other actions by the modeler.

CART uses a priori mechanism, which is equivalent to weighting categories. This a priori mechanism is embedded in the calculation of the CART algorithm to determine the pros and cons of splitting. In the default classification mode of CART, the ratio of the category frequency of each node to the root node is always calculated, which is equivalent to automatically reweighting the data. , To balance the categories.

For a binary classification problem, the node node is divided into category 1 if and only if:

 

For example, in the second category, there are 20 and 80 root nodes belonging to category 1 and category 0, respectively. There are 30 samples on the child nodes, of which 10 and 20 belong to category 1 and category 0, respectively. If 10/20>20/80, the node belongs to category 1.

Through this calculation method, there is no need to manage the true category distribution of the data. Assuming that there are K target categories, we can ensure that the probability of each category in the root node is 1/K. This default mode is called "a priori equality".

The difference between prior setting and weighting is that the prior does not affect the number or share of samples of each category in each node. The prior influence is the category assignment of each node and the choice of splitting during tree growth.

3.6 regression tree

CART (Classification and Regression Tree), from its name, it can be seen that it can be used not only for classification, but also for regression. The algorithm for establishing the regression tree is similar to that of the classification tree. Here is a brief introduction to the differences.

3.6.1 Continuous value processing

For the processing of continuous values, the CART classification tree uses the size of the Gini coefficient to measure the division points of the feature. In the regression model, we use the common sum variance measurement method. For any partition feature A, the corresponding arbitrary partition point s is divided into the data set on both sides, and the mean square error of the sum is minimized and the mean square error of the simultaneous sum The feature and feature value division point corresponding to the smallest sum. The expression is:

 

Among them, is the sample output mean of the data set, and is the sample output mean of the data set.

3.6.2 Forecast method

Regarding the way to make predictions after the decision tree is established, I mentioned above that the CART classification tree uses the category with the highest probability in the leaf node as the predicted category of the current node. The regression tree output is not a category, it uses the mean or median of the final leaves to predict the output result.

4. Summary

Finally, compare the differences between ID3, C4.5 and CART by way of summary.

In addition to the previously listed division standards, pruning strategies, continuous value processing methods, etc., I will introduce some other differences:

  • Differences in the classification criteria: ID3 uses information gain to favor features with many eigenvalues, C4.5 uses information gain rate to overcome the shortcomings of information gain, and favors features with small eigenvalues, CART uses Gini index to overcome C4.5 which requires a huge log The amount of calculation is biased towards features with more feature values.
  • Differences in usage scenarios: ID3 and C4.5 can only be used for classification problems, CART can be used for classification and regression problems; ID3 and C4.5 are multi-trees, which are slower, and CART is a binary tree with fast calculation speed;
  • Differences in sample data: ID3 can only handle discrete data and is sensitive to missing values, C4.5 and CART can handle continuous data and there are multiple ways to deal with missing values; considering the sample size, C4.5 is recommended for small samples and large samples Recommend CART. During the processing of C4.5, the data set needs to be scanned and sorted multiple times, and the processing cost is time-consuming. CART itself is a statistical method for large samples, and the generalization error is larger under small sample processing;
  • Differences in sample features: Only one feature is used between ID3 and C4.5 levels, and CART can reuse features multiple times;
  • Differences in pruning strategy: ID3 does not have a pruning strategy, C4.5 uses a pessimistic pruning strategy to correct the accuracy of the tree, and CART uses cost complexity to prune.

reference

  1. "Machine Learning" Zhou Zhihua

Guess you like

Origin blog.csdn.net/qq_36816848/article/details/114275193