Machine learning theory study: decision tree

table of Contents

1. Decision tree model

1.1, learning of decision tree model

2. Feature selection

2.1, information gain

2.2. Information gain ratio

Three, the generation of decision tree

3.1, ID3 algorithm

3.2, C4.5 algorithm

Fourth, the pruning of the decision tree

Five, classification regression tree (CART) algorithm

5.1, CART generation

5.2, CART pruning


Decision tree is a basic classification and regression method. Decision Tree (Decision Tree) is a non-parametric supervised learning method. It can summarize decision rules from a series of data with features and labels, and present these rules in a tree structure to solve classification and regression problem. Decision tree algorithms are easy to understand, applicable to various data, and have good performance in solving various problems, especially various integrated algorithms with tree models as the core, which have a wide range of applications in various industries and fields.

1. Decision tree model

The classification decision tree model is a tree structure that classifies instances based on features . Decision tree classification starts from the root node, first calculates the characteristics of the instance, and assigns the strength characteristics to the child nodes according to the results. At this time, each node represents the value of a feature. And so on, until the leaf node, and finally the instance is allocated to the leaf node.

The decision tree can be transformed into a set of if-then rules, or it can be regarded as the conditional probability distribution of the classes defined in the feature space division.

  • if-then rules

A rule is constructed from each path from the root node to the leaf node of the decision tree. The characteristics of the internal nodes on the path correspond to the rule conditions, and the class of the leaf nodes corresponds to the conclusion of the rule. If-then rules are mutually exclusive and complete, that is, each instance is covered by a path or rule, and only by a path or rule.

  • Conditional probability distribution

The decision tree indicates that the conditional probability distribution is composed of the conditional probability distribution of the class under the given conditions of each unit. Assuming that X represents the random variable of the characteristic and Y represents the random variable of the class, then the conditional probability distribution is expressed as P(Y|X). In decision tree classification, the node instance is forcibly classified into the category with high conditional probability.

1.1, learning of decision tree model

The essence of decision tree learning is to learn a set of classification rules from data. Decision tree learning aims to build a decision tree that fits the training data well and has low complexity. Because directly selecting the optimal decision tree from the possible decision trees is NP-complete. In reality, heuristic methods are used to learn sub-optimal decision trees.

The algorithm of decision tree learning is usually a recursive selection of the optimal feature, and the data is segmented according to the feature, so that the sub-data set has the best classification process. At the beginning, construct the root node, put all the data in the root node, select an optimal feature, and divide the data into sub-data sets according to the characteristics, so that each sub-set has the best classification under the current conditions. If these subsets have been basically correctly classified, then construct leaf nodes and divide these subsets into corresponding leaf nodes; if there are still subsets that are not correctly classified, then select the optimal features for these subsets and continue segmentation , Build the corresponding node. This is done recursively until all the data sets are basically classified correctly or there are no suitable features. Finally, each subset is assigned to leaf nodes. This generates a decision tree.

The decision tree learning algorithm includes three parts: feature selection, tree generation and tree pruning. Commonly used algorithms are ID3, C4.5 and CART.

2. Feature selection

Feature selection is to select features that have the ability to classify the training data, which can improve the efficiency of the decision tree. Feature selection is to decide which feature to use to divide the feature space. The quasi-measure usually selected is the information gain or the information gain ratio .

2.1, information gain

  • Entropy : A measure of the uncertainty of random variable X.

Suppose the probability distribution of the discrete random variable X is:

                                                 

Then the entropy of X is defined as:

                                                 

The greater the entropy, the greater the uncertainty of the random variable. Generally \small 0\leq H(p)\leq log(n).

  • Conditional entropy : represents the uncertainty of random variable Y under the condition of known random variable X.

Suppose the joint distribution of random variables (X, Y) is:

                                                

Then the conditional entropy of Y under the known condition X is defined as:

                                                

Generally, entropy and conditional entropy are also called empirical entropy and empirical conditional entropy when they are estimated from data.

1. The definition of information gain

The information gain g(D,A) of feature A to data D is defined as the difference between the empirical entropy H(D) of set D and the empirical conditional entropy H(D|A) of feature A under given condition D, namely

                                                 

Generally, information gain is also called mutual information. The information gain in the decision tree is equivalent to the mutual information between the classes and features in the training set. Information gain method of selecting features: Calculate the information gain of each feature on the training set (D), compare its size, and select the feature with the largest information gain.

2. Information gain algorithm

Input: training data and feature A

Output: information gain g(D,A) of feature A to training data set D

(1) The empirical entropy of data D:

                                                  

(2) The empirical conditional entropy of feature A to D:

                                                  

(3) Calculate the information gain:

                                                  

2.2. Information gain ratio

Using information gain as the feature of dividing the data set, there is a problem of selecting features with more values, and the information gain ratio can be used to correct this problem.

1. Definition of information gain ratio

The information gain ratio of feature A to training data is the ratio of its information gain g(D,A) to the entropy H(D) of training data D with respect to feature A, namely

                                                   

Among them, .

Three, the generation of decision tree

3.1, ID3 algorithm

The core of the ID3 algorithm is to use the information gain criterion to select features on the decision tree and construct the decision tree recursively. Specific method: Starting from the root node, calculate the information gain of all possible features of the node, select the largest feature as the feature of the node, and establish child nodes from different values ​​of the feature; then call the above methods on the child nodes to build a decision tree; Until the information gain of all features is small or no feature is available for selection. Finally, a decision tree is constructed.

ID3 is equivalent to the selection of a probability model using maximum likelihood.

3.2, C4.5 algorithm

The C4.5 algorithm is similar to the ID3 algorithm, except that the information gain ratio is used to select features during the C.5 generation process.

Fourth, the pruning of the decision tree

The process of simplifying the tree generated by the decision tree is called pruning. The pruning of the decision tree is usually achieved by minimizing the overall loss function of the decision tree. Assuming that the number of leaf nodes of the tree is |T|, t is the leaf node of tree T, the leaf node consists of Nt sample points, among which the k-th type sample point consists of Ntk, then the loss function of the decision tree is:

                                                   

Among them, the empirical entropy is

Remember:

At this moment

                                                    

The formula represents the prediction error of the model to the training data, that is, the fit degree of the model, the complexity of the model, and the parameters control the influence of both. The larger one prompts the model to choose a simple tree; the small one prompts the model to choose a complex tree; =0 means that the model only considers the fit of the training data, and does not consider the complexity of the model.

Pruning is to select the model with the smallest loss function when it is determined . When it is determined, the smaller the subtree, the lower the complexity of the model, and the fitting effect is often not good.

Pruning algorithm of decision tree:

Input: tree T, parameters

Output: pruned subtree

(1) Calculate the empirical entropy of each node;

(2) Recursively retract upward from the leaf nodes of the tree;

(3) Return 2 until it cannot continue, and get the subtree with the smallest loss function .

Five, classification regression tree (CART) algorithm

The CART algorithm is mainly composed of two parts:

  1. Decision tree generation: Generate a decision tree based on the training data, and the generated decision tree should be as large as possible;
  2. Decision tree pruning: Prune the generated tree with the verification set and select the optimal subtree. At this time, the loss function is used as the minimum pruning standard.

5.1, CART generation

The generation of the decision tree is the process of recursively building a binary decision tree. The regression tree uses the square difference minimization criterion, and the classification tree uses the Gini index minimization criterion to select features and generate a binary tree.

1. The generation of regression tree

Input: training data D

Output: regression tree f(x)

(1) Choose the optimal segmentation variable j and its segmentation point s, and solve:

                                                     

Traverse the variable j, scan the split point s for the fixed variable j, and select (j, s) corresponding to the minimum value.

(2) Divide the area for the selected pair (j, s) and determine the corresponding output value:

                                                     

                                                      

(3) Continue to call (1) and (2) for the two areas until the stop conditions are met;

(4) Divide the input space into M regions and generate a decision tree:

                                                      

2. The generation of classification tree

  • Gini Index

                                                      

                                                      

The Gini index represents the uncertainty of set D. The larger the Gini index, the greater the uncertainty of the sample.

Input: training data set D, stop calculation condition;

Output: CART decision tree

(1) Calculate the Gini index of the existing feature for the data set;

(2) Among all the features A and all possible segmentation points a, select the feature with the smallest Gini index and its corresponding segmentation point as the optimal feature and the optimal segmentation point. According to the optimal feature and segmentation point, the data set is allocated to two sub-nodes;

(3) Recursively call (1) and (2) on the two child nodes until the stop condition is met;

(4) Generate a decision tree.

The general stopping condition is: the number of samples in the node is less than a predetermined threshold, or the Gini index of the sample set is less than a predetermined threshold, or there are no more features.

5.2, CART pruning

The CART pruning algorithm is composed of two steps: firstly, starting from the low end of the decision tree T0 generated by the generating algorithm, pruning continuously until the root node of T0, forming a subtree sequence

; Then the subtree is tested on an independent verification set through the cross-validation method, and the best subtree is selected from it.

Input: Decision tree T0 generated by CART algorithm

Output: optimal decision tree T

(1) Set k=0, T=T0;

(2) Set ;

(3) Calculate each internal node t from top to bottom , and

                                                      

                                                      

Here, Tt represents a subtree with t as a node, C(Tt) is the prediction error of the training data, and |Tt| is the number of leaf nodes of Tt.

(4) Prune the node t, and use the majority voting method to determine the other classes on the leaf node t to obtain the tree T;

(5) is provided. 1 + K = K, , ;

(6) If Tk is not a tree composed of a root node and two leaf nodes, go back to step (2); otherwise, let ;

(7) Test the subtrees on an independent verification set through the cross-validation method, and select the best subtree T from them.

Reference link:

"Statistical Learning Methods" 2nd Edition

 

Guess you like

Origin blog.csdn.net/wxplol/article/details/105346485