Day4 "Machine Learning" Chapter 4 Study Notes

decision tree

  A few days ago, I learned the first three chapters of "Machine Learning". The first three chapters introduce the basic knowledge of machine learning. Next, chapters 4 to 10 introduce some classic and commonly used machine learning methods. This part is a specific application. Chapter 4 introduces a class of machine learning methods—decision trees.

3.1 Basic Process

  Decision tree is a common type of machine learning method. Taking the binary classification task as an example, we hope to learn a model from a given training data set to classify new examples. This task of classifying samples can be regarded as the answer to the question "Does the current sample belong to the positive class?" decision" or "decision" process. As the name suggests, a decision tree is based on a tree structure to make decisions. For example, our judgment process for a good watermelon is as follows:

  Obviously, the final conclusion of the decision is our judgment result: "Yes" or "No" good melon. In general, 1. A decision tree contains a root node, several internal nodes and several leaf nodes; 2. The leaf nodes correspond to the decision results, and each other node corresponds to an attribute test; 3. Each The sample set contained in each node is divided into child nodes according to the result of the attribute test; 4. The root node contains the full set of samples. The path from the root node to each leaf node corresponds to a decision test sequence. The purpose of decision tree learning is to generate a decision tree with strong generalization ability, that is, strong ability to deal with unseen examples. Its basic process follows a simple and intuitive "divide-and-conquer" strategy. The algorithm is as follows:

  Obviously, the generation of decision tree is a recursive process. In the decision tree algorithm, there are three situations that lead to recursive return: (1) the samples contained in the current node all belong to the same category, and no need to be divided; (2) the current attribute set is empty, or all samples have the same value on all attributes, Cannot be divided; (3) The current node contains the sample set is empty and cannot be divided. (In the case of (2), the current node is marked as a leaf node, and its class is set to the class with the most samples in the node; in the case of (3), the current node is also The point is marked as a leaf node, but its category is set to the category with the most samples in the parent node. The essence of the two cases is different, (2) is using the posterior distribution of the current node, and (3) The sample distribution of the parent node is taken as the prior distribution of the current node.)

4.2 Partition selection

  It can be seen from the algorithm process that the key to decision tree learning is in line 8 , that is, how to select the optimal division attribute . Generally speaking, as the division process proceeds, we hope that the samples contained in the branch nodes of the decision tree belong to the same category as much as possible. , that is, the " purity " of the node is getting higher and higher.

  4.2.1 Information Gain 

  "Information entropy" is the most commonly used indicator to measure the purity of a sample set. We assume that the proportion of the kth sample in the current sample set D is pk (k=1,2,...,|y|), then the information entropy of D is defined as:

The smaller the value of Ent(D), the higher the purity of D (ideally, the value of Ent(D) is as small as 0, which is equivalent to pk=1, in this case, the samples in D all belong to the same class).
  Assume that the discrete attribute a has V possible values ​​{a1, a2, . is a sample of aV. The information entropy of Dv is calculated according to the definition of information entropy. Taking into account the different number of samples of different branch nodes, the branch nodes are given weights |Dv|/|D|, that is, the influence of branch nodes with more samples The larger the value, the "information gain" obtained by dividing the sample set D with attribute a can be calculated:

  Therefore, the greater the information gain, the greater the "purity improvement" obtained by using the attribute a for division. Therefore, we can use the information gain to select the division attributes of the decision tree, that is, in the algorithm flow 4.2 Figure 8 row selection properties . The well-known ID3 (Iterative Dichotomiser) decision tree learning algorithm [Quinlan, 1986] selects the partition attributes based on the information gain.

  Take the watermelon dataset 2.0 in Table 4.1 as an example:

  The dataset contains 17 training examples to learn a decision tree that predicts whether an unpeeled melon is a good one. Obviously, |y| = 2. At the beginning of decision tree learning, the root node contains all examples in D, where positive examples account for p1 = 8/17, and negative examples account for p2 = 9/17. Therefore, according to the definition of information entropy, the information entropy of the root node is calculated as:

  Then we calculate the information gain for each attribute in the current attribute set {color, root, knock, texture, navel, touch}. Take the attribute "color" as an example, it has three possible values: {green, jet black, light white}. If this attribute is used to divide D, it can be divided into three subsets, which are recorded as: D1 (color = turquoise), D2 (color = jet black), D3 (color = light white). Subset D1 contains samples numbered {1, 4, 6, 10, 13, 17}, of which positive examples account for p1 = 3/6, and negative examples account for p2 = 3/6. Similarly, write out the samples of D2 and D3 number, calculate the proportion of the corresponding positive and negative examples, and then calculate the information entropy of the three branch nodes obtained by dividing by "color" according to the definition of information entropy:

  Calculate the information gain of each attribute according to the information gain formula:


  Similarly we can calculate the information gain of other attributes:


  Obviously, the attribute "texture" has the largest information entropy, so it is selected as the dividing attribute. Divide the root node based on the "texture", and then continue to calculate the other attribute information gains:

It can be seen that the three attributes of "root", "umbilical part" and "tactile feeling" have achieved the greatest information gain, and one of them can be arbitrarily selected as the next step to divide the attributes. Operation, and finally get the decision tree as shown in Figure 4.4:

  4.2.2 Gain ratio

  In the above division, we did not use the column of "number" as the division attribute. If the number is used as the attribute to divide the samples, 17 branches will be generated, each branch node contains only one sample, and the branch node has the highest purity . However, this decision tree obviously does not have the ability to generalize and cannot make effective predictions on new samples.
In fact, the information gain criterion has a preference for attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, the famous C4.5 decision tree algorithm [Quinlan, 1993] does not directly use information gain, Instead, the optimal partitioning attribute is selected using the "gain ratio", which is defined as:

in

It is called the "intrinsic value" of attribute a [Quinlan, 1993]. The larger the number of possible values ​​of attribute a (that is, the larger V), the larger the value of IV(a).
  It should be noted that the gain rate criterion has a preference for attributes with a small number of possible values. Therefore, the C4.5 algorithm does not directly select the candidate partition attribute with the largest gain rate, but uses a heuristic [Quinlan, 1993 ]: First find the attributes with higher information gain than the average level from the candidate partition attributes, and then select the one with the highest gain rate.

  4.2.3 Gini Index

  The CART (Classification and Rgression Tree) decision tree [Breiman et al. , 1984] uses the "Gini index" to select the division attributes, using the same symbol as 4.1, the purity of the dataset D can be measured by the Gini value:   Gini (D) reflects the probability that two samples are randomly drawn from dataset D and their class labels are inconsistent. Therefore, the smaller Gini(D) is, the higher the purity of dataset D is. The Gini index of attribute a is defined as:

   Therefore, in the candidate attribute set A, we select the attribute with the smallest Gini index after division as the optimal division attribute, that is,

4.3 Pruning   

  Pruning is the main method for decision tree learning algorithms to deal with "overfitting". In decision tree learning, in order to classify the training samples as accurately as possible, the node division process will be repeated, sometimes resulting in too many branches of the decision tree. Some characteristics of the sample set itself lead to overfitting as a general property of all data. Therefore, the risk of overfitting can be reduced by actively removing some branches.
The basic strategies of decision tree pruning are "prepruning" and "post-pruning" [Quinlan, 1993]

  4.3.1 Pre-pruning

  4.3.2 Post-pruning

4.4 Continuous and Missing Values

  4.4.1 Continuous Value Handling
  4.4.2 Missing Value Handling

4.5 Multivariate Decision Trees

  (To be continued)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326084750&siteId=291194637