[Machine learning] pruning processing of decision trees

Introduction

Pruning is the main method of decision tree learning algorithm to deal with overfitting. The process of node division in the learning process sometimes results in too many branches of the decision tree, so that some characteristics of the training set itself are regarded as general properties of all data, which leads to overfitting.

The basic strategies of decision tree pruning are "pre-pruning" and "post-pruning", both of which need to judge whether the generalization ability of the decision tree is improved. Therefore, the validation set is divided from the data set by the leave-out method.
The following table is a data set used to demonstrate the principle of pruning: the
Insert picture description here
following figure is a decision tree without pruning:Insert picture description here

Pre-pruning

Pre-pruning refers to the estimation of each node before the division in the generation process. If the current node division cannot improve the generalization ability of the decision tree, the division is stopped and the current node is marked as a leaf node

After calculating the relevant information gain, we will select the attribute "umbilical" to divide the training set. According to the above table, it can be divided into three branches. We need to perform pre-pruning, so we need to use the validation set to judge:

Assuming that the "umbilical part" is not used for division, then the decision tree is a leaf node, marked as "good melon". At this time, the correct rate of the verification set is: 3 7 × 100% = 42.9% \frac{3}{7} ×100\%=42.9\%73×100%=42.9%

After using the "umbilical part" to divide, the training samples contained in the nodes ②, ③, and ④ in the figure are {1, 2, 3, 14} \{1,2,3,14\}{ 1,2,3,14} { 6 , 7 , 15 , 17 } \{6,7,15,17\} { 6,7,15,17} { 10 , 16 } \{10,16\} { 10,1 6 } , so the 3 nodes of the branch are marked as "good melon", "good melon" and "bad melon". This is to verify with the verification set, the number is{4, 5, 8, 11, 12} \{4,5,8,11,12\}{ 4,5,8,11,1 2 } The sample classification is correct. The correct rate of the validation set at this time is:5 7 × 100% = 71.4% \frac{5}{7}×100\%=71.4\%75×100%=7 1 . 4 % , so it should use the "umbilical" divide.

Subsequent nodes are pruned in a similar way, and finally it can be obtained that none of the three nodes divided by the "umbilical" can be further divided. The final decision tree is as follows:
Insert picture description here
comparing the two decision trees, it can be seen that pre-pruning significantly reduces the time overhead and reduces the risk of overfitting. However, pre-pruning prohibits branch expansion based on the greedy nature, which also brings the risk of underfitting.

Post pruning

Post-pruning is to first generate a decision tree, and then examine non-leaf nodes from the bottom up. If replacing the subtree corresponding to the node with a leaf node can improve the generalization ability of the decision tree, then the subtree is replaced with a leaf node.

The accuracy of the unpruned decision tree validation set is 42.9%. First, let’s consider replacing the ⑥ node with a leaf node, which includes {7, 15} \{7,15\}{ 7,1 5 } Two samples, so the node is marked as "good melon", and the accuracy of the verification set is increased to 57.1%

Similar continues to examine the other nodes (if required accuracy is preserved validation set Occam's razor according pruning criterion, but the original book in order not to facilitate the choice of drawing pruning), to give the final decision tree shown below, 71.4% accuracy validation set
Insert picture description here
can It can be seen that post-pruning retains more branches, the risk of under-fitting is small, and the generalization performance is often better than pre-pruning, but this process should pay attention to non-leaf nodes after the decision tree is completely generated, so time overhead Much larger.

reference

Zhou Zhihua "Machine Learning"

Guess you like

Origin blog.csdn.net/i0o0iW/article/details/108173665