Eat melon tutorial - Task03 (decision tree)

insert image description here

knowledge points

decision tree

Logical perspective: A combination of if else statements
Geometrical perspective: Divide the feature space according to a certain criterion
Ultimate goal: Divide the samples into purer
 decision trees The key to learning is how to choose the optimal division attribute. Generally speaking, with the division The process continues, and we hope that the samples contained in the branch nodes of the decision tree belong to the same category as much as possible, that is, the purity of the nodes is getting higher and higher.
 The purpose of decision tree learning is to generate a decision tree with strong generalization ability, that is, strong ability to deal with unseen examples, and its basic process follows a simple and intuitive "divide and conquer" strategy. As shown below:
insert image description here

ID3 decision tree

Self-information: (the information contained in a random variable)
insert image description here
Conditional entropy:
insert image description here
Information gain:
insert image description here
 Generally speaking, the greater the information gain, the greater the "purity improvement" obtained by using attributes for division. Therefore, we can use information The gain is used to select the partition attribute of the decision tree.

C4.5 Decision tree

 In fact, the information gain criterion has a preference for attributes with more possible values. For example, if the sample number is used as a candidate partition attribute, the information gain is 0.998. In order to reduce the possible adverse effects of this preference, the C4.5 decision tree uses the gain rate to select the optimal partition attribute.
insert image description here
 But only when the number of possible values ​​is large, when the number of possible values ​​is small, the gain rate will still increase. Therefore, C4.5 adopts a heuristic method: attributes with a first-in-first-out information rate higher than the average level, Then choose the one with the highest gain rate.

CART decision tree

 The CART decision tree uses the "Gini index" to select partition attributes. insert image description here
insert image description here
insert image description here
Note: When constructing the CART decision tree, the optimal division attribute will not be selected strictly according to this formula, mainly because the CART decision tree is a binary tree. If the above formula is used to select the optimal division attribute, no further selection can be made. The optimal split point for the optimal split attribute.
The construction process of CART decision tree - watermelon book example:
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/weixin_44195690/article/details/129166679