Decision Tree

Decision tree

There are a lot of things to learn recently. Today I looked at decision trees and integrated learning. I feel that the content is not very complicated, and I also learned the content of decision trees in the information theory class. So let's write a blog to record.

The main reference material for this blog is the watermelon book "Machine Learning" by Zhou Zhihua

What is a decision tree

Decision trees are a common type of machine learning method. As the name implies, decision trees make decisions based on a tree structure, which is a natural processing mechanism for human beings when facing decision-making problems. We judge multiple attributes of one thing in turn, and finally get the final decision.

Generally, a decision tree contains a root node, several internal nodes and several leaf nodes; the leaf nodes correspond to the decision results, and each other node corresponds to an attribute test; the sample set contained in each node is based on the attribute test The results are divided into sub-nodes; the root node contains the complete set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence. The purpose of decision tree learning is to produce a decision tree with strong generalization ability, that is, to deal with unseen examples.

Partition selection

This is what I think is the most important part of the decision tree. The optimal partition attribute must be selected at each step. Generally speaking, as the division process continues, we hope that the samples contained in the branch nodes of the decision tree belong to the same category as possible, that is, the purity of the nodes becomes higher and higher

Information entropy

The formula H (X) = − ∑ x ∈ X p (x) log ⁡ p (x) = ∑ x ∈ X p (x) log ⁡ 1 p (x) H(X) is directly used here in my information theory course
=-\sum_{x \in \mathcal{X}} p(x) \log p(x)=\sum_{x \in \mathcal{X}} p(x) \log \frac{1}{p (x))H(X)=xXp(x)logp(x)=xXp(x)logp(x)1
This can measure the purity of a sample collection and is the most commonly used indicator. For example, p(x)=1, it means that there is only this element and the purity is the highest. At this time H (x) = 0 H(x) = 0H(x)=0 . High school chemistry actually mentioned that entropy can represent the degree of disorder in a system. And this we also call it information entropy

Information gain (ID3)

In the decision tree, we need to use not only information entropy, but also information gain. Entropy can represent the uncertainty of the sample set. The greater the entropy, the greater the uncertainty of the sample. Therefore, the difference between the set entropy before and after the division can be used to measure the effectiveness of the current feature for the division of the sample set D.

Therefore, there is the following formula, use a certain feature A to divide the data set D, calculate the entropy of the divided data subset entroy (after)
g (D, A) = H (D) − H (D ∣ A) g(D , A)=H(D)-H(D | A)g(D,A)=H(D)H(DA)

具体的, H ( D ∣ A ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ H ( D v ) H(D|A)=\sum_{v=1}^V\frac{|D^v|}{|D|}H(D^v) H(DA)=v = 1VDDvH(Dv )

Where D v D^vDv means that the value of attribute A isava^vaSample of v .

Disadvantages: The information gain is biased towards the feature with more values.
Reason: When the feature has more values, it is easier to divide according to this feature to obtain a higher purity subset, so the entropy after the division is lower, because the entropy before the division is constant Therefore, the information gain is larger, so the information gain is more biased towards features with more values.

Information gain rate (C4.5)

g R ( D , A ) = g ( D , A ) H A ( D ) g_{R}(D, A)=\frac{g(D, A)}{H_{A}(D)} gR(D,A)=HA(D)g(D,A)

H A ( D ) = − ∑ i = 1 n ∣ D i ∣ ∣ D ∣ log ⁡ 2 ∣ D i ∣ ∣ D ∣ H_{A}(D)=-\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} \log _{2} \frac{\left|D_{i}\right|}{|D|} HA(D)=i=1nDDilog2DDi

Disadvantage: Features with less information gain than bias.
Reason: HA (D) H_A(D) when the feature value is lessHAThe value of ( D ) is smaller, so its reciprocal is larger, and therefore the information gain is larger. Therefore, features with fewer values ​​are favored.

Use information gain ratio: Based on the above shortcomings, instead of directly selecting the feature with the largest information gain rate, now find the feature whose information gain is higher than the average level among the candidate features, and then select the feature with the highest information gain rate among these features .

Gini Index (CART algorithm)

CART is the abbreviation of Classification and Regression Tree, which is a well-known decision tree learning algorithm for both classification and regression tasks.

The CART decision tree uses the "Gini Index" to select partition attributes.
G ini (D) = ∑ k = 1 ∣ Y ∣ ∑ k ′ ≠ kpkpk ′ = 1 − ∑ k = 1 ∣ Y ∣ pk 2 Gini(D)=\sum_{k=1}^{|Y|)\ sum_{k^{\prime}\neq k}p_k p_{k^{\prime}}=1-\sum_{k=1}^{|Y|}p_k^2Gini(D)=k=1Yk=kpkpk=1k=1Ypk2

Intuitively speaking, Gini(D) reflects the probability that two samples are randomly selected from data set D, and their category labels are inconsistent. Therefore, the smaller the Gini(D), the higher the purity of the data set D.

The Gini index of attribute A is defined as
G ini _ index (D, a) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini (D v) Gini\_index(D,a)=\sum_(v=1 }^{V}\frac{|D^v|}{|D|}Gini(D^v)Gini_index(D,a)=v = 1VDDvGini(Dv )
Therefore, in the candidate attribute set A, we select the attribute with the smallest Gini index after division as the optimal division attribute. Namely
a ∗ = arg min ⁡ a ∈ AG ini _ index (D, a) a^*=\argmin_(a\in A) Gini\_index(D,a)a=aAargm i nGini_index(D,a)

Pruning

In fact, a decision tree just completed does not have generalization ability. In other words, it does not have the ability to predict data that has not been encountered correctly. We must take the initiative to reduce the risk of overfitting by removing some branches.

In order to calculate whether to prune, we must first randomly divide the entire data set into a training set and a validation set. To put it simply, the trained decision tree is pruned or not, depending on which effect is better on the validation set.

Pre-pruning is verified from top to bottom. If the accuracy of the validation set after division is less than or equal to the accuracy before division, then it cannot be divided.

Post-pruning is to first construct a decision tree normally, and then inspect all the non-leaf nodes in the tree one by one from the bottom up.

Compare

Post-pruned decision trees usually retain more branches than pre-pruned decision trees;
post-pruned decision trees have little risk of underfitting, and their generalization performance is often better than pre-pruned decision trees;
post-pruned decision tree training The time overhead is much larger than that of unpruned decision trees and pre-pruned decision trees.

Guess you like

Origin blog.csdn.net/Code_Tookie/article/details/104841129