Explain the decision tree algorithm in detail

decision tree

1.1 Decision tree definition

What is a decision tree, as the name implies, is like a dendritic decision-making algorithm. Through the "decision" of each node, the precise classification or regression of tasks is realized. Decision trees are often used to deal with classification problems. Even if you have never touched a decision tree before, you can understand its basic principle through the following figure: the above figure is the decision-making process of a decision tree. Which beauty is in line with your standards, that would be great
insert image description here
! Through a set of logic like the one above, the algorithm of whether to date a beautiful woman can be realized. Similarly, this decision-making process is the general idea of ​​a decision tree.

1.2 Basic process of decision tree

Generally, a decision tree contains a root node, several internal nodes and several leaf nodes. The leaf node corresponds to the decision result, and each other node corresponds to the attribute test. The sample set contained in each node is divided into sub-nodes according to the attribute test. The root node contains the complete set of samples. The decision tree follows a "divide and conquer" thinking, and its general process is as follows:
For:
training set D = {(x1,y1), (x2,y2), (x3,y3), ...(xm,ym)}
label set A = {a1, a2, a3, ...am}
process

  • a. Generate node node
  • b. If all the samples in D belong to a certain class A, then
  • c. Mark the node as a class A leaf node, return
  • d. end if
  • e. Select the optimal partition attribute ax from A
  • f. for each value of ax do
  • g. Generate a branch for node
  • h. If the branch is empty, mark the branch node as a leaf node, otherwise continue to divide from the beginning
  • i. end if
  • j. end for

It can be seen that the pseudo-code above shows that the generation of the decision tree is a recursive process. Simply put, it is constantly judging whether each sub-item in the data set belongs to the same category. If not, then find the best division method, and then divide the data set. Cycle through this until all data sets are divided.

1.3 Partition selection of decision tree

From 1.2, the most important step of the decision tree is to select the optimal partition attribute ! ! ! Therefore,
the optimal partition attribute has several functions

    1. It can greatly reduce the decision-making process and improve the accuracy of decision-making
    1. It can improve the generalization ability of the model. This can be understood in this way. The process of human recognition of cats and mice is also a decision-making process. Human beings can distinguish from the animal's cry, shape, and some local features, rather than from all the features such as hair, paw size, smell, weight, color, and feces.

1.3.1 Information Gain

Information entropy is the most commonly used indicator to measure sample purity, where entropy is defined as the expected value of information.
Definition of information: If the things to be classified may be divided into multiple categories, the information of symbol x is defined as:
l ( xi ) = − log 2 p ( xi ) l(x_i) = -log_2p(x_i)l(xi)=log2pxi)
of whichp ( xi ) p(x_i)p(xi) is the probability of selecting the category
In order to calculate the entropy, we need to calculate the expected value of all possible information contained in all categories. We all know that the expected formula when the probability is the same is:
μ = ∑ xf ( x ) \mu = \sum xf(x)m=x f ( x ) Considering that the probability of each classification may be different, the formula of information entropy is to put f ( x ) f(x)
in the expectation formulaJust replace f ( x ) with the definition formula of information, as follows:
H = − ∑ i = 1 np ( xi ) log 2 p ( xi ) H = - \sum_{i=1}^np(x_i)log_2p(x_i)H=i=1np(xi)log2p(xi)
We know that if the purity of the training set D is higher, that is, the proportion of the i-th class sample in x is larger, that is,p ( xi ) p(x_i)p(xi) is larger, that is,the smaller H is, the purer D is. And
information gain = entroy (front) − entroy (back) Information gain = entroy (front) - entroy (back)information gain=e n t r o y ( before )e n t r o y ( behind )
We have learned that the label set A has m possible classifications. If A is used to divide the sample set D, m branch nodes will be generated, and the mth branch node contains all m in attribute A. The value of ama^maThe sample of m is denoted asD m D^mDm , since each branch node containsD m D^mDm is different, so we need to give it a certain weight, denoted as∣ D m ∣ ∣ D ∣ \frac{|D^m|}{|D|}DDm, is the information gain Gain: Gain
( D , a ) = H ( D ) − ∑ m = 1 m ∣ D m ∣ ∣ D ∣ H ( D m ) Gain(D, a) = H(D) - \sum_{m=1}^m \frac{|D^m|}{|D|}H(D^m)Gain(D,a)=H(D)m=1mDDmH(Dm )
Therefore, the greater the information gain Gain, the greater the purity improvement obtained by using the label set A to divide the training set D. For example, the famous ID3 decision tree algorithm uses information gain to divide the decision nodes.

1.3.2 Gain ratio

Considering the optimal division attribute again, if we assign a number to each sample and put the number into the decision, the information gain of the number will be the largest. If there are m samples, it will be divided into m categories, and each node will only contain one sample. Of course, its purity is the highest, but it has no practical application value. Therefore, the C4.5 decision tree algorithm no longer uses information gain to divide, but actually uses gain rate. Information gain is like seeking the maximum profit. The more sample attributes, the greater the profit. Therefore, information gain is friendly to attributes with a large number of values; while the gain rate is like seeking the direction of maximum profit growth. The fewer sample attributes, the easier it is to calculate the ratio of the two. Therefore, it can overcome the shortcomings of information gain to a certain extent, and is more friendly to attributes with a small number of possible values. Its formula is:
Gain _ ratio ( D , a ) = Gain ( D , a ) S plit Information ( D , a ) Gain\_ratio(D, a) = \frac {Gain(D, a)}{SplitInformation(D, a)}Gain_ratio(D,a)=SplitInformation(D,a)Gain(D,a)
其中SplitInformation 等于:
S p l i t I n f o r m a t i o n ( D , a ) = − ∑ m = 1 m ∣ D m ∣ ∣ D ∣ l o g 2 ∣ D m ∣ ∣ D ∣ SplitInformation(D, a) = -\sum_{m=1}^m\frac{|D^m|}{|D|}log_2\frac{|D^m|}{|D|} SplitInformation(D,a)=m=1mDDmlog2DDm

1.3.3 Gini Index

Another well-known decision tree algorithm, CART, uses the Gini index to select partition attributes. The Gini index refers to the probability that two samples are randomly selected from the sample set D, and their categories are inconsistent. The smaller the probability, the higher the purity . The formula is:
G ini _ index ( D , a ) = ∑ m = 1 m ∣ D m ∣ ∣ D ∣ G ini ( D m ) Gini\_index(D,a) = \sum_{m=1}^{m}\frac{|D^m|}{|D|}Gini(D^m)Gini_index(D,a)=m=1mDDmG i n i ( Dm )
whereG
ini ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ≠ k ' pkpk ' Gini(D) = \sum_{k=1}^{|y|}\sum_{k \ne {k^'}}}p_kp_{k^'}G i n i D =k=1yk=kpkpk
It should be noted that the CART algorithm always only generates a binary tree, and its non-leaf nodes can only be binary classification variables, generally yes or no. Other algorithms such as ID3 can have two or more child nodes.

1.4 Optimization of decision tree

1.4.1 Pruning processing

Just like trees in the real world, they need to be pruned regularly to ensure their growth direction and beauty, etc. Decision trees also need to be "prune" to fight against overfitting. We know from the division selection of decision trees that sometimes decision trees have too many branches in order to correctly distinguish samples as much as possible. At this time, some characteristics of the training set are integrated into the general nature, just like black cats are cats, but white cats are not cats, which is very absurd. For this reason, the decision tree needs to be pruned. Generally speaking, the pruning strategies are:

    1. pre-pruning
    1. post pruning

Prepruning :

. Pre-pruning means that in the process of decision tree generation, each node is evaluated before division. If the division of the current node cannot improve the generalization performance of the decision tree, the division is stopped and the node is marked as a leaf node.

Post-pruning
Post-pruning refers to training a decision tree first, and then inspecting non-leaf nodes from the bottom up. If replacing the subtree corresponding to the node with a leaf node can improve the generalization ability of the model, replace the subtree with a leaf node.

1.5 Application scope of decision tree

Decision trees can be used for both classification tasks and regression tasks. Generally speaking, they are more used for classification tasks and are not suitable for ring data or circular data. They are very sensitive to data rotation, which is specifically reflected in the data. It is sensitive to small changes in the data, so special attention must be paid during the training process to avoid overfitting.

1.6 References

    1. Zhou Zhihua "Machine Learning"
    1. Wang Jingyuan, Jia Wei, Bian Rui and Qiu Juntao "Machine Learning in Action"
    1. Li Rui, Li Peng, Qu Yadong, Wang Bin "Machine Learning in Action"

The next article will use the decision tree in sklearn for practical exercises, welcome to check out my other blogs, click here to view

Guess you like

Origin blog.csdn.net/JaysonWong/article/details/121702812