Generation of decision tree - ID3 algorithm

Generation of decision tree - ID3 algorithm


The origin of the algorithm:

The decision tree algorithm was originally proposed by Hunt Earl B. CLS (Concept Learning System), but it did not give any method to select the optimal feature. Later, J. Ross Quinlan proposed the ID3 algorithm, using [Information Gain] to determine Afterwards, Rosquin optimized and improved the ID3 algorithm to obtain the C4.5 algorithm, and used the information gain ratio to determine the optimal feature. The essence of the two algorithms is similar, but the method of determining the optimal feature is different. The ID3 algorithm is biased towards selecting a certain feature with a large number, and the C4.5 algorithm is biased towards the selection of a certain number of feature units.

ID3 algorithm

Input: training data set D, feature set A, threshold ∈ \in
output: decision tree T

  1. Determine whether T needs to select features to generate a decision tree:

    ——If all the instances in D belong to the same class, then T is a single-node tree, recording the instance class C k C_kCkUse this as the class mark of the node and return T:
    ——If all instances in D have no features ( A = ∅ ) (A=\emptyset)(A=) , then T is a single-node tree, record the categoryC k C_kCk, take it as the class mark of the node, and return T;

  2. Otherwise, calculate the information gain of each feature in A , and select the feature Ag with the largest information gain :

    ——If the information gain of Ag is less than, then T is a single-node tree, and the category C k C_k with the largest number of instances in D is recordedCk, take this as the class mark of the node, and return T;
    ——Otherwise, according to each possible value of Ag ai a_iai, divide D into several non-empty subsets D i D_iDi,将D i D_iDiThe category with the largest number of instances in is used as a mark, and a sub-node is constructed, and T is composed of a node and its sub-nodes, and T is returned;

  3. The i-th child node, with D i D_iDiis the training set, A-Ag is the feature set, recursively call the above steps, and get the subtree T i T_iTiand return.
    insert image description here
    The core of the decision tree idea is that when the maximum information gain calculated by each iterative training set is less than the threshold, the single-node tree can be generated to terminate.


Example:

The calculated empirical entropy, empirical conditional entropy, and information gain

of each feature are as follows: the maximum information gain is the feature of your own house, which is divided into two categories: those with a house and those without a house. If you have a house, set a single node to agree to the loan. The training data set without houses is used as the new training data set D 2 D_2D2, and then calculate the information gain from the remaining features, and select the feature with the largest information gain.

Take the feature: age is A 1 A_1A1, has work for A 2 A_2A2, the credit status is A 3 A_3A3

信息增益公式:
g ( D 2 , A i ) = H ( D 2 ) − H ( D 2 ∣ A i ) g\left(D_{2}, A_{i}\right)=H\left(D_{2}\right)-H\left(D_{2} \mid A_{i}\right) g(D2,Ai)=H(D2)H(D2Ai)
经验熵:
H ( D 2 ) = − 6 9 log ⁡ 2 ( 6 9 ) − 3 9 log ⁡ 2 ( 3 9 ) = 0.918 \begin{aligned} H\left(D_{2}\right) &=-\frac{6}{9} \log _{2}\left(\frac{6}{9}\right)-\frac{3}{9} \log _{2}\left(\frac{3}{9}\right) \\ &=0.918 \end{aligned} H(D2)=96log2(96)93log2(93)=0.918
Feature: age

Computation:

青年:

w 1 = ∣ D 21 ∣ ∣ D 2 ∣ = 4 9 H ( D 21 ) = − 1 4 log ⁡ 2 1 4 − 3 4 log ⁡ 2 3 4 = 0.811 \begin{aligned} &w_{1}=\frac{\left|D_{21}\right|}{\left|D_{2}\right|}=\frac{4}{9} \\ &H\left(D_{21}\right)=-\frac{1}{4} \log _{2} \frac{1}{4}-\frac{3}{4} \log _{2} \frac{3}{4}=0.811 \end{aligned} w1=D2D21=94H(D21)=41log24143log243=0.811
​ 中年:
w 2 = ∣ D 21 ∣ ∣ D 2 ∣ = 2 9 H ( D 22 ) = 0 \begin{aligned} &w_{2}=\frac{\left|D_{21}\right|}{\left|D_{2}\right|}=\frac{2}{9} \\ &H\left(D_{22}\right)=0 \end{aligned} w2=D2D21=92H(D22)=0
​ 老年:
w 3 = ∣ D 23 ∣ ∣ D 2 ∣ = 3 9 H ( D 23 ) = − 2 3 log ⁡ 2 2 3 − 1 3 log ⁡ 2 1 3 = 0.918 \begin{aligned} &w_{3}=\frac{\left|D_{23}\right|}{\left|D_{2}\right|}=\frac{3}{9} \\ &H\left(D_{23}\right)=-\frac{2}{3} \log _{2} \frac{2}{3}-\frac{1}{3} \log _{2} \frac{1}{3}=0.918 \end{aligned} w3=D2D23=93H(D23)=32log23231log231=0.918
经验条件熵:
H ( D 2 ∣ A 1 ) = w 1 H ( D 21 ) + w 2 H ( D 22 ) + w 3 H ( D 23 ) = 0.360 + 0 + 0.307 = 0.667 \begin{aligned} H\left(D_{2} \mid A_{1}\right) &=w_{1} H\left(D_{21}\right)+w_{2} H\left(D_{22}\right)+w_{3} H\left(D_{23}\right) \\ &=0.360+0+0.307 \\ &=0.667 \end{aligned} H(D2A1)=w1H(D21)+w2H(D22)+w3H(D23)=0.360+0+0.307=0.667
Feature: Whether there is a job

Computation:
There is a job:
w 1 = ∣ D 22 ∣ ∣ D 2 ∣ = 3 9 H ( D 22 ) = 0 \begin{aligned} &w_{1}=\frac{\left|D_{22 }\right|}{\left|D_{2}\right|}=\frac{3}{9} \\ &H\left(D_{22}\right)=0 \end{aligned}w1=D2D22=93H(D22)=0
​ 没有工作:
w 2 = ∣ D 22 ∣ ∣ D 2 ∣ = 6 9 H ( D 22 ) = 0 \begin{aligned} &w_{2}=\frac{\left|D_{22}\right|}{\left|D_{2}\right|}=\frac{6}{9} \\ &H\left(D_{22}\right)=0 \end{aligned} w2=D2D22=96H(D22)=0
Entropy of empirical conditions:
H ( D 2 ∣ A 3 ) = w 1 H ( d 21 ) + w 2 H ( D 22 ) = 0 H(D_2|A_3)=w_1H(d_{21})+w_2H(D_{22 })=0H(D2A3)=w1H(d21)+w2H(D22)=0

Feature: Credit Profile

Calculation:

非常好:

w 1 = ∣ D 23 ∣ ∣ D 2 ∣ = 1 9 H ( D 23 ) = 0 \begin{aligned} &w_{1}=\frac{\left|D_{23}\right|}{\left|D_{2}\right|}=\frac{1}{9} \\ &H\left(D_{23}\right)=0 \end{aligned} w1=D2D23=91H(D23)=0
​ 好:
w 2 = ∣ D 23 ∣ ∣ D 2 ∣ = 3 9 H ( D 23 ) = − 2 3 log ⁡ 2 2 3 − 1 3 log ⁡ 2 1 3 \begin{aligned} &w_{2}=\frac{\left|D_{23}\right|}{\left|D_{2}\right|}=\frac{3}{9} \\ &H\left(D_{23}\right)=-\frac{2}{3} \log _{2} \frac{2}{3}-\frac{1}{3} \log _{2} \frac{1}{3} \end{aligned} w2=D2D23=93H(D23)=32log23231log231
​ 老年:
w 1 = ∣ D 23 ∣ ∣ D 2 ∣ = 5 9 H ( D 23 ) = − 1 5 log ⁡ 2 1 5 − 4 5 log ⁡ 2 4 5 \begin{aligned} &w_{1}=\frac{\left|D_{23}\right|}{\left|D_{2}\right|}=\frac{5}{9} \\ &H\left(D_{23}\right)=-\frac{1}{5} \log _{2} \frac{1}{5}-\frac{4}{5} \log _{2} \frac{4}{5} \end{aligned} w1=D2D23=95H(D23)=51log25154log254
经验条件熵:
H ( D 2 ∣ A 3 ) = w 1 H ( D 21 ) + w 2 H ( D 22 ) + w 3 H ( D 23 ) = 0.444 \begin{aligned} H\left(D_{2} \mid A_{3}\right) &=w_{1} H\left(D_{21}\right)+w_{2} H\left(D_{22}\right)+w_{3} H\left(D_{23}\right) =0.444 \end{aligned} H(D2A3)=w1H(D21)+w2H(D22)+w3H(D23)=0.444
Synthesis:
The comparison found that the feature is: whether there is a job with the largest information gain, so choose it for the next feature.

Through this example, the entire decision tree only uses two features, and their subsets belong to the same category. For those who have a house, they agree to the loan, and for those who do not have a house, they do not agree to the loan. It satisfies a single leaf node, and the decision tree generate.

Guess you like

Origin blog.csdn.net/qq_44795788/article/details/124731658