[Machine Learning] Decision Tree Algorithm Attribute Screening Metrics

Decision Tree Algorithm Attribute Filtering Metrics


In information theory , entropy is the average amount of information contained in each received message, also known as information entropy , source entropy , and average self-information . Here, "messages" represent events, samples, or features from a distribution or data stream. The concept of entropy originated in physics and is used to measure the degree of disorder of a thermodynamic system. In information theory , entropy is a measure of uncertainty. In the information world, the higher the entropy, the more information can be transmitted, and the lower the entropy, the less information can be transmitted.

information entropy

In 1948, Shannon introduced the concept of entropy in thermodynamics to measure the amount of information. Entropy is the sum of the entropy of various possibilities with probability as the weight , each possibility can be used log ⁡ 1 p = − log ⁡ p \log \frac{1}{p} = - \log plogp1=logp to represent. For informationxxx , the amount of information it contains is
I ( x ) = log ⁡ 1 p ( x ) = − log ⁡ p ( x ) \mathrm{I}(x) = \log \frac{1}{p(x) } = -\log p(x)I(x)=logp(x)1=logp ( x )
wherep ( x ) p(x)p ( x ) is informationxxThe probability of x happening.

Information entropy is the mathematical expectation of the amount of information , and it is the prior uncertainty before the source sends information, also known as prior entropy. For random variable XXX x 1 , x 2 , ⋯   , x n x_1,x_2,\cdots,x_n x1,x2,,xnFor a simple sample from the population, its probability density distribution is
P ( X = xi ) = pi \mathrm{P}(X = x_i) = p_iP(X=xi)=pi
Sufficient
∑ i = 1 npi = 1 \sum_{i=1}^n p_i = 1i=1npi=1 XX
_X 的熵定义为
H ( X ) = I E ( x 1 , x 2 , ⋯   , x n ) = ∑ i = 1 n p i I ( x i ) = − ∑ i = 1 n p i log ⁡ p i \mathrm{H}(X) = \mathrm{I}_E(x_1,x_2,\cdots,x_n) = \sum_{i=1}^n p_i I(x_i) = -\sum_{i=1}^np_i\log p_i H(X)=IE(x1,x2,,xn)=i=1npiI(xi)=i=1npilogpi
Suppose there are sources x 1 , x 2 x_1,x_2x1,x2, the figure below is the information entropy H \mathrm{H}H with sourcex 1 x_1x1The probability of occurrence P ( x 1 ) P(x_1)P(x1) , whenP ( x 1 ) = 1 P(x_1) = 1P(x1)=1 orP ( x 2 ) = 0 P(x_2) = 0P(x2)=0 , then the information entropyH = 0 \mathrm{H} = 0H=0 means no uncertainty is sent. And whenP ( x 1 ) = 0.5 P(x_1) = 0.5P(x1)=When the value is 0.5 , the entropy reaches the maximum and the uncertainty is the maximum.

Information entropy change diagram with source probability

For example, if in a classification system, the category identifier is ccc , the values ​​arec 1 , c 2 , ⋯ , cn c_1 , c_2 ,\cdots,c_nc1,c2,,cn n n n is the total number of categories. Then the entropy of the classification system is:
H ( c ) = − ∑ i = 1 np ( ci ) log ⁡ p ( ci ) \mathrm{H}(c) = -\sum_{i=1}^np(c_i) \log p(c_i)H(c)=i=1np(ci)logp(ci)
In particular, if it is a binary classification system, then the entropy of this system is:
H ( c ) = − p ( c 0 ) log ⁡ 2 p ( c 0 ) − p ( c 1 ) log ⁡ p ( c 1 ) \ mathrm{H}(c) = -p(c_0)\log_2p(c_0) - p(c_1) \log p(c_1)H(c)=p(c0)log2p(c0)p(c1)logp(c1)
wherep ( c 0 ) , p ( c 1 ) p(c_0),p(c_1)p(c0),p(c1) are the probabilities of occurrence of the two types of samples, respectively.

information gain

The generation of the decision tree algorithm ID3 uses information gain . Information gain is based on information entropy and self-information theory in information theory , and is defined as IG ( T , a ) ⏞ Information Gain = H ( T ) ⏞ Entropy (parent) − H ( T ∣ a ) ⏞ Weighted Sum of Entropy (Children) = − ∑ i = 1 J pi log ⁡ 2 pi − ∑ ap ( a ) ∑ i = 1 J − Pr ⁡ ( i ∣ a ) log ⁡ 2 Pr ⁡ ( i ∣ a ) \begin{aligned} \overbrace {\ mathrm{IG}(T,a)} ^{\text{Information Gain}} &=\overbrace {\mathrm {H} (T)} ^{\text{Entropy (parent)}}-\overbrace {\mathrm {H} (T|a)} ^{\text{Weighted Sum of Entropy (Children)}}\\ &=-\sum _{i=1}^{J}p_{i}\log _{2} {p_{i}}-\sum _{a}{p(a)\sum _{i=1}^{J}-\Pr(i|a)\log _{2}{\Pr(i| a)}} \end{aligned}
I G ( T ,a) Information Gain=H(T) Entropy (parent)H(Ta) Weighted Sum of Entropy (Children)=i=1Jpilog2piap(a)i=1JPr(ia)log2Pr(ia)
Take a look at an example given in Wikipedia:

Example The dataset has 4 attributes: outlook (sunny, overcast, rainy), temperature (hot, mild, cool), humidity (high, normal), and windy (true, false), target value play (yes, no), A total of 14 data points.

Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

In order to build a decision tree, it is necessary to compare the information gains of four decision trees, and each decision tree is divided by an attribute. The partition with the highest information gain is taken as the first partition, and this process is continued at each child node until its information gain is 0.

If the attribute windy is used for division, two child nodes are generated: the value of windy is true and false. In the current data set, the windy value of 6 data points is true, the play value of 3 points is true, and the play value of 3 points is false; the windy value of the remaining 8 data points is false, and the play value of 6 points is false is true, and the play value of the 2 points is false.

  • The information entropy of the child node of windy =true is calculated as:

I E ( [ 3 , 3 ] ) = − 3 6 log ⁡ 2 3 6 − 3 6 log ⁡ 2 3 6 = − 1 2 log ⁡ 2 1 2 − 1 2 log ⁡ 2 1 2 = 1 {\displaystyle \mathrm{I}_{E}([3,3])=-{\frac {3}{6}}\log _{2}^{}{\frac {3}{6}}-{\frac {3}{6}}\log _{2}^{}{\frac {3}{6}}=-{\frac {1}{2}}\log _{2}^{}{\frac {1}{2}}-{\frac {1}{2}}\log _{2}^{}{\frac {1}{2}}=1} IE([3,3])=63log26363log263=21log22121log221=1

  • The information entropy of the child node of windy =false is calculated as:

I E ( [ 6 , 2 ] ) = − 6 8 log ⁡ 2 6 8 − 2 8 log ⁡ 2 2 8 = − 3 4 log ⁡ 2 3 4 − 1 4 log ⁡ 2 1 4 = 0.8112781 {\displaystyle \mathrm{I}_{E}([6,2])=-{\frac {6}{8}}\log _{2}^{}{\frac {6}{8}}-{\frac {2}{8}}\log _{2}^{}{\frac {2}{8}}=-{\frac {3}{4}}\log _{2}^{}{\frac {3}{4}}-{\frac {1}{4}}\log _{2}^{}{\frac {1}{4}}=0.8112781} IE([6,2])=86log28682log282=43log24341log241=0.8112781

The information entropy of this partition (using the attribute windy ) is the weighted sum of the information entropies of the two child nodes:
IE ( [ 3 , 3 ] , [ 6 , 2 ] ) = IE ( windy or not ) = 6 14 ⋅ 1 + 8 14 ⋅ 0.8112781 = 0.8921589 {\displaystyle \mathrm{I}_{E}([3,3],[6,2])=\mathrm{I}_{E}({\text{windy or not}})= {\frac {6}{14}}\cdot 1+{\frac {8}{14}}\cdot 0.8112781=0.8921589}IE([3,3],[6,2])=IE(windy or not)=1461+1480.8112781=0 . 8 9 2 1 5 8 9
In order to calculate the information gain using the attributewindy, the information entropy of the initial (undivided) data set must be calculated first. Theplayhas 9 yes and 5 no:
IE ([ 9 , 5 ] ) = − 9 14 log ⁡ 2 9 14 − 5 14 log ⁡ 2 5 14 = 0.940286 {\displaystyle \mathrm{I}_{E}([9,5])=-{\frac {9 }{14}}\log _{2}^{}{\frac {9}{14}}-{\frac {5}{14}}\log _{2}{\frac {5}{14} }=0.940286}IE([9,5])=149log2149145log2145=0 . 9 4 0 2 8 6
Thus,the information gainwindy
IG ( windy ) = IE ( [ 9 , 5 ] ) − IE ( [ 3 , 3 ] , [ 6 , 2 ] ) = 0.940286 − 0.8921589 = 0.0481271 {\displaystyle \mathrm{IG}({\text{windy}})=\mathrm{I}_{E}([9,5])-\mathrm{I}_{E}([3, 3],[6,2])=0.940286-0.8921589=0.0481271}IG(windy)=IE([9,5])IE([3,3],[6,2])=0.9402860.8921589=0 . 0 4 8 1 2 7 1
The above is just an example of the information gain if divided according to Windy, but what if it is divided according to Temperature, Humidity and Outlook? Similarly,
IG ( outlook ) = 0.246 IG ( Temperature ) = 0.029 IG ( Humidity ) = 0.151 \begin{aligned} \mathrm{IG(outlook)} = 0.246 \\ \mathrm{IG(Temperature)} = 0.029 \\ \mathrm{IG(Humidity)} = 0.151 \end{aligned}IG(outlook)=0.246IG(Temperature)=0.029IG(Humidity)=0.151
Select the attribute with the largest information gain to divide, and then repeat the above steps until a tree is built; in this example, the attribute of the first branch node is Outlook.

Note : attributes with more values ​​tend to make the data more "pure", and its information gain is greater. The decision tree will first select this attribute as the top/node of the tree; the resulting trained shape is a huge tree with a very shallow depth, which is extremely unreasonable.

information gain rate

Since the information gain prefers attributes with many values ​​(the limit is close to the uniform distribution), the decision tree C4.5 algorithm uses the gain rate to replace the information gain of the ID3 algorithm. The information gain rate is defined as
IG r = IG ( T , a ) H ( a ) \mathrm{IG_{r}}= \frac{\mathrm{IG}(T,a)}{\mathrm{H}(a) }IGr=H(a)I G ( T ,a)
aa _a is the intrinsic property ofIG ( T , a ) \mathrm{IG}(T,a)I G ( T ,a ) is according to attributeaaInformation gain when a is divided, H ( a ) \mathrm{H}(a)H ( a ) is according to attributeaaa Divided information entropy.

Therefore, when divided by attribute Outlook, the information entropy is
IE ( [ 5 , 4 , 5 ] ) = − 5 14 log ⁡ 2 5 14 − 4 14 log ⁡ 2 4 14 − 5 14 log ⁡ 2 5 14 = 1.5774 {\ displaystyle \mathrm{I}_{E}([5,4,5])=-{\frac {5}{14}}\log _{2}^{}{\frac {5}{14}} -{\frac {4}{14}}\log _{2}^{}{\frac {4}{14}}-{\frac {5}{14}}\log _{2}^{} {\frac {5}{14}}=1.5774}IE([5,4,5])=145log2145144log2144145log2145=1 . 5 7 7 4
Information gain rate
IG r ( outlook ) = 0.246 1.5774 = 0.15595 \mathrm{IG_{r}}(\mathrm{outlook}) = \frac{0.246}{1.5774} = 0.15595IGr(outlook)=1.57740.246=0.15595

Gini Coefficient

But both ID3 and C4.5 are based on the entropy model of information theory, which involves a lot of logarithmic operations (in order to avoid this problem). Can the model be simplified without completely losing the benefits of the entropy model? have! The CART (Classification and Regression Tree) classification tree algorithm uses the Gini coefficient instead of the information gain ratio. The Gini coefficient represents the impurity of the model . The smaller the Gini coefficient, the lower the impurity (higher purity) and the better the characteristics.

Similarly, for nnn categoryc 1 , c 2 , ⋯ , cn c_1,c_2,\cdots,c_nc1,c2,,cnSample TTT ,iiThe probability of occurrence of i categories is pk p_kpk,
G ini ( T ) = ∑ i = 1 npi ( 1 − pi ) = 1 − ∑ i = 1 npi 2 \mathrm{Gini}(T) = \sum_{i=1}^n p_i( 1-p_i) = 1- \sum_{i=1}^np_i^2G i n i ( T )=i=1npi(1pi)=1i=1npi2
In particular, for the binary classification problem, the Gini coefficient is
G ini ( T ) = ∑ i = 1 2 pi ( 1 − pi ) = 1 − ∑ i = 1 2 pi 2 = 2 p ( 1 − p ) \mathrm{ Gini}(T) = \sum_{i=1}^2 p_i(1-p_i) = 1- \sum_{i=1}^2 p_i^2 = 2p(1-p)G i n i ( T )=i=12pi(1pi)=1i=12pi2=2p(1p )
Gini impurity is zero when all samples in a node are of one class.

If feature aaa haskkk categories, the Gini coefficient is
G ini ( T , a ) = ∑ i = 1 k ∣ T i ∣ ∣ T ∣ G ini ( T i ) \mathrm{Gini}(T,a) = \sum_{i =1}^k \frac{|T^i|}{|T|} \mathrm{Gini}(T^i)G i n i ( T ,a)=i=1kTTiG i n i ( Ti )
In feature selection,with the smallestas the priority attribute.
a ∗ = arg ⁡ min ⁡ a ∈ AG ini ( T , a ) a^* = \mathop{\arg \min}_{a \in A} \mathrm{Gini}(T,a)a=argminaAG i n i ( T ,a)

References

[1] Super detailed information entropy, information gain, information gain ratio, Gini coefficient

[2] Information Gain_Information Gain Rate_Gini, xiaoxiyouran

[3] Decision tree learning, Wikipedia

Guess you like

Origin blog.csdn.net/qq_38904659/article/details/112311433