some understanding of《Improved Use of Continuous Attributes in C4.5》

版权声明:本文为博主原创文章,可以随便转载 https://blog.csdn.net/appleyuchi/article/details/83154696

Here are formulas provided in
“Improved Use of Continuous Attributes in C4.5”
1996,Journal of Artificial Intelligence Research 4 (1996)77-90

I n f o ( D ) = j = 1 C p ( D , j ) l o g 2 ( p ( D , j ) ) Info(D)=-\sum_{j=1}^{C}p(D,j)·log_2(p(D,j))

G a i n ( D , T ) = I n f o ( D ) i = 1 k D i D I n f o ( D i ) Gain(D,T)=Info(D)-\sum_{i=1}^{k}\frac{|D_i|}{|D|}·Info(D_i)

S p l i t ( D , T ) = i = 1 k D i D l o g 2 ( D i D ) Split(D,T)=-\sum_{i=1}^{k}\frac{|D_i|}{|D|}·log_2(\frac{|D_i|}{|D|})

The followding are my understandings:
------------------first change-----------------------------
then,
G a i n _ R a t i o = G a i n ( D , T ) S p l i t ( D , T ) Gain\_Ratio=\frac{Gain(D,T)}{Split(D,T)}

Then ,my understanding of the "first change"is
G a i n _ R a t i o _ a d j u s t e d = G a i n ( D , T ) l o g 2 ( N 1 ) D S p l i t ( D , T ) Gain\_Ratio\_adjusted=\frac{Gain(D,T)-\frac{log_2(N-1)}{D}}{Split(D,T)}
is this right?
Many Thanks~
--------------------second change---------------------------
Relevant part of “second change” in this article is:
"This seems to be an unnecessary complication,so the threshold t is chosen instead to maximize gain.Once the threshold is chosen,however,the final selection of the attribute to be used for the test is still made on the basis of the gain ratio criterion using the adjusted gain
"
My understanding is:


1st step:
choose threshold t according to G a i n ( D , T ) m a x Gain(D,T)_{max} ,
Not G a i n _ R a t i o m a x Gain\_Ratio_{max}
Not ( G a i n ( D , T ) l o g 2 ( N 1 ) / D ) m a x (Gain(D,T)-log_2(N-1)/|D|)_{max}
2nd step:
the criterion to choose best feature is according to:
G a i n _ R a t i o ( d i s c r e t e   f e a t u r e ) = G a i n ( D , T ) S p l i t ( D , T ) Gain\_Ratio(discrete\ feature)=\frac{Gain(D,T)}{Split(D,T)}
G a i n _ R a t i o _ a d j u s t e d ( c o n t i n u o u s   f e a t u r e ) = G a i n ( D , T ) l o g 2 ( N 1 ) D S p l i t ( D , T ) Gain\_Ratio\_adjusted(continuous\ feature)=\frac{Gain(D,T)-\frac{log_2(N-1)}{D}}{Split(D,T)}
Finally,just choose the feature whose Gain Ratio or Gain Ratio(adjusted) is the largest.


is this understanding right?
Many thanks~

猜你喜欢

转载自blog.csdn.net/appleyuchi/article/details/83154696