【机器学习之 第4章决策树】-作业

决策树作业

1. 题目

you are stranded on a deserted island. Mushrooms of various type grow wildly all over the island,but no other food is anywhere to be found. Some of the mushrooms have been determined as poisonous and others as not determined by your former companions trial and error .You are the only one remaining on the island. You have the following data to consider .

you know whether or not mushrooms A through H are poisonous ,but you do not know about U through W,Build a decision tree to classify mushroom as poisonous or not.

question

(a) What is the entropy of IsPoisonous?
(b) Which attribute should you choose as the root of a decision tree? Hint: You canfigure this out by looking at the data without explicitly computing the informationgain of all four attributes.
© What is the information gain of the attribute you chose in the previous question?

(d) Build a decision tree to classify mushrooms as poisonous or not.
(e) Classify mushrooms U, V, and W using this decision tree as poisonous or notpoisonous.

你被困在一个荒岛上。岛上到处都是各种各样的蘑菇,但找不到其他食物。

有些蘑菇被认为是有毒的,而另一些则不是,这是你以前的同伴反复试验的结果。您需要考虑以下数据。

你知道蘑菇A到H是否有毒,但你不知道蘑菇U到W,建立一个决策树来分类蘑菇是否有毒。

(a) 有毒的信息熵的多少

(b) 应该选择哪个属性作为决策树的根?提示:您可以通过查看数据而不显式地计算所有四个属性的信息。

© 你在上一个问题中选择的属性的信息增益是什么?

(d) 构建一个决策树,将蘑菇分类为有毒或不有毒。

(e) 用这个决策树将蘑菇分类为U、V和W,分别为有毒或不有毒。

Examle(样例名称) IsHeavy(是否重) IsSmelly(是否有味道) IsSpotted(是否有斑点) IsSmooth(是否光滑) IsPoisonous(是否有毒)
A 0 0 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 1 0 0 1 1
E 0 1 1 0 1
F 0 0 1 1 1
G 0 0 0 1 1
H 1 1 0 0 1
U 1 1 1 1 ?
V 0 1 0 1 ?
W 1 1 0 0 ?

2. 解答

2.1 有毒的信息熵的多少

信息熵公式 E n t r o p y ( t ) = − ∑ i = 1 c − 1 p ( i ∣ t ) l o g 2 p ( i ∣ t ) Entropy(t)=-\sum_{i=1}^{c-1}p(i\mid t)log_2p(i\mid t) Entropy(t)=i=1c1p(it)log2p(it)

信息增益的公式 G a i n ( D , a ) = E n t r o p y ( D ) − ∑ i = 1 k ∣ D i ∣ ∣ D ∣ E n t r o p y ( D i ) Gain(D,a)=Entropy(D)-\sum_{i=1}^k\frac{|D_i|}{|D|}Entropy(D_i) Gain(D,a)=Entropy(D)i=1kDDiEntropy(Di)

计算信息熵
E n t r o p y ( i s P o i s o n o u s ) = − 5 8 log ⁡ ( 5 8 ) − 3 8 log ⁡ ( 3 8 ) = 0.954434002924965 E n t r o p y ( i s h e a v y ) = − 5 8 log ⁡ ( 5 8 ) − 3 8 log ⁡ ( 3 8 ) = 0.954434002924965 E n t r o p y ( i s S m e l l y ) = − 5 8 log ⁡ ( 5 8 ) − 3 8 log ⁡ ( 3 8 ) = 0.954434002924965 E n t r o p y ( i s S p o t t e d ) = − 5 8 log ⁡ ( 5 8 ) − 3 8 log ⁡ ( 3 8 ) = 0.954434002924965 E n t r o p y ( i s S m o o t h ) = − 5 8 log ⁡ ( 5 8 ) − 3 8 log ⁡ ( 3 8 ) = 1.0 Entropy(isPoisonous)=-\frac{5}{8}\log(\frac{5}{8})-\frac{3}{8}\log(\frac{3}{8})=0.954434002924965 \\ Entropy(isheavy)=-\frac{5}{8}\log(\frac{5}{8})-\frac{3}{8}\log(\frac{3}{8})=0.954434002924965 \\ Entropy(isSmelly)=-\frac{5}{8}\log(\frac{5}{8})-\frac{3}{8}\log(\frac{3}{8})=0.954434002924965 \\ Entropy(isSpotted)=-\frac{5}{8}\log(\frac{5}{8})-\frac{3}{8}\log(\frac{3}{8})=0.954434002924965 \\ \\ Entropy(isSmooth)=-\frac{5}{8}\log(\frac{5}{8})-\frac{3}{8}\log(\frac{3}{8})=1.0 Entropy(isPoisonous)=85log(85)83log(83)=0.954434002924965Entropy(isheavy)=85log(85)83log(83)=0.954434002924965Entropy(isSmelly)=85log(85)83log(83)=0.954434002924965Entropy(isSpotted)=85log(85)83log(83)=0.954434002924965Entropy(isSmooth)=85log(85)83log(83)=1.0

2.2 应该选择哪个属性作为决策树的根?

一共8组数据

3,1个没毒,2个有毒 不重 5,2个没毒,3个有毒
有味道 3,1个没毒,2个有毒 没有味道 5,2个没毒,3个有毒
有斑点 3,1个没毒,2个有毒 无斑点 5, 2个没毒,3个有毒
光滑 4,2个没毒,2个有毒 不光滑 4,1个没毒,3个有毒
有毒 5 无毒 3

首先看重量这个属性的 信息增益计算

isHeavy-5个有毒-3个没毒
重-2个没毒-3个有毒
不重-1个没毒-2个有毒

计算公式如下
G a i n ( i s h e a v y ) = E n t r o p y ( i s P o i s o n o u s ) − ∣ D i ∣ ∣ D ∣ E n t r o p y ( n o h e a v y ) − ∣ D i ∣ ∣ D ∣ E n t r o p y ( h e a v y ) = 0.954434002924965 − 5 8 [ − 2 5 log ⁡ ( 2 5 ) − 3 5 log ⁡ ( 3 5 ) ] − 3 8 [ − 2 3 log ⁡ ( 2 3 ) − 1 3 log ⁡ ( 1 3 ) ] = 0.0032289436203635224 \begin{aligned} Gain(isheavy)&=Entropy(isPoisonous)-\frac{|D_i|}{|D|}Entropy(noheavy)-\frac{|D_i|}{|D|}Entropy(heavy) \\ &=0.954434002924965-\frac{5}{8}[-\frac{2}{5}\log(\frac{2}{5})-\frac{3}{5}\log(\frac{3}{5})]-\frac{3}{8}[-\frac{2}{3}\log(\frac{2}{3})-\frac{1}{3}\log(\frac{1}{3})] \\ &=0.0032289436203635224 \end{aligned} Gain(isheavy)=Entropy(isPoisonous)DDiEntropy(noheavy)DDiEntropy(heavy)=0.95443400292496585[52log(52)53log(53)]83[32log(32)31log(31)]=0.0032289436203635224

计算Smooth

isSmooth-5个有毒-3个没毒
光滑-2个没毒-2个有毒
不光滑-1个没毒-3个有毒

计算 G a i n ( S m o o t h ) = 0.048794940695398636 Gain(Smooth)=0.048794940695398636 Gain(Smooth)=0.048794940695398636

G a i n ( S m e l l y ) = 0.0032289436203635224 Gain(Smelly)=0.0032289436203635224 Gain(Smelly)=0.0032289436203635224

G a i n ( S p o t t e d ) = 0.0032289436203635224 Gain(Spotted)=0.0032289436203635224 Gain(Spotted)=0.0032289436203635224

选择光滑的

2.3 信息增益为多少

信息增益为 0.048794940695398636

2.4 构建一个决策树,将蘑菇分类为有毒或不有毒。

先分成2分

不光滑
光滑
IsSmooth
ABEH
CDFG

然后再开始分

对ABEH,进行算

1,0个没毒,1个有毒 不重 3,2个没毒,1个有毒
有味道 2,2个没毒,0个有毒 没有味道 2,0个没毒,2个有毒
有斑点 2,1个没毒,1个有毒 无斑点 2, 1个没毒,1个有毒
有毒 2 无毒 2

G a i n ( h e a v y ) = 0.31127812445913283 Gain(heavy)=0.31127812445913283 Gain(heavy)=0.31127812445913283

G a i n ( s m e l l ) = 1 Gain(smell)=1 Gain(smell)=1

G a i n ( S p o t ) = 0 Gain(Spot)=0 Gain(Spot)=0

所以选择isSmelly这个属性,开始计算

不光滑
光滑
没有味道
有味道
没有味道
有味道
IsSmooth
ABEH
CDFG
isSmelly
isSmelly
AB
EH
DFG
C

猜你喜欢

转载自blog.csdn.net/wujing1_1/article/details/125091951