In layman's terms Decision Tree Algorithm (B) Analysis examples

Housewives:
In layman's terms decision tree algorithm (a) basic concepts introduced

I. Overview

On one, we introduce some basic concepts of decision trees, including tree basic knowledge and related content information entropy, then this time, we will use an example to demonstrate the specific works of the decision tree, as well as information entropy in which assume the role.

One thing have to say about the decision tree in the optimization process, there are three classic algorithms are ID3, C4.5, and CART. The back of improved algorithms are based on some of the deficiencies of the previous algorithms, this time we will talk about ID3 algorithm, later would say that it is inadequate and improvements.

II. An example

As we all know, not to stay in bed in the morning is a very profound question. It depends on many variables, let's take a look at Bob's stay in bed the habit of it.

season Time has passed 8:00 Wind conditions Do you want to stay in bed
spring no breeze yes
winter no no wind yes
autumn yes breeze yes
winter no no wind yes
summer no breeze yes
winter yes breeze yes
winter no gale yes
winter no no wind yes
spring yes no wind no
summer yes gale no
summer no gale no
autumn yes breeze no

OK, we randomly draw the case to stay in bed a year 14 days Xiaoming. Now we can build a decision tree based on these data.

Can be found, there are three attributes will affect the final result data, respectively, the season, time is over 8:00, wind conditions.

How to choose the branch

To build a decision tree, we need to find its roots, we said in the previous section, we need to calculate each property of entropy .

Before calculating the entropy of each attribute, we need to historical data, the information entropy calculation does not stay in bed under the influence of any property. From the data we can know, 14 days, eight days have to stay in bed, do not stay in bed six days. then

p(赖床) = 8 / 12.
p(不赖床) = 6 / 12.

Information entropy

H(赖床) = -(p(赖床) * log(p(赖床)) + p(不赖床) * log(p(不赖床))) = 0.89

Next, since you can calculate the entropy of each attribute, here are three attributes that season, whether through 8:00 in the morning, wind conditions , we need to calculate each property three properties under different circumstances, stay in bed and not bad probability bed and entropy.

Put it around a bit, we first wind conditions this attribute to calculate its entropy is how much wind conditions, there are three, we calculate the entropy three cases.
Entropy of wind conditions

风力情况为 breeze 时,有 4 / 5 的概率会赖床,而 1 / 5 的概率不会赖床,它的熵为 entropy(breeze) = -(p(breeze,赖床) * log(p(breeze,赖床)) + p(breeze,不赖床) * log(p(breeze,不赖床))) 0.722。

风力情况为 no wind 时,和上面一样的计算,熵值为 entropy(no wind) = 0.811

风力情况为 gale 时,熵值为  entropy(gale) = 0.918

最终,风力情况这一属性的熵值为其对应各个值的熵值乘以这些值对应频率。

H(风力情况) = 5/12 * entropy(breeze) + 4/12 * entropy(no wind) + 3/12 * entropy(gale) = 0.801

还记得吗,一开始什么属性没有,赖床的熵是H(赖床)=0.89。现在加入风力情况这一属性后,赖床的熵下降到了0.801。这说明信息变得更加清晰,我们的分类效果越来越好。
Reducing information entropy

以同样的计算方式,我们可以求出另外两个属性的信息熵:

H(季节) = 0.56
H(是否过 8 点) = 0.748

通过它们的信息熵,我们可以计算出每个属性的信息增益。没错,这里又出现一个新名词,信息增益。不过它不难理解,信息增益就是拿上一步的信息熵(这里就是最原始赖床情况的信息熵)减去选定属性的信息熵,即

信息增益 g(季节) = H(赖床) - H(季节) = 0.33

这样我们就能计算每个属性的信息增益,然后选取信息增益最大的那个作为根节点就行了,在这个例子中,很明显,信息增益最大的是季节这个属性。

选完根节点怎么办?把每个节点当作一颗新的树,挑选剩下的属性,重复上面的步骤就可以啦。
Other branches of the process of selection

当全部都遍历完之后,一颗完整的树也就构建出来了,这个例子中,我们最终构造的树会是这个样子的:
Complete a decision tree

三. 过拟合与剪枝

在构建决策树的时候,我们的期望是构建一颗最矮的决策树,为什么需要构建最矮呢?这是因为我们要避免过拟合的情况。

什么是过拟合呢,下图是一个分类问题的小例子,左边是正常的分类结果,右边是过拟合的分类结果。

Overfitting

在现实世界中,我们的数据通常不会很完美,数据集里面可能会有一些错误的数据,或是一些比较奇葩的数据。如上图中的蓝色方块,正常情况下我们是允许一定的误差,追求的是普适性,就是要适应大多数情况。但过拟合的时候,会过度追求正确性,导致普适性很差。

剪枝,即减少树的高度就是为了解决过拟合,你想想看,过拟合的情况下,决策树是能够对给定样本中的每一个属性有一个精准的分类的,但太过精准就会导致上面图中的那种情况,丧失了普适性。

而剪枝又分两种方法,预剪枝干,和后剪枝。这两种方法其实还是蛮好理解的,一种是自顶向下,一种是自底向上。我们分别来看看。

预剪枝

预剪枝其实你可以想象成是一种自顶向下的方法。在构建过程中,我们会设定一个高度,当达构建的树达到那个高度的时候呢,我们就停止建立决策树,这就是预剪枝的基本原理。

后剪枝

后剪枝呢,其实就是一种自底向上的方法。它会先任由决策树构建完成,构建完成后呢,就会从底部开始,判断哪些枝干是应该剪掉的。

注意到预剪枝和后剪枝的最大区别没有,预剪枝是提前停止,而后剪枝是让决策树构建完成的,所以从性能上说,预剪枝是会更块一些,后剪枝呢则可以更加精确。

ID3决策树算法的不足与改进

ID3决策树不足

用ID3算法来构建决策树固然比较简单,但这个算法却有一个问题,ID3构建的决策树会偏袒取值较多的属性。为什么会有这种现象呢?还是举上面的例子,假如我们加入了一个属性,日期。一年有365天,如果我们真的以这个属性作为划分依据的话,那么每一天会不会赖床的结果就会很清晰,因为每一天的样本很少,会显得一目了然。这样一来信息增益会很大,但会出现上面说的过拟合情况,你觉得这种情况可以泛化到其他情况吗?显然是不行的!

C4.5决策树

针对ID3决策树的这个问题,提出了另一种算法C4.5构建决策树。
C4.5决策树中引入了一个新的概念,之前不是用信息增益来选哪个属性来作为枝干嘛,现在我们用增益率来选!

The formula 1

The formula 2

There is, IV (a) that, when the attribute values of optional more (such as date of year, preferably 365), when the greater its value.
And IV (a) The larger the value, the gain rate is clearly smaller, which is the emergence of new problems. C4.5 decision tree with the ID3 turn, it is more biased and less optional attribute value of the property. This is very troublesome, is there a more fair and objective decision tree algorithm it? some! !

CART decision tree

Speaking above, ID3 decision trees selected as the attribute information for the gain, C4.5 attribute is selected as the rate of gain. But they have limitations, so the final proposed CART, the current decision tree algorithm sklearn also used in CART. CART decision tree using a thing as another attribute selection criteria that Gini coefficient .
The formula 3

The formula 4

The first equation pk represents the value of each attribute is optional, after calculating the accumulated values ​​of these optional. This formula results and information gain is actually Similarly, when the average property distribution, that is, the more vague information when the Gini coefficient value will be greater, and vice versa even smaller. However, this calculation, each time calculated using only half way classification. For example, seasonal property in the example above, there are four possible values ​​seasons (spring, summer, autumn, winter). So when spring will calculate calculation (in spring (summer, autumn, winter)) by half way. Behind the other steps similar to ID3. By this calculation, it is possible to obtain better avoid the defect of the ID3.

So if trained with CART decision tree algorithm, and we will be on top of the decision tree ID3 trained a little different. In the next article we will use sklearn using CART algorithm to train a decision tree, and the results will show the way by drawing.


Recommended reading:
Scala Functional Programming Guide (a) functional ideas introduced
Actor concurrent programming model On
the evolutionary history of large data storage - from RAID to HDFS Hadoop
C, the Java, Python, rivers and lakes behind these names!

Guess you like

Origin www.cnblogs.com/listenfwind/p/10199720.html