Benpian essay is content data scientists study the seventh week, the main reference is:
Time Geeks - data analysis at 45 combat the stresses of the decision tree
The above figure is a typical decision tree. We made the decision tree in time, will go through two stages:
-
- structure
- Pruning
-
- Choose which attributes / characteristics as the root node?
- What attributes / characteristics selected as the internal nodes?
- When to stop and get the target state, that is, leaf nodes - the result of the final decision
-
- Over-fitting, you have to understand this concept, it refers to the results of the training model of "good," and that in the actual application, there will be the case, "rigid", leading to misclassification.
- Underfitting, and over-fitting is like the first and third cases, the results of this training the following figure "very good", but in the actual application process can lead to misclassification.
One of the reasons is because of over-fitting the training set small sample size. If the tree selected attribute too much, constructed decision tree will be able to "perfect" to classify the sample in the training set, but this will put some of the features in the training set of data as a feature of all the data, but this feature is not necessarily It features all the data, which makes this decision tree errors in real data classification, which is the model of "generalization" poor.
Generalization refers to the ability of the classifier is abstracted through the training set classification capabilities can also be understood is giving top priority to capacity. If we are too dependent on the training data set, then the resulting tree will be relatively low fault rate, poor generalization ability. Because the training set but all the data sampling, and can not reflect the characteristics of all the data.
Pruning : In order to ensure adequate model generalization, classification trees need to be pruned. Pruning is divided into pre-pruning and pruning.
- After pruning:
从上图这个数据集看,我们需要确定将哪一个条件作为根节点,哪一个作为中间节点。哪个条件作为根节点。所以需要通过2个指标度量:
-
- 纯度-让目标变量的分歧最小
- 信息熵(entropy)-信息的不确定性
-
- 集合1: Entropy(t) = -1/6log2 1/6 - -5/6log2 5/6 = 0.65
- 集合2: Entropy(t) = -3/6log2 3/6 - -3/6log2 3/6 = 1
信息熵越大,纯度越低;当集合中样本均匀混合时,信息熵最大,纯度最低。上述信息熵公式是下面各分类算法的基础。
基于上述纯度和信息熵的定义,我们会使用到如下3种指标:
-
- 信息增益— ID3算法
- 信息增益率 — C4.5算法
- 基尼指数 — Cart算法
- 信息增益— ID3算法
-
- 计算根节点的信息熵(注意根节点是“是否打篮球”的结果,因为我们还不知道用哪个属性作为根节点— 也是我们要求解的过程)
- 计算某一个节点(某一列)下每个属性的信息熵
- 计算这个节点归一化信息熵
- 计算Gain(D,a)
- 按上述步骤,遍历所有的节点(所有的列/所有的属性)后,得到每个列的Gain,取最大值的节点为当前节点。
- 按上述步骤,计算出所有的节点。
-
- 计算根节点的信息熵(注意根节点是“是否打篮球”的结果,因为我们还不知道用哪个属性作为根节点— 也是我们要求解的过程)
-
- 计算某一个节点(某一列)下每个属性的信息熵(假设计算第一列天气— 以天气作为属性的划分)
那么会有三个叶子节点 D1、D2 和 D3,分别对应的是晴天、阴天和小雨。我们用 + 代表去打篮球,- 代表不去打篮球。那么第一条记录,晴天不去打篮球,可以记为 1-,于是我们可以用下面的方式来记录 D1,D2,D3
D1(天气 = 晴天)={1-,2-,6+}
D2(天气 = 阴天)={3+,7-}
-
- 计算这个节点归一化信息熵
-
- 计算Gain(D,a)
-
- 按上述步骤,遍历所有的节点(所有的列)后,得到每个列的Gain,取最大值的节点为当前节点。
-
- 然后我们要将上图中第一个叶节点,也就是 D1={1-,2-,3+,4+}进一步进行分裂,往下划分,计算其不同属性(天气、湿度、刮风)作为节点的信息增益,可以得到:
1.计算父节点-- 即温度为高的节点的信息熵为:
Ent(D) = - (2/4log2 2/4 + 2/4log2 2/4) = - (-1) = 1
2.如果以天气作为接下来的属性划分节点,则天气和数据集分为“晴” - (1-,2-),“阴”(3+),“下雨”(4+)—— 注意只在上述D1的数据集中划分,则天气的归一化信息熵为:
2/4(2/2log2 2/2) + 1/4(1/1log2 1/1) + 1/4(1/1log2 1/1) = 2/4(0) + 1/4(0) + 1/4(0) = 0.
3.接下来以此类推,即:决定了湿度是温度为高的子节点后,需要决定温度为“中”的子节点是什么;然后是温度为低的子节点...我们能看到湿度,或者天气为 D1 的节点都可以得到最大的信息增益,这里我们选取湿度作为节点的属性划分。同理,我们可以按照上面的计算步骤得到完整的决策树,结果如下:
下篇继续介绍C4.5和Cart算法。