Book IV Chapter watermelon - a decision tree model

In this part of the book based on watermelon Bo Zou made a video summary

Part I: construction-related information theory and entropy

Main points:

  • entropy
  • Joint entropy and conditional entropy []
  • Mutual information (information gain)

I understand: the entropy is equivalent to Uncertainty

Entropy construction:

  1. The basic idea: if a thing can not happen happened, the uncertainty of the information contained large

Sun Dongsheng West down: this thing does not carry any uncertainty, and the uncertainty of the earthquake is still there, our goal is to find the uncertainty of what happened, and make it the smallest probability of occurrence.

  1. Construction: First of all want to meet the probability additive : increase of the number of (classical bottom 2, but the base-model for extreme value meaningless), followed by: the smaller the probability \ (\ Rightarrow \) The more uncertain information \ (\ Rightarrow \ ) Tim negative sign

Namely: \ (the p-\) [probability events)] | Metric: \ (- LN (the p-) \)

Entropy: Comprehensive uncertainty all random events, i.e. events find a desired
\ [\ sum_ {i = 1 } ^ N p (i) \ cdot lnp (i) \]

Conditional entropy H (Y | X)

The definition of conditional entropy is that: in the case of X has occurred, the Y-occurrence of the "new" uncertainty (entropy)

May have the above definitions: \ (H (X-, the Y) -H (X-) \)

All [X, Y] minus uncertainty contained in the [uncertainty] is equal to [X contains additional uncertainty on the basis of X]

If you draw a Venn diagram can better understand:

Derivation:
\ [\ the aligned the begin {H} (X-, the Y) -H (X-) = {} & - \ sum_ {X, Y} P (X, Y) \ log P (X, Y) + \ sum_ { x} p (x) \ log p (x) \\ = {} & - \ sum_ {x, y} p (x, y) \ log p (x, y) + \ sum_ {x} \ left (\ sum_ {y} p (x, y) \ right) \ log p (x) \\ = {} & - \ sum_ {x, y} p (x, y) \ log p (x, y) + \ sum_ {x, y} p (x , y) \ log p (x) \\ = {} & - \ sum_ {x, y} p (x, y) \ log \ frac {p (x, y)} { p (x)} \\ = {
} & - \ sum_ {x, y} p (x, y) \ log p (y | x) \ end {aligned} \] can be found in this last equation above is not very friendly , in front of the joint probability, followed by a conditional probability, the joint probability of the above changes to the conditional probability that you get to see what results?
\ [\ Begin {aligned} H (X, Y) -H (X) = {} & - \ sum_ {x, y} p (x, y) \ log p (y | x) \\ = {} & - \ sum_ {x} \ sum_ {y} p (x, y) \ log p (y | x) \\ = {} & - \ sum_ {x} \ sum_ {y} p (x, y) \ log p (y | x) \\ = {} & - \ sum_ {x} p (x) p (y | x) \ log p (y | x) \\ = {} & \ sum_ {x} p (x ) \ left (- \ sum_ { y} p (y | x) \ log p (y | x) \ right) \\ = {} & \ sum_ {x} p (x) H (Y | X = x) \ end {aligned} \]
The above formula is actually very good understanding, we think on behalf of the environment of the decision tree to think about this problem. The new information is added to two. First, the real class information, and second, \ (A = (A_1, A_2, \ DOTS, a_v) \) , English as entropy: \ (Information \ Entropy \) , \ (D \) on behalf of the data set, according to the label \ (target \) can be \ (D \) into \ (K \) class. According to a feature \ (A \) , may be \ (D \) into \ (V \) class.

Always remember: entropy formula is calculated based on the true class " purity or uncertainty " \
[Ent (D) = - \ sum_. 1} ^ {K = {K} p_kln (P_K) \]

Information gain: entropy before and after the memo added features

According to the definition: \ [Ent (D) -Ent (D | A) \]
\ [\ the aligned the begin {} Ent (D) -Ent (D | A) = {} & Ent (D) - \ V = {sum_ 1} ^ V \ frac {|
D ^ v |} {| D |} Ent (D ^ v) \\ \ end {aligned} \] the equal sign is well understood conceptually, we calculated after A is added according to the respective entropy child node, the number of samples is not the same as the division of each child node, the processing of these to weighting the entropy. But the mathematics we need an explanation.

\[ \begin{aligned} Ent(D|A) ={} &-\sum_{v, k} p\left(D_{v}, A_{i}\right) \log p\left(D_{v} | A_{i}\right) \\ ={} & -\sum_{v, k} p\left(A_{v}\right) p\left(D_{k} | A_{v}\right) \log p\left(D_{k} | A_{v}\right)\\ ={} & -\sum_{v=1}^{V} \sum_{k=1}^{K} p\left(A_{v}\right) p\left(D_{k} | A_{v}\right) \log p\left(D_{k} | A_{v}\right)\\ ={} & -\sum_{v=1}^{V} \frac{\left|D_{v}\right|}{|D |} \sum_{k=1}^{K} \frac{\left|D_{v k}\right|}{\left|D_{v}\right|} \log \frac{\left|D_{v k}\right|}{\left|D_{v}\right|} =-\sum_{v=1}^V\frac{|D^v|}{|D|}Ent(D^v) \end{aligned} \]

Here the formula is more complex, watermelon book \ (D ^ v \) described a data set is opened by the characteristic \ (A \) to be divided, but it is calculated entropy, entropy classification of samples based on \ (K \) uncertainty calculation. Therefore, where the data set \ (D_ {vk} \) is first to \ (A \) to be divided, and then after the \ (D_v \) based on the classification \ (K \) divided.

The second part of the decision tree generation strategy

Decision tree method uses a recursive top-down, and the basic idea is based on information entropy is a measure , choose the fastest decline in the value of entropy [information] gain maximum attribute as the first node, and then turn on this feature standard decision trees. Termination condition: the entropy at the leaf node is 0, then each leaf node instance belong to the same class.

Chart analysis as follows:

Before our information entropy as a measure to gain information to make decisions based on features selected dimension, in fact, there are other basis:

Common three algorithms:

  • ID3: Information gain (have learned)
  • C4.5: information gain ratio (this section)
  • CART: Gini coefficient (this section)

Information gain ratio

Information gain a drawback. Assume watermelon divided into two categories, good and bad melon melon. 50 samples, which have a characteristic attribute 100. If this attribute is used as the basis for division, the entropy of each category are divided after [0 because each category of melon can be assigned to a node, remember how to calculate the gain rate it? Is to be divided by V, then K divided. Here the end of the V division, K has not divided, because only be one category]

属性\(a\)的固有值:
\[ \mathrm{IV}(a)=-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|} \]
\(a\)越大\(\Rightarrow\) \(D^v\) 越均匀 \(\Rightarrow\) 熵值越大

信息增益率:
\[ \text { Gain ratio }(D, a)=\frac{\operatorname{Gain}(D, a)}{\operatorname{IV}(a)} \]

基尼系数 Gini

基尼数据定义:随机抽取两个样本,类别不一样的概率。
\[ \begin{aligned} \operatorname{Gini}(D) &=\sum_{k=1}^{| \mathcal{Y |}} \sum_{k^{\prime} \neq k} p_{k} p_{k^{\prime}} \\ &=1-\sum_{k=1}^{|\mathcal{Y}|} p_{k}^{2} \end{aligned} \]
基尼指数【对照于增益】:
\[ \text { Gini } \operatorname{index}(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right) \]
最优划分属性:
\[ a_{*}=\underset{a \in A}{\arg \min } \operatorname{Gini} \operatorname{index}(D, a) \]

高阶认识:Gini系数就是信息熵的一阶泰勒近似

第三部分 算法调参

过拟合处理:

  1. 剪枝

  2. 随机深林


剪枝

西瓜书和邹博讲的剪枝手段不一样啊。

⭕【邹博思路】:

  • 决策树的评价

    纯结点的熵为:\(H_p=0\) ,最小

    均结点的熵为:\(H_u=lnk\) ,最大

    对所有叶节点的熵进行求和,值越小,说明样本的分类越精细。

    考虑到每个结点的样本数目是不一样的,所以评价函数采用样本加权求熵和

    评价函数:
    \[ C(T)=\sum_{t \in l e a f} N_{t} \cdot H(t) \]
    将此作为损失函数。

  • 正则化考虑:以叶子的数目作为复杂度

    损失函数:
    \[ C_{\alpha}=C(T)+\alpha|T_{leaf}| \]

    目标:求在保证损失不变的情况下,模型的复杂度是多少?(剪枝系数)
    \[ \alpha = \frac{C(r)-C(R)}{<R_{leaf}>-1} \]

剪枝算法:

  • 对于给定的决策树T:
    • 计算所有内部节点的剪枝系数;
    • 查找最小剪枝系数的结点,剪枝得决策树\(T_k\)
    • 重复以上步骤,直到决策树\(T\),只有1个结点;
    • 得到决策树序到 \(T_0T_1T_2 \dots T_k\)
    • 使用验证样本集选择最优子树

注:当用验证集做最优子树的标准,直接用之前不带正则项的评价函数:\(C(T)=\sum_{t \in l e a f} N_{t} \cdot H(t)\)

⭕【西瓜书

西瓜书的剪枝手段主要是通过验证集去选择,并且把剪枝分为预剪枝后剪枝

  • 预剪枝:
  • 后剪枝:在决策树生成后进行剪枝操作

注:预剪枝基于的是贪心算法,只要验证集精度提高了,我就剪枝,所以有欠拟合的风险。而后剪枝是自底向上对所有非叶结点进行逐一考察,时间开销大的多,但是泛化能力提高。

连续值和缺失值处理

连续值处理:

给定样本集\(D\)和连续属性\(a\),假定\(a\)\(D\)上出现了\(n\)个不同的取值,将这些值从小到大进行排序,记为\(\{a_l,a_2..,a_n\}\).基于划分点\(t\)可将\(D\)分为子集\(D_t^-\)\(D_t^+\),其中\(D_t^-\)包含那些在属性\(a\)上取值不大于\(t\)的样本,而\(D\)则包含那些在属性\(a\)上取值大于\(t\)的样本。我们可考察包含 \(n-1\) 个元素的候选划分点集合:
\[ T_{a}=\left\{\frac{a^{i}+a^{i+1}}{2} | 1 \leqslant i \leqslant n-1\right\} \]
缕一缕:一个结点有\(n-1\)个候选划分点:

综上公式改为:
\[ \begin{aligned} \operatorname{Gain}(D, a) &=\max _{t \in T_{a}} \operatorname{Gain}(D, a, t) \\ &=\max _{t \in T_{a}} \operatorname{Ent}(D)-\sum_{\lambda \in\{-,+\}} \frac{\left|D_{t}^{\lambda}\right|}{|D|} \operatorname{Ent}\left(D_{t}^{\lambda}\right) \end{aligned} \]
缺省值:

我们需解决两个问题:(1)如何在属性值缺失的情况下进行划分属性选择?(2)给定划分属性,若样本在该属性上的值缺失,如何对样本进行划分?

无缺失值样本所占的比例:
\[ \rho=\frac{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in D} w_{\boldsymbol{x}}} \]
无样本下第k类的比例和第v个属性的比例:【与之前的公式相同】
\[ \begin{aligned} \tilde{p}_{k} &=\frac{\sum_{\boldsymbol{x} \in \tilde{D}_{k}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}} \quad(1 \leqslant k \leqslant|\mathcal{Y}|) \\ \tilde{r}_{v} &=\frac{\sum_{\boldsymbol{x} \in \tilde{D}^{v}} w_{\boldsymbol{x}}}{\sum_{\boldsymbol{x} \in \tilde{D}} w_{\boldsymbol{x}}} \quad(1 \leqslant v \leqslant V) \end{aligned} \]
\(w_x\)是每个样本\(x\)的权重,根结点中的样本权重为1。

增益公式推广为:
\[ \begin{aligned} \operatorname{Gain}(D, a) &=\rho \times \operatorname{Gain}(\tilde{D}, a) \\ &=\rho \times\left(\operatorname{Ent}(\tilde{D})-\sum_{v=1}^{V} \tilde{r}_{v} \operatorname{Ent}\left(\tilde{D}^{v}\right)\right) \end{aligned} \]

多变量决策树

⭕同一个特征可以进行多次判别:

⭕一般而言,分类边界为:轴平行(axis-parallel),但是也可以将线性分类器作为决策标准,可以产生"斜"这的分类边界。

技巧:可利用斜着的分裂边界简化决策树模型

Guess you like

Origin www.cnblogs.com/wangjs-jacky/p/11808278.html