[学习笔记] [机器学习] 6. [上]决策树算法(熵Entropy、信息增益(率)、基尼值(指数)、CART剪枝、特征工程特征提取、Jieba分词、回归决策树)

  1. 视频链接
  2. 数据集下载地址:无需下载

学习目标:

  • 掌握决策树实现过程
  • 知道信息熵的公式以及作用
  • 知道信息增益、信息增益率和基尼指数的作用
  • 知道 id3、c4.5、CART 算法的区别
  • 了解 CART 剪枝的作用
  • 知道特征提取的作用
  • 应用 DecisionTreeClassifier 实现决策树分类

1. 决策树算法简介

1.1 决策树算法简介

学习目标

  • 知道什么是决策树

决策树思想的来源非常朴素,程序设计中的条件分支结构就是 if-else 结构,最早的决策树就是利用这类结构分割数据的一种分类学习方法。

决策树是一种树形结构,其中每个内部节点表示一个属性上的判断,每个分支代表一个判断结果的输出,最后每个叶节点(没有子节点的节点就叫叶子节点)代表一种分类结果,本质是一颗由多个判断节点组成的树

insert image description here

想一想这个女生为什么把年龄放在最上面判断

上面案例是女生通过定性的主观意识,把年龄放到最上面,那么如果需要对这一过程进行量化,该如何处理呢?

此时就需要用到信息论中的知识:

  • 信息熵
  • 信息增益

小结

  • 决策树定义:
    • 是一种树形结构
    • 本质上是一颗由多个判断节点组成的树

2. 决策树分类原理

学习目标

  • 知道如何求解信息熵
  • 知道信息增益的求解过程
  • 知道信息增益率的求解过程
  • 知道基尼系数的求解过程
  • 知道信息增益信息增益率基尼系数三者之间的区别、联系

2.1 熵

2.1.1 概念

熵(shāng)是热力学中表征物质状态的参量之一,用符号 S S S 表示,其物理意义是体系混乱程度的度量。它也被社会科学用以借喻人类社会某些状态的程度。熵的概念是由德国物理学家克劳修斯于1865年提出的。

简单来说,在物理学上,熵(Entropy)是用来表示“混乱”程度的量度

insert image description here

从上面的图我们可以知道:

  • 系统越有序,熵值越低
  • 系统越混乱或者分散,熵值越高

1948 年香农(Shannon)提出了信息熵(Information Entropy)的概念。

克劳德·艾尔伍德·香农(Claude Elwood Shannon,1916年4月30日—2001年2月24日)是一位美国数学家、电子工程师和密码学家,被誉为信息论的创始人。他是密西根大学学士,麻省理工学院博士。1948年,香农发表了划时代的论文——《通讯的数学理论》,奠定了现代信息论的基础。

信息理论信息熵 (Information Entropy)是用来度量样本集合纯度的一种指标,它可以从信息的完整性有序性两个方面进行描述。

1. 从信息的完整性上进行的描述

当系统的有序状态一致时,数据越集中的地方熵值越小,数据越分散的地方熵值越大

2. 从信息的有序性上进行的描述

当数据量一致时,系统越有序,熵值越低系统越混乱或者分散,熵值越高


假定当前样本集合 D D D 中第 k k k 类样本所占总样本的比例为 p k ( k = 1 , 2 , . . . , ∣ y ∣ ) p_k(k = 1, 2, ..., |y|) pk(k=1,2,...,y) p k = C k D p_k = \frac{C^k}{D} pk=DCk D D D 为样本的所有数量, C k C^k Ck 为第 k k k 类样本的数量。

样本集合 D D D 的信息熵定义为

E n t ( D ) = − ∑ k = 1 n log ⁡ C k D = − ∑ k = 1 n p k log ⁡ 2 p k = − p 1 log ⁡ 2 p 1 − p 2 log ⁡ 2 p 2 − . . . − p n log ⁡ 2 p n \begin{aligned} \mathrm{Ent}(D) & = -\sum_{k=1}^n \log \frac{C^k}{D} \\ & = -\sum_{k=1}^n p_k \log_2^{p_k} \\ & = -p_1 \log_2^{p_1} - p_2\log_2^{p2} - ... - p_n\log_2^{p_n} \end{aligned} Ent(D)=k=1nlogDCk=k=1npklog2pk=p1log2p1p2log2p 2...pnlog2pn

in:

  • log ⁡ \log log is base 2,lg ⁡ \lgl g is in base 10.
  • D D D : Indicates the sample set.
  • k k k : Indicates the kkthin the sample setK class samples.
  • n n n : Indicates the total number of categories in the sample set.
  • ∣ and ∣ |and|y represents the total number of categories in the sample set. ∣ y ∣ |y|y andnnn is equivalent.
  • p k p_k pk: Indicates the kkthThe proportion of k -type samples in the sample set, that is,pk = C k D p_k = \frac{C^k}{D}pk=DCk, where C k C^kCk is thekkthThe number of k class samples.
  • E nt ( D ) \mathrm{Ent}(D)Ent ( D ) : represents the sample setDDD' s information entropy.

Therefore, according to the formula, the sample set DDD' s information entropyE nt ( D ) \mathrm{Ent}(D)Ent ( D ) can be simplified as:

E nt ( D ) = − ∑ k = 1 npk log ⁡ 2 pk \mathrm{Ent}(D) = -\sum_{k=1}^n p_k \log_2^{p_k}Ent(D)=k=1npklog2pk

Among them, ∑ k = 1 n \sum_{k=1}^nk=1nRepresents the sum of all categories in the sample set, log ⁡ 2 pk \log_2^{p_k}log2pkIndicates base 2, pk p_kpklogarithm of .

According to the formula, it can be seen that:

  • E nt ( D ) \mathrm{Ent}(D)The smaller the value of Ent ( D ) , theDDThe higher the purity of D (the higher the entropy, the more unstable and the lower the purity).

2.1.2 Case

Case introduction :

Suppose we didn't watch the World Cup, but want to know which team will be the champion, we can only guess whether a certain team is the champion or not, and then the audience will answer with yes or no, we want to guess as few times as possible, What method will you use?

Answer : dichotomy.

If there are 16 teams, the numbers are 1-16. First ask whether it is between 1~8, if it is, continue to ask whether it is between 1~4, and so on, until finally it is judged which is the champion team .

If the number of teams is 16, we need to ask 4 times to get the final answer. Then the information entropy of the world champion news is 4.

Then the information entropy is equal to 4, how is it calculated?

E n t ( D ) = − ( p 1 × log ⁡ 2 p 1 + p 2 × log ⁡ 2 p 2 + . . . + p 16 × log ⁡ 2 p 16 ) \mathrm{Ent}(D) = -(p_1 \times \log_2{p_1} + p_2 \times \log_2{p_2} + ... + p_{16} \times \log_2{p_{16}}) Ent(D)=(p1×log2p1+p2×log2p2+...+p16×log2p16)

where p 1 , . . . , p 16 p_1, ..., p_{16}p1,...,p16are the probabilities of the 16 teams winning the championship.

When each team has the same probability of winning (both are 1 16 \frac{1}{16}161)hour:

E n t ( D ) = − 16 × ( 1 16 × log ⁡ 2 1 16 ) = − 16 × ( 1 16 × log ⁡ 2 4 − 2 ) = − 16 × ( 1 16 × − 2 × log ⁡ 2 4 ) = 16 × ( 1 8 × 2 ) = 4 \begin{aligned} \mathrm{Ent}(D) &= - 16 \times ( \frac{1}{16} \times \log_2{\frac{1}{16}})\\ &= - 16 \times ( \frac{1}{16} \times \log_2{4^{-2}})\\ &= - 16 \times ( \frac{1}{16} \times -2\times \log_2{4})\\ &= 16 \times ( \frac{1}{8}\times 2)\\ &= 4 \end{aligned} Ent(D)=16×(161×log2161)=16×(161×log242)=16×(161×2×log24)=16×(81×2)=4

每个事件概率相同时,熵最大,这件事越不确定。

2.1.3 随堂练习

篮球比赛里,有4个球队 {A, B, C, D},获胜概率分别为 {1/2, 1/4, 1/8, 1/8} E n t ( D ) \mathrm{Ent}(D) Ent(D)

answer :

E n t ( D ) = − p 1 log ⁡ 2 p 1 − p 2 log ⁡ 2 p 2 − . . . − p n log ⁡ 2 p n = − ( 1 2 log ⁡ 2 1 2 + 1 4 log ⁡ 2 1 4 + 1 8 log ⁡ 2 1 8 + 1 8 log ⁡ 2 1 8 ) = − ( 1 2 log ⁡ 2 2 − 1 + 1 4 log ⁡ 2 2 − 2 + 1 8 log ⁡ 2 2 − 3 + 1 8 log ⁡ 2 2 − 3 ) = − ( − 1 2 − 1 2 − 3 8 − 3 8 ) = 1 2 + 1 2 + 3 8 + 3 8 = 1 + 3 4 = 7 4 \begin{aligned} \mathrm{Ent}(D) & = -p_1 \log_2^{p_1} - p_2\log_2^{p2} - ... - p_n\log_2^{p_n}\\ & = -(\frac{1}{2}\log_2^{\frac{1}{2}} + \frac{1}{4}\log_2^{\frac{1}{4}} + \frac{1}{8}\log_2^{\frac{1}{8}} + \frac{1}{8}\log_2^{\frac{1}{8}}) \\ & = -(\frac{1}{2}\log_2^{2^{-1}} + \frac{1}{4}\log_2^{2^{-2}} + \frac{1}{8}\log_2^{2^{-3}} + \frac{1}{8}\log_2^{2^{-3}}) \\ & = -(-\frac{1}{2} -\frac{1}{2} -\frac{3}{8} -\frac{3}{8}) \\ & = \frac{1}{2} + \frac{1}{2} + \frac{3}{8} + \frac{3}{8} \\ & = 1 + \frac{3}{4} \\ & = \frac{7}{4} \end{aligned} Ent(D)=p1log2p1p2log2p2...pnlog2pn=(21log221+41log241+81log281+81log281)=(21log221+41log222+81log223+81log223)=(21218383)=21+21+83+83=1+43=47

从上面两个例子中我们可以看到,如果每支球队获胜的概率是一样的,那么我们就很难猜到到底是哪支球队夺冠了(此时熵很高),如果不同球队获胜的概率是不一样的,那么我们倾向于猜胜率高的球队,因此熵就减小了。

2.2 信息增益(Information Gain)【决策树的划分依据·一】

2.2.1 概念

信息增益(Information Gain)是一个统计量,用来描述一个属性区分数据样本的能力。它定义为一个特征能够为分类系统带来多少信息,带来的信息越多,说明该特征越重要,相应的信息增益也就越大。在决策树算法中,信息增益是特征选择的一个重要指标。它表示在一个条件下,信息复杂度(不确定性)减少的程度。如果选择一个特征后,信息增益最大(信息不确定性减少的程度最大),那么我们就选取这个特征。

信息熵 (Information Entropy)是用来度量样本集合纯度的一种指标


Q:意思就是说,如果有一个特征它足够简单,那么它的信息增益就越强吗?
A:不完全是这样的。信息增益衡量的是一个特征对于数据分类的贡献,而不是特征本身的复杂度。信息增益越大,说明这个特征对于数据分类的贡献越大,也就是说,使用这个特征进行分类可以更好地区分数据

  • 举个例子,假设我们要根据一些特征来预测一个人是否喜欢运动。其中一个特征是“身高”,另一个特征是“喜欢的颜色”。显然,“身高”这个特征比“喜欢的颜色”更能够帮助我们预测一个人是否喜欢运动。因此,“身高”这个特征的信息增益会比“喜欢的颜色”这个特征的信息增益大。
  • 总之,信息增益衡量的是一个特征对于数据分类的贡献,而不是特征本身的复杂度。

信息增益(Information Gain)是以某特征划分数据集前后的熵的差值(这里我们可以认为信息增益就是一个 Δ \Delta Δ。熵可以表示样本集合的不确定性,熵越大,样本的不确定性就越大。因此可以使用划分前后集合熵的差值来衡量使用当前特征对于样本集合 D D D 划分效果的好坏。

信息增益 ‾ Δ = E n t r o p y 前 − E n t r o p y 后 \rm \underset{\Delta}{\underline{信息增益}} = Entropy_前 - Entropy_后 Δ信息增益=EntropyEntropy

注意:信息增益表示得知特征 X X X 的信息而使得类 Y Y Y 的信息熵减少的程度

2.2.2 定义与公式

假定离散属性 a a a V V V 个可能的取值: a 1 , a 2 , . . . , a V a^1, a^2, ..., a^V a1,a2,...,aV。这里我们假设离散属性 a a a 为性别,有 2 个可能的取值(男或女)。即, v = 2 v = 2 v=2 a 1 a^1 a1 表示男, a 2 a^2 a2 表示女。

若使用 a a a 来对样本集 D D D 进行划分,则会产生 v = 2 v=2 v=2 个分支结点。其中第 v v v 个分支结点包含了 D D D 中所有在属性 a a a 上且取值为 a v a^v av 的样本,记为 D v D^v Dv

例如,在前面提到的性别属性的例子中,如果我们使用性别属性 a a a 来划分样本集 D D D,那么所有男性样本都会被分配到一个子集中 D 1 D^1 D1,所有女性样本都会被分配到另一个子集中 D 2 D^2 D2。这两个子集中的样本在性别属性上的取值都相同。

注意这里面的 v v v 并不是一个确定的值,把 v v v 换为 n n n 就好理解了!

我们可以根据前面给出的信息熵公式计算出 D ′ D' D 的信息熵。再考虑到不同的分支结点所包含的样本数不同,给分支结点赋予权重 ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|} DDv。(注意:绝对值符号只是用来表示集合中元素的数量,而不是数值的绝对值),即样本数越多的分支结点的影响越大

因此我们可以计算出用属性 a a a 对样本集 D D D 进行划分所获得的“信息增益(Information Gain)”。

其中:特征 a = { a 1 , a 2 } a = \{ a^1, a^2 \} a={ a1,a2 }For the training data setDDD' s information gainG ain ( D , a ) \mathrm{Gain}(D, a)Gain(D,a ) is defined as: setDDD' s information entropyE nt ( D ) \mathrm{Ent}(D)Ent ( D ) with a given featureaaDDunder a conditionD' s information conditional entropyE nt ( D ∣ a ) \mathrm{Ent}(D|a)The difference between Ent ( D a ) , the formula is:

G a i n ( D , a ) = E n t ( D ) − E n t ( D ∣ a ) = E n t ( D ) − ∑ v = 1 v ∣ D v ∣ ∣ D ∣ E n t ( D v ) \begin{aligned} \mathrm{Gain}(D, a) &= \mathrm{Ent}(D) - \mathrm{Ent}(D|a) \\ & = \mathrm{Ent}(D) - \sum_{v=1}^v \frac{|D^v|}{|D|} \mathrm{Ent}(D^v) \end{aligned} Gain(D,a)=Ent(D)Ent(Da)=Ent(D)v=1vDDvEnt(Dv)

in:

  • D D D : Indicates the sample set.
  • a a a : Indicates a discrete attribute.
  • VVV : Indicates the discrete attributeaaThe number of possible values ​​for a .
  • a v a^v av : Indicates the discrete attributeaaa 'svvv possible values.
  • D v D^v Dv : represents the sample setDDAll in D in attribute aaThe value of a is ava^vasample of v .
  • G a i n ( D , a ) \mathrm{Gain}(D, a) Gain(D,a ) : Indicates featureaaa pair of training data setDDD' s information gain.
  • E nt ( D ) \mathrm{Ent}(D)Ent ( D ) : represents the sample setDDD' s information entropy.
  • E n t ( D ∣ a ) \mathrm{Ent}(D|a) Ent ( D a ) : represents a given featureaaDDunder a conditionThe informational conditional entropy of D.
  • ∑ v = 1 v \sum_{v=1}^vv=1vRepresents the discrete attribute aaAll possible values ​​​​of a are summed
  • ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|} DDvIndicates the weight of the branch node
  • E n t ( D v ) \mathrm{Ent}(D^v) Ent(Dv )represents the branch nodeD v D^vDThe information entropy of v .
  • Note : The absolute value symbol is only used to indicate the number of elements in the collection, not the absolute value of the value

The meaning of this formula is : for a discrete attribute a = { a 1 , a 2 , . . . , av } a = \{ a^1, a^2, ..., a^v \}a={ a1,a2,...,av},它有 v v v 个可能的取值。如果使用这个属性 a a a 来对样本集合 D D D 进行划分,则会产生 v v v 个分支结点 D 1 , D 2 , . . . , D v D^1, D^2, ..., D^v D1,D2,...,Dv。每个分支结点包含了样本集合 D D D 中所有在属性 a a a 上取值为 a v a^v av 的样本,记为 D v D^v Dv。我们可以根据前面给出的信息熵公式计算出每个分支结点的信息熵。由于不同的分支结点所包含的样本数不同,所以我们需要给每个分支结点赋予一个权重 ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|} DDv, which means that the more the number of samples, the greater the influence of the branch node . Finally, the information entropy of all branch nodes is weighted and summed to obtain the given feature aaDDunder a conditionD' s information conditional entropyE nt ( D ∣ a ) \mathrm{Ent}(D|a)Ent(Da)

Note : the vv herev is not a definite value, putvvchange v to nnn is easy to understand!

Therefore, the discrete feature attribute a = { a 1 , a 2 , . . . , av } a = \{ a^1, a^2, ..., a^v \}a={ a1,a2,...,av }for the training data setDDD' s information gainG ain ( D , a ) \mathrm{Gain}(D, a)Gain(D,a) 可以理解为:在给定特征 a a a 的条件下,样本集合 D D D 的信息不确定性减少的程度。如果选择一个特征后,信息增益最大(信息不确定性减少的最多),那么我们就选取这个特征。


公式的详细解释如下

  • 信息熵的计算

E n t ( D ) = − ∑ k = 1 n C k D log ⁡ 2 C k D \mathrm{Ent}(D) = -\sum^n_{k=1}\frac{C^k}{D}\log_2^{\frac{C^k}{D}} Ent(D)=k=1nDCklog2DCk

  • 条件熵的计算

E n t ( D ∣ a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ ∑ k = 1 K C k v D v log ⁡ C k v D v \begin{aligned} \mathrm{Ent}(D|a) & = \sum_{v=1}^V \frac{|D^v|}{|D|} \mathrm{Ent}(D^v)\\ & = -\sum_{v=1}^V \frac{|D^v|}{|D|} \sum_{k=1}^K \frac{C^{kv}}{D_v} \log\frac{C^{kv}}{D_v} \end{aligned} Ent(Da)=v=1VDDvEnt(Dv)=v=1VDDvk=1KDvCkvlogDvCkv

in:

  • D v D^v Dv meansaavvin the a attributeThe number of samples contained in v branch nodes
  • C kv C^{kv}Ck v meansaavvin the a attributeAmong the number of samples contained in the v branch nodes, the kkthThe number of samples contained under k categories
  • Note : The absolute value symbol is only used to indicate the number of elements in the collection, not the absolute value of the value

Conditional entropy is E nt ( D ∣ a ) \mathrm{Ent}(D | a)Ent ( D a ) , use∣ | , and the information gain isG ain ( D , a ) \mathrm{Gain}(D, a)Gain(D,a ) , use, ,,

In general, the greater the information gain, it means that the use of attribute aaThe greater the "purity improvement" obtained by dividing by a . Therefore, we can use information gain to select the partition attribute of the decision tree. The famous ID3 decision tree learning algorithm uses information gain as a criterion to select the partition attribute.

ID3 决策树学习算法是一种贪心算法,用来构造决策树。它的全称是 Iterative Dichotomiser 3,即迭代二分器 3。ID3 算法起源于概念学习系统(CLS),以信息熵的下降速度为选取测试属性的标准,即在每一个节点选取还尚未被用来划分的具有最高信息增益的属性作为划分标准,然后继续这个过程,直到生成的决策树能完美的分类训练样例。

ID3 算法主要用于决策树分类问题。它通过计算每个特征的信息增益来选择最优划分属性,然后递归地构建决策树。ID3 算法能够自动地从数据中学习规律,并用生成的决策树对新数据进行分类。

2.2.3 案例

如下图,第一列为论坛号码,第二列为性别,第三列为活跃度,最后一列用户是否流失(1 表示已流失(正样本),0 表示未流失(负样本))。我们要解决一个问题:通过性别和活跃度两个特征,判断哪个特征对用户流失影响更大

insert image description here

通过计算信息增益可以解决这个问题,统计如右表信息。

其中,Positive 为正样本(已流失),Negative 为负样本(未流失),下面的数值为不同划分下对应的人数。

因此我们可以得到三个熵

a. 计算类别信息熵(算计整体熵)

E n t ( D ) = − ∑ k = 1 n C k D log ⁡ 2 C k D = − 5 15 log ⁡ 2 5 15 − 10 15 log ⁡ 2 10 15 = 0.9182 \begin{aligned} \mathrm{Ent}(D) & = -\sum^n_{k=1}\frac{C^k}{D}\log_2^{\frac{C^k}{D}} \\ & = -\frac{5}{15}\log_2^{\frac{5}{15}} - \frac{10}{15}\log_2^{\frac{10}{15}}\\ & = 0.9182 \end{aligned} Ent(D)=k=1nDCklog2DCk=155log21551510log21510=0.9182

b1. 计算性别属性的信息熵(a=“性别”)

E n t ( D ∣ 性别 ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) = ∣ D 1 ∣ ∣ D ∣ E n t ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ E n t ( D 2 ) \begin{aligned} \mathrm{Ent}(D|性别) & = \sum_{v=1}^V \frac{|D^v|}{|D|} \mathrm{Ent}(D^v)\\ & = \frac{|D^1|}{|D|}\mathrm{Ent}(D^1) + \frac{|D^2|}{|D|}\mathrm{Ent}(D^2) \end{aligned} Ent(D性别)=v=1VDDvEnt(Dv)=DD1Ent(D1)+DD2Ent(D2)

E n t ( D 1 ) = − 3 8 log ⁡ 2 3 8 − 5 8 log ⁡ 2 5 8 = 0.9543 \mathrm{Ent}(D^1) = -\frac{3}{8}\log_2^{\frac{3}{8}} - \frac{5}{8}\log_2^{\frac{5}{8}} = 0.9543 Ent(D1)=83log28385log285=0.9543

E n t ( D 2 ) = − 2 7 log ⁡ 2 2 7 − 5 7 log ⁡ 2 5 7 = 0.8631 \mathrm{Ent}(D^2) = -\frac{2}{7}\log_2^{\frac{2}{7}} - \frac{5}{7}\log_2^{\frac{5}{7}} = 0.8631 Ent(D2)=72log27275log275=0.8631

c1. Calculate the information gain of gender (a="gender")

G a i n ( D ∣ 性别 ) = E n t ( D ) − E n t ( D ∣ a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) = E n t ( D ) − 8 15 E n t ( D 1 ) − 7 15 E n t ( D 2 ) = 0.0064 \begin{aligned} \mathrm{Gain}(D|性别) &= \mathrm{Ent}(D) - \mathrm{Ent}(D|a) \\ & = \mathrm{Ent}(D) - \sum_{v=1}^V \frac{|D^v|}{|D|} \mathrm{Ent}(D^v)\\ &= \mathrm{Ent}(D) - \frac{8}{15}\mathrm{Ent}(D^1) - \frac{7}{15}\mathrm{Ent}(D^2)\\ &=0.0064 \end{aligned} Gain ( D Gender )=Ent(D)Ent(Da)=Ent(D)v=1VDDvEnt(Dv)=Ent(D)158Ent(D1)157Ent(D2)=0.0064


b2. Calculate the information entropy of the activity attribute (a = "activity")

E nt ( D , 动生度 ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E nt ( D v ) = ∣ D 1 ∣ ∣ D ∣ E nt ( D 1 ) + ∣ D 2 ∣ ∣ D ∣ E nt ( D 2 ) \begin{aligned} \mathrm{Ent}(D, 动情度) & = \sum_{v=1}^V \frac{|D^v|}{|D|} \mathrm{Ent }(D^v)\\ & = \frac{|D^1|}{|D|}\mathrm{Ent}(D^1) + \frac{|D^2|}{|D|}\ mathrm{Ent}(D^2) \end{aligned}Ent(D,activity )=v=1VDDvEnt(Dv)=DD1Ent(D1)+DD2Ent(D2)

E n t ( D 1 ) = − 0 6 log ⁡ 2 0 6 − 6 6 log ⁡ 2 6 6 = − 0 − 0 = 0 \begin{aligned} \mathrm{Ent}(D^1) & = -\frac{0}{6}\log_2^{\frac{0}{6}} - \frac{6}{6}\log_2^{\frac{6}{6}} \\ & = -0 - 0 \\ & = 0 \end{aligned} Ent(D1)=60log26066log266=00=0

E n t ( D 2 ) = − 1 5 log ⁡ 2 1 5 − 4 5 log ⁡ 2 4 5 = 0.7219 \begin{aligned} \mathrm{Ent}(D^2) & = -\frac{1}{5}\log_2^{\frac{1}{5}} - \frac{4}{5}\log_2^{\frac{4}{5}} \\ & = 0.7219 \end{aligned} Ent(D2)=51log25154log254=0.7219

E n t ( D 3 ) = − 4 4 log ⁡ 2 4 4 − 0 4 log ⁡ 2 0 4 = − 0 − 0 = 0 \begin{aligned} \mathrm{Ent}(D^3) & = -\frac{4}{4}\log_2^{\frac{4}{4}} - \frac{0}{4}\log_2^{\frac{0}{4}} \\ & = -0 - 0 \\ & = 0 \end{aligned} Ent(D3)=44log24440log240=00=0

Note : when log ⁡ \logThe parameter in the log function is 0 00 or1 11 ,log ⁡ ( 0 ) \log(0)log(0) log ⁡ ( 1 ) \log(1) The values ​​of log ( 1 ) are undefined and 0 0respectively0 . Therefore, we usually ignore these terms when calculating information entropy.

c2. Calculate the information gain of activity (a = "activity")

G a i n ( D , 活跃度 ) = E n t ( D ) − E n t ( D ∣ a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) = E n t ( D ) − 6 15 E n t ( D 1 ) − 5 15 E n t ( D 2 ) − 4 15 E n t ( D 3 ) = 0.6776 \begin{aligned} \mathrm{Gain}(D, 活跃度) &= \mathrm{Ent}(D) - \mathrm{Ent}(D|a) \\ & = \mathrm{Ent}(D) - \sum_{v=1}^V \frac{|D^v|}{|D|} \mathrm{Ent}(D^v)\\ &= \mathrm{Ent}(D) - \frac{6}{15}\mathrm{Ent}(D^1) - \frac{5}{15}\mathrm{Ent}(D^2) - \frac{4}{15}\mathrm{Ent}(D^3)\\ &= 0.6776 \end{aligned} Gain(D,activity )=Ent(D)Ent(Da)=Ent(D)v=1VDDvEnt(Dv)=Ent(D)156Ent(D1)155Ent(D2)154Ent(D3)=0.6776


Let's compare the information gain of two different features:

feature name information gain
gender Gain ( D , sex) = 0.0064 \mathrm{Gain}(D, sex)=0.0064Gain(D,sex )=0.0064
Activity Gain ( D , activity) = 0.6776 \mathrm{Gain}(D, activity)=0.6776Gain(D,activity )=0.6776

It is clear that the information gain of activity is larger than that of gender . In other words, activity has a greater impact on user churn than gender . When doing feature selection or data analysis, we should focus on the indicator of activity.

2.3 Information Gain Rate (Information Gain Rate) [The basis for the division of decision trees · 2]

2.3.1 Concept

In the above introduction, we have intentionally ignored the "Number" column. If "number" is also used as a candidate division attribute, its information gain can be calculated according to the information gain formula as 0.9182 0.91820.9182 , which is much larger than other candidate partition attributes.

In the process of calculating the information entropy of each attribute, we found that the value of this attribute is 0, that is, its information gain is 0.9182. But it is obvious that if it is classified in this way, the final result will not have a generalization effect, that is, it will not be able to effectively predict new samples.

In fact, the Information Gain criterion favors attributes with a large number of possible values. In order to reduce the possible adverse effects of this preference, the famous C4.5 decision tree algorithm is proposed. The C4.5 algorithm does not directly use information gain, but uses "Information Gain Ratio" to select the optimal partition attribute .

Information Gain Ratio (Information Gain Ratio) : The information gain ratio is the previous information gain G ain ( D , a ) \mathrm{Gain}(D, a)Gain(D,a ) sum attributeaaThe "intrinsic value" corresponding to a (Intrinsic Value, IV \mathrm{IV}IV ) to the common definition of the ratio.

G a i n   r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) \mathrm{Gain \ ratio}(D,a) = \frac{\mathrm{Gain}(D, a)}{\mathrm{IV}(a)} Gain ratio(D,a)=IV ( a )Gain(D,a)

in:

  • D D D means data set
  • a a a represents the attribute
  • I V ( a ) \mathrm{IV}(a) IV ( a ) means attributeaaintrinsic value of a
  • D v D^v Dv represents the data setDDAttribute aain DThe value of a isava^vasample subset of v
    • Therefore, ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|}DDvRepresents the dataset DDAttribute aain DThe value of a isava^vaThe proportion of v 's sample.

intrinsic value IV ( a ) \mathrm{IV}(a)Solution of IV ( a ) :

IV ( a ) = − ∑ v = 1 VD v D log ⁡ ∣ D v ∣ ∣ D ∣ \mathrm{IV}(a) = -\sum^V_{v=1}\frac{D^v}{D }\log{\frac{|D^v|}{|D|}}IV ( a )=v=1VDDvlogDDv

in:

  • IV \mathrm{IV}IV is an intrinsic value (Intrinsic Value).
    • intrinsic value IV ( a ) \mathrm{IV}(a)IV ( a ) is used to measure attributeaaAn indicator of the amount of inherent information in a .
  • VVV represents attributeaathe number of possible values ​​for a
  • D v D^v Dv represents the data setDDAttribute aain DThe value of a isava^vasample subset of v
    • Therefore, ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|}DDvRepresents the dataset DDAttribute aain DThe value of a isava^vaThe proportion of v 's sample.

Attribute aaThe larger the number of possible values ​​of a (that is, VVV is larger), thenIV ( a ) \mathrm{IV}(a)The value of IV ( a ) will usually be larger.

2.3.2 Case

2.3.2.1 Case 1

a. Calculate category information entropy

b. Calculate the information entropy of gender attributes (gender, activity)

c. Calculate the information gain of activity (gender, activity)

d. Calculate the attribute split information measure (that is, the intrinsic value IV \mathrm{IV}IV

The split information measure is used to consider the quantity information and size information of branches when a certain attribute is split. We call this information the intrinsic information of the attribute (Intrinsic Information, II). Information gain rate IGR Using information gain IG / internal information II will cause the importance of attributes to decrease with the increase of internal information II ( that is, if the attribute itself has a large uncertainty, then I will be more Not inclined to choose it ), which can be regarded as compensation for purely using information gain.

I V ( 性别 ) = − ∑ v = 1 V D v D log ⁡ ∣ D v ∣ ∣ D ∣ = − 7 15 log ⁡ 2 7 15 − 8 15 log ⁡ 2 8 15 = 0.9968 \begin{aligned} \mathrm{IV}(性别) & = -\sum^V_{v=1}\frac{D^v}{D}\log{\frac{|D^v|}{|D|}}\\ & = -\frac{7}{15}\log_2^{\frac{7}{15}} -\frac{8}{15}\log_2^{\frac{8}{15}}\\ & = 0.9968 \end{aligned} IV ( sex )=v=1VDDvlogDDv=157log2157158log2158=0.9968

IV ( current ) = − ∑ v = 1 VD v D log ⁡ ∣ D v ∣ ∣ D ∣ = − 6 15 log ⁡ 2 6 15 − 5 15 log ⁡ 2 5 15 = 1.5656 \begin{aligned} \mathrm{ IV}(definition) & = -\sum^V_{v=1}\frac{D^v}{D}\log{\frac{|D^v|}{|D|}}\\ & = -\frac{6}{15}\log_2^{\frac{6}{15}} -\frac{5}{15}\log_2^{\frac{5}{15}}\\ & = 1.5656\ end{aligned}IV ( Activity )=v=1VDDvlogDDv=156log2156155log2155=1.5656

e. Calculate the information gain rate

G a i n   r a t i o ( D , 性别 ) = G a i n ( D , a ) I V ( a ) = 0.0064 0.9968 = 0.0064 \begin{aligned} \mathrm{Gain \ ratio}(D,性别) & = \frac{\mathrm{Gain}(D, a)}{\mathrm{IV}(a)} \\ & = \frac{0.0064}{0.9968}\\ & = 0.0064 \end{aligned} Gain ratio(D,sex )=IV ( a )Gain(D,a)=0.99680.0064=0.0064

G gain ratio ( D , voltage ) = G gain ( D , a ) IV ( a ) = 0.6776 1.5656 = 0.4328 \begin{aligned} \mathrm{Gain\ratio}(D , voltage ) & = \frac{\ . mathrm{Gain}(D,a)}{\mathrm{IV}(a)}\\&=\frac{0.6776}{1.5656}\\&=0.4328\end{aligned}Gainratio (D,activity )=IV ( a )Gain(D,a)=1.56560.6776=0.4328

The information gain rate of activity is higher ( 0.4328 > 0.0064 0.4328>0.00640.4328>0.0064 ), so when building a decision tree, it is preferred.

By introducing the inherent value IV to obtain the information gain rate, in the process of selecting nodes, we can reduce the selection preference of attributes with more values .

2.3.2.2 Case 2

As shown in the figure below, the first column is the weather, the second column is the temperature, the third column is the humidity, the fourth column is the wind speed, and the last column is whether the activity is carried out.

We need to solve: According to the data in the table below, determine whether the activity will be carried out under the corresponding weather?

insert image description here

insert image description here

The data set has four attributes, attribute set A = {weather, temperature, humidity, wind speed} A=\{weather, temperature, humidity, wind speed\}A={ weather, temperature, humidity, wind speed } , there are two category labels, category setL = {go on, cancel} L=\{go on, cancel\}L={ Proceed, Cancel } .


a. Calculate category information entropy

Category information entropy E nt ( D ) \mathrm{Ent}(D)Ent ( D ) represents the sum of uncertainties in various categories in all samples. According to the concept of entropy, the greater the entropy, the greater the uncertainty, and the greater the amount of information needed to figure things out.

E n t ( D ) = − ∑ k = 1 n C k D log ⁡ 2 C k D = − 9 14 log ⁡ 2 9 14 − 5 14 log ⁡ 2 5 14 = 0.940 \begin{aligned} \mathrm{Ent}(D) & = -\sum^n_{k=1}\frac{C^k}{D}\log_2^{\frac{C^k}{D}} \\ & = -\frac{9}{14}\log_2^{\frac{9}{14}} - \frac{5}{14}\log_2^{\frac{5}{14}}\\ & = 0.940 \end{aligned} Ent(D)=k=1nDCklog2DCk=149log2149145log2145=0.940


b. Calculate the information entropy of each attribute

The information entropy of each attribute is equivalent to a conditional entropy E nt ( D ∣ a ) \mathrm{Ent}(D|a)Ent ( D a ) . It means that in some attributeaaUnder the condition of a , the sum of uncertainties in various categories. The greater the information entropy of an attribute, the less "pure" the sample category in this attribute is.

  • a = "weather" (5 "sunny", 4 "cloudy", 5 "rainy")

E n t ( D ∣ 天气 ) = ∑ v = 1 V D v D E n t ( D v ) = 5 14 × [ − 2 5 log ⁡ 2 2 5 − 3 5 log ⁡ 2 3 5 ] + 4 14 × [ − 4 4 log ⁡ 2 4 4 ] + 5 14 × [ − 2 5 log ⁡ 2 2 5 − 3 5 log ⁡ 2 3 5 ] = 0.694 \begin{aligned} \mathrm{Ent}(D | 天气) &= \sum_{v=1}^V \frac{D^v}{D} \mathrm{Ent}(D^v)\\ &=\frac{5}{14} \times [-\frac{2}{5}\log_2^{\frac{2}{5}} -\frac{3}{5}\log_2^{\frac{3}{5}}] + \frac{4}{14} \times [-\frac{4}{4}\log_2^{\frac{4}{4}}] + \frac{5}{14} \times [-\frac{2}{5}\log_2^{\frac{2}{5}}-\frac{3}{5}\log_2^{\frac{3}{5}}]\\ &=0.694 \end{aligned} Ent(D天气)=v=1VDDvEnt(Dv)=145×[52log25253log253]+144×[44log244]+145×[52log25253log253]=0.694

  • a = "temperature" (4 "cold", 6 "moderate", 4 "hot")

E n t ( D ∣ 温度 ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) = 4 14 × [ − 2 4 log ⁡ 2 2 4 − 2 4 log ⁡ 2 2 4 ] + 6 14 × [ − 4 6 log ⁡ 2 4 6 − 2 6 log ⁡ 2 2 6 ] + 4 14 × [ − 3 4 log ⁡ 2 3 4 − 1 4 log ⁡ 2 1 4 ] = 0.911 \begin{aligned} \mathrm{Ent}(D|温度) & = \sum_{v=1}^V \frac{|D^v|}{|D|} \mathrm{Ent}(D^v)\\ & = \frac{4}{14} \times [-\frac{2}{4}\log_2^{\frac{2}{4}} -\frac{2}{4}\log_2^{\frac{2}{4}}] + \frac{6}{14} \times [-\frac{4}{6}\log_2^{\frac{4}{6}} -\frac{2}{6}\log_2^{\frac{2}{6}}] + \frac{4}{14} \times [-\frac{3}{4}\log_2^{\frac{3}{4}} -\frac{1}{4}\log_2^{\frac{1}{4}}]\\ & = 0.911 \end{aligned} Ent ( D temperature )=v=1VDDvEnt(Dv)=144×[42log24242log242]+146×[64log26462log262]+144×[43log24341log241]=0.911

  • a = "humidity" (7 "normal", 7 "high")

E nt ( D ∣ humidity) = 0.789 \mathrm{Ent}(D|humidity) = 0.789Ent ( D humidity )=0.789

  • a = "wind speed" (8 "weak", 6 "strong")

E nt ( D ∣ wind speed) = 0.892 \mathrm{Ent}(D|wind speed) = 0.892Ent ( D wind speed )=0.892


c. Calculate information gain

Information gain = entropy - conditional entropy, here is category information entropy - attribute information entropy, which represents the degree of information uncertainty reduction . If the information gain of an attribute is greater, it means that using this attribute to divide samples can better reduce the uncertainty of the divided samples. Of course, choosing this attribute can complete our classification goals faster and better.

The greater the information gain, the better

Information gain is the feature selection index of ID3 algorithm

Gain ( D , weather) = 0.940 − 0.694 = 0.246 \mathrm{Gain}(D, weather) = 0.940 - 0.694 = 0.246Gain(D,weather )=0.9400.694=0.246

Gain ( D , temperature) = 0.940 − 0.911 = 0.029 \mathrm{Gain}(D, temperature) = 0.940 - 0.911 = 0.029Gain(D,temperature )=0.9400.911=0.029

Gain ( D , humidity) = 0.940 − 0.789 = 0.15 \mathrm{Gain}(D, humidity) = 0.940 - 0.789 = 0.15Gain(D,humidity )=0.9400.789=0.15

Gain ( D , wind speed) = 0.940 − 0.892 = 0.048 \mathrm{Gain}(D, wind speed) = 0.940 - 0.892 = 0.048Gain(D,wind speed )=0.9400.892=0.048

Suppose we add a column in front of the data in Table 1 above, named "Number", with a value of 1~14. If "number" is also used as a candidate division attribute, according to the previous steps: in the process of calculating the information entropy of each attribute, we found that the value of this attribute is 0, that is, its information gain is 0.940.

But obviously, if classified in this way, the final result does not have a generalization effect. At this time, effective classification features cannot be selected based on information gain. Therefore, C4.5 chooses to use the information gain rate to improve ID3.


d. Calculate attribute split information measure (intrinsic value IV)

I V ( 天气 ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log ⁡ ∣ D v ∣ ∣ D ∣ = − 5 14 log ⁡ 2 5 14 − 5 14 log ⁡ 2 5 14 − 4 14 log ⁡ 2 4 14 = 1.577 \begin{aligned} \mathrm{IV}(天气) & = -\sum^V_{v=1}\frac{|D^v|}{|D|}\log{\frac{|D^v|}{|D|}}\\ & = -\frac{5}{14}\log_2^{\frac{5}{14}} -\frac{5}{14}\log_2^{\frac{5}{14}} -\frac{4}{14}\log_2^{\frac{4}{14}}\\ & = 1.577 \end{aligned} IV ( weather )=v=1VDDvlogDDv=145log2145145log2145144log2144=1.577

I V ( 温度 ) = − 4 14 log ⁡ 2 4 14 − 6 14 log ⁡ 2 6 14 − 4 14 log ⁡ 2 4 14 = 1.556 \begin{aligned} \mathrm{IV}(温度) & = -\frac{4}{14}\log_2^{\frac{4}{14}} -\frac{6}{14}\log_2^{\frac{6}{14}} -\frac{4}{14}\log_2^{\frac{4}{14}}\\ & = 1.556 \end{aligned} IV ( temperature )=144log2144146log2146144log2144=1.556

I V ( 湿度 ) = − 7 14 log ⁡ 2 7 14 − 7 14 log ⁡ 2 7 14 = 1.0 \begin{aligned} \mathrm{IV}(湿度) & = -\frac{7}{14}\log_2^{\frac{7}{14}} -\frac{7}{14}\log_2^{\frac{7}{14}}\\ & = 1.0 \end{aligned} IV ( humidity )=147log2147147log2147=1.0

IV (wind speed) = − 9 14 log ⁡ 2 9 14 − 5 14 log ⁡ 2 5 14 = 0.985 \begin{aligned} \mathrm{IV}(wind speed) & = -\frac{9}{14}\log_2^ {\frac{9}{14}} -\frac{5}{14}\log_2^{\frac{5}{14}}\\ & = 0.985 \end{aligned}IV ( wind speed )=149log2149145log2145=0.985


e. Calculate the information gain rate

Gain ratio ( D , weather) = Gain ( D , weather) IV ( weather) = 0.246 1.577 = 0.156 \begin{aligned} \mathrm{Gain \ ratio}(D,weather) & = \frac{\mathrm{ Gain}(D, weather)}{\mathrm{IV}(weather)} \\ & = \frac{0.246}{1.577}\\ & = 0.156 \end{aligned}Gain ratio(D,weather )=IV ( weather )Gain(D,weather )=1.5770.246=0.156

Gain ratio ( D , temperature) = Gain ( D , temperature) IV ( temperature) = 0.026 1.556 = 0.0167 \begin{aligned} \mathrm{Gain \ ratio}(D,temperature) & = \frac{\mathrm{ Gain}(D, temperature)}{\mathrm{IV}(temperature)} \\ & = \frac{0.026}{1.556}\\ & = 0.0167 \end{aligned}Gain ratio(D,temperature )=IV ( temperature )Gain(D,temperature )=1.5560.026=0.0167

Gain ratio ( D , humidity) = Gain ( D , humidity) IV ( humidity) = 0.151 1.0 = 0.151 \begin{aligned} \mathrm{Gain \ ratio}(D, humidity) & = \frac{\mathrm{ Gain}(D, humidity)}{\mathrm{IV}(humidity)} \\ & = \frac{0.151}{1.0}\\ & = 0.151 \end{aligned}Gainratio (D,湿度)=IV(湿度)Gain(D,湿度)=1.00.151=0.151

G a i n   r a t i o ( D , 风速 ) = G a i n ( D , 风速 ) I V ( 风速 ) = 0.048 0.985 = 0.0487 \begin{aligned} \mathrm{Gain \ ratio}(D,风速) & = \frac{\mathrm{Gain}(D, 风速)}{\mathrm{IV}(风速)} \\ & = \frac{0.048}{0.985}\\ & = 0.0487 \end{aligned} Gain ratio(D,风速)=IV(风速)Gain(D,风速)=0.9850.048=0.0487

属性名称 信息增益率
天气 0.156
温度 0.0167
湿度 0.151
风速 0.0487

从上表可知,天气的信息增益率最高,选择 天气 为分裂属性。发现分裂了之后,天气是“阴”的条件下,类别是“纯”的,所以把它定义为叶子节点,选择不“纯”的结点继续分裂。

insert image description here

在子结点当中重复过程 1~5,直到所有的叶子结点足够“纯”。

现在我们来总结一下 C4.5 的算法流程

while (当前节点"不纯"):
    1. 计算当前节点的类别熵(以类别取值计算):Ent(D)
    2. 计算当前阶段的属性熵(按照属性求得类别取值计算):Ent(D|a)
    3. 计算信息增益:Gain(D,a)
    4. 计算各个属性的分裂信息度量:固有值IV(a)
    5. 计算各个属性的信息增益率:Gain Ratio(D, a)
    
    if 设置的所有值都为叶子结点:
        return  # 结束

Q : What does the "value" here specifically refer to?
A : In the C4.5 algorithm, the "value" here refers to the attribute value . In the decision tree algorithm, each node represents an attribute, and each branch represents a possible value of the attribute. So when we say "all set values ​​are leaf nodes", we mean that all child nodes of the current node are leaf nodes, i.e. they all belong to the same class.

2.3.3 Why is it better to use C4.5?

  1. Use Gain Ratio to select attributes : Overcome the problem of using information gain to select attributes that prefer to select attributes with more values.
  2. A post-pruning method is adopted : avoiding uncontrolled growth of tree height, thereby avoiding model overfitting data
  3. Added handling of missing values : In some cases, the available data may be missing values ​​for some attributes.
    • if < xxx, c ( x ) c(x) c ( x ) > is the sample setSSA training instance in S , but its attribute AAThe value of AA ( x ) A(x)A ( x ) is unknown.
    • There are generally two strategies for dealing with missing attribute values:
      • One strategy is to assign it node nnThe most common value of the attribute in the training instance corresponding to n ;
      • Another more complex strategy is for AAEach possible value of A is assigned a probability.
    • For example : Given a boolean attribute AAA , if nodennn contains 6 knownA = 1 A=1A=1 and 4A = 0 A=0A=0 instance, thenA ( x ) = 1 A(x)=1A(x)=The probability of 1 is 0.6, andA ( x ) = 0 A(x)=0A(x)=The probability of 0 is 0.4. Thus, instancexx60 % 60\%of x60% is allocated toA = 1 A=1A=Branch of 1 , 40 % 40\%40% is allocated to another branch.
    • C4.5 uses this approach to handle missing attribute values.

2.4 Gini Value (Gini Value) and Gini Index (Gini Index) [Basis for dividing decision trees three]

2.4.1 Concept

The CART decision tree (Classification and Regression Tree, classification and regression tree) uses the Gini index (Gini Index) to select the partition attribute.

CART, short for Classification and Regression Tree, is a well-known decision tree learning algorithm that is available for both classification and regression tasks.

Definitions of Gini value and Gini index :

1. Gini Value

The definition of Gini value (Gini Value) : It is an indicator used to measure the purity of the data set, which means that from the data set DDRandomly draw two samples in D , the probability that their class labels are inconsistentG ini ( D ) ∈ [ 0 , 1 ] \mathrm{Gini}(D) \in [0, 1]Gini ( D )[0,1]

  • G i n i ( D ) \mathrm{Gini}(D) The smaller the value of Gini ( D ) , the data setDDThe higher the purity of D (the better)
  • G i n i ( D ) \mathrm{Gini}(D)The larger the value of Gini ( D ) , the larger the data setDDD is less pure (worse)

In other words, the data set DDThe purity of D can be quantified by the Gini value:

G ini ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ kpkpk ′ = 1 − ∑ k = 1 ∣ y ∣ pk 2 \begin{aligned} \mathrm{Gini}(D) & = \sum_{k =1}^{|y|}\sum_{k' \neq k}p_kp_{k'} \\ & = 1 - \sum_{k=1}^{|y|}p_k^2 \end{aligned}Gini ( D )=k=1yk=kpkpk=1k=1ypk2

Among them :

  • ∣ and ∣ |and|y indicates the number of categories
  • p k = C k D p_k = \frac{C^k}{D} pk=DCkIndicates the kkthk class samples in the data setDDThe proportion of D
  • C k C^k Ck is thekkthThe number of k samples,
  • k ′ k' k is an indicator that represents a category. it withkkk is different, that is,k ′ ≠ k k' \neq kk=k . When calculating the Gini value, we need to calculate the combination between all categories, so we need to use two indicatorskkk andk 'k'k' to represent the different categories

2. Gini Index

Definition of Gini Index : It is an index used to select the optimal partition attribute. Generally, the attribute with the smallest Gini coefficient after division is selected as the optimal division attribute.

G i n i   i n d e x ( D , a ) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G i n i ( D v ) \mathrm{Gini \ index}(D, a) = \sum_{v=1}^V \frac{|D^v|}{|D|}\mathrm{Gini}(D^v) Gini index(D,a)=v=1VDDvGini ( Dv)

in:

  • D D D stands for dataset.
  • a a a represents an attribute.
  • VVV represents attributeaaThe number of values ​​of a .
  • vvv is an index, representing the attributeaaa 'svvv values.
  • D v D^v Dv represents the data setDDAttribute aain DThe value of a isava^vaA sample subset of v .
  • G ini ( D v ) \mathrm{Gini}(D^v)Gini ( Dv )means data setD v D^vDThe gini value of v .
  • ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|} DDvRepresents the data set D v D^vDv in datasetDDThe proportion of D.

Therefore, the Gini index G ini index ( D , a ) \mathrm{Gini \ index}(D, a)Gini index(D,a ) indicates that in attributeaaa on the data setDDAfter D is divided, the weighted average of the Gini value of each subset.


Q1 : What is the Gini Index?
A1 : The Gini coefficient (Gini coefficient) is an index defined by the Italian scholar Corrado Gini in the early 20th century based on the Lorenz curve to judge the fairness of annual income distribution . It is a proportional value between 0 and 1. The Gini index is expressed as a percentage by multiplying the Gini coefficient by 100 times.

  • Gini coefficient ∈ \in [0, 1]
  • Gini index ∈ \in [0%, 100%]

Q2 : Can the Gini coefficient only be used to judge the fairness of annual income distribution?
A2 : The Gini coefficient is usually used as one of the commonly used indicators to measure the income gap of residents in a country or region. It includes income Gini coefficient (Income Gini) and wealth Gini coefficient (Wealth Gini). The algorithms of the two are roughly the same, the difference is that the income Gini coefficient data comes from the household income statistics of a certain region, and the wealth Gini coefficient data comes from the household total asset statistics of a certain region.

In addition to income inequality, the Gini coefficient can also be used to measure other aspects of inequality, such as education level, health status and political participation.

Q3 : Means, is the Gini coefficient (Gini index) used to measure inequality?
A3 : Yes, the Gini coefficient (Gini index) is usually one of the common indicators used to measure the income gap among residents of a country or region. It can also be used to measure inequality in other dimensions, such as education levels, health status and political participation.

Q4 : What does a larger Gini coefficient (Gini index) mean?
A4 : The larger the Gini coefficient (Gini index), the higher the degree of inequality in income distribution or other aspects. The maximum value of the Gini coefficient is "1" and the minimum value is "0". The former means that the income distribution among residents is absolutely unequal, that is, 100% of the income is fully occupied by the people of a unit; while the latter means that the income distribution among residents is absolutely equal, that is, the income between people is completely equal, and there is no any difference. Therefore, the actual value of the Gini coefficient can only be between 0 and 1. The smaller the Gini coefficient, the more even the income distribution, and the larger the Gini coefficient, the more uneven the income distribution.

Q5 : Can the Gini coefficient (Gini index) be applied to machine learning?
A5 : Yes, the Gini coefficient (Gini index) can be applied to machine learning. In the decision tree algorithm, the Gini coefficient (Gini index) is often used as an indicator to measure the impurity of the data set. When building a decision tree, the algorithm selects the feature with the smallest Gini coefficient for splitting to obtain a purer sub-dataset.

Q6 : What is the Gini value?
A6 : Gini value refers to the index used to measure the impurity of the data set in the decision tree algorithm. It is calculated based on the concept of Gini coefficient (Gini coefficient). The smaller the Gini value, the lower the impurity of the data set, that is, the greater the possibility that the samples in the data set belong to the same category.


In the CART decision tree, Gini Value, Gini Index and Gini Coefficient are actually different names for the same concept. They are all indicators used to measure the purity of the data set, indicating the probability that two samples are randomly drawn from the data set, and their class labels are inconsistent. The smaller the Gini value, the higher the purity of the data set; the larger the Gini value, the lower the purity of the data set. When building a CART decision tree, the Gini value can be used to select the best partitioning attribute.

2.4.2 Case

Please make a decision tree based on the table below and the division basis of the Gini index.

serial number Is there room marital status Annual income Are you in arrears with your loan?
1 yes single 125k no
2 no married 100k no
3 no single 70k no
4 yes married 120k no
5 no divorced 95k yes
6 no married 60k no
7 yes divorced 220k no
8 no single 85k yes
9 no married 75k no
10 no single 90k yes

1. Calculate the Gini index of the non-serial label attributes of the data set (whether there is a house, marital status, and annual income), and take the attribute with the smallest Gini index as the root node attribute of the decision tree .

first big cycle

2. The Gini value of the root node is:

G ini (loan in default) = 1 − ∑ k = 1 ∣ y ∣ pk 2 = 1 − [ ( 3 10 ) 2 + ( 7 10 ) 2 ] = 0.42 \begin{aligned} \mathrm{Gini}(whether in default loan) & = 1 - \sum_{k=1}^{|y|}p_k^2\\ & = 1 - [(\frac{3}{10})^2 + (\frac{7}{10 })^2]\\ & = 0.42 \end{aligned}Gini ( whether in default on the loan )=1k=1ypk2=1[(103)2+(107)2]=0.42

In the decision tree algorithm, we usually select a category attribute as the decision attribute, and then divide the data set according to the value of this attribute . Here we choose the attribute "whether or not to default on the loan" as the decision attribute, then we can calculate the data set DDThe Gini value of D is 0.42.

3. When dividing according to whether there is a house, the calculation process of Gini index is:

G ini ( integer ) = 1 − ∑ k = 1 ∣ y ∣ pk 2 = 1 − [ ( 0 3 ) 2 + ( 3 3 ) 2 ] = 0 \begin{aligned} \mathrm{Gini}(integer Therefore) & = 1 - \sum_{k=1}^{|y|}p_k^2\\ & = 1 - [(\frac{0}{3})^2 + (\frac{3}{3 })^2]\\ & = 0 \end{aligned}Gini ( left child node )=1k=1ypk2=1[(30)2+(33)2]=0

G ini (right child node) = 1 − ∑ k = 1 ∣ y ∣ pk 2 = 1 − [ ( 3 7 ) 2 + ( 4 7 ) 2 ] = 0.4898 \begin{aligned} \mathrm{Gini}(right child node) & = 1 - \sum_{k=1}^{|y|}p_k^2\\ & = 1 - [(\frac{3}{7})^2 + (\frac{4}{7 })^2]\\ & = 0.4898 \end{aligned}Gini ( right child node )=1k=1ypk2=1[(73)2+(74)2]=0.4898

G ini index ( D , whether there is room or not) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) = 7 10 × 0.4898 + 3 10 × 0 = 0.343 \begin{aligned} \mathrm{Gini \index}(D, whether there is room) & = \sum_{v=1}^V \frac{|D^v|}{|D|}\mathrm{Gini}(D^v)\\ & = \ frac{7}{10} \times 0.4898 + \frac{3}{10} \times 0\\ & = 0.343 \end{aligned}Gini index(D,Is there a room )=v=1VDDvGini ( Dv)=107×0.4898+103×0=0.343

in:

  • Left child node: there is room (yes)
  • Right child node: no room (no)

insert image description here

4. If divided by the marital status attribute, the attribute marital status has three possible values:

  • {married} | {single, divorced}
  • {single} | {married, divorced}
  • {divorced} | {single, married}

4.1 When the grouping {married} | {single, divorced}is :

G ini index ( D , marital status) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) = 4 10 × 0 + 6 10 × [ 1 − ( 3 6 ) 2 − ( 3 6 ) 2 ] = 0.3 \begin{aligned} \mathrm{Gini \ index}(D, marital status) & = \sum_{v=1}^V \frac{|D^v|}{|D|}\mathrm{ Gini}(D^v)\\ & = \frac{4}{10} \times 0 + \frac{6}{10} \times [1 - (\frac{3}{6})^2 - ( \frac{3}{6})^2] \\ & = 0.3 \end{aligned}Gini index(D,marital status )=v=1VDDvGini ( Dv)=104×0+106×[1(63)2(63)2]=0.3

4.2 When the grouping {single} | {married, divorced}is :

G ini index ( D , marital status) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) = 4 10 × [ 1 − ( 2 4 ) 2 − ( 2 4 ) 2 ] + 6 10 × [ 1 − ( 1 6 ) 2 − ( 5 6 ) 2 ] = 0.367 \begin{aligned} \mathrm{Gini \ index}(D, marital status) & = \sum_{v=1}^V \frac{ |D^v|}{|D|}\mathrm{Gini}(D^v)\\ & = \frac{4}{10} \times [1 - (\frac{2}{4})^2 - (\frac{2}{4})^2] + \frac{6}{10} \times [1 - (\frac{1}{6})^2 - (\frac{5}{6} )^2] \\ & = 0.367 \end{aligned}Gini index(D,marital status )=v=1VDDvGini ( Dv)=104×[1(42)2(42)2]+106×[1(61)2(65)2]=0.367

4.3 When the grouping {divorced} | {single, married}is :

G ini index ( D , marital status) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) = 2 10 × [ 1 − ( 2 4 ) 2 − ( 2 4 ) 2 ] + 8 10 × [ 1 − ( 2 8 ) 2 − ( 6 8 ) 2 ] = 0.4 \begin{aligned} \mathrm{Gini \ index}(D, marital status) & = \sum_{v=1}^V \frac{ |D^v|}{|D|}\mathrm{Gini}(D^v)\\ & = \frac{2}{10} \times [1 - (\frac{2}{4})^2 - (\frac{2}{4})^2] + \frac{8}{10} \times [1 - (\frac{2}{8})^2 - (\frac{6}{8} )^2] \\ & = 0.4 \end{aligned}Gini index(D,marital status )=v=1VDDvGini ( Dv)=102×[1(42)2(42)2]+108×[1(82)2(86)2]=0.4

Comparing the calculation results, when the root node is divided according to the marital status attribute, the group with the smallest Gini index is taken as the division result, namely: {married} | {single, divorced}.

5. In the same way, the Gini index of the annual income can be obtained:

If the annual income attribute is a numeric attribute, the data needs to be sorted in ascending order first , and then the samples are divided into two groups by using the middle value of the adjacent value as the separator from small to large. For example, when faced with two values ​​of annual income of 60k and 70k, we calculate the middle value to be 65k. Calculate the Gini index with the middle value 65k as the cut point.

Taking the median value of 65k as an example, those less than 65 are classified into one category, and those greater than 65 are classified into one category, so that the Gini index can be calculated

insert image description here

According to the calculation, there are two smallest indices of the root nodes divided by the three attribute attributes: the annual income attribute and the marital status, and their indices are both 0.3. At this point, select the attribute [married] that appears first as the first division.

second big cycle

6. Next, use the same method to calculate the remaining attributes respectively, where the Gini coefficient of the root node is (at this time, there are 3 records for whether the loan is in arrears or not)

G ini (loan in default) = 1 − ∑ k = 1 ∣ y ∣ pk 2 = 1 − [ ( 3 6 ) 2 + ( 3 6 ) 2 ] = 0.5 \begin{aligned} \mathrm{Gini}(whether in default loan) & = 1 - \sum_{k=1}^{|y|}p_k^2\\ & = 1 - [(\frac{3}{6})^2 + (\frac{3}{6 })^2]\\ & = 0.5 \end{aligned}Gini ( whether in default on the loan )=1k=1ypk2=1[(63)2+(63)2]=0.5

7. For whether there is a house attribute, you can get:

G ini index ( D , whether there is room or not) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) = 2 6 × 0 + 4 6 × [ 1 − ( 3 4 ) 2 − ( 1 4 ) 2 ] = 0.25 \begin{aligned} \mathrm{Gini \ index}(D, whether there is room) & = \sum_{v=1}^V \frac{|D^v|}{|D|}\ mathrm{Gini}(D^v)\\ & = \frac{2}{6} \times 0 + \frac{4}{6} \times [1 - (\frac{3}{4})^2 - (\frac{1}{4})^2] \\ & = 0.25 \end{aligned}Gini index(D,Is there a room )=v=1VDDvGini ( Dv)=62×0+64×[1(43)2(41)2]=0.25

8. For the annual income attribute, there are:

insert image description here

G ini index ( D , whether there is room or not) = ∑ v = 1 V ∣ D v ∣ ∣ D ∣ G ini ( D v ) = 2 6 × 0 + 4 6 × [ 1 − ( 3 4 ) 2 − ( 1 4 ) 2 ] = 0.25 \begin{aligned} \mathrm{Gini \ index}(D, whether there is room) & = \sum_{v=1}^V \frac{|D^v|}{|D|}\ mathrm{Gini}(D^v)\\ & = \frac{2}{6} \times 0 + \frac{4}{6} \times [1 - (\frac{3}{4})^2 - (\frac{1}{4})^2] \\ & = 0.25 \end{aligned}Giniindex (D,Is there a room )=v=1VDDvGini ( Dv)=62×0+64×[1(43)2(41)2]=0.25

After the above process, the decision tree constructed is as follows:

insert image description here


Now let's summarize the algorithm flow of CART (Classification and Regression Tree, classification and regression tree):

while(当前节点"不纯"):
    1. 遍历每个变量的每一种分割方式,找到最好的分割点
    2. 分割成两个节点 N1 和 N2

    if 每个节点足够"纯":
        return  # 结束

2.5 Summary

2.5.1 Heuristic function comparison of common decision trees

1. Information entropy

E nt ( D ) = − ∑ k = 1 npk log ⁡ 2 pk \mathrm{Ent}(D) = -\sum_{k = 1}^n p_k \log_2^{p_k}Ent(D)=k=1npklog2pk

  • D DD means data set
  • n nn represents the number of categories in the dataset
  • p k p_kpkIndicates the kkth in the data setThe proportion of k class samples

2. Information Gain: ID3 Decision Tree

G a i n ( D , a ) = E n t ( D ) − E n t ( D ∣ a ) = E n t ( D ) − ∑ v = 1 V D v D E n t ( D v ) \begin{aligned} \mathrm{Gain}(D, a) & = \mathrm{Ent}(D) - \mathrm{Ent}(D|a)\\ & = \mathrm{Ent}(D) - \sum_{v=1}^V \frac{D^v}{D} \mathrm{Ent}(D^v) \end{aligned} Gain(D,a)=Ent(D)Ent(Da)=Ent(D)v=1VDDvEnt(Dv)

  • G a i n ( D , a ) \mathrm{Gain}(D, a)Gain(D,a ) indicates the use of attributeaaa pair of data setDDThe information gain obtained by the division of D
  • E n t ( D ∣ a ) \mathrm{Ent}(D|a)Ent ( D a ) means using attributeaaa pair of data setDDConditional entropy after D is divided
  • VVV represents attributeaaThe number of possible values ​​​​of a
  • D v D \frac{D^v}{D} DDvRepresents the dataset DDAttribute aain DThe value of a is vvThe proportion of v 's sample
  • E n t ( D v ) \mathrm{Ent}(D^v)Ent(Dv )represents the data setDDAttribute aain DThe value of a is vvThe information entropy of the sample subset of v .

3. Information Gain Rate: C4.5 Decision Tree

G a i n   R a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) \mathrm{Gain \ Ratio}(D, a) = \frac{\mathrm{Gain}(D, a)}{\mathrm{IV}(a)} Gain Ratio(D,a)=IV ( a )Gain(D,a)

  • G a i n   R a t i o ( D , a ) \mathrm{Gain \ Ratio}(D, a) GainRatio (D,a ) indicates the use of attributeaaa pair of data setDDThe information gain rate obtained by the division of D
  • G a i n ( D , a ) \mathrm{Gain}(D, a)Gain(D,a ) indicates the use of attributeaaa pair of data setDDThe information gain obtained by the division of D
  • I V ( a ) \mathrm{IV}(a)IV ( a ) means attributeaaIntrinsic Value of a
    • IV ( a ) = − ∑ v = 1 VD v D log ⁡ 2 D v D \mathrm{\mathrm{IV}}(a) = - \sum_{v=1}^V \frac{D^v}{ D} \log_2 \frac{D^v}{D}IV ( a )=v=1VDDvlog2DDv
    • in:
      • VVV represents attributeaaThe number of possible values ​​​​of a
      • D v D \frac{D^v}{D} DDvRepresents the dataset DDAttribute aain DThe value of a is vvThe proportion of v 's sample
    • intrinsic value IV ( a ) \mathrm{IV}(a)IV ( a ) reflects the attributeaaa pair of data setDDD' s ability to divide. IV ( a ) \mathrm{IV}(a)The larger the value of IV ( a ) , the attributeaaa pair of data setDDThe division ability of D is stronger.

4. Gini value

G ini ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ′ ≠ kpkpk ′ = 1 − ∑ k = 1 ∣ y ∣ pk 2 \begin{aligned} \mathrm{Gini}(D) & = \sum_{k =1}^{|y|}\sum_{k'\neq k}p_k p_k'\\ & = 1 - \sum_{k=1}^{|y|}p_k^2 \end{aligned}Gini ( D )=k=1yk=kpkpk=1k=1ypk2

  • G i n i ( D ) \mathrm{Gini}(D) Gini ( D ) represents the data setDDGini value of D
  • ∣ and ∣ |and|y indicates the number of categories in the data set
  • p k p_kpkIndicates the kkth in the data setThe proportion of k class samples

5. Gini index: CART decision tree

G i n i   i n d e x ( D , a ) = ∑ v = 1 V D v D G i n i ( D v ) \mathrm{Gini \ index}(D, a) = \sum_{v = 1}^V \frac{D^v}{D} \mathrm{Gini}(D^v) Gini index(D,a)=v=1VDDvGini ( Dv)

  • G i n i   i n d e x ( D , a ) \mathrm{Gini \ index}(D, a) Giniindex (D,a ) indicates the use of attributeaaa pair of data setDDGini index obtained by dividing by D
  • D v D \frac{D^v}{D} DDvRepresents the dataset DDAttribute aain DThe value of a is vvThe proportion of v 's sample
  • G ini ( D v ) \mathrm{Gini}(D^v)Gini ( Dv )represents the data setDDAttribute aain DThe value of a is vvGini value for a sample subset of v

name proposed time proposed author branching method return task classification task discrete features continuous features Remark
ID3 1986 Ross Quinlan information gain no yes yes no ID3 can only form a decision tree for datasets with discrete attributes
C4.5 1993 Ross Quinlan information gain rate no yes yes yes Solve the problem that in the process of ID3 branching, it always prefers to select attributes with more values
CART 1984 Breiman et al. Gini Exponent or Least Squares yes yes yes yes Can perform classification and regression, can handle discrete attributes, and can also handle continuous attributes

Q1 : What are these formulas used for?
A1 : These formulas are concepts used in decision tree learning.

  • Information entropy (Entropy) : An indicator to measure the purity of the data set. The smaller the value, the purer the data set.
  • Information Gain (Information Gain) : measure the use of attributes aaa pair of data setDDD for the purity boost obtained by partitioning.
  • Information Gain Ratio : The information gain is normalized to solve the problem that the information gain has a preference for attributes with a large number of possible values.
  • Gini Value (Gini Value) : Similar to information entropy, it is also an indicator to measure the purity of the data set.
  • Gini Index : The CART decision tree uses the Gini index to select the optimal split attribute.

In short, these formulas are used to select the optimal partition attribute to build a decision tree.


Q2 : What is a decision tree?
A2 : Decision Tree is a commonly used supervised learning method that can be used for classification and regression problems. It builds a tree structure for prediction by recursively dividing the data set.

Each non-leaf node of the decision tree represents an attribute test, each branch represents a test output, and each leaf node stores a class label . When using a decision tree for prediction, start from the root node, test the instance, select the corresponding branch to enter the next layer node according to the test result, until it reaches the leaf node, and finally use the category label stored in the leaf node as the prediction result.

Decision trees have the advantages of easy understanding and interpretation, low computational complexity, and the ability to handle categorical and numerical data.


Q3 : What are the applications of decision trees?
A3 : Decision trees are widely used in various fields, including finance, medical care, e-commerce, marketing, etc. It can be used for classification and regression problems such as:

  • A bank can use a decision tree to predict whether a customer will default.
  • Doctors can use decision trees to aid in the diagnosis of diseases.
  • An e-commerce website can use a decision tree to predict whether a user will buy a certain product.
  • Marketers can use decision trees to segment customer segments for better marketing campaigns.

In addition, decision trees can also be used as the base model for other machine learning algorithms, such as random forests and gradient boosted trees.


Q4 : What are the disadvantages of decision trees?
A4 : Although the decision tree has many advantages, it also has some disadvantages, mainly including:

  • Easy to overfit : Decision trees tend to generate overly complex models, leading to overfitting. In order to avoid overfitting, the decision tree needs to be pruned.
  • Instability : Decision trees are very sensitive to small changes in the data, which can cause the resulting model to be unstable. The stability of the model can be improved by ensemble learning methods.
  • Local Optimum : The decision tree learning algorithm adopts a greedy strategy and selects the optimal partition attribute each time, but this does not guarantee that the generated decision tree is globally optimal.

Despite these shortcomings, decision trees are still a very practical approach to machine learning.


Q5 : How to build a decision tree?
A5 : Building a decision tree usually includes the following steps:

  1. Select the optimal partition attribute : Use certain indicators (such as information gain, information gain rate, Gini index, etc.) to select the optimal partition attribute.
  2. Divide the dataset : Divide the dataset into several subsets according to the value of the optimal partition attribute.
  3. Recursively build a decision tree : Build a decision tree recursively for each subset until a stopping condition is met (such as all samples in the subset belong to the same category, the number of samples in the subset is less than a threshold, etc.).

The constructed decision tree may overfit, so it needs to be pruned . Pruning usually includes two methods: pre-pruning and post-pruning .

  • Pre-pruning is pruning during the process of building a decision tree
  • Post-pruning is pruning after the decision tree is built

The specific method of constructing a decision tree may vary depending on the algorithm. Common decision tree algorithms include ID3, C4.5, and CART.

name proposed time proposed author branching method return task classification task discrete features continuous features Remark
ID3 1986 Ross Quinlan information gain no yes yes no ID3 can only form a decision tree for datasets with discrete attributes
C4.5 1993 Ross Quinlan information gain rate no yes yes yes Solve the problem that in the process of ID3 branching, it always prefers to select attributes with more values
CART 1984 Breiman et al. Gini Exponent or Least Squares yes yes yes yes Can perform classification and regression, can handle discrete attributes, and can also handle continuous attributes

2.5.1.1 Advantages and disadvantages of ID3 algorithm

The ID3 algorithm is a decision tree algorithm. Its advantages include simple and easy to understand, clear theory, simple method, strong learning ability, good robustness, not affected by noise, and can train instances lacking attribute values.

However, the ID3 algorithm also has some disadvantages :

  • It only considers categorical features, and does not consider continuous features, such as length and density, which are continuous values ​​and cannot be used in ID3, which greatly limits the use of ID3 - the ID3 algorithm can only be constructed for datasets whose description attributes are discrete decision tree
  • The ID3 algorithm does not consider missing values ​​and does not consider the problem of overfitting
  • ID3 algorithm uses information gain G ain ( D ) \mathrm{Gain}(D) when selecting the root node and the branch attributes in each internal nodeGain ( D ) is used as the evaluation standard. The disadvantage of information gain is that it tends to select attributes with more values. In some cases, such attributes may not provide much valuable information.

2.5.1.2 Advantages and disadvantages of C4.5 algorithm

The C4.5 algorithm is a decision tree algorithm, which is an extension and optimization of the ID3 algorithm. The C4.5 algorithm has improved the ID3 algorithm, and the improvements mainly include:

  • Use the information gain rate G ain R atio ( D ) \mathrm{Gain \ Ratio}(D)Gain Ratio  ( D ) to select the division features, overcomes the shortcomings of information gain selection
  • Able to handle discrete and continuous attribute types, that is, to discretize continuous attributes
  • Ability to handle training data with missing attribute values
  • pruning during tree construction

The C4.5 algorithm has the advantages of being clear, able to handle continuous attributes, preventing overfitting, high accuracy and wide application range. However, the C4.5 algorithm also has some disadvantages :

  • C4.5 performs multiple sequential scans and sorts on the data, which is inefficient
  • Although a more advanced information gain ratio G ain R atio ( D ) \mathrm{Gain \ Ratio}(D)Gain Ratio  ( D ) , but the information gain rateGain R atio ( D ) \mathrm{Gain \ Ratio}(D)Gain Ratio  ( D ) will have a preference for attributes with fewer possible values
  • C4.5 is only suitable for data sets that can reside in memory. When the training set is too large to fit in memory, the program cannot run

2.5.1.3 Advantages and disadvantages of CART algorithm

The CART algorithm is a binary decision tree algorithm that can be used for both classification and regression problems . In classification problems, the CART algorithm uses the Gini coefficient as a feature selection criterion. The CART algorithm can handle both continuous and discrete attribute types, and can handle training data with missing attribute values.

Compared with ID3 and C4.5, the advantages of CART algorithm are as follows:

  • Using a simplified binary tree structure, the operation speed is faster
  • The CART algorithm can be used not only for classification problems, but also for regression problems

Disadvantages :

  • Compared with C4.5, CART algorithm has no pruning strategy .

Notice:

  • C4.5 is not necessarily a binary tree, but CART must be a binary tree

2.5.1.4 Multi-variate Decision Tree

Whether it is ID3, C4.5 or CART, when doing feature selection, the optimal feature is selected to make classification decisions. But most of the time, the classification decision should not be determined by a single feature, but by a set of features . The resulting decision tree is more accurate. This kind of decision tree generated by a set of features is called a multivariate decision tree (Multi-variate Decision Tree).

When selecting the optimal feature, the multivariate decision tree does not select an optimal feature, but selects the optimal linear combination of features to make a decision . The representative of this algorithm is OC1, which will not be introduced here.

In general, when a sample changes a little bit, it will lead to drastic changes in the tree structure. This problem can be solved by methods such as random forest in integrated learning.

2.5.2 Two types of decision tree variables

1. Numeric

The variable type is an integer or a floating point number, such as "annual income" in the previous example. Use >=, >, <or <=as the segmentation condition (after sorting, the time complexity of the segmentation algorithm can be optimized by using the existing segmentation conditions).

2. Nominal

Similar to enumeration types in programming languages, variables can only be selected from limited options, such as "marital status" in the previous example, which can only be "single", "married" or "divorced", and used to divide =.

2.5.3 How to evaluate the quality of the segmentation point?

If a split point can divide all current nodes into two categories, making each class very "pure", that is, there are more records of the same class, then it is a good split point.

For example, in the above example, "owning property", records can be divided into two categories, "yes" nodes can all repay debts (no defaulted loans), very "pure"; "no" nodes, defaulted loans and non-defaulted loans It is not very "pure", but the difference between the sum of the purity of the two nodes and the purity of the original node is the largest, so it is divided according to this method.

The greedy algorithm is used to construct the decision tree, and only the situation with the largest current purity difference is considered as the split point .

2.5.4 [Expansion] Greedy algorithm, dynamic programming and divide and conquer algorithm

Q1 : What is a greedy algorithm?
A1 : The greedy algorithm is an algorithm that takes the best or optimal (that is, the most favorable) choice in the current state in each step of selection, hoping to lead to the best or optimal result . Greedy algorithms are especially effective in problems with optimal substructure. The optimal substructure means that the local optimal solution can determine the global optimal solution. Simply put, the problem can be decomposed into sub-problems to solve, and the optimal solution of the sub-problems can be deduced to the optimal solution of the final problem.

A greedy algorithm does not guarantee a globally optimal solution, because it does not always make the best choice in some sense . But for many problems, greedy algorithms can produce results that are very close to the optimal solution, or even the optimal solution .


Q2 : What is dynamic programming?
A2 : Dynamic programming is an algorithm used to solve optimization problems. It is solved by breaking the problem into smaller subproblems, usually using recursive or iterative methods . Dynamic programming is often used to solve problems with overlapping subproblems and optimal substructure properties.

Overlapping subproblems means that the subproblems will be calculated multiple times, so the dynamic programming algorithm will store the solutions of the subproblems that have already been calculated to avoid repeated calculations. The optimal substructure property means that the optimal solution of a problem can be constructed from the optimal solutions of its subproblems.

Dynamic programming algorithms are often used to solve combinatorial optimization problems, such as the shortest path, longest common subsequence, and knapsack problems.


Q3 : What is a divide and conquer algorithm?
A3 : The divide-and-conquer algorithm is a method of solving problems. It decomposes a complex problem into several smaller sub-problems, then recursively solves these sub-problems, and finally combines the solutions of the sub-problems to obtain the solution of the original problem. . Divide and conquer algorithms are often used to solve problems that have a recursive structure.

The basic steps of the divide and conquer algorithm include:

  1. Decomposition : Decompose the original problem into several smaller sub-problems.
  2. Solving : Solve each subproblem recursively. If the size of the subproblem is small enough, it is solved directly.
  3. Merging : Combining the solutions to the subproblems to form a solution to the original problem.

Classic applications of divide and conquer algorithms include quick sort, merge sort, and binary search .


Q4 : What is the difference between greedy algorithm, dynamic programming and divide and conquer algorithm?
A4 : Greedy algorithm, dynamic programming and divide and conquer algorithm are all algorithms used to solve optimization problems, but there are some differences between them.

  • The greedy algorithm takes the optimal choice in the current state at each step, hoping to get the global optimal solution. It does not guarantee a globally optimal solution, but for many problems, a greedy algorithm can produce results very close to the optimal solution, or even the optimal solution.

  • Dynamic programming solves a problem by breaking it down into smaller subproblems, usually using recursive or iterative methods. It is often used to solve problems with overlapping subproblems and optimal substructure properties. The dynamic programming algorithm will store the solutions of the sub-problems that have been calculated to avoid repeated calculations.

  • The divide and conquer algorithm decomposes a complex problem into several smaller sub-problems, then recursively solves these sub-problems, and finally combines the solutions of the sub-problems to obtain the solution of the original problem . Divide and conquer algorithms are often used to solve problems that have a recursive structure.

In conclusion, greedy algorithm, dynamic programming and divide and conquer algorithm are all algorithms used to solve optimization problems, but they take different approaches. The greedy algorithm takes a local optimal choice at each step; the dynamic programming avoids repeated calculations by decomposing the problem and storing the solutions of the sub-problems; the divide-and-conquer algorithm obtains the solution of the original problem by decomposing the problem and recursively solving the sub-problems.


Q5 : Both dynamic programming and divide and conquer algorithms use recursion. Is there any difference between them?
A5 : Both dynamic programming and divide and conquer algorithms can use recursion to solve problems, but there are some important differences between them.

  • The divide-and-conquer algorithm solves a complex problem by decomposing it into several smaller sub-problems, then recursively solves these sub-problems, and finally combines the solutions of the sub-problems to obtain the solution of the original problem. Divide and conquer algorithms are often used to solve problems that have a recursive structure.

  • Dynamic programming also solves problems by breaking them down into smaller subproblems, but it is usually used to solve problems with overlapping subproblems and optimal substructure properties. The dynamic programming algorithm will store the solutions of the sub-problems that have been calculated to avoid repeated calculations . This method of storing already computed solutions to subproblems is called "memoization".

In summary, both dynamic programming and divide and conquer algorithms can use recursion to solve problems, but they take different approaches.

  • The divide and conquer algorithm obtains the solution of the original problem by decomposing the problem and recursively solving the sub-problems ;
  • Dynamic programming avoids double computation by decomposing the problem and storing the solutions to the subproblems .

3. CART pruning

Learning Objectives :

  • Learn why you should do CART pruning
  • Know the commonly used CART pruning methods

3.1 Why pruning

insert image description here

Graphic description :

  • The horizontal axis represents the total number of nodes in the tree during the process of creating the decision tree ( the total number of nodes in the decision tree can be understood as the complexity of the model, the more nodes, the more complex the model ), and the vertical axis represents the prediction accuracy of the decision tree
  • The solid line shows the accuracy of the decision tree on the training set, and the dashed line shows the accuracy on an independent test set

As the tree grows, the accuracy on the training sample increases monotonically, while the accuracy measured on the independent test examples first increases and then decreases. Clearly, the model (decision tree) is overfitting!

Reasons for this behavior :

  • Reason 1 : Noise, sample conflict, that is, wrong sample data (wrong samples and features learned)
  • Reason 2 : Features (i.e. attributes) cannot be fully used as classification criteria
  • Reason 3 : Learned the regularity of coincidence, these laws are not real laws, generally because the amount of data is not large enough

Pruning is the main means of decision tree learning algorithm to deal with "overfitting".

Pruning: English [ˈpruːnɪŋ] US [ˈpruːnɪŋ]
v. pruning branches; cutting; cutting; cutting; streamlining;
n. pruning; pruning;
adj. for pruning;

In decision tree learning, in order to correctly classify the training samples as much as possible, the node division process will be repeated continuously, sometimes resulting in too many branches of the decision tree . Some characteristics of the set itself are treated as general properties that all data have, leading to overfitting . Therefore, the risk of overfitting can be reduced by actively removing some branches.

Q : How to judge whether the generalization performance of decision tree is improved?
A : You can use the set-out method described above, that is, reserve a part of the data as a "test set" for performance evaluation .

For example, for the watermelon data set in the table below, we randomly divide it into two parts, where the samples {1, 2, 3, 6, 7, 10, 14, 15, 16, 17}numbered constitute the training set, and {4, 5, 8, 9, 11, 12, 13}the samples numbered constitute the test set.

training set :

serial number color roots Knock texture Navel touch good melon
1 Green curl up Loudness clear sunken Hard and slippery yes
2 jet black curl up boring clear sunken Hard and slippery yes
3 jet black curl up Loudness clear sunken Hard and slippery yes
6 Green slightly curled up Loudness clear Slightly concave soft sticky yes
7 jet black slightly curled up Loudness Slightly blurred Slightly concave soft sticky yes
10 Green Stiff crisp clear flat soft sticky no
14 light white slightly curled up boring Slightly blurred sunken Hard and slippery no
15 jet black slightly curled up Loudness clear Slightly concave soft sticky no
16 light white curl up Loudness Vague flat Hard and slippery no
17 Green curl up boring Slightly blurred Slightly concave Hard and slippery no

Test set :

serial number color roots Knock texture Navel touch good melon
4 Green curl up boring clear sunken Hard and slippery yes
5 light white curl up Loudness clear sunken Hard and slippery yes
8 jet black slightly curled up Loudness clear Slightly concave Hard and slippery yes
9 jet black slightly curled up boring Slightly blurred Slightly concave soft sticky no
11 light white Stiff crisp Vague flat Hard and slippery no
12 light white curl up Loudness Vague flat soft sticky no
13 Green slightly curled up Loudness Slightly blurred sunken Hard and slippery no

Suppose we use information gain G ain ( D ) \mathrm{Gain}(D)Gain(D) 准则来划分属性选择,则上表中训练集将会生成一棵决策树,如下所示。为便于讨论,我们对圈中的部分结点做了编号。

insert image description here

接下来,我们一起看一下,如何对这一棵树进行剪枝。

3.2 常用的剪枝方法

决策树剪枝的基本策略有“预剪枝”(pre-pruning)和“后剪枝”(post- pruning):

  • 预剪枝 是指在决策树生成过程中,对每个结点在划分前先进行估计,若当前结点的划分不能带来决策树泛化性能提升,则停止划分并将当前结点标记为叶结点(没有子节点了)
  • 后剪枝 则是先从训练集生成一棵完整的决策树,然后自底向上地对非叶结点进行考察,若将该结点对应的子树替换为叶结点能带来决策树泛化性能提升,则将该子树替换为叶结点

3.2.1 预剪枝

首先,基于信息增益准则,我们会选取属性“脐部”来对训练集进行划分,并产生 3 个分支,如下图所示。然而,是否应该进行这个划分呢?预剪枝要对划分前后的泛化性能进行估计。

insert image description here

在划分之前,所有样例集中在根结点。

  • 若不进行划分,该结点将被标记为叶结点(没有子节点了),其类别标记为训练样例数最多的类别,假设我们将这个叶结点标记为“好瓜”。
  • 那么我们用前面表的测试集对这个单结点决策树进行评估。则编号为 {4, 5, 8} 的样例被分类正确。另外 4 个样例分类错误,于是测试集精度为 3 7 × 100 % = 42.9 % \frac{3}{7}\times 100\% = 42.9\% 73×100%=42.9%

在用属性“脐部”划分之后,上图中的结点 ②、③、④ 分别包含编号为 {1, 2, 3, 14}{6, 7, 15, 17}{10, 16} 的训练样例,因此这 3 个结点分别被标记为叶结点“好瓜”、“好瓜”、“坏瓜”。

insert image description here

此时,测试集中编号为 {4, 5, 8, 11, 12} 的样例被分类正确,测试集精度为 5 7 × 100 % = 71.4 % > 42.9 % \frac{5}{7}\times 100\% = 71.4\% > 42.9\% 75×100%=71.4%>42.9%

于是,用“脐部”进行划分得以确定(确定“脐部”为根节点了)。

然后,决策树算法应该对结点 2 进行划分,基于信息增益准则将挑选出划分属性“色泽”。然而,在使用“色泽”划分后,编号为 {5} 的测试集样本分类结果会由正确转为错误,使得测试集精度下降为 57.1 % 57.1\% 57.1%。于是,预剪枝策略将禁止结点 2 被划分。

对结点 3 ,最优划分属性为“根蒂”,划分后测试集精度仍为 71.4 % 71.4\% 71.4%,这个划分不能提升测试集精度,于是预剪枝策略禁止结点 3 被划分。

对结点 4,其所含训练样例己属于同一类,不再进行划分。

于是,基于预剪枝策略从上表数据所生成的决策树如上图所示,其测试集精度为 71.4 % 71.4\% 71.4%。这是一棵仅有一层划分的决策树,亦称“决策树桩”(Decision Stump)。

3.2.2 后剪枝

后剪枝先从训练集生成一棵完整决策树,继续使用上面的案例.从前面可知,我们前面构造的决策树的测试集精度为 42.9 % 42.9\% 42.9%

insert image description here

后剪枝首先考察结点 6。若将其领衔的分支剪除(相当于把结点 6 替换为叶子结点),替换后的叶结点包含编号为 {7, 15} 的训练样本,于是该叶结点的类别标记为“好瓜”,此时决策树的测试集精度提高至 57.1 % 57.1\% 57.1%。于是,后剪枝策略决定剪枝,如下图所示。

insert image description here

然后考察结点 5。若将其邻衔的子树替换为叶结点,则替换后的叶结点包含编号为 {6, 7, 15} 的训练样例,叶结点类别标记为“好瓜”,此时决策树测试集精度仍为 57.1 % 57.1\% 57.1%。于是,可以不进行剪枝。

对结点 2。若将其领衔的子树替换为叶结点,则替换后的叶结点包含编号为 {1, 2, 3, 14} 的训练样例,叶结点标记为“好瓜”,此时决策树的测试集精度提高至 71.4 % 71.4\% 71.4%。于是,后剪枝策略决定剪枝。

对结点 3 和 1,若将其领衔的子树替换为叶结点,则所得决策树的测试集精度分别为 71.4 % 71.4\% 71.4% 42.9 % 42.9\% 42.9%,均未得到提高,于是它们被保留。

最终,基于后剪枝策略所生成的决策树就如上图所示(就剪了一下),其测试集精度为 $71.4%。

对比两种剪枝方法

  • 后剪枝决策树通常比预剪枝决策树保留了更多的分支。
  • 一般情形下,后剪枝决策树的欠拟合风险很小,泛化性能往往优于预剪枝决策树。
  • 但后剪枝过程是在生成完全决策树之后进行的。并且要自底向上地对树中的所有非叶结点进行逐一考察,因此其训练时间开销比未剪枝决策树和预剪枝决策树都要大得多。

Q1:在剪枝的过程中,如果精度不变呢?
A1:如果剪枝前后的精度没有变化,那么剪枝的决策取决于具体情况。

  • 在预剪枝中,由于提前终止树的生长可以减少计算量,所以通常会进行剪枝。
  • 在后剪枝中,由于剪枝可以简化模型,减少模型复杂度,所以也可能会进行剪枝。但是,这些都不是绝对的,具体决策取决于实际应用场景和需求。

Q2:预剪枝和后剪枝哪个更好?
A2:预剪枝和后剪枝都有各自的优缺点,它们哪个更好取决于具体的应用场景。

  • 预剪枝的优点是计算速度快,因为它在构建决策树的过程中提前终止树的生长,减少了计算量。但是,预剪枝有时会过于简化模型,导致欠拟合。

  • 后剪枝的优点是能够更好地避免欠拟合,因为它在构建完整的决策树后再进行剪枝。但是,后剪枝的计算量比预剪枝大,计算速度较慢。

总之,预剪枝和后剪枝都是有效的防止过拟合的方法。在实际应用中,可以根据数据集的大小、模型复杂度和计算能力等因素来选择合适的方法。


小结

  • 剪枝原因【了解】
    • 噪声、样本冲突,即错误的样本数据
    • 特征即属性不能完全作为分类标准
    • 巧合的规律性,数据量不够大
  • 常用剪枝方法【知道】
    • 预剪枝:在构建树的过程中,同时剪枝
      • 限制节点最小样本数
      • 指定数据高度
      • 指定熵值的最小值
    • 后剪枝:把一棵树,构建完成之后,再进行从下往上的剪枝

4. 特征工程-特征提取

学习目标

  • 了解什么是特征提取
  • 知道字典特征提取操作流程
  • 知道文本特征提取操作流程
  • 知道 TF-IDF 的实现思想

什么是特征提取呢?

insert image description here

我们想让机器去识别文字时,并不能很好的把文字识别出来,此时我们可以将文字转换为数字,以提高机器的识别效果。

4.1 特征提取

4.1.1 定义

将任意数据(如文本或图像)转换为可用于机器学习的数字特征。

注意:将 特征 数值化 是为了计算机更好的去理解数据。

  • 特征提取的分类
    • 字典特征提取(特征离散化)
    • 文本特征提取
    • 图像特征提取(在深度学习将介绍)

4.1.2 特征提取 API

sklearn.feature_extraction

4.2 字典特征提取

sklearn.feature_extraction.DictVectorizer 是一个类,它可以将特征值映射列表转换为向量。这个转换器将特征名称映射到特征值的映射列表(类似于字典的对象)转换为Numpy数组或scipy.sparse矩阵,以便与 scikit-learn 估计器(模型)一起使用。

sklearn.feature_extraction.DictVectorizer(sparse=True,...)
  • 作用:对字典数据进行特征值化。
  • 参数
    • dtype:默认为 np.float64。特征值的类型。作为 dtype 参数传递给 Numpy 数组 或 scipy.sparse 矩阵构造函数。
    • separator:默认为 '='。在进行一键编码时构造新特征时使用的分隔符字符串。
    • sparse:默认为 True。是否应该产生 scipy.sparse 矩阵。
    • sort:默认为 True。在拟合时是否应对 feature_names_vocabulary_ 进行排序。

sparse: 英[spɑːs] 美[spɑːrs]
adj. 稀少的; 稀疏的; 零落的;


类方法

DictVectorizer.fit_transform(X)
  • 作用:用于学习特征名称到索引的映射列表,并将输入数据转换为向量。
  • 参数
    • X:输入数据,应该是一个字典列表,其中每个字典表示一个样本,键表示特征名称,值表示特征值。
    • y:可选参数,默认为 None。目标值,仅用于兼容 scikit-learn 的管道(pipeline)和模型选择工具。

类属性

DictVectorizer.get_feature_names_out()
  • 作用:该方法返回一个包含特征名称的列表,列表中的元素顺序与转换后的向量中的特征顺序相同。
  • 参数
    • input_features:可选参数,默认为 None。输入特征名称,用于生成输出特征名称。

4.2.1 应用

我们对以下数据进行特征提取:

[{
    
    'city': '北京', 'temperature': 100}, 
 {
    
    'city': '上海', 'temperature': 60}, 
 {
    
    'city': '深圳', 'temperature': 30}]

4.2.2 流程分析

  1. 实例化类DictVectorizer
  2. 调用fit_transform方法输入数据并转换(注意返回格式)
from sklearn.feature_extraction import DictVectorizer


data = [{
    
    'city': '北京', 'temperature': 100},
        {
    
    'city': '上海', 'temperature': 60},
        {
    
    'city': '深圳', 'temperature': 30}]

# 1. 实例化一个转换器类
transfer = DictVectorizer(sparse=False)  # 不产生 `scipy.sparse` 矩阵

# 2. 调用fit_transform方法
data = transfer.fit_transform(data)
print(f"返回的结果:\r\n {
      
      data}")

print(f"特征名称:\r\n {
      
      transfer.get_feature_names_out()}")

结果:

返回的结果:
[[  0.   1.   0. 100.]
 [  1.   0.   0.  60.]
 [  0.   0.   1.  30.]]
特征名称:
['city=上海' 'city=北京' 'city=深圳' 'temperature']

这段代码使用了 scikit-learn 库中的 DictVectorizer 类来对字典类型的数据进行特征提取。DictVectorizer 类可以将字典类型的数据转换为矩阵形式,其中数值型特征保持不变,类别型特征会被转换为 one-hot 编码。

在这个例子中,原始数据包含三个样本,每个样本有两个特征:城市(city)和温度(temperature)。城市是一个类别型特征,温度是一个数值型特征。

使用 DictVectorizer 类对原始数据进行转换后,得到了一个 3x4 的矩阵。其中,前三列分别表示城市为上海、北京和深圳的 one-hot 编码,第四列表示温度

例如,第一个样本的城市为北京,所以在第二列(对应北京)的位置为 1,其他位置为 0;第二个样本的城市为上海,所以在第一列(对应上海)的位置为 1,其他位置为 0;第三个样本的城市为深圳,所以在第三列(对应深圳)的位置为 1,其他位置为 0。温度特征保持不变。


注意:如果没有加上sparse=False参数,则结果为:

返回的结果:
  (0, 1)	1.0
  (0, 3)	100.0
  (1, 0)	1.0
  (1, 3)	60.0
  (2, 2)	1.0
  (2, 3)	30.0
特征名称:
['city=上海' 'city=北京' 'city=深圳' 'temperature']

如果没有加上 sparse=False 参数,那么 DictVectorizer 类会返回一个稀疏矩阵(scipy.sparse 矩阵)。稀疏矩阵是一种特殊的矩阵,它只存储非零元素,可以大大节省内存。

在这个例子中,返回的稀疏矩阵采用了坐标格式(COO format)来存储。每一行表示一个非零元素,其中前两个数字表示该元素在矩阵中的坐标(行和列),第三个数字表示该元素的值

例如,第一行 (0, 1) 1.0 表示在第 0 行第 1 列的位置上有一个值为 1.0 的非零元素;第二行 (0, 3) 100.0 表示在第 0 行第 3 列的位置上有一个值为 100.0 的非零元素。

如果将这个稀疏矩阵转换为普通矩阵,那么就会得到与前面相同的结果:

[[  0.   1.   0. 100.]
 [  1.   0.   0. 60.]
 [  0.   0.   1. 30.]]

拓展内容:One-hot 编码

之前在学习 Pandas 中的离散化的时候,也实现了类似的效果。我们把这个处理数据的技巧叫做 “one-hot” 编码。

原始数据

insert image description here

转换后的 One-hot 数据

insert image description here

我们做的是为每个类别生成一个布尔列。这些列中只有一列可以为每个样本取值 1。因此,术语为一个热编码。


小结

  • 对于特征当中存在类别信息的,我们都会做 One-hot 编码处理

4.3 文本特征提取

sklearn.feature_extraction.text.CountVectorizer 是一个类,它可以将文本文档集合转换为令牌计数矩阵。该实现使用 scipy.sparse.csr_matrix 生成计数的稀疏表示。


令牌计数矩阵(Token Count Matrix)是一种用来表示文本文档集合的矩阵。它的每一行表示一个文档,每一列表示一个令牌(token),即一个单词或短语。矩阵中的元素表示该令牌在该文档中出现的次数。

例如,假设我们有两个文档:

文档1: "I love dogs"
文档2: "I love cats and dogs"

那么,对应的令牌计数矩阵为:

       I  love  dogs  cats  and
文档1: 1    1     1     0    0
文档2: 1    1     1     1    1

sklearn.feature_extraction.text.CountVectorizer 类可以将文本文档集合转换为令牌计数矩阵。它使用 scipy.sparse.csr_matrix 来生成计数的稀疏表示,可以大大节省内存


sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
  • 作用:对文本数据进行特征值化(转换为数字)。
  • 参数
    • stop_words。该参数用于指定停用词列表。
      • 如果设置为 'english',则使用内置的英语停用词列表。
      • 如果设置为一个列表,则该列表被认为包含停用词,所有这些停用词都将从结果令牌中删除。
      • 如果设置为 None,则不使用停用词。在这种情况下,将 max_df 设置为较高的值(例如在 0.7 到 1.0 的范围内)可以根据术语的语料库文档频率自动检测和过滤停用词。

方法一fit_transform

CountVectorizer.fit_transform(X)
  • 作用:用于学习词汇表并将文本文档集合转换为令牌计数矩阵。
  • 参数
    • X:输入数据,应该是一个字符串列表,其中每个字符串表示一个文档。
    • y:可选参数,默认为 None。目标值,仅用于兼容 scikit-learn 的管道和模型选择工具。
      返回值:该方法返回一个稀疏矩阵(sparse matrix),表示文本文档集合中每个文档的令牌计数。矩阵中的每一行表示一个文档,每一列表示一个特征(即一个令牌),元素值表示该令牌在该文档中出现的次数。

方法二CountVectorizer.get_feature_names_out()

CountVectorizer.get_feature_names_out()
  • 作用:返回一个包含特征名称的列表,列表中的元素顺序与转换后的向量中的特征顺序相同。
  • 参数
    • input_features:可选参数,默认为 None。输入特征名称,用于生成输出特征名称。

sklearn.feature_extraction.text.TfidfVectorizer

后面再对其说明

4.3.1 应用

我们对以下数据进行特征提取:

["life is short, i like python",
"life is too long, i dislike python"]

4.3.2 流程分析

  1. 实例化类 CountVectorizer
  2. 调用 fit_transform 方法输入数据并转换(注意返回格式,利用 toarray()进行sparse矩阵转换array数组)
from sklearn.feature_extraction.text import CountVectorizer


data = ["life is short, i like python",
        "life is too long, i dislike python"]

# 1. 实例化一个转换器类
transfer = CountVectorizer()

# 2. 调用fit_transform方法
data = transfer.fit_transform(raw_documents=data)

# 这里要使用toarray将sparse矩阵转换为ndarray矩阵
print("返回的特征名称为:", transfer.get_feature_names_out())
print("文本特征抽取的结果为:\r\n", data.toarray())  # 将稀疏矩阵转换为密集的 Numpy 数组

# 如果输出为sparse矩阵
print("\r\n文本特征抽取的结果为(sparse矩阵):\r\n", data)

结果如下

返回的特征名称为: ['dislike' 'is' 'life' 'like' 'long' 'python' 'short' 'too']
文本特征抽取的结果为:
 [[0 1 1 1 0 1 1 0]
 [1 1 1 0 1 1 0 1]]

文本特征抽取的结果为(sparse矩阵):
  (0, 2)	1
  (0, 1)	1
  (0, 6)	1
  (0, 3)	1
  (0, 5)	1
  (1, 2)	1
  (1, 1)	1
  (1, 5)	1
  (1, 7)	1
  (1, 4)	1
  (1, 0)	1

上面的代码使用了 sklearn.feature_extraction.text 模块中的 CountVectorizer 类来对给定数据进行文本特征提取。调用 fit_transform 方法将数据转换为令牌计数的稀疏矩阵表示。然后使用 toarray 方法将稀疏矩阵转换为密集的 Numpy 数组。使用 get_feature_names_out 方法获取与转换后数据的列对应的特征名称。输出显示了这些操作的结果。

CountVectorizer类用于将文本数据转换为数值特征向量。它通过统计每个文本中每个单词出现的次数来实现这一点。

在这个例子中,给定的数据包含两个文本字符串:“life is short, i like python” 和 “life is too long, i dislike python”。CountVectorizer首先对这些文本进行分词,得到 8 个不同的单词:‘dislike’, ‘is’, ‘life’, ‘like’, ‘long’, ‘python’, ‘short’ 和 ‘too’。然后,它统计每个文本中每个单词出现的次数,并将其转换为数值特征向量。

因此,第一个文本字符串"life is short, i like python"被转换为特征向量 [0 1 1 1 0 1 1 0],表示其中’dislike’出现 0 次,'is’出现 1 次,'life’出现 1 次,'like’出现 1 次,'long’出现 0 次,'python’出现 1 次,'short’出现 1 次,'too’出现 0 次。

同理,第二个文本字符串"life is too long, i dislike python"被转换为特征向量 [1 1 1 0 1 1 0 1],表示其中’dislike’出现 1 次,'is’出现1次,'life’出现 1 次,'like’出现 0 次,'long’出现 1 次,'python’出现 1 次,'short’出现 0 次,'too’出现 1 次。


Q1:分词的依据是什么,i为什么不算呢?
A1:分词是指将文本字符串分割成单独的单词或词语的过程。在英文文本中,分词通常是基于空格和标点符号进行的,因为英文单词之间通常用空格分隔。

在这个例子中,CountVectorizer使用了默认的分词器,它基于空格和标点符号对文本进行分词。因此,给定的两个文本字符串被分割成了如下单词:‘life’, ‘is’, ‘short’, ‘i’, ‘like’, ‘python’, ‘life’, ‘is’, ‘too’, ‘long’, ‘i’, ‘dislike’ 和 ‘python’。

然而,在生成特征向量时,CountVectorizer默认会忽略掉英文停用词。停用词是指在文本中经常出现,但对文本意义影响不大的词语,例如’I’, ‘me’, ‘my’, ‘myself’, 'we’等。在这个例子中,'i’被认为是一个停用词,因此在生成特征向量时被忽略掉了。

如果希望保留停用词,可以在创建CountVectorizer对象时指定stop_words=None参数,例如:

transfer = CountVectorizer(stop_words=None)

这样,在生成特征向量时就不会忽略掉任何单词了。

简单来说,停用词就是不统计的词


Q2:如果我们将数据替换成中文呢?

"人生苦短,我喜欢Python", "生活太长久,我不喜欢Python"

A2:那么最终得到的结果是:

返回的特征名称为: ['人生苦短' '我不喜欢python' '我喜欢python' '生活太长久']
文本特征抽取的结果为:
 [[1 0 1 0]
 [0 1 0 1]]

文本特征抽取的结果为(sparse矩阵):
  (0, 0)	1
  (0, 2)	1
  (1, 3)	1
  (1, 1)	1

正确的划分应该是:'人生' '苦短' '不喜欢' 'python' '喜欢' '生活' '太长久'。这是因为CountVectorizer本身并不支持中文分词。它默认的分词器是基于空格和标点符号进行分词的,这对于英文文本来说是有效的,但对于中文文本来说并不适用,因为中文单词之间通常没有空格。

4.3.3 jieba 分词处理

jieba.cut 方法是 jieba 库中用于对文本进行分词的主要方法。它接受一个字符串作为输入,并返回一个生成器,其中包含分词后的单词。

jieba:结巴

jieba.cut()
  • 作用:对中文文本进行分词
  • 参数
    • sentence:要分词的文本字符串。
    • cut_all:是否使用全模式。
      • 如果为 True,则返回所有可能的分词结果;
      • 如果为 False,则返回最精确的分词结果。默认为 False。
    • HMM:是否使用隐马尔科夫模型(HMM)进行未登录词识别。默认为 True。
    • use_paddle:是否使用 PaddlePaddle 深度学习框架进行分词。默认为 False。

下面是一个简单的例子,展示了如何使用 jieba.cut 方法对文本进行分词:

import jieba

text = "人生苦短,我喜欢Python"

# 使用精确模式进行分词
words = jieba.cut(text, cut_all=False)
print("精确模式分词结果:", "/".join(words))

# 使用全模式进行分词
words = jieba.cut(text, cut_all=True)
print("全模式分词结果:", "/".join(words))

输出结果为:

精确模式分词结果: 人生/苦短///喜欢/Python
全模式分词结果: 人生/苦短////喜欢/Python

可以看到,使用不同的参数会影响分词的结果。


如果要使用 CountVectorizer 处理中文文本,需要自定义一个中文分词器,并将其传递给 CountVectorizer。可以使用第三方中文分词库,例如 jieba 来实现。

下面是一个简单的例子,展示了如何使用 jieba 库和 CountVectorizer 对中文文本进行特征提取:

import jieba
from sklearn.feature_extraction.text import CountVectorizer

def cut_word(text):
    # 使用jieba库进行中文分词
    return " ".join(list(jieba.cut(text)))

data = ["人生苦短,我喜欢Python", 
        "生活太长久,我不喜欢Python"]

# 对中文文本进行分词
data = [cut_word(text) for text in data]

# 创建CountVectorizer对象,并指定自定义的分词器
transfer = CountVectorizer(tokenizer=lambda text: text.split())

# 调用fit_transform方法
data = transfer.fit_transform(raw_documents=data)

# 输出结果
print("返回的特征名称为:", transfer.get_feature_names_out())
print("文本特征抽取的结果为:\n", data.toarray())

输出结果为:

返回的特征名称为: ['python' '不' '人生' '喜欢' '太长久' '我' '生活' '苦短' ',']
文本特征抽取的结果为:
 [[1 0 1 1 0 1 0 1 1]
 [1 1 0 1 1 1 1 0 1]]

这里其实还有问题,停用词并不包含(按道理应该包含的)

4.3.4 案例分析

对以下三句话进行特征值化:

今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。

我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。

如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。
  • 分析
    • 准备句子,利用 jieba.cut() 进行分词。
    • 实例化 CountVectorizer
    • 将分词结果变成字符串当作 fit_transform() 的输入值
from sklearn.feature_extraction.text import CountVectorizer
import jieba


def cut_word(text):
    # 用jieba对中文字符串进行分割
    text = " ".join(list(jieba.cut(text)))  # 分的词后面加上空格
    return text


if __name__ == "__main__":
    data = ["今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。", 
            "我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。",
            "如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。"]
    
    # 将原始数据转换为分好词的形式
    text_lst = []
    for sentence in data:
        text_lst.append(cut_word(sentence))
    print("分好词的数据为:\r\n", text_lst)
    
    # 1. 实例化一个转换器类(对于中文,我们也可以使用停用词)
    transfer = CountVectorizer(stop_words=["一种", "今天", "我", ",", "。"])

    # 2. 调用fit_transform方法
    data = transfer.fit_transform(raw_documents=text_lst)

    # 这里要使用toarray将sparse矩阵转换为ndarray矩阵
    print("\r\n返回的特征名称为:", transfer.get_feature_names_out())
    print("\r\n文本特征抽取的结果为:\r\n", data.toarray())

返回结果

分好词的数据为:
 ['今天 很 残酷 , 明天 更 残酷 , 后天 很 美好 , 但 绝对 大部分 是 死 在 明天 晚上 , 所以 每个 人 不要 放弃 今天 。', '我们 看到 的 从 很 远 星系 来 的 光是在 几百万年 之前 发出 的 , 这样 当 我们 看到 宇宙 时 , 我们 是 在 看 它 的 过去 。', '如果 只用 一种 方式 了解 某样 事物 , 你 就 不会 真正 了解 它 。 了解 事物 真正 含义 的 秘密 取决于 如何 将 其 与 我们 所 了解 的 事物 相 联系 。']

返回的特征名称为: ['不会' '不要' '之前' '了解' '事物' '光是在' '几百万年' '发出' '取决于' '只用' '后天' '含义' '大部分'
 '如何' '如果' '宇宙' '我们' '所以' '放弃' '方式' '明天' '星系' '晚上' '某样' '残酷' '每个' '看到'
 '真正' '秘密' '绝对' '美好' '联系' '过去' '这样']

文本特征抽取的结果为:
 [[0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 0]
 [0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 1]
 [1 0 0 4 3 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0]]

注意:中文的停用词有很多,我们不能每次都手动输入停用词,所以可以借用别人总结好的停用词。

停用词大全举例:最全中文停用词表整理(1893个)


QCountVectorizer 是用来分词和数量统计的,这些次数统计对于我们来说有什么意义呢?
ACountVectorizer 是一个文本特征提取工具,它可以将文本数据转换为数值特征向量,以便用于机器学习模型。它通过统计每个文本中每个单词出现的次数来实现这一点

这些次数统计对于文本分析和自然语言处理任务来说非常重要。例如,在文本分类任务中,我们可以使用 CountVectorizer 将文本数据转换为数值特征向量,然后使用这些特征向量来训练分类器。分类器会根据每个文本中不同单词出现的次数来判断该文本属于哪个类别

此外,CountVectorizer 还可以用于其他自然语言处理任务,例如情感分析、主题建模和文本聚类等。在这些任务中,单词出现的次数也是非常重要的特征。

总之,CountVectorizer 提供了一种简单有效的方法,可以将文本数据转换为数值特征向量,以便用于各种自然语言处理任务。这些次数统计对于理解文本数据、挖掘文本中的信息以及构建有效的机器学习模型都非常重要


那如果把这样的词语特征用于分类,会出现什么问题

insert image description here

该如何处理某个词或短语在多篇文章中出现的次数高这种情况?

从上图中可以看到,“车”和“共享”出现的频率高,这篇文章可能是在说“共享单车”或“共享汽车”;“经济”“证券”“银行”出现的次数多,说明这篇文章可能与经融有关。

Q:那么上面这些分析是通过什么实现的呢?
A:就是我们之前按下不表的 TfidfVectorizer

4.3.5 TF-IDF 文本特征提取

TF-IDF 是 Term Frequency-Inverse Document Frequency 的缩写,中文译为“词频-逆文档频率”。它是一种用于评估单词在文档中的重要性的统计方法

主要思想如果某个词或短语在一篇文章中出现的概率高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类

作用:可以用来衡量单词在文档中的重要性。如果一个单词在某个文档中出现得很频繁,但在其他文档中很少出现,那么这个单词的 TF-IDF 值就会很高,说明它对这个文档的意义很大。

应用:TF-IDF 在自然语言处理领域有着广泛的应用。它常用于信息检索、文本分类、关键词提取和文本摘要等任务。例如,在信息检索系统中,可以使用 TF-IDF 来计算查询词与文档之间的相关性,从而返回与查询最相关的文档。

总之,TF-IDF 是一种常用的文本特征提取方法,它可以帮助我们衡量单词在文档中的重要性,并用于各种自然语言处理任务。

4.3.5.1 公式

  • 词频(Term Frequency,TF)指的是某一个给定的词语在该文件中出现的频率
  • 逆向文档频率(Inverse Document Frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的 IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取以 10 为底的对数得到
  • TF-IDF 的值就是 TF 和 IDF 的乘积。

T F I D F i , j = T F i , j × I D F i \mathrm{TFIDF}_{i,j} = \mathrm{TF_{i, j}}\times \mathrm{IDF}_i TFIDFi,j=TFi,j×IDFi

最终得出结果可以理解 某个词 语 j 某个词语_{j} 某个词j 某个文 档 i 某个文档_i 某个文i 中的重要程度。


举例

假如一篇文章的总词语数是 100 个,而词语“非常”出现了 5 次,那么“非常”一词在该文件中的词频就(TF)是 5 100 = 0.05 \frac{5}{100} = 0.05 1005=0.05。而计算文件频率(IDF)的方法是以文件集的文件总数,除以出现“非常”一词的文件数。

所以,如果“非常”一词在 1,0000 份文件出现过,而文件总数是 10,000,000 份的话,其逆向文件频率就是 lg ⁡ 10 , 000 , 000 1 , 000 = 3 \lg{\frac{10,000,000}{1,000}} = 3 lg1,00010,000,000=3

最后“非常”对于这篇文档的 TF-IDF 的分数为:

T F I D F i , j = T F i , j × I D F i = 0.05 × 3 = 0.15 \begin{aligned} \mathrm{TFIDF}_{i,j} & = \mathrm{TF_{i, j}}\times \mathrm{IDF}_i\\ & = 0.05 \times 3\\ & = 0.15 \end{aligned} TFIDFi,j=TFi,j×IDFi=0.05×3=0.15

4.3.5.2 案例

from sklearn.feature_extraction.text import TfidfVectorizer
import jieba


def cut_word(txt):
    # 用jieba对中文字符串进行分词(使用空格分割)
    return " ".join(list(jieba.cut(txt)))


if __name__ == "__main__":
    data = ["今天很残酷,明天更残酷,后天很美好,但绝对大部分是死在明天晚上,所以每个人不要放弃今天。", 
            "我们看到的从很远星系来的光是在几百万年之前发出的,这样当我们看到宇宙时,我们是在看它的过去。",
            "如果只用一种方式了解某样事物,你就不会真正了解它。了解事物真正含义的秘密取决于如何将其与我们所了解的事物相联系。"]
    
    # 1. 使用jieba对中文数据进行分词
    txt_lst = []
    for sentence in data:
        txt_lst.append(cut_word(sentence))
    print("分词后的结果为:\r\n", txt_lst)
    
    # 2. 实例化一个TF-IDF转换器类
    transfer = TfidfVectorizer(stop_words=["一种", "今天", "我", ",", "。"])
    
    # 3. 调用fit_transform
    data = transfer.fit_transform(txt_lst)
    print("\r\n特征名称:\r\n", transfer.get_feature_names_out())
    print("\r\n特征数量:", len(transfer.get_feature_names_out()))
    print("\r\n文本特征抽取的结果:\r\n", data.toarray())

结果如下:

分词后的结果为:
 ['今天 很 残酷 , 明天 更 残酷 , 后天 很 美好 , 但 绝对 大部分 是 死 在 明天 晚上 , 所以 每个 人 不要 放弃 今天 。', '我们 看到 的 从 很 远 星系 来 的 光是在 几百万年 之前 发出 的 , 这样 当 我们 看到 宇宙 时 , 我们 是 在 看 它 的 过去 。', '如果 只用 一种 方式 了解 某样 事物 , 你 就 不会 真正 了解 它 。 了解 事物 真正 含义 的 秘密 取决于 如何 将 其 与 我们 所 了解 的 事物 相 联系 。']

特征名称:
 ['不会' '不要' '之前' '了解' '事物' '光是在' '几百万年' '发出' '取决于' '只用' '后天' '含义' '大部分'
 '如何' '如果' '宇宙' '我们' '所以' '放弃' '方式' '明天' '星系' '晚上' '某样' '残酷' '每个' '看到'
 '真正' '秘密' '绝对' '美好' '联系' '过去' '这样']

特征数量: 34

文本特征抽取的结果:
 [[0.         0.24253563 0.         0.         0.         0.
  0.         0.         0.         0.         0.24253563 0.
  0.24253563 0.         0.         0.         0.         0.24253563
  0.24253563 0.         0.48507125 0.         0.24253563 0.
  0.48507125 0.24253563 0.         0.         0.         0.24253563
  0.24253563 0.         0.         0.        ]
 [0.         0.         0.2410822  0.         0.         0.2410822
  0.2410822  0.2410822  0.         0.         0.         0.
  0.         0.         0.         0.2410822  0.55004769 0.
  0.         0.         0.         0.2410822  0.         0.
  0.         0.         0.48216441 0.         0.         0.
  0.         0.         0.2410822  0.2410822 ]
 [0.15895379 0.         0.         0.63581516 0.47686137 0.
  0.         0.         0.15895379 0.15895379 0.         0.15895379
  0.         0.15895379 0.15895379 0.         0.12088845 0.
  0.         0.15895379 0.         0.         0.         0.15895379
  0.         0.         0.         0.31790758 0.15895379 0.
  0.         0.15895379 0.         0.        ]]

4.3.6 TF-IDF 的重要性

分类机器学习算法进行文章分类中前期数据处理的方式之一。在自然语言处理的诸多课题如信息检索(Information Retrieval)和文本探勘(Text Mining)当中,我们希望找出重要的单词或文句。在过程中我们需要将文字进行量化(转化为向量)以进行后续处理及筛选。而 TF-IDF 是一种常用的方法来衡量单词对于一份文件的重要程度。它可以帮助我们在信息检索和文本挖掘等领域找到重要的单词或短语。


小结

  • 特征提取【了解】:将任意数据(如文本或图像)转换为可用于机器学习的数字特征
  • 特征提取分类:【了解】
    • 字典特征提取(特征离散化)。
    • 文本特征提取
    • 图像特征提取
  • 字典特征提取【知道】:字典特征提取就是对类别型数据进行转换。
  • API:sklearn.feature_extraction.DictVectorizer(sparse=True, ...)
    • sparse 矩阵(稀疏矩阵):
      • 节省内存
      • Improve reading efficiency
    • Notice:
      • We will do One-hot encoding for the category information in the features
  • Text feature extraction (English) [know]
    • API:sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
      • stop_words: Stop words (times without counting word frequency)
      • Note :
        • CountVectorizerWithout sparsethis parameter,
        • CountVectorizerFor English sentences, single letter punctuation marks are not counted
  • Text feature extraction (Chinese) [know]
    • Note :
      • Before Chinese text feature extraction, sentence (article) needs to be segmented
        • jieba.cut()
      • CountVectorizerStop words can still be used to limit words
  • TF-IDF【Know】
    • Main idea : If a word or phrase has a high probability of appearing in an article and rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish categories and is suitable for classification
    • TF-IDF
      • TF: Term Frequency
      • IDF: Inverse Document Frequency
      • API:sklearn.feature_extraction.text.TfidfVectorizer
      • Notice:
        • TFIDF is one of the middle and early data processing methods for classification machine learning algorithms to classify articles

Note : Feature engineering does not serve decision trees. In other feature extraction methods (models), we also use feature engineering to convert data (such as text or images) into digital features that can be used for machine learning!

Guess you like

Origin blog.csdn.net/weixin_44878336/article/details/130798760