Machine Learning (Notes)--Comparison of Decision Tree Model ID3/C4.5/CART Algorithms

Comparison of Decision Tree Model ID3/C4.5/CART Algorithms

From: https://www.cnblogs.com/wxquare/p/5379970.html

Decision tree models are very common in supervised learning and can be used for classification (binary, multiclass) and regression. Although Bagging, Random Forest, Boosting and other tree ensembel models of multiple weak decision trees are more common, the "full growth" decision tree has a wide range of applications because of its simplicity and intuitiveness, strong interpretability, and decision tree is The foundation of tree ensemble is worth understanding. Generally speaking, a "completely grown" decision tree includes three processes: feature selection, decision tree construction, and pruning . This article mainly summarizes and compares ID3, C4.5, and CART algorithms. There is a more detailed introduction in "Statistical Learning Methods".

The advantages and disadvantages of decision trees

    advantage:

  1. In the decision tree algorithm, the process of learning simple decision rules to build a decision tree model is very easy to understand.
  2. The decision tree model can be visualized and very intuitive
  3. Wide range of applications, can be used for classification and regression, and it is very easy to do multi-class classification
  4. Ability to handle numeric and continuous sample features

    shortcoming:

  1. It is easy to generate complex tree structures in the training data, resulting in overfitting. Pruning can alleviate the negative effects of overfitting. Common methods are to limit the height of the tree and the minimum number of samples in the leaf nodes.
  2. Learning an optimal decision tree is considered an NP-Complete problem. The actual decision tree is built based on the heuristic greedy algorithm, which cannot guarantee the establishment of a globally optimal decision tree. Random Forest introduces randomness to alleviate this problem
  3. First of all, we must understand what information entropy and information gain are.
  4. Entropy: Represents the uncertainty of a random variable.

    Conditional Entropy: The uncertainty of a random variable under a condition.

    Information Gain: Information Entropy - Conditional Entropy

    The degree to which information uncertainty is reduced under one condition!

    In layman's terms, X (it will rain tomorrow) is a random variable, the entropy of X can be calculated, and Y (cloudy tomorrow) is also a random variable, if we also know the information entropy of raining under cloudy conditions (here Need to know its joint probability distribution or estimate it from the data) is the conditional entropy.

    The subtraction of the two is the information gain! It turns out that it will rain tomorrow. For example, the information entropy is 2, and the conditional entropy is 0.01 (because if it is cloudy, the probability of rain is very high, and the information is less), so the subtraction is 1.99, after obtaining the cloudy information, the next Rain information uncertainty reduced by 1.99! is a lot! So the information gain is great! That said, the cloudy information is important for rain!

    Therefore, information gain is often used in feature selection. If IG (information gain is large), then this feature is very important for classification~~ This is how decision trees come to find features!
  5. Information entropy represents the complexity (uncertainty) of random variables, and conditional entropy represents the complexity (uncertainty) of random variables under a certain condition

  6. Entropy refers to the chaotic degree of a system. The greater the information entropy, the higher the uncertainty of the degree of confusion, and the less likely it is to be predicted. Therefore, the decision tree should give priority to finding the one with small information entropy and high purity. Information gain is entropy - conditional entropy, and then refers to the degree of uncertainty reduction, so the higher information gain is preferred

2. ID3 algorithm

      ID3由Ross Quinlan在1986年提出。ID3决策树可以有多个分支,但是不能处理特征值为连续的情况。决策树是一种贪心算法,每次选取的分割数据的特征都是当前的最佳选择,并不关心是否达到最优。在ID3中,每次根据“最大信息熵增益”选取当前最佳的特征来分割数据,并按照该特征的所有取值来切分,也就是说如果一个特征有4种取值,数据将被切分4份,一旦按某特征切分后,该特征在之后的算法执行中,将不再起作用,所以有观点认为这种切分方式过于迅速。ID3算法十分简单,核心是根据“最大信息熵增益”原则选择划分当前数据集的最好特征,信息熵是信息论里面的概念,是信息的度量方式,不确定度越大或者说越混乱,熵就越大。在建立决策树的过程中,根据特征属性划分数据,使得原本“混乱”的数据的熵(混乱度)减少,按照不同特征划分数据熵减少的程度会不一样。在ID3中选择熵减少程度最大的特征来划分数据(贪心),也就是“最大信息熵增益”原则。下面是计算公式,建议看链接计算信息上增益的实例。

三、C4.5算法

      C4.5是Ross Quinlan在1993年在ID3的基础上改进而提出的。.ID3采用的信息增益度量存在一个缺点,它一般会优先选择有较多属性值的Feature,因为属性值多的Feature会有相对较大的信息增益?(信息增益反映的给定一个条件以后不确定性减少的程度,必然是分得越细的数据集确定性更高,也就是条件熵越小,信息增益越大).为了避免这个不足C4.5中是用信息增益比率(gain ratio)来作为选择分支的准则。信息增益比率通过引入一个被称作分裂信息(Split information)的项来惩罚取值较多的Feature。除此之外,C4.5还弥补了ID3中不能处理特征属性值连续的问题。但是,对连续属性值需要扫描排序,会使C4.5性能下降,有兴趣可以参考博客

image

五、CART算法

     参考:CART使用GINI指数分类

     CART(Classification and Regression tree)分类回归树由L.Breiman,J.Friedman,R.Olshen和C.Stone于1984年提出。ID3中根据属性值分割数据,之后该特征不会再起作用,这种快速切割的方式会影响算法的准确率。CART是一棵二叉树,采用二元切分法,每次把数据切成两份,分别进入左子树、右子树。而且每个非叶子节点都有两个孩子,所以CART的叶子节点比非叶子多1。相比ID3和C4.5,CART应用要多一些,既可以用于分类也可以用于回归。CART分类时,使用基尼指数(Gini)来选择最好的数据分割的特征,gini描述的是纯度,与信息熵的含义相似。CART中每一次迭代都会降低GINI系数。下图显示信息熵增益的一半,Gini指数,分类误差率三种评价指标非常接近。回归时使用均方差作为loss function。基尼系数的计算与信息熵增益的方式非常类似,公式如下

imageimage

 

六、分类树 VS 回归树

         提到决策树算法,很多想到的就是上面提到的ID3、C4.5、CART分类决策树。其实决策树分为分类树和回归树,前者用于分类,如晴天/阴天/雨天、用户性别、邮件是否是垃圾邮件,后者用于预测实数值,如明天的温度、用户的年龄等。

         作为对比,先说分类树,我们知道ID3、C4.5分类树在每次分枝时,是穷举每一个特征属性的每一个阈值,找到使得按照feature<=阈值,和feature>阈值分成的两个分枝的熵最大的feature和阈值。按照该标准分枝得到两个新节点,用同样方法继续分枝直到所有人都被分入性别唯一的叶子节点,或达到预设的终止条件,若最终叶子节点中的性别不唯一,则以多数人的性别作为该叶子节点的性别。

         回归树总体流程也是类似,不过在每个节点(不一定是叶子节点)都会得一个预测值,以年龄为例,该预测值等于属于这个节点的所有人年龄的平均值。分枝时穷举每一个feature的每个阈值找最好的分割点,但衡量最好的标准不再是最大熵,而是最小化均方差--即(每个人的年龄-预测年龄)^2 的总和 / N,或者说是每个人的预测误差平方和 除以 N。这很好理解,被预测出错的人数越多,错的越离谱,均方差就越大,通过最小化均方差能够找到最靠谱的分枝依据。分枝直到每个叶子节点上人的年龄都唯一(这太难了)或者达到预设的终止条件(如叶子个数上限),若最终叶子节点上人的年龄不唯一,则以该节点上所有人的平均年龄做为该叶子节点的预测年龄。 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325721366&siteId=291194637