Decision Tree - Classification

Decision Tree Algorithm

How classification and regression can be performed based on the existing data? A decision tree is a machine learning model to solve such problems.

Solutions are: characterized by three digital samples characterized by: 1) the number of samples satisfies the characteristic value; class 2) 1) a plurality of samples of the respective amounts less; 3) the total number of samples, as the input parameter, by constructing / select the model calculated on the characteristics of the index, but the information gain for ID3, C4.5 is the largest information gain ratio, CART is the gini index; Once you have a series of characteristic index, which according to certain rules (such as index maximum) to select a node (node ​​value), then weed out this feature, then choose from the rest of the features of its child nodes and then is which.

Decision tree algorithms (heuristic function) include: ID3, C4.5 and CART, scikit learn the Decision Tree implementation is CART, which also shows CART is the most important algorithms. Why here called heuristic function? In fact, it is because the use of a variety of functions, is not necessarily guaranteed to be the best division, but a divided manner, similar to the follow-up also need pruning, re-build and other ways to optimize the node of the tree.

ID3 only for discrete classification; C4.5 classification may be based on continuous values, the principle is the value data segments continuously performed as a node partitioning for each segment; the CART is not only for various types of data the classification (for several discrete), can also be used regression analysis (by this point Class And regression Tree name can be seen), here CART value C4.5 for continuous processing is the same;

The relationship between the three is progressive; ID3 algorithm is based on the maximum information gain, what is the information gain? We must first understand the experience of entropy, entropy describes the study activity numbers, from physics (thermodynamics) point of view, what entropy is it? Entropy is the logarithm of the number of a set of state space included microscopic point, these points correspond to a similar macroscopic state (n). In the field of information, Shannon draws on this concept, the concept of entropy is used to describe the uncertainty of information, the higher the entropy, on behalf of the more uncertain. Description classification entropy formula is as follows:

Data set D, the classification Ck entropy value and is entropy data set D; for example a 100 samples, A classification 20, B classification 30, C classification 50, the C0 = 20, C1 = 30, C2 = 50, D = 100.

A designated features described experiences entropy follows:

Sample set (subset D) represents Di takes features A i-th eigenvalue, Dik i-th feature, which belong to the k-th classification sample sets; for example, 100 individuals (sample D), nationality (feature A) is the Chinese (i = 0) 50 th, American (i = 1) 30 th, Korean (i = 2) 20; of which 50 young Chinese people (k = 1) is 15, a man (k = 2) 26 th; then D = 100, D1 = 50, D2 = 30, D3 = 20, D11 = 15, D12 = 26.

Description information gain is specified characteristic certainty, gain greater the more stable characteristics (note that this is an overall gain with respect to specified characteristics, the greater the stability characteristics), the following formula:

g(D, A) = H(D) - H(D|A)

Such information gain greater => smaller entropy value => higher uncertainty, so to gain the greatest selection information field as a distinguished field that, of course, extremely direct the field as a parent node of the leaf node, that is, its direct classification result;

But this decision has a drawback, is that you will find that because the formula is Xlog2X, have more features characteristic value, the smaller the entropy values ​​for each characteristic value, leading to its H | smaller (D A), more extreme cases If ID is selected features do research, because all the samples are only unique values, the log value of 0 (Dik = 1, Di = 1, Dik / Di = 1).

C4.5 algorithm to do in terms of improvement, not just look at information gain, but to see and gain information about the value of entropy D ratio of A, with this ratio, reducing the impact of multi-feature value for the results to some extent.

The above equation Di is the i-th eigenvalue A is, Dik meaning and described above is the same; so H (D | A) calculation process because the feature value much and account for the "cheap", because the denominator divided by a similar operator, and to offset the (punishment).

When the measure object Gini index CART employed, the purity of which is described in many books, in fact, be construed as "eccentricity" is more appropriate, because the Gini index = 0, representing the highest purity (preferably certainty stability).

CART在构建树的时候,每次都是选择gini指数最小(稳定性好的)的特征的条件来作为二叉树的节点来进行构建(CART的树和ID3以及C4.5之间的区别还在于CART是一棵二叉树):

另外,在构建决策树的时候,可能很多时候都不一样,这个是因为在每次构建决策树的时候,取出的构建样本都不一样,可能会导致特征对应的特征数字会有出入,所以会不一样,比如在随机森林里面,每次都是用一部分的数据来构建子树,然后放回采样构建下一颗子树,所以会有森林里面的决策树都不一样的。

 

关于决策树的决策概率

在分类模型上,因为决策树节点大部分都是不纯洁的,所以在使用训练样本构建树的时候,最终在各个叶子结点上的样本一部分是A分类,一部分B分类等等,那么这意味着在每个叶子节点上,每个分类是有一定概率的(基于此次样本训练出来的结果),假设是一个二分类的模型,针对某个叶子结点,最终落入了100个训练样本,A分类是80个,B分类是20个;如果我们使用predict返回的当然的是A分类,但是对于决策树还提供了一个predict_proba函数,通过这个函数可以知道预测的每种分类(A:80%,B: 20%)的概率是怎么样的;

 

关于剪枝(Pruning)

构建决策树的问题:树的构建是会过拟合的(测试集表现得很好,但是验证集表现比较差),所以需要对树进行剪枝来提高其泛华能力;构建树的启发式算法有研究表明差别并不是很大(毕竟大家都是启发式的算法,并非最优算法),关键在于剪枝。

解决过拟合的方案:剪枝本质就是删掉一些叶子节点,让父节点成为叶子节点,让剪之后的叶子节点更加憨厚,泛华性好一些。

方案的实现:剪枝有很多算法,这里讲一下CART的CCP(Cost Complexity Pruning,代价复杂度)算法,原理是对所有的父节点(非叶子节点)都计算一下减去其子节点前后的误差值,计算误差增加率:

α = [R(t) - R(T)] / (|L(T)| - 1)

其中alpha值最小的节点,剪掉其子节点。为什么选择最小的呢?说明其子节点进行细化分支的价值并不是很明显。所以直接剪去;然后继续遍历计算误差增加率,选择最小的进行剪枝,迭代进行,一直到最小的α<=0剪枝结束,因为这意味再剪下去已经没有意义了。

 

推广

其实你会发现决策树的过程和梯度下降类似,首先二者都是启发式的建模过程,通过迭代来求得局部最优;对于决策树而言,基于熵值,gini值等来判断特征的稳定程度,基于启发式函数来决定谁来作为分支节点,然后通过剪枝来进行迭代优化;对于梯度喜爱而言,则是开始启发式的指定参数θ值,然后通过计算梯度值来获得下一个调整θ值;两者都无法通过穷举获得最优解,只是通过局部最优来作为最终结果。

其实,从本质上面来讲二者都是贪心算法,因为每一步选择的都是局部优化,即当前指标值最好的,比如在决策树,每次节点的划分都是信息增益/比值/gini指数最大最小;对于梯度下降,参数θ每次变化的方向都是切线方向,或者说变小的方向,当然对于Line regression的损失函数是凸函数,这种方式可以找到全局最低点,但是对于其他形式的曲线,则碰到的第一个低点(第一次损失函数增值了)不一定是最全局最低点。诸如此类

 

另外,对于CART的回归实现,是另外一套基于损失函数最小化的算法,这个另外再写一篇文章来说明。

 

Guess you like

Origin www.cnblogs.com/xiashiwendao/p/12104280.html