Decision Tree - Return

Solve the problem

  Return to realize the tree traversal feature-based range.

solution

  By looking for the best feature and feature value of the sample as the best split point to construct a binary tree. Selecting the best feature and value principle is satisfied by the minimum function. In fact, the essence of the process is to select the sample interval training division, based on the sample mean interval to calculate the mean, the final region is the predicted value.

  When the prediction will be provided according to the characteristics of the sample to traverse the binary tree (the area determination process), wherein the value of the leaf node is the predicted value.

  Construction of regression tree, the process, in fact, can be understood to be supervised clustering training samples, each category is a logical set of features described range do; predict when, in fact, in the match, that the scope of the classification logic path and predicted samples match and then take the mean value of y which classification as the predicted value.

  Because the tree is current optimal solution for each step loss function, so in essence, is a greedy algorithm, because each step is a local optimum; so CART regression algorithm is also a category of heuristic algorithms.

 

achieve

  It can be seen by the following equation out implementation of the solution, because the return of CART, so the solution the idea is to obtain the minimum loss function shown in the following formula:

  Two regions R1 and R2 are represented by binary two regions (characteristic feature j s as the logical value determination condition) divided (sample set);

R1(j, s) = {x | xj ≤ s}, R2(j, s) = {x | xj > s}

  Wherein minC1, minC2 represents optimal output value of the area, the optimum value of y is the mean value of each region, the formula can be written (the minC1 / minC2 bound to specific values: c1_hat and c2_hat):

  The final decision tree model we built are:

  This model may seem abstract, graphical decision tree look back, first to a tree diagram:

  Is on the left side of FIG view corresponding to display data, the last layer of the leaf node because there is no discrimination, it is not a layer, only a layer comprising a parent node considered.

 

  Can be seen by the figure, the pattern is in fact the decision tree within the specified range is a straight line, this range is the condition of each node.

  我们先来看一下左侧的depth为2的图:根节点是0.2作为分割线,即靠左侧的实线线,将整个样本空间一分为二,R1:(0,0.2), R2:(0.2, 1.0);然后是第二层两个子节点,分别是0.1和0.8(近似值),将根节点所划分的区域继续一分为二,这样就有了四个区域;每个区间都有自己的R空间(样本),其中的红线就是各个空间里面样本的应变量的均值,即cm值。通过这种方式划分,其实实现了损失函数值最小。

  我们再来看一下右侧图,即depth=3,你会发现一个特点,任何一个R区间都会包含一个最后一层,这是因为最后获取叶子节点,这是因为每个子节点(非叶子)一定回事把所有的父节点所划分的区域一分为2,所以所有的区域一定是有最有一层节点对应的分割线的。

  总之,我们会发现决策树其实是通过阶梯式的图像,来拟合训练空间的样本(一定程度上是有一些生硬的。所以做回归,一般很少直接使用CART,而是CART的簇,随机森林)。

  那么我们再回到公式,我们就拿左侧图(depth=2)来举例,对于f(0.5),这个值只会属于其中一个R空间,这里是(0.2,0.8)区域,所以其他区域在公式计算中都应该取值为0,这个就是指示函数的作用,类似于开关作用,确保只有应该计算点是有值,其他都是0值。

 

实际案例

  下面通过一个实际的例子来解释一下这个求解过程。

  下面的图标是某个特征j的特征值:

  需要考虑如下9个切分点:{1.5, 2.5, 3.5, 4.5, 6.5, 7.5, 8.5, 9.5},这里切分点是取相邻的两个特征值的均值。

  对于第一个切分点1.5,y值数据(5.56)为第一个数据区域R1,2.5~9.5为第二个数据区域R2。

  于是有c1 = 5.56,c2 = (1/9)*Σyi = 7.5

  L(1.5) = (y1 - 5.56)² + Σ(yi - c2)² = 15.72,这里 i∈[1, 9]

  继续,按照第二个切分点2.5,R1={5.56, 5.7},R2={5.91, 6.4,..., 9.05},于是有:

  c1 = mean(R1) = 5.63, c2 = mean(R2) = 7.73

  L(2.5) = Σ(yi - c1)² + Σ(yj - c2)² = 12.07,其中i∈[0, 1],j∈[2, 9]

  以此类推有:

  我们看到该属性损失函数最小的是6.5这个点,于是对于该属性最佳区分点就是6.5。于是有(这里的R采用x:y模式来表示):

R1={1:5. 2:56, 3:5.7, 4:5.91, 5:6.4, 6:7.05}, R2={7:8.9, 8:8.7, 9:9, 10:9.05}

  如果该样本还有其他属性,也是以此类推,计算出来各个最佳区分点的L(s),然后选出L(s)值最小的特征以及区分点,作为二叉树的节点;然后提出到选定的特征,再从R1和R2中继续该操作分别获得新的子节点,循环往复。

 

附录

  为什么c1_hat和c2_hat的最优解是对应的均值呢(注意,带hat代表该值是估计值)?

  这里我们看一下推倒的过程:

  F(a) = (x1 - a)² + (x2 - a)² + ... + (xn - a)²

  考察其单调性(通过求导来求其极值):

  F'(a) = -2*(x1 - a) - 2(x2 - a) - ... - 2(xn - a) = 2na - 2*Σxi(此处注意,因为是对a求导,a前面是-号,所以求导后系数为-2

  F'(a) = 0 => a = (1/n)*Σxi

  根据其单调性,可以知道a_hat = (1/n)*Σxi为最小值。

  根据公式来看求得是L2的值,所以构建的树也称之为最小二乘回归树

 

  单调性为什么可以通过求导体现出来,怎么体现出来?

  导数代表的是函数变化速率,单调性代表了y值和x值变化是否是同方向,同方向则>0,反方向则<0;基于这个理解,不难退出导数如下定理:

  f'(x) > 0 => f(x)是增函数(单调递增)

  f'(x) < 0 => f(x)是减函数(单调递减)

  f'(x) = 0 => f(x)是常数函数。

  注意,这个左侧条件为充分条件而非必要条件,单调增减并不推导出函数导数的正负性。

  导数怎么来判断极值的,是极大值还是极小值?

  首先要明白一点,极值是一个局部概念,只是描述某个点两边的变化趋势。

  首先导数=0,称之为函数的驻点,然后分析驻点两边的导数值是否异号,如果异号:左边区域>0,右边区域<0 =>驻点是极大值(峰值,可以想象沿着x轴正方向,山峰的左侧是上行,右侧是下行),反之是极小值(谷底值,可以想象,沿着x轴正向,山谷左侧是下行,右侧是上行)。如果驻点两侧同号,则称该点为鞍点:

  导数的峰值分析,一般思路:

  1. 求导数公式,令导数为0,求出驻点;

  2. 基于驻点和指定的区间,以驻点为截线将x轴进行区域分割,分析驻点两边导数正负号,来决定驻点的是否为极值点或者说为哪一类极值点。

  下面举一个例子:

  y = (1/3)*x³ - 4x + 4

  求导:y' = x² - 4 = (x+2)*(x-2)

  求驻点:y' = 0 = (x + 2)*(x - 2) => x= ±2

  划分区域,分段分析极值:

 

 

参考

https://zhuanlan.zhihu.com/p/42505644

https://blog.csdn.net/weixin_40604987/article/details/79296427

https://www.jianshu.com/p/b90a9ce05b28

https://wenku.baidu.com/view/eb97650e76c66137ee061960?fr=uc 导数极值例子参考

https://wenku.baidu.com/view/4357a7ce58f5f61fb73666c3?fr=uc 提到了极值只是局部概念

https://blog.csdn.net/wfei101/article/details/80766934 提到了驻点概念

 

Guess you like

Origin www.cnblogs.com/xiashiwendao/p/12168323.html