机器学习 (2)：决策树模型

1.概述

决策树是一种简单的机器学习方法，它是对被观测数据进行分类的一种相当直观的方法。

优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。
缺点：可能会产生过度匹配问题。
适用数据类型：数值型和标称型。

2.决策树的构造

决策树学习的关键是如何选择最优划分属性。一般而言，随着划分过程的不断进行，我们希望决策树的分支节点所包含的样本尽可能属于同一类别，即结点的“纯度”越来越高。

信息增益：在划分数据集之前之后信息发生的变化称为信息增益。

知道如何计算信息增益，我们就可以计算每个特征值划分数据集获得的信息增益，获得信息增益最高的特征就是最好的选择。

实际上，信息增益准则对可取值数目较多的属性有所偏好，为减少这种偏好可能带来的不利影响，注明的C4.5决策树算法不直接使用信息增益，而是使用“增益率”来选择最优划分属性。

代码：

class decisionnode:
  def __init__(self,col=-1,value=None,results=None,tb=None,fb=None):
    self.col=col
    self.value=value
    self.results=results
    self.tb=tb
    self.fb=fb
# Divides a set on a specific column. Can handle numeric
# or nominal values
def divideset(rows,column,value):
   # Make a function that tells us if a row is in 
   # the first group (true) or the second group (false)
   split_function=None
   if isinstance(value,int) or isinstance(value,float):
      split_function=lambda row:row[column]>=value
   else:
      split_function=lambda row:row[column]==value
   
   # Divide the rows into two sets and return them
   set1=[row for row in rows if split_function(row)]
   set2=[row for row in rows if not split_function(row)]
   return (set1,set2)

# Create counts of possible results (the last column of 
# each row is the result)
def uniquecounts(rows):
   results={}
   for row in rows:
      # The result is the last column
      r=row[len(row)-1]
      if r not in results: results[r]=0
      results[r]+=1
   return results

2.1基尼不纯度（Gini Impurity）

定义：指将来自集合中的某种结果随机应用于集合中某一数据项的预期误差率。

公式： $Gini(D)=\sum_{k = 1}^{\left | \gamma \right |} \sum_{{k}' \neq k} p_{k}{p}'_{k}$

代码：

# Probability that a randomly placed item will
# be in the wrong category
def giniimpurity(rows):
  total=len(rows)
  counts=uniquecounts(rows)
  imp=0
  for k1 in counts:
    p1=float(counts[k1])/total
    for k2 in counts:
      if k1==k2: continue
      p2=float(counts[k2])/total
      imp+=p1*p2
  return imp

该函数利用集合中每一项结果出现的次数除以集合的总行数来计算相应的概率，然后将所有这些概率值的乘积累加起来。这样就会得到某一行数据被随机分配到错误结果的概率，这一概率的值越高，就说明对数据的拆分越不理想。

2.2熵

公式：

$Ent(D)= -\sum^{\left| \gamma \right|}_{k=1}p_{k}log_{2}p_{k}$

$p(i) = frequency(outcome) = count(outcome) / count(total rows)$

代码：

# Entropy is the sum of p(x)log(p(x)) across all 
# the different possible results
def entropy(rows):
   from math import log
   log2=lambda x:log(x)/log(2)  
   results=uniquecounts(rows)
   # Now calculate the entropy
   ent=0.0
   for r in results.keys():
      p=float(results[r])/len(rows)
      ent=ent-p*log2(p)
   return ent

3.递归建树

首先求出整个群组的熵，然后尝试利用每个属性的可能取值对群组进行拆分，并求出两个新群组的熵，然后计算针对每个属性计算相应的信息增益，然后从中选出信息增益最大的属性。通过计算每个新生节点的最佳拆分属性，对分支的拆分过程和树的构造过程会不断地持续下去。当拆分某个节点所得的信息增益不大于0时，对分支的拆分才停止。

在决策树基本算法中，有三种情形会导致递归返回：（1）当前结点包含的样本全属于同一类别，无需划分（2）当前属性集为空，或是所有样本在所有属性上取值相同，无法划分；（3）当前结点包含的样本集合为空，不能划分。

信息增益公式： $Gain(D,\alpha ) = Ent(D) - \sum _{v-1}^{V}\frac{\left | D^{\gamma} \right |}{D}Ent(D^{\gamma})$

代码：

def buildtree(rows,scoref=entropy):
  if len(rows)==0: return decisionnode()
  current_score=scoref(rows)

  # Set up some variables to track the best criteria
  best_gain=0.0
  best_criteria=None
  best_sets=None
  
  column_count=len(rows[0])-1
  for col in range(0,column_count):
    # Generate the list of different values in
    # this column
    column_values={}
    for row in rows:
       column_values[row[col]]=1
    # Now try dividing the rows up for each value
    # in this column
    for value in column_values.keys():
      (set1,set2)=divideset(rows,col,value)
      
      # Information gain
      p=float(len(set1))/len(rows)
      gain=current_score-p*scoref(set1)-(1-p)*scoref(set2)
      if gain>best_gain and len(set1)>0 and len(set2)>0:
        best_gain=gain
        best_criteria=(col,value)
        best_sets=(set1,set2)
  # Create the sub branches   
  if best_gain>0:
    trueBranch=buildtree(best_sets[0])
    falseBranch=buildtree(best_sets[1])
    return decisionnode(col=best_criteria[0],value=best_criteria[1],
                        tb=trueBranch,fb=falseBranch)
  else:
    return decisionnode(results=uniquecounts(rows))

4.对新的观测数据进行分类

每次调用之后，函数会根据调用结果来判断是否到达分支的末端。如果尚未到达末端，它会对观测数据做出评估，以确认列数据是否与参考值相匹配。如果匹配，则会再次在True分支上调用classify，如果不匹配，则会在False分支上调用classify。

def classify(observation,tree):
  if tree.results!=None:
    return tree.results
  else:
    v=observation[tree.col]
    branch=None
    if isinstance(v,int) or isinstance(v,float):
      if v>=tree.value: branch=tree.tb
      else: branch=tree.fb
    else:
      if v==tree.value: branch=tree.tb
      else: branch=tree.fb
    return classify(observation,branch)

5.决策树的剪枝

剪枝手段可以解决过度拟合问题，决策树剪枝的主要策略有“预剪枝”和“后剪枝”。

5.1预剪枝

预剪枝熬对划分前后的泛化性能进行估计，若划分后验证集的精度提高，则进行划分，否则禁止划分。

决策树桩：仅有一层划分的决策树。

预剪枝使得决策树的很多分支都没有“展开”，这不仅降低了过拟合的风险，还显著减少了决策树的训练时间开销和测试时间开销。但另一方面，有些分支的当前划分虽不能提升泛化性能、甚至可能导致泛化性能暂时下降，但在其基础上进行的后续划分却有可能导致性能显著提高；预剪枝基于“贪心”本质禁止这些分支展开，给预剪枝决策树带来了欠拟合的风险。

5.2后剪枝

后剪枝先从训练集生成一棵完整决策树，然后自下往上遍历树，对于任一根节点，如果合并分支能提高验证集的精度，则合并。

后剪枝决策树通常比预剪枝决策树保留了更多的分支。一般情形下，后剪枝决策树的欠拟合风险很小，泛化性能往往优于预剪枝决策树。但后剪枝过程是在生成完全决策树之后进行的，并且要自底向上地对树中的所有非叶节点进行逐一考察，因此其训练时间开销比未剪枝决策树和预剪枝决策树都要大得多。

代码：

def prune(tree,mingain):
  # If the branches aren't leaves, then prune them
  if tree.tb.results==None:
    prune(tree.tb,mingain)
  if tree.fb.results==None:
    prune(tree.fb,mingain)
    
  # If both the subbranches are now leaves, see if they
  # should merged
  if tree.tb.results!=None and tree.fb.results!=None:
    # Build a combined dataset
    tb,fb=[],[]
    for v,c in tree.tb.results.items():
      tb+=[[v]]*c
    for v,c in tree.fb.results.items():
      fb+=[[v]]*c
    
    # Test the reduction in entropy
    delta=entropy(tb+fb)-(entropy(tb)+entropy(fb)/2)

    if delta<mingain:
      # Merge the branches
      tree.tb,tree.fb=None,None
      tree.results=uniquecounts(tb+fb)