> By joey周琦

概述

线性模型一般variance小，bias大；而树模型一般variance大，bias小
决策树的优点：具有可读性，分类速度快
一般包含三个步骤：
- 特征选择
- 决策树生成
- 剪枝
决策树定义：分类决策树模型是一种描述对实例进行分类的树形结构。决策树由结点(node)和有向边(directed edge)组成。结点分为内部结点和叶结点，内部结点表示一个特征(feature),外部结点表示一个类。
- 决策树可以看为一个if-then规则集合，具有“互斥完备”性质

决策树学习

假设有数据集

D = {(x 1, y 1), (x 2, y 2), . . . (x N, y N)}

$D= \{(x_1,y_1),(x_2,y_2) ,...(x_N,y_N)\}$
其中

xi=(x1i,x2i,...xni) $x_i = (x_i^1,x_i^2,...x_i^n)$ , 为输入特征向量，n为特征的数目.

yi∈{1,2,...,K} $y_i \in \{1,2,...,K\}$ 为类标记，i=1,2,…N, N为样本容量。
我们需要一个与训练数据矛盾较小的决策树，而且有很好泛化能力(剪枝）。

特征选择

特征选择是选取对训练数据具有分类能力的特征，将最有分类能力的特征放在前面判断。那么如何量化特征的分类能力呢？这里引入一个“信息增益（information gain“的概念

信息理论Information Theory

考虑随机变量 $x$ , 若获取 $x$ 的确切值，我们可以获取多少信息量？
知道了一件可能性很小事情的发生，相对于知道一件可能性很大事情的发生，前者的信息量要大很多。如果我们知道一件必然事情要发生，即没有信息量。
如何衡量？利用“信息熵(entropy)”,
- 若 $x$ 离散随机变量则熵表示为： $H [x] = - \sum i p (x i) log 2 p (x i)$ $H[x] = - \sum_i p(x_i) \log_2 p(x_i)$
- 若 $x$ 连续随机变量则熵表示为： $H [x] = - \int p (x) log 2 p (x) d x$ $H[x] = - \int p(x) \log_2 p(x) d_x$
- 若 D 为一个数据集，则信息熵表示为:
  H(D)=−∑x∈Xp(x)log2p(x)
  
  其中
  - X - 数据集D中所有的类
  - p(x) - 每个类所占的比例

信息增益Information gain

信息增益IG(A)衡量了，一个数据集在被变量A分割之前与分割之后的熵之差。用另外一句话说，数据集的不确定程度在分割变量A之后减少了多少。

I G (A, D) = H (D) - \sum t \in T p (t) H (t)

$IG(A,D) = H(D) - \sum_{t \in T} p(t)H(t)$
其中,
* H(D) - 数据集D的熵
* T - 数据集D在被变量A分割之后所产生的数据子集 D =

⋃t∈Tt $\bigcup_{t \in T} t$
* p(t) - 子数据集t所占比例
* H(t) - 子数据集t的熵
使用信息增益作为选择特征依据的算法，叫作ID3算法。如果使用特征增益比作为选择特征的依据，算法为ID4.5, 特征增益比表达式如下：

I G r (A, D) = I G ( A , D ) H ( D )

$IG_r(A,D) = \frac{IG(A,D)} {H(D)}$

决策树建立

下面我们以ID3决策树分类算法为例，介绍分类决策树建立。
ID3算法从跟结点开始迭代，选择出信息增益最大的特征作为结点的分割，产生若个数据子集，当出现如下情况时停止：
*每个子集中的元素都属于一个类，该结点变为叶结点，并且被标记为所有元素属于的那一类
*没有特征可以选择了，该结点变为叶结点，并且被标记为所有元素中占比最多的那一类
*子集中没有元素，那么创建叶结点，并且被标记为父结点中元素占比最多的那一类

伪代码为(ID3 wiki)：

    ID3 (Examples, Target_Attribute, Attributes)
    Create a root node for the tree
    If all examples are positive, Return the single-node tree Root, with label = +.
    If all examples are negative, Return the single-node tree Root, with label = -.
    If number of predicting attributes is empty, then Return the single node tree Root,
    with label = most common value of the target attribute in the examples.
    Otherwise Begin
        A ← The Attribute that best classifies examples.
        Decision Tree attribute for Root = A.
        For each possible value, vi, of A,
            Add a new tree branch below Root, corresponding to the test A = vi.
            Let Examples(vi) be the subset of examples that have the value vi for A
            If Examples(vi) is empty
                Then below this new branch add a leaf node with label = most common target value in the examples
            Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
    End
    Return Root

剪枝

python代码实现

对于ID3算法的实现，我们分为三部分介绍
1. 获取数据并且预处理
2. 建立决策树
3. 测试决策树效果

获取数据并且预处理

而实验数据来自常用的数据集，”adults”, 链接： http://archive.ics.uci.edu/ml/datasets/Adult。该数据有14个特征，连续或离散，预测目标是离散的二类特征。数据分为训练数据和测试数据。

因为特征包含连续、离散特征，而ID3决策树只支持离散特征，那么首先要解决的问题就是连续数据的离线化。这里主要用了panda库中的一个函数cut, 即在下面的程序中“pd.cut(data_of_feature,bin_num)”这一句。data_of_feature表示某一个特征的数据集，是一列数据，而bin_num就是一个数字，表示将个列连续数据分为多少个离散组。利用这两个参数panda.cut可以自动离散化数据，并且进行合适的分割。值得一提的是，这里需要将训练数据和测试数据放在一起离散化，因为如果分开离散化，训练数据、测试数据的分割点不同。

代码有些部分是根据adult数据集写死的，比如哪些列是连续的，离散化时需要分割为多少个bin, 如果更换数据集则要相应变动。经过预处理，最终可以得到一个两维的np.ndarray数据，作为生成决策树的输入数据。整体的数据预处理代码如下：


def prepareData(training_data_file_path,testing_data_file_path):
    print "prepare adult data"
    #To split continues data and have the same split standard, we have to discretize the 
    #training data and testing data in one batch. And use the panda.cut to discretize .

    continue_feature_list = [0,2,4,10,11,12] #The index of continues feature
    bins = [10,12,8,12,12,12] #The bins of each continues feature
    data = []
    training_size = 0
    with open(training_data_file_path) as f:
        dataList = f.read().splitlines()    
    for datai in dataList:
        training_size += 1
        datai_feature_list = datai.split(", ")
        data.append(np.array(datai_feature_list))

    with open(testing_data_file_path) as f:
        dataList = f.read().splitlines()    
    for datai in dataList:
        datai_feature_list = datai.split(", ")
        data.append(np.array(datai_feature_list))
    data = np.array(data)
    discretizedData = discretizeData(data, continue_feature_list, bins)

    #return training data and testing data
    print "training_size: ", training_size
    return discretizedData[0:training_size,:],discretizedData[training_size:,:]

#data_of_feature:np.array, the data of a feature
#bin_num: to discretize to how many bins
def discretizeFeature(data_of_feature,bin_num):    
    return pd.cut(data_of_feature,bin_num)

    #data: np.ndarray, the training data
    #continue_attr: list
    #bins: the length of each discretized feature
    #To discretize the continues attribute/feature of data
def discretizeData(data,continue_feature_list,bins):
    for feature_i_index in range(len(continue_feature_list)):
        feature = continue_feature_list[feature_i_index]
        data_of_feature_i = np.array( [float(rowi) for rowi in data[:,feature] ] )#str to float
        discretized_feature_i = discretizeFeature(data_of_feature_i,bins[feature_i_index])
        print discretized_feature_i
        data[:,feature] = np.array(discretized_feature_i) #Use the discretized feature replace the continues feature
    return data

建立决策树

下面建立决策数，建立的过程，就如上面的伪代码所描述。所以这里仅解释一些代码里的每个函数的作用、和一些参数的含义：

def makeTree(data,attributes,target,depth):#构建决策数
#data 训练数据
#attributes 特征和预测目标名字的集合
#target 预测目标
#depth 树的深度（可以为剪枝使用）
#返回的树保存为一个“字典格式”. key:是根部最优的分割特征
#value两种可能。 1 就是预测的label. 2 就是一个字典。字典的key为分割特征的某个值，value就是一个子树

def majority(data,attributes,target): #获取该数据集下，预测值占大多数的那一个
def get_entropy_data(data,attributes,target,rows):#获取熵
def get_expected_entropy_data(data,attributes,attri,target):#获取条件期望熵
def infoGain(data,attributes,attri,target):#获取信息增益
def best_split(data,attributes,target):#获取最优的分割特征
def getValue(data,attributes,best_attri):#获取最优特征的所有值的集合
def getExample(data,attributes,best_attri,val):#获取满足“最优特征==val”的子数据集

而整体的构建代码如下：

'''
Created on 2015.10

@author: joeyqzhou
'''


import numpy as np
import copy 
from matplotlib.mlab import entropy


#return the majority of the label
def majority(data,attributes,target):
    target_index = attributes.index(target)

    valFreq = {}
    for i in range(data.shape[0]):
        if valFreq.has_key(data[i,target_index]):
            valFreq[ data[i,target_index] ] += 1
        else:
            valFreq[ data[i,target_index] ] = 1

    maxLabel = 0
    major = ""
    for label in valFreq.keys():
        if valFreq[label] > maxLabel:
            major = label
            max = valFreq[label]

    return major


def get_entropy_data(data,attributes,target,rows):
    data_len = data.shape[0]
    target_index = attributes.index(target)
    target_list = list( [ data[i,target_index]  for i in range(data_len) if rows[i]==1 ] )
    target_set = set(target_list)
    len_of_each_target_val = []
    for target_val in target_set:
        len_of_each_target_val.append( target_list.count(target_val) )

    entropy_data = 0.0 

    for target_count in len_of_each_target_val:
        entropy_data += -target_count*1.0/sum(len_of_each_target_val) * np.log(target_count*1.0/sum(len_of_each_target_val) )

    return entropy_data*sum(rows)*1.0/len(rows)

def get_expected_entropy_data(data,attributes,attri,target):

    attri_index = attributes.index(attri)
    attri_value_set = set( data[:,attri_index] )
    data_len = data.shape[0]
    sum_expected_entropy = 0.0

    for attri_value in attri_value_set:
        attri_selected_rows = np.zeros(data_len)
        for i in range(data_len):
            if data[i,attri_index] == attri_value:
                attri_selected_rows[i] = 1
        sum_expected_entropy += get_entropy_data(data,attributes,target,attri_selected_rows)

    return sum_expected_entropy


def infoGain(data,attributes,attri,target):
    entropy_data = get_entropy_data(data,attributes,target,np.ones(data.shape[0]))
    expected_entropy_data = get_expected_entropy_data(data,attributes,attri,target)
    return entropy_data - expected_entropy_data
#id3
def best_split(data,attributes,target):
    max_info = 0.000001 #Also can be seen as a threshold
    best_attri = ""
    print "best_split attributes: ",attributes
    print "data_len: ", data.shape[0]
    for attri in attributes:
        if attri != target:
            attri_infoGain = infoGain(data,attributes,attri,target)
            if attri_infoGain > max_info :
                max_info = attri_infoGain
                best_attri = attri

    print "max_info_gain: ",attri_infoGain
    print "best_attri: ", best_attri


    return best_attri

#get the possible value of best_attri in the data
def getValue(data,attributes,best_attri):
    best_attri_index = attributes.index(best_attri)
    return set(data[:,best_attri_index])

#get the data that best_attri==val from parent-data
def getExample(data,attributes,best_attri,val):
    best_attri_index = attributes.index(best_attri)
    data_len = data.shape[0]
    subset_data = [ ]
    for i in range(data_len):
        if data[i,best_attri_index] == val:
            subset_data.append( np.concatenate([data[i,0:best_attri_index],data[i,(best_attri_index+1):]]) )

    return np.array(subset_data)

#data: np.ndarray, training data, each row is a piece of data, each column is a feature
#attributes: list , feature name list
#target: target name
def makeTree(data,attributes,target,depth):
    print "depth: ", depth
    depth += 1
    val = [ record[attributes.index(target)] for record in data] #val is the value of target
    label_prediction = majority(data,attributes,target)

    #if data is empty or attributes is empty
    # len(attributes) <= 1, 1 is from the target
    if len(attributes) <= 1:
        return label_prediction
    elif val.count(val[0]) == len(val):
         return val[0]
    else:
         best_attri = best_split(data,attributes,target)
         print "best_attri: ", best_attri
         if best_attri == "":
            return label_prediction

         #create a new decision tree
         tree = {best_attri:{}}

         for val in getValue(data,attributes,best_attri):
             examples = getExample(data,attributes,best_attri,val)
             if examples.shape[0] == 0: #if the data_len ==0, then this is leaf node whose value is the majority
                 tree[best_attri][val] = label_prediction
             else:
                 newAttr = copy.copy(attributes)
                 newAttr.remove(best_attri)
                 subTree = makeTree(examples,newAttr,target,depth)
                 tree[best_attri][val] = subTree

    return tree

测试决策树效果

最后就是测试数据了，测试的时候，我也获得了几个小经验
1. 连续数据的离散化程度，可以影响到最终的准确率。比如，你把年龄这个连续特征分为几段，我一开始分为了4段，准确率就不高，后面试了10段，就提高了不少。所以可以通过调节每个连续特征的分割数目(bin_num),来优化最终的结果
2. 对于决策树无法分辨的情况，可以给出其父树中，大多数的label作为测试结果

经过测试adult.test, 准确率可以达到78.3%左右，但是这个远远没有达到 85%左右的准确率，原因可以有如下
1. 连续特征的离散化有优化空间
2. 没有进行剪枝

总结

完整的代码在http://download.csdn.net/detail/u011467621/9211651可以下载。直接运行main.py即可
另外，写博客的确经验不是很多，希望可以越来越好吧。如果有任何问题、建议，可以留言或联系我的邮箱[email protected]，一起讨论进步

参考文献

1 统计学习方法
2 PRML
3 https://en.wikipedia.org/wiki/ID3_algorithm
4 https://github.com/NinjaSteph/DecisionTree (代码参考）

机器学习算法之：决策树 (decision trees)

概述