Machine Learning in Practice Chapter 3 Decision Tree

Chapter 3 Decision Tree

Decision tree overview

决策树(Decision Tree)算法是一种基本的分类与回归方法,是最经常使用的数据挖掘算法之一。我们这章节只讨论用于分类的决策树。

决策树模型呈树形结构,在分类问题中,表示基于特征对实例进行分类的过程。它可以认为是 if-then 规则的集合,也可以认为是定义在特征空间与类空间上的条件概率分布。

决策树学习通常包括 3 个步骤:特征选择、决策树的生成和决策树的修剪。

Decision tree scenario

A game called "Twenty Questions". The rules of the game are very simple: one party participating in the game thinks about something in his mind, and other participants ask him questions. Only 20 questions are allowed, and the answers to the questions can only be True or false answer. The person asking the question gradually narrows down the scope of things to be guessed through inference and decomposition, and finally gets the answer to the game.

An email classification system, the general workflow is as follows:

Decision Tree-Flowchart

首先检测发送邮件域名地址。如果地址为 myEmployer.com, 则将其放在分类 "无聊时需要阅读的邮件"中。
如果邮件不是来自这个域名,则检测邮件内容里是否包含单词 "曲棍球" , 如果包含则将邮件归类到 "需要及时处理的朋友邮件", 
如果不包含则将邮件归类到 "无需阅读的垃圾邮件" 。

Definition of decision tree:

A classification decision tree model is a tree structure that describes the classification of instances. A decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes. Internal nodes represent a feature or attribute (features), and leaf nodes represent a class (labels).

Use a decision tree to classify the instances that need to be tested: starting from the root node, test a certain feature of the instance, and assign the instance to its child nodes based on the test results; at this time, each child node corresponds to the feature a value of . Instances are tested and allocated recursively in this way until a leaf node is reached. Finally, assign the instance to the class of the leaf node.

Decision tree principle

Decision tree concepts to know

Information entropy & information gain

Entropy:
Entropy refers to the degree of chaos of the system. There are also more specific definitions derived from different disciplines. It is a very important parameter in various fields.

Entropy (Shannon entropy) in information theory:
It is a measurement of information, indicating the degree of chaos of information. That is to say: the more orderly the information, the lower the information entropy. For example: if matches are placed in a matchbox in order, the entropy value is very low. On the contrary, the entropy value is very high.

Information gain:
The change in information before and after dividing the data set is called information gain.

How decision trees work

How to construct a decision tree?

We use the createBranch() method as follows:

def createBranch():
'''
此处运用了迭代的思想。 感兴趣可以搜索 迭代 recursion, 甚至是 dynamic programing。
'''
    检测数据集中的所有数据的分类标签是否相同:
        If so return 类标签
        Else:
            寻找划分数据集的最好特征(划分之后信息熵最小,也就是信息增益最大的特征)
            划分数据集
            创建分支节点
                for 每个划分的子集
                    调用函数 createBranch (创建分支的函数)并增加返回结果到分支节点中
            return 分支节点

Decision tree development process

收集数据:可以使用任何方法。
准备数据:树构造算法 (这里使用的是ID3算法,只适用于标称型数据,这就是为什么数值型数据必须离散化。 还有其他的树构造算法,比如CART)
分析数据:可以使用任何方法,构造树完成之后,我们应该检查图形是否符合预期。
训练算法:构造树的数据结构。
测试算法:使用训练好的树计算错误率。
使用算法:此步骤可以适用于任何监督学习任务,而使用决策树可以更好地理解数据的内在含义。

Decision tree algorithm characteristics

优点:计算复杂度不高,输出结果易于理解,数据有缺失也能跑,可以处理不相关特征。
缺点:容易过拟合。
适用数据类型:数值型和标称型。

Decision tree project case

Project Case 1: Determining fish and non-fish

Project Overview

Animals are divided into two categories based on the following 2 characteristics: fish and non-fish.

feature:

  1. Is it possible to survive without surfacing?
  2. Are there fins?
Development Process
收集数据:可以使用任何方法
准备数据:树构造算法(这里使用的是ID3算法,因此数值型数据必须离散化。)
分析数据:可以使用任何方法,构造树完成之后,我们可以将树画出来。
训练算法:构造树结构
测试算法:使用习得的决策树执行分类
使用算法:此步骤可以适用于任何监督学习任务,而使用决策树可以更好地理解数据的内在含义

Collect data: any method can be used

Marine life data

We use the createDataSet() function to input data

def createDataSet():
    dataSet = [[1, 1, 'yes'],
            [1, 1, 'yes'],
            [1, 0, 'no'],
            [0, 1, 'no'],
            [0, 1, 'no']]
    labels = ['no surfacing', 'flippers']
    return dataSet, labels

Preparing the Data: Tree Construction Algorithm

Here, since the data we input is discretized data itself, this step is omitted.

Analyze data: You can use any method. After constructing the tree, we can draw the tree.

Entropy calculation formula

Function that calculates the Shannon entropy of a given data set

def calcShannonEnt(dataSet):
    # 求list的长度,表示计算参与训练的数据量
    numEntries = len(dataSet)
    # 计算分类标签label出现的次数
    labelCounts = {
    
    }
    # the the number of unique elements and their occurrence
    for featVec in dataSet:
        # 将当前实例的标签存储,即每一行数据的最后一个数据代表的是标签
        currentLabel = featVec[-1]
        # 为所有可能的分类创建字典,如果当前的键值不存在,则扩展字典并将当前键值加入字典。每个键值都记录了当前类别出现的次数。
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1

    # 对于 label 标签的占比,求出 label 标签的香农熵
    shannonEnt = 0.0
    for key in labelCounts:
        # 使用所有类标签的发生频率计算类别出现的概率。
        prob = float(labelCounts[key])/numEntries
        # 计算香农熵,以 2 为底求对数
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt

Divide the data set according to given characteristics

将指定特征的特征值等于 value 的行剩下列作为子数据集。

def splitDataSet(dataSet, index, value):
    """splitDataSet(通过遍历dataSet数据集,求出index对应的colnum列的值为value的行)
        就是依据index列进行分类,如果index列的数据等于 value的时候,就要将 index 划分到我们创建的新的数据集中
    Args:
        dataSet 数据集                 待划分的数据集
        index 表示每一行的index列        划分数据集的特征
        value 表示index列对应的value值   需要返回的特征的值。
    Returns:
        index列为value的数据集【该数据集需要排除index列】
    """
    retDataSet = []
    for featVec in dataSet: 
        # index列为value的数据集【该数据集需要排除index列】
        # 判断index列的值是否为value
        if featVec[index] == value:
            # chop out index used for splitting
            # [:index]表示前index行,即若 index 为2,就是取 featVec 的前 index 行
            reducedFeatVec = featVec[:index]
            '''
            请百度查询一下: extend和append的区别
            music_media.append(object) 向列表中添加一个对象object
            music_media.extend(sequence) 把一个序列seq的内容添加到列表中 (跟 += 在list运用类似, music_media += sequence)
            1、使用append的时候,是将object看作一个对象,整体打包添加到music_media对象中。
            2、使用extend的时候,是将sequence看作一个序列,将这个序列和music_media序列合并,并放在其后面。
            music_media = []
            music_media.extend([1,2,3])
            print music_media
            #结果:
            #[1, 2, 3]
            
            music_media.append([4,5,6])
            print music_media
            #结果:
            #[1, 2, 3, [4, 5, 6]]
            
            music_media.extend([7,8,9])
            print music_media
            #结果:
            #[1, 2, 3, [4, 5, 6], 7, 8, 9]
            '''
            reducedFeatVec.extend(featVec[index+1:])
            # [index+1:]表示从跳过 index 的 index+1行,取接下来的数据
            # 收集结果值 index列为value的行【该行需要排除index列】
            retDataSet.append(reducedFeatVec)
    return retDataSet

Choose the best way to partition your data set

def chooseBestFeatureToSplit(dataSet):
    """chooseBestFeatureToSplit(选择最好的特征)

    Args:
        dataSet 数据集
    Returns:
        bestFeature 最优的特征列
    """
    # 求第一行有多少列的 Feature, 最后一列是label列嘛
    numFeatures = len(dataSet[0]) - 1
    # 数据集的原始信息熵
    baseEntropy = calcShannonEnt(dataSet)
    # 最优的信息增益值, 和最优的Featurn编号
    bestInfoGain, bestFeature = 0.0, -1
    # iterate over all the features
    for i in range(numFeatures):
        # create a list of all the examples of this feature
        # 获取对应的feature下的所有数据
        featList = [example[i] for example in dataSet]
        # get a set of unique values
        # 获取剔重后的集合,使用set对list数据进行去重
        uniqueVals = set(featList)
        # 创建一个临时的信息熵
        newEntropy = 0.0
        # 遍历某一列的value集合,计算该列的信息熵 
        # 遍历当前特征中的所有唯一属性值,对每个唯一属性值划分一次数据集,计算数据集的新熵值,并对所有唯一特征值得到的熵求和。
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            # 计算概率
            prob = len(subDataSet)/float(len(dataSet))
            # 计算信息熵
            newEntropy += prob * calcShannonEnt(subDataSet)
        # gain[信息增益]: 划分数据集前后的信息变化, 获取信息熵最大的值
        # 信息增益是熵的减少或者是数据无序度的减少。最后,比较所有特征中的信息增益,返回最好特征划分的索引值。
        infoGain = baseEntropy - newEntropy
        print 'infoGain=', infoGain, 'bestFeature=', i, baseEntropy, newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature
问:上面的 newEntropy 为什么是根据子集计算的呢?
答:因为我们在根据一个特征计算香农熵的时候,该特征的分类值是相同,这个特征这个分类的香农熵为 0;
这就是为什么计算新的香农熵的时候使用的是子集。

Training algorithm: data structure for constructing trees

The function code to create the tree is as follows:

def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    # 如果数据集的最后一列的第一个值出现的次数=整个集合的数量,也就说只有一个类别,就只直接返回结果就行
    # 第一个停止条件:所有的类标签完全相同,则直接返回该类标签。
    # count() 函数是统计括号中的值在list中出现的次数
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    # 如果数据集只有1列,那么最初出现label次数最多的一类,作为结果
    # 第二个停止条件:使用完了所有特征,仍然不能将数据集划分成仅包含唯一类别的分组。
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)

    # 选择最优的列,得到最优列对应的label含义
    bestFeat = chooseBestFeatureToSplit(dataSet)
    # 获取label的名称
    bestFeatLabel = labels[bestFeat]
    # 初始化myTree
    myTree = {
    
    bestFeatLabel: {
    
    }}
    # 注:labels列表是可变对象,在PYTHON函数中作为参数时传址引用,能够被全局修改
    # 所以这行代码导致函数外的同名变量被删除了元素,造成例句无法执行,提示'no surfacing' is not in list
    del(labels[bestFeat])
    # 取出最优列,然后它的branch做分类
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        # 求出剩余的标签label
        subLabels = labels[:]
        # 遍历当前选择特征包含的所有属性值,在每个数据集划分上递归调用函数createTree()
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
        # print 'myTree', value, myTree
    return myTree

Testing the Algorithm: Performing Classification Using Decision Trees

def classify(inputTree, featLabels, testVec):
    """classify(给输入的节点,进行分类)

    Args:
        inputTree  决策树模型
        featLabels Feature标签对应的名称
        testVec    测试输入的数据
    Returns:
        classLabel 分类的结果值,需要映射label才能知道名称
    """
    # 获取tree的根节点对于的key值
    firstStr = inputTree.keys()[0]
    # 通过key得到根节点对应的value
    secondDict = inputTree[firstStr]
    # 判断根节点名称获取根节点在label中的先后顺序,这样就知道输入的testVec怎么开始对照树来做分类
    featIndex = featLabels.index(firstStr)
    # 测试数据,找到根节点对应的label位置,也就知道从输入的数据的第几位来开始分类
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    print '+++', firstStr, 'xxx', secondDict, '---', key, '>>>', valueOfFeat
    # 判断分枝是否结束: 判断valueOfFeat是否是dict类型
    if isinstance(valueOfFeat, dict):
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else:
        classLabel = valueOfFeat
    return classLabel

Use an algorithm: This step can be applied to any supervised learning task, and using decision trees can provide a better understanding of the underlying meaning of the data.

Project Case 2: Predicting contact lens types using decision trees

Project Overview

Contact lens types include hard materials, soft materials, and contact lenses that are not suitable for wearing. We need to use a decision tree to predict the type of contact lenses a patient will need to wear.

Development Process
  1. Collect data: Text file provided.
  2. Parse data: parse tab-delimited data rows
  3. Analyze the data: Quickly check the data to ensure that the data contents are parsed correctly, and use the createPlot() function to draw the final dendrogram.
  4. Training algorithm: Use createTree() function.
  5. Test the algorithm: Write a test function to verify that the decision tree can correctly classify a given data instance.
  6. Usage algorithm: A data structure that stores the tree so that the tree does not need to be reconstructed the next time it is used.

Collect data: provided text file

The text file data format is as follows:

young	myope	no	reduced	no lenses
pre	myope	no	reduced	no lenses
presbyopic	myope	no	reduced	no lenses

Parse data: parse tab-delimited data rows

lecses = [inst.strip().split('\t') for inst in fr.readlines()]
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']

Analyze the data: Quickly check the data to ensure that the data contents are parsed correctly, and use the createPlot() function to draw the final dendrogram.

>>> treePlotter.createPlot(lensesTree)

Training algorithm: using the createTree() function

>>> lensesTree = trees.createTree(lenses, lensesLabels)
>>> lensesTree
{
    
    'tearRate': {
    
    'reduced': 'no lenses', 'normal': {
    
    'astigmatic':{
    
    'yes':
{
    
    'prescript':{
    
    'hyper':{
    
    'age':{
    
    'pre':'no lenses', 'presbyopic':
'no lenses', 'young':'hard'}}, 'myope':'hard'}}, 'no':{
    
    'age':{
    
    'pre':
'soft', 'presbyopic':{
    
    'prescript': {
    
    'hyper':'soft', 'myope':
'no lenses'}}, 'young':'soft'}}}}}

Test the algorithm: Write a test function to verify that the decision tree can correctly classify a given data instance.

Usage algorithm: A data structure that stores the tree so that the tree does not need to be reconstructed the next time it is used.

Use the pickle module to store decision trees

def storeTree(inputTree, filename):
    import pickle
    fw = open(filename, 'wb')
    pickle.dump(inputTree, fw)
    fw.close()

def grabTree(filename):
    import pickle
    fr = open(filename, 'rb')
    return pickle.load(fr)

  • Reference information comes from ApacheCN

Guess you like

Origin blog.csdn.net/diaozhida/article/details/84957248