ML基础-决策树-3-递归构建决策树

递归构建决策树

划分数据集时的数据路径

#
# 改函数使用分类名称的列表，然后创建键值为classList中唯一的字典数据，
# 字典对象存储了classList中每个类标签出现的频率，租后利用operator操作键值排序字典，并返回出现次数最多的分类名称
#
#　classList的解释：这是分类名称的列表
#
def majorityCnt(classList):
    #创建分类的统计
    classCount = {}
    for vote in classList:
        #如果vote不在classCount中，将这个出现的次数设置为0
        if vote not in classCount.keys():
            classCount[vote] = 0
        #将出现的次数+1
        classCount[vote]+=1
    #将出现的次数排序
    sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    #返回出现频率最多的那个class
    return sortedClassCount[0][0]

创建决策树

#
#  参数1：数据集，参数2:标签列表（包含了数据集中所有的特征的标签）
#  数据集的要求：与之前相同
# 1，数据必须是一种由列表元素组成的列表，而且所有列表元素都具有相同的数据长度
# 2，数据的最后一列，或者每个实例最后一个元祖是当前实例的类别的标签
# 　
#
def createTree(dataSet, labels):
    #创建了列表变量classList：包含了数据集的所有类标签
    # 获取这个类的最后一项，和要求2相同
    classList = [example[-1] for example in dataSet]
    # 递归函数有2个停止条件：
    # ❶ （以下两行）类别完全相同则停止继续划分
        #classList 中第一个分类的数量=classList的len
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    # ❷ （以下两行）遍历完所有特征时返回出现次数最多的
        # dataSet使用完了所有特征，仍不能将数据集合划分成仅包含一类别的分组
    if len(dataSet[0]) == 1:
        # 由于无法返回唯一的类标签，使用majorityCnt取得最多频率的标签
        return majorityCnt(classList)
    #划分数据集合，计算出最好的划分数据集特征
    bestFeat = chooseBestFeatureToSplit(dataSet)
    #获得标签列表中最好的数据集合的值
    bestFeatLabel = labels[bestFeat]
    #字典类型存储树的信息
    myTree = {bestFeatLabel: {}}
    # ❸ 得到列表包含的所胡属性值，并且删掉
    del (labels[bestFeat])
    #遍历当前选择特征包含的所有属性值
    featValues = [example[bestFeat] for example in dataSet]
    #set() 函数创建一个无序不重复元素集，可进行关系测试，删除重复数据，还可以计算交集、差集、并集等。
    #构成了一个不重复的属性值集合
    uniqueVals = set(featValues)
    # 遍历这个不重复的属性集合，
    for value in uniqueVals:
        #subLabels 就是labels去掉列表包含属性值的后的标签列表
        #为了保证每次调用函数createTree() 时不改变原始列表的内容，使用新变量subLabels 代替原始列表
        subLabels = labels[:]
        # bestFeatLabel=列表中最好数据的集合的值  value=不重复的标签
        # 得到的返回值将被插入到字典变量myTree 中
        myTree[bestFeatLabel][value] = createTree((splitDataSet(dataSet, bestFeat, value), subLabels))

    return myTree

变量myTree 包含了很多代表树结构信息的嵌套字典，从左边开始，第一个关键字no surfacing 是第一个划分数据集的特征名称，该关键字的值也是另一个数据字典。第二个关键字是no surfacing 特征划分的数据集，这些关键字的值是no surfacing 节点的子节点。这些值可能是类标签，也可能是另一个数据字典。如果值是类标签，则该子节点是叶子节点；如果值是另一个数据字典，则子节点是一个判断节点，这种格式结构不断重复就构成了整棵树

myDat, labels = createDataSet()
myTree = createTree(myDat, labels)
print(myTree)

# result:
#{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

python 函数

关于operator 包中的sorted函数

import  operator

# sorted(iterable, *, key=None, reverse=False)¶
#   Return a new sorted list from the items in iterable.
# 简言之这是一个排序函数，本函数是实现对可迭代对象iterable进行排序。
# 可选参数key是比较键的函数；reverse是表示是否反向排列对象里的项，是布尔值。

print(sorted([5, 2, 3, 1, 4]))
print(sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'}, reverse = True))
print(sorted("This is a test string from Andrew".split(), key=str.lower))
student_tuples = [
        ('john', 'A', 75),
        ('jane', 'B', 62),
        ('dave', 'B', 100),
]
print(sorted(student_tuples, key=lambda student: student[2]))   # 按年龄排序

# result：
#[1, 2, 3, 4, 5]
#[5, 4, 3, 2, 1]
#['a', 'Andrew', 'from', 'is', 'string', 'test', 'This']
#[('jane', 'B', 62), ('john', 'A', 75), ('dave', 'B', 100)]

对于key函数的解释：
key为函数，指定取待排序元素的哪一项进行排序，函数用上面的例子来说明，代码如下：
sorted(students, key=lambda student : student[2])

key指定的lambda函数功能是去元素student的第三个域（即：student[2]），因此sorted排序时，会以students所有元素的第三个域来进行排序。

上述sorted函数中排序函operator.itemgetter

# operator.itemgetter(1)
#   operator模块提供的itemgetter函数用于获取对象的哪些维的数据，
#   参数为一些序号（即需要获取的数据在对象中的序号）
#   operator.itemgetter函数获取的不是值，而是定义了一个函数，通过该函数作用到对象上才能获取值

a=[1,2,3]
b=operator.itemgetter(1) #定义函数b，获取对象的第一个域的值
print(b(a))
b=operator.itemgetter(1,0)#定义函数b，获取对象的第1个域和第0个域的值
print(b(a))

# result:
# 2
# (2, 1)

有了上面的operator.itemgetter函数，也可以用该函数来实现，例如要通过student的第三个域排序，可以这么写：

sorted(students, key=operator.itemgetter(2))

sorted函数也可以进行多级排序，例如要根据第二个域和第三个域进行排序，可以这么写：

sorted(students, key=operator.itemgetter(1,2))