从零单排《机器学习实战》（二）：决策树

运行平台：Windows

Python版本： Python3.5

IDE：PyCharm

同样参考Jack-Cui博主的分享（http://blog.csdn.net/c406495762），这里补充一些我自己比较困惑的地方。

1.append 和extend

a=[1,2,3]
b=[4,5,6]
a.append(b)
#  a [1,2,3,[4,5,6]]
a=[1,2,3]
a.extend(b)
#  a [1,2,3,4,5,6]

也就是说，append是将b作为一个元素添加到列表里，而extend是将b中的元素逐一添加到列表里。

2.splitDataSet()

def splitDataSet(dataSet, axis, value):
    retDataSet = []                                        #创建返回的数据集列表
    for featVec in dataSet:                             #遍历数据集
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]                #去掉axis特征
            reducedFeatVec.extend(featVec[axis+1:])     #将符合条件的添加到返回的数据集
            retDataSet.append(reducedFeatVec)
    return retDataSet                                      #返回划分后的数据集

这里稍微绕了一下，我们先回忆一下之前所说的索引和切片的知识。

reducedFeatVec = featVec[:axis]，

先取第1列到指定特征的前一列，

reducedFeatVec.extend(featVec[axis+1:])，

再从指定特征的后一列开始取到最后一列，

也就是相当于对原数据进行了一个缩减，去掉了已经明确的特征，

最后通过append将缩减后的数据添加到所要保存的数组里。

3.getNumLeafs()

def getNumLeafs(myTree):
    numLeafs = 0                                                #初始化叶子
    firstStr = next(iter(myTree))                                #python3中myTree.keys()返回的是dict_keys,不在是list,所以不能使用myTree.keys()[0]的方法获取结点属性，可以使用list(myTree.keys())[0]
    secondDict = myTree[firstStr]                                #获取下一组字典
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':                #测试该结点是否为字典，如果不是字典，代表此结点为叶子结点
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

Iter()函数是用来生成迭代器，参数为object，是一个支持迭代的集合对象。

next() 返回迭代器的下一个项目。

一个简单的例子

it = iter([1, 2, 3, 4, 5])
# 循环:
while True:
    try:
        # 获得下一个值:
        x = next(it)
        print(x)
    except StopIteration:
        # 遇到StopIteration就退出循环

        break

输出结果为

这里if type(secondDict[key]).__name__=='dict'可以用来判断是否为叶子结点的原理是因为叶子字典类型为key->value，而叶子节点并没有值，故可以用来判断。

4.plotTree()

def plotTree(myTree, parentPt, nodeTxt):
    #设置结点格式，swatooth代表边缘是波浪型的，fc控制的是注解框内的颜色深度
    decisionNode = dict(boxstyle="sawtooth", fc="0.8")
    leafNode = dict(boxstyle="round4", fc="0.8")
    numLeafs = getNumLeafs(myTree)
    depth = getTreeDepth(myTree)
    firstStr = next(iter(myTree))
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)                                  #标注有向边属性值
    plotNode(firstStr, cntrPt, parentPt, decisionNode)                      #绘制结点
    secondDict = myTree[firstStr]                                           #下一个字典，也就是继续绘制子结点
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD                     #y偏移
    for key in secondDict.keys():
        if type(secondDict[key]).__name__=='dict':                          #测试该结点是否为字典，如果不是字典，代表此结点为叶子结点
            plotTree(secondDict[key],cntrPt,str(key))                       #不是叶结点，递归调用继续绘制
        else:                                                               #如果是叶结点，绘制叶结点，并标注有向边属性值
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD

cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)

计算坐标，x坐标为当前树的叶子结点数目除以整个树的叶子结点数再除以2，y为起点

全局变量plotTree.totalW存储树的宽度，全局变量plotTree.totalD存储树的深度

plotTree.xOff和plotTree.yOff为设定的起点坐标

绘制图形的x轴和y轴有效范围都为0.0-1.0，左下角为坐标原点

其目的是为了将树尽可能生成在中间，根节点位于（0.5,1.0），利用广度和深度将画布等分。

从零单排《机器学习实战》（二）：决策树

猜你喜欢